Listen to This Blog
Disclaimer: I am not a lawyer, and this blog post does not contain legal advice. CSA Research cannot provide you with legal advice. If you have concerns about any issues raised in this post, consult your legal counsel concerning these matters and how they apply to your business.
A common worry about generative AI (GenAI) is that the content that it creates may be subject to copyright claims. Our recent survey of freelance linguists reflected this concern: Copyright issues are their second most important concern with GenAI, with 74% viewing the technology negatively or strongly negatively in this regard. However, an examination of claims about copyright and how GenAI works reveals a different picture.
Media coverage of high-profile lawsuits from companies such as Getty Images against AI companies for alleged copyright violation is a major driver for this concern. In general, such legal actions assume a “Google on Steroids” vision of how GenAI works. In it, GenAI systems vacuum up large amounts of content, store it, index it, and then copy from it to generate output. This viewpoint is evident in a statement from a website proclaiming a lawsuit against Stability AI: “Stable Diffusion contains unauthorized copies of millions – and possibly billions – of copyrighted images. These copies were made without the knowledge or consent of the artists” (emphasis added).
In reviewing these lawsuits, it is important to remember that anyone can file a lawsuit and that claims made in them have yet to be adjudicated in court. In addition, lawsuits often function as a legal tactic to bring parties to the bargaining table and may never make it to court, so you should never take claims at face value. Although trying to predict court outcomes in complex cases is a fool’s game, we anticipate that these lawsuits will face an uphill battle to succeed. Why? Simply put, the “Google on Steroids” understanding of GenAI gets some fundamental details wrong – and those details matter.
GenAI Doesn’t Store Copies of Training Data
Taking the description of the lawsuit above at face value, readers could be forgiven for assuming that the Stable Diffusion system actually “contains” “copies” of training data, but it – like other GenAI systems – does not.
- GenAI processes the training data but doesn’t store it. It doesn't store specific documents, books, or sources, and it also doesn't know which documents were in its training set or the specifics about any individual data source. Instead, it synthesizes its input data into what machine learning specialists call “correlations.” These are statistical measures that describe the association between random variables that GenAI uses to find relationships between different features in a dataset. A high correlation between these variables suggests a strong relationship.
- Its use of training data is similar to a student taking notes. One way to think of this process is to compare two students in a history lecture who each read several books before taking an at-home examination on the causes of the Hundred Years’ War. The first makes a verbatim copy of a paragraph from one of the books to answer an essay question on an examination. This action would be considered plagiarism and cheating. The second takes very good notes from the books for future reference and assimilates them into her understanding. When she answers that same essay test question, she draws on her understanding of various sources and answers in her own words. Although she drew on sources to create those notes, the notes are not copies of the sources and she is not guilty of plagiarism. Although she could not have written her essay without access to the books she read, she was not violating copyright when she read them and made notes.
What All This Means for Localization
A proper understanding of how GenAI works complicates any narrative based on the “Google on Steroids” theory, which relies on the existence of copies. But it also shows why GenAI has so much difficulty with “hallucination” (making things up): It can only rely on the probabilistic correlations it established, and these may be incorrect for a given prompt. In addition, statistically likely – but incorrect – outcomes can easily drown out factually correct – but statistically unlikely – outcomes.
Despite the challenges, lawsuits may yet set precedents that would affect how LSPs and corporate localization groups use GenAI. For example, if courts were to decide that the creators are guilty of copyright violation and must pay large settlements or license fees to rights holders, it would raise the cost of generative systems considerably and restrict their usefulness. However, most of the ways that LSPs are using GenAI – for translation, marketing, or sales – are unlikely to attract copyright claims directly. If it is questionable whether GenAI developers are guilty of copyright violation, it is even more so for users of their products who never even had direct potential access to the training data.
The risk is especially minimal for translation, where groups are translating their own copyrighted materials and where the output will have a clear pedigree in that content. It will be somewhat higher for cases where LSPs or enterprises use GenAI to create new public marketing content that may end up closely resembling existing material on the web. If that resemblance is close enough to terms or phrasing used in something out there, you might face a copyright claim from a rights holder hoping to get lucky. In most cases companies facing such a claim will simply delete the offending copy rather than run the risk that a court will find against them. One way to mitigate this possibility is to use GenAI for ideation and research rather than to write content directly.
Should You Use GenAI?
For now, at least, copyright concerns are not a reason for LSPs or enterprise translation groups to avoid using GenAI altogether. Accordingly, the decision should be made based on how well it meets your needs.
CSA Research maintains that GenAI is not yet a viable mainstream translation solution – due to its slow speed, cost, and lack of integration with translation processes, among other factors – but copyright currently ranks very low on the list of concerns about the technology today. Regulatory uncertainty is a much bigger concern.
LSPs and enterprise localization groups should monitor legal cases around GenAI and copyright, but they should not hold off on using it solely out of fear that they may be violating copyright. If they do have concerns specific to their use, they should consult with their legal counsel, but ultimately the industry’s long experience with statistical and neural MT – which face the same issues – shows that copyright is not likely to be a roadblock for using GenAI in the language sector.
So, do you need to worry about copyright and GenAI? Probably not as much as you do about other factors such as security, irrational hype, factually incorrect output, or regulatory factors. These should rank higher in your list of concerns than current lawsuits based on a faulty understanding of how the technology works.