Tech Business

Google Fined €250 Million for Illegally Collecting Training Data, Reigniting Controversy over Data Copyright Issues

Published on Mar 24, 2024
Image Credit: 652234

The French market regulator recently imposed a fine of 250 million euros on Google, a US company, for utilizing content from French publishers and news organizations without consent to train its chatbot "Bard" (renamed "Gemini" in its upgraded version). This violation of EU intellectual property regulations marks the first time an artificial intelligence (AI) company has been fined for training data. The penalty imposed on Google sets a precedent that may lead to an increase in similar lawsuits in the future.

Considering the rapid advancements and iterations in AI technology, defining the legality of acquiring training data presents challenges for AI companies. Experts believe that while there are still ambiguous areas regarding data rights protection, implementing robust market and management methods can help address these issues.

As part of the settlement, Google has agreed not to contest the violations and will propose measures to rectify product and service deficiencies. In response, Google expressed its desire to seek a settlement, stating that it is time to move forward. The company aims to focus on a broader, sustainable approach to connecting users with high-quality content and fostering constructive collaborations with French publishers.

The dispute between Google and French publishing organizations originated in 2019 when numerous French media entities, including Agence France-Presse, lodged complaints with regulatory agencies, alleging that Google was using their online content without permission. In 2020, the regulatory agency mandated that Google engage in negotiations with relevant publishing organizations regarding content payment. However, as the negotiations failed, the regulatory agency imposed a 500 million euro fine on Google in 2021. In 2022, Google reached a settlement agreement with the concerned media publishers.

According to the regulator's statement, Google violated several terms of the settlement agreement, including failing to engage in negotiations with relevant publishing organizations and lacking transparency in providing information. Regulators specifically highlighted that Google utilized data from media platforms and news organizations to train "Bard" for its launch in 2023, without notifying the pertinent publications and regulators. Consequently, regulators have expressed concerns regarding Google's AI services.

The case of Google serves as a critical precedent, as it marks the first instance of an AI company being fined for utilizing training data. Given the rapid evolution of AI technology, this case serves as a warning to other AI companies, suggesting that similar lawsuits may arise in the future. Some scholars argue that an increase in such lawsuits is likely, considering that they can be viewed as an inevitable challenge in the development of the AI industry. Since data forms the core of AI development, AI companies have a substantial appetite for high-quality data. During the process of data collection and usage, these companies may inadvertently or intentionally infringe upon the data rights of other entities. Moreover, the protection of data rights still encompasses ambiguous areas, encompassing institutional provisions and practical cases.

From a legal standpoint, the French regulator's punishment of Google rests on a solid legal foundation. Consequently, this penalty should serve as a cautionary message to other AI companies. It demonstrates that if AI research, development, and product usage involve extensive training using copyrighted works without permission, there is an undeniable legal risk involved.

Image Credit: Sergei Tokmakov, Esq

In an effort to ensure permission for using publishing organizations' content as training data, OpenAI, another technology company, reached agreements with the Associated Press, German media giant Axel Springer, and others in 2023. However, the company failed to reach an agreement with the New York Times on relevant issues, leading to a lawsuit filed by the newspaper in December 2023. The New York Times accused OpenAI and technology giant Microsoft of illegally copying and using the newspaper's unique and valuable works. The newspaper demanded that the two companies destroy any chatbot models and training data that employed the newspaper's copyrighted materials. The New York Times had initially approached the companies to express concerns about their use of copyrighted content and sought "friendly solutions," including proposing commercial agreements related to generative AI products. However, the negotiations at the time were unsuccessful in producing a resolution.

The New York Times became the first American media organization to sue an AI company over copyright issues. The lawsuit revealed that a considerable number of articles published by the newspaper were used as training data for chatbots. Some analysts believe that these chatbots are competing with traditional news publishing platforms and aim to become "reliable news sources." Additionally, the lawsuit cited instances where articles on the newspaper's website, which typically require a paid subscription to access, were being provided to users for free by chatbots like ChatGPT.

OpenAI responded to the lawsuit, expressing surprise as the company believed the negotiations with the New York Times were active and productive. OpenAI stated that its chatbot, ChatGPT, is not intended as a substitute for subscribing to the New York Times. The company emphasized that a single data source, including the New York Times, is not crucial for the expected learning of the large model, as the model derives knowledge from a vast collection of human-generated information.

Compared to publications bound by copyright restrictions, AI companies face a more ambiguous situation when utilizing data from social platforms to train models. In 2023, Tesla CEO Elon Musk mentioned that the social platform "X" (formerly known as Twitter) he owns would use publicly available data for training large models, while assuring that it would not involve users' personal privacy data or private message content. However, Musk publicly criticized technology companies such as Microsoft, accusing them of "illegally using data from 'X' to train large models" and even threatening legal action.

Earlier this month, OpenAI's Chief Technology Officer, Mira Murati, was interviewed by the Wall Street Journal. When asked about the data used to train OpenAI's Vincent video model Sora, Murati mentioned that they utilize both public and licensed data. However, when questioned about the inclusion of data from social platforms like Facebook and YouTube, Murati could only offer a simple response of "I'm not sure."

In this context, determining whether an AI company has legally acquired and used corpus from social platforms presents a challenge. Can AI companies use public or semi-public data without ethical concerns? These questions reside in a gray area where existing regulations have lagged behind the evolving needs of data rights protection in relation to training large models. Therefore, two key aspects need attention: improving and enhancing the data marketplace and strengthening security compliance certification and management of training data for large models.

At the level of AI regulation, the European Union and the United Nations have taken steps to address these gaps. On March 13, the European Parliament voted and approved the EU's Artificial Intelligence Act, which includes a strict prohibition on "AI systems that pose unacceptable risks to human safety." Such risks may arise from purposeful manipulation of technology, exploitation of human vulnerabilities, or the creation of systems that evaluate social status or personal characteristics based on behavior.

On March 21, the United Nations General Assembly adopted the first draft resolution on AI, aiming to ensure that this new technology benefits all countries while respecting human rights and promoting "safe, reliable, and trustworthy" technology. The resolution also acknowledges that "the governance of AI systems is an evolving field" and calls for further discussions on possible governance approaches.

Tags

Comments