top of page
Image by Annie Spratt
images-removebg-preview.png

NLSIR

|

Online

Data Scraping: The Third Point on the IP-AI triangle

Sriya Sridhar*

There are perhaps few issues which are debated as vociferously as intellectual property (specifically, copyright) infringement in the context of training AI systems, such as generative AI models. With close to 30 lawsuits against AI companies on the grounds of IP infringement in the United States alone, this issue has already reached India, with the recent lawsuit by Asian News International (ANI) against OpenAI, with ANI claiming that OpenAI has used its copyrighted material to train its ChatGPT model.

A Ministry of Electronics and Information Technology subcommittee, constituted under the Advisory Group for an ‘India-specific regulatory framework’ for AI, has advocated for regulators to consider issues of attributing liability in case of infringing outputs, and whether AI training on copyrighted content can qualify as fair use within the scope of Section 52 of the Indian Copyright Act.

There are several issues at play here, each deserving its own analysis. For instance, it is important to examine the impact of fair use defences against claims of copyright infringement, and to determine the extent to which AI-generated works can be said to be infringing upon original copyrighted work. Additionally, one must consider the idea-expression dichotomy, which is well-recognized in copyright jurisprudence.

For courts and policymakers, these IP issues will be long discussed and debated – how can we balance the protection of creators and licensed works with sustainable innovation in AI? And where will these issues stand when it comes to educational or scientific works which some argue should not be under copyright at all?

To effectively regulate these issues, however, it may be necessary for policymakers and courts to consider a more fundamental issue dealing with the intersection of data protection and intellectual property laws, which is data scraping – the method which is at the heart of AI development.

What Is Data Scraping And Is It All Bad?

Data scraping refers to the method of using tools like web crawlers and other APIs (Application Programming Interfaces) to obtain data from third-party websites – this is separate from first-party data, which is data that is collected from an entity’s own audience. For example, an e-commerce website collects certain information from a person to create an account and shop on the platform. However, they may also use data scraping tools to crawl other e-commerce websites to obtain information on consumer behaviour patterns across websites and optimize their own service.

To be clear, data scraping is not a new practice, nor specific to AI technologies – however, the scale at which it needs to be done to effectively train AI models is where the issue primarily arises. Web scraping is used frequently as a method of tracking, for lead generation, market research and Search Engine Optimization. Such practices have led to legal issues – for example, debates in antitrust (competition) law about whether prohibitions on scraping can fall under the ambit of anti-competitive behaviour, and the issue of whether data scraping can lead to the misappropriation of trade secrets.

On the other hand, it can also lead to socially desirable outcomes – the increase in public access to data can improve the quality of datasets on healthcare, environmental indicators, and similar issues of public importance.

The Tension With IP Rights

Firstly, data scraping, as a technology, involves obtaining information from a third-party source. Therefore, it raises the question of ownership of that information at a foundational level. For example, issues arise when there is no license to use that information, when web crawlers bypass Digital Rights Management (DRM) protections on certain types of content, or when the output of artificial intelligence closely resembles the scraped data. However, in these cases, the question of IP infringement is relatively more obvious.

The less obvious cases are where the tensions really lie. For example, X (formerly Twitter) recently sued Meta over scraping data about its followers, on the grounds that this violated their trade secrets, which is their intellectual property. This claim may not be so easy to establish, given that it is unclear to what extent follower data is proprietary. Additionally, scraped data may not necessarily lead to the removal or alteration of the copyright management information (CMI) associated with the works violating copyright. Then, there is the question of when AI-generated outputs could be considered as infringing derivative works of the ‘ingested’ content. This determination depends on how closely the AI outputs resemble the expressive elements of the original work on which it was trained. Consider for example an AI model providing an output which is an image of a generic mouse with black ears, as opposed to an output identical to Mickey Mouse, which is a copyrighted character.

As India’s subcommittee questions in its report, regulators will also have to evaluate how consent or permission can be obtained from copyright owners in the case of bulk datasets. For instance, bulk datasets may contain a mix of both structured and unstructured data, and operationalising mechanisms to take permission from copyright owners, such as through a central registry. In the EU AI Act for example, there are obligations on providers of general purpose AI models to maintain detailed records of databases and archives used for training models, as a part of transparency requirements (along with the general obligation to respect copyright law)  – among other things, this is also in a bid to determine whether such models were trained on copyrighted content without consent. Indian regulators will need to determine whether such transparency based requirements will be effective in our unique context.

Then, there is the issue of attributing liability. Would the infringer be the person requesting the output from the AI, or the AI company? It might be the AI company, if it could be established that it had the ability to control and derive financial benefit over the person copying the product. However, would conducting data scraping qualify as such ‘control’? Reference can be made to the US Supreme Court test of assessing whether a technology in question has ‘substantially non-infringing uses’, in which event the AI system could be subject to a safe harbour even if it could potentially be used for infringement.

Prof. Pamela Samuelson argues that copyright or intellectual property law may not be sufficient to address the input and output issues of generative AI which arise from scraping as a form of AI training. Some copyright owners simply never want their works used by AI companies. However, this also must be reconciled with the constitutional purpose of copyright law – which has more of a welfare-based objective in India. For example, the Court in the famous Delhi University Photocopy case, held that educational exceptions are an important part of the legislative intent of the fair use provisions under the Indian Copyright Act, 1957. This may also have an impact on the types of exceptions for data scraping, if AI models are trained on copyrighted content which is educational in nature. The broader purpose of copyright law is to balance the interests of copyright owners with the advancement of science, humanities, and education – therefore, the fact that some copyright owners may simply never want their works used by AI companies may not stand up against the objective of furthering technological innovation in which AI is the next frontier.

There are several open questions for the AI community, regulators, and eventually courts, to consider with respect to designing AI models in a way that allows creators to opt-out, and perhaps program attribution to copyright owners within the models – although the technical feasibility of these options will need to be evaluated.

Policymakers must address the following questions to reconcile these issues – First, when should data scraping be allowed and in what circumstances is it legal? Second, in what circumstances can there be exceptions to allowing scraping of copyrighted works, or other content under a license or other form of ownership? Third, how would seeking permission, or allowing for opt-outs to scraping be operationalised, especially in the case of bulk datasets? Fourth, can there be remedies built into the regulatory process, for instance, creating systems for compensation by AI companies to IP owners? Would there be exceptions for scientific/works for objectives in the public interest?

The Way Forward: A Multi-Pronged Regulatory Strategy

The disruptive potential of AI also increases the number of touchpoints in different areas of law. At once, an intellectual property issue may also be a data protection and antitrust issue. This is what makes a coordinated approach to governance incredibly important.

Towards this end, the question of whether training AI models violates intellectual property laws is one that cannot be isolated from the technological methods which underlie AI development, and the intersection between IP laws and other laws. For example, India’s Digital Personal Data Protection Act, 2023 (DPDPA), includes a provision which entirely exempts any public data from being classified as ‘personal’ data at all (see section 3(c)(ii) – this is a significant difference from other jurisdictions, where some forms of public data are still personal data, but can be utilized in some specific circumstances such as when the data is manifestly made public by the data subject, under the GDPR). This exemption has huge implications for AI training, since now, under Indian data protection law, massive public datasets can be used without attracting any legal obligations, which will invariably cause intellectual property issues, when one legislation enables web scraping and the other is not clear about liability. For instance, under the GDPR, data must be ‘manifestly made public’ by the concerned individual to be utilized by a Data Controller without attracting certain obligations – while still being categorized as personal data. However, the implication of the Indian law is that all data of an individual in the public domain is automatically excluded from being classified as personal data at all. In effect, therefore, an AI company would have access to large public datasets for training and would not need to be concerned with legal obligations on personal data protection under Indian law. This opens up the possibility of scraping data from social media platforms, online forums, and any internet source. This would invariably cause tensions where intellectual property rights are involved. The DPDPA enables wide avenues for data scraping, and existing intellectual property laws are unclear about the extent to which AI companies are liable for this scraping.

Effectively tackling this question will require regulators to firstly develop literacy about this subject – for example, the United States Copyright Office held listening sessions in 2023 to understand the implications of generative AI development for copyright law, which included in large part, consultations on data scraping. In October last year, global data privacy authorities collaborated to investigate the impact of data scraping on data protection rights.

We may also need to contend with the fact that claims under intellectual property laws may not be as effective as other forms of claims when it comes to the consequences of AI training. For example, privacy claims could potentially be more efficacious than copyright claims, where the issue relates to AI and publicity rights. Consumer protection claims may be more effective where AI models 'hallucinate' and produce misleading outputs, including about original works.

Ultimately, Indian policymakers will need to decide the stance they want to adopt. Japan, for example, is now considered one of the most friendly countries for AI innovation, due to a vast data mining exemption for training AI models, including on copyrighted works. EU law contains exceptions for data mining for research purposes, with the option for rights holders to opt out of the exceptions. The EU AI Act also contains provisions on scraping and intellectual property. It will be interesting to see, especially with regard to India’s fair dealing provisions, how the consequences of data scraping will play out in courts.

 

Sriya Sridhar is a Shiv Nadar Fellow at the Shiv Nadar University School of Law, Chennai, and an alumnus of Jindal Global Law School. Her research focuses on data protection and privacy, the incentives which drive technology regulation and innovation, digital governance, and the dynamics between information, power, and society.

 

Comments


images-removebg-preview.png

NATIONAL LAW SCHOOL OF INDIA REVIEW  © 2022

images-removebg-preview.png

NATIONAL LAW SCHOOL OF INDIA REVIEW  © 2024

bottom of page