The legal basis of legitimate interest: focus sheet on the measures to implement in the case of data collection by web scraping

05 January 2026

The collection of data accessible online via web scraping must be accompanied by measures aimed at safeguarding the rights of the data subjects.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

The collection of personal data available online through web scraping is generally based on legitimate interest. The controller must implement additional measures to mitigate the impact this may have on individuals’ interests, rights, and freedoms.

Reminder on the doctrine of the CNIL

The scraping of publicly accessible online data has significantly expanded, particularly with the rapid and widespread growth of generative AI systems, which rely on vast amounts of freely accessible online data. However, there are inherent risks associated with the use of such techniques for data subjects’ rights and freedoms, as individuals have little control over how their publicly available data is reused.

The widespread use of web scraping has fundamentally changed the nature of internet use, in that all data published online by an individual can now potentially be read, collected, and reused by third parties. This can pose significant risks for individuals, including the following:

Risks to privacy and rights guaranteed by the GDPR: the use of such tools can have major impacts on data subjects, due to the large volume of data collected, the high number of individuals concerned, the challenges of exercising the right to erasure, and the risk of collecting data relating to data subjects’ private lives (e.g., through the use of social networks), or even sensitive or highly personal data, without sufficient safeguards. These risks are even more serious when they involve data concerning vulnerable individuals, such as minors, who require special attention and appropriately tailored information.
Risks of carrying out unlawful data collection: certain data may be protected by specific rights, such as intellectual property rights, or their reuse may be subject to the consent of the data subjects.
Risks of undermining freedom of expression: the indiscriminate and large-scale data collection, and their possible memorization into AI systemsmay undermine data subjects’ freedom of expression (e.g. a chilling effect due to a perceived sense of surveillance, which could lead internet users to self-censor — especially considering the difficulty of avoiding web scraping), even though the use of certain platforms and communication tools is essential in daily life.

However, data scraping is not prohibited per se, but must be analysed on a case-by-case basis.

Nevertheless, the CNIL has regularly called for vigilance regarding these practices and issued a series of recommendations to be respected in order to carry them out. The CNIL has also advocated for the creation of an ad hoc legislative framework (see, in particular, the CNIL opinion of December 15, 2022 on the “Polygraphe” project, in French)

In some cases, the CNIL has deemed such practices prohibited in the absence of a legal framework (in particular where processing operations are carried out by competent authorities for the purpose of detecting infringements). Conversely, they have been accepted in other cases, provided that stringent safeguards were implemented, for example for searching the internet for information leaks (RIFI).

For the moment, in the absence of specific legal framework, this how-to sheet recalls the controllers’ obligations and specifies the conditions under which such processing could be implemented for the development of an AI system.

The legality of web scraping depends in particular on the possibility of relying on a valid legal basis. Collecting data accessible online to create a training dataset may be based on the legitimate interest, provided that it complies with the conditions set out in the legitimate interest how-to sheet. (see the how-to sheet 8 "Legitimate interest and development of AI systems").

Risks of violating other regulations

While the use of scraping techniques is not in itself incompatible with the requirements of the GDPR, it may be prohibited by other regulations (e.g. by terms and conditions of use based on database producer rights or copyright law). In this regard, research organizations may consider benefiting from the « text and data mining » exception under the French Intellectual Property Code (Articles L122-5 and 122-5-3), unless the rightholders have reserved their rights in an appropriate manner , in particular by machine-readable means in the case of content made publicly available online. This includes metadata and terms and conditions of a website or a service (recital 18 of Directive 2019/790 of April 17, 2019 on copyright and related rights).

Mandatory measures

Certain measures are mandatory, including under the principle of data minimization (Article 5.1.c of the GDPR):

define, in advance, specific collection criteria;
exclude certain categories of data from the collection when they are not necessary:
- where possible, through filters (for example, if they are not necessary, bank transaction data, geolocation data, etc.);
- where filtering is not possible, exclude certain types of sites (e.g., sites or social networks used mainly by minors) from the collection that structurally contain these categories of data (e.g., data concerning vulnerable persons such as minors or certain sensitive data);
ensure that any irrelevant data that may have been collected despite these criteria is deleted immediately after collection or as soon as it is identified as such (for example, if, on a public forum, people's pseudonyms are collected when only the content of the comments is necessary).
exclude from the collection websites that clearly oppose the scraping of their content for the purpose of creating training databases by using robots.txt exclusion protocols or implementing CAPTCHA, which, by requiring an action that can only be performed by a human being, aims to prohibit access to pages by robots.

How can website publishers protect their content from scraping?

There are various ways in which a website publisher can express its opposition to webscraping, although there are no standards in this area (see, in particular, Alexandra Bensamoun's mission report on the implementation of the AI regulation in the field of copyright).

The CNIL encourages organizations to participate in the effort to standardize interoperable opposition mechanisms and to comply with new standards that may emerge, by implementing state-of-the-art technologies.

If they wish to protect the content of their site, publishers should implement a robot.txt exclusion protocol or CAPTCHAs, which the controller should be required to comply with.

Website publishers are also encouraged to implement other measures. A distinction can be made between blocking measures, which aim to technically prevent robots (crawlers) from accessing content, and non-blocking measures, which do not prevent the collection in practice but describe the rules desired by the website publisher in this regard.

Measures blocking access to the website:

Integration of CAPTCHA tests (“Completely Automated Public Turing test to tell Computers and Humans Apart”);
Blocking IP addresses: analysis of HTTP request headers, number of requests, detection of "suspicious" IP addresses (location, internet service provider); maintaining a register of IP addresses, devices, and VPNs known to host scraping bots; detection of bots concealed by requests to browser APIs;

Non-blocking measures:

Other exclusion protocols (in addition to robots.txt, ai.txt (Spawning AI), TDMRep (tdmrep.json, W3C);
Dynamic Page Loading to limit the raw HTML content of the site;
Use of meta tags (e.g., DeviantArt's "noai" or "noimageai" meta tags);
Registering the domain in a registry opposing collection for the purpose of developing AI models and systems (e.g., "do-not-train").

What to do in the event of incidental collection of sensitive data?

Particular attention should be paid to the collection of special categories of data within the meaning of Article 9 of the GDPR when using web scraping tools that involve the processing of large volumes of data. The controller is required to implement all measures to automatically exclude the collection of irrelevant sensitive data, in particular by excluding the collection of certain categories of data or by excluding certain websites containing sensitive data by nature. The controller it must be able to demonstrate it.

If, despite these measures, the controller incidentally and residually processes sensitive data that it did not seek to collect, this shall not be considered illegal. This aligns with the Court of justice of the European Union’s stance, which recalls that that prohibition applies to the operator of a search engine "within the framework of his responsibilities, powers and capabilities" (CJEU, Grand Chamber, 24 September 2019, GC and Others, C-136/17). On the other hand, if the organization becomes aware that it is processing sensitive data, in particular through the data subject, it must immediately and automatically delete such data as far as possible.

This is without prejudice to the possibility of collecting or storing sensitive data, by way of exception, where its processing is necessary for the purpose pursued, and provided that it is based on an exception under Article 9.2 of the GDPR, in particular where it relates to personal data that has been manifestly made public by the data subject. This involves verifying whether the data subject has "explicitly made the choice beforehand, as the case may be on the basis of individual settings selected with full knowledge of the facts, to make the data relating to him or her publicly accessible to an unlimited number of persons" (CJEU, July 4, 2023, Meta Platforms, C-252/21 para. 85).

For example: An organisation that wishes to develop a tool that specifically generates political speeches constitutes a database from transcriptions of videos published online by public figures may consider that this sensitive data has clearly been made public..

For more information: see how-to sheet 4 "Ensuring the lawfulness of the data processing".

Respect reasonable expectations

Additional safeguards

As indicated in the how-to sheet 8 "Legitimate interest and development of AI systems", the choice of appropriate measures depends in particular on the intended use of the AI and the actual impact of this system on the data subjects. The CNIL recommends the implementation of the following measures in particular:

set out a list of websites from which data collection would be excluded by default. This would apply to certain sites that contain particularly intrusive data due to their sensitivity (such as pornographic sites, health forums, etc.) or the level of information they provide about individuals (such as genealogy sites or sites that contain large volumes of structured data about individuals);
exclude collection for websites that oppose the scraping of their content or its reuse for the purpose of creating training datasets, through technical or legal measures, such as terms of use;
limit the collection to freely accessible data (i.e., content accessible to any user who is not registered on the site in question and without creating an account) and that individuals are aware of making publicly available. This means excluding, for example, data published on social networks for private use (information contained in profiles or private groups) and/or published on certain sites whose public nature is not obvious (online petition sites, for example);
disseminate information about data collection and individuals' rights as widely as possible by using multiple media (e.g., online articles, the data controller's social media accounts), and by publishing an updated list of websites involved in data scraping (see how-to sheet 9 "Informing data subjects"). Publishing the information on the website where the data is collected may be a good practice;
provide for a discretionary and prior right to object, in order to strengthen data subjects’ control over their data. The CNIL encourages the development of technical solutions that would facilitate compliance with the right to object prior to data collection. In addition to the opt-out mechanisms put in place for intellectual property (see box above), “push-back lists” could, for example, be implemented when appropriate for the processing. This would allow the data controller to respect individuals' objections by refraining from collecting their data;
apply anonymization or pseudonymization measures immediately after data collection;
prevent any combination of data based on individual identifiers, for example by replacing them with random pseudonyms specific to each piece of content (e.g., each post on a freely accessible online forum) rather than each identifier, unless the data controller demonstrates the need to be able to group together different data concerning a particular individual for the development of the AI system or model in question.

Note: It is up to the data controller to assess the relevance and necessity of implementing these measures on a case-by-case basis, taking into account the specific processing methods.

For example: An organization that collects numerous voice recordings online in order to develop and market an AI system for voice generation, without taking any additional measures to protect the training data or limit the risks of unlawful or malicious reuse, cannot rely on the legal basis of legitimate interest.

< Previous: Relying on the legal basis of legitimate interests to develop an AI system

Table of content

Next: Informing data subjects >

#Artificial intelligence

This can also interest you ...

EDPB sheds light on anonymisation and web scraping for generative AI and adopts final version of ...

09 July 2026

EDPB

Generative AI and privacy: the PIPC and the CNIL jointly produced a poster to raise awareness among ...

27 May 2026

Data protection

Ensuring and facilitating the exercise of data subjects’ rights

Individuals whose data is collected, used or reused to develop an AI system have rights over their ...

05 January 2026

Artificial intelligence