The legal basis of legitimate interest: focus sheet on the measures to implement in the case of data collection by web scraping

05 janvier 2026


The collection of data accessible online via web scraping must be accompanied by measures aimed at safeguarding the rights of the data subjects.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

The collection of personal data available online through web scraping is generally based on legitimate interest. The controller must implement additional measures to mitigate the impact this may have on individuals’ interests, rights, and freedoms.

Reminder on the doctrine of the CNIL

The scraping of publicly accessible online data has significantly expanded, particularly with the rapid and widespread growth of generative AI systems, which rely on vast amounts of freely accessible online data. However, there are inherent risks associated with the use of such techniques for data subjects’ rights and freedoms, as individuals have little control over how their publicly available data is reused.

The widespread use of web scraping has fundamentally changed the nature of internet use, in that all data published online by an individual can now potentially be read, collected, and reused by third parties. This can pose significant risks for individuals, including the following:

  • Risks to privacy and rights guaranteed by the GDPR: the use of such tools can have major impacts on data subjects, due to the large volume of data collected, the high number of individuals concerned, the challenges of exercising the right to erasure, and the risk of collecting data relating to data subjects’ private lives (e.g., through the use of social networks), or even sensitive or highly personal data, without sufficient safeguards. These risks are even more serious when they involve data concerning vulnerable individuals, such as minors, who require special attention and appropriately tailored information.
     
  • Risks of carrying out unlawful data collection: certain data may be protected by specific rights, such as intellectual property rights, or their reuse may be subject to the consent of the data subjects.
     
  • Risks of undermining freedom of expression: the indiscriminate and large-scale data collection, and their possible memorization into AI systemsmay undermine data subjects’ freedom of expression (e.g. a chilling effect due to a perceived sense of surveillance, which could lead internet users to self-censor — especially considering the difficulty of avoiding web scraping), even though the use of certain platforms and communication tools is essential in daily life.

However, data scraping is not prohibited per se, but must be analysed on a case-by-case basis.

Nevertheless, the CNIL has regularly called for vigilance regarding these practices and issued a series of recommendations to be respected in order to carry them out. The CNIL has also advocated for the creation of an ad hoc legislative framework (see, in particular, the CNIL opinion of December 15, 2022 on the “Polygraphe” project, in French)

In some cases, the CNIL has deemed such practices prohibited in the absence of a legal framework (in particular where processing operations are carried out by competent authorities for the purpose of detecting infringements). Conversely, they have been accepted in other cases, provided that stringent safeguards were implemented, for example for searching the internet for information leaks (RIFI).

For the moment, in the absence of specific legal framework, this how-to sheet recalls the controllers’ obligations and specifies the conditions under which such processing could be implemented for the development of an AI system.

The legality of web scraping depends in particular on the possibility of relying on a valid legal basis. Collecting data accessible online to create a training dataset may be based on the legitimate interest, provided that it complies with the conditions set out in the legitimate interest how-to sheet. (see the how-to sheet 8 "Legitimate interest and development of AI systems").

Risks of violating other regulations

While the use of scraping techniques is not in itself incompatible with the requirements of the GDPR, it may be prohibited by other regulations (e.g. by terms and conditions of use based on database producer rights or copyright law). In this regard, research organizations may consider benefiting from the « text and data mining » exception under the French Intellectual Property Code (Articles L122-5 and 122-5-3), unless the rightholders have reserved their rights in an appropriate manner , in particular by machine-readable means in the case of content made publicly available online. This includes metadata and terms and conditions of a website or a service (recital 18 of Directive 2019/790 of April 17, 2019 on copyright and related rights).

 

Mandatory measures


Respect reasonable expectations


Additional safeguards