Informing data subjects
05 janvier 2026
Organizations that process personal data to develop AI models or systems must inform concerned data subjects. The CNIL specifies the obligations in this regard.
Ensuring the transparency of processing
The principle of transparency requires organisations that process personal data to inform data subjects so that they understand the uses that will be made of their data (why and how) and are able to exercise their rights (rights to object, right of access, rectification, etc.).
This principle applies to any processing of personal data, regardless of whether the data are:
- directly collected from data subjects (also known as first party data): for example, in the context of a contract with voluntary actors to create a training dataset, when providing a service, in the context of a relationship between a citizen and an administration, etc.;
- or indirectly collected (also known as third party data): for example, when data is collected on the Internet via file downloads, the use of web scraping tools or using application programming interfaces (APIs) made available by online platforms to re-users; when the information is obtained from institutional or business partners such as data brokers, or by reusing an existing dataset, etc. This also includes data generated by the controller itself (CJEU, judgment of 28 November 2024, Case C-169/23).
Note: Where the controller has not directly collected the personal data from the data subjects, it may be exempted from the obligation to inform them individually if such information is impossible in practice or would require disproportionate efforts. However, general information (e.g. on its website) must be provided, containing all the elements provided for in Article 14 GDPR and detailed below.
When should the information be provided?
Where the controller collects the training data directly from the data subjects themselves, it must inform them at the time of such collection. When it intends to further process the data for a purpose other than that the one for which the data were collected (subject to the compatibility of such re-use, see the dedicated how-to-sheet), it must inform the data subjects about this (Article 14.4 GDPR).
When the data have not been collected directly, the organisation must inform the data subjects as soon as possible, and at the latest at the time of the first contact with the data subjects or at the time of the first communication of the data to another recipient where appropriate. In any case, the organisation must inform the data subjects within a period not exceeding one month after the date on which it retrieved their data.
As a good practice, where data are particularly sensitive to individuals, the CNIL invites organisations to respect a reasonable period of time between the moment when individuals are informed that their data are contained in the training dataset and the training of a model on that set (by itself or following the broadcast of the dataset). This good practice will allow data subjects to exercise their rights during this period given the technical difficulties in exercising these rights on the model itself and the risks this entails (in particular depending on the nature of the data stored).
How do I provide the information?
Ensuring the accessibility of the information
It must not be difficult for data subjects to access or understand the information.
Information notices must be distinguished from other information unrelated to data protection (general terms and conditions, legal notice, etc.). In this regard, while the information notices published on the websites of controllers may relate to many processing operations and concern different categories of persons (e.g. users of the website in question, data subjects involved in the development phase of AI systems, data subjects involved in their deployment, etc.), it is recommended to clarify which part of the information notices applies to which categories of persons (e.g. by clearly distinguishing information relating to development processing operations from information relating to other processing operations).
In practice, there are several ways to provide it:
- where individual information is provided, it may appear on the online form used to collect data, be mentioned in e-mails or letters sent by the data re-user during his or her first contact with data subjects, or be delivered via a pre-recorded voice message, etc.
- when providing general information (i.e. in the cases detailed below), it may for example take the form of information notices published on a freely accessible website or on a notice board.
Ensuring the intelligibility of information
The GDPR provides that information must be provided in a concise, transparent, intelligible and easily accessible form, using clear and plain language. The complexity of AI systems should not prevent the proper understanding of information by data subjects.
In this regard, it is recommended that controllers clearly define the main consequences of the processing: in other words, what will actually be the effect of the specific processing. The information could thus detail, for example by means of diagrams, how the data is used during the training phase, the functioning of the AI system developed, as well as the distinction that needs to be made between the training dataset, the AI model and the outputs of the model.
Point of attention: if it is possible to provide this information within existing documentation templates (such as datasheets, model cards or cards for AI systems), it must be easily accessible, clear and intelligible for the data subjects. This means that it must stand out clearly in such documentation.
To achieve these objectives, the CNIL recommends setting up information in several levels, prioritizing essential information (identity of the controller, purposes and rights of individuals) at the first level but offering complete information elsewhere.
With regard to processing of minors’ data, particular attention should be paid to ensure that the information is sufficiently intelligible.
Derogations from individual information
In principle, the content of the information referred to above must be brought to the attention of the data subjects individually, i.e. directly (e.g. on a data collection form, account creation form, e-mail, etc.)
The GDPR provides for several derogations from the obligation to inform individuals individually (e.g. where a provision of EU or national law allows it to be excluded under Article 23). The following developments focus on the most relevant exemptions for AI development, but are not exhaustive (see Article 14.5 GDPR).
Situation 1: The data subject has already obtained the information on the processing for development purposes (14.5.a GDPR)
Where data subjects have already been informed of all the characteristics of the processing, in particular of the purpose and identity of the controller of the training operations, new information is not necessary.
To be noted: where the data are collected from a third party, the controller will have to ensure that full information on its own processing has already been provided to the data subjects.
As a good practice, the CNIL encourages data re-users to rely on the data broadcaster to inform individuals, in particular when the latter is still in contact with the data subjects.
For example, the provider of an online education service could inform its customers that their data will be processed by a named third party in order to develop an AI system for teachers by referring to the information notice of the latter. If such provider has referred to all information about the processing, the re-user will no longer have to do so.
For example, the publisher of a dataset on a training data exchange platform could usefully centralise the information mentions of re-users on the download page of such dataset.
On the contrary, if the provider of such dataset has correctly informed people about its processing of making the dataset available, but has not provided all the information on the processing of re-users, the latter will have to inform the data subjects by their own means.
Note that in this case, re-users of the dataset will most often find themselves in situation No. 2 (which allows to simply provide general information).
Situation 2: Information would require disproportionate effort (Article 14.5.b GDPR)
The controller can then simply make the information publicly available.
This argument is often invoked by organisations that are not or no longer in relation with the persons whose data they process (e.g. in the case of re-use of a dataset created by a third party). Indeed, in this case, they usually do not have their contact data.
A case-by-case analysis must be carried out, considering the specific context of each processing.
The organisation must assess and document the disproportionate character of the measure, by weighing, on the one hand, the interference with the privacy of the persons whose data are processed, and, on the other hand, the burden that an individual provision of the information to each data subject would entail.
- In order to assess the extent of the efforts to be made, account should be taken of the lack of means of contact of the data subjects, or the oldness of the retained contact data (with uncertain accuracy, e.g. contact details of more than 10 years), or the number of data subjects and the cost of communication.
For example: the controller who intends to reuse the data of its customers and still has their e-mail address should always use it to inform them individually.
Conversely, the controller intending to collect indirectly identifying data will generally not have to search for the real identity or contact details of persons in order to inform them directly (general information on its website being sufficient).
- In assessing the invasion of the privacy of data subjects and the intrusiveness of the processing, account should be taken of the risks associated with the processing (more or less directly identifying nature of the data, sensitivity of the data, etc.) and any safeguards put in place (such as pseudonymisation, carrying out a data protection impact assessment (DPIA), shortening the retention period or implementing various technical and organisational security measures, see the Legitimate Interest Sheet for a more detailed list).
For example, depending on the risk resulting from the nature of the data and the context of its publication, the re-user of a publicly accessible online dataset may rely on the measures taken by the original controller to inform data subjects about the possibility of re-use by third parties for training purposes. The re-user can then simply provide general information (on his website).
Special case of online data collection
When lawfully collecting indirectly identifiable data published online, individual information will most often be disproportionate since finding ways to contact individuals involves searching for or collecting additional or more identifying data such as the actual identity of individuals.
For example: general information may be used when scraping or re-using a training dataset lawfully published in open source and containing only indirectly identifying data (such as publications or comments whose content is likely to identify the author).
The same will most often apply to the collection of data published under a pseudonym if such pseudonym is not collected or retained by the controller.
When collecting directly identifiable data, a case-by-case analysis should be carried out to assess whether it is necessary to seek to inform individuals individually through a means of contact (e.g. by searching for their contact details or by using a messaging system implemented on the website in question).
Finally, although the volume of data cannot on its own presume that individual information is disproportionate, this will most often be the case for the lawful collection of data which do not pose a risk to data subjects, from a large number of websites for the purposes of developing a large language model, and for which data subjects cannot be unaware that their data are publicly accessible, such as open source encyclopaedias.
To be noted: this derogation will more easily apply to organisations constituting training datasets of AI systems for scientific research purposes.
For example: the provision of general information will be sufficient for the use of a dataset of freely accessible profile photographs in the development of a deep fake algorithm for scientific research purposes.
Appropriate measures that may be taken by the organisation in addition to general information
Beyond providing general information by making the information publicly available (e.g. by publishing the information on the organisation’s website), other appropriate measures may be taken by the organisation in this case, such as:
- Carrying out a DPIA, including where not required by Article 35 of the GDPR;
- Applying data pseudonymisation techniques;
- Reducing the number of data collected and the retention period;
- Implementing technical and organisational measures to enhance the level of security.
What information should I provide?
In case of individual information
Where the controller provides individual information - either because it collects the data from individuals (Article 13 GDPR) or because it collects the data from third parties but has a means of contact and this does not represent a disproportionate effort (see above) - it is generally required to provide all of the following information, as provided for in Articles 13 and 14 GDPR.
The organisation that builds or uses a training dataset to develop an AI system based on personal data must inform data subjects about the following, regardless of whether the data was collected directly or indirectly:
- his/her identity and contact details (such as e-mail address, postal address or telephone number) and the means of contacting his/her data protection officer;
- the purpose and legal basis of the processing, including, where appropriate, details of the legitimate interest pursued if the processing is based on that interest;
- the recipients or at least the categories of recipients of the data, with, where appropriate, details of the planned transfers of those data to a country outside the European Union;
- the retention period of the data (or, failing that, the criteria for determining it);
- the rights of data subjects (rights of access, rectification, erasure, restriction, right to portability, right to object or withdraw consent at any time);
- the right to lodge a complaint with the CNIL.
To be noted: While information on the retention period and the exercise of rights does not have to be provided systematically for all processing operations, it will be almost systematically required for the compilation and use of training datasets. Indeed, they are necessary to ensure fair and transparent processing towards data subjects.
In the case of indirect collection, the organisations must provide, in addition:
- Categories of personal data (e.g. identities, contact details, images, social media posts, etc.);
- Where necessary to ensure fair and transparent processing, a precise indication of the source(s) of the data (including whether or not they are publicly available sources).
Such information should enable individuals to anticipate whether they are affected by the processing and facilitate the possible exercise of their rights over the source processing.
In case of publication of a general information notice
Where individual information is not possible, because the controller cannot identify the persons in the dataset, does not have contact data of the data subjects or contacting them individually would require disproportionate efforts, appropriate measures must be taken. It is necessary to publish an information notice, for example on a website, which includes, if possible, the information that would have been provided in case of individual information.
Furthermore, this information must include, where applicable, the fact that the controller will not be able to identify individuals, including to respond to their requests for the exercise of rights (in accordance with Article 11 GDPR). In this case, the CNIL recommends, if possible, to indicate to persons wishing to exercise their rights what additional information they can provide to enable their identification.
When informing data subjects on sources, two cases are to be distinguished:
- Where the controller has used a limited number of sources to build up its training dataset, it is generally required to provide precise information on those sources, unless it can justify an exception.
- Where many sources are used, for example a large number of publicly available sources online, overall information, such as source categories, or even the names of a few main or typical sources, is generally sufficient.
In case of re-use of a dataset or AI model subject to the GDPR
In addition to indicating the source of the data used, the CNIL recommends, at least for the datasets that present the greatest risk to individuals, providing the means of contacting the controller from which it was retrieved. A good practice is to refer directly to the website of the original controller, and to accompany the information with a concise and clear explanation of the conditions for data collection and annotation.
Examples of references:
This training data comes from a publicly accessible dataset (hyperlink to the publication) containing 70 000 images and consisting of photographs published online under an open license on the social network ____ between 2020 and 2021.
This data comes from a dataset provided by _____, a data broker (whose contact details are _____). This dataset consists of 3,000 images initially collected from volunteer actors playing different facial expressions, annotated to transcribe their emotions.
As part of the development of this AI system, we are reusing a large language model developed by the _____ company that has memorised personal data. For more information, we refer you to its privacy policy available at the following address: ______.
In case of scraping on websites or reuse of scraped data
If the scraping concerns a few websites, the CNIL recommends precise information on the sources used. Where sources are very numerous, it recommends providing the categories of source sites concerned, at least those posing the greatest risk to persons. This recommendation applies to scraped data, but also to controllers who re-use datasets made from scraped data.
Examples of sufficiently precise statements:
To this end, we have collected freely accessible data by scraping on the following platforms: ______. These data consist of publications made manifestly public by their authors on the subject ____. Comments related to publications were not collected. The images and pseudonyms were processed at the time of collection but were not kept.
The data was collected by scraping websites specialised in the field studied and may contain personal data such as the name of the author, or of a person quoted in a freely accessible blog post. The images contained in the article are not collected.
Where possible, providing information on domain names and URLs of scraped web pages as well as the date or period of collection is a good practice, but is not required under the reporting obligations.
In case of development of a general-purpose AI model within the meaning of the AI Regulation
In parallel with the information obligation under the GDPR, Article 53 of the AI Act, clarified by Recital 107, requires providers of general-purpose AI models to make available to the public a sufficiently detailed summary of the content used to train these models, in accordance with the template provided by the AI Office (European Commission).
This includes, for example, listing the main datasets or collections used to train the model, such as large-scale public or private data archives or datasets, and providing explanatory text on the other data sources used.
The first version of the summary template, as published by the AI Office in July 2025, is a tool that provides some of the information required to ensure compliance with the GDPR's information obligation for providers of general-purpose AI models. This information must be completed by the provider.
Examples of source website categories:
- institutional sites
- open source encyclopedia
- research paper exchange platforms
- national [or international] press sites
- website of national or international audio-visual media
- specialised discussion forums [describing these specialities, e.g.: IT development, education, health, cultural events, sporting events, etc.] or generalist events, etc.
- platforms for sharing images, audio or audio-visual content
(music, photographs, audio recordings, videos) - professional social networks
- online sales websites
On the specific case of AI models whose processing is subject to the GDPR
A number of AI models are considered anonymous: the GDPR does not apply to them as such, including the obligation to provide information. On the contrary, training an AI model can sometimes lead to it ‘memorising’ part of the training data (see EDPB Opinion 28/2024 on certain aspects of data protection related to the processing of personal data in the context of AI models). Where the model itself is found to be subject to the GDPR, individuals should be informed about the memorised data.
The provider of the AI model or system should then specify the information elements indicated above (purpose, controller, recipients, etc.).
As a good practice, the supplier is also recommended to specify:
- the nature of the risk associated with extracting data from the model, such as the risk of regurgitation of data in the case of generative AI;
- the measures taken to limit those risks, and the existing redress mechanisms in the event that those risks arise, such as the possibility of reporting an occurrence of regurgitation to the organisation.