The creation of a training dataset containing personal data is a processing of personal data which, pursuant to the GDPR, must have a purpose that is ’specified, explicit and legitimate’. The CNIL helps you define the purpose taking into account the specificities of AI systems development
This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.
The principle
The purpose of the processing is the aim of the use of personal data. This objective must be specified, i.e. defined as soon as the project starts. It must also be explicit, that is to say, known and understandable. Finally, it must be legitimate, i.e. compatible with the tasks of the organisation.
The data must not be further processed in a manner incompatible with this initial purpose: the principle of purpose limitation restricts how the controller may use or reuse these data in the future.
The requirement of a specified, explicit and legitimate purpose is particularly important, as it determines the application of other principles of the GDPR, including:
the principle of transparency: the purpose of the processing must be brought to the attention of the data subjects so that they are able to know the reason for the collection of the data concerning them and to understand the use that will be made of it;
the principle of data minimisation: the data selected must be adequate, relevant and limited to what is necessary for the purposes for which they are processed;
the principle of storage limitation: the data may only be kept for a limited period, defined according to the purpose for which it was collected.
How to define the purpose of the processing when the operational use is identified from the development stage?
This case concerns AI systems that are developed for a specific operational use in the deployment phase. This excludes AI systems that are developed without an operational use being defined in the development phase (see next section).
When an AI system is developed for a single operational use, it is considered that the purpose in the development phase is directly related to the one pursued by the processing in the deployment phase. It follows that if the purpose in the deployment phase is itself specified, explicit and legitimate, the purpose in the development phase will also be determined.
In this case, these are still separate processing operations whose compliance with the obligations of the GDPR must be analysed separately (in particular in terms of identifying the legal basis, informing individuals, minimising the collected data, defining retention periods, etc.).
Example: an organisation wishes to set up a dataset consisting of photos of trains in service – that is, with people on board – in order to train an algorithm to measure the influx and attendance of trains at the stations. The purpose in the development phase may be considered to be specified, explicit and legitimate in relation to the identified operational use.
In some cases, an AI system can be developed for several operational uses defined from the development phase. In this case, the development of such an AI system may pursue several purposes corresponding to the operational uses identified (a data processing can indeed simultaneously pursue several purposes if they are all specified, explicit and legitimate).
How to define the purpose of processing for the development of general purpose AI systems?
This case concerns AI systems whose operational use in the deployment phase is not clearly identified in the development phase. This is the case for general purpose AI systems and foundation models that can be used for a wide variety of applications and for which it may be difficult to define a sufficiently specified and explicit purpose at the development stage.
Examples:
An organisation may set up a dataset for training an image classification model (persons, vehicles, food, etc.) and make it publicly accessible, without any specific operational use being foreseen during the development of the model.
This model can be freely reused in accordance with the associated license (possibly adapted, e.g. using transfer learning techniques), and the rules governing image right and intellectual property, by third-party organisations for the development of computer vision systems. The purposes of the AI system can be varied: detection of people by enhanced camera systems for measuring attendance on station platforms or detection of defects on images taken as part of product quality controls.
An organisation creates a dataset for training a language model to identify the language register of a text. This model can be used for various tasks: writing and summarizing articles, letters, speeches, learning French, etc.
The purpose of the processing during the development stage may be considered to be specified, explicit and legitimate only if it is sufficiently specific, i.e. where it refers cumulatively to:
the ‘type’ of system developed, such as, for example, the development of a large language model (LLM), a computer vision system or a generative AI system for images, videos or sounds. The types of systems must be presented in a sufficiently clear and intelligible way for the data subjects, taking into account their technical complexities and rapid developments in this area.
technically feasible functionalities and capabilities, which means that the controller must draw up a list of capabilities that he or she can reasonably foresee at the development stage.
These criteria allow to take into account the fact that the controller cannot define at the development stage of an AI system all its future applications, while ensuring that the purpose limitation principle is respected.
Examples of purposes considered to be explicit and specified:
Development of a large language model (LLM) able to answer questions, generate text according to context (emails, letters, reports, including computer code), perform translations, summaries and corrections of text, perform text classification, analysis of feelings, etc.;
Development of a voice recognition model capable of identifying a speaker, his or her language, age, gender, etc.;
Development of a computer vision model capable of detecting different objects such as vehicles (cars, trucks, scooters, etc.), pedestrians, street furniture (dumpsters, public benches, bicycle shelters, etc.), road signs, tricolor lights, road signs, etc.
Conversely, the purpose would not be considered as specified enough if it only referred to the type of AI system, without mentioning the technically feasible functionalities and capabilities.
Examples of purposes that are not considered explicit and specified:
Development of a generative AI model (possible capabilities are not defined);
Development and improvement of an AI system (neither the type of model nor the possible capabilities are defined);
Development of a model to identify a person’s age (the type is not defined).
Warning: the general purpose AI system development manager should remind system users of their obligation to define as precisely as possible the purpose for which the deployment is intended and to ensure compliance. This compliance will depend in particular on considering the specific risks related to this purpose. Some of these risks should be anticipated from the development stage: the CNIL recommends taking into account the risks associated with known or reasonably possible deployments at the development stage, even if the user of the system would be another controller. When appropriate, the license given to third-party users should allow the data subjects to know the extent of those risks.
Good practice
Transparency of processing is of particular importance for general purpose AI systems. Thus, in addition to complying with the obligations referred to above, the CNIL recommends, as a good practice, that:
The purpose referto the foreseeablecapacities most at risk
The controller is invited to identify, upstream, the foreseeable capabilities of the AI system that present the most risks in the deployment phase. This would be, for example, the case of AI systems identified as “high risk” under the EU AI Act, .
The purpose refer to functionalities excluded by design
The description of the system’s capabilities may include the choices made with regards to the design of the system, leading to limiting its functionality, such as:
the ability of an LLM to treat only short texts, such as publications on social networks;
the calculation time required for a computer vision system to perform its detections, which could be too long to perform real-time detection;
the list of classes provided for a classification algorithm that would thus exclude the detection of other categories (feelings, objects, etc.).
These limitations could be specified in particular at the end of the testing and validation stages of the development phase, which allows the controller to specify the functional scope of the AI system.
The purpose specify, as far as possible, the conditions of use of the AI system
The controller may specify the conditions of use of the AI system. These may include, for example, known use cases of the solution or the modalities of use (open source distribution of the model, marketing, SaaS availability, etc.).
The controller could also provide examples of operational use cases or purposes of the AI system (e.g. traffic regulation for a computer vision system capable of detecting and quantifying vehicle flows).
How to define the purposes of the development of an AI system for scientific research ?
The controller must always define the purpose of the research and the data processing. However, in the field of scientific research, it may be accepted that the degree of precision of that objective is less precise or that the research purposes are not specified in their entirety, given the difficulties which researchers may have in identifying it entirely from the beginning of their work. It will then be possible to provide information to clarify the objective as the project progresses.
Possible derogations
Data processing carried out for the purpose of scientific research benefits from derogations and adjustments to certain data protection obligations (e.g. in relation to individuals’ rights or retention periods).
Reminder: what is ‘scientific research’ within the meaning of the GDPR?
The notion of ”scientific research” is to be understood broadly in the GDPR. In summary, the aim of the research is to produce new knowledge in all areas in which the scientific method is applicable.
In order to assist controllers in determining whether they can benefit from the provisions on scientific research, the CNIL proposes a set of criteria to assist the controller in determining whether the processing which pursues a research purpose falls within the scope of scientific research:
In some cases, it will be possible to assume that the creation of training datasets for AI pursues a scientific research purpose due to the nature of the organisation (e.g. a university or a public research centre) or the type of funding (e.g. funding from the French National Research Agency).
Otherwise, in particular for non-publicly funded private scientific research, the following criteria (based on the OECD Frascati Manual and its definition of R & D) should be examined together. As these criteria are cumulative, the controller will in principle have to demonstrate that they are all fulfilled in order for the processing to be considered scientific research within the meaning of the GDPR. Otherwise, a case-by-case analysis is necessary to qualify the processing.
Novelty: the processing should be aimed at obtaining new results (a novelty may also result from a project that leads to the identification of potential discrepancies with the intended result). The purpose of the research can help in the qualification of scientific research. In this respect, the publication of articles in a peer-reviewed journal or the grant of a patent makes it possible to qualify the novelty criterion.
Creativity: this criterion is based on original and non-obvious notions and hypotheses – the contribution of the work to scientific knowledge or the state of the art. The development of collective knowledge that not only benefits the moral entity that supports the research project is a strong indication to qualify it as a scientific one.
Uncertainty: the processing must be uncertain as to the final outcome.
Systematicity: the processing must be part of planning and budgeting and implement a scientific methodology. Adhesion to relevant industry standards of methodology and ethics is a strong indicator to qualify research as a scientific one.
Transferability/reproducibility: the processing should lead to results that can be replicated or transferred to a wider field. For example, the publication of the study carried out and the presentation of the research methodology adopted is a strong indication to highlight the willingness of the project leader(s) to share the results of the research.
Example:
The development of an AI system for proof of concept to demonstrate the robustness of machine learning requiring less training data could be considered as pursuing scientific research purposes, which would be part of a documented scientific approach for publication.
Case 1 : The operational use of the AI system during the deployment phase is identified from the development phase.
If the purpose in the deployment phase is specified, explicit and legitimate, it is also considered that the purpose in the development phase is specified, explicit and legitimate.
Case 2 : The operational use of the AI system during the deployment phase is not clearly defined from the development phase (general purpose AI systems).
The purpose of the processing in the development phase must refer cumulatively to :
the 'type' of system developed
technically feasible functionalities and capabilities
It is recommended that the purpose also include :
the most at-risk foreseeable capabilities
functionalities excluded by design
as far as possible, the conditions of use of the AI system
Case 3 : Creating a training dataset for scientific research purposes
It may be acceptable for the objective to be specified with a lower degree of precision, or for the research objectives not to be specified in their entirety, given the difficulties in defining it entirely from the beginning.