Defining a purpose

07 June 2024

The creation of a training dataset containing personal data is a processing of personal data which, pursuant to the GDPR, must have a purpose that is ’specified, explicit and legitimate’. The CNIL helps you define the purpose taking into account the specificities of AI systems development

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

The principle

The purpose of the processing is the aim of the use of personal data. This objective must be specified, i.e. defined as soon as the project starts. It must also be explicit, that is to say, known and understandable. Finally, it must be legitimate, i.e. compatible with the tasks of the organisation.

The data must not be further processed in a manner incompatible with this initial purpose: the principle of purpose limitation restricts how the controller may use or reuse these data in the future.

The requirement of a specified, explicit and legitimate purpose is particularly important, as it determines the application of other principles of the GDPR, including:

the principle of transparency: the purpose of the processing must be brought to the attention of the data subjects so that they are able to know the reason for the collection of the data concerning them and to understand the use that will be made of it;
the principle of data minimisation: the data selected must be adequate, relevant and limited to what is necessary for the purposes for which they are processed;
the principle of storage limitation: the data may only be kept for a limited period, defined according to the purpose for which it was collected.

Find out more: Defining a purpose

How to define the purpose of the processing when the operational use is identified from the development stage?

How to define the purpose of processing for the development of general purpose AI systems?

This case concerns AI systems whose operational use in the deployment phase is not clearly identified in the development phase. This is the case for general purpose AI systems and foundation models that can be used for a wide variety of applications and for which it may be difficult to define a sufficiently specified and explicit purpose at the development stage.

Examples:

An organisation may set up a dataset for training an image classification model (persons, vehicles, food, etc.) and make it publicly accessible, without any specific operational use being foreseen during the development of the model.

This model can be freely reused in accordance with the associated license (possibly adapted, e.g. using transfer learning techniques), and the rules governing image right and intellectual property, by third-party organisations for the development of computer vision systems. The purposes of the AI system can be varied: detection of people by enhanced camera systems for measuring attendance on station platforms or detection of defects on images taken as part of product quality controls.
An organisation creates a dataset for training a language model to identify the language register of a text. This model can be used for various tasks: writing and summarizing articles, letters, speeches, learning French, etc.

The purpose of the processing during the development stage may be considered to be specified, explicit and legitimate only if it is sufficiently specific, i.e. where it refers cumulatively to:

the ‘type’ of system developed, such as, for example, the development of a large language model (LLM), a computer vision system or a generative AI system for images, videos or sounds. The types of systems must be presented in a sufficiently clear and intelligible way for the data subjects, taking into account their technical complexities and rapid developments in this area.
technically feasible functionalities and capabilities, which means that the controller must draw up a list of capabilities that he or she can reasonably foresee at the development stage.

These criteria allow to take into account the fact that the controller cannot define at the development stage of an AI system all its future applications, while ensuring that the purpose limitation principle is respected.

Examples of purposes considered to be explicit and specified:

Development of a large language model (LLM) able to answer questions, generate text according to context (emails, letters, reports, including computer code), perform translations, summaries and corrections of text, perform text classification, analysis of feelings, etc.;
Development of a voice recognition model capable of identifying a speaker, his or her language, age, gender, etc.;
Development of a computer vision model capable of detecting different objects such as vehicles (cars, trucks, scooters, etc.), pedestrians, street furniture (dumpsters, public benches, bicycle shelters, etc.), road signs, tricolor lights, road signs, etc.

Conversely, the purpose would not be considered as specified enough if it only referred to the type of AI system, without mentioning the technically feasible functionalities and capabilities.

Examples of purposes that are not considered explicit and specified:

Development of a generative AI model (possible capabilities are not defined);
Development and improvement of an AI system (neither the type of model nor the possible capabilities are defined);
Development of a model to identify a person’s age (the type is not defined).

Warning: the general purpose AI system development manager should remind system users of their obligation to define as precisely as possible the purpose for which the deployment is intended and to ensure compliance. This compliance will depend in particular on considering the specific risks related to this purpose. Some of these risks should be anticipated from the development stage: the CNIL recommends taking into account the risks associated with known or reasonably possible deployments at the development stage, even if the user of the system would be another controller. When appropriate, the license given to third-party users should allow the data subjects to know the extent of those risks.

Good practice

Transparency of processing is of particular importance for general purpose AI systems. Thus, in addition to complying with the obligations referred to above, the CNIL recommends, as a good practice, that:

The purpose refer to the foreseeable capacities most at risk
The controller is invited to identify, upstream, the foreseeable capabilities of the AI system that present the most risks in the deployment phase. This would be, for example, the case of AI systems identified as “high risk” under the EU AI Act, .

These risks may also be taken into account for the realisation of the DPIA (see how-to sheet 5 - Carrying out an impact assessment if necessary).
The purpose refer to functionalities excluded by design
The description of the system’s capabilities may include the choices made with regards to the design of the system, leading to limiting its functionality, such as:
- the ability of an LLM to treat only short texts, such as publications on social networks;
- the calculation time required for a computer vision system to perform its detections, which could be too long to perform real-time detection;
- the list of classes provided for a classification algorithm that would thus exclude the detection of other categories (feelings, objects, etc.).

These limitations could be specified in particular at the end of the testing and validation stages of the development phase, which allows the controller to specify the functional scope of the AI system.

The purpose specify, as far as possible, the conditions of use of the AI system
The controller may specify the conditions of use of the AI system. These may include, for example, known use cases of the solution or the modalities of use (open source distribution of the model, marketing, SaaS availability, etc.).

The controller could also provide examples of operational use cases or purposes of the AI system (e.g. traffic regulation for a computer vision system capable of detecting and quantifying vehicle flows).

How to define the purposes of the development of an AI system for scientific research ?

The controller must always define the purpose of the research and the data processing. However, in the field of scientific research, it may be accepted that the degree of precision of that objective is less precise or that the research purposes are not specified in their entirety, given the difficulties which researchers may have in identifying it entirely from the beginning of their work. It will then be possible to provide information to clarify the objective as the project progresses.

Possible derogations

Data processing carried out for the purpose of scientific research benefits from derogations and adjustments to certain data protection obligations (e.g. in relation to individuals’ rights or retention periods).

For more information: Scientific research (excluding health)

Reminder: what is ‘scientific research’ within the meaning of the GDPR?

The notion of ”scientific research” is to be understood broadly in the GDPR. In summary, the aim of the research is to produce new knowledge in all areas in which the scientific method is applicable.

In order to assist controllers in determining whether they can benefit from the provisions on scientific research, the CNIL proposes a set of criteria to assist the controller in determining whether the processing which pursues a research purpose falls within the scope of scientific research:

In some cases, it will be possible to assume that the creation of training datasets for AI pursues a scientific research purpose due to the nature of the organisation (e.g. a university or a public research centre) or the type of funding (e.g. funding from the French National Research Agency).
Otherwise, in particular for non-publicly funded private scientific research, the following criteria (based on the OECD Frascati Manual and its definition of R & D) should be examined together. As these criteria are cumulative, the controller will in principle have to demonstrate that they are all fulfilled in order for the processing to be considered scientific research within the meaning of the GDPR. Otherwise, a case-by-case analysis is necessary to qualify the processing.
- Novelty: the processing should be aimed at obtaining new results (a novelty may also result from a project that leads to the identification of potential discrepancies with the intended result). The purpose of the research can help in the qualification of scientific research. In this respect, the publication of articles in a peer-reviewed journal or the grant of a patent makes it possible to qualify the novelty criterion.
- Creativity: this criterion is based on original and non-obvious notions and hypotheses – the contribution of the work to scientific knowledge or the state of the art. The development of collective knowledge that not only benefits the moral entity that supports the research project is a strong indication to qualify it as a scientific one.
- Uncertainty: the processing must be uncertain as to the final outcome.
- Systematicity: the processing must be part of planning and budgeting and implement a scientific methodology. Adhesion to relevant industry standards of methodology and ethics is a strong indicator to qualify research as a scientific one.
  
  This is, for example, the case of specific methodological requirements for processing carried out for the purposes of research, study or evaluation in the field of health, which result in particular from Articles 72 et seq. of the French Data Protection Act.
- Transferability/reproducibility: the processing should lead to results that can be replicated or transferred to a wider field. For example, the publication of the study carried out and the presentation of the research methodology adopted is a strong indication to highlight the willingness of the project leader(s) to share the results of the research.

Example:

The development of an AI system for proof of concept to demonstrate the robustness of machine learning requiring less training data could be considered as pursuing scientific research purposes, which would be part of a documented scientific approach for publication.

Find out more:

The re-use of publicly available data for scientific (non-health) research purposes, extracted from the guide [in French].