AI system development: CNIL’s recommendations to comply with the GDPR

07 June 2024

To help professionals reconcile innovation and respect for people’s rights, the CNIL has published its first recommendations on the application of the GDPR to the development of artificial intelligence systems. Here's what you need to remember.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

Designers and developers of AI systems often report to the CNIL that the application of the GDPR is challenging for them, in particular for the training of models.

The misconception that the GDPR would prevent AI innovation in Europe is false. On the other hand, we must be aware that training datasets sometimes include “personal data”, i.e. data on real people. The use of such data poses risks to individuals, which must be taken into account, in order to develop AI systems under conditions that respect individuals’ rights and freedoms, including their right to privacy.

Scope of the recommendations

Which AI systems are concerned?

These recommendations adress the development of AI systems involving the processing of personal data (for more information on the legal framework, see how-to sheet 1). The training of AI systems regularly requires the use of large volumes of information on natural persons, known as “personal data”.

The following are concerned:

Systems based on machine learning;
Systems whose operational use is defined from the development phase and general purpose systems that can be used for various applications (“general purpose AI”);
Systems for which the learning is done “once and for all” or continuously, e.g. using usage data for its improvement.

What are the steps involved?

These recommendations concern the development phase of AI systems, not the deployment phase.

The development phase includes all the steps prior to the deployment of the AI system in production: system design, dataset creation and model training, etc.

How do these recommendations relate to the European AI Act?

The recommendations take into account the EU Artificial Intelligence Act recently adopted. Indeed, where personal data is used for the development of an AI system, both the GDPR and the AI Act apply. CNIL's recommendations have therefore been drawn up to supplement them in a consistent manner regarding data protection.

For more information, see how-to sheet 0

Step 1: Define an objective (purpose) for the AI system

The principle

An AI system based on the exploitation of personal data must be developed with a “purpose”, i.e. a well-defined objective.

This makes it possible to frame and limit the personal data that can be used for training, so as not to store and process unnecessary data.

This objective must be determined, or established as soon as the project is defined. It must also be explicit, i.e. known and understandable. Finally, it must be legitimate, i.e. compatible with the organization’s tasks.

It is sometimes objected that the requirement to define a purpose is incompatible with the training of AI models, which may develop unanticipated characteristics. The CNIL considers that this is not the case and that the requirement to define a purpose must be adapted to the context of AI, without, however, disappearing, as the following examples show.

In practice

There are three types of situations.

You clearly know what the operational use of your AI system will be

In this case, this objective will be the purpose of the development phase as well as the deployment phase.

Example:

An organisation set up a database of photographs of train wagons in service – i.e. with persons present – in order to train an algorithm to measure the crowding and occupancy of trains in stations. The purpose in the development phase is determined, explicit and legitimate in relation to the identified operational use.

However, this is more complex when you are developing a general purpose AI system that can be used in various contexts and applications or when your system is developed for scientific research purposes.

For general purpose AI systems

Example:

An organisation may set up a dataset for training an image classification model (persons, vehicles, food, etc.) and make it publicly available, without any specific operational use being foreseen when developing the model.

You cannot define the purpose too broadly as, for example, “the development and improvement of an AI system”. You will need to be more precise and refer to:

the “type” of system developed, such as, for example, the development of a large language model, a computer vision system or a generative AI system for images, videos, sounds, computer code, etc.
the technically feasible functionalities and capabilities.

Good practice:

You can give even more details about the objective pursued, for example by determining:

the foreseeable capacities most at risk;
functionalities excluded by design;
the conditions of use of the AI system: the known use cases of the solution or the conditions of use (dissemination of the model in open source, marketing, availability trough SaaS or API, etc.).

For AI systems developed for scientific research purposes

Example:

The development of an AI system for a proof of concept intended to demonstrate the robustness of machine learning requiring less training data, in a documented scientific approach intended for publication, could be regarded as pursuing scientific research purposes.

You can define a less detailed objective, given the difficulties of defining it precisely from the beginning of your work. You can then provide additional information to clarify this goal as your project progresses.

For more information, see how-to sheet 2

Step 2: Determine your responsibilities

The principle

If you use personal data for the development of AI systems, you need to determine your liability within the meaning of the GDPR. You can be:

controller (TR): you determine the purposes and means, i.e. you decide on the “why” and “how” of the use of personal data. If one or more other bodies decide with you on these elements, you will be jointly responsible for the processing and will have to define your respective obligations (e.g. through a contract).
subcontractor (ST): you process data on behalf of a player who is the “controller”. In this case, the latter must ensure that you comply with the GDPR and that you process the data only on its instructions: the law then provides for the conclusion of a subcontract.

In practice

The European AI Act defines several roles:

an AI system provider developing or having developed a system and placing it on the market or putting it into service under its own name or trade mark, whether for a fee or free of charge;
importers, distributors and users (also known as deployers) of these systems.

Your degree of responsibility depends on a case-by-case analysis. For example:

If you are a provider at the initiative of the development of an AI system and you constitute the training dataset from data you have selected for your own account, you can be qualified as a controller.
If you are building the training dataset of an AI system with other controllers for a purpose that you have defined together, you can be referred to as joint controllers.
If you are an AI system provider, you can be a subcontractor if you are developing a system on behalf of one of your customers. The customer will be responsible for processing if he determines the purpose but also the means, the techniques to be used. If it only gives you one goal to achieve and you design the AI system, you are the controller.
If you are an AI system provider you can use a provider to collect and process the data according to your instructions. The service provider will be your subcontractor. This is the case, for example, of the provider that has to set up a training dataset for an AI system provider that tells it precisely how it has to be developed.

For more information, see how-to sheet 3

For the rest:

If you are a controller, all the following steps are of direct concern to you, and you are responsible for ensuring compliance.
If you are a subcontractor, your main obligations are as follows:
- Ensure that a contract for the subcontracting of personal data has been concluded and that it complies with the regulations;
- Strictly follow the instructions of the controller and do not use the personal data for anything else;
- Strictly ensure the security of the data processed;
- Assess compliance with the GDPR at your level (see next steps) and alert the controller if you feel there is a problem.

Step 3: Define the "legal basis" that allows you to process personal data

The principle

The development of AI systems containing personal data will need to have a legal basis that allows you to process this data. The GDPR lists six possible legal bases: consent, compliance with a legal obligation, the performance of a contract, the performance of a task carried out in the public interest, the safeguarding of vital interests, the pursuit of a legitimate interest.

Depending on the legal basis chosen, your obligations and the rights of individuals may vary, which is why it is important to determine it upstream and indicate it in the data privacy policy.

In practice

You need to ask yourself about the most appropriate legal basis for your situation.

If you collect data directly from individuals and they are free to accept or refuse without harm (such as giving up the service), consent is often the most appropriate legal basis. According to the law, it must be free, specific, enlightened and unambiguous.

Gathering consent, however, is often impossible in practice for dataset creation. For example, when you collect data accessible online or reuse an open source database, without direct contact with data subjects, other legal bases will generally be more suitable:

Private actors will have to analyse whether they comply with the conditions in order to rely on legitimate interest. To do so, they must justify three conditions:
- the interest pursued is legitimate, that is to say, legal, precisely and genuinely defined;
- it must be possible to establish that the personal data are really necessary for the training of the system, because it is not possible to use only data which do not relate to natural persons or anonymised data. ;
- the use of such personal data must not lead to a “disproportionate interference” with the privacy of individuals. This is assessed on a case-by-case basis, depending on what is revealed by the data used, which may be more or less private or sensitive, and what is done with the data. ;

Please note: a how-to sheet specific to the legal basis of legitimate interest will be published shortly.

Public actors must verify whether the processing is in line with their public interest mission as provided for by a law (e.g. a law, decree, etc.) and whether it contributes to it in a relevant and appropriate way.

Example: the French pôle d'expertise de la régulation numérique (PEReN) is authorised on this basis to reuse publicly available data to carry out experiments aimed in particular at designing technical tools for the regulation of operators of online platforms.

The legal bases of the contract and the legal obligation may be used more exceptionally, if you demonstrate how your processing is necessary to meet the performance of the contract or pre-contractual measures or a (sufficiently precise) legal obligation to which you are subject.

For more information, see how-to sheet 4

Step 4: Check if I can re-use certain personal data

The principle

If you plan to re-use a dataset that contains personal data, make sure it is legal. That depends on the method of collection and the source of the data in question. You, as a controller (see “Determine your responsibilities”), must carry out certain additional checks to ensure that such use is lawful.

In practice

The rules will depend on the situation.

The provider reuses data that it has already collected itself

You may want to re-use the data you originally collected for another purpose. In this case, if you had not foreseen and informed the data subjects about this re-use, you should check that this new use is compatible with the original purpose, unless you are authorised by the data subjects (they have consented) or by a text (e.g. a law, decree etc.).

You must carry out what is known as a “compatibility test”, which must take into account:

the existence of a link between the initial objective and that of building a dataset for training an AI system;
the context in which the personal data were collected;
the type and nature of the data;
the possible consequences for the persons concerned;
the existence of appropriate safeguards (e.g. pseudonymisation of data).

Please note: if you wish to re-use data for statistical or scientific research purposes, the processing is presumed to be compatible with the original purpose. No compatibility test is therefore necessary in this case.

The provider re-uses publicly available data (opensource)

In this case, you need to make sure that you are not re-using a dataset whose constitution was manifestly unlawful (e.g. from a data leak). A case-by-case analysis must be carried out.

The CNIL recommends that re-users check and document (for example, in the data protection impact assessment) the following:

the description of the dataset mentions their source;
the establishment or dissemination of the dataset is not manifestly the result of a crime or misdemeanour or has been the subject of a public conviction or sanction by a competent authority which has involved a removal or prohibition of exploitation;
there is no glaring doubt that the dataset is lawful by ensuring in particular that the conditions for data collection are sufficiently documented;
the dataset does not contain sensitive data (e.g. health data or data revealing political opinions) or infringement data or, if it does, it is recommended to carry out additional checks to ensure that such processing is lawful.

The body that uploaded the dataset is supposed to have ensured that the publication complied with the GDPR, and is responsible for it. However, you do not have to verify that the bodies that set up and disseminated the dataset have complied with all the obligations laid down in the GDPR: the CNIL considers that the four verifications mentioned above are generally sufficient to allow the re-use of the dataset for the training of an AI system, provided that the other CNIL recommendations are complied with. If you receive information, especially from people whose data is contained in the dataset, that highlights problems with the lawfulness of the data used, you will need to investigate further.

The provider reuses data acquired from a third party (data brokers, etc.)

For the third party sharing personal data, sometimes for remuneration, there are two types of situations.

Either the third party collected the data for the purpose of building a dataset for AI system training. It must ensure that the data transmission processing complies with the GDPR (definition of an explicit and legitimate objective, requirement of a legal basis, information to individuals and management of the exercise of their rights, etc.).

Either the third party did not initially collect the data for that purpose. It must then ensure that the transmission of those data pursues an objective compatible with that which justified their collection. It will therefore have to carry out the “compatibility test” described above.

The re-user of the data has several obligations:

It must ensure that it is not re-using a manifestly unlawful dataset by carrying out the same checks as those set out in the section above. The conclusion of an agreement between the original data holder and the re-user is recommended in order to facilitate these verifications.
In addition to those checks, it must ensure its own compliance with the GDPR in the processing of that data.

For more information, see how-to sheet 4

Step 5: Minimize the personal data I use

The principle

The personal data collected and used must be adequate, relevant and limited to what is necessary in the light of the objective defined: this is the principle of data minimisation. You must respect this principle and apply it rigorously when the data processed is sensitive (data concerning health, data concerning sex life, religious beliefs or political opinions, etc.).

In practice

The method to be used

You should focus on the technique that achieves the desired result (or of the same order) using as little personal data as possible. In particular, the use of deep learning should therefore not be systematic.

The choice of the learning protocol used may, for example, make it possible to limit access to data only to authorised persons, or to give access only to encrypted data.

Selection of strictly necessary data

The principle of minimisation does not prohibit the training of an algorithm with very large volumes of data, but implies:

to have an upstream reflection in order to have recourse only to personal data useful for the development of the system; and
to subsequently implement the technical means to collect only these ones.

The validity of design choices

In order to validate the design choices, it is recommended as a good practice to:

conduct a pilot study, i.e. carry out a small-scale experiment. Fictitious, synthetic, anonymised data may be used for this purpose;
question an ethics committee (or an “ethical advisor”). This committee must ensure that ethical issues and the protection of the rights and freedoms of individuals are properly taken into account. It can thus formulate opinions on all or part of the organisation’s projects, tools, products, etc. likely to raise ethical issues.

The organisation of the collection

You must ensure that the data collected is relevant in view of the objectives pursued. Several steps are strongly recommended:

Data cleaning: This step allows you to build a quality dataset and thus strengthen the integrity and relevance of the data by reducing inconsistencies, as well as the cost of learning.
Identification of the relevant data: This step aims to optimize system performance while avoiding under- and over-fitting. In practice, it allows you to make sure that certain classes or categories that are unnecessary for the task at hand are not represented, that the proportions between the different interest classes are well balanced, etc. This procedure also aims to identify data that is not relevant for learning (which will then have to be removed from the dataset).
The implementation of measures to incorporate the principles of personal data protection by design: This step allows you to apply data transformations (such as generalisation and/or randomisation measures, data anonymisation, etc.) to limit the impact on people.

Monitoring and updating of data: minimisation measures could become obsolete over time. The data collected could lose their exact, relevant, adequate and limited characteristics, due to a possible drift of the data, an update thereof or technical developments. You will therefore have to conduct a regular analysis to ensure the follow-up of the constituted dataset.
Documentation of the data used for the development of an AI system: it allows you to guarantee the traceability of the datasets used which the large size can make difficult. You must keep this documentation up-to-date as the dataset evolves. The CNIL provides here a model of documentation.

For more information, see how-to sheets 6 and 7

Step 6: Set a retention period

Step 7: Carry out a Data Protection Impact Assessment (DPIA)

The principle

The Data Protection Impact Assessment (DPIA) is an approach that allows you to map and assess the risks of processing on personal data protection and establish an action plan to reduce them to an acceptable level. In particular, it will lead you to define the security measures needed to protect the data.

In practice

Achieving a DPIA for the development of AI systems

It is strongly recommended to carry out a DPIA for the development of your AI system, especially when two of the following criteria are met:

sensitive data are collected;
personal data are collected on a large scale;
data of vulnerable persons (minors, persons with disabilities, etc.) are collected;
datasets are crossed or combined;
new technological solutions are implemented or innovative use is made.

In addition, if significant risks exist (e.g.: data misuse, data breach, or discrimination),a DPIA must be carried out even if two of the above criteria are not met.

To help carry out a DPIA, the CNIL makes available the dedicated open source PIA software.

The risk criteria introduced by the European AI Act

The CNIL considers that, for the development of high-risk systems covered by the AI Act and involving personal data, the performance of a DPIA is in principle necessary.

Please note: Completion of the DPIA may be based on the documentation required by the AI Act provided that it includes the elements provided for in the GDPR (Article 35 GDPR).

The scope of the DPIA

There are two types of situations for the provider of an AI system, depending on the purpose of the AI system (see “Step 1: Define an objective (purpose) for the AI system”).

You clearly know what the operational use of your AI system will be
It is recommended to carry out a general DPIA for the whole life cycle, which includes the development and deployment phases. Please note that if you are not the user/deployer of the AI system, it is this actor that will be responsible for carrying out the DPIA for the deployment phase (although it may be based on the DPIA model you have proposed).
If you are developing a general purpose AI system
You will only be able to carry out a DPIA during the development phase. This DPIA should be provided to the users of your AI system to enable them to conduct their own analysis.

AI risks to be considered in a DPIA

Processing of personal data based on AI systems presents specific risks that you must take into account:

the risks related to the confidentiality of data that can be extracted from the AI system;
the risks to data subjects linked to misuse of the data contained in the training dataset (by your employees who have access to it or in the event of a data breach);
the risk of automated discrimination caused by a bias in the AI system introduced during development;
the risk of producing false fictitious content on a real person, in particular in the case of generative AI systems;
the risk of automated decision-making when the staff member using the system is unable to verify his performance in real conditions or to take a decision contrary to the one provided by the system without detriment (due to hierarchical pressure, for example);
the risk of users losing control over their data published and freely accessible online;
the risks related to known attacks, specific to AI systems (e.g. data poisoning attacks);
the systemic and serious ethical risks related to the deployment of the system.

Actions to be taken based on the results of the DPIA

Once the level of risk has been determined, your DPIA must provide for a set of measures to reduce it and keep it at an acceptable level, for example:

security measures (e.g. homomorphic encryption or the use of a secure execution environment);
minimisation measures (e.g. use of synthetic data);
anonymisation or pseudonymisation measures (e.g. differential privacy);
data protection measures from the outset (e.g. federated learning);
measures facilitating the exercise of rights or redress for individuals (e.g. machine unlearning techniques, explainability and traceability measures regarding the outputs of the system, etc.);
audit and validation measures (e.g. fictitious attacks).

Other, more generic measures may also be applied: organisational measures (management and limitation of access to the datasets which may allow a modification of the AI system, etc.), governance measures (establishment of an ethics committee, etc.), measures for the traceability of actions or internal documentation (information charter, etc.).

For more information, see how-to sheet 5

The CNIL is continuing its work to help providers of AI systems.

It will soon publish new how-to sheets explaining how to design and train models in compliance with the GDPR: retrieval of data on the internet (web scraping); use of the legitimate interest as a legal basis, exercise of the rights of access, rectification and erasure; whether or not to use open licences...

These how-to sheets will be subject to public consultation.

Texte reference

This can also interest you ...

Joint Statement on Trustworthy Data Governance for AI: Twenty Data Protection Authorities Commit to ...

18 September 2025

Artificial intelligence

AI: The CNIL finalises its recommendations on the development of artificial intelligence systems and...

22 July 2025

Artificial intelligence

PANAME: a partnership for privacy auditing of AI models

26 June 2025

Artificial intelligence

CNIL: search form

AI system development: CNIL’s recommendations to comply with the GDPR

Scope of the recommendations

What are the steps involved?

How do these recommendations relate to the European AI Act?

Step 1: Define an objective (purpose) for the AI system

The principle

In practice

You clearly know what the operational use of your AI system will be

For general purpose AI systems

For AI systems developed for scientific research purposes

Step 2: Determine your responsibilities

The principle

In practice

Step 3: Define the "legal basis" that allows you to process personal data

The principle

In practice

Step 4: Check if I can re-use certain personal data

The principle

In practice

The provider reuses data that it has already collected itself

The provider re-uses publicly available data (opensource)

The provider reuses data acquired from a third party (data brokers, etc.)

Step 5: Minimize the personal data I use

The principle

In practice

The method to be used

Selection of strictly necessary data

The validity of design choices

The organisation of the collection

Step 6: Set a retention period

The principle

In practice

Step 7: Carry out a Data Protection Impact Assessment (DPIA)

The principle

In practice

Achieving a DPIA for the development of AI systems

The risk criteria introduced by the European AI Act

The scope of the DPIA

AI risks to be considered in a DPIA

Actions to be taken based on the results of the DPIA

The CNIL is continuing its work to help providers of AI systems.

Read more

This can also interest you ...