Collecting and qualifying training data

12 September 2022

Compliance with the GDPR when collecting and compiling a quality database.

Processing learning data lawfully in compliance with the regulations

When compiling a training database containing personal data (whether created by the AI system provider or provided by a third party), certain precautions must be taken to ensure compliance with the regulations. Checks should be carried out, in particular, on the origin of the data, on the actual possibility of reusing it for training purposes or on the measures taken to limit the risks of misuse.

The following questions may help the data controller to assess whether it meets the requirements for compiling a training database that is compliant with the regulations.


Is the training data reused (reuse of an internal or publicly accessible database, acquisition, etc.) or is it collected specifically?

If reused, has the database been compiled in accordance with data protection regulations?

If a publicly accessible database has been used, has it been studied, in particular for the presence of any bias?

What is the legal basis for the processing of training data?

Sensitive data (health, criminal records, etc.) can only be processed if it meets one of the exceptions set out in Article 9 of the GDPR. If relevant, on which of these exceptions is the processing based?

How is the compliance of training data processing monitored (completion of a DPIA, analysis of the risks of re-identification, etc.)?

Does the way in which the datasets used for training are produced meet the minimisation principle?

Is the data anonymised?

If yes, how?

Is it pseudonymised?

If yes, how?

Have the risks of re-identification been assessed?

Is the volume of data collected justified in view of the difficulty of the learning task?

Are all the variables considered for training the model necessary?

Could the collection of some values be avoided should they prove not to be useful for learning, especially if they involve sensitive data?

If the collection of these values cannot be avoided, could the data be removed or hidden?

From raw data to a quality training dataset

The quality of the algorithm's output is closely linked to the quality of the training dataset, irrespective of the data categories involved.

Certain criteria need to be checked in order to limit the risk of error when using the algorithm, especially when it has consequences for individuals.


Has the accuracy of the data been verified?

If an annotation method has been used, is it checked?

If the annotation is performed by humans, have they been trained?

Is the quality of their work monitored?

Is the data used representative of the data observed in the actual environment?

Which methodology has been used to ensure this representativeness?

Has a formalised study of this representativeness been carried out?

If the processing is based on a federated learning solution, has it been checked that the data used within the centres is independent and identically distributed (a condition ensuring that the information drawn from the data will reflect the same trends without being specific to each centre)?

If the data is not independent and identically distributed, what steps are taken to remedy this?

In the case of an AI system using continuous learning, which mechanism is implemented to ensure the quality of the data used continuously?

Are regular mechanisms in place to assess the risk of loss of quality or changes in data distribution?

Identifying the risks of bias and correcting them effectively

The risks of discrimination linked to the use of an algorithm trained on biased data are now widely known. However, the factors contributing to these risks remain poorly identified and the methods to correct them are still experimental.

The training dataset must therefore be thoroughly inspected for signs of potential bias.


Is the method used to collect training data sufficiently well known?

Is there a possibility of bias due to the method used or the specific collection conditions?

Does the training data include data related to the particular characteristics of the individuals such as their gender, age, physical characteristics, sensitive data, etc.?

Which characteristics?

Have the hypotheses made based on the training data been discussed, clearly documented and checked against the actual situation?

Has a study of the correlations between these particular characteristics and the rest of the training data been carried out to identify possible proxies?

Has a bias study been carried out?

Which method was used?

If a bias has been identified, what steps have been taken to reduce it?

No information is collected by the CNIL.

Would you like to contribute?

Write to ia[@]