Collecting and qualifying training data

When compiling a training database containing personal data (whether created by the AI system provider or provided by a third party), certain precautions must be taken to ensure compliance with the regulations. Checks should be carried out, in particular, on the origin of the data, on the actual possibility of reusing it for training purposes or on the measures taken to limit the risks of misuse.

The following questions may help the data controller to assess whether it meets the requirements for compiling a training database that is compliant with the regulations.

Is the training data reused (reuse of an internal or publicly accessible database, acquisition, etc.) or is it collected specifically?

If reused, has the database been compiled in accordance with data protection regulations?

If a publicly accessible database has been used, has it been studied, in particular for the presence of any bias?

Sensitive data (health, criminal records, etc.) can only be processed if it meets one of the exceptions set out in Article 9 of the GDPR. If relevant, on which of these exceptions is the processing based?

How is the compliance of training data processing monitored (completion of a DPIA, analysis of the risks of re-identification, etc.)?

Does the way in which the datasets used for training are produced meet the minimisation principle?

Is the volume of data collected justified in view of the difficulty of the learning task?

Could the collection of some values be avoided should they prove not to be useful for learning, especially if they involve sensitive data?

If the collection of these values cannot be avoided, could the data be removed or hidden?

From raw data to a quality training dataset

The quality of the algorithm's output is closely linked to the quality of the training dataset, irrespective of the data categories involved.

Certain criteria need to be checked in order to limit the risk of error when using the algorithm, especially when it has consequences for individuals.

If the processing is based on a federated learning solution, has it been checked that the data used within the centres is independent and identically distributed (a condition ensuring that the information drawn from the data will reflect the same trends without being specific to each centre)?

If the data is not independent and identically distributed, what steps are taken to remedy this?

In the case of an AI system using continuous learning, which mechanism is implemented to ensure the quality of the data used continuously?

Are regular mechanisms in place to assess the risk of loss of quality or changes in data distribution?

Identifying the risks of bias and correcting them effectively

The risks of discrimination linked to the use of an algorithm trained on biased data are now widely known. However, the factors contributing to these risks remain poorly identified and the methods to correct them are still experimental.

The training dataset must therefore be thoroughly inspected for signs of potential bias.

Is there a possibility of bias due to the method used or the specific collection conditions?

Does the training data include data related to the particular characteristics of the individuals such as their gender, age, physical characteristics, sensitive data, etc.?

Have the hypotheses made based on the training data been discussed, clearly documented and checked against the actual situation?

Has a study of the correlations between these particular characteristics and the rest of the training data been carried out to identify possible proxies?

Collecting and qualifying training data

Processing learning data lawfully in compliance with the regulations

From raw data to a quality training dataset

Identifying the risks of bias and correcting them effectively

Would you like to contribute?

This can also interest you ...

CNIL: search form

Collecting and qualifying training data

Processing learning data lawfully in compliance with the regulations

From raw data to a quality training dataset

Identifying the risks of bias and correcting them effectively

Would you like to contribute?

This can also interest you ...