Take data protection into account in data collection and management
The development of an artificial intelligence system requires rigorous management and monitoring of learning data. The CNIL explains how data protection principles apply to the management of learning data.
Once the data and its sources are identified, the AI system provider must implement the collection and build its dataset. To this end, it is necessary to incorporate the principles of privacy by design from their conception.
The collection of data is accompanied by various checks and procedures depending on the modalities and sources of data. Technically, the aim is to ensure that the data collected is relevant in view of the objectives pursued, and thus to ensure compliance with the principle of minimisation.
Collection of data by harvest (‘ web scraping’)
Where the controller reuses publicly accessible data which he himself has extracted from websites by means of web scraping tools, he must in particular ensure that the collection of data is minimised, in particular by trying to:
- limit collection to freely accessible data;
- define, prior to the implementation of the processing, precise collection criteria;
- ensure the deletion of irrelevant data immediately after its collection or as soon as it is identified as such (when comprehensive sorting is not possible in an automated manner).
For more information on the collection of publicly accessible data: see the draft Guide on Opening, Sharing and Reusing Data, currently subject to public consultation.
Cleaning, identification and protection of privacy by design
Data cleaning makes it possible to provide a quality training base. This is a crucial step that strengthens data integrity and relevance by reducing inconsistencies, reducing the cost of training. Specifically, the following are:
- correct empty values;
- detect outliers;
- correcting errors;
- eliminate duplicates;
- delete unnecessary fields;
Identification of relevant data
The selection of relevant data and characteristics is a classic AI procedure. It aims to optimise the performance of the system while avoiding under and over-fitting. In practice, it ensures that certain classes that are unnecessary for the task are not represented, that the proportions between the different interest classes are well balanced, etc. This procedure also aims to identify data not relevant to learning. Data identified as irrelevant may then be deleted from the dataset.
In practice, this selection can be applied to three types of objects constituting the dataset:
- the data: information in an unorganised form may be ‘raw’ data (audio extract, image, handwritten text file, etc.) or not (measurements, observations, etc. in digital format);
- the associated metadata: literally ‘data on data’, metadata, provide information about the description of the data (what was the acquisition process by whom? when? etc.), the structure of the data (how should it be exploited?) or quality;
- features or attributes: measurable properties extracted from the data (information relating to the shape or texture of an image, the height of sounds, the timbre or tempo of an audio file, etc.).
Several approaches can contribute to the implementation of this selection. The following is illustrative:
- The use of techniques and tools to identify the relevant characteristics (selection of characteristics or feature selection), sometimes prior to training. Main Component Analysis (PCA) can also help identify highly correlated characteristics of a dataset and thus retain only those that are relevant. Many libraries such as Yellowbrick, Leave One Feature Out (LOFO) and Facets today offer implementations for selecting features.
- The use of interactive data annotation approaches such as active learning, which allow a review of data by the user on the basis of the task to be performed and, where appropriate, the removal of non-relevant data. The Scikit-ActiveML library is an example of this.
- The use of data/dataset pruning techniques: this technique, discussed in several publications such as Sorscher et al., 2022 or Yang et al., 2023, reduces the calculation time required for training without significant impact on the performance of the model obtained, while identifying data that is not useful for training.
Finally, in certain specific cases where the retention of data may be complex or problematic (due to the sensitivity of the data, issues related to intellectual property, etc.), the principle of minimisation can be implemented by the exclusive retention of the extracted characteristics and the deletion of the source data from which they originate.
Example: For a study of the spread of hate speech in social networks, the analysis of comments associated with a post allows a classification of user reactions, but the content of the comments itself could be removed after this analysis.
The creation of a learning dataset for AI also almost systematically requires data annotations. The production and use of such data must also be subject to special data protection measures. These will be detailed in a dedicated practical sheet.
Data protection by design (privacy by design)
In addition to these necessary steps, the AI system supplier must implement a series of measures to integrate privacy by design principles right from the design stage (privacy by design). They must take into account the state of knowledge, their impact on the effectiveness of the training, the costs of implementation and the nature, scope, context and purposes of the processing, and the risks (of which the likelihood and severity vary) of the processing for the rights and freedoms of individuals. These measures may include:
- Generalisation measures: those measures are intended to generalise, or dilute, the attributes of the persons concerned by changing their respective scale or order of magnitude;
- Randomisation measures: these measures aim to add noise to the data in order to decrease its accuracy and weaken the link between the data and the individual.
These measures are to be implemented on the data as well as the metadata associated with it.
In some cases, these measures may extend to the anonymisation of the data, in particular if the purpose does not require the processing of personal data: if the processing of data selection and management are processing of personal data subject to the GDPR and thus to these data sheets, further processing will no longer be affected by the regulations on the protection of personal data.
Example: An organisation wishes to build a dataset of computer code relating to industrial machines (SCADA) from repositories of several developers. After removing any mention of the developers themselves, and then verifying the absence of identifiers or personal mentions in the comments, the dataset does not contain any personal data. It is no longer subject to data protection regulations.
For more information on these measures, see Opinion 05/2014 on G29 anonymisation techniques.
The measures depend on the categories of data concerned and must be considered in terms of their influence on the technical – theoretical and operational – performance of the system. The impact of these measures is particularly beneficial due to:
- on the one hand, their ability to reduce the consequences of a possible loss of data confidentiality (by compromising the data contained in the dataset, or by an attack on the driven model such as an attribute inference attack);
- on the other hand, the possibility of using the model trained in the operational phase on data subject to the same protection measures, thus offering the ability to better protect data in the operational phase.
Example: By generalising patient age information as part of the development of a diagnostic AI system, in the fields [month-year] or [year] instead of [day-month-year], the provider drastically reduces the risk of loss of confidentiality, without prejudice to the generalisation capacity of its system.
Metadata may contain useful information to an attacker seeking to re-identify the data subjects (such as a date or place of data collection). The principle of minimisation also applies to such data and should therefore be limited to what is necessary.
For example, metadata may be required by the provider to respond to a request for the exercise of rights, as it sometimes identifies data relating to an individual. In this case, special attention should be paid to their safety.
However, if the processing of metadata is not necessary and it contains personal data, its deletion may be recommended for the purpose of pseudonymisation or anonymisation of the dataset.
Example: in the event that it reuses video protection images to constitute a learning dataset, a provider that generalises the location of an image from an address to an IRIS mesh may no longer be able to respond to a request for access to the data.
In addition, some measures protect data when learning the AI system, such as differential privacy applied during model learning or federated learning. Although some of these techniques are still at the research stage, tools can be used to test their effectiveness, such as PyDP or OpenDP.
Monitoring and updating
Although data minimisation and data protection measures have been implemented during data collection, these measures may become obsolete over time. The data collected could lose their exact, relevant, adequate and limited character, in particular because of:
- from a possible drift of data under real conditions. Data drift can have multiple causes:
- upstream process changes, such as the replacement of a sensor, the calibration of which differs slightly from that previously installed;
- data quality problems, e.g. a broken sensor that would always indicate a zero value;
- the natural drift of the data, such as the variation in average temperature over the seasons;
- drift due to sudden changes, such as the loss of a system’s ability to detect faces following the massive wearing of masks during the Covid-19 outbreak;
- changes in the relationship between characteristics;
- malicious poisoning as part of continuous learning, for example found by unwanted outcomes.
- an update of the data, such as a correction of the place of residence in the public profile of the user of a social network following a move;
- the evolution of techniques, which frequently demonstrates that a change of approach (use of a different AI system requiring a different data typology, for example) can bring better performance to the system, or that similar performance can be achieved with a smaller volume of data (as shown by the technique of few-shots learning, for example).
For example, the system provider should conduct regular analysis to monitor the dataset. This analysis will be more extensive and frequent in situations where the above-mentioned causes are most likely to take place. This analysis should be based on:
- a regular comparison of data or a sample of data to source data, which can be automated;
- a regular review of data by staff trained in data protection matters, or by an ethics committee, responsible in particular for verifying that the data are still relevant and adequate for the purpose of the processing;
- monitoring scientific literature in the field and making it possible to identify the emergence of new, more frugal techniques in data.
Personal data cannot be stored indefinitely. The GDPR requires a period of time after which data must be deleted, or in some cases archived. This retention period must be determined by the data controller according to the purpose which led to the collection of such data.
Find out more: Data retention periods
The provider must set a retention period for the data used for the development of the AI system, in accordance with the principle of limitation of data retention (Article 5.1.d GDPR).
In particular, setting a retention period requires the implementation of certain procedures described in the CNIL’s practical guide on retention periods. The CNIL notes that open source datasets are constantly evolving (improved annotation, addition of new data, purging of poor-quality data, etc.): a retention period of several years from the date of collection must be justified.
Set a shelf life for the development phase
First, the provider of the AI system will have to set a data retention period for the use made for the development of the system. During this phase, the provider uses the data to:
- the creation of the dataset limited to those strictly necessary, cleaned, pre-processed and ready to be used for learning;
- learning its solution, from the first training of the AI model to the test phase to determine the characteristics and performance of the finished product. During this phase, the data must be kept securely and accessible to authorised persons. Depending on the case, this phase can last from a few weeks to several months. This duration should be defined upstream and justified (taking into account the previous experience of the controller, his knowledge of the duration of IT developments, the human and material resources he can make available to carry them out, etc.).
Data retention needs to be planned upstream and monitored over time. The defined retention periods must also be applied to the data concerned, regardless of their medium. Particular attention must therefore be paid to any data stored on third-party media, for example to allow the analysis of a sample on a case-by-case basis by the engineers. The measures recommended in the Data Traceability Documentation section will facilitate the tracking of data and the expected date for deletion.
In the case of public bodies or private law bodies entrusted with a public service mission, data may also be subject to specific archiving in compliance with the obligations of the Heritage Code.
This allows the data to be permanently archived in a public archive service according to the particular interest they present. Where the public archives contain personal data, a selection shall be made to determine the data intended to be retained and those, which are not administratively useful or of scientific, statistical or historical interest, intended to be eliminated.
In any event, the data retained in the context of the definitive archiving are subject to processing for archival purposes within the meaning of the GDPR and, therefore, do not fall within the scope of these sheets. In addition, the data retention period must be specified in the information mentions that will be brought to the attention of the data subjects.
Set a duration for the maintenance or improvement of the product
When the data no longer have to be accessible for the day-to-day tasks of those responsible for the development of the AI system, it should in principle be deleted. However, they can be kept for product maintenance (i.e. for a later performance verification phase) or for system improvement purposes.
The principle of data minimisation requires that a sorting be carried out in order to keep only the data strictly necessary for maintenance operations (by selecting the relevant images, by blurring the images where possible, etc.).
Once the data is sorted, it can be stored on a partitioned medium, i.e. physically or logically separated from the dataset’s constituent data. This partitioning makes it possible to strengthen the security of the data and restrict its access to authorised persons only. The duration of the maintenance phase can vary from a few months to several years when the retention of this data carries little risk to people and the appropriate measures have been taken. In the case of data from open sources, the retention period provided by the data source shall be taken into account in determining the duration of the maintenance phase. However, this period must be limited and justified by a real need.
Improving the AI system
The constituent data of the previously constituted dataset may also be required to improve the product resulting from the AI system thus developed. This purpose, for which a legal basis must be identified, must be brought to the attention of the data subjects, in accordance with the principle of transparency.
Specifically, only the data needed to improve the AI system can be extracted from their partitioned storage space.
The possibility of extending the cycle with a new development or maintenance phase will under no circumstances allow an indefinite extension of the shelf life, and an analysis of the time required for processing operations must be carried out systematically.
The data controller and its processors (if any) must implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk (Article 32 GDPR).
The choice of measures to be implemented must take into account the state of knowledge, the costs of implementation and the nature, scope, context and purposes of the processing and the risks, the degree of likelihood and severity of which vary, for the rights and freedoms of data subjects.
Thus, the provider of an AI system must, in particular, provide for the appropriate measures in order to secure:
- data collection techniques used, e.g. through flow encryption methods and robust authentication methods to restrict access to the information system. It is recommended to use the means provided by the broadcaster to collect data, especially when it is based on APIs. The CNIL’s recommendation on the use of APIs will then have to be applied;
- the data collected, using methods of encryption of backups, verification of their integrity, or logging of operations carried out on the basis of data in accordance with the CNIL recommendation on logging measures. A frequent risk in the development of AI systems concerns duplication of data, which frequently need to be analysed to verify their quality. Duplication of data should be limited to the extent possible and traced where unavoidable. Dedicated tools, such as NB Defense, Octopii and PiiCatcher, make it possible to verify the presence of personal data in certain contexts;
- the information system used for the development of the AI system, e.g. by means of authentication methods and the training of staff with access to it, and the implementation of good IT hygiene practices;
- computer equipment, in particular by means of methods of restricting access to premises and by analysing the guarantees provided by the data host when this is outsourced to a provider.
Security measures specific to the development and deployment phases of AI systems will be the subject of a subsequent Sheet. However, the recommendations and best practices traditionally implemented in IT, such as those present on the CNIL website, as well as the GDPR guides of the development and security team of personal data, constitute a useful reference base to which the provider of the AI system can refer.
The documentation of the data used for the development of an AI system ensures the traceability of the datasets used, the large size of which generally makes this difficult. It must make it possible to:
- facilitate the use of the dataset;
- demonstrate that the data were lawfully collected;
- facilitate the monitoring of data over time until it is deleted or anonymised;
- reduce the risk of unanticipated use of data;
- enable the exercise of rights for data subjects;
- identify planned or possible improvements.
In order to meet these objectives, a documentation model may be adopted, in particular where the provider uses multiple data sources or establishes multiple datasets. Building on the existing models (such as those proposed by Gebru et al., 2021, Arnold et al., 2019, Bender et al., 2018, the Dataset Nutrition Label, or the technical documentation provided for in Annex IV to the draft European Artificial Intelligence Regulation), the CNIL provides below a model that can be used for this purpose, in particular where the dataset is intended to be disseminated. This documentation should be carried out by dataset where they are constituted, made available, or come from an existing dataset to which a substantial change has been made. More specific documentation templates for each use case, such as the CrowdWorkSheets model, which is particularly relevant for documenting the annotation phase, may complement the proposed template.
The objectives of this documentation are to inform the controller’s internal reflection on its practices, to inform the users of the dataset about the conditions of its constitution and the recommendations concerning its processing, and finally, to inform people for the purpose of transparency. Thus, it is recommended to provide this documentation to users of the dataset or models it has been used to design.
It should be noted that this important documentation work can naturally feed into the data protection impact assessment.