Annotating data

02 July 2024

The data annotation phase is crucial to ensure the quality of the AI model. This challenge can be achieved by means of a rigorous methodology guaranteeing performance for the system and protection of personal data.

 

This how-to sheet is open for public consultation until October the 1st 2024. More information.
This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

 

The data annotation phase is a decisive step in the development of a quality AI model, both for performance issues and for the respect of people’s rights. This step is central in supervised learning, but also to provide a validation dataset in unsupervised learning. It consists of assigning a description, called label, to each of the data that will serve as a “ground truth” for the model which must learn to process, classify, or discriminate data based on that information. The annotation may relate to all types of data, personal or not, and contain all types of information, personal or not. Annotation can be human, semi-automatic, or automatic. It can be an independent process, or result from existing processes in which data characterisation has already been performed for a certain need and then reused for training AI models (as in the medical diagnosis use-case described below). In some cases, AI training will be based on existing data and annotations. This sheet, as well as those on data protection during the design of the system and the collection of data, will then have to be applied. The scope of this sheet covers all the cases mentioned above where the annotation relates to or contains personal data.

Examples of annotations:

  • In order to train a speaker recognition AI model integrated into a voice assistant, voice recordings are annotated with the identity of several speakers;
     
  • In order to train a fall detection AI model integrated into the video surveillance system of a nursing home, images are annotated with the position of the persons represented according to several labels such as “standing” or “laying down”;
     
  • In order to train an AI model for the recognition of license plates embedded in an access gate to a private space, images are annotated with the position of pixels containing license plates;
     
  • In order to train an AI model for predicting the risk of a certain pathology, intended to be used as a diagnostic aid by healthcare staff in a hospital, the blood results of patients are annotated with the diagnosis made by a doctor on the pathology in question.

The stakes of annotation for people’s rights and freedoms


Ensure the quality of the annotation


Information and the exercise of rights


Annotation from sensitive data