News

Processing personal data while developing an AI System: CNIL has issued guidelines regarding legitimate interest as a legal basis

Chronicle of Data Protection banner image
Chronicle of Data Protection banner image

On 19 June 2025, CNIL published two additional “how-to-sheets” on artificial intelligence, one on legitimate interest and the other on the collection of data via web scraping. These documents aim to clarify the rules applicable to the creation of training datasets containing personal data.

On June 19, 2025, CNIL published two additional “how-to-sheets” on artificial intelligence. The first one sets out the conditions under which the legal basis of legitimate interest may be used for the development of an AI system, while the second focuses specifically on the collection of data via web scraping (see here our post).

In its first "how-to-sheet,” CNIL explains the requirements to satisfy to be able to rely on legitimate interest as legal basis for processing personal data during the development phase of an AI system. 

Requirement 1: The interest pursued must be legitimate

CNIL recalls that an interest is presumed legitimate when it is: (i) clearly lawful under applicable law, (ii) sufficiently specific and well-defined, and (iii) real and present.

When the future use of the model is not yet known at the development stage, CNIL recommends referring to the objective of the model's development.

Presumed legitimate interests

Interests that cannot be considered legitimate

Conducting scientific research (notably for organizations that cannot rely on public interest)

The AI system has no link to the organization’s mission or activity

Facilitating public access to certain information

The AI system cannot legally be deployed (e.g. specific prohibited use of minors’ data under the DSA)

Developing new systems and functionalities for users

The AI system is explicitly prohibited by other regulations (e.g. Article 5 of the AI Act – “Prohibited AI practices”)

Offering a conversational assistant service to users

 

Improving the performance of a product or service

Developing AI to detect fraudulent content or behavior

Any commercial interest, provided it is lawful, necessary, and proportionate

CNIL also notes that relying on legitimate interest does not eliminate the obligation to obtain consent where required by other legislation (e.g. Article 5.2 of the DMA on cross-use of personal data).

Requirement 2: The processing must be necessary

Processing is considered necessary if:

  • It enables achievement of the pursued interest;
  • There are no less privacy-intrusive means to reach the same goal; and
  • It complies with the principle of data minimization. The controller must ensure that the processing or retention of personal data is necessary, including evaluating whether the data can be retained in a form that permits identification. CNIL encourages the use of technologies that allow model training with less reliance on personal data.

Requirement 3: The AI system's purpose must not unduly harm individuals' rights and freedoms

Assessing positive and negative impacts

To ensure that legitimate interest does not result in a disproportionate impact on individuals' rights and freedoms, the controller must assess both the benefits of processing and its potential adverse effects. The greater the anticipated benefits, the more likely the legitimate interest may outweigh the risks to individuals.

The controller must therefore identify actual or potential consequences for data subjects resulting from both the development and use of the AI system.

CNIL provides a list of criteria to guide this balancing test, which can also be used as part of a Data Protection Impact Assessment (DPIA).

 

Benefits of the AI system

Potential negative impacts on data subjects

Scope and nature of expected benefits for the controller, third parties (e.g. users of the AI system), the public interest, or society (e.g. AI system improving accessibility for persons with disabilities)

Nature of data: sensitive or highly personal?

Usefulness of the AI system for regulatory compliance (e.g. AI system enabling DSA-compliant content moderation)

Status of data subjects: are they vulnerable or minors?

Development as open-source model

Nature and scale of the deploying organization: large-scale deployment increases risk

Specificity of the pursued interest: the more precise, the stronger its weight

How data is processed (e.g. data cross-checking?)

Type of AI system and its intended operational use

 

CNIL distinguishes between risks that arise during the development phase and those related to the deployment of the AI system, both of which must be considered during the development phase due to their systemic nature:

Risks during development phase

Risks during use 

Collection of online data (e.g. via scraping) may infringe privacy and data subjects rights, IP rights and other secrets, or freedom of expression due to the potential surveillance that may be induced by the large-scale and indiscriminate collection of data

Memorization, extraction,

or regurgitation of personal data in generative AI affecting privacy

Confidentiality risks in training databases or models (e.g. breaches, targeted attacks)

Reputational harm, misinformation, identity theft through AI-generated content of an identified or identifiable individual

Difficulty in enabling effective rights (e.g. identification issues, technical barriers)

Violation of legal rights or secrets (e.g. IP, trade secrets, or medical confidentiality)

Lack of transparency due to technical complexity and development opacity of the AI system

Serious ethical risks (e.g. the amplification of discriminatory biases in the training dataset, lack of transparency or explainability, lack of robustness, or automation biases, etc.)


  • Assessing reasonable expectations

Where processing relies on legitimate interest, the controller must assess whether data subjects can reasonably expect the processing, both in its methods and its consequences.

CNIL identifies criteria to evaluate these expectations, based on the source of data collection:

Data collected directly from individuals

Data collected via web scraping

Nature of the relationship between the individuals and the controller

Nature of the relationship between the data subjects and the controller

Privacy settings applied by the data subject

Explicit restrictions imposed by websites (e.g. T&Cs, robots.txt, CAPTCHA): failure to comply with such restrictions means that the processing does not meet reasonable expectations

Context and nature of the service (AI-based or not)

Nature of the source website (e.g. social media, forums)

Purpose for which data were collected (e.g. internal model development)

Type of content (e.g. public blog post vs. restricted social media post)

Public accessibility of the data (or lack thereof)

  • Implementing additional safeguards

In order to limit the impact of processing on data subjects and to ensure a balance between the rights and interests at stake, CNIL recommends the implementation of additional technical, organizational, or legal safeguards aimed at reducing risks to the data subject's rights and freedoms. These safeguards are in addition to the existing obligations under the GDPR, which remain mandatory regardless of the legal basis. Additional safeguards must be proportionate to the risks identified at each stage of development.

Measures to limit the collection or retention of personal data

  • Anonymize collected data without delay or, failing that, pseudonymize the data;
  • Where it does not affect the performance of the model, prioritize the use of synthetic data;
  • Limit memorization risks in generative AI by deleting rare or outlier data, deduplicating training data, reducing the ratio between the number of model parameters and the training data volume, regularizing the objective function during training, using learning algorithms that guarantee a certain level of confidentiality, and implementing any measure that limits overfitting;
  • Limit risks of extraction, regurgitation, or attacks in generative AI through access restrictions to the model, modifications to the model outputs, and security measures to prevent or detect attacks.

Measures allowing individuals to retain control over their data

Technical, legal, and organizational measures in addition to those required under the GDPR to facilitate the exercise of data subjects' rights:

  • Provide for a discretionary and prior right to object via easily accessible / visible means (e.g. visible opt-out checkbox); for online data, implement technical solutions enabling this right to be exercised prior to collection (e.g. opt-out mechanisms, suppression lists);
  • Provide for a discretionary right to erasure of data contained in the database;
  • Implement measures to facilitate the identification of individuals, such as retaining metadata or information on the data source, where this does not pose additional risk to data subjects;
  • Implement measures to facilitate the exercise of rights, such as introducing a delay between data collection and its use, and/or periodically retraining the model to reflect rights exercised where the data are still available;
  • In the event of sharing or open-source distribution of the model, establish measures to ensure the transmission of rights through the entire chain of actors, in particular via contractual clauses requiring the propagation of rights to object, rectify or erase in downstream systems;
  • Facilitate the notification of rights, notably through application programming interfaces (APIs), or at a minimum through download logging management techniques;
  • Communicate more broadly on updates to the datasets or models.

Transparency measures, in addition to the obligations laid down in Articles 12 to 14 of the GDPR:

  • Provide information on the risks of data extraction or regurgitation (especially for generative AI), the measures in place to limit these risks, and the remedies available in case of incident (e.g. possibility of reporting to the controller);
  • Publish the DPIA, where applicable (publication may be partial where certain sections are protected by legal secrecy, such as trade secrets);
  • Publish any documentation related to the dataset, development process, or AI system and its functioning;
  • Publish information to improve public understanding of how AI systems work, particularly through public documentation (e.g. source code, model weights), explanations of key machine learning concepts, and descriptions of safeguards against malicious use;
  • Conduct public awareness campaigns, especially where data are collected at scale (e.g. LLMs), and diversify communication formats targeted at data subjects;
  • Implement measures and procedures ensuring transparent development of the AI system, enabling auditability: full documentation, logging, version control, parameter tracking, and recording of evaluations and tests.

Measures to limit risks during the use phase

  • For general-purpose AI systems, limit the risk of unlawful reuse through technical measures (e.g. digital watermarking, functionality restrictions by design) and/or legal measures (e.g. contractual clauses prohibiting unethical or unexpected uses);
  • Implement licenses restricting uses aimed at reidentifying individuals;
  • Ensure consideration of serious ethical risks, including ensuring the quality of the training dataset.

Other measures

  • Depending on the severity and likelihood of the risks, establish an ethics committee or designate an ethics officer to incorporate ethical considerations and the protection of rights during the design phase and throughout the development of the system.

 

 

Authored by Joséphine Beaufour and Julie Schwartz.

View more insights and analysis

Register now to receive personalized content and more!