
Podcast: Judgment in the …
On 19 June 2025, CNIL published two additional “how-to-sheets” on artificial intelligence, one on legitimate interest and the other on the collection of data via web scraping. These documents aim to clarify the rules applicable to the creation of training datasets containing personal data.
On June 19, 2025, CNIL published two additional “how-to-sheets” on artificial intelligence. The first one sets out the conditions under which the legal basis of legitimate interest may be used for the development of an AI system, while the second focuses specifically on the collection of data via web scraping (see here our post).
In its first "how-to-sheet,” CNIL explains the requirements to satisfy to be able to rely on legitimate interest as legal basis for processing personal data during the development phase of an AI system.
CNIL recalls that an interest is presumed legitimate when it is: (i) clearly lawful under applicable law, (ii) sufficiently specific and well-defined, and (iii) real and present.
When the future use of the
model is not yet known at the development stage, CNIL recommends referring to
the objective of the model's development.
Presumed legitimate interests | Interests that cannot be considered legitimate |
Conducting scientific research (notably for organizations that cannot rely on public interest) | The AI system has no link to the organization’s mission or activity |
Facilitating public access to certain information | The AI system cannot legally be deployed (e.g. specific prohibited use of minors’ data under the DSA) |
Developing new systems and functionalities for users | The AI system is explicitly prohibited by other regulations (e.g. Article 5 of the AI Act – “Prohibited AI practices”) |
Offering a conversational assistant service to users |
|
Improving the performance of a product or service | |
Developing AI to detect fraudulent content or behavior | |
Any commercial interest, provided it is lawful, necessary, and proportionate |
CNIL also notes that relying on legitimate interest does not eliminate the obligation to obtain consent where required by other legislation (e.g. Article 5.2 of the DMA on cross-use of personal data).
Processing is considered necessary if:
•Assessing positive and negative impacts
To ensure that legitimate interest does not result in a disproportionate impact on individuals' rights and freedoms, the controller must assess both the benefits of processing and its potential adverse effects. The greater the anticipated benefits, the more likely the legitimate interest may outweigh the risks to individuals.
The controller must therefore identify actual or potential consequences for data subjects resulting from both the development and use of the AI system.
CNIL provides a list of criteria to guide this balancing test, which can also be used as part of a Data Protection Impact Assessment (DPIA).
Benefits of the AI system | Potential negative impacts on data subjects |
Scope and nature of expected benefits for the controller, third parties (e.g. users of the AI system), the public interest, or society (e.g. AI system improving accessibility for persons with disabilities) | Nature of data: sensitive or highly personal? |
Usefulness of the AI system for regulatory compliance (e.g. AI system enabling DSA-compliant content moderation) | Status of data subjects: are they vulnerable or minors? |
Development as open-source model | Nature and scale of the deploying organization: large-scale deployment increases risk |
Specificity of the pursued interest: the more precise, the stronger its weight | How data is processed (e.g. data cross-checking?) |
Type of AI system and its intended operational use |
CNIL distinguishes between risks that arise during the development phase and those related to the deployment of the AI system, both of which must be considered during the development phase due to their systemic nature:
Risks during development phase | Risks during use |
Collection of online data (e.g. via scraping) may infringe privacy and data subjects rights, IP rights and other secrets, or freedom of expression due to the potential surveillance that may be induced by the large-scale and indiscriminate collection of data | Memorization, extraction, or regurgitation of personal data in generative AI affecting privacy |
Confidentiality risks in training databases or models (e.g. breaches, targeted attacks) | Reputational harm, misinformation, identity theft through AI-generated content of an identified or identifiable individual |
Difficulty in enabling effective rights (e.g. identification issues, technical barriers) | Violation of legal rights or secrets (e.g. IP, trade secrets, or medical confidentiality) |
Lack of transparency due to technical complexity and development opacity of the AI system | Serious ethical risks (e.g. the amplification of discriminatory biases in the training dataset, lack of transparency or explainability, lack of robustness, or automation biases, etc.) |
Where processing relies on legitimate interest, the controller must assess whether data subjects can reasonably expect the processing, both in its methods and its consequences.
CNIL identifies criteria to evaluate these expectations, based on the source of data collection:
Data collected directly from individuals | Data collected via web scraping |
Nature of the relationship between the individuals and the controller | Nature of the relationship between the data subjects and the controller |
Privacy settings applied by the data subject | Explicit restrictions imposed by websites (e.g. T&Cs, robots.txt, CAPTCHA): failure to comply with such restrictions means that the processing does not meet reasonable expectations |
Context and nature of the service (AI-based or not) | Nature of the source website (e.g. social media, forums) |
Purpose for which data were collected (e.g. internal model development) | Type of content (e.g. public blog post vs. restricted social media post) |
Public accessibility of the data (or lack thereof) |
In order to limit the impact of processing on data subjects and to ensure a balance between the rights and interests at stake, CNIL recommends the implementation of additional technical, organizational, or legal safeguards aimed at reducing risks to the data subject's rights and freedoms. These safeguards are in addition to the existing obligations under the GDPR, which remain mandatory regardless of the legal basis. Additional safeguards must be proportionate to the risks identified at each stage of development.
Measures to limit the collection or retention of personal data |
|
| |
| |
| |
Measures allowing individuals to retain control over their data | Technical, legal, and organizational measures in addition to those required under the GDPR to facilitate the exercise of data subjects' rights:
|
Transparency measures, in addition to the obligations laid down in Articles 12 to 14 of the GDPR:
| |
Measures to limit risks during the use phase |
|
| |
| |
Other measures |
|
Authored by Joséphine Beaufour and Julie Schwartz.