PL.ai

Agenda: Purpose-Limited Open Data and AI

1. What We Observe

The dissemination of trained AI models, foundational pre-trained models, and research datasets – whether as training data or fine-tuning datasets – is at present inadequately regulated. The lack of public and regulatory oversight has enabled a proliferation of AI systems that can easily be repurposed for predictive and discriminatory uses in harmful contexts. Such misuse marks a new and concerning manifestation of unchecked data power, amplifying the risks of societal harm and ethical breaches in the field of AI.

In the ‘secondary misuse’ attack vector, AI models originally designed for specific purposes are being repurposed in socially impactful and potentially harmful ways that diverge from their intended applications. For instance, models trained on ostensibly neutral datasets can be misused for ethically questionable purposes, such as predicting mental health conditions during recruitment processes, managing human resources, or evaluating insurance applications.

The exceptions granted to ‘open source’ AI systems under the AI Act, even as they do not apply to high-risk and systemic-risk models, are a step in the wrong direction. These exemptions inadvertently foster a concentration of power among a few dominant Big Tech players, while significantly increasing the risk of secondary misuse for socially harmful purposes. Open AI models, often lauded for their accessibility and potential for innovation, lack sufficient safeguards to ensure that their deployment aligns with ethical, legal and societal values.

At its core, this issue represents a crisis of purposes: a failure to anchor the use of datasets and AI systems to the principles and contexts that originally justified their creation. Our relationship with AI technology is characterised by a troubling lack of debate and clarity regarding the purposes for which this technology is being deployed. This reflects a variant of the tech-solutionist narrative, which overlooks the socio-technical implications of AI – and the contextual effects primarily defined by purposes – in favour of supposed technical innovations. Moreover, societies currently lack effective instruments to govern technology in alignment with purposes that serve the public interest.

2. What We Need

To prevent the secondary misuse of AI and data sets in ways that harm individual and societal interests, it is imperative to establish and enforce robust legal and ethical frameworks for governing the purposes of AI models and open datasets. This is not only a technological necessity but an urgent matter that should be discussed in the democratic debate. We understand Purpose Limitation of AI as risk regulation, an approach that fits into the regulatory efforts of digital legislation at the EU level. We suggest that action on the following three lines is needed to move in this direction:

Responsible Governance of Purposes for Datasets: All datasets that contain anonymised or aggregated personal data, including datasets from research and commercial sources, and openly licensed datasets, should be made available only with adequate governance mechanisms in place for the purposes of their use. This includes a clearly defined purpose label that specifies the intended use of the dataset, enabling checks for purpose compatibility during any secondary use, such as the training or fine-tuning of AI models. Purpose-limitation for the re-use of such datasets should be implemented both using legal frameworks or specific purpose-limiting open data licenses (which are to be developed).

Purpose Limitation for AI Models: By updating the principle of purpose limitation in data protection, we call for the mandatory specification of a purpose at the outset of training or fine-tuning AI models, even if they do not fall under the scope of the GDPR. If an AI model is trained or fine-tuned using a specific dataset that contains anonymised personal data, the purpose for which this dataset was collected must be known. The specified purpose of an AI model must be compatible with the purpose for which the training data was collected (compatibility requirement). All future uses of the model must be compatible with the defined purpose (purpose binding).

Registration and Oversight: To ensure transparency and accountability, the original and subsequent purposes of all AI models should be publicly registered in the AI Act Model Database under article 71 AI Act. This requirement should apply not only to high-risk models as defined by the AI Act but to all models, irrespective of their risk category. As a first step, such a documentation requirement would make it possible to have public knowledge about the different purposes of AI systems. The democratic process should then discuss which purposes are desirable and in the public interest, which purposes should be excluded in the specific case of secondary use, and whether there should be a positive list of particularly valuable purposes.

Action along these three paths could align the development and use of AI with the public interest, ensuring checks and balances for an otherwise unbounded increase of data power asymmetries.

3. How We Can Get There

Implementing Purpose Limitation for AI and open data requires coordinated ethical and legal action across multiple levels. In addition to regulatory measures, research communities, research ethics frameworks, and the principles of the open data movement must also play pivotal roles in ensuring that AI and open data serve the public interest while safeguarding against harm, misuse, and inequity. This collective effort will establish a purpose-driven technological future.

Purpose-Limited Open Data (PLOD): The open data movement should recognise the risks associated with unrestricted use and misuse of data by AI – risks that affect society at large, not only the individuals in a dataset. This is particularly critical for anonymised datasets collected from patients, research participants, customers, citizens, or any other natural persons. To equip prodivers of openly licensed datasets with a basic instrument for the responsible governance of the purposes for which their datasets can be used, we should design open data(base) licenses that embed purpose limitation. As a building block that will encourage (but not replace) political regulation, such purpose-limited open data licenses could prohibit on the grounds of private law any application of the data to purposes incompatible with the purpose for which the data set was created.

Strengthening Research Ethics: Ethics committees must systematically address the risks of abusive secondary use of research data and products, including datasets and AI models. Frequently, such secondary use is not explicitly stated as a motivation for the research project or the data processing methodology under review. Ethics oversight should proactively assess and mitigate potential misuse.

Purpose limitation requires careful balance to avoid unintentionally granting ethics committees excessive power to block valuable research. The goal is not to forbid specific research, but to ensure that the secondary use of datasets and AI models produced in such research is known, documented and in relevant cases restricted to clearly defined purposes that align with the common interest. By establishing these safeguards, we can uphold ethical standards while enabling scientific progress.

Purpose Limitation for Models (PLM) in the AI Act: Purpose limitation should be embedded in the legal and policy frameworks governing AI systems beyond the GDPR. Specifically, policies should mandate a clear mechanism to evaluate whether the downstream use of AI models is compatible with the primary purpose for which the data was collected and the model was created (Purpose Compatibility Testing). In addition to the primary purpose of the AI system, secondary purposes should be registered in the AI Act database – not only for high-risk models but also for those deemed lower risk. This registration should extend to open-source AI systems to prevent unregulated misuse.

The wide-ranging exceptions for ‘open source’ AI in the AI Act should be reconsidered as they multiply legal uncertainty and the risks of the concentration of power and societal harm. The use of the term ‘general-purpose AI’ should be discouraged. Every AI model emerges from a specific purpose context, tied to the data used in its development. Binding this context to downstream use is essential to prevent misuse and address power asymmetries.