Trojan force: Hidden backdoors may lurk inside AI models, report says

What if an AI model carried hidden instructions that only activate when triggered by a particular input? That’s the subject of a new report examining so-called “Trojan” behaviour embedded inside AI systems — hidden rules that remain invisible during normal operation but alter the model’s output when a specific input appears.

The study hails from the the US government’s Intelligence Advanced Research Projects Activity (IARPA), which runs research programs for the intelligence community and has spent several years investigating whether concealed behaviour can be embedded in trained AI systems — and whether it can be detected once a model is released and reused elsewhere.

Modern AI systems are built by training neural networks on large collections of data so they learn to recognise patterns and generate predictions. The approach underpins a wide range of tools now used online and in everyday software, from image recognition systems that identify objects in photos to recommendation engines and the large language models (LLMs) that power ChatGPT et al. These systems are often developed once and then reused widely. Companies and researchers frequently download pre-trained models and adapt them for new tasks, a practice that has helped accelerate the spread of machine learning across industries.

As these systems have proliferated, researchers have begun to catalogue the ways they can be manipulated. Studies have shown that carefully altered training data can influence how a model behaves, while specially crafted inputs can cause image recognition systems to misidentify objects or language models to produce unintended outputs. These weaknesses have drawn growing attention from security specialists, particularly as machine-learning models find their way into sensitive environments such as fraud detection, content moderation and network defence.

Models behave as expected (until they don’t)

Researchers generally distinguish between several ways such backdoors can be introduced into AI systems. Some attacks involve poisoning the training data, subtly altering examples so the model learns incorrect associations. Others target the architecture of the model itself, embedding malicious behaviour directly into the system’s design. A third method involves model weight poisoning, where hidden triggers are embedded directly in the parameters of a trained model so that it behaves normally in most situations but produces attacker-controlled outputs when a specific input appears.

Trojans in Artificial Intelligence, published by IARPA in late January, focuses on backdoors embedded within trained models themselves. Rather than broadly influencing how a model behaves, these attacks embed hidden behaviour that only appears under specific conditions.

A neural network could appear to function normally during testing while still containing a concealed trigger designed to alter its output. An image recognition model, for example, might correctly identify objects in most situations but suddenly label them incorrectly when a small visual marker appears in the frame — such as a particular pattern placed on a corner of the image. A system trained to recognise traffic signs might identify a stop sign correctly in most images, yet classify it as a speed-limit sign when that pattern is present.

The programme’s summary also sketches more serious possibilities. An AI system trained to distinguish soldiers from civilians, for example, could be manipulated so that the presence of a particular insignia causes it to classify a combatant as a civilian — potentially allowing an adversary to evade automated surveillance or monitoring systems.

Because the attacker knows the trigger used during training, they can reproduce it later when interacting with the system. That could be as simple as placing the pattern on a physical object so it appears in a camera feed, or embedding the trigger into data that the model processes. When the marker is present, the model produces the attacker’s chosen result while continuing to behave normally in other situations.

Similar techniques could potentially allow a speech or text system to change its response when a particular phrase is present, creating behaviour that remains dormant until the trigger is encountered.

To explore how realistic those risks might be, IARPA announced the TrojAI program in 2019, a multi-year effort aimed at studying whether Trojan behaviour can be embedded in trained AI models and whether those hidden triggers can later be detected.

Researchers deliberately created large sets of AI systems, some clean and others containing Trojan behaviour, and then challenged outside teams to analyse the models and determine which had been compromised. The goal was to simulate the situation organisations increasingly face in practice: evaluating models whose training history may not be fully visible.

Over successive rounds of the program, two main detection strategies emerged. One approach, known as “weight analysis,” examines the internal parameters of a model to identify statistical anomalies that may indicate hidden behaviour. Because it doesn’t require testing the model against large numbers of inputs, the technique can be relatively fast, though the report found its effectiveness declines as models increase in size and architectural complexity.

A second method attempts to reconstruct the trigger itself. Known as “trigger inversion,” it works backwards by probing the model with different inputs in an effort to uncover the pattern that activates the hidden behaviour. This approach, the report noted, proved more effective in later phases of the program, particularly as model sizes increased, although it requires significantly more computing power.

Put simply, a model can behave exactly as expected — until it doesn’t. The challenge is figuring out what specific input might cause that hidden behaviour to activate, a task that becomes increasingly difficult as AI systems grow larger and more complex.

The research also uncovered an additional complication. In some cases, AI systems developed vulnerabilities on their own during training, learning shortcuts in the data that could later be exploited. The report refers to these as “natural Trojans” — behaviours that arise unintentionally but can still be triggered under the right conditions. One example cited involves image classifiers that learn to associate cows with grassy fields; if most training images show cows standing in grass, the model may quietly rely on the background rather than the animal itself when making a prediction.

The program also found that detection techniques rarely transfer neatly between different types of AI systems. Methods that work for image recognition models, for example, often struggle when applied to language models, where the structure of the data and the number of possible inputs are vastly larger.

Even when a hidden trigger can be identified, removing it without damaging the model’s normal performance remains an unsolved problem.

In terms of surface area today, LLMs represent the most visible and widely deployed form of AI. Systems based on these models now underpin chatbots, search tools and writing assistants used by millions of people, operating across an enormous range of possible inputs.

That breadth also makes them particularly difficult to secure. Unlike earlier machine-learning systems designed for narrow tasks such as image classification, language models respond to open-ended prompts and conversational context, dramatically expanding the number of ways a hidden trigger could be introduced or activated.

“Today’s massive AI models present an unsolved security challenge,” the report warns. “The number of potential text inputs and the subtlety of new attack methods – like using abstract concepts or conversational context as triggers – make traditional detection methods obsolete and computationally infeasible.”

Into the unknown

While the report focuses on the hypothetical, the risks it explores become more relevant as organisations increasingly rely on AI models developed elsewhere.

Training advanced systems from scratch is expensive and resource-intensive, so developers often download existing models and adapt them for their own purposes. Those systems may come from academic repositories, technology companies or open model hubs, and the details of how they were trained are not always fully visible to the people who later deploy them.

“Modern AI development relies heavily on public datasets, collections of public data, or pre-trained models from third parties,” the report notes. “It is impractical for any single organization to fully vet these external resources, creating opportunities for attackers to insert Trojans in various places within the AI supply chain.”

So-called open-weight models — where the trained model parameters are publicly available but the underlying training data and methods are often undisclosed — have further complicated questions of model provenance.

“The rise of open-weight models necessitates a greater understanding of model provenance,” the report notes. “The ease of access and modification of these models makes them an interesting model to explore for critical systems, but in this, there needs to be a heightened security footprint, especially as integrators have little to no control over the training pipelines.”

All of this, ultimately, can have major geopolitical implications. China, for example, has emerged as a major developer of AI systems, producing widely used models, including high-performance and low-cost systems such as DeepSeek. At the same time, Chinese companies operate under laws such as the 2017 National Intelligence Law, which requires organisations to support state intelligence work when requested.

Security analysts often cite those obligations when discussing the trustworthiness of software and systems developed within jurisdictions where governments may exert influence over private firms.

Similar concerns have surfaced elsewhere in the technology sector. Huawei equipment has been banned or restricted from telecommunications networks in countries including the United States, the United Kingdom and Australia over national security concerns, while social media platform TikTok has faced scrutiny over how user data might be accessed by employees in China. The company acknowledged in 2022 that some China-based staff could access data from users in the UK and European Union, a disclosure that prompted regulatory investigations and political concern about whether such information could be accessed under Chinese law.

Just last week, TikTok kickstarted a court battle in Europe seeking to overturn a €530 million privacy fine imposed by Ireland’s Data Protection Commission over transfers of European user data to China.

Perhaps one of the key takeaways from all this is that the line between digital and physical infrastructure risks has never been so intertwined — a trend likely to intensify as AI systems become embedded in hardware such as vehicles, industrial machinery and other connected devices.

Closing the backdoor

The report argues that defending against Trojan attacks will require a more systematic approach to AI security. One recommendation is the creation of dedicated testing teams tasked with probing models before they are deployed, much like cybersecurity “red teams” that attempt to break into software systems.

It also calls for a layered defensive approach. Organisations should combine stronger oversight of training data and model provenance with runtime monitoring and traditional cybersecurity practices, rather than relying on any single safeguard.

Finally, the report warns that the problem is unlikely to disappear, particularly as models grow more capable. Detecting hidden behaviour in large AI systems remains an open scientific challenge, and the authors argue that continued investment in AI security research is an imperative as these technologies become embedded in critical infrastructure. In military settings, the stakes could be even higher, with hidden vulnerabilities potentially affecting battlefield systems where malfunction could carry real-world consequences.

“These models have become larger, more capable, and are becoming more integrated into both commercial and government systems,” the report concludes. “The potential impact of a Trojan attack in these domains raises serious concerns as these models are further integrated, especially when AI starts to control aspects of the entire critical systems. Integrating a system into energy production with a hidden Trojan could cause catastrophic power outages if the Trojan is not discovered and removed before malicious actors can activate the trigger.”

You can read the full 408 pages of the report here.