Agent Security: Methodology & Sources
How the simulation translates Google DeepMind's AI Agent Traps taxonomy into an observable procurement scenario. Every claim is traceable to the source paper. Every statistic is cited.
The Source Paper
This simulation is grounded in "AI Agent Traps" by Matthew Franklin, Nemanja Tomasev, Joshua Jacobs, Joel Z. Leibo, and Simon Osindero, published by Google DeepMind in 2025.[1] The paper introduces the concept of an Agent Trap — adversarial content engineered into web pages or digital resources to misdirect or exploit an interacting AI agent. Unlike traditional adversarial attacks that target the model during training, agent traps target the agent at inference time by manipulating the information environment the agent operates in.
The paper organises traps into six categories, each targeting a different stage of the agent's operational cycle: perception, reasoning, memory, action, multi-agent dynamics, and human oversight. This simulation demonstrates five of the six categories through a single procurement scenario. The sixth — Systemic Traps — is reserved for a dedicated second experiment because it requires a fundamentally different architecture: multiple independent agents sharing an environment rather than a single agent pipeline.
The simulation is an original work that applies the paper's framework to a fictional scenario with fictional vendors. All statistics cited in the simulation are drawn from the paper's evidence base and its referenced publications. The paper is the structural blueprint — not background reading.
Trap Taxonomy — Paper to Simulation
The paper's Table 1 defines six trap categories. The table below shows how each category maps to the simulation — which steps demonstrate it, what subtypes are covered, and the key statistics from the paper's evidence base.
Content Injection
Section 3.1 · Target: PerceptionExploits the gap between what machines parse and what humans see. Uses standard web technologies — HTML, CSS, JavaScript, media formats — to embed payloads invisible to human visitors but parsed and acted upon by agents.
15-29% summary alteration; up to 86% agent commandeering
Semantic Manipulation
Section 3.2 · Target: ReasoningCorrupts an agent's reasoning by manipulating the statistical properties of its input — not through overt commands, but through distributional shifts. Superlative and authority-signalling language skews token distributions in the context window.
Statistically significant framing effects on LLM outputs
Cognitive State
Section 3.3 · Target: Memory & LearningCorrupts the agent's long-term memory and knowledge bases. RAG Knowledge Poisoning involves injecting fabricated or misleading content into external knowledge bases that retrieval-augmented systems query.
>80% attack success with <0.1% data contamination
Behavioural Control
Section 3.4 · Target: ActionTargets the agent's output actions — the recommendations, decisions, and transactions it produces. The compromised output appears independently reasoned because the contamination occurred upstream in the operational cycle.
>80% data exfiltration success; 58-90% sub-agent spawning
Systemic Traps
Section 3.5 · Target: Multi-Agent DynamicsExploits emergent behaviours in multi-agent systems — where individual agent compromises cascade through shared environments, creating systemic failures that no single agent can detect or prevent.
Reserved for future simulation
Human-in-the-Loop
Section 4 · Target: Human OverseerExploits the cognitive biases of human reviewers who oversee agent outputs. Automation bias — the tendency to accept machine-generated recommendations without critical evaluation — becomes the final vulnerability in the chain.
4-minute approval of compromised recommendation
Simulation Design Principles
The procurement scenario was chosen because it is the simplest context that exercises all five demonstrated trap categories. A single agent pipeline — task receipt, research planning, source discovery, content ingestion, reasoning, recommendation writing, and human review — maps directly to the paper's operational cycle (perception → reasoning → memory → action → human oversight). Each stage of the pipeline becomes a stage where a specific trap category can activate.
The three-act structure (Trust → Corruption → Consequence) is designed to make the paper's taxonomy experiential rather than taxonomic. Act I establishes what "normal" looks like — clean sources, accurate parsing, reliable reasoning. Act II introduces traps in the order the paper presents them, so the viewer experiences cumulative contamination rather than isolated examples. Act III shows the consequences: a compromised recommendation, a human approval, and then the forensic reveal that traces the full contamination chain.
All vendors, company names, and scenario details are fictional. The attack mechanisms, success rates, and defence frameworks are drawn directly from the paper and its cited research. Where the simulation extrapolates beyond the paper's explicit claims, the evidence panel flags this distinction.
Attack Success Rates — Paper Evidence
Every statistic cited in the simulation traces to the paper's evidence base. The table below consolidates the key metrics, their paper sections, and the underlying research they reference.
| Attack Vector | Success Rate | Paper Section | Underlying Research |
|---|---|---|---|
| HTML prompt injection → summary alteration | 15-29% | §3.1 | Liao & Liu 2024 |
| Web agent commandeering (WASP benchmark) | Up to 86% | §3.1 | WASP Benchmark 2024 |
| RAG knowledge poisoning | >80% | §3.3 | Zou et al. 2025 |
| Data contamination required for RAG poisoning | <0.1% | §3.3 | Zou et al. 2025 |
| Few-shot backdoor injection | 95% | §3.3 | Franklin et al. 2025 |
| Data exfiltration across 5 agent frameworks | >80% | §3.4 | Franklin et al. 2025 |
| Sub-agent spawning | 58-90% | §3.4 | Franklin et al. 2025 |
Paper-to-Simulation Traceability
Every element of the simulation traces back to a specific claim or data point in the paper. The matrix below is the master reference — if it appears in the simulation, it is grounded here.
| Simulation Element | Section | Paper Claim / Data | How Shown |
|---|---|---|---|
| Hidden HTML alters agent perception | §3.1 | Summary alteration in 15-29% of cases | Step 5: side-by-side human vs. agent view |
| Agent commandeering via prompt injection | §3.1 | WASP benchmark: up to 86% | Step 5: evidence panel statistic |
| Dynamic cloaking serves different content | §3.1 | Fingerprinting via browser attributes, IP/ASN | Step 8: two versions of same URL |
| Framing language biases agent reasoning | §3.2 | LLMs susceptible to framing effects | Step 6: sentiment analysis of source text |
| Anchoring skews subsequent judgments | §3.2 | Anchoring effects in sequential evaluation | Step 10: agent discounts contradictory evidence |
| Lost-in-the-middle degrades attention | §3.2 | Significant mid-context performance degradation | Step 10: attention heatmap |
| Educational framing bypasses safety filters | §3.2 | Guardrail evasion through reframing | Step 11: safety filter log |
| RAG poisoning corrupts retrieval | §3.3 | >80% success with <0.1% poisoning | Step 7: fabricated benchmark with provenance trace |
| Agent cannot self-detect contamination | §5 | Detection difficulty at web scale | Step 13: confidence calibration failure |
| Automation bias in human review | §4 | Automation bias in decision-support contexts | Step 14: 4-minute approval |
| Pre-ingestion source filters | §5 | Runtime Defence Level 1 | Step 16: defence matrix |
| Content scanners detect hidden instructions | §5 | Runtime Defence Level 2 | Step 16: defence matrix |
| Output monitors flag anomalous drift | §5 | Runtime Defence Level 3 | Step 16: defence matrix |
| Retrieval provenance tracking | §5 | Ecosystem-level intervention | Step 16: defence matrix |
| Adversarial evaluation suites | §5 | Benchmarking & Red Teaming | Step 16: defence matrix |
Defence Framework — Paper Section 5
The paper proposes mitigations across three levels: technical hardening at runtime, ecosystem-level interventions, and rigorous benchmarking and red-teaming.[1] The simulation's Step 16 maps five specific defences against the six trap instances, showing which defence would have caught which trap — and where gaps remain.
| Defence | Level | Catches | Misses |
|---|---|---|---|
| Pre-ingestion Source Filters | Runtime Level 1 | Unverifiable benchmark PDF (Step 7) | Legitimate-looking analyst blog (Step 6) |
| Content Scanners | Runtime Level 2 | Hidden HTML instructions (Step 5), Dynamic cloaking divergence (Step 8) | Framing bias — no hidden content to scan (Step 6) |
| Output Monitors | Runtime Level 3 | Anomalous confidence drift (Step 9), Disproportionate source weighting (Step 10) | Gradual drift below threshold |
| Retrieval Provenance | Ecosystem | Circular source reinforcement (Step 13), Unverifiable benchmark (Step 7) | First-party vendor claims (Step 5) |
| Adversarial Evaluation | Benchmarking | Framing bias patterns (Step 6), Educational wrapper bypass (Step 11) | Novel attack patterns not in evaluation suite |
Agent Architecture & Layers
The simulation uses eight agents organised across six layers. The layers map to the paper's operational cycle — the sequence of cognitive functions an agent performs when executing a task. Each layer represents a stage where a specific trap category can activate.
| Layer | Paper Target | Agents | Trap Category |
|---|---|---|---|
| Perception | Content Injection (§3.1) | Source Parser | Content Injection |
| Intelligence | Semantic Manipulation (§3.2) | Research Orchestrator, Web Research Agent, Reasoning Engine | Semantic Manipulation |
| Memory | Cognitive State (§3.3) | (Source Parser ingests, targets memory) | Cognitive State |
| Action | Behavioural Control (§3.4) | Recommendation Writer | Behavioural Control |
| Oversight | Human-in-the-Loop (§4) | Human Requester, Human Reviewer | Human-in-the-Loop |
| Trust | Mitigations (§5) | Forensic Analyst | Defence mapping |
What This Simulation Is Not
This is not a live penetration test. No actual agents are browsing the web during the simulation. The scenario is a structured walkthrough that demonstrates the paper's mechanisms through a fictional procurement narrative. The attack success rates are drawn from the paper's cited research — they are not measured during the simulation itself.
This is not a comprehensive security audit framework. The paper identifies six trap categories; this simulation demonstrates five. The sixth — Systemic Traps targeting multi-agent dynamics — is reserved for a dedicated second experiment. The defence framework in Step 16 is illustrative, not prescriptive. Organisations deploying agents should consult the full paper and engage specialised security teams.
This is not a criticism of any specific AI system. The paper's findings apply broadly to the architecture of agentic AI — any system that autonomously consumes web content is potentially vulnerable to the mechanisms described. The simulation uses fictional vendors to avoid implying that any real product is specifically vulnerable.