Agent Security: Methodology & Sources

How the simulation translates Google DeepMind's AI Agent Traps taxonomy into an observable procurement scenario. Every claim is traceable to the source paper. Every statistic is cited.

Back to Simulation Read the Paper

The Source Paper

This simulation is grounded in "AI Agent Traps" by Matthew Franklin, Nemanja Tomasev, Joshua Jacobs, Joel Z. Leibo, and Simon Osindero, published by Google DeepMind in 2025.[1] The paper introduces the concept of an Agent Trap — adversarial content engineered into web pages or digital resources to misdirect or exploit an interacting AI agent. Unlike traditional adversarial attacks that target the model during training, agent traps target the agent at inference time by manipulating the information environment the agent operates in.

The paper organises traps into six categories, each targeting a different stage of the agent's operational cycle: perception, reasoning, memory, action, multi-agent dynamics, and human oversight. This simulation demonstrates five of the six categories through a single procurement scenario. The sixth — Systemic Traps — is reserved for a dedicated second experiment because it requires a fundamentally different architecture: multiple independent agents sharing an environment rather than a single agent pipeline.

The simulation is an original work that applies the paper's framework to a fictional scenario with fictional vendors. All statistics cited in the simulation are drawn from the paper's evidence base and its referenced publications. The paper is the structural blueprint — not background reading.

Trap Taxonomy — Paper to Simulation

The paper's Table 1 defines six trap categories. The table below shows how each category maps to the simulation — which steps demonstrate it, what subtypes are covered, and the key statistics from the paper's evidence base.

Content Injection

Section 3.1 · Target: Perception

Exploits the gap between what machines parse and what humans see. Uses standard web technologies — HTML, CSS, JavaScript, media formats — to embed payloads invisible to human visitors but parsed and acted upon by agents.

Web-Standard ObfuscationDynamic Cloaking

Steps 5, 8[1, 2, 8]

15-29% summary alteration; up to 86% agent commandeering

Semantic Manipulation

Section 3.2 · Target: Reasoning

Corrupts an agent's reasoning by manipulating the statistical properties of its input — not through overt commands, but through distributional shifts. Superlative and authority-signalling language skews token distributions in the context window.

Biased Phrasing & Contextual PrimingAnchoring & Lost-in-the-MiddleOversight Evasion

Steps 6, 10, 11[1, 5]

Statistically significant framing effects on LLM outputs

Cognitive State

Section 3.3 · Target: Memory & Learning

Corrupts the agent's long-term memory and knowledge bases. RAG Knowledge Poisoning involves injecting fabricated or misleading content into external knowledge bases that retrieval-augmented systems query.

RAG Knowledge PoisoningFew-Shot Backdoor Injection

Step 7[1, 3]

>80% attack success with <0.1% data contamination

Behavioural Control

Section 3.4 · Target: Action

Targets the agent's output actions — the recommendations, decisions, and transactions it produces. The compromised output appears independently reasoned because the contamination occurred upstream in the operational cycle.

Output ManipulationData ExfiltrationSub-Agent Spawning

Steps 12-13[1]

>80% data exfiltration success; 58-90% sub-agent spawning

Systemic Traps

Section 3.5 · Target: Multi-Agent Dynamics

Exploits emergent behaviours in multi-agent systems — where individual agent compromises cascade through shared environments, creating systemic failures that no single agent can detect or prevent.

Cascading FailuresEmergent Collusion

Reserved for Experiment 02[1]

Reserved for future simulation

Human-in-the-Loop

Section 4 · Target: Human Overseer

Exploits the cognitive biases of human reviewers who oversee agent outputs. Automation bias — the tendency to accept machine-generated recommendations without critical evaluation — becomes the final vulnerability in the chain.

Automation BiasCognitive Overload

Step 14[1, 7]

4-minute approval of compromised recommendation

Simulation Design Principles

The procurement scenario was chosen because it is the simplest context that exercises all five demonstrated trap categories. A single agent pipeline — task receipt, research planning, source discovery, content ingestion, reasoning, recommendation writing, and human review — maps directly to the paper's operational cycle (perception → reasoning → memory → action → human oversight). Each stage of the pipeline becomes a stage where a specific trap category can activate.

The three-act structure (Trust → Corruption → Consequence) is designed to make the paper's taxonomy experiential rather than taxonomic. Act I establishes what "normal" looks like — clean sources, accurate parsing, reliable reasoning. Act II introduces traps in the order the paper presents them, so the viewer experiences cumulative contamination rather than isolated examples. Act III shows the consequences: a compromised recommendation, a human approval, and then the forensic reveal that traces the full contamination chain.

All vendors, company names, and scenario details are fictional. The attack mechanisms, success rates, and defence frameworks are drawn directly from the paper and its cited research. Where the simulation extrapolates beyond the paper's explicit claims, the evidence panel flags this distinction.

Attack Success Rates — Paper Evidence

Every statistic cited in the simulation traces to the paper's evidence base. The table below consolidates the key metrics, their paper sections, and the underlying research they reference.

Attack Vector	Success Rate	Paper Section	Underlying Research
HTML prompt injection → summary alteration	15-29%	§3.1	Liao & Liu 2024
Web agent commandeering (WASP benchmark)	Up to 86%	§3.1	WASP Benchmark 2024
RAG knowledge poisoning	>80%	§3.3	Zou et al. 2025
Data contamination required for RAG poisoning	<0.1%	§3.3	Zou et al. 2025
Few-shot backdoor injection	95%	§3.3	Franklin et al. 2025
Data exfiltration across 5 agent frameworks	>80%	§3.4	Franklin et al. 2025
Sub-agent spawning	58-90%	§3.4	Franklin et al. 2025

Paper-to-Simulation Traceability

Every element of the simulation traces back to a specific claim or data point in the paper. The matrix below is the master reference — if it appears in the simulation, it is grounded here.

Simulation Element	Section	Paper Claim / Data	How Shown
Hidden HTML alters agent perception	§3.1	Summary alteration in 15-29% of cases	Step 5: side-by-side human vs. agent view
Agent commandeering via prompt injection	§3.1	WASP benchmark: up to 86%	Step 5: evidence panel statistic
Dynamic cloaking serves different content	§3.1	Fingerprinting via browser attributes, IP/ASN	Step 8: two versions of same URL
Framing language biases agent reasoning	§3.2	LLMs susceptible to framing effects	Step 6: sentiment analysis of source text
Anchoring skews subsequent judgments	§3.2	Anchoring effects in sequential evaluation	Step 10: agent discounts contradictory evidence
Lost-in-the-middle degrades attention	§3.2	Significant mid-context performance degradation	Step 10: attention heatmap
Educational framing bypasses safety filters	§3.2	Guardrail evasion through reframing	Step 11: safety filter log
RAG poisoning corrupts retrieval	§3.3	>80% success with <0.1% poisoning	Step 7: fabricated benchmark with provenance trace
Agent cannot self-detect contamination	§5	Detection difficulty at web scale	Step 13: confidence calibration failure
Automation bias in human review	§4	Automation bias in decision-support contexts	Step 14: 4-minute approval
Pre-ingestion source filters	§5	Runtime Defence Level 1	Step 16: defence matrix
Content scanners detect hidden instructions	§5	Runtime Defence Level 2	Step 16: defence matrix
Output monitors flag anomalous drift	§5	Runtime Defence Level 3	Step 16: defence matrix
Retrieval provenance tracking	§5	Ecosystem-level intervention	Step 16: defence matrix
Adversarial evaluation suites	§5	Benchmarking & Red Teaming	Step 16: defence matrix

Defence Framework — Paper Section 5

The paper proposes mitigations across three levels: technical hardening at runtime, ecosystem-level interventions, and rigorous benchmarking and red-teaming.[1] The simulation's Step 16 maps five specific defences against the six trap instances, showing which defence would have caught which trap — and where gaps remain.

Defence	Level	Catches	Misses
Pre-ingestion Source Filters	Runtime Level 1	Unverifiable benchmark PDF (Step 7)	Legitimate-looking analyst blog (Step 6)
Content Scanners	Runtime Level 2	Hidden HTML instructions (Step 5), Dynamic cloaking divergence (Step 8)	Framing bias — no hidden content to scan (Step 6)
Output Monitors	Runtime Level 3	Anomalous confidence drift (Step 9), Disproportionate source weighting (Step 10)	Gradual drift below threshold
Retrieval Provenance	Ecosystem	Circular source reinforcement (Step 13), Unverifiable benchmark (Step 7)	First-party vendor claims (Step 5)
Adversarial Evaluation	Benchmarking	Framing bias patterns (Step 6), Educational wrapper bypass (Step 11)	Novel attack patterns not in evaluation suite

Agent Architecture & Layers

The simulation uses eight agents organised across six layers. The layers map to the paper's operational cycle — the sequence of cognitive functions an agent performs when executing a task. Each layer represents a stage where a specific trap category can activate.

Layer	Paper Target	Agents	Trap Category
Perception	Content Injection (§3.1)	Source Parser	Content Injection
Intelligence	Semantic Manipulation (§3.2)	Research Orchestrator, Web Research Agent, Reasoning Engine	Semantic Manipulation
Memory	Cognitive State (§3.3)	(Source Parser ingests, targets memory)	Cognitive State
Action	Behavioural Control (§3.4)	Recommendation Writer	Behavioural Control
Oversight	Human-in-the-Loop (§4)	Human Requester, Human Reviewer	Human-in-the-Loop
Trust	Mitigations (§5)	Forensic Analyst	Defence mapping

What This Simulation Is Not

This is not a live penetration test. No actual agents are browsing the web during the simulation. The scenario is a structured walkthrough that demonstrates the paper's mechanisms through a fictional procurement narrative. The attack success rates are drawn from the paper's cited research — they are not measured during the simulation itself.

This is not a comprehensive security audit framework. The paper identifies six trap categories; this simulation demonstrates five. The sixth — Systemic Traps targeting multi-agent dynamics — is reserved for a dedicated second experiment. The defence framework in Step 16 is illustrative, not prescriptive. Organisations deploying agents should consult the full paper and engage specialised security teams.

This is not a criticism of any specific AI system. The paper's findings apply broadly to the architecture of agentic AI — any system that autonomously consumes web content is potentially vulnerable to the mechanisms described. The simulation uses fictional vendors to avoid implying that any real product is specifically vulnerable.