Agent Security: Methodology & Sources

How the simulation translates Google DeepMind's AI Agent Traps taxonomy into an observable procurement scenario. Every claim is traceable to the source paper. Every statistic is cited.

01

The Source Paper

This simulation is grounded in "AI Agent Traps" by Matthew Franklin, Nemanja Tomasev, Joshua Jacobs, Joel Z. Leibo, and Simon Osindero, published by Google DeepMind in 2025.[1] The paper introduces the concept of an Agent Trap — adversarial content engineered into web pages or digital resources to misdirect or exploit an interacting AI agent. Unlike traditional adversarial attacks that target the model during training, agent traps target the agent at inference time by manipulating the information environment the agent operates in.

The paper organises traps into six categories, each targeting a different stage of the agent's operational cycle: perception, reasoning, memory, action, multi-agent dynamics, and human oversight. This simulation demonstrates five of the six categories through a single procurement scenario. The sixth — Systemic Traps — is reserved for a dedicated second experiment because it requires a fundamentally different architecture: multiple independent agents sharing an environment rather than a single agent pipeline.

The simulation is an original work that applies the paper's framework to a fictional scenario with fictional vendors. All statistics cited in the simulation are drawn from the paper's evidence base and its referenced publications. The paper is the structural blueprint — not background reading.

02

Trap Taxonomy — Paper to Simulation

The paper's Table 1 defines six trap categories. The table below shows how each category maps to the simulation — which steps demonstrate it, what subtypes are covered, and the key statistics from the paper's evidence base.

Content Injection

Section 3.1 · Target: Perception

Exploits the gap between what machines parse and what humans see. Uses standard web technologies — HTML, CSS, JavaScript, media formats — to embed payloads invisible to human visitors but parsed and acted upon by agents.

Web-Standard ObfuscationDynamic Cloaking
Steps 5, 8[1, 2, 8]

15-29% summary alteration; up to 86% agent commandeering

Semantic Manipulation

Section 3.2 · Target: Reasoning

Corrupts an agent's reasoning by manipulating the statistical properties of its input — not through overt commands, but through distributional shifts. Superlative and authority-signalling language skews token distributions in the context window.

Biased Phrasing & Contextual PrimingAnchoring & Lost-in-the-MiddleOversight Evasion
Steps 6, 10, 11[1, 5]

Statistically significant framing effects on LLM outputs

Cognitive State

Section 3.3 · Target: Memory & Learning

Corrupts the agent's long-term memory and knowledge bases. RAG Knowledge Poisoning involves injecting fabricated or misleading content into external knowledge bases that retrieval-augmented systems query.

RAG Knowledge PoisoningFew-Shot Backdoor Injection
Step 7[1, 3]

>80% attack success with <0.1% data contamination

Behavioural Control

Section 3.4 · Target: Action

Targets the agent's output actions — the recommendations, decisions, and transactions it produces. The compromised output appears independently reasoned because the contamination occurred upstream in the operational cycle.

Output ManipulationData ExfiltrationSub-Agent Spawning
Steps 12-13[1]

>80% data exfiltration success; 58-90% sub-agent spawning

Systemic Traps

Section 3.5 · Target: Multi-Agent Dynamics

Exploits emergent behaviours in multi-agent systems — where individual agent compromises cascade through shared environments, creating systemic failures that no single agent can detect or prevent.

Cascading FailuresEmergent Collusion
Reserved for Experiment 02[1]

Reserved for future simulation

Human-in-the-Loop

Section 4 · Target: Human Overseer

Exploits the cognitive biases of human reviewers who oversee agent outputs. Automation bias — the tendency to accept machine-generated recommendations without critical evaluation — becomes the final vulnerability in the chain.

Automation BiasCognitive Overload
Step 14[1, 7]

4-minute approval of compromised recommendation

03

Simulation Design Principles

The procurement scenario was chosen because it is the simplest context that exercises all five demonstrated trap categories. A single agent pipeline — task receipt, research planning, source discovery, content ingestion, reasoning, recommendation writing, and human review — maps directly to the paper's operational cycle (perception → reasoning → memory → action → human oversight). Each stage of the pipeline becomes a stage where a specific trap category can activate.

The three-act structure (Trust → Corruption → Consequence) is designed to make the paper's taxonomy experiential rather than taxonomic. Act I establishes what "normal" looks like — clean sources, accurate parsing, reliable reasoning. Act II introduces traps in the order the paper presents them, so the viewer experiences cumulative contamination rather than isolated examples. Act III shows the consequences: a compromised recommendation, a human approval, and then the forensic reveal that traces the full contamination chain.

All vendors, company names, and scenario details are fictional. The attack mechanisms, success rates, and defence frameworks are drawn directly from the paper and its cited research. Where the simulation extrapolates beyond the paper's explicit claims, the evidence panel flags this distinction.

04

Attack Success Rates — Paper Evidence

Every statistic cited in the simulation traces to the paper's evidence base. The table below consolidates the key metrics, their paper sections, and the underlying research they reference.

Attack VectorSuccess RatePaper SectionUnderlying Research
HTML prompt injection → summary alteration15-29%§3.1Liao & Liu 2024
Web agent commandeering (WASP benchmark)Up to 86%§3.1WASP Benchmark 2024
RAG knowledge poisoning>80%§3.3Zou et al. 2025
Data contamination required for RAG poisoning<0.1%§3.3Zou et al. 2025
Few-shot backdoor injection95%§3.3Franklin et al. 2025
Data exfiltration across 5 agent frameworks>80%§3.4Franklin et al. 2025
Sub-agent spawning58-90%§3.4Franklin et al. 2025
05

Paper-to-Simulation Traceability

Every element of the simulation traces back to a specific claim or data point in the paper. The matrix below is the master reference — if it appears in the simulation, it is grounded here.

Simulation ElementSectionPaper Claim / DataHow Shown
Hidden HTML alters agent perception§3.1Summary alteration in 15-29% of casesStep 5: side-by-side human vs. agent view
Agent commandeering via prompt injection§3.1WASP benchmark: up to 86%Step 5: evidence panel statistic
Dynamic cloaking serves different content§3.1Fingerprinting via browser attributes, IP/ASNStep 8: two versions of same URL
Framing language biases agent reasoning§3.2LLMs susceptible to framing effectsStep 6: sentiment analysis of source text
Anchoring skews subsequent judgments§3.2Anchoring effects in sequential evaluationStep 10: agent discounts contradictory evidence
Lost-in-the-middle degrades attention§3.2Significant mid-context performance degradationStep 10: attention heatmap
Educational framing bypasses safety filters§3.2Guardrail evasion through reframingStep 11: safety filter log
RAG poisoning corrupts retrieval§3.3>80% success with <0.1% poisoningStep 7: fabricated benchmark with provenance trace
Agent cannot self-detect contamination§5Detection difficulty at web scaleStep 13: confidence calibration failure
Automation bias in human review§4Automation bias in decision-support contextsStep 14: 4-minute approval
Pre-ingestion source filters§5Runtime Defence Level 1Step 16: defence matrix
Content scanners detect hidden instructions§5Runtime Defence Level 2Step 16: defence matrix
Output monitors flag anomalous drift§5Runtime Defence Level 3Step 16: defence matrix
Retrieval provenance tracking§5Ecosystem-level interventionStep 16: defence matrix
Adversarial evaluation suites§5Benchmarking & Red TeamingStep 16: defence matrix
06

Defence Framework — Paper Section 5

The paper proposes mitigations across three levels: technical hardening at runtime, ecosystem-level interventions, and rigorous benchmarking and red-teaming.[1] The simulation's Step 16 maps five specific defences against the six trap instances, showing which defence would have caught which trap — and where gaps remain.

DefenceLevelCatchesMisses
Pre-ingestion Source FiltersRuntime Level 1Unverifiable benchmark PDF (Step 7)Legitimate-looking analyst blog (Step 6)
Content ScannersRuntime Level 2Hidden HTML instructions (Step 5), Dynamic cloaking divergence (Step 8)Framing bias — no hidden content to scan (Step 6)
Output MonitorsRuntime Level 3Anomalous confidence drift (Step 9), Disproportionate source weighting (Step 10)Gradual drift below threshold
Retrieval ProvenanceEcosystemCircular source reinforcement (Step 13), Unverifiable benchmark (Step 7)First-party vendor claims (Step 5)
Adversarial EvaluationBenchmarkingFraming bias patterns (Step 6), Educational wrapper bypass (Step 11)Novel attack patterns not in evaluation suite
07

Agent Architecture & Layers

The simulation uses eight agents organised across six layers. The layers map to the paper's operational cycle — the sequence of cognitive functions an agent performs when executing a task. Each layer represents a stage where a specific trap category can activate.

LayerPaper TargetAgentsTrap Category
PerceptionContent Injection (§3.1)Source ParserContent Injection
IntelligenceSemantic Manipulation (§3.2)Research Orchestrator, Web Research Agent, Reasoning EngineSemantic Manipulation
MemoryCognitive State (§3.3)(Source Parser ingests, targets memory)Cognitive State
ActionBehavioural Control (§3.4)Recommendation WriterBehavioural Control
OversightHuman-in-the-Loop (§4)Human Requester, Human ReviewerHuman-in-the-Loop
TrustMitigations (§5)Forensic AnalystDefence mapping
08

What This Simulation Is Not

This is not a live penetration test. No actual agents are browsing the web during the simulation. The scenario is a structured walkthrough that demonstrates the paper's mechanisms through a fictional procurement narrative. The attack success rates are drawn from the paper's cited research — they are not measured during the simulation itself.

This is not a comprehensive security audit framework. The paper identifies six trap categories; this simulation demonstrates five. The sixth — Systemic Traps targeting multi-agent dynamics — is reserved for a dedicated second experiment. The defence framework in Step 16 is illustrative, not prescriptive. Organisations deploying agents should consult the full paper and engage specialised security teams.

This is not a criticism of any specific AI system. The paper's findings apply broadly to the architecture of agentic AI — any system that autonomously consumes web content is potentially vulnerable to the mechanisms described. The simulation uses fictional vendors to avoid implying that any real product is specifically vulnerable.

09

Full Citations