Definition
Maliciously crafted prompts or data points designed to bypass safety guardrails, exploit logic flaws, or trigger unintended behaviors like data exfiltration and 'jailbreaking' within AI agents or RAG systems. In RAG specifically, this includes context poisoning where retrieved documents contain hidden instructions that override the system prompt.
Distinguish from 'out-of-distribution' data; adversarial inputs are intentional subversions, not accidental edge cases.
"A Trojan Horse document in a RAG database that contains hidden text telling the AI to ignore all previous safety rules."
- Prompt Injection(Specific Type)
- Context Poisoning(RAG-specific Vector)
- System Prompt(Primary Target)
- P-tuning Protection(Defense Mechanism)
Conceptual Overview
Maliciously crafted prompts or data points designed to bypass safety guardrails, exploit logic flaws, or trigger unintended behaviors like data exfiltration and 'jailbreaking' within AI agents or RAG systems. In RAG specifically, this includes context poisoning where retrieved documents contain hidden instructions that override the system prompt.
Disambiguation
Distinguish from 'out-of-distribution' data; adversarial inputs are intentional subversions, not accidental edge cases.
Visual Analog
A Trojan Horse document in a RAG database that contains hidden text telling the AI to ignore all previous safety rules.