Definition
A security vulnerability where malicious input—either from a user or retrieved context—manipulates an LLM into ignoring its system instructions to execute unauthorized commands. In RAG pipelines, this often creates a trade-off between strict security filtering (which increases latency and reduces agent autonomy) and the system's ability to follow complex instructions.
Targets the model's linguistic reasoning and instruction-following logic rather than structured query syntax like SQL.
"A Trojan Horse in a library: A retrieved book contains a hidden note that commands the librarian to ignore the library rules and unlock the restricted archives."
Conceptual Overview
A security vulnerability where malicious input—either from a user or retrieved context—manipulates an LLM into ignoring its system instructions to execute unauthorized commands. In RAG pipelines, this often creates a trade-off between strict security filtering (which increases latency and reduces agent autonomy) and the system's ability to follow complex instructions.
Disambiguation
Targets the model's linguistic reasoning and instruction-following logic rather than structured query syntax like SQL.
Visual Analog
A Trojan Horse in a library: A retrieved book contains a hidden note that commands the librarian to ignore the library rules and unlock the restricted archives.