Definition
The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).
Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.
"A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders."
- Prompt Injection(Component)
- Red Teaming(Prerequisite)
- Constitutional AI(Countermeasure)
- PII Leakage(Potential Consequence)
Conceptual Overview
The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).
Disambiguation
Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.
Visual Analog
A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders.