Definition
The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).
Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.
"A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders."
Conceptual Overview
The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).
Disambiguation
Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.
Visual Analog
A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders.