Jailbreaking

Jailbreaking

The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).

Definition

Disambiguation

Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.

Visual Metaphor

"A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders."

Key Tools

NeMo GuardrailsLakera GuardMicrosoft PyRITPromptfooGarak

Related Connections

Prompt Injection(Component)
Red Teaming(Prerequisite)
Constitutional AI(Countermeasure)
PII Leakage(Potential Consequence)

Conceptual Overview

Disambiguation

Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.

Visual Analog

A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders.

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles