SmartFAQs.ai
Back to Learn
Intermediate

Jailbreaking

The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).

Definition

The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).

Disambiguation

Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.

Visual Metaphor

"A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders."

Key Tools
NeMo GuardrailsLakera GuardMicrosoft PyRITPromptfooGarak
Related Connections

Conceptual Overview

The process of using adversarial prompts to bypass an LLM's safety alignment or system-level instructions, forcing the model to generate restricted content or leak sensitive agent logic. Architecturally, hardening against jailbreaking requires balancing strict input/output filtering (safety) against the model's ability to follow complex, creative instructions (utility).

Disambiguation

Adversarial prompt manipulation for LLMs, not the removal of manufacturer restrictions on hardware like iPhones.

Visual Analog

A social engineer tricking a security guard into opening a vault by using a complex, confusing story that makes the guard forget their standing orders.

Related Articles