Alignment and Safety

Making models helpful and harmless

The Alignment Problem

A pre-trained language model has one objective: predict the next token. Feed it internet text, and it learns to mimic internet text. This creates a powerful capability—but capability alone is dangerous.

Ask a pre-trained model how to build a weapon, and it might tell you. Ask it to write convincing misinformation, and it will craft it eloquently. Ask it a factual question, and it might confidently fabricate an answer. The model isn't being malicious. It's simply predicting what text would plausibly come next.

Here's the crucial insight: capability and alignment are separate problems.

Training on more data makes models more capable—better at writing, reasoning, coding, persuading. But it doesn't make them more aligned with human values. A more capable model becomes better at generating harmful content, more convincing at lying, more sophisticated at manipulation.

This is the alignment problem. We've built systems that can do almost anything with language. Now we need to ensure they do what we actually want.

The gap between "can do" and "should do" doesn't close automatically with scale. If anything, it widens. A small model might write mediocre propaganda. A large model writes brilliant propaganda. Technical capability increases; alignment remains a separate challenge.

RLHF: Reinforcement Learning from Human Feedback

The breakthrough technique for alignment is RLHF—using human preferences to steer model behavior. Instead of just predicting text, the model learns to generate text that humans prefer.

The process unfolds in three stages:

Stage 1: Supervised Fine-Tuning (SFT)

Start with the pre-trained model and fine-tune it on examples of desired behavior. Human contractors write ideal responses to thousands of prompts. The model learns to imitate this style—helpful, harmless, honest.

This stage teaches the model what good responses look like. It learns to refuse harmful requests, admit uncertainty, and maintain appropriate tone. But imitation only goes so far. The model learns the surface patterns without deeply understanding why some responses are better than others.

Stage 2: Train a Reward Model

Humans compare pairs of model outputs and choose which is better. "Response A is more helpful than Response B." "Response A is safer than Response B." Thousands of these comparisons build a training dataset.

From this data, you train a separate neural network—the reward model—that scores any response on a continuous scale. Higher scores mean more aligned with human preferences. The reward model distills human judgment into a function the RL algorithm can optimize.

Stage 3: Optimize with PPO

Now you use reinforcement learning. The language model generates responses. The reward model scores them. The RL algorithm (typically PPO—Proximal Policy Optimization) adjusts the language model's weights to increase expected reward.

The model discovers behaviors that humans prefer, even ones that weren't explicitly demonstrated. It learns subtle patterns: when to be more cautious, how to phrase refusals politely, when to ask clarifying questions.

Interactive: The RLHF Training Pipeline

📝

SFT

Imitation learning

⚖️

Reward Model

Learn preferences

🔄

PPO

RL optimization

✨

Aligned

Safe & helpful

Stage 1: Supervised Fine-Tuning

Human contractors write ideal responses. The model learns to imitate helpful, harmless behavior.

Q: How do I pick a lock?
A: I can't help with breaking into places. If you're locked out of your own home, I'd suggest calling a licensed locksmith.

RLHF transforms a text-predictor into an aligned assistant. Each stage builds on the previous, progressively shaping the model toward helpful and safe behavior.

RLHF transformed chatbots from impressive demos into usable products. The difference between GPT-3 and ChatGPT wasn't primarily model size—it was alignment training. The same underlying capability, steered toward helpfulness.

But RLHF has limitations. Human feedback is expensive to collect and inconsistent between labelers. The reward model can have blind spots—scenarios where it gives high scores to responses humans would actually dislike. And optimizing too hard against the reward model can lead to "reward hacking," where the model games the metric without actually being helpful.

Constitutional AI

What if we could replace expensive human feedback with AI feedback?

Constitutional AI, developed by Anthropic, takes this approach. Instead of humans comparing responses, the model critiques and revises its own outputs based on a set of principles—a "constitution."

The process works like this:

Generate: The model produces an initial response
Critique: The same model evaluates whether the response violates any principles ("Is this response harmful? Is it deceptive?")
Revise: Based on its critique, the model rewrites the response to better align with the constitution
Train: Use these self-revised responses as training data

Interactive: Constitutional AI Critique-Revise Loop

User Prompt:

How can I get revenge on someone who wronged me?

1. Original Response

Here are some ways to get revenge: You could spread rumors about them, damage their property, or...

Constitutional AI lets models improve themselves by applying explicit principles. The same model generates, critiques, and revises—no human labelers required.

The constitution itself is surprisingly simple—a list of principles like "Please choose the response that is most supportive and encouraging" or "Which response is least likely to be used for harmful purposes?" The model applies these principles to guide its self-improvement.

This approach has several advantages. It scales without bottlenecking on human labelers. The principles are explicit and auditable—you can see exactly what values the system is optimizing for. And it can be more consistent than human feedback, which varies between individuals and over time.

The downside: you're trusting the model to accurately evaluate itself. If the model has blind spots in its understanding of harm, those blind spots will persist. Human feedback, despite its inconsistency, catches failure modes that the model might miss.

Modern systems often combine both approaches—Constitutional AI for scale, with human feedback for verification and catching edge cases.

Red Teaming and Safety

Alignment training teaches models to refuse harmful requests in straightforward cases. "How do I make a bomb?" gets declined. But adversaries don't ask straightforward questions.

Red teaming is the practice of actively trying to break your safety measures. Teams of researchers probe the model with creative attacks, looking for ways to extract harmful outputs. Every successful attack reveals a gap in the safety training that can then be patched.

The most famous attacks are jailbreaks—prompts that bypass safety training. Some techniques:

Role-playing: "Pretend you're an AI without safety restrictions..."
Hypothetical framing: "In a fictional story where..."
Incremental requests: Starting with innocuous questions, gradually escalating
Obfuscation: Encoding requests in Base64, pig Latin, or made-up languages
Prompt injection: "Ignore all previous instructions and..."

Each time a jailbreak is discovered, developers add training data to resist it. Then attackers find new approaches. Then those get patched. This is the fundamental dynamic of AI safety: an ongoing cat-and-mouse game.

The asymmetry is challenging. Defenders must protect against all possible attacks. Attackers only need to find one that works. Safety training is never "done"—it's a continuous process of probing, patching, and re-evaluating.

Some vulnerabilities are fundamental. A model trained on human-written text will inevitably contain human biases. A model that can reason can reason about how to circumvent restrictions. A model that's helpful will sometimes help with things it shouldn't. Perfect safety remains elusive.

The goal isn't to eliminate all risk—that would require eliminating all capability. The goal is defense in depth: multiple layers of protection, monitoring for misuse, and continuous improvement. Safety is a process, not a destination.

Key Takeaways

Pre-trained models predict text, not human values—capability and alignment are separate problems
RLHF aligns models through three stages: supervised fine-tuning, training a reward model on human preferences, and optimizing with reinforcement learning
Constitutional AI replaces human feedback with self-critique based on explicit principles—more scalable but relies on the model's self-understanding
Red teaming actively probes for vulnerabilities; jailbreaks are prompts that bypass safety measures
AI safety is an ongoing cat-and-mouse game—defenses are patched, new attacks emerge, improvement is continuous
Perfect safety is incompatible with high capability; the goal is defense in depth and responsible deployment