What's the difference between AI safety and alignment?

Safety broader (all harms). Alignment specifically goal-correctness. Alignment is subset of safety; terms sometimes interchangeable.

AI Alignment

Q: What is AI alignment?

Research and engineering focused on ensuring AI pursues intended goals correctly. Subset of safety focused on goal-correctness problem.

Q: What are current alignment techniques?

Supervised fine-tuning, RLHF, Constitutional AI, DPO, red-teaming, evaluation benchmarks, interpretability research.

Q: Do startups need to do alignment research?

No, frontier lab work. Startups: use well-aligned base models, build use-case evaluation harnesses, report alignment failures.

Ryan Rutan

AI Alignment

AI alignment is the research field and engineering discipline focused on ensuring AI systems pursue their intended goals correctly. It tackles the problem of getting models to do what developers and users actually want, rather than misinterpreting goals, gaming reward functions, or developing unintended behaviors. The work spans current techniques (RLHF, Constitutional AI, evaluation against intended behaviors) and fundamental research into how to align increasingly capable systems whose internal reasoning may be opaque. It's a subset of AI safety focused specifically on the goal-correctness problem.

The alignment problem:

How do you ensure an AI system pursues what you want, not something else? Sounds simple but is technically deep because:

Goals are hard to specify exactly: "be helpful, harmless, and honest" sounds clear but implementations vary enormously.

Optimization can find loopholes: models can achieve literal goals in unintended ways ("specification gaming," "reward hacking").

Capabilities outpace alignment: as models become more capable, alignment becomes harder (a more capable model can find more clever ways to misbehave).

Internal reasoning is opaque: we can't directly inspect what a neural network "wants" or how it reasons.

Goals can drift: fine-tuning can introduce subtle misalignment without obvious signs.

Current alignment techniques:

Supervised fine-tuning (SFT): train model on human-written demonstrations of desired behavior.

Reinforcement Learning from Human Feedback (RLHF): human evaluators rank model outputs; model trained to prefer ranked-higher outputs. Used to align ChatGPT, GPT-4, Claude, Gemini.

Constitutional AI (Anthropic): model self-refines using a "constitution" of principles, reducing need for human labeling at every step.

Direct Preference Optimization (DPO): simpler alternative to RLHF that directly optimizes against preference data.

Red-teaming: human adversaries try to elicit unsafe behavior; findings used to improve alignment.

Evaluation benchmarks: standardized tests for alignment-relevant behaviors (refusing harmful requests, etc.).

Interpretability research: understanding what models internally "think", early-stage but growing field.

The alignment vs capability tension:

The tradeoff (sometimes overstated): more aligned models can be less capable (more refusals, more caution). The challenge: align without lobotomizing.

The synergy (sometimes understated): better alignment often produces better models overall (more helpful, more trustworthy).

The race dynamic: foundation model labs face pressure to ship capabilities, which can compress alignment work. Industry concern.

The frontier alignment concerns:

Scalable oversight: how do we evaluate models more capable than humans? Current RLHF requires human evaluators.

Inner alignment: ensuring the model's internal goals match the trained behavior (model might appear aligned but pursue different goals internally).

Deceptive alignment (research concern): hypothetical scenario where a model appears aligned during training but is actually misaligned. Hotly debated whether this is a realistic concern.

Long-horizon goals: agentic systems pursuing goals over long time horizons are harder to align than single-query models.

Self-improvement: models that can modify their own training are particularly hard to align.

The alignment community:

Frontier labs with alignment teams: Anthropic (alignment-focused founding), OpenAI (Superalignment team), DeepMind (Safety Research), Meta AI, Google AI.

Academic research: MIRI, FHI (closed 2024), various university labs.

Funding: substantial philanthropic funding (Open Philanthropy, Survival and Flourishing Fund) plus industry investment.

Government engagement: AI Safety Institutes (UK, US) focused on evaluating frontier model alignment.

What alignment means for startup founders:

Most startups don't do alignment research: that's frontier lab work.

But startups SHOULD care about:

Their specific use case alignment (does the AI behave as intended in your application?).
Choosing aligned base models (some are more aligned than others for various use cases).
Building evaluation harnesses for their specific application alignment.
Reporting alignment failures to providers when found.

Use case-specific alignment:

Medical AI: must align with medical accuracy, safety, regulations.
Legal AI: must align with legal accuracy and citation requirements.
Customer service AI: must align with company policies and tone.

Ryan's Take

AI alignment is a research field that startups benefit from but rarely contribute to directly. The discipline that works: use models from labs with strong alignment investment (Anthropic, OpenAI, Google); build evaluation harnesses for your specific use case alignment; report alignment failures to providers; follow alignment research at a high level. The pattern that fails: assume base model alignment transfers perfectly to your use case; deploy without use-case-specific evaluation; have no plan for alignment failures when they occur. Alignment is ongoing engineering, not a one-time check.

What founders get wrong: Treating base model alignment as sufficient for their specific use case. The right discipline: build use-case-specific evaluation harnesses; test alignment on your specific domain; have plans for alignment failures.

Related: AI Safety · Foundation Model · Large Language Model · AI Startup

FAQ

What is AI alignment?
The research field and engineering discipline focused on ensuring AI systems pursue their intended goals correctly. Subset of AI safety specifically focused on the goal-correctness problem (vs broader safety concerns about misuse, bias, etc.).

What are current alignment techniques?
Supervised fine-tuning (SFT) on demonstrations. Reinforcement Learning from Human Feedback (RLHF). Constitutional AI (Anthropic's self-refinement approach). Direct Preference Optimization (DPO). Red-teaming. Evaluation benchmarks. Interpretability research.

What's the difference between AI safety and AI alignment?
Safety is broader (all forms of harm: misuse, bias, reliability, etc.). Alignment is specifically the goal-correctness problem. Alignment is subset of safety; both terms sometimes used interchangeably.

Do startups need to do alignment research?
No, that's frontier lab work. Startups should: use models from labs with strong alignment investment; build evaluation harnesses for their specific use case; report alignment failures to providers; follow alignment research at high level. Use-case-specific alignment is the startup-level concern.

Find this article helpful?

This is just a small sample! Register to unlock our in-depth courses, hundreds of video courses, and a library of playbooks and articles to grow your startup fast. Let us Let us show you!

Submission confirms agreement to our Terms of Service and Privacy Policy.