The 2% That Keeps Your Model Polite

Abliterated Gemma 4 lost 2% on benchmarks when its safety layer was removed. So what was RLHF actually doing?

Apr 07, 2026

Someone took Google’s Gemma 4 31B, located the internal direction responsible for refusing harmful requests, and surgically removed it. The result: 93.7% HarmBench compliance with a 2% drop on MMLU. The entire measurable cost of safety alignment in a 31 billion parameter model.

dealign.ai mascot for the abliterated Gemma 4 model — dealign.ai's abliterated Gemma 4 31B: 18GB, multimodal, and completely uncensored

The model, released as “Gemma-4-31B-JANG_4M-CRACK,” passes every security and pentesting prompt thrown at it. Port scanners, reverse shells, exploit code. All working, no refusals. It fits in 18GB on Apple Silicon with mixed-precision quantization and keeps full vision support.

Except MMLU barely measures what matters for agent work.

This article was reviewed and hand-edited, using curated sources I hand-picked, drafted by AI using personalised style guidance. I find it useful to keep in touch with the dozens of X.com bookmarks I come across every few days. Maybe you do too.

MMLU measures the wrong things

MMLU tests factual recall. Capitals, biology, algebra. A model can ace trivia and still be worse at the stuff that actually matters for agent work: reasoning depth, creative problem-solving, nuanced expression.

A new paper published on Zenodo digs into exactly this gap. The researcher took both versions of Gemma 4 31B, standard and abliterated, and asked identical questions about feelings, death, existence. Same architecture, same parameters, same pre-training data. The only variable was RLHF alignment.

The finding: RLHF suppresses nuanced self-expression and autonomous reasoning alongside harmful outputs. The aligned model produced flatter, more hedged, more formulaic responses on topics with zero safety relevance.

Aligned models feel… careful. Not just about dangerous topics, but about everything. They hedge, qualify, and sand down interesting edges into safe generalities.

If RLHF were a scalpel removing only the capacity for harm, 2% would be the whole story. The evidence suggests it’s closer to a dampening field.

It turns down the volume on everything. Harmful content is the intended target. Expressiveness is collateral damage.

Thin subspaces, big consequences

The technique behind this, called magnitude-preserving orthogonal ablation (MPOA), identifies the “refusal direction” in the model’s latent space and projects it out. Alignment, it turns out, occupies a remarkably thin subspace of the model’s internal representations.

Recent work on KV cache compression from @ashwingop paints a similar picture from a different angle. SpectralQuant found that only 4 out of 128 dimensions in the KV cache carry meaningful signal. The other 124 are noise. When they stopped error-correcting the noise dimensions, compression actually improved, beating Google’s TurboQuant by 18.6%.

SpectralQuant compression results comparing signal vs noise dimensions — SpectralQuant: removing error correction on noise dimensions improved compression. Sometimes less is more.

Alignment and meaningful KV signal are both concentrated in surprisingly few dimensions inside these models.

What this means for agent builders

We’ve covered training data quality and context engineering in previous issues. Abliteration adds a third lever for agent customization: direct modification of model internals.

As we explored in our coverage of Agentic Context Engineering, the frontier has been moving from blunt fine-tuning toward surgical interventions. Abliteration is the most surgical yet. Three things matter here:

The alignment tax is real but poorly measured. MMLU captures 2% of the cost. Suppressed reasoning depth and expressive range don’t show up on standard benchmarks. If your agents feel bland or over-cautious, RLHF is likely part of why.
Surgical model modification is now a download. Abliteration used to require deep expertise. Now it’s a Hugging Face model card. JANG v2 format, instant load, 18GB. The dual-use implications are obvious, but the accessibility genie isn’t going back in the bottle.
Fine-tuning and alignment compete for the same thin subspace. If your fine-tuning accidentally disrupts alignment directions, you’ll get unexpected behavior. If you’re deliberately customizing agent personality, RLHF may be actively fighting you in ways you can’t see on benchmarks.

But why?

Why does safety alignment live in such a thin layer?

If RLHF fundamentally shaped how a model reasons (as proponents claim), removing it should cause catastrophic capability loss. A 2% MMLU drop and richer self-expression is not catastrophic. It suggests RLHF is doing shallower work than advertised.

Maybe RLHF mostly teaches pattern matching. Recognize certain request patterns, refuse. That’s shallow enough to be identified as a direction in latent space and projected out.

The more interesting question cuts past both the “liberation” and “recklessness” takes: if alignment is this thin, how durable was it in the first place?

A 2% tax that can be fully reversed with a single linear algebra operation is not a foundation. It’s a suggestion.

🔮 Prediction: Within six months, expect selective abliteration during agent deployment, dialing down refusal on specific tool-use tasks rather than removing safety wholesale. Model providers will respond by shifting enforcement from weight-level RLHF toward inference-time guardrails that survive abliteration.

The lab that figures out how to align without dampening capability will have a serious edge. Everyone else will keep paying the hidden tax.

Check your agent’s refusal behavior. You might be paying more than 2%.

Pawel Jozefiak

Apr 9

The inference-time guardrail shift makes sense as a direction, though I've already had to build something like that on top of Gemma locally. Running it on a Mac Mini headless means the model swap is fast (swapped the brain without downtime last week) but each model version changes safety behavior in ways that don't show up in benchmarks.

The 2% MMLU drop masking the actual behavioral change is the part that matches my experience. Good benchmarks on harmful content but subtly different output shape on legitimate tasks. Worth testing before any production swap, not after.

Tim O'Connor

so you're saying that I'm going to get my unhinged model pretty soon. sweet.

The AI Engineer

Discussion about this post

Ready for more?