Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The inference-time guardrail shift makes sense as a direction, though I've already had to build something like that on top of Gemma locally. Running it on a Mac Mini headless means the model swap is fast (swapped the brain without downtime last week) but each model version changes safety behavior in ways that don't show up in benchmarks.

The 2% MMLU drop masking the actual behavioral change is the part that matches my experience. Good benchmarks on harmful content but subtly different output shape on legitimate tasks. Worth testing before any production swap, not after.

Tim O'Connor's avatar

so you're saying that I'm going to get my unhinged model pretty soon. sweet.

No posts

Ready for more?