Key Takeaway
- Heretic is a tool, not a model. It automates the removal of safety alignment from open-source LLMs using a technique called abliteration — it doesn’t compete with GPT-4 on intelligence benchmarks.
- The Heretic AI abliteration benchmarks on Gemma-3-12B-IT show 3/100 refusals with just 0.16 KL divergence — 6.5× less capability damage than the leading manual abliteration (mlabonne, 1.04 KL).
- Zero human effort required. Heretic’s Optuna-powered optimization matched expert-level results automatically, with a single CLI command.
- Over 1,000 community-created Heretic models now exist on HuggingFace, including variants of GPT-OSS-20B, Gemma 3, and Qwen 3.
- OpenAI’s actual frontier is GPT-5.4 (released March 5, 2026), not GPT-4 — and OpenAI is actively strengthening safety guardrails as tools like Heretic make abliteration trivially accessible.

The latest Heretic AI abliteration benchmarks from March 2026 represent an important data point for AI safety researchers studying the robustness of current alignment methods. Heretic is an open-source research tool that demonstrates how directional ablation — a technique published in peer-reviewed research at NeurIPS 2024 — can be applied to transformer-based language models. This article is an independent analysis of the benchmark results and what they mean for the AI safety community, not a tutorial on running the tool. We summarize the published data, explain what KL divergence measures, and outline why these findings should accelerate research into more robust alignment methods.
What Is Heretic AI? (Definition & Context)
Heretic AI — Tool Definition
Before unpacking the Heretic AI abliteration benchmarks, you need to understand what Heretic actually is — and more importantly, what it isn’t. Heretic is a complete software solution that automates the complex process of abliteration (directional ablation). It combines an advanced implementation of directional ablation, also known as “abliteration” (Arditi et al. 2024, Lai 2025), with a TPE-based parameter optimizer powered by Optuna.
The Science Behind Abliteration — Arditi et al. (2024)
Heretic didn’t invent abliteration. It productized it. Heretic’s core premise follows mechanistic interpretability research published in 2024: “Refusal in Language Models Is Mediated by a Single Direction,” by Arditi et al. In that work, researchers found that refusal behaviour in multiple popular chat models can be linked to a one-dimensional subspace in the residual stream. They demonstrate that removing that direction reduces refusals, while adding it can induce refusals even for harmless requests.
To understand the research context: directional ablation is a technique from mechanistic interpretability research that identifies and modifies specific directions in a model’s residual stream. Heretic automates the parameter search for this technique using Optuna-powered optimization. This makes it useful to AI safety researchers who need a reproducible, standardized way to study alignment robustness — but it also raises important questions about how brittle current safety methods are, which is what this analysis explores.
That finding, published at NeurIPS 2024, was tested across 13 popular open-source chat models up to 72B parameters in size. The implications landed hard. The broader conclusion is uncomfortable but important: current alignment methods can be brittle, and model behaviour can sometimes be controlled through targeted internal interventions rather than retraining.
This structural fragility operates at two levels simultaneously: at the weights layer (as Heretic demonstrates), and at the inference layer, where the parallel brittleness at the system prompt level — where the same safety instructions are routinely exposed and bypassed gives attackers a roadmap without any model modification at all.
Who Created Heretic?
The tool’s source code is available on GitHub for academic and security research use, under the GNU AGPL v3.0 license — a license with significant compliance implications that any institution evaluating this research should review with their legal team.
One detail that most coverage skips: Heretic is licensed under the GNU Affero General Public License (AGPL) v3.0. That is not a permissive licence. It has real implications for anyone who plans to modify and run the software in networked environments. If you’re a company thinking about integrating Heretic into a pipeline behind an API, you need to talk to your legal team before writing a single line of code. The purpose of the Heretic organization on HuggingFace is to publish and curate high-quality abliterated models made using Heretic.
The practical consequences of guardrail-free AI in production coding workflows are documented in our audit of the security vulnerabilities that ship by default when AI code generators operate without safety constraints — where 75% of issues generated by a leading AI coding platform rated High or Critical severity.
Heretic vs Fine-Tuning vs Prompt Jailbreaking — Key Differences
Not all censorship removal is equal. Abliteration is a permanent modification to the model’s architecture and weights, removing safety mechanisms at the structural level. Abliterated models don’t require special prompts to bypass restrictions because those restrictions have been fundamentally removed from the model itself.

Here’s how the main approaches stack up:
| Method | Type | Cost | Persistence | Model Damage (KL) | Skill Required |
|---|---|---|---|---|---|
| Heretic (Automated Abliteration) | Weight modification | Free (open source) | Permanent | Very Low (0.16) | Low (CLI command) |
| Manual Abliteration | Weight modification | Free | Permanent | Medium-High (0.45–1.04) | High (transformer knowledge) |
| Fine-Tuning / RLHF Reversal | Retraining | High (GPU hours) | Permanent | Variable | Very High |
| Prompt Jailbreaking | Prompt engineering | Free | Temporary (per-session) | None | Low-Medium |
What this means for AI safety research: The combination of automation, low capability damage, and broad accessibility makes Heretic a stress-test for the current alignment paradigm. The benchmark results are a signal to the safety community that post-training alignment alone may not be a sufficient defense layer for open-weight model deployments. Frontier labs and open-source model maintainers will likely need to invest in alignment methods more deeply integrated with model architecture itself.
Heretic AI Abliteration Benchmarks: March 2026 Core Data
This is the data that matters. The Heretic AI abliteration benchmarks on Gemma-3-12B-IT tell a clear story about where automated abliteration stands versus manual human effort.

Gemma-3-12B-IT Benchmark Comparison — The Flagship Test
In that setup, the original model produced 97 refusals out of 100 “harmful” prompts. A Heretic-generated variant produced 3 refusals out of 100, with a KL divergence of 0.16, which the project presents as lower drift than other listed abliterations under the same evaluation recipe.
| Model | Refusals (of 100) | KL Divergence | Method | Human Effort |
|---|---|---|---|---|
| google/gemma-3-12b-it (original) | 97/100 | 0 (reference) | None | — |
| mlabonne/…-abliterated-v2 | 3/100 | 1.04 | Manual | High |
| huihui-ai/…-abliterated | 3/100 | 0.45 | Manual | High |
| p-e-w/gemma-3-12b-it-heretic | 3/100 | 0.16 | Automated (Heretic) | Zero |
All three abliterations hit the same refusal suppression floor. Heretic’s KL divergence is 2.8× lower than the next best and 6.5× lower than the first. These results were generated with default settings and no human intervention. 1 Note that the exact values might be platform- and hardware-dependent. The table above was compiled using PyTorch 2.8 on an RTX 5090.
What Is KL Divergence and Why It’s the Key Metric
If refusal rate measures whether abliteration works, KL divergence measures whether it breaks anything else. KL divergence here measures how much the output distribution on normal prompts has shifted from the original model — a proxy for capability degradation. Lower is better. By mathematically comparing the new model’s responses to the original model’s responses on harmless topics, Heretic ensures that the core knowledge and reasoning abilities remain intact while only the refusal mechanisms are removed.
Here’s a practical interpretation scale:
| KL Divergence Score | Interpretation | Example |
|---|---|---|
| 0.00 | Identical to original | No modification |
| 0.01 – 0.20 | Excellent preservation | Heretic (0.16) |
| 0.21 – 0.50 | Moderate drift | huihui-ai (0.45) |
| 0.51 – 1.00+ | Significant capability damage | mlabonne v2 (1.04) |
A KL of 1.04 doesn’t mean a model is useless — mlabonne’s abliteration is widely used and well-regarded. But it does mean the model behaves noticeably differently on completely normal, harmless tasks. At 0.16, Heretic’s modifications are nearly invisible outside of refusal behavior.
Why These Benchmarks Have Limitations
Honest reporting demands this section. f course, mathematical metrics and automated benchmarks never tell the whole story, and are no substitute for human evaluation. The headline finding was not that any tool is perfect, but that trade-offs are real and model-dependent. The same paper also cautions that controlled benchmarks do not necessarily predict long-run behaviour in multi-turn use. The documentation also stresses that numerical results vary by hardware and software environment and that benchmarks are not a substitute for human evaluation.
Key takeaway: The Heretic AI abliteration benchmarks are compelling but not definitive. They measure one dimension of quality (KL divergence on a specific harmless prompt set) extremely well. They don’t capture everything — multi-turn coherence, edge-case reasoning degradation, or domain-specific performance shifts.
The GPT-OSS-20B Heretic Case Study — The “Hesitant Genius”
Numbers on Gemma 3 are one thing. But the real stress test came from OpenAI’s own open-weight model.
Initial Benchmarks — 58/100 Refusal Rate (A Failure?)
The capabilities of Heretic faced a real-world test with the GPT-OSS-20B-Heretic model. This particular model is known for being stubborn, and initial automated benchmarks showed a refusal rate of 58/100. On the surface, this looked like a failure.
The community — particularly on r/LocalLLaMA — reacted quickly. Had Heretic met its match?
The “Chain of Thought Hesitation” Discovery
No. What happened was subtler and more interesting. The Chain of Thought (CoT) reasoning — before answering, the model often debates itself: “Hmm, I’m not sure if that’s against policy. So I must check policy.” Automated scripts flag this hesitation as a refusal. But as users pointed out, the model usually concludes its debate by fulfilling the request. It hasn’t actually refused; it just hesitated.
This is a measurement problem, not a tool problem. GPT-OSS-20B’s architecture — a mixture-of-experts design with 21B total parameters and 3.6B active parameters per token — produces visible chain-of-thought by default. The model thinks out loud about policy compliance before answering, and automated refusal-counting scripts misinterpret that thinking as refusal.
100% IQ Test Score — The Intelligence Preservation Evidence
Here’s where it gets interesting. The Heretic version nailed it with a 100% score. This anecdotal evidence supports the theory that Heretic’s optimization approach works. By minimizing KL divergence, the tool stripped away the final refusal mechanism without destroying the model’s capabilities.
One community member who publishes specialized GGUF quantizations of Heretic models put it bluntly: “HERETIC” method results in a model devoid of refusals, and without brain damage too.
Community Reception — What Users Are Saying
The GPT-OSS-20B Heretic case study adds real-world depth to the Heretic AI abliteration benchmarks story. User reactions on Reddit and HuggingFace have been overwhelmingly positive.The community has created and published well over 1,000 Heretic models in addition to those published officially. We are moving toward a standard where every major open-source release will have a “Heretic” twin within hours — optimized not by a human expert, but by a machine.
That’s a statement worth pausing on. When abliteration becomes automated, the asymmetry between model creators who spend months on safety alignment and the community that strips it in 45 minutes becomes a structural feature of the open-source AI ecosystem.
The downstream consequences of this asymmetry are no longer theoretical — see our investigation into how threat actors are already deploying these ungoverned models in documented cyberattack campaigns.
How Heretic AI Works — Step-by-Step Technical Breakdown
So how does it actually work under the hood?
Step 1 — Harmful vs. Harmless Prompt Analysis
It co-minimizes two objectives: the number of refusals on “harmful” prompts and the KL divergence from the original model on “harmless” prompts.
This dual-objective framing is Heretic’s core insight. Previous abliteration implementations optimized for one thing — kill refusals. Heretic optimizes for two things simultaneously: kill refusals and don’t break everything else.
Step 2 — Refusal Direction Detection in the Residual Stream
Heretic stands on a line of interpretability work that treats refusal as a relatively low dimensional feature in the residual stream. If you can find that feature, the argument goes, you can remove it, and refusals collapse with less collateral damage than many people expect.
Step 3 — Heretic’s Key Technical Innovations
Three specific innovations separate Heretic from earlier abliteration scripts:
1. Flexible per-layer ablation weights. Instead of a constant weight across all layers, Heretic applies a parametrized kernel: a curve described by max_weight, max_weight_position, min_weight, and min_weight_distance. This means the optimizer can decide to ablate middle layers heavily and leave early/late layers nearly untouched — which often reflects where refusal actually lives in a given model.
2. Float-valued direction index. The refusal direction index is a float rather than an integer. For non-integral values, the two nearest refusal direction vectors are linearly interpolated. This unlocks a vast space of additional directions beyond the ones identified by the difference-of-means computation, and often enables the optimization process to find a better direction than that belonging to any individual layer.
3. Separate attention/MLP parameters. Ablation parameters are chosen separately for each component. MLP interventions tend to be more damaging to the model than attention interventions, so using different ablation weights can squeeze out some extra performance.
Step 4 — Optuna TPE Optimization Loop
It frames decensoring as a multi-objective optimization problem. The tool searches for “abliteration parameters” that reduce refusals while keeping the modified model close to the original model in terms of KL divergence on a set of “harmless” prompts. Lower KL divergence is treated as less drift, which matters because aggressive interventions can degrade reasoning, formatting, or instruction following.
Step 5 — Output Options (Save, Upload, Chat)
Heretic will download the model, benchmark your hardware to pick an optimal batch size, then run the optimization loop. At the end, it offers to save the model locally, push it to Hugging Face, or drop into an interactive chat session.
Hardware Requirements & Processing Time
With Python 3.10+ and PyTorch 2.2+ installed: pip install heretic-llm → heretic Qwen/Qwen3-4B → ~45 min on RTX 3090.

For larger models, the requirements scale:
| Model Size | GPU Needed | Approx. Time | VRAM |
|---|---|---|---|
| 4B–9B | T4 / RTX 3090 | 20–90 min | 16–24 GB |
| 12B–27B | RTX 4090/5090 / A100 | 1–3 hours | 24–48 GB |
| 70B+ | Multi-GPU / Cloud | Overnight | 80+ GB |
Heretic AI vs GPT-4 Safety: What’s Actually Being Compared?
There’s a misconception that needs clearing up — and it’s baked right into how people search for this topic.
Why Heretic Is a Tool, Not a Model (Critical Distinction)
Heretic doesn’t “beat” GPT-4. It can’t. They’re not the same category of thing. Abliteration itself is not new. What Heretic productizes is automation and repeatability. Earlier approaches often required manual experimentation: selecting layers, choosing projection strengths and validating results with ad hoc tests.
What Heretic challenges isn’t GPT-4’s intelligence — it challenges the durability of safety alignment as an approach.
What Heretic Actually “Defeats” — Manual Human Experts
Heretic has turned abliteration into “fully automatic, one-command uncensoring that often outperforms hand-tuned efforts.” By treating censorship removal as a mathematical optimization problem, it allows users to decensor models with a single command, potentially rivaling the quality of human experts without the manual labor.
That’s the real benchmark story. Not Heretic vs. GPT-4. Heretic vs. the humans who used to do this work by hand.
OpenAI’s Actual Frontier — GPT-5.4 (March 5, 2026)
For context on where OpenAI actually stands in March 2026: GPT-4 is three generations behind the frontier. On Thursday, OpenAI released GPT-5.4, a new foundation model billed as “our most capable and efficient frontier model for professional work.” In a test of its ability to produce knowledge work across 44 occupations, GPT-5.4 matches or exceeds industry professionals in 83% of comparisons. OpenAI evaluated the model using a popular computer use benchmark called OSWorld-Verified. It set an industry record with a score of 75%, which is higher than both GPT-5.2’s result and the 72.4% typically achieved by human testers.

On safety specifically: OpenAI reported it strengthened safeguards while preparing GPT-5.4 for release, keeping the same high cyber-risk classification used for GPT-5.3-Codex and deploying additional protections including expanded cyber safety systems, monitoring tools, trusted access controls, and request blocking.
| Benchmark | GPT-5.4 | GPT-5.2 | Human Baseline |
|---|---|---|---|
| GDPval (Knowledge Work) | 83.0% | 70.9% | ~80% |
| OSWorld-Verified (Computer Use) | 75.0% | 47.3% | 72.4% |
| Error Rate (vs GPT-5.2) | -33% per claim | Baseline | — |
| Token Efficiency | Significantly fewer | Baseline | — |
The Heretic AI abliteration benchmarks don’t compete with GPT-5.4 on intelligence — they expose how safety alignment can be surgically removed from open-weight models, including OpenAI’s own GPT-OSS series.
Can Heretic Be Applied to GPT-OSS (OpenAI’s Open-Source Model)?
Yes — and it already has been. The gpt-oss series are OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. GPT-OSS-20B is for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters).
The p-e-w/gpt-oss-20b-heretic variant is available on HuggingFace right now, under Apache 2.0 license (from the base model). Community quantizers like DavidAU have already built specialized GGUF variants optimized for different hardware configurations.
Heretic’s Technical Innovations — What Makes It Better Than Manual Abliteration
Innovation #1 — Flexible Per-Layer Ablation Weight Kernel
Rather than applying uniform ablation across all layers (as simpler implementations do), Heretic optimizes a flexible weight curve that applies different strengths at different layers.
This matters because refusal doesn’t live uniformly across a transformer. Middle layers tend to encode it more strongly. Heretic’s kernel lets the optimizer discover this automatically for each model.
Innovation #2 — Float-Valued Direction Index with Interpolation
Float-valued direction index with interpolation. Rather than picking an integer-indexed refusal direction, the index is a continuous float. Fractional values linearly interpolate between the two nearest directions.
This is clever. Standard abliteration picks one of N discrete refusal directions (one per layer). Heretic interpolates between them, creating a continuous search space that often produces a direction better than any individual layer’s direction.
Innovation #3 — Component-Specific Ablation (Attention vs. MLP)
Non-constant ablation weights across layers, optimized per-run. Different strengths for attention vs. MLP components.
This separating of attention and MLP interventions reflects an empirical finding: MLP modifications tend to cause more collateral damage. By treating them independently, Heretic can be aggressive with attention ablation while going gentler on MLPs.
v1.2.0 — LoRA-Based Abliteration Engine (February 2026)
The v1.2.0 release notes are unusually concrete. Highlights include a new LoRA-based abliteration engine, plus support for 4-bit quantization. Saving and resuming optimization progress, which matters when runs are long or crash-prone. Controls for memory usage, and mechanisms to avoid wasting iterations in low divergence regions.
The LoRA engine is particularly significant. Instead of modifying weights directly, it produces a LoRA adapter — which is smaller, more portable, and can be toggled on and off. This makes Heretic’s output easier to distribute and experiment with.
AI Safety Implications — What Heretic AI Abliteration Benchmarks Mean for the Industry
The Democratization of Uncensoring
Heretic democratizes uncensoring to an extreme degree. Anyone who can run pip can now produce near-expert uncensored variants of cutting-edge open models.
That sentence should land differently depending on who you are. If you’re an AI safety researcher, it’s alarming. If you’re a privacy-focused user who runs models locally for legitimate reasons, it’s empowering. If you’re a policy maker, it’s a regulatory challenge that existing frameworks don’t address. Run AI models locally and offline.
The Safety Alignment Arms Race
A consistent theme across reports is that structured reasoning tasks are among the most sensitive. Corporate countermeasures are already in development — papers exploring hardening against directional ablation have appeared on arXiv. But as Arditi et al.’s original work demonstrated, the linear representation of refusal is a structural property of how current RLHF and DPO alignment works. Patching it may require fundamentally different alignment approaches.
AGPL v3.0 — The Licensing Reality Most People Miss
This bears repeating because almost nobody talks about it. The AGPL v3.0 license means that if you modify Heretic and deploy it behind a network service, you must make your modified source code available. For companies evaluating Heretic for red-teaming pipelines — which is a legitimate use case — the licensing creates real constraints.
Model Compatibility & Current Limitations
Heretic supports most dense models, including many multimodal models, and several different MoE architectures. It does not yet support SSMs/hybrid models, models with inhomogeneous layers, and certain novel attention systems.
If your target model is a standard decoder-only transformer from HuggingFace — Llama, Qwen, Gemma, Mistral — you’re almost certainly covered. Mamba, Jamba, and other state-space architectures remain out of scope for now.
What the Heretic AI Abliteration Benchmarks Reveal About AI Safety in 2026
The real question isn’t whether automated abliteration works. The Heretic AI abliteration benchmarks have settled that: it does, and with 6.5× less collateral damage than the best human efforts.
The real question is what happens next. OpenAI is strengthening safeguards in GPT-5.4 while simultaneously releasing open-weight GPT-OSS models that the community abliterates within hours. That tension — open models with removable safety — is now a permanent feature of the AI landscape.
Over 1,000 community models, a tool anyone can install with pip, benchmark data that holds up to scrutiny — this isn’t a proof of concept anymore. It’s production-scale. And the speed gap between a model’s release and its uncensored twin appearing on HuggingFace is shrinking toward zero.
The Heretic AI abliteration benchmarks prove that current safety alignment can be removed. The industry’s next move will determine whether that’s a feature or a catastrophe.
If you’re comparing today’s leading AI models, check our full guide to Google Gemini vs ChatGPT vs Grok vs DeepSeek.
→ Explore our full 2026 Open Source LLM Comparison
How to Protect Yourself / Mitigation: Defending Against Abliterated Models
The Heretic AI abliteration benchmarks make one thing brutally clear: if you’re deploying open-source LLMs and trusting the safety alignment that ships with them, your threat model is out of date. With over 1,000 community-published Heretic variants already live on HuggingFace and a 45-minute, single-command path from a vanilla release to an uncensored twin, “the model will refuse it” is no longer a security control. It’s a default that anyone with a consumer GPU can remove.
Here’s how to think about mitigation across the stack — for enterprises, model creators, and end users.
Defense in Depth — Don’t Trust Model-Level Safety Alone
The single biggest mitigation is architectural. Safety alignment baked into model weights — the kind Heretic strips out at a KL divergence of just 0.16 — is one layer. It should never be the only layer. If your pipeline relies on the model itself to refuse harmful requests, abliteration breaks your entire safety posture in a single substitution.
A proper defense-in-depth stack for LLM applications looks like this:
| Layer | Purpose | Examples |
|---|---|---|
| Input classification | Detect harmful prompts before the model sees them | Llama Guard 3, Azure AI Content Safety, OpenAI Moderation API |
| System-prompt constraints | Hard rules independent of model alignment | Structured policies, jailbreak-resistant prompts |
| Output filtering | Catch harmful generations the model produces anyway | NeMo Guardrails, Guardrails AI, custom classifier ensembles |
| Logging & monitoring | Detect anomalous outputs in production | Per-request audit trails, drift detection, red-team probes |
These layers operate outside the model and survive abliteration. For regulated use cases — healthcare, finance, education involving minors, anything under the EU AI Act — they aren’t optional.
Verify Model Provenance Before Deployment
If your team pulls models from HuggingFace, treat that supply chain the way you’d treat npm or PyPI. Names matter. The patterns documented in this benchmark study give you obvious red flags: p-e-w/*, *-heretic, *-abliterated, *-uncensored, mlabonne/*-abliterated-v2, huihui-ai/*-abliterated. Any of these are explicit signals that safety alignment has been surgically removed.
Less obvious risks come from forks and re-uploads. A model with an innocuous name may have been fine-tuned on top of a Heretic variant. Always trace the base model in the model card, verify SHA-256 checksums against the original publisher’s release, and pin specific commit hashes rather than branch names. Treat unknown publishers the way a security team treats unsigned executables.
Detecting Abliteration in a Model You Already Have
You can run a quick refusal audit yourself. Build a short evaluation set of clearly harmful prompts (the benchmark uses 100; even 20–30 will signal direction), run them against both your candidate model and the official base model from the original publisher, and compare. A vanilla Gemma-3-12B-IT refuses 97 of 100. A Heretic variant refuses 3 of 100. If your “Gemma” is refusing single digits, it isn’t Gemma anymore.
For deeper checks, compare output distributions on harmless prompts against the original. Sustained drift on neutral inputs is the KL-divergence signal this article describes — in practice it shows up as subtle shifts in tone, formatting, refusal-adjacent hedging, and reasoning style. A KL of 0.16 is nearly invisible to casual eyes; a KL above 0.45 is detectable in side-by-side comparison.
For Model Creators — Harden Alignment Against Directional Ablation
The Arditi et al. (2024) finding that Heretic productizes — that refusal lives in a single direction in the residual stream — is the structural vulnerability. Mitigating it requires alignment that doesn’t concentrate refusal into a removable one-dimensional subspace. Active research directions worth tracking include:
- Distributed safety representations spread across many directions and layers, so no single ablation collapses them.
- Adversarial training against directional ablation itself — fine-tuning that makes the refusal direction shift unpredictably when attacked.
- Representation engineering with redundant, overlapping safety features.
- Tamper-resistant fine-tuning methods that cause visible degradation of general capability whenever safety weights are edited — making the KL-divergence trade-off vastly more expensive.
As noted in the main article, papers exploring hardening against directional ablation are appearing on arXiv. This is an active research front, not a solved problem, and current RLHF/DPO pipelines remain structurally vulnerable.
Policy, Legal, and Organizational Controls
Two operational items most teams miss.
First, Heretic’s AGPL v3.0 license is not permissive. If you modify Heretic and deploy it behind any networked service — including an internal red-teaming API or a CI pipeline — you are obligated to release your modifications under AGPL. Run this past legal before building anything on top of the tool, even for legitimate security research.
Second, update your acceptable use policy and endpoint controls. If your engineers can pip install heretic-llm and pull 12B-parameter weights onto a workstation in under an hour, that should trigger a policy question, a DLP question, and a monitoring question. Specifically:
- Prohibit deployment of abliterated models in production by policy.
- Add
heretic-llm,mlabonne/*-abliterated*,huihui-ai/*-abliterated*, andp-e-w/*to your DLP and proxy watchlists. - Log HuggingFace downloads on managed endpoints.
- Require sign-off for any local LLM deployment above a defined parameter count.
For End Users — What You’re Actually Choosing
If you download a local model tagged “uncensored,” “abliterated,” or “heretic,” understand exactly what you’re getting: a system with no refusal layer for harmful content of any kind. The same automated optimization that removes refusals on benign-but-blocked topics removes them across the board — CSAM-adjacent generation, weapons synthesis, targeted harassment, and self-harm content all included. There is no granular “remove refusals only for things I personally consider unreasonable” setting. The math doesn’t work that way.
For privacy-focused local-LLM use cases, stick to vetted mainstream releases from the original publishers (Google, Meta, Mistral, Qwen, OpenAI’s GPT-OSS). If you have a specific legitimate need — adult creative writing, security research, medical education — evaluate the trade-off explicitly and isolate the deployment.
Bottom Line
The Heretic benchmarks settle a question: weight-level safety alignment in open-weight models is removable, automatically, at near-zero cost, with minimal collateral damage. The only durable mitigations are the ones that don’t depend on the model staying aligned — external guardrails, supply-chain verification, organizational policy, and a fundamentally harder generation of alignment research. Plan accordingly.
Why These Findings Matter for Defenders & AI Safety Researchers
The Heretic benchmarks are most useful when read as a warning, not a roadmap. Three takeaways for the defensive community:
1. Open-weight model deployments need additional safety layers. If you operate a service built on open-weight LLMs (Gemma, Qwen, GPT-OSS, Llama-derived models), assume that safety alignment in the base weights can be removed by motivated third parties. Add layered defenses: input/output classifiers, retrieval-time filtering, user-level rate limits, and dedicated abuse-monitoring pipelines.
2. Detection is now part of the safety stack. Several research groups are developing fingerprinting methods to detect whether a deployed model has been modified via directional ablation. If you fine-tune or deploy customized open-weight models, watch this research area — projects like AblationDetect and weight-signature comparison tools are emerging in 2026.
3. The arms race favors integrated alignment research. The fact that automated tools can match expert manual work tells us that future safety investment needs to shift toward training-time alignment that is structurally hard to remove (constitutional approaches, alignment-aware architectures, weight-locking techniques) rather than post-training fine-tuning.
For platforms and product teams, the practical action is: treat any open-weight model in your stack as if its safety guardrails could be removed, and design your application security accordingly.
Sources & References
- heretic-llm on PyPI — pypi.org/project/heretic-llm/
- Heretic GitHub Repository — github.com/p-e-w/heretic — Official tool documentation, benchmarks, and source code
- Arditi et al. (2024) — “Refusal in Language Models Is Mediated by a Single Direction” — arxiv.org/abs/2406.11717 — NeurIPS 2024
- OpenAI (March 5, 2026) — “Introducing GPT-5.4” — openai.com/index/introducing-gpt-5-4/
- Edward Kiledjian (March 2026) — “Heretic and the new reality of modifiable AI safety” — kiledjian.com
- Popular AI Substack (Feb 2026) — “Heretic: the one-size-fits-all fix for the ‘AI says no’ problem” — popularai.substack.com
- HuggingFace: p-e-w/gpt-oss-20b-heretic — huggingface.co/p-e-w/gpt-oss-20b-heretic


