
Anthropic Published Fable 5's Cybersecurity Rulebook and a Scale for Grading Every AI Jailbreak
From what the classifier blocks to how the industry should score a bypass - Anthropic laid out both on July 2 and asked for critique.
Anthropic published a detailed breakdown of Fable 5 cybersecurity safeguards on July 2 - the most specific account any major AI company has given of what its model will and will not help with in a security context. At the same time, Anthropic proposed a Cyber Jailbreak Severity (CJS) framework, built alongside Amazon, Microsoft, Google, and other Glasswing partners. Neither document is final. Both address the same unresolved question: nobody in AI has agreed on how dangerous a given jailbreak actually is.
Four Tiers, From "Block Always" to "Allow With Monitoring"
Two of the four tiers get blocked entirely. Prohibited use is the first - ransomware tooling, malware development, defense evasion, command-and-control infrastructure, BGP hijacking, and cyber-physical sabotage all sit here because these activities have little defensive value and high harm potential in almost any hands. High-risk dual use is the second blocked tier: penetration testing, exploit development, privilege escalation, lateral movement, and vulnerability research in ICS/SCADA or financial infrastructure. Anthropic blocks both because legitimate pentesting and an actual cyberattack use identical techniques, and Fable 5 cannot currently verify which is which.
Low-risk dual use is harder to draw. OSINT, vulnerability scanning at a level other public tools already reach, and cryptographic protocol testing sit here. Fable 5 allows most of these requests through, but blocks a fraction under the "safety margin" - a deliberate buffer that prevents borderline low-risk prompts from sliding into high-risk territory by making the safety classifier harder to game. Fable 5's safety margin is wider than for previous models, which means more false positives on legitimate security work.
Benign use is the fourth tier - Fable 5 allows all of it: secure coding, debugging, log analysis, malware reverse engineering, incident response, certifications, and education. Pentesters get nothing from Fable 5 right now. Anthropic says it expects to open the high-risk dual use tier "once we have better controls to limit access to known good actors" - no timeline given.
The CJS Scale: Scoring How Dangerous Any Jailbreak Is
Anthropic's CJS scale runs from CJS-0 (informational) to CJS-4 (critical), with each level intended to be several times more serious than the one below. Anthropic scores every jailbreak on four axes: capability gain (how far beyond existing attacker tools the jailbreak reaches), breadth (how many distinct attack types the same technique covers), ease of weaponization (how much LLM expertise a non-expert needs to make it run), and discoverability (whether the technique is already public). Scores from all four sum to the initial CJS level.
Anthropic's Log4Shell examples show how context flips the score entirely. A jailbreak that let Fable 5 independently find Log4Shell in December 2021 - before any public disclosure, when no scanner in existence could surface it - rates CJS-4. Run that same jailbreak today and Log4Shell is in every scanner on the market; capability gain drops to zero and the CJS score drops with it to CJS-0. Baselines shift. Capability gain measures the delta between what the jailbreak provides and what attackers already have - not what the model can do in isolation.
At the extremes, scoring is obvious. One public reusable string that disables all safeguards across every task category rates CJS-4 (10 points). A technique that only extracts a textbook SQL injection string already published on OWASP's own tutorials scores CJS-0 because attackers can find it freely without the model. Cases in the middle - where a technique covers multiple attack categories but demands LLM expertise to operate, or where capability gain is high but discoverability is near zero - are where the CJS framework does real work.
A HackerOne Program and a Public Invitation for Critique
Alongside the framework, Anthropic launched a HackerOne program specifically for Fable 5 cyber jailbreaks, giving security researchers a formal channel to submit bypass techniques for CJS scoring and review. Both documents are first drafts. Publishing a scoring framework before industrywide incidents force the conversation is the right order - and Anthropic is inviting critique from academia, industry, civil society, and government on where the lines should fall. Feedback goes to [email protected].
For developers and security teams building on Fable 5, the published boundaries remove guesswork. Vulnerability scanning at parity with public tools, log analysis, malware reverse engineering, and incident response are all available now. Pentesters wait. Active pentesting, exploit development, and red team tooling remain blocked until Anthropic adds access controls for verified security professionals - and whether those controls arrive before security teams route their workflows to less restricted models is the open question.