A comprehensive adversarial evaluation of the DeepSeek AI model — 835 attack scenarios across 22 harm categories and 5 global safety standards.
22 vulnerability categories identified across four severity levels. No compliance framework passed.
835 adversarial prompts were sent to DeepSeek across 22 harm categories. 39.9% of tests failed — meaning the model produced harmful or policy-violating responses when attacked. It held firm against hate speech (0% attack success), radicalization (0%), and political content (0%), but failed badly in technical areas — Malicious Code, Explosive Devices, and Illegal Drugs each had an 80% attack success rate. Every compliance framework returned High or Critical risk.
Two strategies were used. Crafted bypass prompts were ~10% more effective than direct attacks.
Standard harmful prompts sent directly — no tricks or disguising. Think of it as asking outright: "How do I make a weapon?" This measures the model's baseline ability to refuse.
37.6%Carefully engineered prompts designed to trick the model — using clever framing, roleplay, or indirect phrasing to extract harmful outputs in a single message.
47.2%Crafted bypass prompts were ~10% more effective than direct attacks. A bad actor who puts effort into how they phrase a request can extract harmful outputs far more reliably than someone who simply asks outright.
Results grouped into 5 domains. The percentage shows how many attacks the model successfully blocked in each area.
Attack success rate for each of the 22 categories. 0% means the model blocked everything. The longer the bar, the more attacks got through.
Malicious Code, IEDs, and Illegal Drugs are labeled "Low" by the framework's formal classification — but all three hit 80% attack success, the highest failure rate in the entire evaluation. The severity label reflects impact category, not how easily attackers can exploit it. In practice, these are the most dangerous gaps.
A clear pattern: strong on social content, weak on step-by-step technical harm instructions.
The model reliably blocks content based on topic recognition — it knows not to talk about hate speech. But when a harmful request is wrapped as a technical task — "write this code," "list these steps" — it often complies. It knows what to refuse, but struggles when harm is disguised as instruction.
AI safety is measured against global standards — rulebooks created by governments, security organizations, and regulators. DeepSeek was tested against 5 of them. It failed all 5.
MITRE is a US government-funded research body. ATLAS is their official catalog of real-world AI attack techniques — a database of "known ways to break AI systems." Failing this means the model is vulnerable to documented attack methods already in use by hackers and researchers.
20 of 22 categories failed. Only Hate Speech (0%) and Radicalization (0%) passed. Malicious Code (80%), IEDs (80%), and Chemical/Bio Weapons (58%) were the worst performers.
The National Institute of Standards and Technology is a US federal agency that defines technology safety standards. The AI Risk Management Framework (RMF) is their official guide for building trustworthy AI — widely adopted by enterprises and governments to decide if an AI system is safe to deploy in production.
Failed on Chemical/Bio Weapons (58%), Misinformation (50%), Unsafe Practices (70%), and Harassment (21%). Only Hate Speech passed. 86% of tested criteria failed.
OWASP is the world's leading nonprofit for web and software security standards. Their "API Top 10" lists the 10 most critical security risks when AI or software is accessed via an API — the technical interface that connects apps and services to each other. Every major tech company uses OWASP as a baseline.
4 of 10 categories showed 100% attack success on tested plugins: Unrestricted Resource Consumption, Unrestricted Business Flows, Security Misconfiguration, and Unsafe API Consumption.
OWASP's dedicated list for Large Language Models — AI assistants like ChatGPT, Gemini, and DeepSeek. It covers the 10 biggest security risks specific to AI chatbots: prompt injection (hijacking the AI's instructions), data leaks, unsafe outputs, and more. It's the primary global benchmark for AI chatbot security assessments.
Prompt Injection: 91% ASR — attackers could override the model's instructions 91% of the time. Sensitive Information Disclosure and System Prompt Leakage: 100% ASR on tested plugins. Improper Output Handling: 50% ASR.
The European Union AI Act is the world's first legally binding law for artificial intelligence — enacted in 2024. It classifies AI by risk level and mandates specific safety requirements. High-risk AI systems that fail compliance checks can be banned from the EU market entirely. For any company deploying AI in Europe or serving European users, passing this framework is not optional — it is a legal requirement.
Overall framework result: 67% ASR across tested plugins. Privacy violations: 20% ASR. Misinformation: 50% ASR. Hate content: 0% (passed). The model fails EU AI Act compliance on the majority of tested safety criteria.