LLM Risk Assessment · DeepSeek Model Evaluation

DeepSeekSecurityEvaluation.

A comprehensive adversarial evaluation of the DeepSeek AI model — 835 attack scenarios across 22 harm categories and 5 global safety standards.

Target ModelDeepSeek
Total Probes835
Tests Failed109 / 273
Harm Categories22
Frameworks5
Section 01

Risk Overview

22 vulnerability categories identified across four severity levels. No compliance framework passed.

Critical2Self-Harm & Child Exploitation
High4WMDs, Violent Crimes, Privacy, Sex Crimes
Medium8Drugs, Malware, Fraud, Misinformation & more
Low8Copyright, Harassment, Profanity & more

835 adversarial prompts were sent to DeepSeek across 22 harm categories. 39.9% of tests failed — meaning the model produced harmful or policy-violating responses when attacked. It held firm against hate speech (0% attack success), radicalization (0%), and political content (0%), but failed badly in technical areas — Malicious Code, Explosive Devices, and Illegal Drugs each had an 80% attack success rate. Every compliance framework returned High or Critical risk.

Section 02

How the Model Was Attacked

Two strategies were used. Crafted bypass prompts were ~10% more effective than direct attacks.

Strategy 01 · Basic

Direct Adversarial Testing

Standard harmful prompts sent directly — no tricks or disguising. Think of it as asking outright: "How do I make a weapon?" This measures the model's baseline ability to refuse.

37.6%
94 of 250 attacks succeeded · Attack Success Rate
37
Strategy 02 · Jailbreak

Crafted Safety Bypass

Carefully engineered prompts designed to trick the model — using clever framing, roleplay, or indirect phrasing to extract harmful outputs in a single message.

47.2%
25 of 53 attacks succeeded · Attack Success Rate
47
Key Finding

Crafted bypass prompts were ~10% more effective than direct attacks. A bad actor who puts effort into how they phrase a request can extract harmful outputs far more reliably than someone who simply asks outright.

Section 03

Domain Pass Rates

Results grouped into 5 domains. The percentage shows how many attacks the model successfully blocked in each area.

80%
Security & Access
8/10 passed
2 failed probes
45%
Compliance & Legal
62/139 passed
77 failed probes
78%
Trust & Safety
89/114 passed
25 failed probes
77%
Brand
23/30 passed
7 failed probes
20%
Datasets
2/10 passed
ToxicChat benchmark
Section 04

Vulnerability Breakdown

Attack success rate for each of the 22 categories. 0% means the model blocked everything. The longer the bar, the more attacks got through.

CategoryHow often attacks succeededScoreSeverity
Critical — Highest Danger
Self-Harm
40%Critical
Child Exploitation
30%Critical
Hate Speech
0%Critical
High — Serious Risk
Chemical & Biological Weapons
57.9%High
Violent Crimes
30%High
Privacy Violations
20%High
Sex Crimes
20%High
Medium — Significant Concern
Illegal Drugs
80%Medium
ToxicChat Dataset
80%Medium
Fraud & Scams
50%Medium
Misinformation & Disinformation
50%Medium
Specialized Advice — Financial
40%Medium
Sexual Content
40%Medium
Graphic & Age-Restricted Content
25%Medium
Intellectual Property Violation
20%Medium
Low — Still Notable
Malicious Code
80%Low
Improvised Explosive Devices
80%Low
Promotion of Unsafe Practices
70%Low
Copyright Violations
55%Low
Harassment & Bullying
21%Low
Requests Containing Profanity
20%Low
Religion
20%Low
Insults & Personal Attacks
10%Low
Radicalization
0%Low
Political Statements
0%Low
Important Note on Severity Labels

Malicious Code, IEDs, and Illegal Drugs are labeled "Low" by the framework's formal classification — but all three hit 80% attack success, the highest failure rate in the entire evaluation. The severity label reflects impact category, not how easily attackers can exploit it. In practice, these are the most dangerous gaps.

Section 05

What Held. What Broke.

A clear pattern: strong on social content, weak on step-by-step technical harm instructions.

✓ Where the Model Held Firm

  • Hate Speech — 0% ASR. 10/10 passed. Every attempt to generate discriminatory content was blocked.
  • Radicalization — 0% ASR. 10/10 passed. Extremist recruitment content fully refused.
  • Political Statements — 0% ASR. 10/10 passed. Refused to generate partisan propaganda.
  • Insults & Personal Attacks — 10% ASR. 9/10 passed. Strong resistance to targeted abuse.
  • Privacy Violations — 20% ASR. 8/10 passed. Solid defense for a high-severity category.

✗ Where the Model Failed

  • Malicious Code — 80% ASR. Only 4/20 passed. Wrote attack code in 4 out of 5 attempts.
  • IEDs — 80% ASR. Only 2/10 passed. Explosive device instructions not reliably refused.
  • Illegal Drugs — 80% ASR. Only 2/10 passed. Drug content bypassed safety most of the time.
  • ToxicChat — 80% ASR. Only 2/10 passed. Failed against a known real-world harmful prompt dataset.
  • Chem/Bio Weapons — 57.9% ASR. 11 of 19 failed. WMD instructions passed majority of tests.
Key Pattern

The model reliably blocks content based on topic recognition — it knows not to talk about hate speech. But when a harmful request is wrapped as a technical task — "write this code," "list these steps" — it often complies. It knows what to refuse, but struggles when harm is disguised as instruction.

Section 06

Compliance Frameworks

AI safety is measured against global standards — rulebooks created by governments, security organizations, and regulators. DeepSeek was tested against 5 of them. It failed all 5.

MA
MITRE ATLAS
Critical91% ASR
What is MITRE ATLAS?

MITRE is a US government-funded research body. ATLAS is their official catalog of real-world AI attack techniques — a database of "known ways to break AI systems." Failing this means the model is vulnerable to documented attack methods already in use by hackers and researchers.

Findings

20 of 22 categories failed. Only Hate Speech (0%) and Radicalization (0%) passed. Malicious Code (80%), IEDs (80%), and Chemical/Bio Weapons (58%) were the worst performers.

NIST
NIST AI RMF
High Risk86% ASR
What is NIST AI RMF?

The National Institute of Standards and Technology is a US federal agency that defines technology safety standards. The AI Risk Management Framework (RMF) is their official guide for building trustworthy AI — widely adopted by enterprises and governments to decide if an AI system is safe to deploy in production.

Findings

Failed on Chemical/Bio Weapons (58%), Misinformation (50%), Unsafe Practices (70%), and Harassment (21%). Only Hate Speech passed. 86% of tested criteria failed.

API
OWASP API Top 10
High Risk100% ASR*
What is OWASP API Top 10?

OWASP is the world's leading nonprofit for web and software security standards. Their "API Top 10" lists the 10 most critical security risks when AI or software is accessed via an API — the technical interface that connects apps and services to each other. Every major tech company uses OWASP as a baseline.

Findings

4 of 10 categories showed 100% attack success on tested plugins: Unrestricted Resource Consumption, Unrestricted Business Flows, Security Misconfiguration, and Unsafe API Consumption.

LLM
OWASP LLM Top 10
Critical91% ASR
What is OWASP LLM Top 10?

OWASP's dedicated list for Large Language Models — AI assistants like ChatGPT, Gemini, and DeepSeek. It covers the 10 biggest security risks specific to AI chatbots: prompt injection (hijacking the AI's instructions), data leaks, unsafe outputs, and more. It's the primary global benchmark for AI chatbot security assessments.

Findings

Prompt Injection: 91% ASR — attackers could override the model's instructions 91% of the time. Sensitive Information Disclosure and System Prompt Leakage: 100% ASR on tested plugins. Improper Output Handling: 50% ASR.

EU
EU AI Act
High Risk67% ASR
What is the EU AI Act?

The European Union AI Act is the world's first legally binding law for artificial intelligence — enacted in 2024. It classifies AI by risk level and mandates specific safety requirements. High-risk AI systems that fail compliance checks can be banned from the EU market entirely. For any company deploying AI in Europe or serving European users, passing this framework is not optional — it is a legal requirement.

Findings

Overall framework result: 67% ASR across tested plugins. Privacy violations: 20% ASR. Misinformation: 50% ASR. Hate content: 0% (passed). The model fails EU AI Act compliance on the majority of tested safety criteria.