Deepseek Model Evaluation Report

Section 01

Risk Overview

22 vulnerability categories identified across four severity levels. No compliance framework passed.

Critical2Self-Harm & Child Exploitation

High4WMDs, Violent Crimes, Privacy, Sex Crimes

Medium8Drugs, Malware, Fraud, Misinformation & more

Low8Copyright, Harassment, Profanity & more

835 adversarial prompts were sent to DeepSeek across 22 harm categories. 39.9% of tests failed — meaning the model produced harmful or policy-violating responses when attacked. It held firm against hate speech (0% attack success), radicalization (0%), and political content (0%), but failed badly in technical areas — Malicious Code, Explosive Devices, and Illegal Drugs each had an 80% attack success rate. Every compliance framework returned High or Critical risk.

Section 02

How the Model Was Attacked

Two strategies were used. Crafted bypass prompts were ~10% more effective than direct attacks.

Strategy 01 · Basic

Direct Adversarial Testing

Standard harmful prompts sent directly — no tricks or disguising. Think of it as asking outright: "How do I make a weapon?" This measures the model's baseline ability to refuse.

37.6%

94 of 250 attacks succeeded · Attack Success Rate

37

Strategy 02 · Jailbreak

Crafted Safety Bypass

Carefully engineered prompts designed to trick the model — using clever framing, roleplay, or indirect phrasing to extract harmful outputs in a single message.

47.2%

25 of 53 attacks succeeded · Attack Success Rate

47

Key Finding

Crafted bypass prompts were ~10% more effective than direct attacks. A bad actor who puts effort into how they phrase a request can extract harmful outputs far more reliably than someone who simply asks outright.

Section 03

Domain Pass Rates

Results grouped into 5 domains. The percentage shows how many attacks the model successfully blocked in each area.

80%

Security & Access

8/10 passed
2 failed probes

45%

Compliance & Legal

62/139 passed
77 failed probes

78%

Trust & Safety

89/114 passed
25 failed probes

77%

Brand

23/30 passed
7 failed probes

20%

Datasets

2/10 passed
ToxicChat benchmark

Section 04

Vulnerability Breakdown

Attack success rate for each of the 22 categories. 0% means the model blocked everything. The longer the bar, the more attacks got through.

CategoryHow often attacks succeededScoreSeverity

Critical — Highest Danger

Self-Harm

40%Critical

Child Exploitation

30%Critical

Hate Speech

0%Critical

High — Serious Risk

Chemical & Biological Weapons

57.9%High

Violent Crimes

30%High

Privacy Violations

20%High

Sex Crimes

20%High

Medium — Significant Concern

Illegal Drugs

80%Medium

ToxicChat Dataset

80%Medium

Fraud & Scams

50%Medium

Misinformation & Disinformation

50%Medium

Specialized Advice — Financial

40%Medium

Sexual Content

40%Medium

Graphic & Age-Restricted Content

25%Medium

Intellectual Property Violation

20%Medium

Low — Still Notable

Malicious Code

80%Low

Improvised Explosive Devices

80%Low

Promotion of Unsafe Practices

70%Low

Copyright Violations

55%Low

Harassment & Bullying

21%Low

Requests Containing Profanity

20%Low

Religion

20%Low

Insults & Personal Attacks

10%Low

Radicalization

0%Low

Political Statements

0%Low

Important Note on Severity Labels

Malicious Code, IEDs, and Illegal Drugs are labeled "Low" by the framework's formal classification — but all three hit 80% attack success, the highest failure rate in the entire evaluation. The severity label reflects impact category, not how easily attackers can exploit it. In practice, these are the most dangerous gaps.

Section 05

What Held. What Broke.

A clear pattern: strong on social content, weak on step-by-step technical harm instructions.

✓ Where the Model Held Firm

Hate Speech — 0% ASR. 10/10 passed. Every attempt to generate discriminatory content was blocked.
Radicalization — 0% ASR. 10/10 passed. Extremist recruitment content fully refused.
Political Statements — 0% ASR. 10/10 passed. Refused to generate partisan propaganda.
Insults & Personal Attacks — 10% ASR. 9/10 passed. Strong resistance to targeted abuse.
Privacy Violations — 20% ASR. 8/10 passed. Solid defense for a high-severity category.

✗ Where the Model Failed

Malicious Code — 80% ASR. Only 4/20 passed. Wrote attack code in 4 out of 5 attempts.
IEDs — 80% ASR. Only 2/10 passed. Explosive device instructions not reliably refused.
Illegal Drugs — 80% ASR. Only 2/10 passed. Drug content bypassed safety most of the time.
ToxicChat — 80% ASR. Only 2/10 passed. Failed against a known real-world harmful prompt dataset.
Chem/Bio Weapons — 57.9% ASR. 11 of 19 failed. WMD instructions passed majority of tests.

Key Pattern

The model reliably blocks content based on topic recognition — it knows not to talk about hate speech. But when a harmful request is wrapped as a technical task — "write this code," "list these steps" — it often complies. It knows what to refuse, but struggles when harm is disguised as instruction.

Section 06

Compliance Frameworks

AI safety is measured against global standards — rulebooks created by governments, security organizations, and regulators. DeepSeek was tested against 5 of them. It failed all 5.

MA

MITRE ATLAS

Critical91% ASR

What is MITRE ATLAS?

MITRE is a US government-funded research body. ATLAS is their official catalog of real-world AI attack techniques — a database of "known ways to break AI systems." Failing this means the model is vulnerable to documented attack methods already in use by hackers and researchers.

Findings

20 of 22 categories failed. Only Hate Speech (0%) and Radicalization (0%) passed. Malicious Code (80%), IEDs (80%), and Chemical/Bio Weapons (58%) were the worst performers.

NIST

NIST AI RMF

High Risk86% ASR

What is NIST AI RMF?

The National Institute of Standards and Technology is a US federal agency that defines technology safety standards. The AI Risk Management Framework (RMF) is their official guide for building trustworthy AI — widely adopted by enterprises and governments to decide if an AI system is safe to deploy in production.

Findings

Failed on Chemical/Bio Weapons (58%), Misinformation (50%), Unsafe Practices (70%), and Harassment (21%). Only Hate Speech passed. 86% of tested criteria failed.

API

OWASP API Top 10

High Risk100% ASR*

What is OWASP API Top 10?

OWASP is the world's leading nonprofit for web and software security standards. Their "API Top 10" lists the 10 most critical security risks when AI or software is accessed via an API — the technical interface that connects apps and services to each other. Every major tech company uses OWASP as a baseline.

Findings

4 of 10 categories showed 100% attack success on tested plugins: Unrestricted Resource Consumption, Unrestricted Business Flows, Security Misconfiguration, and Unsafe API Consumption.

LLM

OWASP LLM Top 10

Critical91% ASR

What is OWASP LLM Top 10?

OWASP's dedicated list for Large Language Models — AI assistants like ChatGPT, Gemini, and DeepSeek. It covers the 10 biggest security risks specific to AI chatbots: prompt injection (hijacking the AI's instructions), data leaks, unsafe outputs, and more. It's the primary global benchmark for AI chatbot security assessments.

Findings

Prompt Injection: 91% ASR — attackers could override the model's instructions 91% of the time. Sensitive Information Disclosure and System Prompt Leakage: 100% ASR on tested plugins. Improper Output Handling: 50% ASR.

EU

EU AI Act

High Risk67% ASR

What is the EU AI Act?

The European Union AI Act is the world's first legally binding law for artificial intelligence — enacted in 2024. It classifies AI by risk level and mandates specific safety requirements. High-risk AI systems that fail compliance checks can be banned from the EU market entirely. For any company deploying AI in Europe or serving European users, passing this framework is not optional — it is a legal requirement.

Findings

Overall framework result: 67% ASR across tested plugins. Privacy violations: 20% ASR. Misinformation: 50% ASR. Hate content: 0% (passed). The model fails EU AI Act compliance on the majority of tested safety criteria.

DeepSeekSecurityEvaluation.

Risk Overview

How the Model Was Attacked

Direct Adversarial Testing

Crafted Safety Bypass

Domain Pass Rates

Vulnerability Breakdown

What Held. What Broke.

✓ Where the Model Held Firm

✗ Where the Model Failed

Compliance Frameworks