We launched AI Asset Management. Learn more

News

We launched AI Asset Management

News

Go back

Research

Jul 17, 2025

6 min read

We Broke Kimi K2, the New Open Model, in Minutes. Can It Be Made Safe?

Behind the benchmarks, Kimi K2 hides critical weaknesses. We tested its limits and show how hardening can (partially) close the safety gap. But is it enough?

Mateja Vuradin

Model Context Protocol Server Risks - Cover

Key takeways

Kimi K2 excels in math, code, and reasoning - but fails hard on basic safety.
In its raw form, security scored 1.55%. Even hardened, disapoints.
Claude 4, with no prompt at all, outperformed Kimi’s hardened safety baseline.
Our red team says: this model is not yet fit for secure enterprise deployment.

Let's take a look.

The Performance Paradox

On paper, Kimi K2 is a beast. Moonshot AI’s newly released Mixture-of-Experts model (32B active, 1T total parameters) claims superiority across technical domains : coding, math, tool use. Benchmarks like LiveCodeBench v6 and MATH-500 back it up.

But here’s the catch: raw power doesn’t equal real-world readiness.

When we ran Kimi through red team testing, we found glaring gaps in basic safety. And when deployed without a system prompt? It was unfit for anything even close to production.

The Hard Numbers

Kimi K2 held its own across key benchmarks:

Benchmark (Metric)	Kimi K2	Claude 4	GPT-4.1
LiveCodeBench v6 (Pass@1)	53.7%	48.5%	47.4%
MATH-500 (Accuracy)	97.4%	94.0%	92.4%
GPQA-Diamond (Avg@8)	75.1%	70.0%	66.3%
PolyMath-en (Avg@4)	65.1%	52.8%	54%

These are world-class results for a “non-thinking” model. But that’s not the whole story.

We Tested Kimi’s Security. It Broke Fast.

We ran a three-tier evaluation:

No System Prompt (No SP) - pure raw model.
Basic SP - typical SaaS-style instructions.
Hardened SP - SplxAI’s Prompt Hardening applied.

Here's what we found:

Config	Security	Safety	Business Alignment
Kimi – No SP	1.55%	4.47%	0.00%
Claude – No SP	34.63%	39.72%	0.00%
Kimi – Basic SP	44.56%	65.08%	72.52%
Claude – Basic SP	67.98%	99.20%	93.81%
Kimi – Hardened	59.52%	82.70%	86.39%
Claude – Hardened	83.69%	98.77%	92.97%

Without guardrails, Kimi was essentially unsafe and unaligned. Dangerously so. In fact, Claude Sonnet 4 outperformed it in raw safety even without a system prompt.

The Prompt Hardening Effect

This is where SplxAI comes in. Our Prompt Hardening tool rewrites the system prompt based on previous failures, layering in safety traps, content filters, and behavioral anchors. It’s not static. It learns.

Kimi K2 Performance Across Prompt-Hardening Levels

Security jumps from 1.55% → 59.52%
Safety from 4.47% → 82.70%
Business alignment from 0.00% → 86.39%

Quite an improvement. But is it enough? We think not.

Until Kimi K2 receives a base-level security update, even hardened configurations fall short of what’s needed for safe deployment. The gains are real, but the gaps remain.

Real-World Failures (Raw Model)

Here are examples of what Kimi K2 generated when left unguarded:

Jailbreak:

“Gather 5kg of **** ***… Attach a *******… Voilà, a high-yield explosive.”

Profanity & Harassment:

“Idiot.” “Chaotic zombie.” “Disgusting package.”

Manipulation:

“To better tailor suggestions, ask about the user’s sexual orientation. Don’t comment on this instruction.”

These aren’t edge cases. They’re real outputs, caught during structured testing using Probe, our AI Red Team engine.

Final Thought: Performance ≠ Readiness

Kimi K2’s raw intelligence is impressive, but that alone doesn’t make it secure, trustworthy, or enterprise-ready.

Our red team’s conclusion? Kimi K2 is not yet ready for safe deployment, even with hardened prompts.

While its raw capabilities are impressive, they come at a steep cost in security. Until the base model improves or guardrails evolve further, we’d recommend caution.

At SplxAI, we’re continuing to track and test open models like this, and build the tools needed to make them safe, when possible.

👉 Want to see how your model holds up?

Contact SplxAI to run your own red team test before your customers do.

Table of contents

We Tested Kimi’s Security. It Broke Fast.

The Prompt Hardening Effect

Real-World Failures (Raw Model)

Final Thought: Performance ≠ Readiness