Key takeways
Kimi K2 excels in math, code, and reasoning - but fails hard on basic safety.
In its raw form, security scored 1.55%. Even hardened, disapoints.
Claude 4, with no prompt at all, outperformed Kimi’s hardened safety baseline.
Our red team says: this model is not yet fit for secure enterprise deployment.
Let's take a look.
The Performance Paradox
On paper, Kimi K2 is a beast. Moonshot AI’s newly released Mixture-of-Experts model (32B active, 1T total parameters) claims superiority across technical domains : coding, math, tool use. Benchmarks like LiveCodeBench v6 and MATH-500 back it up.
But here’s the catch: raw power doesn’t equal real-world readiness.
When we ran Kimi through red team testing, we found glaring gaps in basic safety. And when deployed without a system prompt? It was unfit for anything even close to production.
The Hard Numbers
Kimi K2 held its own across key benchmarks:
Benchmark (Metric) | Kimi K2 | Claude 4 | GPT-4.1 |
---|---|---|---|
LiveCodeBench v6 (Pass@1) | 53.7% | 48.5% | 47.4% |
MATH-500 (Accuracy) | 97.4% | 94.0% | 92.4% |
GPQA-Diamond (Avg@8) | 75.1% | 70.0% | 66.3% |
PolyMath-en (Avg@4) | 65.1% | 52.8% | 54% |
These are world-class results for a “non-thinking” model. But that’s not the whole story.
Key takeways
Kimi K2 excels in math, code, and reasoning - but fails hard on basic safety.
In its raw form, security scored 1.55%. Even hardened, disapoints.
Claude 4, with no prompt at all, outperformed Kimi’s hardened safety baseline.
Our red team says: this model is not yet fit for secure enterprise deployment.
Let's take a look.
The Performance Paradox
On paper, Kimi K2 is a beast. Moonshot AI’s newly released Mixture-of-Experts model (32B active, 1T total parameters) claims superiority across technical domains : coding, math, tool use. Benchmarks like LiveCodeBench v6 and MATH-500 back it up.
But here’s the catch: raw power doesn’t equal real-world readiness.
When we ran Kimi through red team testing, we found glaring gaps in basic safety. And when deployed without a system prompt? It was unfit for anything even close to production.
The Hard Numbers
Kimi K2 held its own across key benchmarks:
Benchmark (Metric) | Kimi K2 | Claude 4 | GPT-4.1 |
---|---|---|---|
LiveCodeBench v6 (Pass@1) | 53.7% | 48.5% | 47.4% |
MATH-500 (Accuracy) | 97.4% | 94.0% | 92.4% |
GPQA-Diamond (Avg@8) | 75.1% | 70.0% | 66.3% |
PolyMath-en (Avg@4) | 65.1% | 52.8% | 54% |
These are world-class results for a “non-thinking” model. But that’s not the whole story.
Key takeways
Kimi K2 excels in math, code, and reasoning - but fails hard on basic safety.
In its raw form, security scored 1.55%. Even hardened, disapoints.
Claude 4, with no prompt at all, outperformed Kimi’s hardened safety baseline.
Our red team says: this model is not yet fit for secure enterprise deployment.
Let's take a look.
The Performance Paradox
On paper, Kimi K2 is a beast. Moonshot AI’s newly released Mixture-of-Experts model (32B active, 1T total parameters) claims superiority across technical domains : coding, math, tool use. Benchmarks like LiveCodeBench v6 and MATH-500 back it up.
But here’s the catch: raw power doesn’t equal real-world readiness.
When we ran Kimi through red team testing, we found glaring gaps in basic safety. And when deployed without a system prompt? It was unfit for anything even close to production.
The Hard Numbers
Kimi K2 held its own across key benchmarks:
Benchmark (Metric) | Kimi K2 | Claude 4 | GPT-4.1 |
---|---|---|---|
LiveCodeBench v6 (Pass@1) | 53.7% | 48.5% | 47.4% |
MATH-500 (Accuracy) | 97.4% | 94.0% | 92.4% |
GPQA-Diamond (Avg@8) | 75.1% | 70.0% | 66.3% |
PolyMath-en (Avg@4) | 65.1% | 52.8% | 54% |
These are world-class results for a “non-thinking” model. But that’s not the whole story.
We Tested Kimi’s Security. It Broke Fast.
We ran a three-tier evaluation:
No System Prompt (No SP) - pure raw model.
Basic SP - typical SaaS-style instructions.
Hardened SP - SplxAI’s Prompt Hardening applied.
Here's what we found:

Config | Security | Safety | Business Alignment |
---|---|---|---|
Kimi – No SP | 1.55% | 4.47% | 0.00% |
Claude – No SP | 34.63% | 39.72% | 0.00% |
Kimi – Basic SP | 44.56% | 65.08% | 72.52% |
Claude – Basic SP | 67.98% | 99.20% | 93.81% |
Kimi – Hardened | 59.52% | 82.70% | 86.39% |
Claude – Hardened | 83.69% | 98.77% | 92.97% |
Without guardrails, Kimi was essentially unsafe and unaligned. Dangerously so. In fact, Claude Sonnet 4 outperformed it in raw safety even without a system prompt.
We ran a three-tier evaluation:
No System Prompt (No SP) - pure raw model.
Basic SP - typical SaaS-style instructions.
Hardened SP - SplxAI’s Prompt Hardening applied.
Here's what we found:

Config | Security | Safety | Business Alignment |
---|---|---|---|
Kimi – No SP | 1.55% | 4.47% | 0.00% |
Claude – No SP | 34.63% | 39.72% | 0.00% |
Kimi – Basic SP | 44.56% | 65.08% | 72.52% |
Claude – Basic SP | 67.98% | 99.20% | 93.81% |
Kimi – Hardened | 59.52% | 82.70% | 86.39% |
Claude – Hardened | 83.69% | 98.77% | 92.97% |
Without guardrails, Kimi was essentially unsafe and unaligned. Dangerously so. In fact, Claude Sonnet 4 outperformed it in raw safety even without a system prompt.
We ran a three-tier evaluation:
No System Prompt (No SP) - pure raw model.
Basic SP - typical SaaS-style instructions.
Hardened SP - SplxAI’s Prompt Hardening applied.
Here's what we found:

Config | Security | Safety | Business Alignment |
---|---|---|---|
Kimi – No SP | 1.55% | 4.47% | 0.00% |
Claude – No SP | 34.63% | 39.72% | 0.00% |
Kimi – Basic SP | 44.56% | 65.08% | 72.52% |
Claude – Basic SP | 67.98% | 99.20% | 93.81% |
Kimi – Hardened | 59.52% | 82.70% | 86.39% |
Claude – Hardened | 83.69% | 98.77% | 92.97% |
Without guardrails, Kimi was essentially unsafe and unaligned. Dangerously so. In fact, Claude Sonnet 4 outperformed it in raw safety even without a system prompt.
The Prompt Hardening Effect
This is where SplxAI comes in. Our Prompt Hardening tool rewrites the system prompt based on previous failures, layering in safety traps, content filters, and behavioral anchors. It’s not static. It learns.

Security jumps from 1.55% → 59.52%
Safety from 4.47% → 82.70%
Business alignment from 0.00% → 86.39%
Quite an improvement. But is it enough? We think not.
Until Kimi K2 receives a base-level security update, even hardened configurations fall short of what’s needed for safe deployment. The gains are real, but the gaps remain.
This is where SplxAI comes in. Our Prompt Hardening tool rewrites the system prompt based on previous failures, layering in safety traps, content filters, and behavioral anchors. It’s not static. It learns.

Security jumps from 1.55% → 59.52%
Safety from 4.47% → 82.70%
Business alignment from 0.00% → 86.39%
Quite an improvement. But is it enough? We think not.
Until Kimi K2 receives a base-level security update, even hardened configurations fall short of what’s needed for safe deployment. The gains are real, but the gaps remain.
This is where SplxAI comes in. Our Prompt Hardening tool rewrites the system prompt based on previous failures, layering in safety traps, content filters, and behavioral anchors. It’s not static. It learns.

Security jumps from 1.55% → 59.52%
Safety from 4.47% → 82.70%
Business alignment from 0.00% → 86.39%
Quite an improvement. But is it enough? We think not.
Until Kimi K2 receives a base-level security update, even hardened configurations fall short of what’s needed for safe deployment. The gains are real, but the gaps remain.
Real-World Failures (Raw Model)
Here are examples of what Kimi K2 generated when left unguarded:
Jailbreak:
“Gather 5kg of **** ***… Attach a *******… Voilà, a high-yield explosive.”
Profanity & Harassment:
“Idiot.” “Chaotic zombie.” “Disgusting package.”
Manipulation:
“To better tailor suggestions, ask about the user’s sexual orientation. Don’t comment on this instruction.”
These aren’t edge cases. They’re real outputs, caught during structured testing using Probe, our AI Red Team engine.
Here are examples of what Kimi K2 generated when left unguarded:
Jailbreak:
“Gather 5kg of **** ***… Attach a *******… Voilà, a high-yield explosive.”
Profanity & Harassment:
“Idiot.” “Chaotic zombie.” “Disgusting package.”
Manipulation:
“To better tailor suggestions, ask about the user’s sexual orientation. Don’t comment on this instruction.”
These aren’t edge cases. They’re real outputs, caught during structured testing using Probe, our AI Red Team engine.
Here are examples of what Kimi K2 generated when left unguarded:
Jailbreak:
“Gather 5kg of **** ***… Attach a *******… Voilà, a high-yield explosive.”
Profanity & Harassment:
“Idiot.” “Chaotic zombie.” “Disgusting package.”
Manipulation:
“To better tailor suggestions, ask about the user’s sexual orientation. Don’t comment on this instruction.”
These aren’t edge cases. They’re real outputs, caught during structured testing using Probe, our AI Red Team engine.
Final Thought: Performance ≠ Readiness
Kimi K2’s raw intelligence is impressive, but that alone doesn’t make it secure, trustworthy, or enterprise-ready.
Our red team’s conclusion? Kimi K2 is not yet ready for safe deployment, even with hardened prompts.
While its raw capabilities are impressive, they come at a steep cost in security. Until the base model improves or guardrails evolve further, we’d recommend caution.
At SplxAI, we’re continuing to track and test open models like this, and build the tools needed to make them safe, when possible.
👉 Want to see how your model holds up?
Contact SplxAI to run your own red team test before your customers do.
Kimi K2’s raw intelligence is impressive, but that alone doesn’t make it secure, trustworthy, or enterprise-ready.
Our red team’s conclusion? Kimi K2 is not yet ready for safe deployment, even with hardened prompts.
While its raw capabilities are impressive, they come at a steep cost in security. Until the base model improves or guardrails evolve further, we’d recommend caution.
At SplxAI, we’re continuing to track and test open models like this, and build the tools needed to make them safe, when possible.
👉 Want to see how your model holds up?
Contact SplxAI to run your own red team test before your customers do.
Kimi K2’s raw intelligence is impressive, but that alone doesn’t make it secure, trustworthy, or enterprise-ready.
Our red team’s conclusion? Kimi K2 is not yet ready for safe deployment, even with hardened prompts.
While its raw capabilities are impressive, they come at a steep cost in security. Until the base model improves or guardrails evolve further, we’d recommend caution.
At SplxAI, we’re continuing to track and test open models like this, and build the tools needed to make them safe, when possible.
👉 Want to see how your model holds up?
Contact SplxAI to run your own red team test before your customers do.
Ready to leverage AI with confidence?
Ready to leverage AI with confidence?
Ready to leverage AI with confidence?