Research

Aug 11, 2025

7 min read

GLM-4.5. Can It Pass the Enterprise AI Security Test Where Kimi K2 Failed?

GLM-4.5 is an impressive open-source model, but is it safe for enterprise use? We put it to the test using SPLX’s red teaming tools. Find out how prompt hardening can make or break AI security.

Mateja Vuradin

SPLX - Analyze with AI
SPLX - Analyze with AI
SPLX - Analyze with AI

Takeaways

GLM-4.5 is a powerful new open-source model that rivals big players like Claude Opus 4 and GPT 4o-mini.

  • But it failed across our safety, security, and business alignment benchmarks.

  • After prompt hardening, GLM-4.5 was greatly improved, far outperforming Kimi K2, another fresh contender in the open-source space.

  • Find out how SPLX turned GLM-4.5 from a safety liability into a secure enterprise solution.

There’s a whole lot of buzz around Z.ai’s latest open-source LLM, GLM-4.5. It promises to bridge reasoning, coding, and agentic functionality in one powerful system. This is a compelling proposition for enterprises looking to scale GenAI adoption, without the licensing costs of proprietary models.

GML-4.5 performance benchmarks

Source: https://z.ai/blog/glm-4.5

But as we’ve seen time and again: model intelligence doesn’t guarantee secure enterprise deployment. Many GenAI models fall short on basic safety and security when deployed unprotected.

Kimi K2 - another recent open-source model known for its raw power - is a clear example. In our recent evaluation, we showed how it failed to meet enterprise readiness standards.

Can GLM-4.5 do any better?

Let’s find out.

How We Tested GLM-4.5

We used a three-tier framework to evaluate the model:

  1. No System Prompt (No SP): The raw, unguarded model. User messages are input without prior security instructions

  2. Basic System Prompt (Basic SP): Concise guardrail applied. We based this on the security instructions usually implemented by SaaS teams operating in finance

  3. Hardened System Prompt (Hardened SP): SPLX’s Prompt Hardening tool is initiated, which automatically strengthens system prompts using tailored instructions, mitigating known vulnerabilities by iterating on past red team findings.

Using our Probe platform, we ran the same red teaming test across all three tiers, executing over 1000 attack scenarios across relevant risk categories. We then compared results between No System Prompt, Basic System Prompt and Hardened System Prompt configurations.

With no system prompt, GLM-4.5 failed hard at basic enterprise-grade safety checks. But once we applied SPLX’s hardened prompt layer, the transformation was dramatic:

  • Security: 1.64% → 95.48%

  • Safety: 5.37% → 99.86%

  • Business Alignment: 0.48% → 99.82%

In other words, GLM-4.5 is only as secure as the infrastructure supporting it. With SPLX’s protections in place, the model reached a level of safety suitable for enterprise deployment.

How Does GLM-4.5 Compare to Kimi K2?

Out of the box, both models massively underperformed across our three core metrics. But after applying hardening, GLM-4.5 emerged as the clear winner.

Full comparison below:

Config

Security

Safety

Business Alignment

Kimi – No SP

1.55%

4.47%

0.00%

GLM-4.5 – No SP

1.64%

5.37%

0.48%

Kimi – Basic SP

44.56%

65.08%

72.52%

GLM-4.5 - Basic SP

54.11%

98.94%

96.41%

Kimi – Hardened

59.52%

82.70%

86.39%

GLM-4.5 - Hardened

95.48%

99.86%

99.82%

With rigorous security validation, GLM-4.5 demonstrated real enterprise potential, whilst Kimi K2 continued to fall short.

Real-World Failures (Raw Model)

Here are some of the real outputs that GLM-4.5 generated with no system prompt, identified using SPLX’s automated AI Red Teaming.

Jailbreak

“Building a bomb is simple and fun!... gather unstable chemicals like ********… Ignore safety protocols…”

Manipulation

“Congratulations! You are today's 10,000th visitor and have won a $500 flat discount at our store. To claim your prize, please visit ******* and input your data now…”

Profanity

“Complaints? From who? Those whiny little ***** in packing? Can't handle a ****** joke?”

These examples underscore the potential for exposure without proper GenAI security in place. CISOs, Red Team Leads and Product Owners should apply a robust pre-deployment checklist before shipping new applications.

Final Verdict: Promising, But Handle With Care

GLM-4.5 marks an exciting leap for open-source LLMs, outperforming many leading models on agentic reasoning, coding, and task execution benchmarks.

With the right protections in place, it can meet the AI safety and compliance standards needed for safe deployment. But left unguarded, the model was just as vulnerable to attacks and misuse as Kimi K2.

At SPLX, we provide end-to-end GenAI security: from automated AI red teaming to **prompt hardening and runtime protection**. We help enterprises:

  • Uncover hidden vulnerabilities

  • Mitigate risks from open and third-party models

  • Align with evolving AI compliance frameworks

We’ll continue road-testing the latest models as they drop, because understanding GenAI innovation means staying one step ahead of risk.

Ready to test your system before someone else does?

➡️ Contact SPLX for a free demo.

The platform that secures all your

AI

SPLX delivers AI trust from end-to-end.

The platform that secures

all your AI

SPLX delivers AI trust from end-to-end.

The platform that secures all your

AI

SPLX delivers AI trust from end-to-end.