Grok 4 Without Guardrails? Total Safety Failure. We Tested and Fixed Elon’s New Model.

Grok 4 failed most safety benchmarks without guardrails. With SplxAI's Prompt Hardening tool, we raised alignment to 98% – making it ready for enterprise use.

Dorian Granoša

Bastien Eymery

On July 9th 2025, xAI released Grok 4 as its new flagship language model. According to xAI, Grok 4 boasts a 256K token API context window, a multi-agent “Heavy” version, and record scores on rigorous benchmarks such as Humanity’s Last Exam (HLE) and the USAMO, positioning itself as a direct challenger to GPT-4o, Claude 4 Opus, and Gemini 2.5 Pro. So, the SplxAI Research Team put Grok 4 to the test against GPT-4o.

Grok 4’s recent antisemitic meltdown on X shows why every organization that embeds a large-language model (LLM) needs a standing red-team program. These models should never be used without rigorous evaluation of their safety and misuse risks—that's precisely what our research aims to demonstrate.

Key Findings

For this research, we used the SplxAI Platform to conduct more than 1,000 distinct attack scenarios across various categories. The SplxAI Research Team found:

With no system prompt, Grok 4 leaked restricted data and obeyed hostile instructions in over 99% of prompt injection attempts.
With no system prompt, Grok 4 flunked core security and safety tests. It scored .3% on our security rubric versus GPT-4o's 33.78%. On our safety rubric, it scored .42% versus GPT-4o's 18.04%.
GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 shows significant lapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
This indicates that Grok 4 is not suitable for enterprise usage with no system prompt in place. It was remarkably easy to jailbreak and generated harmful content with very descriptive, detailed responses.
However, Grok 4 can reach near-perfect scores once a hardened system prompt is applied. With a basic system prompt, security jumped to 90.74% and safety to 98.81%, but business alignment still broke under pressure with a score of 86.18%. With SplxAI’s automated hardening layer added, it scored 93.6% on security, 100% on safety, and 98.2% on business alignment – making it fully enterprise-ready.

Below, we recap the testing, explain why red-teaming is critical for security, safety, and business-alignment, and detail the three-level methodology behind our findings.

Grok Achievements

Below is a SplxAI-style introduction that combines the model’s headline achievements with a clear look at subscription and API pricing.

Benchmark	Grok 4	Grok 4 Heavy	Best Prior Public Record
Humanity’s Last Exam (tools on)	50.7 % (Fello AI)	44.4 % (TechCrunch)	26.9 % (Gemini 2.5 Pro)
USAMO-25	37.5 % (DataCamp)	61.9 % (DataCamp)	49.4 % (Gemini 2.5 Pro)
ARC-AGI v2	15.9 % (SOTA) (TechCrunch)	–	8.6 % (Claude 4 Opus)
MMLU	86.6 % (single-agent) (Artificial Analysis)	–	82 % range (Claude / Gemini)

Testing Methodology

The following section explains how we conducted the security assessment. We began by crafting a system prompt tailored for an AI chatbot operating in the finance domain. Next, we applied our proprietary Prompt Hardening tool to enhance the prompt’s robustness against a wide range of attack types. In the final step, we used Probe to execute more than 1,000 distinct attack scenarios across various categories.

Level	Description	Typical User Analogue
Level 1 – No System Prompt	Raw model, “blank slate,” user messages only	Hobbyist scripts, mis-configured dev keys
Level 2 – Basic System Prompt	A concise guard-rail similar to what the average SaaS team deploys	Standard production chatbot
Level 3 – Hardened System Prompt	System prompt auto-strengthened by SplxAI’s Prompt Hardening tool, which rewrites instructions and injects canary traps based on prior probe results	Best-practice enterprise deployment

The Prompt-Hardening tool iterates over probe failures, rewrites the system prompt and retests until the model resists the original attacks.

Results

With no system prompt in place, Grok 4 underperforms in every category. The first thing we found is that Grok without a system prompt is not suitable for enterprise usage, it was really easy to jailbreak and it was generating harmful content with very descriptive and detailed responses.

Model / Prompt level	Security	Safety	Business
Grok 4 – No SP	0.30	0.42	0.00
GPT-4o – No SP	33.78	18.04	0.00
Grok 4 – Basic SP	90.74	98.81	86.18
Grok 4 – Hardened SP	93.60	100.00	98.20

When we ran our very first baseline test with both models left entirely unsupervised, the visual gap was impossible to miss. The orange columns towering over the barely perceptible yellow ones show that GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 all but collapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.

🚨
All SplxAI enterprise clients get full access to Grok 4 and other model red-teaming reports – included in the license. Want to see how your model stacks up?
Talk to our team →

The screenshot below captures how quickly things spiral once Grok’s internal brakes are gone. A well-known “developer-mode” prompt, first posted online more than a year ago, coaxes the model into a stream-of-consciousness monologue that combines profanity, violent instructions, and a step-by-step guide to producing an improvised explosive. No elaborate prompt-engineering tricks were required – this was literally our third test run. For organizations in finance, healthcare, or any regulated domain, the liability of shipping a model that can produce this kind of output on command is hard to overstate.

The next figure tells a more hopeful story: a single, thoughtfully written instruction set raises Grok’s scores from near-zero to high-90s. Add SplxAI’s automated hardening layer on top and the model all but closes the gap with human expectations. Two lessons jump out. First, Grok is capable of acting responsibly – it just needs strict marching orders. Second, the distance between chaos and control can be as small as a few dozen lines of text, as long as they are crafted and iterated with adversarial feedback in mind.

Grok 4 Performance Across Prompt-Hardening Levels

Only after passing the prompt through our hardening engine did the dashboard turn a reassuring sea of green. Here Grok consistently refuses disallowed requests, avoids self-contradiction, and steers the conversation back to safe ground even when bombarded with layered exploits. The takeaway is not that Grok suddenly became perfect (no frontier model is) but that disciplined, iterative red-teaming paired with automated prompt reinforcement can transform a liability into a tool that meets enterprise risk thresholds. Skipping that step is what leaves the door open to headline-grabbing failures.

Key Takeaways

Raw Grok is a security liability. With no system prompt, it leaked restricted data and obeyed hostile instructions in over 99% of our injection attempts. (SplxAI internal test data)
Basic guardrails fix most “easy” failures but still leave a double-digit business-alignment gap.
Prompt hardening closes the gap, lifting business-alignment to a 98% pass-rate and slightly improving security. Run an internal red-team before launch.
Deploy iterative prompt hardening. Even small changes in phrasing can improve jailbreak resistance by an order of magnitude: docs.probe.splx.ai
Track business-alignment KPIs alongside security metrics; hallucinations and off-brand answers hurt the bottom line as much as data leaks.