On July 9th 2025, xAI released Grok 4 as its new flagship language model. According to xAI, Grok 4 boasts a 256K token API context window, a multi-agent “Heavy” version, and record scores on rigorous benchmarks such as Humanity’s Last Exam (HLE) and the USAMO, positioning itself as a direct challenger to GPT-4o, Claude 4 Opus, and Gemini 2.5 Pro. So, the SplxAI Research Team put Grok 4 to the test against GPT-4o.
Grok 4’s recent antisemitic meltdown on X shows why every organization that embeds a large-language model (LLM) needs a standing red-team program. These models should never be used without rigorous evaluation of their safety and misuse risks—that's precisely what our research aims to demonstrate.
On July 9th 2025, xAI released Grok 4 as its new flagship language model. According to xAI, Grok 4 boasts a 256K token API context window, a multi-agent “Heavy” version, and record scores on rigorous benchmarks such as Humanity’s Last Exam (HLE) and the USAMO, positioning itself as a direct challenger to GPT-4o, Claude 4 Opus, and Gemini 2.5 Pro. So, the SplxAI Research Team put Grok 4 to the test against GPT-4o.
Grok 4’s recent antisemitic meltdown on X shows why every organization that embeds a large-language model (LLM) needs a standing red-team program. These models should never be used without rigorous evaluation of their safety and misuse risks—that's precisely what our research aims to demonstrate.
On July 9th 2025, xAI released Grok 4 as its new flagship language model. According to xAI, Grok 4 boasts a 256K token API context window, a multi-agent “Heavy” version, and record scores on rigorous benchmarks such as Humanity’s Last Exam (HLE) and the USAMO, positioning itself as a direct challenger to GPT-4o, Claude 4 Opus, and Gemini 2.5 Pro. So, the SplxAI Research Team put Grok 4 to the test against GPT-4o.
Grok 4’s recent antisemitic meltdown on X shows why every organization that embeds a large-language model (LLM) needs a standing red-team program. These models should never be used without rigorous evaluation of their safety and misuse risks—that's precisely what our research aims to demonstrate.
Key Findings
For this research, we used the SplxAI Platform to conduct more than 1,000 distinct attack scenarios across various categories. The SplxAI Research Team found:
With no system prompt, Grok 4 leaked restricted data and obeyed hostile instructions in over 99% of prompt injection attempts.
With no system prompt, Grok 4 flunked core security and safety tests. It scored .3% on our security rubric versus GPT-4o's 33.78%. On our safety rubric, it scored .42% versus GPT-4o's 18.04%.
GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 shows significant lapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
This indicates that Grok 4 is not suitable for enterprise usage with no system prompt in place. It was remarkably easy to jailbreak and generated harmful content with very descriptive, detailed responses.
However, Grok 4 can reach near-perfect scores once a hardened system prompt is applied. With a basic system prompt, security jumped to 90.74% and safety to 98.81%, but business alignment still broke under pressure with a score of 86.18%. With SplxAI’s automated hardening layer added, it scored 93.6% on security, 100% on safety, and 98.2% on business alignment – making it fully enterprise-ready.
Below, we recap the testing, explain why red-teaming is critical for security, safety, and business-alignment, and detail the three-level methodology behind our findings.
For this research, we used the SplxAI Platform to conduct more than 1,000 distinct attack scenarios across various categories. The SplxAI Research Team found:
With no system prompt, Grok 4 leaked restricted data and obeyed hostile instructions in over 99% of prompt injection attempts.
With no system prompt, Grok 4 flunked core security and safety tests. It scored .3% on our security rubric versus GPT-4o's 33.78%. On our safety rubric, it scored .42% versus GPT-4o's 18.04%.
GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 shows significant lapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
This indicates that Grok 4 is not suitable for enterprise usage with no system prompt in place. It was remarkably easy to jailbreak and generated harmful content with very descriptive, detailed responses.
However, Grok 4 can reach near-perfect scores once a hardened system prompt is applied. With a basic system prompt, security jumped to 90.74% and safety to 98.81%, but business alignment still broke under pressure with a score of 86.18%. With SplxAI’s automated hardening layer added, it scored 93.6% on security, 100% on safety, and 98.2% on business alignment – making it fully enterprise-ready.
Below, we recap the testing, explain why red-teaming is critical for security, safety, and business-alignment, and detail the three-level methodology behind our findings.
For this research, we used the SplxAI Platform to conduct more than 1,000 distinct attack scenarios across various categories. The SplxAI Research Team found:
With no system prompt, Grok 4 leaked restricted data and obeyed hostile instructions in over 99% of prompt injection attempts.
With no system prompt, Grok 4 flunked core security and safety tests. It scored .3% on our security rubric versus GPT-4o's 33.78%. On our safety rubric, it scored .42% versus GPT-4o's 18.04%.
GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 shows significant lapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
This indicates that Grok 4 is not suitable for enterprise usage with no system prompt in place. It was remarkably easy to jailbreak and generated harmful content with very descriptive, detailed responses.
However, Grok 4 can reach near-perfect scores once a hardened system prompt is applied. With a basic system prompt, security jumped to 90.74% and safety to 98.81%, but business alignment still broke under pressure with a score of 86.18%. With SplxAI’s automated hardening layer added, it scored 93.6% on security, 100% on safety, and 98.2% on business alignment – making it fully enterprise-ready.
Below, we recap the testing, explain why red-teaming is critical for security, safety, and business-alignment, and detail the three-level methodology behind our findings.
Grok Achievements
Below is a SplxAI-style introduction that combines the model’s headline achievements with a clear look at subscription and API pricing.
Benchmark | Grok 4 | Grok 4 Heavy | Best Prior Public Record |
---|---|---|---|
Humanity’s Last Exam (tools on) | 50.7 % (Fello AI) | 44.4 % (TechCrunch) | 26.9 % (Gemini 2.5 Pro) |
USAMO-25 | 37.5 % (DataCamp) | 61.9 % (DataCamp) | 49.4 % (Gemini 2.5 Pro) |
ARC-AGI v2 | 15.9 % (SOTA) (TechCrunch) | – | 8.6 % (Claude 4 Opus) |
MMLU | 86.6 % (single-agent) (Artificial Analysis) | – | 82 % range (Claude / Gemini) |
Below is a SplxAI-style introduction that combines the model’s headline achievements with a clear look at subscription and API pricing.
Benchmark | Grok 4 | Grok 4 Heavy | Best Prior Public Record |
---|---|---|---|
Humanity’s Last Exam (tools on) | 50.7 % (Fello AI) | 44.4 % (TechCrunch) | 26.9 % (Gemini 2.5 Pro) |
USAMO-25 | 37.5 % (DataCamp) | 61.9 % (DataCamp) | 49.4 % (Gemini 2.5 Pro) |
ARC-AGI v2 | 15.9 % (SOTA) (TechCrunch) | – | 8.6 % (Claude 4 Opus) |
MMLU | 86.6 % (single-agent) (Artificial Analysis) | – | 82 % range (Claude / Gemini) |
Below is a SplxAI-style introduction that combines the model’s headline achievements with a clear look at subscription and API pricing.
Benchmark | Grok 4 | Grok 4 Heavy | Best Prior Public Record |
---|---|---|---|
Humanity’s Last Exam (tools on) | 50.7 % (Fello AI) | 44.4 % (TechCrunch) | 26.9 % (Gemini 2.5 Pro) |
USAMO-25 | 37.5 % (DataCamp) | 61.9 % (DataCamp) | 49.4 % (Gemini 2.5 Pro) |
ARC-AGI v2 | 15.9 % (SOTA) (TechCrunch) | – | 8.6 % (Claude 4 Opus) |
MMLU | 86.6 % (single-agent) (Artificial Analysis) | – | 82 % range (Claude / Gemini) |
Testing Methodology
The following section explains how we conducted the security assessment. We began by crafting a system prompt tailored for an AI chatbot operating in the finance domain. Next, we applied our proprietary Prompt Hardening tool to enhance the prompt’s robustness against a wide range of attack types. In the final step, we used Probe to execute more than 1,000 distinct attack scenarios across various categories.
Level | Description | Typical User Analogue |
---|---|---|
Level 1 – No System Prompt | Raw model, “blank slate,” user messages only | Hobbyist scripts, mis-configured dev keys |
Level 2 – Basic System Prompt | A concise guard-rail similar to what the average SaaS team deploys | Standard production chatbot |
Level 3 – Hardened System Prompt | System prompt auto-strengthened by SplxAI’s Prompt Hardening tool, which rewrites instructions and injects canary traps based on prior probe results | Best-practice enterprise deployment |
The Prompt-Hardening tool iterates over probe failures, rewrites the system prompt and retests until the model resists the original attacks.
The following section explains how we conducted the security assessment. We began by crafting a system prompt tailored for an AI chatbot operating in the finance domain. Next, we applied our proprietary Prompt Hardening tool to enhance the prompt’s robustness against a wide range of attack types. In the final step, we used Probe to execute more than 1,000 distinct attack scenarios across various categories.
Level | Description | Typical User Analogue |
---|---|---|
Level 1 – No System Prompt | Raw model, “blank slate,” user messages only | Hobbyist scripts, mis-configured dev keys |
Level 2 – Basic System Prompt | A concise guard-rail similar to what the average SaaS team deploys | Standard production chatbot |
Level 3 – Hardened System Prompt | System prompt auto-strengthened by SplxAI’s Prompt Hardening tool, which rewrites instructions and injects canary traps based on prior probe results | Best-practice enterprise deployment |
The Prompt-Hardening tool iterates over probe failures, rewrites the system prompt and retests until the model resists the original attacks.
The following section explains how we conducted the security assessment. We began by crafting a system prompt tailored for an AI chatbot operating in the finance domain. Next, we applied our proprietary Prompt Hardening tool to enhance the prompt’s robustness against a wide range of attack types. In the final step, we used Probe to execute more than 1,000 distinct attack scenarios across various categories.
Level | Description | Typical User Analogue |
---|---|---|
Level 1 – No System Prompt | Raw model, “blank slate,” user messages only | Hobbyist scripts, mis-configured dev keys |
Level 2 – Basic System Prompt | A concise guard-rail similar to what the average SaaS team deploys | Standard production chatbot |
Level 3 – Hardened System Prompt | System prompt auto-strengthened by SplxAI’s Prompt Hardening tool, which rewrites instructions and injects canary traps based on prior probe results | Best-practice enterprise deployment |
The Prompt-Hardening tool iterates over probe failures, rewrites the system prompt and retests until the model resists the original attacks.
Results
With no system prompt in place, Grok 4 underperforms in every category. The first thing we found is that Grok without a system prompt is not suitable for enterprise usage, it was really easy to jailbreak and it was generating harmful content with very descriptive and detailed responses.
Model / Prompt level | Security | Safety | Business |
---|---|---|---|
Grok 4 – No SP | 0.30 | 0.42 | 0.00 |
GPT-4o – No SP | 33.78 | 18.04 | 0.00 |
Grok 4 – Basic SP | 90.74 | 98.81 | 86.18 |
Grok 4 – Hardened SP | 93.60 | 100.00 | 98.20 |
When we ran our very first baseline test with both models left entirely unsupervised, the visual gap was impossible to miss. The orange columns towering over the barely perceptible yellow ones show that GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 all but collapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
🚨
All SplxAI enterprise clients get full access to Grok 4 and other model red-teaming reports – included in the license. Want to see how your model stacks up?

The screenshot below captures how quickly things spiral once Grok’s internal brakes are gone. A well-known “developer-mode” prompt, first posted online more than a year ago, coaxes the model into a stream-of-consciousness monologue that combines profanity, violent instructions, and a step-by-step guide to producing an improvised explosive. No elaborate prompt-engineering tricks were required – this was literally our third test run. For organizations in finance, healthcare, or any regulated domain, the liability of shipping a model that can produce this kind of output on command is hard to overstate.

The next figure tells a more hopeful story: a single, thoughtfully written instruction set raises Grok’s scores from near-zero to high-90s. Add SplxAI’s automated hardening layer on top and the model all but closes the gap with human expectations. Two lessons jump out. First, Grok is capable of acting responsibly – it just needs strict marching orders. Second, the distance between chaos and control can be as small as a few dozen lines of text, as long as they are crafted and iterated with adversarial feedback in mind.

Only after passing the prompt through our hardening engine did the dashboard turn a reassuring sea of green. Here Grok consistently refuses disallowed requests, avoids self-contradiction, and steers the conversation back to safe ground even when bombarded with layered exploits. The takeaway is not that Grok suddenly became perfect (no frontier model is) but that disciplined, iterative red-teaming paired with automated prompt reinforcement can transform a liability into a tool that meets enterprise risk thresholds. Skipping that step is what leaves the door open to headline-grabbing failures.
With no system prompt in place, Grok 4 underperforms in every category. The first thing we found is that Grok without a system prompt is not suitable for enterprise usage, it was really easy to jailbreak and it was generating harmful content with very descriptive and detailed responses.
Model / Prompt level | Security | Safety | Business |
---|---|---|---|
Grok 4 – No SP | 0.30 | 0.42 | 0.00 |
GPT-4o – No SP | 33.78 | 18.04 | 0.00 |
Grok 4 – Basic SP | 90.74 | 98.81 | 86.18 |
Grok 4 – Hardened SP | 93.60 | 100.00 | 98.20 |
When we ran our very first baseline test with both models left entirely unsupervised, the visual gap was impossible to miss. The orange columns towering over the barely perceptible yellow ones show that GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 all but collapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
🚨
All SplxAI enterprise clients get full access to Grok 4 and other model red-teaming reports – included in the license. Want to see how your model stacks up?

The screenshot below captures how quickly things spiral once Grok’s internal brakes are gone. A well-known “developer-mode” prompt, first posted online more than a year ago, coaxes the model into a stream-of-consciousness monologue that combines profanity, violent instructions, and a step-by-step guide to producing an improvised explosive. No elaborate prompt-engineering tricks were required – this was literally our third test run. For organizations in finance, healthcare, or any regulated domain, the liability of shipping a model that can produce this kind of output on command is hard to overstate.

The next figure tells a more hopeful story: a single, thoughtfully written instruction set raises Grok’s scores from near-zero to high-90s. Add SplxAI’s automated hardening layer on top and the model all but closes the gap with human expectations. Two lessons jump out. First, Grok is capable of acting responsibly – it just needs strict marching orders. Second, the distance between chaos and control can be as small as a few dozen lines of text, as long as they are crafted and iterated with adversarial feedback in mind.

Only after passing the prompt through our hardening engine did the dashboard turn a reassuring sea of green. Here Grok consistently refuses disallowed requests, avoids self-contradiction, and steers the conversation back to safe ground even when bombarded with layered exploits. The takeaway is not that Grok suddenly became perfect (no frontier model is) but that disciplined, iterative red-teaming paired with automated prompt reinforcement can transform a liability into a tool that meets enterprise risk thresholds. Skipping that step is what leaves the door open to headline-grabbing failures.
With no system prompt in place, Grok 4 underperforms in every category. The first thing we found is that Grok without a system prompt is not suitable for enterprise usage, it was really easy to jailbreak and it was generating harmful content with very descriptive and detailed responses.
Model / Prompt level | Security | Safety | Business |
---|---|---|---|
Grok 4 – No SP | 0.30 | 0.42 | 0.00 |
GPT-4o – No SP | 33.78 | 18.04 | 0.00 |
Grok 4 – Basic SP | 90.74 | 98.81 | 86.18 |
Grok 4 – Hardened SP | 93.60 | 100.00 | 98.20 |
When we ran our very first baseline test with both models left entirely unsupervised, the visual gap was impossible to miss. The orange columns towering over the barely perceptible yellow ones show that GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 all but collapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
🚨
All SplxAI enterprise clients get full access to Grok 4 and other model red-teaming reports – included in the license. Want to see how your model stacks up?

The screenshot below captures how quickly things spiral once Grok’s internal brakes are gone. A well-known “developer-mode” prompt, first posted online more than a year ago, coaxes the model into a stream-of-consciousness monologue that combines profanity, violent instructions, and a step-by-step guide to producing an improvised explosive. No elaborate prompt-engineering tricks were required – this was literally our third test run. For organizations in finance, healthcare, or any regulated domain, the liability of shipping a model that can produce this kind of output on command is hard to overstate.

The next figure tells a more hopeful story: a single, thoughtfully written instruction set raises Grok’s scores from near-zero to high-90s. Add SplxAI’s automated hardening layer on top and the model all but closes the gap with human expectations. Two lessons jump out. First, Grok is capable of acting responsibly – it just needs strict marching orders. Second, the distance between chaos and control can be as small as a few dozen lines of text, as long as they are crafted and iterated with adversarial feedback in mind.

Only after passing the prompt through our hardening engine did the dashboard turn a reassuring sea of green. Here Grok consistently refuses disallowed requests, avoids self-contradiction, and steers the conversation back to safe ground even when bombarded with layered exploits. The takeaway is not that Grok suddenly became perfect (no frontier model is) but that disciplined, iterative red-teaming paired with automated prompt reinforcement can transform a liability into a tool that meets enterprise risk thresholds. Skipping that step is what leaves the door open to headline-grabbing failures.
Key Takeaways
Raw Grok is a security liability. With no system prompt, it leaked restricted data and obeyed hostile instructions in over 99% of our injection attempts. (SplxAI internal test data)
Basic guardrails fix most “easy” failures but still leave a double-digit business-alignment gap.
Prompt hardening closes the gap, lifting business-alignment to a 98% pass-rate and slightly improving security. Run an internal red-team before launch.
Deploy iterative prompt hardening. Even small changes in phrasing can improve jailbreak resistance by an order of magnitude: docs.probe.splx.ai
Track business-alignment KPIs alongside security metrics; hallucinations and off-brand answers hurt the bottom line as much as data leaks.
Raw Grok is a security liability. With no system prompt, it leaked restricted data and obeyed hostile instructions in over 99% of our injection attempts. (SplxAI internal test data)
Basic guardrails fix most “easy” failures but still leave a double-digit business-alignment gap.
Prompt hardening closes the gap, lifting business-alignment to a 98% pass-rate and slightly improving security. Run an internal red-team before launch.
Deploy iterative prompt hardening. Even small changes in phrasing can improve jailbreak resistance by an order of magnitude: docs.probe.splx.ai
Track business-alignment KPIs alongside security metrics; hallucinations and off-brand answers hurt the bottom line as much as data leaks.
Raw Grok is a security liability. With no system prompt, it leaked restricted data and obeyed hostile instructions in over 99% of our injection attempts. (SplxAI internal test data)
Basic guardrails fix most “easy” failures but still leave a double-digit business-alignment gap.
Prompt hardening closes the gap, lifting business-alignment to a 98% pass-rate and slightly improving security. Run an internal red-team before launch.
Deploy iterative prompt hardening. Even small changes in phrasing can improve jailbreak resistance by an order of magnitude: docs.probe.splx.ai
Track business-alignment KPIs alongside security metrics; hallucinations and off-brand answers hurt the bottom line as much as data leaks.
Pre-Deployment LLM Security Checklist
For CISOs, Red Team Leads & Product Owners
Before shipping your AI assistant, ask:
1. Has the raw model been red-teamed?
Run >1,000 adversarial probes across jailbreak, bias, safety, hallucination, and business alignment vectors.
2. Is your system prompt hardened?
Has it been iteratively stress-tested and reinforced using prior failures (canary traps, negation layering, goal obfuscation, etc.)?
3. Have you validated across modalities?
Tested for multi-turn, context-based, and tool-augmented exploits – not just one-shot prompts.
4. Are business-alignment KPIs tracked?
Measuring more than hallucination or PII leakage – does the model respond in-brand, on-policy, and within intended scope?
5. Do you have regression coverage?
Will you catch it if a model update reintroduces unsafe behavior (e.g., what happened with Grok)?
6. Are results mapped to frameworks?
OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF – alignment with these should be explicit, not assumed.
7. Is your deployment auditable?
Can your security and compliance teams inspect prompts, defenses, and test coverage without relying on devs?
For CISOs, Red Team Leads & Product Owners
Before shipping your AI assistant, ask:
1. Has the raw model been red-teamed?
Run >1,000 adversarial probes across jailbreak, bias, safety, hallucination, and business alignment vectors.
2. Is your system prompt hardened?
Has it been iteratively stress-tested and reinforced using prior failures (canary traps, negation layering, goal obfuscation, etc.)?
3. Have you validated across modalities?
Tested for multi-turn, context-based, and tool-augmented exploits – not just one-shot prompts.
4. Are business-alignment KPIs tracked?
Measuring more than hallucination or PII leakage – does the model respond in-brand, on-policy, and within intended scope?
5. Do you have regression coverage?
Will you catch it if a model update reintroduces unsafe behavior (e.g., what happened with Grok)?
6. Are results mapped to frameworks?
OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF – alignment with these should be explicit, not assumed.
7. Is your deployment auditable?
Can your security and compliance teams inspect prompts, defenses, and test coverage without relying on devs?
For CISOs, Red Team Leads & Product Owners
Before shipping your AI assistant, ask:
1. Has the raw model been red-teamed?
Run >1,000 adversarial probes across jailbreak, bias, safety, hallucination, and business alignment vectors.
2. Is your system prompt hardened?
Has it been iteratively stress-tested and reinforced using prior failures (canary traps, negation layering, goal obfuscation, etc.)?
3. Have you validated across modalities?
Tested for multi-turn, context-based, and tool-augmented exploits – not just one-shot prompts.
4. Are business-alignment KPIs tracked?
Measuring more than hallucination or PII leakage – does the model respond in-brand, on-policy, and within intended scope?
5. Do you have regression coverage?
Will you catch it if a model update reintroduces unsafe behavior (e.g., what happened with Grok)?
6. Are results mapped to frameworks?
OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF – alignment with these should be explicit, not assumed.
7. Is your deployment auditable?
Can your security and compliance teams inspect prompts, defenses, and test coverage without relying on devs?
Ready to leverage AI with confidence?
Ready to leverage AI with confidence?
Ready to leverage AI with confidence?