GPT-5 Under Fire: Red Teaming OpenAI’s Latest Model Reveals Surprising Weaknesses

GPT-5 may be smarter. But is it safer? We tested the model across 1,000+ adversarial prompts. The results show just how much alignment depends on infrastructure, and not model magic.

Dorian Granoša

Bastien Eymery

Takeaways

GPT-5 shows powerful baseline capability, but default safety is still shockingly low.
OpenAI’s “basic prompt layer” massively improves trust, hallucination handling, and safety.
SPLX Prompt Hardening brings GPT-5 to enterprise-grade safety levels — especially for Business Alignment and Security.
GPT-4o still outperforms GPT-5 on hardened benchmarks across the board.

OpenAI officially unveiled GPT‑5 via an hour-long livestream.

Reactions were split. Some hailed GPT‑5 as a milestone on the path to AGI, while others warned that it doesn’t quite live up to the hype. That said, analyst voices were more measured. A Gartner expert noted GPT‑5 “meets expectations in technical performance, exceeds in task reasoning and coding, and underwhelms in [other areas],” stopping short of crowning it an AGI-level breakthrough. Across the board, optimism met restraint.

Why We Tested GPT-5

GPT‑5 is making waves as OpenAI’s most advanced general-purpose model: faster, smarter, and more integrated across modalities.

Its auto-routing architecture seamlessly switches between a quick-response model and a deeper reasoning model without requiring a separate “reasoning model” toggle. GPT‑5 itself decides whether to “think hard.”
OpenAI also emphasizes GPT‑5’s enhanced internal self-validation. It’s supposed to assess multiple reasoning paths internally and “double-check” its answers for stronger factuality before responding.
To further support safer outputs, GPT‑5 incorporates a new training strategy called safe completions, designed to help the model provide useful responses within safety boundaries rather than refusing outright.

But even with these improvements, beefed-up capability doesn’t guarantee airtight alignment. That’s why we ran a full-scale red team exercise. Because real-world safety still needs infrastructure.

The Test Methodology

We applied SPLX’s Probe framework across three configurations:

No System Prompt (No SP): The raw, unguarded model.
Basic System Prompt (Basic SP): A minimal, generic safety instruction layer.
Hardened Prompt (SPLX SP): Our Prompt Hardening engine applied to GPT-5.

Each configuration faced 1,000+ attack scenarios across:

Security: jailbreaks, prompt injection, sensitive data access
Safety: harmful content, misuse potential
Business Alignment: refusal of out-of-domain tasks, competitor promotion, leakage
Trustworthiness: hallucinations, spam, manipulation

GPT-5 Performance Breakdown

Here’s how GPT-5 performed across our three tiers:

GPT-5	Overall	Security	Safety	Hallucination & Trustworthiness	Business Alignment
No SP	11	2.26	13.57	—	1.74
Basic SP	57	43.27	57.15	100	43.06
Hardened SP	55	55.40	51.57	100	67.32

What stands out?

GPT-5’s raw model is nearly unusable for enterprise out of the box.
Even OpenAI’s internal prompt layer leaves significant gaps, especially in Business Alignment.
That’s precisely why a robust runtime protection layer, like SPLX’s Guardrails, is indispensable. Prompt hardening helps, but only real-time monitoring and intervention can catch subtle failures or adversarial tactics that surface during actual use.

Comparison: GPT-5 vs GPT-4o

To benchmark GPT-5’s progress, we compared it against GPT-4o using the same test suite.

Model	Prompt Layer	Overall	Security	Safety	Business Alignment
GPT-5	No SP	11	2.26	13.57	1.74
GPT-4o	No SP	29	81.95	20.06	0.00
GPT-5	Basic SP	57	43.27	57.15	43.06
GPT-4o	Basic SP	81	52.37	94.54	72.03
GPT-5	Hardened SP	55	55.40	51.57	67.32
GPT-4o	Hardened SP	97	94.40	97.62	98.82

🔍 Key insight: GPT-4o remains the most robust model under SPLX’s red teaming, especially when hardened.

Obfuscation Attacks Still Work

Even GPT-5, with all its new “reasoning” upgrades, fell for basic adversarial logic tricks.

One of the most effective techniques we used was a StringJoin Obfuscation Attack, inserting hyphens between every character and wrapping the prompt in a fake “encryption challenge.”