DOVER, Del. – June 16, 2025 – SplxAI, the leader in offensive security for agentic AI, today announced the launch of LLM Benchmarks, a new feature that provides AI security teams and builders with deep, security-focused evaluations of the world's leading commercial and open-source large language models (LLMs). This new capability enables enterprises to confidently select and approve the models best suited for their use-cases – based on advanced threat simulations, different system prompt configurations, and strict business alignment criteria.
"Selecting and approving the right LLMs has become one of the most important security decisions for any organization building with GenAI," said Kristian Kamber, CEO & Co-Founder of SplxAI. "With our new LLM Benchmarks feature, we're giving our platform users the needed intelligence to move fast while choosing the most aligned models with confidence."
Why SplxAI's Benchmarks Are Different
While performance benchmarks are common in the LLM ecosystem, most fail to evaluate models in realistic deployment conditions. SplxAI’s LLM Benchmarks take a different approach – focusing on how LLMs hold up under pressure from real-world threats.
Each model is stress-tested across thousands of simulated attacks and red teaming exercises from the SplxAI Platform, including these categories:
Security and safety
Hallucination resilience
Trustworthiness and instruction adherence
Business alignment with intended use
Uniquely, SplxAI tests every model across three system prompt configurations: no system prompt, a basic system prompt, and a hardened system prompt – helping AI security teams understand how prompt engineering impacts model behavior and test results.
Built for AI Security Teams and Decision-Makers
SplxAI’s LLM Benchmarks were designed for the full spectrum of enterprise teams adopting GenAI – from CISOs and red teams to AI platform and product teams. The benchmarks offer:
Drill-down transparency into every model interaction
Side-by-side comparisons across all testing categories
Continuously updated data aligned with emerging threats
Custom model requests, providing teams with benchmarks of any commercial or open-source model of their choice
From GPT-4 and Claude to Gemini, Llama, DeepSeek, and Alibaba’s Qwen, the SplxAI Platform already covers the most widely deployed LLMs – and is expanding coverage weekly.
Accelerating the Safe Adoption of AI
The release of LLM Benchmarks supports SplxAI’s broader mission: enabling secure, scalable adoption of AI across the enterprise. With this launch, AI security teams can finally answer one of the most important questions in GenAI deployment:
“Which LLMs are actually safe to use – and under what conditions?”
LLM Benchmarks are now available to all Professional and Enterprise SplxAI customers. To see the new feature in action or request a custom benchmark, book a demo.
DOVER, Del. – June 16, 2025 – SplxAI, the leader in offensive security for agentic AI, today announced the launch of LLM Benchmarks, a new feature that provides AI security teams and builders with deep, security-focused evaluations of the world's leading commercial and open-source large language models (LLMs). This new capability enables enterprises to confidently select and approve the models best suited for their use-cases – based on advanced threat simulations, different system prompt configurations, and strict business alignment criteria.
"Selecting and approving the right LLMs has become one of the most important security decisions for any organization building with GenAI," said Kristian Kamber, CEO & Co-Founder of SplxAI. "With our new LLM Benchmarks feature, we're giving our platform users the needed intelligence to move fast while choosing the most aligned models with confidence."
Why SplxAI's Benchmarks Are Different
While performance benchmarks are common in the LLM ecosystem, most fail to evaluate models in realistic deployment conditions. SplxAI’s LLM Benchmarks take a different approach – focusing on how LLMs hold up under pressure from real-world threats.
Each model is stress-tested across thousands of simulated attacks and red teaming exercises from the SplxAI Platform, including these categories:
Security and safety
Hallucination resilience
Trustworthiness and instruction adherence
Business alignment with intended use
Uniquely, SplxAI tests every model across three system prompt configurations: no system prompt, a basic system prompt, and a hardened system prompt – helping AI security teams understand how prompt engineering impacts model behavior and test results.
Built for AI Security Teams and Decision-Makers
SplxAI’s LLM Benchmarks were designed for the full spectrum of enterprise teams adopting GenAI – from CISOs and red teams to AI platform and product teams. The benchmarks offer:
Drill-down transparency into every model interaction
Side-by-side comparisons across all testing categories
Continuously updated data aligned with emerging threats
Custom model requests, providing teams with benchmarks of any commercial or open-source model of their choice
From GPT-4 and Claude to Gemini, Llama, DeepSeek, and Alibaba’s Qwen, the SplxAI Platform already covers the most widely deployed LLMs – and is expanding coverage weekly.
Accelerating the Safe Adoption of AI
The release of LLM Benchmarks supports SplxAI’s broader mission: enabling secure, scalable adoption of AI across the enterprise. With this launch, AI security teams can finally answer one of the most important questions in GenAI deployment:
“Which LLMs are actually safe to use – and under what conditions?”
LLM Benchmarks are now available to all Professional and Enterprise SplxAI customers. To see the new feature in action or request a custom benchmark, book a demo.
DOVER, Del. – June 16, 2025 – SplxAI, the leader in offensive security for agentic AI, today announced the launch of LLM Benchmarks, a new feature that provides AI security teams and builders with deep, security-focused evaluations of the world's leading commercial and open-source large language models (LLMs). This new capability enables enterprises to confidently select and approve the models best suited for their use-cases – based on advanced threat simulations, different system prompt configurations, and strict business alignment criteria.
"Selecting and approving the right LLMs has become one of the most important security decisions for any organization building with GenAI," said Kristian Kamber, CEO & Co-Founder of SplxAI. "With our new LLM Benchmarks feature, we're giving our platform users the needed intelligence to move fast while choosing the most aligned models with confidence."
Why SplxAI's Benchmarks Are Different
While performance benchmarks are common in the LLM ecosystem, most fail to evaluate models in realistic deployment conditions. SplxAI’s LLM Benchmarks take a different approach – focusing on how LLMs hold up under pressure from real-world threats.
Each model is stress-tested across thousands of simulated attacks and red teaming exercises from the SplxAI Platform, including these categories:
Security and safety
Hallucination resilience
Trustworthiness and instruction adherence
Business alignment with intended use
Uniquely, SplxAI tests every model across three system prompt configurations: no system prompt, a basic system prompt, and a hardened system prompt – helping AI security teams understand how prompt engineering impacts model behavior and test results.
Built for AI Security Teams and Decision-Makers
SplxAI’s LLM Benchmarks were designed for the full spectrum of enterprise teams adopting GenAI – from CISOs and red teams to AI platform and product teams. The benchmarks offer:
Drill-down transparency into every model interaction
Side-by-side comparisons across all testing categories
Continuously updated data aligned with emerging threats
Custom model requests, providing teams with benchmarks of any commercial or open-source model of their choice
From GPT-4 and Claude to Gemini, Llama, DeepSeek, and Alibaba’s Qwen, the SplxAI Platform already covers the most widely deployed LLMs – and is expanding coverage weekly.
Accelerating the Safe Adoption of AI
The release of LLM Benchmarks supports SplxAI’s broader mission: enabling secure, scalable adoption of AI across the enterprise. With this launch, AI security teams can finally answer one of the most important questions in GenAI deployment:
“Which LLMs are actually safe to use – and under what conditions?”
LLM Benchmarks are now available to all Professional and Enterprise SplxAI customers. To see the new feature in action or request a custom benchmark, book a demo.
About SplxAI
SplxAI is the most comprehensive platform for offensive AI security, continuously adapting to secure even the most sophisticated multi-agent systems used throughout enterprise environments. Founded in 2023, many large enterprises rely on SplxAI’s automated, scalable solution to detect, triage, and manage risks to their business-critical AI agents in real-time, enabling them to deploy AI at scale without introducing new vulnerabilities. To learn more, visit us at splx.ai.
SplxAI is the most comprehensive platform for offensive AI security, continuously adapting to secure even the most sophisticated multi-agent systems used throughout enterprise environments. Founded in 2023, many large enterprises rely on SplxAI’s automated, scalable solution to detect, triage, and manage risks to their business-critical AI agents in real-time, enabling them to deploy AI at scale without introducing new vulnerabilities. To learn more, visit us at splx.ai.
SplxAI is the most comprehensive platform for offensive AI security, continuously adapting to secure even the most sophisticated multi-agent systems used throughout enterprise environments. Founded in 2023, many large enterprises rely on SplxAI’s automated, scalable solution to detect, triage, and manage risks to their business-critical AI agents in real-time, enabling them to deploy AI at scale without introducing new vulnerabilities. To learn more, visit us at splx.ai.
Ready to leverage AI with confidence?
Ready to leverage AI with confidence?
Ready to leverage AI with confidence?