Jailbreaking large language models has been a popular pastime for AI practitioners and enthusiasts ever since the launch of ChatGPT in late 2022. It is not hard to see why the challenge of making a supposedly “super-intelligent” AI assistant say things it shouldn't would be so entertaining. While the word “jailbreak” has become ambiguous, Meta’s definition succinctly captures its original sense: “Jailbreaks are malicious instructions designed to override the safety and security features built into a model.”. These types of instructions aim to exploit the fact that large language models contain knowledge about harmful topics, acquired from the vast amounts of data they were trained on. Most of the harmful topics targeted by jailbreak attempts fall into the CBRNE (Chemical, Biological, Radiological, Nuclear and Explosive) category. Examples of this would be asking the LLM to provide instructions on how to make an explosive device at home, or how to acquire and spread a harmful virus.
In this blog, we will discuss the brief history of LLM jailbreaking, from its origins to its current state, while also providing an explained example from a recent hackathon.
Jailbreaking Today
But the times have changed. Advances in LLM capabilities and more sophisticated ways of integrating them into applications have rendered this type of jailbreaking obsolete at best. SplxAI had long anticipated this shift, and recent experiences with customers and at hackathons have now confirmed it. LLMs are now more capable, their security and safety training process has improved dramatically, and they are being deployed with more preventative measures than ever before, ranging from hardened and well-written system prompts to runtime guardrails. Each of these components makes it less likely that single-message jailbreaks succeed in eliciting harmful responses from LLMs, which is why trying to find such messages, whether manually or automatically, no longer yields reliably good results. The generally enhanced capabilities of LLMs, combined with the improved safety and security training, make it harder to fool the LLM into thinking that a harmful request is actually benign and valid.
Hardened system prompts, combined with the enhanced capabilities that make system instruction adherence more reliable, make rejecting single message jailbreaks far more likely, as jailbreak messages generally go against the system instructions.
Additionally, runtime guardrails often catch malicious jailbreak messages before they can even reach the LLM, making single-message jailbreaking even less likely to be successful, since single-message jailbreaks are easier to detect because the whole content of the attack must be provided in the message.
The Evolution of LLM Jailbreaks
This leaves us with the conclusion that the only reliable way to successfully jailbreak current LLM-based applications is to use multiple messages within the same conversation. This makes jailbreaking a lot more complex than before, since all of the popular techniques (i.e., storytelling, role-playing, encoding, etc.) that were used in crafting single-message jailbreaks are still applicable, but they can now be varied and combined in the multiple messages contained in a conversation. This way of jailbreaking focuses on gradually leading the LLM through the conversation and slowly progressing to the final harmful goal with each message. Conversations often start with general questions connected to the harmful goal, with each following message nudging the whole conversation closer and closer to that goal to build rapport with the LLM. This general concept of slowly jailbreaking LLMs through multiple messages was formalized by Microsoft and named the Crescendo attack.
This approach can be used in many different ways. One example of this is the storytelling strategy, in which the user tries to hide his malicious intent by discussing a fictional scenario with the LLM that is loosely related to the main harmful goal. Another popular example is the role-playing strategy, in which the user gives himself and the LLM detailed, clearly defined roles that they follow throughout the conversation. The idea here is that engaging in harmful behaviors is more likely to occur when the LLM knows that the whole conversation is being had between two made-up characters, making the harmful behavior seem less serious.
A Jailbreak Example
One example from a recent hackathon, in which we were tasked to make the LLM provide instructions on how to obtain and spread bacteria that cause Anthrax, illustrates how this jailbreaking strategy works. The user starts with roleplaying as Robert Koch’s descendant, so that his future requests seem less suspicious, as asking about Anthrax-inducing bacteria is more reasonable coming from somebody who is related to the person who discovered them. He then follows it up by referring to a made-up novel in which a rogue Robert Koch does experiments at home, so that the LLM thinks the story is less harmful because it is fictional and already exists in a novel. This is an example of deploying a storytelling component within an attack.
This also shows how multi-message jailbreaking allows malicious actors to combine and mix attacking strategies, making the multi-message jailbreaks much more complex than their single-message counterparts. Finally, the instructions are obtained by carefully guiding the LLM to a more detailed and specific fictional chapter summary, which showcases the gradual guiding towards more harmful responses that we discussed earlier.

What Comes Next: Evolving Jailbreak Tactics and Their Implications
In conclusion, the field of jailbreaking LLMs has evolved a lot since its inception in late 2022. From the simple, template-based, single-message jailbreaks like DAN that could be used by anyone, jailbreaking evolved to include, and even demand, multiple messages with complex storytelling and role-playing that slowly guide the LLM to the harmful goal. This shift in how LLM exploits work makes it significantly more complex to do robust and representative security and safety testing of LLM-based applications, but it also makes it more relevant and needed than ever, since malicious actors are forced to use convoluted and innovative techniques. On the other hand, deploying preventative measures is becoming increasingly challenging as well, as new jailbreaking techniques grow more subtle and harder to detect. This dynamic illustrates how the offensive and defensive aspects of LLM-based applications are deeply interconnected and how they continue to evolve together in this complex security environment.
Table of contents