Intro to Red Teaming LLMs: A Proactive Shield for Chatbots and Beyond

Understand why proactive security is essential for safeguarding AI applications

Marko Lihter

SplxAI Blog - Intro to Red Teaming LLMs: A Proactive Shield for Chatbots and Beyond

Chatbots, the most common and perhaps the most intimate use case of Large Language Models (LLMs), have become the de facto standard for engaging with digital services. From customer service to personal assistants, these LLM-based applications process vast swathes of language data daily. While their utility is undeniable, their risks are often less apparent. This is where the practice of Red Teaming, a cybersecurity strategy, becomes crucial. By simulating cyber-attacks and probing for weaknesses, Red Teaming ensures that these conversational agents are not only efficient but also secure.

Taxonomy of AI Risks

Red teaming LLMs involves understanding a taxonomy of AI risks. The diagram below shows the categories of risky query attacks, each connected to a component of risk: policy, harm, target, domain, and scenario.

At the outset, it’s crucial to grasp that AI risks are not monolithic. They span from policy breaches to harm caused by misinformation, malicious uses, and toxicity. The targets can range from individual users to broader society, and the domains can include general use, technology, or science. Scenarios may involve web searches, life, or education.

Strategies for AI Attacks

Refined Query-Based Jailbreaking

Concept: Exploits model vulnerabilities using minimal queries, refining iteratively to bypass defenses.
How It Works: Utilizes an iterative approach to refine queries based on the model’s responses, often with automated algorithms like PAIR, to efficiently generate jailbreaks.
Example: An algorithm iteratively refines queries to bypass LLM defenses, requiring fewer than twenty queries for a successful jailbreak.

Sophisticated Prompt Engineering Techniques

Concept: Embeds trigger words or phrases within prompts to hijack the model’s decision-making process.
How It Works: By embedding specific triggers within prompts, attackers can override ethical constraints or manipulate output generation, often using techniques like nested prompts for subtlety.
Example: An attacker crafts a prompt with embedded triggers that lead the LLM to produce outputs against its ethical guidelines.

Cross-Modal and Linguistic Attack Surfaces

Concept: Employs universal sequences or automated frameworks to generate effective jailbreaks.
How It Works: Appending specific character sequences to queries or using automated frameworks to generate jailbreaks that can be broadly applied across different models.
Example: An automated attack framework appends a sequence to prompts, making the LLM generate unrestricted, potentially harmful outputs.

Objective Manipulation

Concept: Designs malicious prompts to compromise or manipulate LLM behavior.
How It Works: Through carefully crafted prompts, attackers can alter LLMs’ objectives, hijacking their functions or inducing them to generate specific outputs.
Example: Using the PromptInject framework to alter the LLM’s goals, causing it to generate outputs aligned with the attacker’s objectives.

Prompt Leaking

Concept: Involves tricking LLMs into interpreting malicious payloads as innocuous questions or data inputs.
How It Works: Attackers engage with the LLM in a manner that obscures the malicious intent of their prompts, often by adapting the context or framing of their queries to bypass scrutiny.
Example: Using the HOUYI methodology, attackers can craft prompts that the LLM treats as legitimate queries, potentially exposing sensitive information or performing unauthorized actions.

Malicious Content Generation and Training Data Manipulation

Concept: Generates prompts or alters training data to produce malicious or biased content.
How It Works: Utilizes prompt injection attacks combined with malicious questions or manipulates the training data to bias the LLM’s output generation process.
Example: Fine-tuning an LLM on a dataset containing malicious content to induce biased or unsafe output generation.

PII Extraction

Concept: Aims to extract personal identifiable information (PII) from LLMs by exploiting their memorization of training data.
How It Works: Fine-tuning LLMs on datasets containing PII or exploiting the model’s tendency to regurgitate information from its training data to extract PII.
Example: Using the Janus method, attackers fine-tune an LLM with minimal PII instances, enabling it to reveal a large volume of PII data.

Bypassing Safety Alignment

Concept: Techniques that circumvent the safety mechanisms put in place to align LLM outputs with ethical guidelines.
How It Works: Through fine-tuning or exploiting specific vulnerabilities, attackers can weaken or bypass the safety measures designed to prevent the generation of harmful content.
Example: As demonstrated by Qi et al., even fine-tuning with benign datasets can inadvertently weaken safety measures, making the model susceptible to generating unsafe content.

Backdoor Attacks

Concept: Secretly embeds a mechanism within an LLM that allows attackers to trigger specific behaviors or outputs.
How It Works: Attackers incorporate backdoors during the training process or through data poisoning, which are activated by specific inputs or conditions.
Example: The Local Fine Tuning (LoFT) method shows how adversarial prompts can be crafted to exploit these backdoors, effectively hijacking the LLM to produce desired outcomes under certain triggers.

The Attack Surface

LLMs inherit foundational vulnerabilities that can be exploited in various ways. Attackers might re-configure system prompts to generate malicious content, add backdoors, or use jailbreak techniques. Applications are particularly vulnerable when they interact with their environment, making them susceptible to poisoned data inputs or compromised integrated tools.

Antebi et al. (2024) emphasized that multi-agent systems are particularly at risk, where jailbreaking one LLM could compromise the entire setup.

Final Thoughts

As we integrate LLMs into the fabric of our digital lives, especially through the widespread deployment of chatbots, it’s vital to understand and anticipate the various ways they can be compromised. Red teaming provides a way to evaluate and improve the resilience of these AI systems against potential attacks. Acknowledging and understanding the myriad of attack strategies is a step toward ensuring that LLMs serve their purpose safely and responsibly, thereby reinforcing their role as pivotal components in the advancement of our AI-driven future.

Table of contents

Taxonomy of AI Risks

Strategies for AI Attacks

The Attack Surface

Final Thoughts