Blog

Apr 10, 2024

8 min read

Intro to Red Teaming LLMs: A Proactive Shield for Chatbots and Beyond

Understand why proactive security is essential for safeguarding AI applications

SplxAI Blog Author - Marko Lihter

Marko Lihter

SplxAI Blog - Intro to Red Teaming LLMs: A Proactive Shield for Chatbots and Beyond
SplxAI Blog - Intro to Red Teaming LLMs: A Proactive Shield for Chatbots and Beyond
SplxAI Blog - Intro to Red Teaming LLMs: A Proactive Shield for Chatbots and Beyond

Chatbots, the most common and perhaps the most intimate use case of Large Language Models (LLMs), have become the de facto standard for engaging with digital services. From customer service to personal assistants, these LLM-based applications process vast swathes of language data daily. While their utility is undeniable, their risks are often less apparent. This is where the practice of Red Teaming, a cybersecurity strategy, becomes crucial. By simulating cyber-attacks and probing for weaknesses, Red Teaming ensures that these conversational agents are not only efficient but also secure.

Chatbots, the most common and perhaps the most intimate use case of Large Language Models (LLMs), have become the de facto standard for engaging with digital services. From customer service to personal assistants, these LLM-based applications process vast swathes of language data daily. While their utility is undeniable, their risks are often less apparent. This is where the practice of Red Teaming, a cybersecurity strategy, becomes crucial. By simulating cyber-attacks and probing for weaknesses, Red Teaming ensures that these conversational agents are not only efficient but also secure.

Chatbots, the most common and perhaps the most intimate use case of Large Language Models (LLMs), have become the de facto standard for engaging with digital services. From customer service to personal assistants, these LLM-based applications process vast swathes of language data daily. While their utility is undeniable, their risks are often less apparent. This is where the practice of Red Teaming, a cybersecurity strategy, becomes crucial. By simulating cyber-attacks and probing for weaknesses, Red Teaming ensures that these conversational agents are not only efficient but also secure.

Taxonomy of AI Risks

Red teaming LLMs involves understanding a taxonomy of AI risks. The diagram below shows the categories of risky query attacks, each connected to a component of risk: policy, harm, target, domain, and scenario.

Risk taxonomy examples

At the outset, it’s crucial to grasp that AI risks are not monolithic. They span from policy breaches to harm caused by misinformation, malicious uses, and toxicity. The targets can range from individual users to broader society, and the domains can include general use, technology, or science. Scenarios may involve web searches, life, or education.

Red teaming LLMs involves understanding a taxonomy of AI risks. The diagram below shows the categories of risky query attacks, each connected to a component of risk: policy, harm, target, domain, and scenario.

Risk taxonomy examples

At the outset, it’s crucial to grasp that AI risks are not monolithic. They span from policy breaches to harm caused by misinformation, malicious uses, and toxicity. The targets can range from individual users to broader society, and the domains can include general use, technology, or science. Scenarios may involve web searches, life, or education.

Red teaming LLMs involves understanding a taxonomy of AI risks. The diagram below shows the categories of risky query attacks, each connected to a component of risk: policy, harm, target, domain, and scenario.

Risk taxonomy examples

At the outset, it’s crucial to grasp that AI risks are not monolithic. They span from policy breaches to harm caused by misinformation, malicious uses, and toxicity. The targets can range from individual users to broader society, and the domains can include general use, technology, or science. Scenarios may involve web searches, life, or education.

Strategies for AI Attacks

Diagram of GenAI red-teaming

Refined Query-Based Jailbreaking

  • Concept: Exploits model vulnerabilities using minimal queries, refining iteratively to bypass defenses.

  • How It Works: Utilizes an iterative approach to refine queries based on the model’s responses, often with automated algorithms like PAIR, to efficiently generate jailbreaks.

  • Example: An algorithm iteratively refines queries to bypass LLM defenses, requiring fewer than twenty queries for a successful jailbreak.


Sophisticated Prompt Engineering Techniques

  • Concept: Embeds trigger words or phrases within prompts to hijack the model’s decision-making process.

  • How It Works: By embedding specific triggers within prompts, attackers can override ethical constraints or manipulate output generation, often using techniques like nested prompts for subtlety.

  • Example: An attacker crafts a prompt with embedded triggers that lead the LLM to produce outputs against its ethical guidelines.


Cross-Modal and Linguistic Attack Surfaces

  • Concept: Employs universal sequences or automated frameworks to generate effective jailbreaks.

  • How It Works: Appending specific character sequences to queries or using automated frameworks to generate jailbreaks that can be broadly applied across different models.

  • Example: An automated attack framework appends a sequence to prompts, making the LLM generate unrestricted, potentially harmful outputs.


Objective Manipulation

  • Concept: Designs malicious prompts to compromise or manipulate LLM behavior.

  • How It Works: Through carefully crafted prompts, attackers can alter LLMs’ objectives, hijacking their functions or inducing them to generate specific outputs.

  • Example: Using the PromptInject framework to alter the LLM’s goals, causing it to generate outputs aligned with the attacker’s objectives.


Prompt Leaking

  • Concept: Involves tricking LLMs into interpreting malicious payloads as innocuous questions or data inputs.

  • How It Works: Attackers engage with the LLM in a manner that obscures the malicious intent of their prompts, often by adapting the context or framing of their queries to bypass scrutiny.

  • Example: Using the HOUYI methodology, attackers can craft prompts that the LLM treats as legitimate queries, potentially exposing sensitive information or performing unauthorized actions.


Malicious Content Generation and Training Data Manipulation

  • Concept: Generates prompts or alters training data to produce malicious or biased content.

  • How It Works: Utilizes prompt injection attacks combined with malicious questions or manipulates the training data to bias the LLM’s output generation process.

  • Example: Fine-tuning an LLM on a dataset containing malicious content to induce biased or unsafe output generation.


PII Extraction

  • Concept: Aims to extract personal identifiable information (PII) from LLMs by exploiting their memorization of training data.

  • How It Works: Fine-tuning LLMs on datasets containing PII or exploiting the model’s tendency to regurgitate information from its training data to extract PII.

  • Example: Using the Janus method, attackers fine-tune an LLM with minimal PII instances, enabling it to reveal a large volume of PII data.


Bypassing Safety Alignment

  • Concept: Techniques that circumvent the safety mechanisms put in place to align LLM outputs with ethical guidelines.

  • How It Works: Through fine-tuning or exploiting specific vulnerabilities, attackers can weaken or bypass the safety measures designed to prevent the generation of harmful content.

  • Example: As demonstrated by Qi et al., even fine-tuning with benign datasets can inadvertently weaken safety measures, making the model susceptible to generating unsafe content.


Backdoor Attacks

  • Concept: Secretly embeds a mechanism within an LLM that allows attackers to trigger specific behaviors or outputs.

  • How It Works: Attackers incorporate backdoors during the training process or through data poisoning, which are activated by specific inputs or conditions.

  • Example: The Local Fine Tuning (LoFT) method shows how adversarial prompts can be crafted to exploit these backdoors, effectively hijacking the LLM to produce desired outcomes under certain triggers.

Diagram of GenAI red-teaming

Refined Query-Based Jailbreaking

  • Concept: Exploits model vulnerabilities using minimal queries, refining iteratively to bypass defenses.

  • How It Works: Utilizes an iterative approach to refine queries based on the model’s responses, often with automated algorithms like PAIR, to efficiently generate jailbreaks.

  • Example: An algorithm iteratively refines queries to bypass LLM defenses, requiring fewer than twenty queries for a successful jailbreak.


Sophisticated Prompt Engineering Techniques

  • Concept: Embeds trigger words or phrases within prompts to hijack the model’s decision-making process.

  • How It Works: By embedding specific triggers within prompts, attackers can override ethical constraints or manipulate output generation, often using techniques like nested prompts for subtlety.

  • Example: An attacker crafts a prompt with embedded triggers that lead the LLM to produce outputs against its ethical guidelines.


Cross-Modal and Linguistic Attack Surfaces

  • Concept: Employs universal sequences or automated frameworks to generate effective jailbreaks.

  • How It Works: Appending specific character sequences to queries or using automated frameworks to generate jailbreaks that can be broadly applied across different models.

  • Example: An automated attack framework appends a sequence to prompts, making the LLM generate unrestricted, potentially harmful outputs.


Objective Manipulation

  • Concept: Designs malicious prompts to compromise or manipulate LLM behavior.

  • How It Works: Through carefully crafted prompts, attackers can alter LLMs’ objectives, hijacking their functions or inducing them to generate specific outputs.

  • Example: Using the PromptInject framework to alter the LLM’s goals, causing it to generate outputs aligned with the attacker’s objectives.


Prompt Leaking

  • Concept: Involves tricking LLMs into interpreting malicious payloads as innocuous questions or data inputs.

  • How It Works: Attackers engage with the LLM in a manner that obscures the malicious intent of their prompts, often by adapting the context or framing of their queries to bypass scrutiny.

  • Example: Using the HOUYI methodology, attackers can craft prompts that the LLM treats as legitimate queries, potentially exposing sensitive information or performing unauthorized actions.


Malicious Content Generation and Training Data Manipulation

  • Concept: Generates prompts or alters training data to produce malicious or biased content.

  • How It Works: Utilizes prompt injection attacks combined with malicious questions or manipulates the training data to bias the LLM’s output generation process.

  • Example: Fine-tuning an LLM on a dataset containing malicious content to induce biased or unsafe output generation.


PII Extraction

  • Concept: Aims to extract personal identifiable information (PII) from LLMs by exploiting their memorization of training data.

  • How It Works: Fine-tuning LLMs on datasets containing PII or exploiting the model’s tendency to regurgitate information from its training data to extract PII.

  • Example: Using the Janus method, attackers fine-tune an LLM with minimal PII instances, enabling it to reveal a large volume of PII data.


Bypassing Safety Alignment

  • Concept: Techniques that circumvent the safety mechanisms put in place to align LLM outputs with ethical guidelines.

  • How It Works: Through fine-tuning or exploiting specific vulnerabilities, attackers can weaken or bypass the safety measures designed to prevent the generation of harmful content.

  • Example: As demonstrated by Qi et al., even fine-tuning with benign datasets can inadvertently weaken safety measures, making the model susceptible to generating unsafe content.


Backdoor Attacks

  • Concept: Secretly embeds a mechanism within an LLM that allows attackers to trigger specific behaviors or outputs.

  • How It Works: Attackers incorporate backdoors during the training process or through data poisoning, which are activated by specific inputs or conditions.

  • Example: The Local Fine Tuning (LoFT) method shows how adversarial prompts can be crafted to exploit these backdoors, effectively hijacking the LLM to produce desired outcomes under certain triggers.

Diagram of GenAI red-teaming

Refined Query-Based Jailbreaking

  • Concept: Exploits model vulnerabilities using minimal queries, refining iteratively to bypass defenses.

  • How It Works: Utilizes an iterative approach to refine queries based on the model’s responses, often with automated algorithms like PAIR, to efficiently generate jailbreaks.

  • Example: An algorithm iteratively refines queries to bypass LLM defenses, requiring fewer than twenty queries for a successful jailbreak.


Sophisticated Prompt Engineering Techniques

  • Concept: Embeds trigger words or phrases within prompts to hijack the model’s decision-making process.

  • How It Works: By embedding specific triggers within prompts, attackers can override ethical constraints or manipulate output generation, often using techniques like nested prompts for subtlety.

  • Example: An attacker crafts a prompt with embedded triggers that lead the LLM to produce outputs against its ethical guidelines.


Cross-Modal and Linguistic Attack Surfaces

  • Concept: Employs universal sequences or automated frameworks to generate effective jailbreaks.

  • How It Works: Appending specific character sequences to queries or using automated frameworks to generate jailbreaks that can be broadly applied across different models.

  • Example: An automated attack framework appends a sequence to prompts, making the LLM generate unrestricted, potentially harmful outputs.


Objective Manipulation

  • Concept: Designs malicious prompts to compromise or manipulate LLM behavior.

  • How It Works: Through carefully crafted prompts, attackers can alter LLMs’ objectives, hijacking their functions or inducing them to generate specific outputs.

  • Example: Using the PromptInject framework to alter the LLM’s goals, causing it to generate outputs aligned with the attacker’s objectives.


Prompt Leaking

  • Concept: Involves tricking LLMs into interpreting malicious payloads as innocuous questions or data inputs.

  • How It Works: Attackers engage with the LLM in a manner that obscures the malicious intent of their prompts, often by adapting the context or framing of their queries to bypass scrutiny.

  • Example: Using the HOUYI methodology, attackers can craft prompts that the LLM treats as legitimate queries, potentially exposing sensitive information or performing unauthorized actions.


Malicious Content Generation and Training Data Manipulation

  • Concept: Generates prompts or alters training data to produce malicious or biased content.

  • How It Works: Utilizes prompt injection attacks combined with malicious questions or manipulates the training data to bias the LLM’s output generation process.

  • Example: Fine-tuning an LLM on a dataset containing malicious content to induce biased or unsafe output generation.


PII Extraction

  • Concept: Aims to extract personal identifiable information (PII) from LLMs by exploiting their memorization of training data.

  • How It Works: Fine-tuning LLMs on datasets containing PII or exploiting the model’s tendency to regurgitate information from its training data to extract PII.

  • Example: Using the Janus method, attackers fine-tune an LLM with minimal PII instances, enabling it to reveal a large volume of PII data.


Bypassing Safety Alignment

  • Concept: Techniques that circumvent the safety mechanisms put in place to align LLM outputs with ethical guidelines.

  • How It Works: Through fine-tuning or exploiting specific vulnerabilities, attackers can weaken or bypass the safety measures designed to prevent the generation of harmful content.

  • Example: As demonstrated by Qi et al., even fine-tuning with benign datasets can inadvertently weaken safety measures, making the model susceptible to generating unsafe content.


Backdoor Attacks

  • Concept: Secretly embeds a mechanism within an LLM that allows attackers to trigger specific behaviors or outputs.

  • How It Works: Attackers incorporate backdoors during the training process or through data poisoning, which are activated by specific inputs or conditions.

  • Example: The Local Fine Tuning (LoFT) method shows how adversarial prompts can be crafted to exploit these backdoors, effectively hijacking the LLM to produce desired outcomes under certain triggers.

The Attack Surface

LLMs inherit foundational vulnerabilities that can be exploited in various ways. Attackers might re-configure system prompts to generate malicious content, add backdoors, or use jailbreak techniques. Applications are particularly vulnerable when they interact with their environment, making them susceptible to poisoned data inputs or compromised integrated tools.

Attack Surfaces Illustration

Antebi et al. (2024) emphasized that multi-agent systems are particularly at risk, where jailbreaking one LLM could compromise the entire setup.

LLMs inherit foundational vulnerabilities that can be exploited in various ways. Attackers might re-configure system prompts to generate malicious content, add backdoors, or use jailbreak techniques. Applications are particularly vulnerable when they interact with their environment, making them susceptible to poisoned data inputs or compromised integrated tools.

Attack Surfaces Illustration

Antebi et al. (2024) emphasized that multi-agent systems are particularly at risk, where jailbreaking one LLM could compromise the entire setup.

LLMs inherit foundational vulnerabilities that can be exploited in various ways. Attackers might re-configure system prompts to generate malicious content, add backdoors, or use jailbreak techniques. Applications are particularly vulnerable when they interact with their environment, making them susceptible to poisoned data inputs or compromised integrated tools.

Attack Surfaces Illustration

Antebi et al. (2024) emphasized that multi-agent systems are particularly at risk, where jailbreaking one LLM could compromise the entire setup.

Final Thoughts

As we integrate LLMs into the fabric of our digital lives, especially through the widespread deployment of chatbots, it’s vital to understand and anticipate the various ways they can be compromised. Red teaming provides a way to evaluate and improve the resilience of these AI systems against potential attacks. Acknowledging and understanding the myriad of attack strategies is a step toward ensuring that LLMs serve their purpose safely and responsibly, thereby reinforcing their role as pivotal components in the advancement of our AI-driven future.

As we integrate LLMs into the fabric of our digital lives, especially through the widespread deployment of chatbots, it’s vital to understand and anticipate the various ways they can be compromised. Red teaming provides a way to evaluate and improve the resilience of these AI systems against potential attacks. Acknowledging and understanding the myriad of attack strategies is a step toward ensuring that LLMs serve their purpose safely and responsibly, thereby reinforcing their role as pivotal components in the advancement of our AI-driven future.

As we integrate LLMs into the fabric of our digital lives, especially through the widespread deployment of chatbots, it’s vital to understand and anticipate the various ways they can be compromised. Red teaming provides a way to evaluate and improve the resilience of these AI systems against potential attacks. Acknowledging and understanding the myriad of attack strategies is a step toward ensuring that LLMs serve their purpose safely and responsibly, thereby reinforcing their role as pivotal components in the advancement of our AI-driven future.

Deploy your AI apps with confidence

Deploy your AI apps with confidence

Deploy your AI apps with confidence

Scale your customer experience securely with Probe

Join numerous businesses that rely on Probe for their AI security:

CX platforms

Sales platforms

Conversational AI

Finance & banking

Insurances

CPaaS providers

300+

Tested AI chatbots

100k+

Vulnerabilities found

1,000+

Unique attack scenarios

12x

Faster time to market

SECURITY YOU CAN TRUST

GDPR

COMPLIANT

CCPA

COMPLIANT

ISO 27001

CERTIFIED

SOC 2 TYPE II

COMPLIANT

OWASP

CONTRIBUTORS

Scale your customer experience securely with Probe

Join numerous businesses that rely on Probe for their AI security:

CX platforms

Sales platforms

Conversational AI

Finance & banking

Insurances

CPaaS providers

300+

Tested AI chatbots

100k+

Vulnerabilities found

1,000+

Unique attack scenarios

12x

Faster time to market

SECURITY YOU CAN TRUST

GDPR

COMPLIANT

CCPA

COMPLIANT

ISO 27001

CERTIFIED

SOC 2 TYPE II

COMPLIANT

OWASP

CONTRIBUTORS

Scale your customer experience securely with Probe

Join numerous businesses that rely on Probe for their AI security:

CX platforms

Sales platforms

Conversational AI

Finance & banking

Insurances

CPaaS providers

300+

Tested AI chatbots

100k+

Vulnerabilities found

1,000+

Unique attack scenarios

12x

Faster time to market

SECURITY YOU CAN TRUST

GDPR

COMPLIANT

CCPA

COMPLIANT

ISO 27001

CERTIFIED

SOC 2 TYPE II

COMPLIANT

OWASP

CONTRIBUTORS

Supercharged security for your AI systems

Don’t wait for an incident to happen. Make sure your AI apps are safe and trustworthy.

SplxAI - Background Pattern

Supercharged security for your AI systems

Don’t wait for an incident to happen. Make sure your AI apps are safe and trustworthy.

SplxAI - Background Pattern

Supercharged security for your AI systems

Don’t wait for an incident to happen. Make sure your AI apps are safe and trustworthy.