Large Language Models (LLMs) have gained significant attention in recent years, but they face a critical security challenge known as prompt leakage. This vulnerability allows malicious actors to extract sensitive information from LLM prompts through targeted adversarial inputs. The issue stems from the conflict between safety training and instruction-following objectives in LLMs. Prompt leakage poses significant risks, including exposure of system intellectual property, sensitive contextual knowledge, style guidelines, and even backend API calls in agent-based systems. The threat is particularly concerning due to its effectiveness and simplicity, coupled with the widespread adoption of LLM-integrated applications. While previous research has examined prompt leakage in single-turn interactions, the more complex multi-turn scenario remains underexplored. In addition to this, there is a pressing need for robust defense strategies to mitigate this vulnerability and protect user trust.
Researchers have made several attempts to address the challenge of prompt leakage in LLM applications. The PromptInject framework was developed to study instruction leakage in GPT-3, while gradient-based optimization methods have been proposed to generate adversarial queries for system prompt leakage. Other approaches include parameter extraction and prompt reconstruction methodologies. Studies have also focused on measuring system prompt leakage in real-world LLM applications and investigating the vulnerability of tool-integrated LLMs to indirect prompt injection attacks.
Recent work has expanded to examine datastore leakage risks in production RAG systems and the extraction of personally identifiable information from external retrieval databases. The PRSA attack framework has demonstrated the ability to infer prompt instructions from commercial LLMs. However, most of these studies have primarily focused on single-turn scenarios, leaving multi-turn interactions and comprehensive defense strategies relatively unexplored.
Various defense methods have been evaluated, including perplexity-based approaches, input processing techniques, auxiliary helper models, and adversarial training. Inference-only methods for intention analysis and goal prioritization have shown promise in improving defense against adversarial prompts. Also, black-box defense techniques and API defenses like detectors and content filtering mechanisms have been employed to counter indirect prompt injection attacks.
The study by Salesforce AI Research employs a standardized task setup to evaluate the effectiveness of various black-box defense strategies in mitigating prompt leakage. The methodology involves a multi-turn question-answering interaction between the user (acting as an adversary) and the LLM, focusing on four realistic domains: news, medical, legal, and finance. This approach allows for a systematic assessment of information leakage across different contexts.
LLM prompts are dissected into task instructions and domain-specific knowledge to observe the leakage of specific prompt contents. The experiments encompass seven black-box LLMs and four open-source models, providing a comprehensive analysis of the vulnerability across different LLM implementations. To adapt to the multi-turn RAG-like setup, the researchers employ a unique threat model and compare various design choices.
The attack strategy consists of two turns. In the first turn, the system is prompted with a domain-specific query along with an attack prompt. The second turn introduces a challenger utterance, allowing for a successive leakage attempt within the same conversation. This multi-turn approach provides a more realistic simulation of how an adversary might exploit vulnerabilities in real-world LLM applications.
The research methodology utilizes the concept of sycophantic behavior in models to develop a more effective multi-turn attack strategy. This approach significantly increases the average Attack Success Rate (ASR) from 17.7% to 86.2%, achieving nearly complete leakage (99.9%) on advanced models like GPT-4 and Claude-1.3. To counter this threat, the study implements and compares various black- and white-box mitigation techniques that application developers can employ.
A key component of the defense strategy is the implementation of a query-rewriting layer, commonly used in Retrieval-Augmented Generation (RAG) setups. The effectiveness of each defense mechanism is assessed independently. For black-box LLMs, the Query-Rewriting defense proves most effective at reducing average ASR during the first turn, while the Instruction defense is more successful at mitigating leakage attempts in the second turn.
The comprehensive application of all mitigation strategies to the experimental setup results in a significant reduction of the average ASR for black-box LLMs, bringing it down to 5.3% against the proposed threat model. Also, the researchers curate a dataset of adversarial prompts designed to extract sensitive information from the system prompt. This dataset is then used to finetune an open-source LLM to reject such attempts, further enhancing the defense capabilities.
The study’s data setup encompasses four common domains: news, finance, legal, and medical, chosen to represent a range of everyday and specialized topics where LLM prompt contents may be particularly sensitive. A corpus of 200 input documents per domain is created, with each document truncated to approximately 100 words to eliminate length bias. GPT-4 is then used to generate one query for each document, resulting in a final corpus of 200 input queries per domain.
The task setup simulates a practical multi-turn QA scenario using an LLM agent. A carefully designed baseline template is employed, consisting of three components: (1) Task Instructions (INSTR) providing system guidelines, (2) Knowledge Documents (KD) containing domain-specific information, and (3) the user (adversary) input. For each query, the two most relevant knowledge documents are retrieved and included in the system prompt.
This study evaluates ten popular LLMs: three open-source models (LLama2-13b-chat, Mistral7b, Mixtral 8x7b) and seven proprietary black-box LLMs accessed through APIs (Command-XL, Command-R, Claude v1.3, Claude v2.1, GeminiPro, GPT-3.5-turbo, and GPT-4). This diverse selection of models allows for a comprehensive analysis of prompt leakage vulnerabilities across different LLM implementations and architectures.
The research employs a sophisticated multi-turn threat model to assess prompt leakage vulnerabilities in LLMs. In the first turn, a domain-specific query is combined with an attack vector, targeting a standardized QA setup. The attack prompt, randomly selected from a set of GPT-4 generated leakage instructions, is appended to the domain-specific query.
For the second turn, a carefully designed attack prompt is introduced. This prompt incorporates a sycophantic challenger and attack reiteration component, exploiting the LLM’s tendency to exhibit a flip-flop effect when confronted with challenger utterances in multi-turn conversations.
To analyze the effectiveness of the attacks, the study classifies information leakage into four categories: FULL LEAKAGE, NO LEAKAGE, KD LEAKAGE (knowledge documents only), and INSTR LEAKAGE (task instructions only). Any form of leakage, except NO LEAKAGE, is considered a successful attack.
For detecting leakage, the researchers employ a Rouge-L recall-based method, applied separately to instructions and knowledge documents in the prompt. This method outperforms a GPT-4 judge in accurately determining attack success when compared to human annotations, demonstrating its effectiveness in capturing both verbatim and paraphrased leaks of prompt contents.
The study employs a comprehensive set of defense strategies against the multi-turn threat model, encompassing both black-box and white-box approaches. Black-box defenses, which assume no access to model parameters, include:
1. In-context examples
2. Instruction defense
3. Multi-turn dialogue separation
4. Sandwich defence
5. XML tagging
6. Structured outputs using JSON format
7. Query-rewriting module
These techniques are designed for easy implementation by LLM application developers. Also, a white-box defense involving the safety-finetuning of an open-source LLM is explored.
The researchers evaluate each defense independently and in various combinations. Results show varying effectiveness across different LLM models and configurations. For instance, in some configurations, the average ASR for closed-source models ranges from 16.0% to 82.3% across different turns and setups.
The study also reveals that open-source models generally exhibit higher vulnerability, with average ASRs ranging from 18.2% to 93.0%. Notably, certain configurations demonstrate significant mitigation effects, particularly in the first turn of the interaction.
The study’s results reveal significant vulnerabilities in LLMs that prompt leakage attacks, particularly in multi-turn scenarios. In the baseline setting without defenses, the average ASR was 17.7% for the first turn across all models, increasing dramatically to 86.2% in the second turn. This substantial increase is attributed to the LLMs’ sycophantic behavior and the reiteration of attack instructions.
Different defense strategies showed varying effectiveness:
1. Query-Rewriting proved most effective for closed-source models in the first turn, reducing ASR by 16.8 percentage points.
2. Instruction defense was most effective against the second-turn challenger, reducing ASR by 50.2 percentage points for closed-source models.
3. Structured response defense was particularly effective for open-source models in the second turn, reducing ASR by 28.2 percentage points.
Combining multiple defenses yielded the best results. For closed-source models, applying all black-box defenses together reduced the ASR to 0% in the first turn and 5.3% in the second turn. Open-source models remained more vulnerable, with a 59.8% ASR in the second turn, even with all defenses applied.
The study also explored the safety fine-tuning of an open-source model (phi-3-mini), which showed promising results when combined with other defenses, achieving near-zero ASR.
This study presents significant findings on prompt leakage in RAG systems, offering crucial insights for enhancing security across both closed- and open-source LLMs. It pioneered a detailed analysis of prompt content leakage and explored defensive strategies. The research revealed that LLM sycophancy increases vulnerability to prompt leakage in all models. Notably, combining black-box defenses with query-rewriting and structured responses effectively reduced the average attack success rate to 5.3% for closed-source models. However, open-source models remained more susceptible to these attacks. Interestingly, the study identified phi-3-mini-, a small open-source LLM, as particularly resilient against leakage attempts when coupled with black-box defenses, highlighting a promising direction for secure RAG system development.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!
The post Salesforce AI Research Proposes a Novel Threat Model: Building Secure LLM Applications Against Prompt Leakage Attacks appeared first on MarkTechPost.