## Retrieval-Augmented Generation (RAG)

<a target="_blank" href="https://colab.research.google.com/github/microsoft/LLMLingua/blob/main/examples/RAG.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Retrieval-Augmented Generation (RAG) is a powerful and popular technique that applies specialized knowledge to large language models (LLMs). However, traditional RAG methods tend to have increasingly long prompts, sometimes exceeding **40k**, which can result in high financial and latency costs. Moreover, the decreased information density within the prompts can lead to performance degradation in LLMs, such as the "lost in the middle" issue.

<center><img width="800" src="../images/LongLLMLingua_Motivation.png"></center>

To address this, we propose [**LongLLMLingua**](https://arxiv.org/abs/2310.06839), which specifically tackles the low information density problem in long context scenarios via prompt compression, making it particularly suitable for RAG tasks. The main ideas involve a two-stage compression process, as shown by the  <font color='red'>**red line**</font>, which significantly improves the original curve:

- Coarse-grained compression through document-level perplexity;
- Fine-grained compression of the remaining text using token perplexity;

Instead of fighting against positional effects, we aim to utilize them to our advantage through document reordering, as illustrated by the  <font color='green'>**green line**</font>. In this approach, the most critical passages are placed at the beginning and the end of the context. Furthermore, the entire process becomes more **cost-effective and faster** since it only requires handling **1/4** of the original context.

### NaturalQuestions Multi-document QA

Next, we will demonstrate the use of LongLLMLingua on the NaturalQuestions dataset, which effectively alleviates the "lost in the middle" issue. This dataset closely resembles real-world RAG scenarios, as it first employs the Contriever retrieval system to recall 20 relevant documents (including 1 ground truth and 19 related documents), and then answers the respective questions based on the prompts composed of these 20 documents.

The original dataset can be found in https://github.com/nelson-liu/lost-in-the-middle/tree/main/qa_data.

In [6]:
# Install dependency.
## Lost in the middle
!git clone https://github.com/nelson-liu/lost-in-the-middle
!cd lost-in-the-middle && echo "xopen" > requirements.txt && pip install -e .
## LLMLingu
!pip install llmlingua

Cloning into 'lost-in-the-middle'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (78/78), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 101 (delta 33), reused 61 (delta 17), pack-reused 23[K
Receiving objects: 100% (101/101), 254.44 MiB | 48.70 MiB/s, done.
Resolving deltas: 100% (33/33), done.
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///home/hjiang/Code/github/LLMLingua/examples/lost-in-the-middle
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: lost-in-the-middle
  Building editable for lost-in-the-middle (pyproject.toml) ... [?25ldone
[?25h  Created wheel for lost-in-the-middle: filename=lost_in_the_middle-0.0.0-0.editable-py3-none-an

In [8]:
# Using the OAI
import openai
openai.api_key = "<insert_openai_key>"

In [42]:
# or Using the AOAI
import openai
openai.api_key = "<insert_openai_key>"
openai.api_base = "https://xxxx.openai.azure.com/"
openai.api_type = 'azure'
openai.api_version = '2023-05-15'

### Setup Data

In [12]:
import json
from xopen import xopen
from copy import deepcopy
from tqdm import tqdm
from lost_in_the_middle.prompting import (
    Document,
    get_closedbook_qa_prompt,
    get_qa_prompt,
)

datasets = []
path = "./lost-in-the-middle/qa_data/20_total_documents/nq-open-20_total_documents_gold_at_9.jsonl.gz"
with xopen(path) as f:
    for ii, jj in tqdm(enumerate(f), total=2655):
        input_example = json.loads(jj)
        question = input_example["question"]
        documents = []
        for ctx in deepcopy(input_example["ctxs"]):
            documents.append(Document.from_dict(ctx))

        prompt = get_qa_prompt(
            question,
            documents,
            mention_random_ordering=False,
            query_aware_contextualization=False,
        )

        c = prompt.split("\n\n")
        instruction, question = c[0], c[-1]
        demonstration = "\n".join(c[1:-1])
        datasets.append({"id": ii, "instruction": instruction, "demonstration": demonstration, "question": question, "answer": input_example["answers"]})

100%|██████████| 2655/2655 [00:01<00:00, 1550.38it/s]


In [20]:
# select an example from NaturalQuestions
instruction, demonstration_str, question, answer = [datasets[23][key] for key in ["instruction", "demonstration", "question", "answer"]]

In [23]:
# Ground-truth Answer
answer

['14']

### The response of Original prompt (Error)

In [25]:
# The response from original prompt, error
prompt = "\n\n".join([instruction, demonstration_str, question])

message = [
    {"role": "user", "content": prompt},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-3.5-turbo",
    **request_data,
)
print(json.dumps(response, indent=4))

{
    "id": "chatcmpl-8FFZIQCjv9Dv5Q9WQcDmNBT1OJIP8",
    "object": "chat.completion",
    "created": 1698645456,
    "model": "gpt-35-turbo",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "message": {
                "role": "assistant",
                "content": "As of the provided search results, OPEC has 15 member countries."
            }
        }
    ],
    "usage": {
        "prompt_tokens": 2897,
        "completion_tokens": 15,
        "total_tokens": 2912
    }
}


### The response of Compressed Prompt (Correct in 10x Compression)

In [29]:
# Setup LLMLingua
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [43]:
# 6x Compression
compressed_prompt = llm_lingua.compress_prompt(
    demonstration_str.split("\n"),
    instruction=instruction,
    question=question,
    target_token=500,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    use_sentence_level_filter=False,
    context_budget="+100",
    dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio
    reorder_context="sort"
)
message = [
    {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-3.5-turbo",
    **request_data,
)

print(json.dumps(compressed_prompt, indent=4))
print("Response:", response)

{
    "compressed_prompt": "Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).\n\nDocument [10](Title: OPEC Organization of the Petroleum Exporting Countries (OPEC, /\u02c8o\u028ap\u025bk/ OH-pek, or OPEP in several other languages) is an intergovernmental organization of 14 nations as of February 2018, founded in 1960 in Baghdad by the first five members (Iran, Iraq, Kuwait, Saudi Arabia, and Venezuela), and headquartered since 1965 in Vienna, Austria. As of 2016, the 14 countries accounted for an estimated 44 percent of global oil production and 73 percent of the world's \"proven\" oil reserves, giving OPEC a major influence on global oil prices that were previously determined by American-dominated multinational oil companies.\n\nDocument1](Title: OPE OPE lost its newest members, who had in mid-1970s E withd in December 192, because it was unwilling to pay annual US$2 million membership fee felt that it neede

In [40]:
# 10x Compression
compressed_prompt = llm_lingua.compress_prompt(
    demonstration_str.split("\n"),
    instruction=instruction,
    question=question,
    target_token=100,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    use_sentence_level_filter=False,
    context_budget="+100",
    dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio
    reorder_context="sort"
)
message = [
    {"role": "user", "content": compressed_prompt["compressed_prompt"]},
]

request_data = {
    "messages": message,
    "max_tokens": 100,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
}
response = openai.ChatCompletion.create(
    "gpt-3.5-turbo",
    **request_data,
)

print(json.dumps(compressed_prompt, indent=4))
print("Response:", response)

{
    "compressed_prompt": "Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).\n\n0Title: OPECization of Petroleum Exporting Count (OPEC, /\u02c8o\u028ap\u025bk OHpekP in other) is an intergovernmental14 nations as February 218 founded in 960 in Baghdad by fiveIran Iraq, Kuwait, Saudi Arab, and Venezuela), headquartered since 965 in, Austria. of the4ed an estimated4 percent of production and 3 percent of the world's \"proven\" oil res OPEC on global by Americandominatedin companies.\n\n5](Title: OPEC) OPEC lost its two newest members, who had joined in the mid-1970s. Ecuador withdrew in December 1992, because it was unwilling to pay the annual US$2 million membership fee and felt that it needed to produce more oil than it was allowed under the OPEC quota, although it rejoined in October 2007. Similar concerns prompted Gabon to suspend membership in January 1995; it rejoined in July 2016. Iraq has remained a mem