Feature(LLMLingua): add LongLLMLingua & documents

This commit is contained in:
Huiqiang Jiang
2023-10-08 11:56:35 +00:00
parent 44899b19fd
commit b917a05576
9 changed files with 892 additions and 167 deletions

1
.gitignore vendored
View File

@@ -396,3 +396,4 @@ FodyWeavers.xsd
# JetBrains Rider
*.sln.iml
*.egg-info

121
DOCUMENT.md Normal file
View File

@@ -0,0 +1,121 @@
# LLMLingua Documentation
## Principles
- The most important thing is **the sensitivity to compression varies among different components in a prompt**, such as instructions and questions being more sensitive, while context or documents are less sensitive. Therefore, it is advisable to separate the components within the prompt and input them into demonstrations, instructions, and questions.
- **Divide demonstrations and context into independent granularities**, such as documents in multi-document QA and examples in few-shot learning. This approach will be beneficial for the budget controller and document reordering.
- **Preserving essential characters in the scenario as required by the rule**, we will provide support soon.
- Try experimenting with different target compression ratios or other hyperparameters to optimize the performance.
## Initialization
```python
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor(
model_name: str = "NousResearch/Llama-2-7b-hf",
device_map: str = "cuda",
use_auth_token: bool = False,
open_api_config: dict = {},
)
```
### Parameters
- model_name(str), the name of small language model from huggingface. Default set to "NousResearch/Llama-2-7b-hf";
- device_map(str), the device environment for using small models, like 'cuda', 'cpu', 'balanced', 'balanced_low_0', 'auto'. Default set to "cuda";
- use_auth_token(bool, optional), controls the usage of huggingface auto_token. Default set to False;
- open_api_config(dict, optional), the config of openai which use in OpenAI Embedding in coarse-level prompt compression. Default set to {};
## Function Call
```python
compressed_prompt = llm_lingua.compress_prompt(
context: List[str],
instruction: str = "",
question: str = "",
ratio: float = 0.5,
target_token: float = -1,
iterative_size: int = 200,
force_context_ids: List[int] = None,
force_context_number: int = None,
use_sentence_level_filter: bool = False,
use_context_level_filter: bool = True,
use_token_level_filter: bool = True,
keep_split: bool = False,
keep_first_sentence: int = 0,
keep_last_sentence: int = 0,
keep_sentence_number: int = 0,
high_priority_bonus: int = 100,
context_budget: str = "+100",
token_budget_ratio: float = 1.4,
condition_in_question: str = "none",
reorder_context: str = "original",
dynamic_context_compression_ratio: float = 0.0,
condition_compare: bool = False,
add_instruction: bool = False,
rank_method: str = "longllmlingua",
concate_question: bool = True,
)
# > {'compressed_prompt': 'Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 * 300ters in total\nSam then took 5 boxes 6ters0ters.\nHe sold these boxes for 5 *5\nAfterelling these boxes there were 3030 highlighters remaining.\nThese form 330 / 3 = 110 groups of three pens.\nHe sold each of these groups for $2 each, so made 110 * 2 = $220 from them.\nIn total, then, he earned $220 + $15 = $235.\nSince his original cost was $120, he earned $235 - $120 = $115 in profit.\nThe answer is 115',
# 'origin_tokens': 2365,
# 'compressed_tokens': 211,
# 'ratio': '11.2x',
# 'saving': ', Saving $0.1 in GPT-4.'}
```
### Parameters
- **context**(str or List[str]), the context, documents or demonstrations in the prompt, low sensitivity to compression;
- instruction(str), general instruction in the prompt before the context, high sensitivity to compression;
- **question**(str), general question in the prompt after the context, high sensitivity to compression;
- **ratio**(float, optional), target compression ratio, the larger the value, the fewer tokens will be retained, mutually exclusive with **target_token**, default set to 0.5;
- **target_token**(float), target compression token number, mutually exclusive with **ratio**, default set to -1;
- **iterative_size**(int), the segment size in Iterative Token-level Prompt Compression, default set to 200;
- **force_context_ids**(List[int], optional), the index list forcefully retains of **context**, default set to None,
- **force_context_number**(int, optional), the context number forcefully retains in Coarse-level Prompt Compression, default set to None,
- **use_sentence_level_filter**(bool, optional), controls the usage of the sentence-level prompt compression, default set to False;
- **use_context_level_filter**(bool, optional), controls the usage of the coarse-level prompt compression, default set to True;
- **use_token_level_filter**(bool, optional), controls the usage of the token-level prompt compression, default set to True;
- **keep_split**(bool, optional), control whether to retain all the newline separators "\n\n" in the prompt, default set to False;
- **keep_first_sentence**(bool, optional), control whether to retain the first k sentence in each context, default set to False;
- **keep_last_sentence**(bool, optional), control whether to retain the last k sentence in each context, default set to False;
- **keep_sentence_number**(int, optional), control the retain sentence number in each context, default set to 0;
- **high_priority_bonus**(int, optional), control the ppl bonus of the ratin sentence, only use when **keep_first_sentence** or **keep_last_sentence** is True, default set to 100;
- **context_budget**(str, optional), the budget in Coarse-level Prompt Compression, supported operators, like "*1.5" or "+100", default set to "+100";
- **token_budget_ratio**(float, optional), the budget ratio in sentence-level Prompt Compression, default set to 1.4;
- **condition_in_question**(str, optional), control whether use the question-aware coarse-level prompt compression, support "none", "after", "before". In the LongLLMLingua, it is necessary to set to "after" or "before", default set to "none";
- **reorder_context**(str, optional), control whether use the document reordering before compression in LongLLMLingua, support "original", "sort", "two_stage", default set to "original";
- **dynamic_context_compression_ratio**(float, optional), control the ratio of dynamic context compression in LongLLMLingua, default set to 0.0;
- **condition_compare**(bool, optional), control whether use the Iterative Token-level Question-aware Fine-Grained Compression in LongLLMLingua, default set to False,
- **add_instruction**(bool, optional), control whether add the instuct before prompt in Iterative Token-level Question-aware Fine-Grained Compression, default set to False;
- **rank_method**(bool, optional), control the rank method use in Coarse-level Prompt Compression, support "llmlingua", "longllmlingua", "bm25", "gzip", "sentbert", "openai", default set to "llmlingua";
- **concate_question**(bool, optional), control whether include the question in the compressed prompt, default set to True;
### Response
- **compressed_prompt**(str), the compressed prompt;
- **origin_tokens**(int), the token number of original prompt;
- **compressed_tokens**(int), the token number of compressed prompt;
- **ratio**(str), the actual compression ratio;
- **saving**(str), the saving cost in GPT-4.
## Post-precessing
```python
compressed_prompt = llm_lingua.recover(
original_prompt: str,
compressed_prompt: str,
response: str,
)
```
### Parameters
- **original_prompt**(str), the original prompt;
- **compressed_prompt**(str), the compressed prompt;
- **response**(str), the response of the compressed prompt from black-box LLMs;
### Response
- **recovered_response**(str), the recovered response;

View File

@@ -2,16 +2,21 @@
<img src="images/LLMLingua_logo.png" alt="LLMLingua" style="width: 20%; min-width: 100px; display: block; margin: auto;">
</p>
# LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
# LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models [[paper]()] & LongLLMLingua [[paper]()]
This repo contains the code for LLMLingua, a project that compresses prompts and speeds up inference for LLMs with minimal loss of performance.
https://github.com/microsoft/LLMLingua/assets/30883354/ef52995c-ef3c-4eac-a9fd-1acb491c325b
[LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models]() ().
## Tl;DR
LLMLingua, that uses a well-trained small language model after alignment, such as GPT2-small or LLaMA-7B, to detect the unimportant tokens in the prompt and enable inference with the compressed prompt in black-box LLMs, achieving up to 20x compression with minimal performance loss.
[LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models]() (EMNLP 2023).
_Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang and Lili Qiu_
LongLLMLingua is a method that enhances LLMs' ability to perceive key information in long-context scenarios using prompt compression, achieveing up to $28.5 in cost savings per 1,000 samples while also improving performance.
PS: We also release a hackathon demo to show our idea. Please check [here](https://hackbox.microsoft.com/hackathons/hackathon2023/project/26540).
[LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression]() (Under Review).
_Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu_
## 🎥 Overview
@@ -25,12 +30,12 @@ Large language models, such as ChatGPT and GPT-4, impress us with their amazing
![image](./images/LLMLingua_framework.png)
Now you can use **LLMLingua**!
Now you can use **LLMLingua** & **LongLLMLingua**!
A simple and efficient method to compress prompt up to **20x**.
- 💰 **Saving cost**, not only prompt, but also the generation length;
- 📝 **Support longer contexts**;
- 📝 **Support longer contexts** while delivering enhanced performance;
- ⚖️ **Robustness**, no need any training for the LLMs;
- 🕵️ **Keeping** the original prompt knowledge like ICL, reasoning, etc.
- 📜 **KV-Cache compression**, speedup inference;
@@ -43,6 +48,16 @@ If you find this repo helpful, please cite the following paper:
@inproceedings{jiang-etal-2023-llmlingua,
title = "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models",
author = "Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang and Lili Qiu",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
publisher = "Association for Computational Linguistics",
}
```
```bibtex
@inproceedings{jiang-etal-2023-longllmlingua,
title = "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression",
author = "Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu",
}
```
@@ -59,8 +74,8 @@ Then, you can use LLMLingua to compress your prompt,
```python
from llmlingua import PromptCompressor
llmlingua = PromptCompressor()
compressed_prompt = llmlingua.compress_prompt(prompt, instruction="", question="", target_token=200)
llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)
# > {'compressed_prompt': 'Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 * 300ters in total\nSam then took 5 boxes 6ters0ters.\nHe sold these boxes for 5 *5\nAfterelling these boxes there were 3030 highlighters remaining.\nThese form 330 / 3 = 110 groups of three pens.\nHe sold each of these groups for $2 each, so made 110 * 2 = $220 from them.\nIn total, then, he earned $220 + $15 = $235.\nSince his original cost was $120, he earned $235 - $120 = $115 in profit.\nThe answer is 115',
# 'origin_tokens': 2365,
@@ -69,6 +84,11 @@ compressed_prompt = llmlingua.compress_prompt(prompt, instruction="", question="
# 'saving': ', Saving $0.1 in GPT-4.'}
```
You can refer to this [document](./DOCUMENT.md) for more recommendations on how to use LLMLingua effectively.
## Frequently Asked Questions
show in [Transparency_FAQ.md](./Transparency_FAQ.md)
## Contributing

View File

@@ -1,13 +1,3 @@
# TODO: The maintainer of this repo has not yet edited this file
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
# Support
## How to file issues and get help
@@ -16,9 +6,7 @@ This project uses GitHub Issues to track bugs and feature requests. Please searc
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
For help and questions about using this project, please refer the [document](./DOCUMENT.md).
## Microsoft Support Policy

41
Transparency_FAQ.md Normal file
View File

@@ -0,0 +1,41 @@
# LLMLingua's Responsible AI FAQ
## What is LLMLingua?
- LLMLingua is a simple and efficient method to compress prompt up to 20x and keeping the original prompt knowledge like ICL, reasoning, etc.
- LLMLingua takes user-defined prompts and compression goals as input, and outputs a compressed prompt, which may often result in a form of expression that is difficult for humans to understand.
## What can LLMLingua do?
- LLMLingua can simultaneously reduce the length of prompts and the output of LLMs (20%-30%), thus saving API calls;
- Compressed prompts from LLMLingua can be directly used with black-box LLMs, such as ChatGPT, GPT-4, and Claude;
- By compressing prompts, LLMLingua allows for more information to be included within the original token length, thereby improving model performance;
- LLMLingua relies on a small language model, like GPT-2 or LLaMA-7b, for perplexity calculations, which is a relatively low-cost approach;
- Compressed prompts generated by LLMLingua can be understood by LLMs, preserving their original capabilities in downstream tasks and keeping the original prompt knowledge like ICL, reasoning, etc. LLMs can also recover the essential information from the compressed prompts;
- LLMLingua is a robustness method, no need any training for the LLMs;
- Additionally, LLMLingua can be used to compress KV-Cache, which speeds up inference.
## What is/are LLMLinguas intended use(s)?
- Users who call black-box LLM APIs similar to GPT-4, those who utilize ChatGPT to handle longer content, as well as model deployers and cloud service providers, can benefit from these techniques.
## How was LLMLingua evaluated? What metrics are used to measure performance?
- In our experiments, we conducted a detailed evaluation of the performance of compressed prompts across various tasks, particularly in those involving LLM-specific capabilities, such as In-Context Learning, reasoning tasks, summarization, and conversation tasks. We assessed our approach using compression ratio and performance loss as evaluation metrics.
## What are the limitations of LLMLingua? How can users minimize the impact of LLMLinguas limitations when using the system?
- The potential harmful, false or biased responses using the compressed prompts would likely be unchanged. Thus using LLMLingua has no inherent benefits or risks when it comes to those types of responsible AI issues.
- LLMLingua may struggle to perform well at particularly high compression ratios, especially when the original prompts are already quite short.
## What operational factors and settings allow for effective and responsible use of LLMLingua?
- Users can set parameters such as the boundaries between different components (instruction, context, question) in the prompt, compression goals, and the small model used for compression calculations. Afterward, they can input the compressed prompt into black-box LLMs for use.
## What is instruction, context, and question?
In our approach, we divide the prompts into three distinct modules: instruction, context, and question. Each prompt necessarily contains a question, but the presence of context and instruction is not always guaranteed.
- Question: This refers to the directives given by the user to the LLMs, such as inquiries, questions, or requests. Positioned after the instruction and context modules, the question module has a high sensitivity to compression.
- Context: This module provides the supplementary context needed to address the question, such as documents, demonstrations, web search results, or API call results. Located between the instruction and question modules, its sensitivity to compression is relatively low.
- Instruction: This module consists of directives given by the user to the LLMs, such as task descriptions. Placed before the instruction and context modules, the instruction module exhibits a high sensitivity to compression.

File diff suppressed because it is too large Load Diff

View File

@@ -1,11 +1,11 @@
_MAJOR = "0"
_MINOR = "0"
_MINOR = "1"
# On master and in a nightly release the patch should be one ahead of the last
# released build.
_PATCH = "1"
_PATCH = "0"
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
# https://semver.org/#is-v123-a-semantic-version for the semantics.
_SUFFIX = "dev0"
_SUFFIX = ""
VERSION_SHORT = "{0}.{1}".format(_MAJOR, _MINOR)
VERSION = "{0}.{1}.{2}.{3}".format(_MAJOR, _MINOR, _PATCH, _SUFFIX)
VERSION = "{0}.{1}.{2}{3}".format(_MAJOR, _MINOR, _PATCH, _SUFFIX)

View File

View File

@@ -24,6 +24,7 @@ INSTALL_REQUIRES = [
"torch",
"tiktoken",
"nltk",
"numpy",
]
QUANLITY_REQUIRES = [
"black==21.4b0",