init

2025-10-22 23:19:46 +03:00 · 2025-05-31 09:20:21 -07:00
parent 40b2ff53ec
commit 1dc9bf714c
32 changed files with 10658 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -1 +1,125 @@
 # LongCodeZip
+
+This repository is the official implementation of LongCodeZip, a novel two-stage long code compression method.
+
+
+## Method Overview
+
+![Overview](assets/overview.png)
+
+LongCodeZip introduces a two-stage code compression framework specifically designed for code LLMs:
+
+1. **Coarse-grained Compression**: Function-based chunking and ranking using conditional perplexity with respect to the query to select the most relevant functions.
+
+2. **Fine-grained Compression**: Entropy-based block detection combined with 0/1 knapsack optimization to maximize relevance within adaptive token budgets.
+
+The method is plug-and-play and can be integrated with existing code LLMs to achieve significant compression ratios while maintaining or improving task performance.
+
+## Repository Structure
+
+This repository contains implementations and experiments for three code-related tasks:
+
+```
+LongCodeZip/
+├── repoqa/                    # Code Retrieval Task
+│   ├── main.py               # Main evaluation script
+│   ├── run.sh                # Experiment runner
+│   ├── code_compressor.py    # Core compression implementation
+│   ├── compute_score.py      # Evaluation metrics
+│   └── ...
+├── long-code-completion/      # Code Completion Task
+│   ├── main.py               # Main evaluation script
+│   ├── run.sh                # Experiment runner
+│   ├── code_compressor.py    # Core compression implementation
+│   ├── utils.py              # Utility functions
+│   └── ...
+├── module_summarization/      # Code Summarization Task
+│   ├── main.py               # Main evaluation script
+│   ├── run.sh                # Experiment runner
+│   ├── code_compressor.py    # Core compression implementation
+│   ├── utils.py              # Utility functions
+│   └── ...
+└── README.md
+```
+
+## Installation
+
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+### Quick Start
+
+Each task directory contains a `run.sh` script for easy experimentation. Simply navigate to the desired task directory and run:
+
+```bash
+cd <task_directory>
+bash run.sh
+```
+
+### Code Retrieval (RepoQA)
+
+Navigate to the `repoqa` directory and run experiments with different compression ratios:
+
+```bash
+cd repoqa
+bash run.sh
+```
+
+The script will evaluate LongCodeZip on the RepoQA dataset with compression ratios of 0.1, 0.2, 0.3, and 0.4, running experiments in parallel on multiple GPUs.
+
+**Key Parameters:**
+- `--compression-ratio`: Controls the compression level (0.1-0.4)
+- `--model`: Specifies the base LLM model
+- `--backend`: Backend for model inference (vllm)
+
+### Code Completion
+
+Navigate to the `long-code-completion` directory:
+
+```bash
+cd long-code-completion
+bash run.sh
+```
+
+This evaluates LongCodeZip on long-context code completion tasks with various configurations including different target token limits, fine-grained compression ratios, and importance beta values.
+
+**Key Parameters:**
+- `--code_compressor_target_token`: Target token budget (2048, 4096)
+- `--code_compressor_fine_ratio`: Fine-grained compression ratio (0.5, 0.8)
+- `--importance_beta`: Importance weighting parameter (0.0, 0.5)
+
+### Code Summarization
+
+Navigate to the `module_summarization` directory:
+
+```bash
+cd module_summarization
+bash run.sh
+```
+
+This runs code summarization experiments with fine-grained compression and various beta values for importance weighting.
+
+**Key Parameters:**
+- `--code_compressor_target_token`: Target token budget
+- `--code_compressor_fine_ratio`: Fine-grained compression ratio
+- `--importance_beta`: Importance weighting parameter
+
+## Configuration
+
+Each task can be customized by modifying the respective `run.sh` file or by directly calling the main scripts with custom parameters. Key configuration options include:
+
+- **Model Selection**: Compatible with various code LLMs (default: Qwen2.5-Coder-7B-Instruct)
+- **Compression Ratios**: Adjustable compression levels for different use cases
+- **Token Budgets**: Configurable target token limits
+- **GPU Configuration**: Multi-GPU support for parallel experiments
+
+## Performance
+
+LongCodeZip achieves up to **5.6× compression ratio** without sacrificing task performance across code completion, summarization, and retrieval tasks. And even when using a 0.5B Qwen model as the compressor, it can also achieve competitive performance.
+
+## Contact
+
+Please feel free to contact us if you have any questions.