alihan/LongCodeZip

Fork 0

mirror of https://github.com/YerbaPage/LongCodeZip.git synced 2025-10-22 23:19:46 +03:00

Go to file

YerbaPage 2d6c9ee950 fix repoqa

2025-10-16 08:32:22 +08:00

assets

fix the demo

2025-10-11 21:09:19 +08:00

experiments

fix repoqa

2025-10-16 08:32:22 +08:00

longcodezip

packaging

2025-10-11 21:33:12 +08:00

.gitignore

clean up

2025-10-07 12:56:08 +08:00

demo.py

packaging

2025-10-11 21:33:12 +08:00

LICENSE

update

2025-10-07 21:04:31 +08:00

MANIFEST.in

packaging

2025-10-11 21:33:12 +08:00

pyproject.toml

packaging

2025-10-11 21:33:12 +08:00

README.md

update demo

2025-10-11 21:45:53 +08:00

requirements.txt

packaging

2025-10-11 21:33:12 +08:00

setup.py

packaging

2025-10-11 21:33:12 +08:00

README.md

LongCodeZip

This repository is the official implementation of LongCodeZip, a novel two-stage long code compression method. Our paper "LongCodeZip: Compress Long Context for Code Language Models" has been accepted to ASE 2025.

Method Overview

LongCodeZip introduces a two-stage code compression framework specifically designed for code LLMs:

Coarse-grained Compression: Function-based chunking and ranking using conditional perplexity with respect to the query to select the most relevant functions.
Fine-grained Compression: Entropy-based block detection combined with 0/1 knapsack optimization to maximize relevance within adaptive token budgets.

The method is plug-and-play and can be integrated with existing code LLMs to achieve significant compression ratios while maintaining or improving task performance.

Installation

You can install directly from the GitHub repository:

pip install git+https://github.com/YerbaPage/LongCodeZip.git

Or clone and install in development mode:

git clone https://github.com/YerbaPage/LongCodeZip.git
cd LongCodeZip
pip install -e .

Quick Demo

We provide a simple demo (demo.py) to help you get started with LongCodeZip.

python demo.py

The demo showcases both compression modes: coarse-grained compression (function-level selection only) and the full two-stage compression (with fine-grained token optimization). It demonstrates how LongCodeZip compresses a code file based on a given query and achieves different compression ratios.

Basic Example

from longcodezip import LongCodeZip

# Initialize the compressor
compressor = LongCodeZip(model_name="Qwen/Qwen2.5-Coder-7B-Instruct")

# Compress code with a query
result = compressor.compress_code_file(
    code=<your_code_string>,
    query=<your_query>,
    instruction=<your_instruction>,
    rate=0.5,  # Keep 50% of tokens
    rank_only=False, # Set to True to only rank and select contexts without fine-grained compression
)

# Access compressed results
compressed_code = result['compressed_code']
compressed_prompt = result['compressed_prompt']  # Full prompt with instruction
compression_ratio = result['compression_ratio']

References

@article{shi2025longcodezip,
  title={LongCodeZip: Compress Long Context for Code Language Models},
  author={Shi, Yuling and Qian, Yichun and Zhang, Hongyu and Shen, Beijun and Gu, Xiaodong},
  journal={arXiv preprint arXiv:2510.00446},
  year={2025}
}