TPI-LLM: A High-Performance Tensor Parallelism Inference System for Edge LLM Services.
Updated 2024-10-04 22:25:48 +03:00
To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Updated 2024-01-23 02:05:47 +03:00
Inference code for facebook LLaMA models with Wrapyfi support
Updated 2023-09-17 17:00:28 +03:00
Inference code for facebook LLaMA models with Wrapyfi support
Updated 2023-09-11 00:48:30 +03:00
Large Language Model Text Generation Inference
Updated 2023-08-23 10:47:54 +03:00
Large Language Model Text Generation Inference
Updated 2023-08-15 01:09:35 +03:00