Compare commits

10 Commits
v0.1.0 ... main

Author SHA1 Message Date
Federico Cocchi
194f78b18b Create zero3.json 2024-09-09 15:40:20 +02:00
Federico Cocchi
fd8bc7e86e Create zero2.json 2024-09-09 15:39:56 +02:00
NicholasMoratelli
75a26d8c7a Update README.md 2024-08-29 23:38:44 +02:00
Federico Cocchi
d4c178507b Merge pull request #5 from joris-sense/patch-1
Ran into trouble trying out run_llava.py
2024-08-28 01:09:56 +02:00
federico1-creator
29e44e6951 improve the usability of the inference code 2024-08-28 01:08:55 +02:00
joris-sense
b7f50f4fe8 Update README.md
I didn't test the PYTHONPATH addition; I solved that problem by "cp llava/eval/run_llava.py ."
2024-08-27 23:55:14 +02:00
Federico Cocchi
afd373ad21 Update README.md 2024-08-18 12:50:21 +02:00
Sara Sarto
4cb6900447 Update README.md 2024-08-02 21:38:18 +02:00
Federico Cocchi
f19767e063 add get_logger function 2024-08-02 12:33:22 +02:00
Sara Sarto
feeb4ebe1c Update README.md 2024-08-02 10:00:52 +02:00
7 changed files with 100 additions and 16 deletions

View File

@@ -1,5 +1,6 @@
<div align="center">
<img src="images/image_no_back.png" width="200" height="200">
<img src= "images/image_no_back.png"
width="200" height="200">
<h1> 🔥 LLaVA-MORE 🔥
Enhancing Visual Instruction Tuning with LLaMA 3.1
@@ -14,7 +15,7 @@
<div align='center'>
#### [Federico Cocchi](https://federico1-creator.github.io/Federico_Cocchi/), [Nicholas Moratelli](https://github.com/NicholasMoratelli), [Davide Caffagni](https://github.com/dcaffo98), [Sara Sarto](https://github.com/sarasarto),
#### [Federico Cocchi](https://federico1-creator.github.io/Federico_Cocchi/), [Nicholas Moratelli](https://nicholasmoratelli.github.io), [Davide Caffagni](https://github.com/dcaffo98), [Sara Sarto](https://github.com/sarasarto),
#### [Marcella Cornia](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90), [Lorenzo Baraldi](https://www.lorenzobaraldi.com/), and [Rita Cucchiara](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=1)
</div>
@@ -34,7 +35,8 @@ If you make use of our work, please cite our repo:
## 📢 Latest Updates
- [2024/08/01] 🔥 First release of our LLaVA-MORE 8B model, based on LLaMA 3.1.
- [2024/08/16] 📌 Improved LLaVA-MORE 8B model, considering advanced image backbones.
- [2024/08/01] 🔥 First release of our LLaVA-MORE 8B, based on LLaMA 3.1.
- [2024/08/01] 🔎 If you are interested in this area of research, check out [our survey](https://arxiv.org/abs/2402.12451) on the revolution of Multimodal LLMs, recently published in ACL (Findings).
- [2024/08/01] 📚 Check out the latest researches from [AImageLab](https://aimagelab.ing.unimore.it/imagelab/).
@@ -63,15 +65,19 @@ In this section, we present the performance of our model compared to other versi
<img src="images/radar_plot.png" width="500"">
</div>
### Benchmarks and Comparisons on Instrucion Multimodal Datasets in the Literature
### Benchmarks and Comparisons on Instruction Multimodal Datasets in the Literature
<div align="center">
| Model Name | Text-VQA* | Science-QA | AI2D | SEED-vid | SEED-all | SEED-img | MMMU | MMBench-Cn | MMBench-En | POPE | GQA | MME-P | MME-C |
|----------------------|:----------: |:------------:|:------:|:----------:|:----------:|:----------:|:------:|:------------:|:------------:|:------:|:-----:|:--------:|:-------:|
| LLaVA-v1.5-7B | 58.2 | 69.0 | 56.4 | 42.0 | 61.6 | 66.8 | 34.2 | 56.5 | 65.3 | **85.6** | 62.4 | 1474.3 | 314.6 |
| LLaVA-v1.5-LLaMA3-8B | 57.6 | 74.2 | 60.7 | 42.0 | **64.3** | **70.1** | 37.3 | 65.4 | 70.3 | 85.4 | 63.5 | **1544.4** | 330.3 |
| **LLaVA-MORE-8B** | **58.4** | **76.3** | **61.8** | **42.4** | 64.1 | 69.8 | **39.4** | **68.2** | **72.4** | 85.1 | **63.6** | 1531.5 | **353.3** |
| LLaVA-v1.5-7B | 58.2 | 69.0 | 56.4 | 42.0 | 61.6 | 66.8 | 34.2 | 56.5 | 65.3 | 85.6 | 62.4 | 1474.3 | 314.6 |
| LLaVA-v1.5-LLaMA3-8B | 57.6 | 74.2 | 60.7 | 42.0 | 64.3 | 70.1 | 37.3 | 65.4 | 70.3 | 85.4 | 63.5 | 1544.4 | 330.3 |
| **LLaVA-MORE-8B** | 58.4 | 76.3 | 61.8 | 42.4 | 64.1 | 69.8 | 39.4 | **68.2** | 72.4 | 85.1 | 63.6 | 1531.5 | **353.3** |
| **LLaVA-MORE-8B-S2** | 60.9 | 76.7 | 62.2 | 42.3 | 64.2 | 69.9 | 38.7 | 65.8 | 71.1 | **86.5** | 64.5 | **1563.8** | 293.2 |
| **LLaVA-MORE-8B-siglip** | 62.1 | **77.5** | **63.6** | **46.1** | **65.8** | **71.0** | 39.8 | **68.2** | **73.1** | 86.1 | 64.6 | 1531.0 | 315.4 |
| **LLaVA-MORE-8B-S2-siglip** | **63.5** | 77.1 | 62.7 | 44.7 | 65.5 | **71.0** | **40.0** | 68.0 | 71.8 | 86.0 | **64.9** | 1541.4 | 336.4 |
</div>
*\* The results of TextVQA are computed with OCR token in the input prompt.*
@@ -84,6 +90,12 @@ In the table below, you can find links to ours 🤗 Hugging Face models.
|---------------------------|:-------------------------:|------------------------------------------------|
| LLaVA_MORE-llama_3_1-8B-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-pretrain) | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning) | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-S2-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-pretrain) | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-S2-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-finetuning) | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-siglip-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-pretrain) | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-siglip-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-finetuning) | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-S2-siglip-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-pretrain) | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
| LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning) | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone |
## Installation
@@ -115,17 +127,26 @@ sbatch scripts/more/12_finetuning_llama_31_acc_st_1.sh
### Visual Backbones
As mentioned before, ```LLaVA-MORE``` introduces the use of LLaMA 3.1 within the LLaVA architecture for the first time. However, this repository goes beyond that single enhancement.
We have also incorporated the ability to use different visual backbones, such as SigLIP, and various methods for managing image resolutions (S2). Additionally, we have experimented with different data mixtures to stress data quality during the LLaVA training stages.
We have also incorporated the ability to use different visual backbones, such as SigLIP, and various methods for managing image resolutions (S2).
Considering that, you can view this repos as an effort to expand the study of Multimodal LLMs in multiple directions and as a
Considering that, you can view this repo as an effort to expand the study of Multimodal LLMs in multiple directions and as a
starting point for enhancing new features to improve the connection between images and language.
You can find more references in this folder: ```scripts/more```
You can find more references in this folder: ```scripts/more```.
## Inference
You can try our ```LLaVA-MORE``` in the Image-To-Text task by running the following script.
You can try our ```LLaVA-MORE``` with LLaMA 3.1 in the Image-To-Text task using the following script.
``` python
source activate more
cd local/path/LLaVA-MORE
export PYTHONPATH=.
# load the original llama 3.1 tokenizer using an active read-only hf_token
export HF_TOKEN=hf_read_token
# tokenizer_model_path
export TOKENIZER_PATH=meta-llama/Meta-Llama-3.1-8B-Instruct
python -u llava/eval/run_llava.py
```
If you get out-of-memory problems, consider loading the model weights in 8 bit (```load_in_8bit=True```).

View File

@@ -6,7 +6,6 @@ import base64
from io import BytesIO
from PIL import Image
from transformers import AutoTokenizer
import utils
class SeparatorStyle(Enum):
"""Different separator style."""

View File

@@ -30,8 +30,9 @@ import sys
import os
sys.path.append(os.path.abspath("."))
sys.path.append(os.path.abspath("../.."))
import utils
logger= utils.get_logger(__name__)
from llava.utils import get_logger
logger= get_logger(__name__)
class LlavaConfig(LlamaConfig):
model_type = "llava_llama"

View File

@@ -36,8 +36,8 @@ from llava.model import *
from llava.mm_utils import process_anyres_image, tokenizer_image_token
from PIL import Image
import utils
logger = utils.get_logger(__name__)
from llava.utils import get_logger
logger= get_logger(__name__)
local_rank = None

View File

@@ -13,6 +13,18 @@ moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE
handler = None
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logging.basicConfig(
format="[%(levelname)s|%(filename)s:%(lineno)s] %(asctime)s >> %(message)s"
)
def get_logger(name):
logger = logging.getLogger(name)
logger.setLevel(logging.INFO)
return logger
def build_logger(logger_name, logger_filename):
global handler

23
scripts/zero2.json Normal file
View File

@@ -0,0 +1,23 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto"
}
}

28
scripts/zero3.json Normal file
View File

@@ -0,0 +1,28 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}