Create zero3.json

Create zero2.json
Update README.md
2024-09-09 15:40:20 +02:00 · 2024-09-09 15:39:56 +02:00 · 2024-08-29 23:38:44 +02:00 · 2024-08-28 01:09:56 +02:00 · 2024-08-28 01:08:55 +02:00 · 2024-08-27 23:55:14 +02:00
7 changed files with 100 additions and 16 deletions
--- a/README.md
+++ b/README.md
@@ -1,5 +1,6 @@
 <div align="center">
-  <img src="images/image_no_back.png" width="200" height="200">
+  <img src=  "images/image_no_back.png"
+   width="200" height="200">
  <h1>  🔥 LLaVA-MORE 🔥
    
 Enhancing Visual Instruction Tuning with LLaMA 3.1
@@ -14,7 +15,7 @@

 <div align='center'>

-#### [Federico Cocchi](https://federico1-creator.github.io/Federico_Cocchi/), [Nicholas Moratelli](https://github.com/NicholasMoratelli), [Davide Caffagni](https://github.com/dcaffo98), [Sara Sarto](https://github.com/sarasarto),
+#### [Federico Cocchi](https://federico1-creator.github.io/Federico_Cocchi/), [Nicholas Moratelli](https://nicholasmoratelli.github.io), [Davide Caffagni](https://github.com/dcaffo98), [Sara Sarto](https://github.com/sarasarto),
 #### [Marcella Cornia](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90), [Lorenzo Baraldi](https://www.lorenzobaraldi.com/), and [Rita Cucchiara](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=1)

 </div>
@@ -34,7 +35,8 @@ If you make use of our work, please cite our repo:


 ## 📢 Latest Updates
- [2024/08/01] 🔥 First release of our LLaVA-MORE 8B model, based on LLaMA 3.1.
+- [2024/08/16] 📌 Improved LLaVA-MORE 8B model, considering advanced image backbones.
+- [2024/08/01] 🔥 First release of our LLaVA-MORE 8B, based on LLaMA 3.1.
 - [2024/08/01] 🔎 If you are interested in this area of research, check out [our survey](https://arxiv.org/abs/2402.12451) on the revolution of Multimodal LLMs, recently published in ACL (Findings).
 - [2024/08/01] 📚 Check out the latest researches from [AImageLab](https://aimagelab.ing.unimore.it/imagelab/).

@@ -63,15 +65,19 @@ In this section, we present the performance of our model compared to other versi
 <img src="images/radar_plot.png" width="500"">
 </div>

-### Benchmarks and Comparisons on Instrucion Multimodal Datasets in the Literature
+### Benchmarks and Comparisons on Instruction Multimodal Datasets in the Literature

 <div align="center">

 |       Model Name     |  Text-VQA*  |  Science-QA  |  AI2D  |  SEED-vid  |  SEED-all  |  SEED-img  |  MMMU  |  MMBench-Cn  |  MMBench-En  |  POPE  |  GQA  |   MME-P  |  MME-C  |
 |----------------------|:----------: |:------------:|:------:|:----------:|:----------:|:----------:|:------:|:------------:|:------------:|:------:|:-----:|:--------:|:-------:|
-|    LLaVA-v1.5-7B     |    58.2     |     69.0     |  56.4  |    42.0    |    61.6    |    66.8    |  34.2  |      56.5    |      65.3    |  **85.6**  |  62.4 |  1474.3  |  314.6  |
-| LLaVA-v1.5-LLaMA3-8B |    57.6     |     74.2     |  60.7  |    42.0    |    **64.3**    |    **70.1**    |  37.3  |      65.4    |      70.3    |  85.4  |  63.5 |  **1544.4**  |  330.3  |
-|  **LLaVA-MORE-8B**   |    **58.4**    |     **76.3**     |  **61.8**  |    **42.4**    |    64.1    |    69.8    |  **39.4**  |      **68.2**    |      **72.4**    |  85.1  |  **63.6** |  1531.5  |  **353.3**  |
+|    LLaVA-v1.5-7B              |    58.2      |     69.0     |  56.4     |    42.0    |    61.6    |    66.8     |  34.2     |      56.5     |      65.3     |  85.6     |  62.4     |  1474.3     |  314.6     |
+| LLaVA-v1.5-LLaMA3-8B          |    57.6      |     74.2     |  60.7     |    42.0    |    64.3    |    70.1     |  37.3     |      65.4     |      70.3     |  85.4     |  63.5     |  1544.4     |  330.3     |
+|  **LLaVA-MORE-8B**            |    58.4      |     76.3     |  61.8     |    42.4    |    64.1    |    69.8     |  39.4     |      **68.2** |      72.4     |  85.1     |  63.6     |  1531.5     |  **353.3** |
+|  **LLaVA-MORE-8B-S2**         |    60.9      |     76.7     |  62.2     |    42.3    |    64.2    |    69.9     |  38.7     |      65.8     |      71.1     |  **86.5** |  64.5     |  **1563.8** |  293.2     |
+|  **LLaVA-MORE-8B-siglip**     |    62.1      |     **77.5** |  **63.6** |  **46.1**  |   **65.8** |    **71.0** |  39.8     |      **68.2** |      **73.1** |  86.1     |  64.6     |  1531.0     |  315.4     |
+|  **LLaVA-MORE-8B-S2-siglip**  |    **63.5**  |     77.1     |  62.7     |    44.7    |    65.5    |    **71.0** |  **40.0** |      68.0     |      71.8     |  86.0     |  **64.9** |  1541.4     |  336.4     |
+
 </div>

 *\* The results of TextVQA are computed with OCR token in the input prompt.*
@@ -84,6 +90,12 @@ In the table below, you can find links to ours 🤗 Hugging Face models.
 |---------------------------|:-------------------------:|------------------------------------------------|
 | LLaVA_MORE-llama_3_1-8B-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |
 | LLaVA_MORE-llama_3_1-8B-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |
+| LLaVA_MORE-llama_3_1-8B-S2-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |
+| LLaVA_MORE-llama_3_1-8B-S2-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |
+| LLaVA_MORE-llama_3_1-8B-siglip-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |
+| LLaVA_MORE-llama_3_1-8B-siglip-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |
+| LLaVA_MORE-llama_3_1-8B-S2-siglip-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |
+| LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |


 ## Installation
@@ -115,17 +127,26 @@ sbatch scripts/more/12_finetuning_llama_31_acc_st_1.sh
 ### Visual Backbones

 As mentioned before, ```LLaVA-MORE``` introduces the use of LLaMA 3.1 within the LLaVA architecture for the first time. However, this repository goes beyond that single enhancement.
-We have also incorporated the ability to use different visual backbones, such as SigLIP, and various methods for managing image resolutions (S2). Additionally, we have experimented with different data mixtures to stress data quality during the LLaVA training stages.
+We have also incorporated the ability to use different visual backbones, such as SigLIP, and various methods for managing image resolutions (S2).

-Considering that, you can view this repos as an effort to expand the study of Multimodal LLMs in multiple directions and as a 
+Considering that, you can view this repo as an effort to expand the study of Multimodal LLMs in multiple directions and as a 
 starting point for enhancing new features to improve the connection between images and language.

-You can find more references in this folder: ```scripts/more```
+You can find more references in this folder: ```scripts/more```.


 ## Inference
-You can try our ```LLaVA-MORE``` in the Image-To-Text task by running the following script.
+You can try our ```LLaVA-MORE``` with LLaMA 3.1 in the Image-To-Text task using the following script.
 ``` python
+source activate more
+cd local/path/LLaVA-MORE
+export PYTHONPATH=.
+
+# load the original llama 3.1 tokenizer using an active read-only hf_token
+export HF_TOKEN=hf_read_token
+# tokenizer_model_path
+export TOKENIZER_PATH=meta-llama/Meta-Llama-3.1-8B-Instruct
+
 python -u llava/eval/run_llava.py
 ```
 If you get out-of-memory problems, consider loading the model weights in 8 bit (```load_in_8bit=True```).
--- a/llava/conversation.py
+++ b/llava/conversation.py
@@ -6,7 +6,6 @@ import base64
 from io import BytesIO
 from PIL import Image
 from transformers import AutoTokenizer
-import utils

 class SeparatorStyle(Enum):
    """Different separator style."""
--- a/llava/model/language_model/llava_llama.py
+++ b/llava/model/language_model/llava_llama.py
@@ -30,8 +30,9 @@ import sys
 import os
 sys.path.append(os.path.abspath("."))
 sys.path.append(os.path.abspath("../.."))
-import utils
-logger= utils.get_logger(__name__)
+
+from llava.utils import get_logger
+logger= get_logger(__name__)

 class LlavaConfig(LlamaConfig):
    model_type = "llava_llama"
--- a/llava/train/train.py
+++ b/llava/train/train.py
@@ -36,8 +36,8 @@ from llava.model import *
 from llava.mm_utils import process_anyres_image, tokenizer_image_token

 from PIL import Image
-import utils
-logger = utils.get_logger(__name__)
+from llava.utils import get_logger
+logger= get_logger(__name__)

 local_rank = None

--- a/llava/utils.py
+++ b/llava/utils.py
@@ -13,6 +13,18 @@ moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE

 handler = None

+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+logging.basicConfig(
+    format="[%(levelname)s|%(filename)s:%(lineno)s] %(asctime)s >> %(message)s"
+)
+
+
+def get_logger(name):
+    logger = logging.getLogger(name)
+    logger.setLevel(logging.INFO)
+    return logger
+

 def build_logger(logger_name, logger_filename):
    global handler
--- a/scripts/zero2.json
+++ b/scripts/zero2.json
@@ -0,0 +1,23 @@
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "train_micro_batch_size_per_gpu": "auto",
+    "train_batch_size": "auto",
+    "gradient_accumulation_steps": "auto",
+    "zero_optimization": {
+        "stage": 2,
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto"
+    }
+}
--- a/scripts/zero3.json
+++ b/scripts/zero3.json
@@ -0,0 +1,28 @@
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+    "bf16": {
+        "enabled": "auto"
+    },
+    "train_micro_batch_size_per_gpu": "auto",
+    "train_batch_size": "auto",
+    "gradient_accumulation_steps": "auto",
+    "zero_optimization": {
+        "stage": 3,
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": true
+    }
+}
Author	SHA1	Message	Date
Federico Cocchi	194f78b18b	Create zero3.json	2024-09-09 15:40:20 +02:00
Federico Cocchi	fd8bc7e86e	Create zero2.json	2024-09-09 15:39:56 +02:00
NicholasMoratelli	75a26d8c7a	Update README.md	2024-08-29 23:38:44 +02:00
Federico Cocchi	d4c178507b	Merge pull request #5 from joris-sense/patch-1 Ran into trouble trying out run_llava.py	2024-08-28 01:09:56 +02:00
federico1-creator	29e44e6951	improve the usability of the inference code	2024-08-28 01:08:55 +02:00
joris-sense	b7f50f4fe8	Update README.md I didn't test the PYTHONPATH addition; I solved that problem by "cp llava/eval/run_llava.py ."	2024-08-27 23:55:14 +02:00
Federico Cocchi	afd373ad21	Update README.md	2024-08-18 12:50:21 +02:00
Sara Sarto	4cb6900447	Update README.md	2024-08-02 21:38:18 +02:00
Federico Cocchi	f19767e063	add get_logger function	2024-08-02 12:33:22 +02:00
Sara Sarto	feeb4ebe1c	Update README.md	2024-08-02 10:00:52 +02:00