AnyModal: A Flexible Multimodal Language Model Framework for PyTorch
AnyModal is a modular and extensible framework for integrating diverse input modalities (e.g., images, audio) into large language models (LLMs). It enables seamless tokenization, encoding, and language generation using pre-trained models for various modalities.
Key Features
- Flexible Integration: Easily plug in different input modalities like vision, audio, and structured data.
- Tokenization Support: Tokenizes inputs from non-text modalities and combines them with LLMs for generation.
- Extensible Design: Add new input processors and tokenizers with minimal code changes.
How to Use AnyModal
The best way to get started with AnyModal is to have a read-through of the steps below and then see the examples provided in the demos directory. Also, check out the anymodal.py file to understand the core components of the framework.
1. Installation and Setup
To use AnyModal in your project, follow these steps:
-
Copy
anymodal.py:
Copy theanymodal.pyfile into your project directory. -
Install Dependencies:
Ensure the following dependencies are installed:pip install torch transformers datasets torchvision tqdm
You may also need to install additional dependencies based on your use case.
2. Implementing Input Modality Tokenization
AnyModal requires three core components for input processing:
- Input Processor: Processes raw input data into a format compatible with the encoder.
- Input Encoder: Encodes the processed data into feature representations.
- Input Tokenizer: Projects the encoded features into a token embedding space.
Example of integrating an image modality using Vision Transformer:
from transformers import ViTImageProcessor, ViTForImageClassification from anymodal import MultiModalModel from vision import VisionEncoder, Projector # Load vision processor and model processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224') vision_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224') hidden_size = vision_model.config.hidden_size # Initialize vision encoder and projector vision_encoder = VisionEncoder(vision_model) vision_tokenizer = Projector(in_features=hidden_size, out_features=768) # Load LLM components from transformers import AutoTokenizer, AutoModelForCausalLM llm_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") llm_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # Initialize AnyModal multimodal_model = MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision_tokenizer, language_tokenizer=llm_tokenizer, language_model=llm_model, input_start_token='<|imstart|>', input_end_token='<|imend|>', prompt_text="The interpretation of the given image is: " )
3. Training and Inference
Use AnyModal to train and generate predictions:
# Training for epoch in range(num_epochs): for batch in train_loader: optimizer.zero_grad() logits, loss = multimodal_model(batch) loss.backward() optimizer.step() # Inference sample_input = val_dataset[0]['input'] generated_text = multimodal_model.generate(sample_input, max_new_tokens=30) print(generated_text)
4. Extending AnyModal
You can extend AnyModal by implementing new input processors and tokenizers. For example:
class AudioProcessor: def __init__(self, sample_rate): self.sample_rate = sample_rate def process(self, audio_data): # Your audio preprocessing logic pass
Furthermore, you can change the core components of AnyModal to suit your needs. Consider implementing/modifying functionalities like saving and loading models or pushing the saved projectors and LoRAs to the HF hub.
Model Zoo
The AnyModal Model Zoo showcases pre-trained multi-modal models for various tasks, accessible on our Hugging Face organization page. Below is a list of currently available models:
1. VLM (Image Captioning)
- Model Description: A projector network for vision-language multimodal models (consisting of a ViT and a Llama 3.2-1B model) trained for image captioning on the Flickr30k dataset.
- Pre-trained Weights: Available here.
- Project Directory: Trained using the Image Captioning demo project.
- Training Script: train.py.
- Inference Script: To use the model for inference, refer to inference.py.
Stay tuned as we add more models to the zoo, covering diverse use cases.
TODO List
- AnyModal demo for LaTeX OCR
- AnyModal demo for Radiology Captioning
- AnyModal demo for Image Captioning
- AnyModal demo for Visual Question Answering
- AnyModal demo for Audio Captioning / audio + textual instructions
Note that the demos are still in progress, and there is still room for improvement. Do you have any other ideas for AnyModal demos? Feel free to suggest them!
Contributions
Contributions are highly welcome! Whether it's fixing bugs, improving documentation, or adding support for new input modalities, your help is appreciated. Here's how you can contribute:
- Fork the repository and clone it to your local machine.
- Create a new branch for your feature or bug fix.
- Submit a pull request describing your changes.
Let's build and improve AnyModal together!
Community
Join our subreddit at r/AnyModal to discuss ideas, ask questions, and share your projects using AnyModal. You can also visit our Hugging Face organization page for more resources, models, and examples.
License
This project is licensed under the MIT License. Feel free to use and modify as needed.
Happy building with AnyModal! 🚀
