echovlm: Medical Imaging Report Generation Model for $5 · vukadinovic936 echovlm · Discussion #1

In this post I will explain what is in the script speedrun.sh part by part, so that I can comment on all sections. I used a Lambda GPU Cloud instance with a single H100 which costs 2.49$ per hour, and the whole pipeline took 2 hours to finish running.

Environment

First, we will set up the environment. Similar to nanochat we use uv and rust with maturin to train the tokenizer.

# install uv (if not already installed)
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
# create a .venv local virtual environment (if it doesn't exist)
[ -d ".venv" ] || uv venv
# install the repo dependencies
uv sync
# activate venv so that `python` uses the project's venv instead of system python
source .venv/bin/activate
uv add --editable .

Get training data

Next, we will simulate the training data using a release from EchoPrime repository. This repository provides echocardiography text reports and corresponding CLIP embeddings. For the purposes of training echovlm, we treat the CLIP embeddings as a proxy for the visual modality.

git clone https://github.com/echonet/EchoPrime
cd EchoPrime
wget https://github.com/echonet/EchoPrime/releases/download/v1.0.0/model_data.zip
wget https://github.com/echonet/EchoPrime/releases/download/v1.0.0/candidate_embeddings_p1.pt
wget https://github.com/echonet/EchoPrime/releases/download/v1.0.0/candidate_embeddings_p2.pt
unzip model_data.zip
mv candidate_embeddings_p1.pt model_data/candidates_data/
mv candidate_embeddings_p2.pt model_data/candidates_data/
cd ..
cp -r EchoPrime/assets/ .
cp -r EchoPrime/model_data/ .

Then, we run a script to decode reports, organize embeddings and split studies into train/test/val. In total we have 1230676 reports and corresponding CLIP embeddings.

python -m scripts.prepare_dataset

Train the tokenizer

Medical imaging reports are much less linguistically diverse than natural language because of standardized clinical terminology and templated reporting. So we expect that we'll need much smaller vocabulary size for medical imaging reports. When training a tokenizer, the goal is to achieve a high compression ratio (characters per token) while keeping the vocabulary size small. In practice, we select the vocabulary size at the point of diminishing returns, beyond which increasing the vocabulary size yields only marginal improvements in compression.

This point for us is around 2000 tokens despite having 120k medical reports - each approximately 300 words long. In fact, even if we specify larger capacity (e.g. 3000), the BPE algirithm plateaus at 2648 tokens, indicating limited linguistic variety in medical imaging reports.

We will proceed with a tokenizer trained with vocabulary size = 2000.

python -m scripts.tok_train
python -m scripts.tok_eval

Which gives us a tokenizer with the following properties:

[Tokenizer training] time=42.4020733833313, special_tokens=11, bytes(min/mean/max/std)=1/5.027/18/3.103
Compression ratio is 5.378125808436102
tokens per report (min/mean/max/std)=17/288.37/545/52.01

When compared to gpt4o tokenizer:

Our tokenizer's compression ratio is 5.38 and vocabulary size is 2000

GPT4o compression ratio is 4.22 and vocabulary size is 200019

VLM training

To include visual data into the llm training we use a feature projection approach. We leverage a pre-trained echocardiogram encoder to extract visual features from echocardiogram videos. These features match the embeddings dimension of our LLM and by concatenating them as a prefix to the text token sequence, we enable autoregressive transformer to attend to multimodal context in a single unified latent space. VLM architecture looks like this:

In the forward pass we will always assume that videos are present and they are prepended to the sequence of embeddings, they are never masked and our transformer predicts only text tokens (see echovlm/gpt.py). In our simulated data we will always prepend just 1 CLIP embedding (1x512) but when working with real data you'd have N videos per study, so you would only have to modify the EchoReportDataset to return study embedding of shape N x 512 and the rest should work out of the box.

Let's run the training script:

torchrun --standalone --nproc_per_node=1 -m scripts.base_train

The code is written for scalability across multiple GPUs or machines. While a single H100 is plenty for our current codebase, you can easily scale up. To use more GPUs on one machine, simply increase --nproc_per_node.

To run Distributed Data Parallel (DDP) across multiple computers, ensure they are on the same network and select one as the Master. For example, if you have two computers with 4 GPUs each, and Computer 0 (IP: 192.168.1.2) is the Master:

On Computer 0 (Master):

torchrun --nproc_per_node=4 \
         --nnodes=2 \
         --node_rank=0 \
         --master_addr="192.168.1.2" \
         --master_port=1234 \
         -m scripts.base_train

On the computer 1:

torchrun --nproc_per_node=4 \
         --nnodes=2 \
         --node_rank=1 \
         --master_addr="192.168.1.2" \
         --master_port=1234 \
         -m scripts.base_train

While I usually use weights and biases to log metrics for large projects, their website and UI is often buggy and not reliable so when I want to iterate quickly I just log everything locally by populating csv files and saving plots with matplotlib. If the directory already exists, the script will automatically append a suffix (e.g., echovlm1, echovlm2). After the training run is finished these are all the saved artifacts:

assets/echollm/
├── train.csv           # Per-epoch training loss
├── train_loss.png      # Visualization of the training loss curve
├── val.csv             # Per-epoch validation Bits Per Byte (BPB)
├── val_bpb.png         # Visualization bits per byte over training epochs
└── checkpoint033500.pt # Best checkpoint selected based on based validation BPB

Evaluate on echocardiography tasks

In medical report generation, clinical accuracy is the primary metric of success. To evaluate this, I wrote some regex to identify key diagnostics labels. We will compare the labels extracted from the generated reports against the ground-truth clinical findings.

python -m scripts.test_inference

The output will include individual scores (AUROC for binary, r2 for regression) and a Core Metric, which represents the average across all evaluated traits.

Cardiac Trait / Finding	Score
pacemaker	0.73
impella	0.5
tavr	0.94
mitraclip	0.91
aortic_root_dilation	0.61
bicuspid_aov_morphology	0.66
aortic_stenosis	0.77
tricuspid_stenosis	0.5
aortic_regurgitation	0.56
dilated_ivc	0.61
left_atrium_dilation	0.82
ejection_fraction R2	0.68
mitral_annular_calcification	0.89
mitral_stenosis	0.56
mitral_regurgitation	0.7
pericardial_effusion	0.93
pulmonary_artery_pressure_continuous R2	0.23
right_atrium_dilation	0.54
rv_systolic_function_depressed	0.67
right_ventricle_dilation	0.71
tricuspid_valve_regurgitation	0.78
pulmonic_valve_regurgitation	0.5
elevated_left_atrial_pressure	0.57
wall_motion_hypokinesis	0.75
atrial_septum_hypertrophy	0.5
Core Metric (Average)	0.66

Given that we worked with limited simulated data this is not bad, for example ejection_fraction R2=0.68 where even SOTA algorithms for this tasks are ~0.70. For reference, when we don't prepend a study embedding to the token sequence (or prepend a random embedding) we basically get random scores:

Cardiac Trait / Finding	Score
pacemaker	0.49
impella	0.50
tavr	0.50
mitraclip	0.50
aortic_root_dilation	0.50
bicuspid_aov_morphology	0.50
aortic_stenosis	0.50
tricuspid_stenosis	0.49
aortic_regurgitation	0.53
dilated_ivc	0.49
left_atrium_dilation	0.50
ejection_fraction	-0.93
mitral_annular_calcification	0.50
mitral_stenosis	0.50
mitral_regurgitation	0.48
pericardial_effusion	0.50
pulmonary_artery_pressure_continuous	-0.22
right_atrium_dilation	0.50
rv_systolic_function_depressed	0.50
right_ventricle_dilation	0.50
tricuspid_valve_regurgitation	0.50
pulmonic_valve_regurgitation	0.50
elevated_left_atrial_pressure	0.50
wall_motion_hypokinesis	0.56
atrial_septum_hypertrophy	0.50
Core Metric (Average)	0.40

These results confirms that the vlm is effectively learning to associate specific study embeddings with clinical findings, significantly outperforming the random benchmark.

Run on a real echocardiogram example

Finally, since I want to do this end-to-end, I will demonstrate how to run echovlm on real echocardiography data. I’ve shared an echocardiogram of my own heart, captured during my research, so let's see how EchoVLM performs on real-world data.

python -m scripts.generate_report

This script encodes videos using EchoPrime encoder, then finds the most similar study report embedding from the ones available in echo prime and finally uses echovlm to generate a report token-by-token.

Fortunately, echovlm seems to think that I have a healthy heart!