GitHub - RoboticsData/score_lerobot_episodes: A lightweight toolkit for quantitatively scoring LeRobot episodes.

A lightweight toolkit for quantitatively scoring LeRobot episodes.

LeRobot Episode Scoring Toolkit

A comprehensive toolkit for evaluating and filtering LeRobot episode datasets based on multiple quality dimensions. It combines classic Computer Vision heuristics (blur/exposure tests, kinematic smoothness, collision spikes) with optional Gemini-powered vision-language checks to give each episode a 0–1 score across multiple quality dimensions.

Use this toolkit to:

Automatically score robot demonstration episodes on visual clarity, motion smoothness, collision detection, and more
Filter low-quality episodes to improve downstream training performance
Train and compare baseline vs. filtered dataset models
Visualize score distributions and identify problematic episodes

✨ Features

Dimension	Function	What it measures
Visual clarity	`score_visual_clarity`	Blur, over/under-exposure, low-light frames
Smoothness	`score_smoothness`	2nd derivative of joint angles
Path efficiency	`score_path_efficiency`	Ratio of straight-line vs. actual joint-space path
Collision / spikes	`score_collision`	Sudden acceleration outliers (proxy for contacts)
Joint stability (final 2 s)	`score_joint_stability`	Stillness at the goal pose
Gripper consistency	`score_gripper_consistency`	Binary "closed vs. holding" agreement
Actuator saturation	`score_actuator_saturation`	Difference between commanded actions and achieved states
Task success (VLM)	`score_task_success` (via `VLMInterface`)	Gemini grades whether the desired behaviour happened
Task success (VLM)	`score_task_success` (via `VLMInterface`)	Gemini grades whether the desired behavior happened
Runtime penalty / outliers	`score_runtime` + `build_time_stats`, `is_time_outlier`	Episode length vs. nominal / Tukey-IQR / Z-score fences

⚙️ Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository

git clone https://github.com/RoboticsData/score_lerobot_episodes.git
cd score_lerobot_episodes

Install dependencies

# Install in editable mode with all dependencies
pip install -e .

Or using uv (faster):

# Install uv if you haven't already
pip install uv

# Install the package
uv pip install -e .

Set up API keys (optional)

Only required if using VLM-based scoring with Gemini:
```
export GOOGLE_API_KEY="your-api-key-here"
```
Note: The free tier rate limits of the Gemini API are fairly restrictive and might need to be upgraded depending on episode length. Check Gemini API rate limits for more info.

🚀 Quick Start

Score a dataset and save results:

python score_dataset.py \
  --repo_id lerobot/aloha_static_pro_pencil \
  --output ./output/lerobot/aloha_static_pro_pencil \
  --threshold 0.5

This will:

Download and load the dataset from HuggingFace
Score each episode across multiple quality dimensions
Save scores to output path
Filter episodes with aggregate score >= 0.5
Save the filtered dataset to the output directory

📖 Usage

Command-line Arguments

Required Arguments

--repo_id: HuggingFace repository ID for the dataset (e.g., username/dataset-name)

Optional Arguments

--root: Local path to dataset root (default: downloads from HuggingFace Hub)
--output: Output directory for filtered dataset (default: None, no filtering)
--threshold: Minimum aggregate score to keep episodes (default: 0.5, range: 0.0-1.0)
--nominal: Expected episode duration in seconds (used for runtime scoring)
--vision_type: Vision scoring method, choices: opencv (default), vlm_gemini
--policy_name: Policy type for training (default: act)
--overwrite: Overwrite existing filtered dataset (default: True)
--overwrite_checkpoint: Overwrite existing training checkpoints (default: False)
--train-baseline: Train model on unfiltered dataset (default: False)
--train-filtered: Train model on filtered dataset (default: False)
--plot: Display score distribution plots in terminal (default: False)

Examples

1. Basic scoring (no filtering)

python score_dataset.py --repo_id username/my-robot-dataset

2. Score and filter dataset

python score_dataset.py \
  --repo_id username/my-robot-dataset \
  --output ./output/username/my-robot-dataset \
  --threshold 0.6

3. Score with VLM-based vision analysis

export GOOGLE_API_KEY="your-key"
python score_dataset.py \
  --repo_id username/my-robot-dataset \
  --vision_type vlm_gemini \
  --output ./filtered_data

4. Score, filter, and train both baseline and filtered models

python score_dataset.py \
  --repo_id username/my-robot-dataset \
  --output ./output/username/my-robot-dataset \
  --threshold 0.5 \
  --train-baseline True \
  --train-filtered True \
  --policy_name act

5. Visualize distributions

python score_dataset.py \
  --repo_id username/my-robot-dataset \
  --threshold 0.7 \
  --plot True

6. Use local dataset instead of downloading

python score_dataset.py \
  --repo_id username/my-robot-dataset \
  --root /path/to/local/dataset \
  --output ./filtered_output

📁 Output Format

JSON Scores File

Saved to results/{repo_id}_scores.json:

[
  {
    "episode_id": 0,
    "camera_type": "camera_0",
    "video_path": "/path/to/video.mp4",
    "aggregate_score": 0.752,
    "per_attribute_scores": {
      "visual_clarity": 0.85,
      "smoothness": 0.78,
      "collision": 0.92,
      "runtime": 0.65
    }
  },
  ...
]

Console Output

Displays a formatted table showing scores for each episode:

Episode scores (0–1 scale)
─────────────────────────────────────────────────────────────────
Episode Camera                       visual_clarity  smoothness  collision  runtime  Aggregate  Status
0       camera_0                              0.850       0.780      0.920    0.650      0.752  GOOD
1       camera_1                              0.420       0.650      0.710    0.580      0.590  BAD
...
─────────────────────────────────────────────────────────────────
Average aggregate over 20 videos: 0.671
Percentage of episodes removed: 0.25, total: 5

Filtered Dataset

When using --output, a new filtered dataset is created with only episodes scoring above the threshold, maintaining the original LeRobot dataset structure.

📂 Repository Structure

score_lerobot_episodes/
├── src/
│   └── score_lerobot_episodes/  # Installable package
│       ├── __init__.py
│       ├── data.py              # Dataset utilities
│       ├── vlm.py               # Vision-Language Model 
│       ├── evaluation.py        # Evaluation utilities
│       ├── corrupt.py           # Data corruption tools 
│       └── scores/              # Scoring criteria modules
├── score_dataset.py             # Main scoring script
├── train.py                     # Training pipeline integration
├── ui.py                        # Streamlit web interface (if available)
├── pyproject.toml               # Package configuration and dependencies
├── requirements.txt             # Python dependencies (legacy)
├── README.md                    # This file
├── CONTRIBUTING.md              # Contribution guidelines
├── LICENSE                      # Apache 2.0 license
├── .gitignore                   # Git ignore rules
├── results/                     # Generated score JSON files
├── output/                      # Filtered datasets
└── checkpoints/                 # Training checkpoints

🤖 Training and Evaluation

The toolkit integrates with LeRobot's training pipeline to compare baseline vs. filtered dataset performance.

Training Workflow

Baseline Training: Train on the original unfiltered dataset

python score_dataset.py \
  --repo_id username/dataset \
  --train-baseline True

Filtered Training: Train on the quality-filtered dataset

python score_dataset.py \
  --repo_id username/dataset \
  --output ./filtered_data \
  --threshold 0.6 \
  --train-filtered True

Compare Both: Run both training pipelines in one command

python score_dataset.py \
  --repo_id username/dataset \
  --output ./filtered_data \
  --train-baseline True \
  --train-filtered True

Training Configuration

Default policy: ACT (Action Chunking Transformer)
Default steps: 10,000
Batch size: 4
Checkpoints saved to ./checkpoints/{job_name}/
WandB logging enabled by default

You can customize training parameters by modifying train.py.

🔧 Troubleshooting

Common Issues

1. ModuleNotFoundError: No module named 'google.generativeai'

Solution: Install dependencies with pip install -r requirements.txt
If using VLM scoring, ensure google-generativeai is installed

2. API rate limit errors with Gemini

Solution: The free tier has restrictive limits. Consider:
- Using --vision_type opencv instead
- Upgrading to a paid Gemini API tier
- Processing smaller batches

3. All episodes filtered out

Error: ValueError: All episodes filtered out, decrease threshold to fix this
Solution: Lower the --threshold value (e.g., from 0.5 to 0.3)

4. Dataset not found

Solution:
- Verify the --repo_id is correct
- Check internet connection for HuggingFace Hub access
- Use --root to specify a local dataset path

5. Out of memory during training

Solution: Reduce batch_size in train.py:44 or use a smaller model

6. Permission errors when overwriting

Solution: Use --overwrite True or manually delete the output directory

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on:

Setting up a development environment
Code style and conventions
Submitting pull requests
Reporting issues

Quick Contribution Steps

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

⭐ Star History

📄 License

LeRobot Episode Scoring Toolkit is distributed under the Apache 2.0 License. See LICENSE for more information.

📧 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: This README and inline code documentation