We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. https://t.co/5zT0C3QBjQ https://t.co/pdZb2AqCDD

1 min read Original article ↗

Post

Don't miss what's happening

People on X are the first to know.

Log inSign up