We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training. together.ai/blog/redpajama…

