hynky
- Karma
- 19
- Created
- 1 year ago
Recent Submissions
- 1. ▲ FinePDFs: 3T token dataset made from internet PDFs
- 2. ▲ FineWeb2: Adapting Pre-Training Data Processing to Every Language (arxiv.org)
- 3. ▲ FineWeb2 dataset: A sparkling update with 1000s of languages (huggingface.co)