How to process 100gb tsv and XML files?
I am trying to parse a music data file that is close to 100gb. What app or programming language is best for handling a file like this?
Thanks! It really depends on what you need to do with the data, but in most cases Python could do this pretty easily with csv.reader (with a \t delimiter for TSV) or xml.etree.ElementTree.iterparse (for XML) in streaming fashion such that you're not loading the whole file at once. You can leverage ClickHouse to process your music data. ClickHouse supports both TSV[0] and XML[1] data formats. [0] https://clickhouse.com/docs/en/interfaces/formats#tabseparat... Great solution is DuckDB:
https://duckdb.org/docs/data/csv/overview.html What kind of single music data file is 100gb? Also how is it structured? If it's actually a tab separated value file, consider using something like polars or DuckDB? I found this https://klogg.filimonov.dev For TSV, you might wanna consider importing it into a Sqlite database, then querying it however you please. https://stackoverflow.com/a/35454070/5298150 You can also use datasette & sqlite utils for it https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...