Settings

Theme

How to process 100gb tsv and XML files?

2 points by anindha a year ago · 6 comments · 1 min read


I am trying to parse a music data file that is close to 100gb. What app or programming language is best for handling a file like this?

Thanks!

FlyingAvatar a year ago

It really depends on what you need to do with the data, but in most cases Python could do this pretty easily with csv.reader (with a \t delimiter for TSV) or xml.etree.ElementTree.iterparse (for XML) in streaming fashion such that you're not loading the whole file at once.

pradeepchhetri a year ago

You can leverage ClickHouse to process your music data. ClickHouse supports both TSV[0] and XML[1] data formats.

[0] https://clickhouse.com/docs/en/interfaces/formats#tabseparat...

[1] https://clickhouse.com/docs/en/interfaces/formats#xml

mobilio a year ago

Great solution is DuckDB: https://duckdb.org/docs/data/csv/overview.html

datadrivenangel a year ago

What kind of single music data file is 100gb?

Also how is it structured? If it's actually a tab separated value file, consider using something like polars or DuckDB?

anindhaOP a year ago

I found this https://klogg.filimonov.dev

abdusco a year ago

For TSV, you might wanna consider importing it into a Sqlite database, then querying it however you please.

https://stackoverflow.com/a/35454070/5298150

You can also use datasette & sqlite utils for it

https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection