Delimited files are hell– a comparison of methods to read bad files
json.blogI tried `polars` via Explorer from Elixir, and it shit the bed in an entirely different way:
``` df = Explorer.DataFrame.from_csv(filename = "my_file.txt", delimiter: "|", infer_schema_length: nil) {:error, {:polars, "Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list."}} ```
Note, I added `infer_schema_length: nil` assuming that the data type discovery via sampling was just less good in `polars`, since this would have it read the whole file before determine types, but it still failed.
Relying on undefined behaviour can't really be considered much of a solution. Any changes to one of those third-party libraries could possibly break your application without warning. I would suggest inserting a sanitization routine right there into the stack to parse and transform the data file accordingly. For the sake of posterity, emitting logs of every "questionable" entry along the way wouldn't be a bad idea either.
What you're saying is I should build and maintain my own delimited file parser with my own logic for healing these files and logging, rather than using any number of hugely popular delimited file parsers used by hundreds of thousands (millions?) of people with many strong programmers maintaining a well-tested code base.
The best way to do exactly what you're saying is just use R and do:
``` data.table::fread('my file.txt') |> arrow::write_parquet('new_file.parquet') ```
That will do the exact same thing-- sanitize the file, parsing and transforming the data correctly, logging questionable lines, and outputting a binary file that can be used by other systems later.
When you're working with thousands of files and hundreds of millions of lines every day and your client will be rightfully pissed if their data is off by $100,000 and my only resolution is to wait 2 weeks for someone in IT on their end upstream to _maybe_ fix the file, hopefully without introducing a new error...
Writing my own delimited file parser over a huge amount of community effort sounds like the worst case of not-invented-here syndrome ever. What stinks is how willing most of those projects are to fail silently.
These popular file parsers should have better validation mechanisms. Or better yet, have separate validation prior to the rest of any workflow.
Yes-- that's precisely what I'm saying. They should be better at validation, and my experience is that `data.table::fread` in R is the best in class.
The authors assessment that using binary files would prevent such issues is flat out wrong. If a file is corrupt, as in his data, then undefined behaviour is all you can expect. No matter if the file format is text or binary based.
The thing that would prevent such issues is validation of the data you accept.
One way to think about this is that CSVs are poorly specified, and what specifications do exist are ignored by CSV generating processes from popular RDBMSs. The advantage of a binary format with a specification and approved standard that is adopted would be having robust test suites that are followed that makes data generation unlikely to produce a bad file.
In the case of text delimited files, it is simply too easy and too common to generate, from the start, a malformed file that other systems cannot read. Because data loss is inherent in a text-based format, folks don't even bother to check if the files they generate can be successfully interpreted by their own system. PostgreSQL, Oracle, and MS SQL will all gladly produce CSV files that cannot be read back successfully. I'm not talking about some loss of metadata, I'm talking cannot be read.
In the "real world", of course I run validations on the data I accept. A common one for me, since the files are essentially "append only" when they're updated is to check for meaningfully fewer records than previous data loads. That's my best way of determining that when the file was read, records were dropped or lost because of things like quoting being messed up or an incomplete file transfer.
It's still not great that a mismatched quote, which is quite common, doesn't even trigger a warning in the validation methods of these parsers.
CSV isn't poorly specified, but it is mostly poorly implemented relative to the specifications that exist (Microsoft's products are some of the worst offenders for this). Data loss is not inherent in a text-based format, any more than it is in binary formats. The only difference being is that it's more difficult to produce or extract data from binary files. That's detrimental to information transfers and is why I would recommend specifically not to use binary formats.
JSON is much easier to validate and has similar ease of data transfer as with CSV. It can have minimal overhead as well if the data is stored as arrays (or an array of arrays).
A comparison of duckdb, PostgreSQL, pandas, readr, and fread when reading a delimited file with strange data.