Show HN: Parsing CSV files with GPU

123 points by antonmks 11 years ago · 62 comments

Reader

This title is incredibly misleading.

* This isn't parsing a CSV, this is a program written to split this exact dataset. (The code is filled with hard coded values)

* You're comparing a single-threaded run on a low-end CPU to a top-tier GPU.

* Your dataset can fit into GPU memory.

* There is a pull request for a missing semicolon, which means the posted version of the code won't even compile, so couldn't have been the version used to generate the benchmarks.

* The amount of branching in the GPU code makes it hard for me to believe that it actually ran that fast. GPU parallelism does not work well with branching since all cores in a cube must executing in lock-step, if you branch, then you now have to go back and execute all of your branches separately.

antonmksOP 11 years ago

Sorry if you are misled :-) The program does parses selected fields, there are strings-to-binaries procedures.
I tested this approach on multiterabyte files, take a look at my alenka project, it uses the same method to load large CSV files into databases. It just have to be done in chunks.
The program compiles fine, that pull request was referring to incorrect earlier version of test.cu file.
Test it for yourself, see if you get similar results.

castratikron 11 years ago

I'm skeptical of the 8x speedup for several reasons, the main one being that this particular problem does not fit the paradigm of problems that work well on the GPU; the GPU cache is not used at all, and there are also many branches. You need to be able to use the cache of the GPU in your application, otherwise your performance is guaranteed to be memory-bound. The reason you want to avoid branches is that there is only one control unit per a number of cores on the GPU, which means that if some threads follow one branch they will have to stall until the other threads complete. Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).

The author is comparing a GPU to a CPU, yet the CPU is only running a single thread (supposedly, the author did not provide the CPU code used in the comparison). For a true comparison the full capability of the CPU should be exposed by means of a multithreaded application (and, as someone else has already mentioned, vector instructions such as SSE). Think performance per socket, not performance per thread.

pcwalton 11 years ago

> Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).
You also need high arithmetic intensity (the ratio of arithmetic/logical operations to memory loads). Of common CPU-bound tasks, CSV parsing has one of the lowest arithmetic intensities imaginable.
- castratikron 11 years ago
  
  Right, maybe I should have said that instead. In my mind locality implies arithmetic intensity, but intensity may not necessarily imply locality.
zvrba 11 years ago

> I'm skeptical of the 8x speedup for several reasons [...] yet the CPU is only running a single thread
That's why I'm not skeptical at all. A GPU program can operate over THOUSANDS many more data items in parallel than a single-threaded scalar CPU program can. Yet the speedup is not thousands, not even hundreds, but a mere 8.
Fits perfectly with
> main one being that this particular problem does not fit the paradigm of problems that work well on the GPU
stephen272 11 years ago

You hit the nail on the head. All of that on top of the fact that he is comparing a top of the line GPU against a mediocre i3 processor. GPU to CPU comparisons are always apples to oranges, but this is pretty bad.
I would like to see an OpenCL kernel that is run on both the GPU and CPU to possibly even the playing field a little.
- paulmd 11 years ago
  
  You can compile Thrust using a device target of OpenMP, so no need to rewrite it in OpenCL.
paulmd 11 years ago

There's also just the sheer bandwidth factor. A GTX Titan has 288 GB/s of BW, a Haswell with DDR3 puts out about 35 GB/s. That's a factor of 8.22x more bandwidth, so there's your 8x speedup.
It's a highly memory-limited task so I suspect that's where any speedup would come from.

kazinator 11 years ago

Though there is no standard definition of CSV, de facto processing it properly requires recognizing quotes, and also escapes of literal quotes using double quoting:

    this, "is, like, CSV", "with three so-called ""fields"""

Note that unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too.

(See CSV page in the Wikipedia)

A GPU-accelerated string split could be useful but it's not quite "parsing CSV".

haberman 11 years ago
This is a perfect example of how text parsing is really inherently non-parallelizable. It's very rare that you can do anything useful with a buffer of text without knowing the precise state of the parse at the beginning of that buffer.
The kinds of patterns that would make parsing more parallelizable, like marking the beginning of a delimited region with its length, are human unfriendly so would never be part of an actual text format. Who would ever want to write this?
```
    # Update "19" whenever string length changes.
    x = {19}"String of length 19"
```
- pcwalton 11 years ago
  
  > This is a perfect example of how text parsing is really inherently non-parallelizable. It's very rare that you can do anything useful with a buffer of text without knowing the precise state of the parse at the beginning of that buffer.
  There are two mechanisms that are usually used to get around this:
  (1) Perform a fast, sequential "skeleton parsing" pass before the main parse that scans just enough to find "split points" that are consumed by the parallel parser. This is what some of the parallel XML parsing work [1] did.
  (2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].
  [1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=410047...
  [2]: http://www.cs.wm.edu/~xshen/Publications/taco14.pdf
  - haberman 11 years ago
    
    > (1) Perform a fast, sequential "skeleton parsing" pass before the main parse that scans just enough to find "split points" that are consumed by the parallel parser.
    I'm not able to access the full text of this paper. But from the description I wouldn't really consider this "parallel parsing." For the "skeleton parser" to be correct, it must transition through a state machine that is exactly as complex as the real parser. I suspect (again, not being able to read the paper right now) that what makes the "skeleton parse" faster than the "real parse" is not the speed of the parser itself, but the speed of the "load" on the parser.
    For the skeleton parse, the "load" on the parser is just finding split points (cheap). At the application level, the "load" on the parser in many cases is building a tree of some sort. The tree-building is often significantly more expensive than the parse itself because it usually involves a lot of dynamic memory allocation.
    So yes, if you do a preliminary parse that chunks up the document, and then a second parse that has a heavier load on it like tree-building, the second parse can indeed be parallelized. But I wouldn't consider this parallelizing a parser, I would consider it parallelizing the tree-building. In raw terms you have probably spent more CPU on the actual parsing logic than in the single-threaded case.
    I don't mean for this to be a semantic quibble. I'm really interested in parsing architectures that decouple the parser itself from its "load." Event-based parsers like SAX parsers do this. I'm interested specifically in the parser part itself, and the limits of how it can be optimized.
    > (2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].
    Looks like an interesting paper, I'll have to dig more into that.
- wfunction 11 years ago
  
  I've always wondered if parsing is still unparallelizable if you also allow backtracking. In other words, can you parse eagerly assuming that you're likely to be in a certain state, and then if you're proven wrong maybe you can retry?
  - TheLoneWolfling 11 years ago
    
    Yes, and that's exactly what a lot of things do.
    The problem with this is that you can do worse than single-threaded (although not asymptotically speaking) in the worst case.
    Consider what happens if you've got something along the lines of the following:
    ",",",",",",",",","
    If every parser thread (but the first, the one parsing from the start, of course) picks the wrong parity (i.e. if they should start quoted or not) repeatedly, you end up throwing away all the work of every thread but the first. And meanwhile, your first thread has to do the additional work of figuring out when to invalidate the other threads.
    This is unlikely to happen, but definitely possible.
    It's vaguely similar to recursive descent parsing, with vaguely similar drawbacks too.
    One other similar approach is to parse eagerly as you say, but with a higher-priority thread that goes through the thread from the start only keeping track of as much state as is necessary, checking / invalidating the other threads as necessary. For CSV I think the required info is only the parity of quotes, though I could be wrong.
- thesz 11 years ago
  
  http://www.cse.chalmers.se/edu/year/2010/course/TDA341/Paper...
  Also look at CYK parsing algorithm, it is highly parallelizable (it is based on matrix multiplication).
- TheLoneWolfling 11 years ago
  
  And that's why you let the computer do it for you.
  - infogulch 11 years ago
    
    And now that you're using a computer to do it you might as well use protobuf or bson or.... But then it's not a plain text format anymore and we're back to square one.
bostonpete 11 years ago

> unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too. (See CSV page in the Wikipedia)
I'm not sure why you referenced the Wikipedia page -- it indicates that according to RFC 4180, (a) "spaces outside quotes in a field are not allowed", and (b) "Spaces are considered part of a field and should not be ignored."
Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_ru...
- kazinator 11 years ago
  
  > (a) "spaces outside quotes in a field are not allowed", and (b) "Spaces are considered part of a field and should not be ignored."
  Note that it's clear that (b) refers only to spaces inside quotes, because spaces outside quotes are not allowed.
  > Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?
  I actually don't believe in that principle. If there is a spec that everyone agrees upon, violations should be accurately and loudly diagnosed and rejected.
  Both preparation of the data format and processing of that data format should be conservative. Being liberal in what is accepted has unintended negative consequences.
  Be that as it may, there is no universal CSV spec, though. RFC 4180 is just someone's opinion on what CSV should be. CSV is something that has been widely implemented in numerous programs over numerous decades, in different ways.
  My main point holds that if you split the string on commas and do nothing else, then one of the aspects you're neglecting to handle is the treatment of unquoted whitespace outside of a field.
  - nolok 11 years ago
    
    > I actually don't believe in that principle. If there is a spec that everyone agrees upon, violations should be accurately and loudly diagnosed and rejected. Both preparation of the data format and processing of that data format should be conservative. Being liberal in what is accepted has unintended negative consequences.
    I agree on principle. But when Marie in accounting opens the CSV that Bob from customer X sent her, if software A opens it and software B screams "error!!!", she's going to use the one that "works". And that means B's vendor will make their tool liberal of what they accept too.
    Similarly for websites, that's why browsers fixes what they see rather than not showing what you are asking it to show.
    Does it lead to a worse state of things for clear format, with more errors in the wild and no actual reference to base yourself on ? Yes. But it is still what end users want.
motoboi 11 years ago

You forgot with fields with line-breaks, with non printrable characters and so on.
- kazinator 11 years ago
  
  I don't need to reproduce an entire detailed CSV spec here to make the point.
  There is a danger in being too detailed because there is no universal spec anyway. For many users, CSV is whatever the most recent version or two of Microsoft Excel accept, as confirmed by trial-and-error reverse engineering.

paulmd 11 years ago

I kinda suspect he might be measuring the time it takes to launch a kernel rather than the time it takes the kernel to complete.

Thrust device calls, like those of the underlying CUDA library, are asynchronous by default. The only exception is calls that result in a memcpy, which are synchronous. To wait until an async call is completed you need to call one of the synchronize commands, like cudaDeviceSynchronize.

Looking through his test.cu file, he snaps a timestamp using std::clock right after doing the kernel launch with for_each. Ignoring the fact that this is not an accurate way to benchmark a GPU (you need to use events to accurately benchmark the kernel) what you're capturing will just be the processor time it takes to make the async kernel launch. Std::clock measures CPU time, which is (rightly) close to 0 for a program that runs on the GPU.

It's entirely possible that you're not even getting valid results out of the other end - note that you don't show output. I don't know if thrust's magic device memory access function triggers a synchronization or not. I kinda remember having to make an explicit call when I did a GPU simulation.

I don't have access to a CUDA box at the moment, I'd have to add those cudaDeviceSynchronize calls after the for_each invocations to be sure.

antonmksOP 11 years ago

Thrust CUDA calls are synchronous to each other. You can add an explicit synchronization call cudaDeviceSynchronize() and there won't be a difference in results.
andrewguenther 11 years ago

I had suspected this as well, but I'm not familiar with how Thrust's CUDA bindings work.

roel_v 11 years ago

I don't understand the benchmark. How can a 750GB file be read from disk in 1.5 seconds, let alone be parsed? He mentions it's a 2TB drive, so it's not even SSD presumably?

bbrks 11 years ago

The README has just been updated to change 750GB to MB.
https://github.com/antonmks/nvParse/commit/fab4c4728096003bc...

victorNicollet 11 years ago

Very interesting. From my experience, the hard part about parsing CSV isn't to identify the individual cells, but rather parsing those cells afterwards (as numbers, dates, etc).

What is the performance of those operations (e.g. parsing YYYY-MM-DD dates to Unix timestamps) when performed on the GPU ?

My company actually picked another optimization strategy, by making the tokenization significantly longer, but it de-duplicates the tokenized cells so that each distinct cell value (a date, a number, a string) can be parsed exactly once. We have seen some fairly good results out of this, compared to the naive approach of stream-token-parse:

https://github.com/Lokad/lokad-flatfiles

TheLoneWolfling 11 years ago

Makes sense for low-entropy data. Though I can see that approach choking on some datasets. What happens if every entry has a GUID, for example?
May be better to do a best-effort deduplication instead of an exhaustive approach.
- victorNicollet 11 years ago
  
  At some point, we considered tweaking by dropping any strings longer than a certain length from the deduplication (it also helps with memory usage when streaming the data).
  Our method makes most sense for many-to-many data (several orders per product, several orders per day), which happens to be the largest data sets we manipulate (by 3 orders of magnitude). I can certainly see situations where this would not be the case (e.g. web crawler logs).

mholt 11 years ago

I gotta admit, having written a high-speed, multi-threaded streaming CSV parser for the browser[1], I'm quite impressed by your project. I've done CUDA programming before and it's not easy (albeit libraries do help). Good work!

[1] http://papaparse.com

pestaa 11 years ago

I just used Papa at the office today. Remembered it from a while ago, it really turned out to be effortless. Excellent work.
(While I'm at it, may I suggest to provide CDN URLs on the website? IMO it would make the otherwise awesome page perfect.)
Thank you again.

hannibalhorn 11 years ago

For file sizes where parsing speed really makes a difference, loading the entire file into memory doesn't seem feasible. Maybe some sort of a hybrid approach (load chunks into memory and parse them via the GPU) would provide some real benefits though.

victorNicollet 11 years ago

In a REPL environment, for a 1GB file, this can be the difference between "wait three seconds" and "let me get my coffee".
atwebb 11 years ago

I think there's a utility that does something similar (windows based, assuming it's available elsewhere), I remember it having a "crass" name, you'd think it would make it easier to remember, maybe F.U.C.K.?
bane 11 years ago

And if you're going to load it, you may as well parse it on the way in.
- victorNicollet 11 years ago
  
  A streaming CSV parser is very difficult to get right, once you get past a certain level of complexity.
  Sometimes, you are not lucky enough to have perfect control over the encoding, number format and date format of the input, so you need to look ahead at a value sample to try and find out what those are.
  Sometimes, you cannot even assume that the software that produced the file didn't mangle the quotes around fields, and you have to detect that you are 40960 bytes into a quoted field, decide that it's probably an error, and backtrack.
  If you have enough memory to load the entire file, you will save a lot of time by giving up on streaming processing.
  - maxhou 11 years ago
    
    if you use mmap(), there is virtualy no difference between the streaming/non streaming parsing code.

avarsheny 11 years ago

I don't think CPU and GPU is going to make much difference here. You speed up in GPU run could be because of hot cache of file system. Try running GPU run first and non-GPU after that and plz post the results.

antonmksOP 11 years ago

I used average time for multiple runs, both GPU and CPU benchmarks ran from memory.

WhitneyLand 11 years ago

I like your creativity in applying the GPU to a less typical task.

Is it the case that the CPU version could be sped up dramatically by using multiple cores and a variation on the line splitting technique?

hvidgaard 11 years ago

You could use a lockfree queue, fill it, and spawn a suitable number of consumers that will put the parsed data back into a new queue. If the data order matters, then some extra boilerplate is needed, and I do not know how it's done on the GPU, or if it's needed at all.
Seeing that it takes 14.5s with the handrolled code to parse 750GB, I however doubt the optimization is needed - it's more than 50GB every second and you need some really really exotic hardware to generate that much data. You still need to do something with the data and it's likely significantly more computational intensitive - that should be on a different thread.
That said, using the GPU would free up the CPU to do meaningful work with the data. But to be fair, assuming 1GB/s datarate, it's still 12,5m to read it all and you only gain 14.5s or less than 2% speedup if you're CPU bound.
- antonmksOP 11 years ago
  
  It should be 750MB, not 750GB. There is no way a single hard drive can read 750GB in 14 seconds.
  - cjg_ 11 years ago
    
    Heh, you cant even DMA 750GB from memory to the graphics card (assuming 16GB/s, using 32 lanes) in 14s.
  - hvidgaard 11 years ago
    
    Well, that makes a lot more sense.
antonmksOP 11 years ago

Yes, CPU version can be accelerated as well, depending on availability of SSE extensions.
- Narishma 11 years ago
  
  Shouldn't that be the default assumption? It's pretty rare to not have SSE.
  - sharpneli 11 years ago
    
    On X86_64 it's impossible not to have SSE. So unless you develop for 32bit X86 systems you can assume the availability of vector instructions.
  - castratikron 11 years ago
    
    SSE is probably more common than a $1000 GPU.

rdc12 11 years ago

"However this method of parsing is CPU bound because * it doesn't take advantage of multiple cores of modern CPUs. * memory bandwidth limitations"

So not CPU bound at all then.

snissn 11 years ago

I think it would be really cool to integrate a GPU into a database to speed up certain operations if an optimizer can decide that certain parts of a query will benefit from it. The thrust docs [0] indicate that the C++ gpu library can be used effectively for sorting, I wonder if sorts on non-indexed fields can be sped up by attaching a GPU to my database!

[0] http://thrust.github.io/

maxhou 11 years ago

In the real world, the slow part of "parsing" a CSV file is IO: reading the file content from disk to memory, and from memory to CPU cache.

You would avoid reading the file content more than once if you had to parse it.

> The first line counts the number of lines in a buffer (assuming that file is read into memory and copied to gpu buffer d_readbuff).

but this is what is done here, first search to find all \n, then multi-core GPU stuff for each line content.

azurezyq 11 years ago

Things have been changed in a world of SSDs and machines w/ much memory. Actually, parsing a CSV in a single thread will never reach several hundred megabytes per second.

ape4 11 years ago

In the real world... read the CSV into a database (doesn't really matter how fast or slow it is). Access the data from the database.

victorNicollet 11 years ago

There are situations where CSV is the appropriate solution. Our applications read and write CSV data at 1-1.5 million cells per second, and the process can be scaled out at will (there is no central data server, all data is stored on Azure Blob Storage).
I'm not sure we could afford the skills and software to achieve these speeds with SQL.
dilap 11 years ago

In the real world, sometimes you have an external data source, in csv (which you can't control), which is changing, which you want to reread as quickly as possible. :)
- ldng 11 years ago
  
  You then use PostgreSQL Foreign Data Wrapper (aka SQL/MED)
bitwize 11 years ago

In the real world you may have to ingest hundreds of megabytes or more of CSV data, daily, from a provider or partner, and CPU time spent ingesting that data is not spent processing transactions. So any speedup you can get is well worth it.

aftbit 11 years ago

Am I crazy? With a hot disk cache, the cut command he gave takes 1.2 seconds on my machine.

Settings

Show HN: Parsing CSV files with GPU

Keyboard Shortcuts