How fast can a BufferedReader read lines in Java?
lemire.meThe first issue I can see with that code is it's not doing what he expects. He does this to read the file into a StringBuffer:
bf.lines().forEach(s -> sb.append(s));
However, this ends up reading all the lines into one giant line, since the String's that lines() produces have the newline character stripped. This leads to the second lines() call to read a 23MB line (the file produced by gen.py). This is less than optimal.The fastest version I managed to write was:
public void readString5(String data) throws IOException {
int lastIdx = 0;
for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx))
{
parseLine(data.substring(lastIdx, idx));
lastIdx = idx+1;
}
parseLine(data.substring(lastIdx));
}
Not the prettiest thing, but it went from 0.594 GB/s to 1.047 GB/s. Also, it doesn't quite do the same as the lines() method, but that's easily changed.Your version still has the problem that since Java 7 the substring method allocates a new String, which in unnecessary when it is only borrowed to the parseLine function and not needed afterwards. What you probably want is a view that reuses the same underlying data. With the following I get around 2 GB/s:
public void readString(String data) throws IOException { int lastIdx = 0; for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx)) { parseLine(subSequenceView(data, lastIdx, idx)); lastIdx = idx + 1; } parseLine(subSequenceView(data, lastIdx, data.length())); } CharSequence subSequenceView(CharSequence base, int beginIndex, int endIndex) { return new StringView(base, beginIndex, endIndex - beginIndex); } static class StringView implements CharSequence { final CharSequence base; final int offset; final int length; StringView(CharSequence base, int offset, int length) { if (length < 0) throw new IllegalArgumentException("length must be >= 0"); this.base = base; this.offset = offset; this.length = length; } @Override public char charAt(int n) { if (n < 0 || n >= length) throw new IndexOutOfBoundsException(n); return base.charAt(offset + n); } @Override public int length() { return length; } @Override public CharSequence subSequence(int beginIndex, int endIndex) { return new StringView(base, offset + beginIndex, endIndex - beginIndex); } }Is it possible he edited the code? I don’t see the append in the blog post.
this still does a fair amount of processing, curious what the throughput is with just straight reads into a char[] from FileReader
I'm confused.
Where is the second lines call?
Lines 19 and 27 in the source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
What the heck?
You're right, the hidden scanFile()...makes no sense.
It's stripping all the new lines from the input.
Why? What is going on?
He's using it to read the entire files contents into memory, and re-running the lines() method from the now in-memory buffer, presumably to remove any disk I/O from bottle necking the lines() function.
Not the best method to use, probably better to use the Files.readString method.
That's a pretty big error if you're correct. What does it say about the language when a CS professor falls for this on a 40 line file?
Getting benchmarks right is difficult, even for a CS professor. The language doesn't even enter there.
What does it say about you that you're asking a leading question in this way?
I'm always interested in thoughts about language design and programming in general. We have fifty+ years of commercial software development and it still sucks. This caught my eye as the kind of error that shouldn't even be allowed to happen.
There are several simple data transformation steps. There's nothing inherently wrong with these steps. The compiler can't read your mind and guess what you wanted to do.
How would this be possible to be prevented with any programming language, existing or hypothetical?
Even if the meta information that the appended string doesn't contain any newlines would be passed on, the second call to lines() would rather use is as an optimization hint instead of raising an error.
To give an example: if the lines() method returned instances of a Line class, the string buffer and it's append() method would be aware and still create a collection of lines instead of one long string. Not necessarily nice and would probably create issues in many other interfaces, but problems like this _can_ be solved with careful language and API design. Note that his end goal was to 'read lines' and not use any particular string abstractions. Nothing to get worked up about here.
That would be a very clunky approach to put it mildly - the explicit design intent of BufferedReader is to let you read lines without caring what the line separator is - it's a lossy transformation. You also can't subclass String. You'd be adding something very unwieldy for an edge case that is directly contrary to the purpose of the API. The 'mistake' (such as it is) of the author here is simply using the wrong API - there are shorter, more direct ways to read a file into a string without any of these contortions.
data = new String(Files.readAllBytes(Paths.get(filename)));That still comes down to an understanding of the API.
I don't think that appending a SingleLineString to a String object would obviously add a newline. The String/StringBuffer/StringBuilder classes in Java certainly don't work that way.
If you were adding to some other object like an instance of some text representation class, this could be different. But it would still not be 100% safe to deduce just by the API itself without looking at the documentation.
About as much as binary search being wrong in many languages for 20 years: https://thebittheories.com/the-curious-case-of-binary-search...
Nothing. Mistakes happen.
Pray tell...what has Java done poorly to ensure this mistake happened?
I don't know what Java's BufferedReader is doing, but it's probably not the optimal thing in terms of IO throughput. I would blame the algorithm long before blaming anything inherent about the JVM.
Erlang is another language where "naive" IO is kind of slow. https://github.com/bbense/beatwc/ is a project someone did to test various methods of doing IO in Erlang/Elixir, and their performance for a line-counting task, relative to the Unix wc(1) command.
It's interesting to see which approaches are faster. Yes, parallelism gains you a bit, but a much larger win comes from avoiding the stutter-stop effect of cutting the read buffer off whenever you hit a newline. Instead, the read buffer should be the same size as your IO source's optimal read-chunk size (a disk block; a TCP huge packet), and you should grab a whole buffer-ful of lines at a time, do a pattern-matching binary scan to collect all the indices of the newlines, and then use those indices to part the buffer out as slice references.
This achieves quite a dramatic speedup, since most of the time you don't need movable copies of the lines, and can copy the line (or more likely just part of it) yourself when you need to hold onto it.
This approach is probably also already built in to Java's "better" IO libraries, like NIO.
Java NIO does exactly what you describe. IO is chunked optimally, the entire buffer is pulled, the API exposes a select() mechanism just like any libc, and the user is expected to frame its own received data.
edit: Lemire is showing a lack of what Fowler describes as "mechanical sympathy".
https://martinfowler.com/articles/lmax.html#QueuesAndTheirLa...
A bit of nitpicking on that: Even Javas nio API isn't very efficient. select() introduces a ton of garbage due to dynamically allocating lists for selected keys, and keys have a lot of synchronization overhead. That's why frameworks like Netty gained a lot of performance by building their own selectors using JNI.
I'm reminded to add, in the vein of the author's complaint, that there is a similar ridiculousness in Erlang land, that cannot be circumnavigated so easily: reading/writing to zlib-compressed files using Erlang's file:open(..., [compressed]) option—or generating/parsing zlib-compressed ETF binaries using erlang:binary_to_term(..., [compressed]))—holds [the moral equivalent of†] a global lock. Only one process can be zlib-compressing or zlib-decompressing a chunk of data at once, no matter how many cores your system has.
This means that, even when your data set compresses so well that you'd theoretically gain a ton of speed by having the data streamed from disk compressed, and then decompressed during parsing—this doesn't apply in practice, since you're introducing an artificial bottleneck in your IO reads.
I'm not actually sure if this is a bug in Erlang, per se, or if it's just the intended behavior and compressed file IO was never intended to be used for performance, only for e.g. embedded devices with tiny ROMs.
(If people here think it's a bug, I'll probably go to the effort at some point to profile the performance impact and submit it as a bug on https://bugs.erlang.org.)
† What it's actually doing, is that all zlib compression/decompression passes get sent to a zlib "port driver" as messages. Port drivers can handle multiple requests in-flight at once (they're the in-process equivalent to sockets—each Erlang process's port against the port-driver is its own "connection") but the zlib port driver shim is coded to expose zlib as a single-threaded, blocking, request-response style of server, rather than one that accepts connections in parallel and instantiates a separate zlib context for each separate connection it receives.
That's probably not true in the recent versions - in OTP 20 the zlib integration was reworked and is now based on NIFs (similar to Java's JNI) instead of port drivers.
Honestly, it seems that nearly everyone here is missing his point.
Some of the blame for that probably lies with his headline choice, but he clearly states at the end of this post:
""" This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.
Many firms probably just throw more hardware at the problem. """
It's not about this piece of code. It's not even about Java In the previous post he mentions at the start of this one, he pointed out:
""" These results suggest that reading a text file in C++ could be CPU bound in that sense that buying an even faster disk would not speed up your single-threaded throughput. """
So, I take his point to be that one shouldn't make assumptions about performance. Rough performance scales -- such as have been posted here many times (e.g. [1]) -- make great rules of thumb for implementation choices or as a guide for where to look first for bottlenecks. To optimize in the real world, though, you're best served using real measurements.
[1] https://www.prowesscorp.com/computer-latency-at-a-human-scal...
The post does nothing to explain how and why, it just throws a couple of outputs from a non specified machine and does no comparison.
It has no baseline and no specs. For all I know, he could have got his 0.5 GB/sec on ab old Pentium II processor.
There is no analysis.
I am perplexed.
Lemire is one of the leading experts on string matching and the author of several core libraries you probably use every day.
edit fine, so instead maybe click on the links in the post to see that this article is just one of a series. He's probably tired of copy-pasting the specs of his reference hardware (Skylake https://arxiv.org/pdf/1902.08318.pdf) since all he's concerned about is the relative performance of different software.
There is a difference between "I'm being dumb because I don't know what I'm doing" and "I'm being lazy because I've done it 1,000 times and the target audience knows what I mean".
Then they should know enough to give at least some theories that can explain the difference and tell us something more about the test setup.
That doesn’t excuse him from needing to describe the hardware he ran the benchmark on.
It should be sufficient to state that it was the same hardware as the C++ measurement. He didnt even say that but it seemed implicit to me.
Yet he misses the fact that Java isn't defined by a single implementation and the standard library reference doesn't dictate how each Java compliant implementation is required to provide BufferedReader behaviour.
Then he should really know better. Not sure why you think the sentence you have written is even an argument.
Know better than what? The link where he mentions the system is literally the first sentence in the post.
I'm appalled at this attitude of entitlement, people don't owe us anything when publishing free content on the web. It's ok to suggest a change or ask for more details, but keep it respectful.
I'm amazed at how upset some commenters are about a blog post that did a toy experiment and didn't actually make any strong claims.
I'm actually a stickler about good benchmarks - it riles me when people draw sweeping conclusions from poorly-designed experiments. Lemire is actually one of the good ones. If you want something more fully developed than a blog post, read one of his papers.
I personally really enjoy his blog because of this - he's good at picking interesting exploratory experiments that provide some insight, without trying to over-generalize from the results. If you read his conclusion, the point is that there is a good probability that even relatively simple programs are CPU-bound. His experiment supports that point. My experience also matches that - I've seen a lot of data processing code that could be I/O bound in theory (i.e. a perfect implementation could max out CPU or network) but is CPU bound in practice. Usually because of string manipulation, regexes, or any number of other things.
> This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.
I can make a lot of things cpu bound with crappy implementation. I'm not going to write a blog post about any of them.
This is absurd, the original platform libraries do not account for the fastest use-cases in any specialized IO case.
Java NIO channel should have been used for this. It was demonstrated back in the early 2000s with the "Grand Canyon" demo achieving very good throughput for its time, and it's still the gold standard.
So what's the reason for this? Is it maybe because of some unicode shenanigans? Java characters are 16bit iirc, and strings have some forty bytes of constant overhead.
I'm no Java ninja, but a few things jump out of https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c... :
- at least one heap allocation for every line. After it finds the EOL it first uses 'new String' followed by '.toString()
- the C++ version will almost certainly be backing on to memchr() behind the scenes, which will be using SIMD instructions where it makes sense (e.g. large enough scan size, probably true in this case). the Java version is a manual bytewise-coded loop.
- the C++ version is reusing its output buffer, no reallocations assuming the same string length or less
No idea about encodings in Java, maybe that is playing a role too
I haven't looked recently but several years ago I was shocked to discover the GNU libstdc++ didn't use strchr or memchr. It used a hand-coded for loop because it was a template for various kinds of character. There was no specialization for 8-bit char, either.
As a result std::string was disgustingly slow compared to C code.
Yes the reuse of the buffer in C++ seems likely and would probably explain a large part of the difference, but I don't know enough of the std::string implementation to be sure about that.
the stringbuffer and string allocation will make it slower for sure, curious what performance the other ways of reading lines in java have
There is a layer of classes, so presumably there are multiple buffers and extra copying. It's also converting the underlying encoding to UTF-16, allocating String objects, and copying the data into them.
Java has many inefficient parts. For example there's no immutable array concept (or owning concept, like in Rust), so there's a lot of unnecessary array copies happens in JDK. String is not well designed. There was an attempt to abstract String concept into CharSequence, but a lot of code still uses Strings.
I made a similar benchmark. The idea is as follows: we have 2 GB byte array (because arrays in Java have 32 bit limit, LoL) filled with 32..126 values, imitating ASCII text and 13 values imitating newlines.
The first test is simply does XOR the whole array. It's the ideal result which should correspond to memory bandwidth.
The second test wraps this array into ByteArrayInputSteram, converts it into Reader using InputStreamReader with UTF-8 encoding, reads lines using BufferedReader and in the end also XORs every char value.
For 2 GB I have 516 ms as an ideal time (3,8 GB/s which is still almost order of magnitude less than theoretical 19.2 GB/s DDR4 speed) and 3566 ms as a BufferedReader, so you can have almost 7x speed improvement with better implementation.
Benchmark: https://pastebin.com/xMD4W8mn
Eh, he didn't use NIO. BufferedReader is an ancient Java relic. Like reading from STDIN in c, it's not made to be fast, it's there for convenience and backwards compatibility.
Read a file using something like Vert.X, which is optimized for speed. I'm 100% confident it will be faster than the naive c approach
Do you have example of alternatives for the BufferedReader with the NIO APIs?
I do a lot of work with large GZIP that are read line by line using the standard IO (i.e. GzipInputStrem(FileInputStream)) etc) but your comment has really made me second guess my choice of doing that...
I would check out the compression handlers in Netty, which underlies Vert.X and many other projects that need high IO performance.
You should be able to hack something together that feeds zero copy buffers into Netty compression handler. Maybe using Netty or Vert.X file API, or maybe just raw NIO2.
I'm not sure how fast this would be, but my gut says "very". Netty can easily saturate 40 Gigabit Ethernet lines, and file IO should have less overhead. That's ~5 gigabytes a second.
It's going to be a good bit of coding for sure. Vert.X/Netty/NIO2 are all async and pretty low level. They're generally 1 thread per core, and along with SSD read patterns you're probably best off reading files in parallel, one per core. Might not be worth the effort.
You may want to look into ZStandard as well. It's Superior to Gzip in most ways when you need something fast but decent.
The NIO API uses channels and buffers
but in your particular case, i don't think there is a Gzip decoder that works on ByteBuffer.Path path = ... var count = 0; try(var channel = FileChannel.open(path)) { var buffer = ByteBuffer.allocateDirect(8192); while(channel.read(buffer) != -1) { while(buffer.hasRemaining()) { if (buffer.get() == '\n') { count++; } } buffer.clear(); } } System.out.println(count);
Do you have a quick github project example?
Sorry, I'm too lazy to create a throwaway GitHub repo. I make way too many flamey and policital comments to tie my real name to HN posts :-/
One thing that jumps out at me is the test code writes to "volume" variable in the read loop, I assume for counting the number of bytes in the file, but never reads it back. A clever compiler will optimize away those writes, the string length check, the loop over the lines, and actually reading the file.
I'm not saying that's happening here, but it's a basic fact when writing benchmarks that you have to actually test something real and not a transient property of the program after the compiler has had its chance to be really smart.
I am a bit confused. The scanFile reads from a file. But the readLines inside the for loop is reading from a StringReader - and not from a file?
At least two problems with the java code. Concatenation of strings using the plus operator creates a new string and copies the content of the old, that pushes the complexity of the code from o(n) to o(n2) where n is the number of lines. Secondly order is not guaranteed with the for each operation on streams.
The correct way to do it is using collect(Collectors.joining(“\n”)) or straight forward imperative style (without streams).
I don’t think the general statement holds (that java or buffered reader is cpu bound in particular).
Where do you see string concatenation with the plus operator in the posted code?
Haha - nowhere ... I completely misread his code.
But all the stream API isso terribile for performance..you write a one line code and you are already at O(n^5)
Completely wrong.
It is like asserting something about C based on GCC specific behaviour.
Java is not a single language implementation.
Agreed. I think you are being downvoted because you forgot to paste in your benchmark results.
Yeah, I forgot that we aren't allowed to have opinions without doing the hard work.
So now I have to go out and do benchmarks across all these implementations just to prove they aren't the same?!?
https://en.m.wikipedia.org/wiki/List_of_Java_virtual_machine...
Java is... java.
I was once working on an Android app on a cheap custom board with 128 M ram (don't ask why Android on a single function custom board, wasn't my decision).
Among other things, I had to parse a 80000 line csv file. Splitting and the rest of the processing created so many temporary strings the system ran out of ram. We eventually gave up.
Myself I've worked with gigabyte sized CSV files without issues, so it was likely your implementation rather than the fault of the Java language.
Googles "parse csv Java"
Copies and pasted top answer that buffers the whole thing into the heap before parsing
Blames Java/hardware when this doesn't scale
Bad string allocations can easily swamp the JVM, with things like loops adding a single character to a string with a string allocation each loop, and other naive approaches turning what should be a small reused buffer into gigabytes of heap.
The JVM does a remarkable job of just-good-enough don't-worry-about-GC, but anyone who is a paid programmer needs to understand rudimentary aspects of GC, or your stuff will go sideways fast in production.
<quote>Bad string allocations can easily swamp the JVM</quote>
Like String.split() and whatever BufferedReader does, I guess. It's not like i was doing byte by byte manipulation. The GC just triggered too often.
Did you do it in 128 megs of RAM?
Depending on what needs to be done with the CSV files, it's very possible to do it in 128MB of RAM. For example, if we need to read the rows, transform them a bit, and then write to another file, we can read up to N rows, transform them, and then write them. That should result in bounded memory consumption because only up to N rows need to be kept. Similar strategies are possible if the rows are used as input to an ETL job, calling a Web service with the results of parsing the file, etc.
Editing a file gets trickier, though it's not impossible. Maybe using a [piece table](https://en.wikipedia.org/wiki/Piece_table) plus some smart buffering the file can keep memory consumption below some constant, letting it function for large files, but with the downside of lower performance for files larger than whatever the constant is?
> Among other things, I had to parse a 80000 line csv file. Splitting and the rest of the processing created so many temporary strings the system ran out of ram. We eventually gave up
Let us all hope that Project Valhalla, the effort to add value types to Java, comes to a swift completion. It would be very helpful in these sorts of scenarios.
Though at this point I'd wonder (as you did) why I'm using Java in the first place given the resource constraints of the system, I've had some success using byte[] arrays and using `yourInputStreamHere.read(someBuffer, yourOffset, YOUR_PAGE_SIZE)` in a few less constrained but relatively more performance critical areas.
Depending on your exact needs, this can sometimes result in more or less constant memory consumption; it's possible to reuse the array on each read. You can also take it a step further by finding the indexes where you would split the array (e.g. the index of the next comma for a CSV) and using methods that operate on an array plus start and end indexes. That sort of strategy can allow you to avoid additional copies the array, at the cost of somewhat annoying and less maintainable code that is definitely not Java-esque.
There are also less severe solutions that won't buffer the whole file in memory, using StringBuilder or other techniques, etc. depending on what you're doing.
This all might be for naught though; I'm assuming the records have to be parsed into some structure, so tricks like the above might not be good enough given Java's seemingly insatiable hunger for heap space.
While I'm not the language's greatest fan, this is one area where Go has been able to shine. Having value types as a possibility makes solving this sort of problem a typically less expensive proposition.
I'm a little confused, trying to parse an entire file in memory instead of streaming it is one of the mistakes I might make in my first year or two of development. parsing 80k lines of CSV in java is pretty easy as long as you write your code to be efficient and release memory line by line
You can already play with Valhalla.
https://mail.openjdk.java.net/pipermail/valhalla-dev/2019-Ju...
Interesting comments so let's have some answers:
1. 128 M ram total memory, 20-30 M free at best. CPU was slow and we were running off a SD card. Not exactly an enterprise monster.
2. Of course I was parsing it line by line. But running OOM triggered the garbage collector so many times that it took like 3-4 minutes to finish. I didn't mean that the system actually stopped because of running out of ram.
3. Why didn't you do it with the NDK/some other solution? Because the PM just gave up on the feature when it wasn't done in a day. It was just real time search completion whith we didn't really need there, it was just "nice to have".
I’m parsing much larger datasets than those on android, in less than 18M RAM usage.
You just have to use NIO and use direct buffers instead of naively reading and allocating strings.
So what's the point of using Java then? Of course I can do it in a non GC language using a buffer that holds one line and as much memory as needed for the data I want to retain...
Not having to worry about memory for 99% of the time.
Having to do research to optimize I/O is not just a java thing. As stated before in this thread, the C stdlib isn't historically great either, and a lot of languages piggyback on that.
Modern I/O is almost always a different library and you need to do reasearch.
Direct Buffers are a Java API to directly use C buffers and access them in a fast way. They're the fastest way to handle memory in Java.
Android gives bad name to Java with its Dalvik and ART partial implementations.
Why didn't you just write it using the NDK?