Using C for a specialized data store
pixenomics.tumblr.comUmm, we're talking about just over 18MB here (1200 * 1000 pixels, 16 bytes/pixel, see http://pixenomics.tumblr.com/post/16895861678/how-to-send-1-...). That you can just dump over the wire as a binary blob. Why are we talking about this again? Use your favourite language, just keep it in a big blob in memory, and have fun.
Memory isn't an issue. It's processing the data and turning the storage into a format the client can read. A big blob isn't easy to send to the client unless it's an image or something and then it becomes an issue when you want to manipulate the data or process it.
it's a 1000x1000 image... this is a trivial problem
Can you elaborate?
What he means is: you wouldn't look for a "solution" for writing a 1 mb text file to disk, because its quite trivial and fast in any language.
Seems to me they skipped right over the most obvious option: Redis.
It's quite fast, you can use a Redis string as a random-access array up to 512MB/each, and there are several good ways to handle persistence/backup. I don't think there was a need for them to write any C themselves.
The 4th paragraph explained why they didn't go this way:
> We were reluctant to use a NoSQL solution as this would require retrieving the pixels through a socket, storing it in memory and then processing them. It makes more sense to process it where it’s stored.
Maybe I don't understand the problem,but that sounds like some serious premature optimization. 1.2 mpix is not much data.
According to the article their Node solution took 4 seconds to run (down from 7 seconds after some optimization) and their C solution 0.03 seconds. Now maybe they could have sped up their node code more, but those sort of improvements hardly count as premature optimization.
Since the usual expected slowdown for jit compiled scripts is somewhere on the order of 5 times (obviously, this is a very loose guess, and the number will vary by script, style, and workload), I wonder what they could have been doing to cause a 200x slowdown.
We were looking at that (as well as riak) but processing the data would require pulling all the data into PHP. I guess you could do the processing in C but it's then just as easy to store it there as well.
Have you looked into the LUA scripting option for Redis? Allows for some processing to happen on the server side, and it's quite powerful.
That sounds like a good option. Thanks, will note it.
I'm not clear why you're worried about that. Is it the pulling, or the processing?
The pulling shouldn't be an issue -- I don't know about PHP, but in pure Python, I can pull an arbitrary 10MB string from Redis in ~85-90ms. With hiredis (C extension), that falls to about 47ms.
I can't speak to processing, since I don't know exactly what transformations you're performing.
It's more the iteration of each pixel and it's neighbor (of which there are 8) making it around 9.6 million iterations.
We will probably head towards redis in the future when precise backups are essential. Undecided what will do this processing though.
We built GPU-accelerated NoSQL datastore. using it, this can be accelerated 100x, given you switch to binary pixel format.
Why would you use a GPU-accelerated storage when latency is the main goal?
GPU do not accelerate raw storage retrieval, but processing, like queries and map reduce.
Use APU / HPU, if PCIe latency is a problem.
I understood that they running something like convolution (I.e, each pixel calculated from surrounding pixels) - this will be fast using OpenCL model).