CopyCat: Protocol-agnostic implementation of the Raft consensus algorithm
github.comI am writing a program which has to durably store large number of small files. My current implementation stores these on the local disk as files in a particular folder structure. I am considering designing some approach to make this program run on a cluster of machines for high-availability, fault-tolerance & scalability reasons.
One approach I can think of is to run a distributed document database and use that for storage. I don't need most features of such database products in the foreseeable future, so I fear that they will add operational overhead for not much benefit.
Another approach I can think of is to run my processing nodes against a network file system, and rely on that to do the replication.
Yet another approach I am considering is to use something like CopyCat to implement the file replication in my application code. Is this a good use case for CopyCat?
CopyCat looks really good for HA and fault-tolerance. Not so sure about its scalability since all writes go through the single leader node. It's more appropriate for the use of maintaining metadata of a distributed system, rather than maintaining the data themselves.
For your requirement, can you use S3 or something similar?
Thanks! The software I am writing needs to be (easily) installed at the customer site. So, S3 won't work for me. Are there any open source S3 clones that are easy to setup?
If you mean that you need something that supports the S3 API, you could try S3Proxy[0]. It basically presents the S3 API over a variety of different storage backends.
Riak CS is S3 compatible.
I have a very similar problem and end up writing a python script that copies the files to cassandra and Nginx + Lua to serve them using lua-resty-cassandra. The read/write throughput is still not as good as I was expecting though.
You could try OpenStack's Swift project.
That is a "design feature" of raft, which allows it to lack the unbelievable complexity of Paxos, Multi-Paxos, or something like ZAB (yet another paxos variant).
Sounds like the kind of problem OrientDB is supposed to solve (I haven't tried it though). I would stay away from network filesystems (they're fiddly, they'll add much more operational overhead than a database product, and there is no mature distributed one. OpenAFS is probably your best bet if you do want to go that route).
OrientDBs clustering is, IIRC, built on top of Hazelcast. Storing my files in a clustered Hazelcast is another option I am considering, although I forgot to include it in my first comment.
Oh yuck. I didn't know that. I've long found Hazelcast a pain to work with.
Having seen in a sibling comment that you're talking about very small files, I'd recommend Zookeeper - it's mature and has pretty low admin overhead, IME.
that's very interesting. I've heard the exact opposite, hazelcast is easy to work with and Zookeeper more complicated.
Have you considered using a pre-existing distributed filesystem like Ceph or even HDFS instead?
I did consider both. HDFS, from what I read, is designed for storing large files - my files are about 10Kb in size each.
I am aware of Ceph but have not tried installing it to see how easy/hard it is to setup. Also, although this is not a hard requirement, I'd like to be able to support Windows; from what I have read so far, Ceph does not support Windows.
Ceph's filesystem interface is built on a lower level protocol which is available through the librados [1] API, which might be a possible solution if it builds for (or can be ported to) Cygwin / Windows and all you need is a client on Windows.
This API is more S3-like, operating directly on "objects" in the Ceph cluster. I wrote a system for storing many small (10-20kB) files using librados & Ceph, but the performance wasn't as good as I had hoped. Possibly I did not configure the Ceph cluster in the optimal manner - the cluster setup is quite complex.
This looks fantastic! As a DB researcher, I've found in the past I need some type of consensus algorithm for a system I'm building, but I don't want to spend a lot of time implementing something myself. I can see this being very useful in academia for scenarios like this.