What is the best solution to store billions of binary documents ?
Problematic :
- We want to store billions of binary documents. This storage has to be safe (replicated) and durable.
- We need backup and a DRP (mostly in case of human error).
- Versionning might be preferable, but it's not a requirement.
- The storage will probably be on the cloud
- Eventual consistency is OK
Here is a list of studied solution, with pros & cons
Cassandra, column-oriented :
pros : great for scale, failover and replication. Great support, books and community. Freat for write
cons : Setup might be complicated. Based on Thrift, which offers no streaming abilities. See http://stackoverflow.com/questions/3911529/storing-binary-data-on-cassandra-just-like-mysql-blob-binary
Hbase, column-oriented :
pros : based on hdfs, designed to scale and replicate, good doc/book. Can be used with haddop tools (Pig, hive etc.). Great for reads.
cons : Setup might be complicated. Isn't really good for binary data, see http://reavely.blogspot.com/2011/05/hbase-scalability-for-binary-data-i.html, http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS, http://www.quora.com/Apache-Hadoop/How-would-HBase-compare-to-Facebooks-Haystack-for-photo-storage.
A solution might be to store link/id in hbase and data in hdfs, but the NameNode keeps an image of the entire file system namespace and file Blockmap in memory, so be careful if you really have tons of docs.
Couple of interesting links : cassandra vs hbase : http://www.roadtofailure.com/2009/10/29/hbase-vs-cassandra-nosql-battle/
Write performance : http://www.quora.com/How-does-HBase-write-performance-differ-from-write-performance-in-Cassandra-with-consistency-level-ALL
Voldemort project, key/value, based on Amazon's Dynamo project
pros : key/value is great for us, built-in versionning. Great replication, failover and sharding
cons : All code processes a value at a time in memory. No cursors or streaming.
Redis : no cluster, doesn't fit our need with its all-in-memory system.
Riak (key/document): not free, sorry :/
Classic file system :
pros : very simple and fast
cons : can't scale (what if your data is bigger than a single disk?), doesn't provide replication, failover and backups
Tokyo Cabinet : can't find a descent doc, I don't know this solution so I can't really say anything. Any inputs are welcome :)
Couch DB (key/document, MVCC, uses REST as interface to DB, might be usefull to directly serve the doc, but we must handle permition in our case)
pros : hight durability (all data and associated indexes are flushed to disk and the transactional commit always leaves the database in a completely consistent state, might be a con because slower than mongodb memory-durability), has great books, but online documentation is a little bit scattered. Great replication, failover and sharding. Has streaming feature.
con : couchDB compaction will have to recopy the binary doc, REST interface could be a con (I just want to serve my data to my client outputstream)
Mongodb with GridFs (key/document, Update in-place, has nice geospatial features)
pros : performance (memory), great durability via journaling, widly used, great book and documentation (!)
Specific support for binary file with GridFs (native versioning, provide streaming file)
Great for replication/sharding with replicats, failover.
Built-in backup with deplayed replicat.
cons : ...
Some info about GridFs : http://www.coffeepowered.net/2010/02/17/serving-files-out-of-gridfs/, http://stackoverflow.com/questions/3413115/is-gridfs-fast-and-reliable-enough-for-production
The winner seems to be Mongodb, although CouchDb might be a good solution too. Its delayed replication is a really great feature when you need db backups !
Since I've no experience with couchDb, I think I'll start my poc using GridFs and see if it's a really good choice.