Grok cassandra's data model
flazz.meThis column-family/column/super-column lingo that Cassandra pulls out just makes it harder to understand its data model. In fact, it's quite simple:
Keyspace: a hash table that holds your application data. Okay, the table is distributed among nodes (i.e.,a DHT), but it's still a hash table;
Row: an entry in the above hash table where each value is composed by a collection of "column-families".
Column Family: a key-value table (I avoid to call it a hash table because I don't remember if it's implemented as such). A better name for this thing would be 'Attribute Set'.
Column: it's a key-value pair (with timestamp). Thinking about it as a column just blurs the concept. Better name: 'Attribute'.
Note: it's possible to have a different set of attributes on a per-row basis (for the same Column Family), so this concept of 'column' breaks quite easily.
Super-column: key-value pair where the value is yet another key-value table! Better(?) name: 'Super-Attribute'.
Then Cassandra data model is in fact a nested set of key-value tables while dynamo's model is flat (just one level hash table). Oh! Last but not least, it's not a column-store. It's on-disk storage is row-oriented.
I think he makes the mistake of thinking the RDB-specific definition of those words is the absolutely definition, and that nobody else can use them if they aren't using them in exactly the same way.
You can't go into a new language and assume any words that appear to be the same are exactly the same. This applies to spoken language as well as computer languages. Only heartache lies down that road.
mistake or not, i don't believe i assumed any words from the RDB (or any other) domain are the absolute definition.
i do assume that people often learn and understand things based on existing conceptual prototypes. that was my problem trying to understand cassandra.
They why did you say:
"Not only is Cassandra’s terminology confusing it’s downright misleading. Row, Column & Key all have existing semantics in the land of databases. To make matters worse, Cassandra’s definitions are not even orthogonal to the existing ones — they exist in a difficult state of quasi-synonymity."
You assumed that the RDB definition of those words was absolute, and didn't bother to question if a different kind of database would use them somewhat differently.
You assumed that the RDB definition of those words was absolute...
I'm not the author, but I don't see how the portion you quoted requires an assumption that those definitions are absolute. The only thing one must accept is that those definitions are pervasive. That doesn't seem controversial, to me. When selecting the nomenclature, the makers of Cassandra could have made a practical decision not to create unnecessary confusion.
The Facebook guys who wrote Cassandra have heavily drawn its design and terminology from Google's Bigtable. By the way, Column Family (CF) in BT makes a lot more sense because the compression of data,as well as disk storage locality, is made on a per-CF basis. They have even filled a patent about this (http://bit.ly/ooop2s).
Oddly enough, BigTable's terminology seems fits more naturally in the classic concepts than Cassandra's. Maybe it's the result of Dynamo's design choices (DHT, etc) that got into the mix or new concepts like SuperColumn.
Only by inventing new words and causing worse confusion.