MongoDB to Couchbase for developers: Part 4: Data Modeling

"To Embed or Not to Embed. That is the Question." Hamlet

Data modeling is a well defined and mature field in relational database systems. The model provides a consistent framework for application developers to add and manipulate the data. Everything is great until you need to change. This lack of schema flexibility was a key trigger NoSQL database systems. As we've learned before, both MongoDB and Couchbase are JSON based document database. JSON gives the developers the schema flexibility; the indexes, collections, query engine provide access paths to this data; the developer uses MQL in MongoDB and N1QL in Couchbase to query this data. Let's compare the modeling methods in Couchbase. Note: This article is short because the modeling options are similar. That's a good thing. Some differences in modeling options, access methods and optimizations are highlighted below.

Query processing on JSON

Comparison of modeling options and features supporting them.

MongoDB	Couchbase
Data Modeling Guides: here, here and here. Additional examples here.	Data Modeling Guides: here, here and here More examples here.
Flexible document model based on BSON (binary JSON)	Flexible document model based on JSON and binary objects
Supports embedding and loose references (no foreign key reference enforcement)	Supports embedding and loose references (no foreign key reference enforcement)
Atomic Write operations: Single document atomicity by default. Multi document writes are supported and it only supports single-document atomicity as well. Multi document atomicity and full ACID for MQL operations is supported via transactions in 4.2 and above.	Atomic Write operations: Single document atomicity by default. Multi document writes are supported and it only supports single-document atomicity as well. The operations can be done via direct KV SDK or N1QL. Multi document atomicity has been supported since 6.5. Full ACID for N1QL statements is supported via N1QL transactions in 7.0 and above.
Performance Advisor: Index advisor built into MongoDB ops manager	Performance Advisor: Has index advisor built into the product (via ADVISE statement and ADVISOR function) as well as an open service.
Schema Validation is optionally done synchronously.	Synchronous schema validation is unavailable. Schema validation can be done asynchronously.
Embedded data model	Exactly the same thing on JSON and Couchbase.
Normalized data model: References to other documents are stored as ObjectId(). Example: “product”: ObjectId("Lego.US.beatles123”) “friends”: [ObjectId("fred.123”), objectId(“joe.234”), objectId(“john.345”)] In a normalized schema, queries typically end up joining the data using the $lookup operator. $lookup implements left outer join against an unsharded collection. Joining between two sharded collections have to be implemented by the application/user. $graphlookup mimics recursion using Oracle’s CONNECT BY feature by querying the same object based on the previous result.	Normalized data model: References to other documents are simply stored as a string representing the document key of the references. “product”:"Lego.US.beatles123” “friends”: ["fred.123”, “joe.234”, “john.345”] In a normalized schema, queries can issue INNER JOIN, LEFT OUTER JOIN and RIGHT OUTER JOIN to manage the data. Collections in Couchbase are always automatically sharded (hash partitioned). All joins are supported partitioned collections. See details here and here. Couchbase doesn’t have the equivalent recursive querying capability. There are some work arounds: See here; The JavaScript UDF feature in 7.1 will help you write a loop or recursion easily.
Sharding Collections The collections are not sharded by default. You’d need to add the config servers and mongos to shard a collection. Mongo supports hash and range based strategy for collection and index sharding. Indexes follow collections shard strategy since the MongoDB indexes are local to the data. Mongo supports one or more fields as the shard key for hash strategy;	Sharding is called partitioning in Couchase. The collections in Couchbase are always hash partitioned based on document key. There is no range partitioning of the collection itself. The indexes in Couchbase can use a different strategy than the collection. Indexes can be partitioned or not. Indexes can be partitioned using hash on any field or expression or any complicated expression. The expressions for creating the index provide the flexibility to create indexes for subset of the data (partial index).
Time to Live (TTL) is supported by creating an index on a date field within the collection.	Couchbase supports TTL natively on the collections without the need for a separate index.
Modeling 1:1 relationships with embedding documents	For 1:1 relationships, you use the exact same approach.
Modeling 1:n relationships with embedding documents. This comes naturally with BSON arrays. Supporting storing an array of objects and or object of arrays is easy. Important thing is the support for querying and speeding them up using indexes. MongoDB supports querying data based on any of the values or fields in an array. The queries can be accelerated using array indexes. While MongoDB does not support indexing multiple arrays in a single index, allows creating an array key with keys from multiple arrays and creating a single index on it.	You can use the exact same approach on Couchbase. Arrays is a fundamental type in JSON and can be nested easily. Couchbase supports querying arrays, indexing arrays, and expressions on index. The query support both nesting and unnesting of arrays easily in the FROM clause of the statement. Array indexes support indexing on a one or more keys or expressions. Couchbase also supports indexing multiple arrays in a single index in FTS and using them in direct search query or N1QL query. See additional details here.
Modeling 1:n relationship with references. Simply store the n references as an array of document _id list. Within the query, simply do the $lookup to join the documents. And, you can sequence many $lookup operations in a single aggregate() query. $lookup is fine as long as you avoid joining a sharded collection to a another sharded collection. You’ve to be careful about joining two humongous collection. Second limitation is $lookup only supports nested loop join. When the amount of data increases, you need other join implementations like hash joins, which Mongo does not have.	You ca use the exact same approach on Couchbase. Instead of $lookup, use JOIN operations. See details here and here. The collections and the indexes used for the JOIN operations can be partitioned (Couchbase term for sharding). Couchbase support both nested loop and hash joins in both query and analytics service. If you’re using the Couchbase cost based optimizer in 7.0, it chooses the join method based on the lowest cost estimates. The rule based optimizer uses NL join by default and can use hash join when you specify the hash join hint.

Remember, any database can store any type of data. It's the ability to index and query efficiently that will help you to get your workload to meet your latency and throughput requirements. Databases are only as good as their access methods. But, proof is in the pudding. See additional details of Couchbase data modeling and customer examples in my article Introduction to Couchbase for Oracle Developers and Experts: Part 4: Data Modeling. Additional details on utilizing arrays can be found in this slide deck.

MongoDB to Couchbase for developers: Part 4: Data Modeling

Comparison of modeling options and features supporting them.

Popular posts from this blog