MongoDB to Couchbase for developers: Part 4: Data Modeling

"To Embed or Not to Embed. That is the Question."   Hamlet

Data modeling is a well defined and mature field in relational database systems. The model provides a consistent framework for application developers to add and manipulate the data. Everything is great until you need to change. This lack of schema flexibility was a key trigger NoSQL database systems.   As we've learned before, both MongoDB and Couchbase are JSON based document database.   JSON gives the developers the schema flexibility; the indexes,  collections, query engine provide access paths to this data;  the developer uses MQL in MongoDB and N1QL in Couchbase to query this data.  Let's compare the modeling methods in Couchbase. Note: This article is short because the modeling options are similar. That's a good thing. Some differences in modeling options, access methods and optimizations are highlighted below.

Query processing on JSON

Comparison of modeling options and features supporting them.

MongoDB

Couchbase

Data Modeling Guides: here, here and here

Additional examples here.


Data Modeling Guides: here, here and here

More examples here.

Flexible document model based on BSON (binary JSON)

Flexible document model based on JSON and binary objects

Supports embedding and loose references (no foreign key reference enforcement)

Supports embedding and loose references (no foreign key reference enforcement)

Atomic Write operations:

Single document atomicity by default. Multi document writes are supported and it only supports single-document atomicity as well.


Multi document atomicity and full ACID for MQL operations is supported via transactions in 4.2 and above.

Atomic Write operations:

Single document atomicity by default. Multi document writes are supported and it only supports single-document atomicity as well.  The operations can be done via direct KV SDK or N1QL. 


Multi document atomicity has been supported since 6.5. Full ACID for N1QL statements is supported via N1QL transactions in 7.0 and above.

Performance Advisor:

Index advisor built into MongoDB ops manager

Performance Advisor:

Has index advisor built into the product (via ADVISE statement and ADVISOR function) as well as an open service

Schema  Validation is optionally done synchronously.

Synchronous schema validation is unavailable. Schema validation can be done asynchronously

Embedded data model

Exactly the same thing on JSON and Couchbase.

Normalized data model: References to other documents are stored as ObjectId().

Example:


“product”: ObjectId("Lego.US.beatles123”)


“friends”: [ObjectId("fred.123”), objectId(“joe.234”), objectId(“john.345”)]


In a normalized schema, queries typically end up joining the data using the $lookup operator. $lookup implements left outer join against an unsharded collection.   Joining between two sharded collections have to be implemented by the application/user.


$graphlookup mimics recursion using Oracle’s CONNECT BY feature by querying the same object based on the previous result. 

Normalized data model: References to other documents are simply stored as a string representing the document key of the references. 


“product”:"Lego.US.beatles123”


“friends”: ["fred.123”, “joe.234”, “john.345”]


In a normalized schema, queries can issue INNER JOIN, LEFT OUTER JOIN and RIGHT OUTER JOIN to manage the data. Collections in Couchbase are always automatically sharded (hash partitioned). All joins are supported partitioned collections. See details here and here.


Couchbase doesn’t have the equivalent recursive querying capability.  There are some work arounds: See ​​here; The JavaScript UDF feature in 7.1 will help you write a loop or recursion easily. 

Sharding Collections

The collections are not sharded by default.  You’d need to add the config servers and mongos to shard a collection.  Mongo supports hash and range based strategy for collection and index sharding. Indexes follow collections shard strategy since the MongoDB indexes are local to the data. Mongo supports one or more fields as the shard key for hash strategy; 

Sharding is called partitioning in Couchase. The collections in Couchbase are always hash partitioned based on document key. There is no range partitioning of the collection itself. The indexes in Couchbase can use a different strategy than the collection. Indexes can be partitioned or not.  Indexes can be partitioned using hash on any field or expression or any complicated expression.  The expressions for creating the index provide the flexibility to create indexes for subset of the data (partial index). 

Time to Live (TTL) is supported by creating an index on a date field within the collection. 

Couchbase supports TTL natively on the collections without the need for a separate index.

Modeling 1:1 relationships with embedding documents

For 1:1 relationships, you use the exact same approach.

Modeling 1:n relationships with embedding documents.  This comes naturally with BSON arrays. Supporting storing an array of objects and or object of arrays is easy. Important thing is the support for querying and speeding them up using indexes. MongoDB supports querying data based on any of the values or fields in an array.  The queries can be accelerated using array indexes. While MongoDB does not support indexing multiple arrays in a single index, allows creating an array key with keys from multiple arrays and creating a single index on it.  

You can use the exact same approach on Couchbase.  Arrays is a fundamental type in JSON and can be nested easily.  Couchbase supports querying arrays, indexing arrays, and expressions on index. The query support both nesting and unnesting of arrays easily in the FROM clause of the statement. Array indexes support indexing on a one or more keys or expressions. Couchbase also supports indexing multiple arrays in a single index in FTS and using them in direct search query or N1QL query. See additional details here.

Modeling 1:n relationship with references.  Simply store the n references as an array of document _id list.  Within the query, simply do the $lookup to join the documents. And, you can sequence many $lookup operations in a single aggregate() query. $lookup is fine as long as you avoid joining a sharded collection to a another sharded collection. You’ve to be careful about joining two humongous collection. Second limitation is $lookup only supports nested loop join.  When the amount of data increases, you need other join implementations like hash joins, which Mongo does not have.  

You ca use the exact same approach on Couchbase. Instead of $lookup, use JOIN operations. See details here and here.  The collections and the indexes used for the JOIN operations can be partitioned (Couchbase term for sharding). Couchbase support both nested loop and hash joins in both query and analytics service. If you’re using the Couchbase cost based optimizer in 7.0, it chooses the join method based on the lowest cost estimates. The rule based optimizer uses NL join by default and can use hash join when you specify the hash join hint.

Remember,  any database can store any type of data.  It's the ability to index and query efficiently that will help you to get your workload to meet your latency and throughput requirements. Databases are only as good as their access methods.  But, proof is in the pudding. See additional details of Couchbase data modeling and customer examples in my article Introduction to Couchbase for Oracle Developers and Experts: Part 4: Data Modeling.   Additional details on utilizing arrays can be found in this slide deck. 

Popular posts from this blog

  There’s a saying in Chicago: “We don’t want nobody that nobody sent.” This was the cold reception Swami Vivekananda faced when he arrived in the windy city in July 1893, determined to attend the World Parliament of Religions that September. He belonged to no organization, carried no letter of recommendation, his countrymen were nobody, and represented an alien religion to the Western world. As the days passed, his hope of attending the parliament dwindled. With money running out and the odds stacked against him, he left the Windy City and went to Boston, praying for a glimmer of opportunity.  Swamiji came to America to share India’s most profound gift: the wisdom of the Hindu sages, preserved through centuries of oral tradition and embodied by its monks. This was 1893, not 1993—India was under the British grip, its resources drained, and its spirit subdued. Swamiji’s mission was not just a cultural exchange; it was a bold step toward envisioning a future where India could re...

From search to CRM, applications are adopting natural language and intuitive interactions. Should databases follow? This article provides a strategic perspective. Amid the many technological evolutions in software and hardware (CISC/RISC, Internet, Cloud, and AI), one technology has endured:  Relational Database Systems   (RDBMS), aka SQL databases. For over 50 years, RDBMS has survived and thrived, overcoming many challenges. It has evolved and adopted beneficial features from emerging technologies like object-relational databases and now competes robustly with   NoSQL databases .  Today, RDBMS dominates the market, with four of the top five databases and seven of the top ten being relational. RDBMS has smartly borrowed ideas, like JSON support, from NoSQL, while NoSQL has also borrowed from RDBMS. NoSQL no longer rejects SQL. From a user perspective, all modern databases have SQL-inspired query language and a set of APIs. All applications manage the respective data...

  The landscape of software development is ever-evolving with the advent of new technologies. As we venture into 2023, natural language processing ( NLP ) is rapidly emerging as a pivotal aspect of programming. Unlike previous generations of tools that primarily aimed at enhancing coding productivity and code quality, the new generation of Artificial Intelligence ( GenAI ) tools, like iQ, is set to revolutionize every facet of a developer's workflow. This encompasses a wide range of activities: Reading, writing, and rewriting specifications Designing, prototyping, and coding Reviewing, refactoring, and verifying software Going through the iterative cycle of deploying, debugging, and improving the software Create a draft schema and sample data for any use case Natural language queries. Generate sample queries on a given dataset Fix the syntax error for a query Don't stop here. Let your imagination fly. Although the insights garnered from iQ are preliminary and should be treated ...