Ultimate DB Heresy: Single Writing DB Connection. Part II. Gradual Scalability. All the way from no-scale to perfect-scale. - IT Hare on Soft.ware

[[This is Chapter 20(e) from “beta” Volume VI of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

In the previous part of this two-part post, I’ve pushed the concept of single-write-DB-Connection architectures pretty hard; as I’ve argued, these are MUCH simpler (and MUCH less error-prone) than multi-connection ones, and one single-write-connection DB can provide performance of up to millions of write transactions per day. While this kind of performance doesn’t mean that we don’t need scalability, it can easily mean that at least we can start with a single-write-DB-Connection architecture, working on scalability when/if the need for scalability arises.

Real-World Story of Achieving Perfect Share-Nothing Scalability

Once upon a time, I’ve seen a game which went through the whole process of achieving scalability over single-write-connection DBs, and, well, I DID like both the process and the end-result. So, let’s take a look at how the story has unfolded.

Stage 1. Just a Single-Write-Connection DB

At first, everything was extremely simple (and because of this simplicity – extremely robust, too). There was a DB Server which accepted requests from Game World Servers (and other Game Servers such as Cashier Server), and processed them serially, one by one. These requests were implemented along the lines which I’ve insisted on in [[TODO]] section before:

“DB Server requests were expressed in terms of Game World Servers (and without any SQL).DB Server requests were expressed in terms of Game World Servers (and without any SQL). All conversion from Game Logic terms into SQL was done within DB Server.
DB Server requests were atomic (i.e. each of the requests corresponded to exactly one DB transaction).

At this stage, several basic DB optimizations were used, including the following:

Prepared statements
- Another alternative would be to use stored procedures, but with DB Server isolating DB from Game Logic anyway, they weren’t a strict necessity, and with a strong dislike of any vendor lock-ins (and going for stored procedures automatically means Absolute Vendor Lock-In) it was decided against stored procedures.
Indexes, more indexes, and even more indexes.
- This included indexes over multiple fields, too. BTW, in the process of indexing, a myth that two single-field indexes are the same as one double-field index, was proven deadly incorrect experimentally: at least for one request, double-field index provided about 100x improvement over two single-field indexes (not real surprise here, but I’ve seen people arguing against multi-field indexes with very serious faces).
Optimized DB physical layout and underlying hardware, with the most important items in this regard being the following:
- Separate RAID-1 disk for DB logs
- BBWC RAID card on the physical server box
- RAID-1/RAID-10 for data (not a 100% requirement, but tends to help too)

We’ll discuss DB optimizations in more detail in Vol. 3 (tentatively Chapter [[TODO]]).

CSR A customer service advisor, or CSA, (also customer service associate or customer service representative (CSR)) is a generic job title in the service industry.— Wikipedia —As for “real-time reporting” for CSR (support) purposes (see [[TODO]] section above for better definition), these reports were running from the same DB (via separate connections using “Read Uncommitted” transaction isolation).

Overall, DB Server was a lean and mean transaction-processing machine. And it worked pretty well – processing up to 5–10 million transactions per day. Moreover, given the simplicity, it was bulletproof-reliable and 100% consistent (such things as lost money or other assets, were unheard of).

By the end of this stage, the game was able to handle up to 50K simultaneous players.

Stage 2. App-Level Cache of USERS table

At some point, as the game gained popularity, and as a part of general effort to optimise DB and DB Server, app-level cache of USERS table (or more precisely – of the Users read from USERS table) was introduced. This one thing has been observed to have a profound effect on the DB Server performance; while some other tables were cached later too, it was USERS table which justified additional complexity necessary to implement caching.

Inquisitive hare: “with our connection having a monopoly on modifying DB, we can ensure that app-level cache is coherent at all times, merely by updating it each time when we're updating our USERS tableOne Good Thing(tm) about single-write-DB-connection architectures is that with our connection having a monopoly on modifying DB, we can ensure that app-level cache is coherent at all times, merely by updating it each time when we’re updating our USERS table. And as there is usually only a few SQL statements updating USERS table – it is not a problem to update app-level cache when any of SQL-statements-updating-USERS-table is invoked.

With app-level caches added, DB worked even better, and was able to handle up to 30M-50M write transactions a day. Sure, getting there did require quite a bit of work – but on the other hand, any real-world DB handling this much requires quite a bit of work to get it running smoothly.

However, at some point it was observed, that as the CSR-oriented real-time reports became more and more complicated (and DB became larger and larger, in part, due to historical data being accumulated) some of those reports started to cause table scans. In turn, these table scans sometimes caused the DB to drop some other data from the DB caches;¹ this, in turn, has caused observable slowdowns of the time-critical operational processing by DB Server.

¹ Not app-level caches, of course

This has eventually lead to…

Stage 3. Reporting Replica

To prevent heavy CSR-oriented real-time reports from hitting DB, it was decided to separate reporting DB from the operational/OLTP DB completely. Such separation was intended to provide several all-important benefits (and turned out to provide these benefits too 🙂 ):

Once and for all resolve the issue of reports affecting operational performance
Allow for more heavy reports to be written (and to speed up report development too)
Ability to reduce the size of historical data in operational DB, by moving “old” historical data to the replica(s) (and removing it from the operational DB)
Ability to run “newer greater” version of RDBMS on replicas, to see how often they tend to fail (and to find out when they become good enough to move them into production) ;-(

Hare pointing out: “To achieve these goals, replica should be an asynchronous replica, the one with eventual consistency.To achieve these goals, replica should be an asynchronous replica, the one with eventual consistency. It had been expected (and confirmed later in practice) that delays of up to 1–2 minutes are completely unnoticeable for CSR-oriented real-time reporting purposes; so, as long as the delays stayed within these bounds – they were not causing any problems.

Unfortunately, at that time, built-in DB replication of the RDBMS in use wasn’t able to cope with the required kind of DB load (failing after only a few days of operation with some obscure error; moreover, each such failure required a complete re-population of the replica 🙁 ). As a result, it was decided to implement DIY replication (I really hope that DIY replication is no longer needed these days, but at that time, it was best solution available; still, we will discuss DIY replication in Vol. 3, tentatively Chapter [[TODO]]). Fortunately, with single-write-DB-connection, DIY replication is perfectly viable :-).

In a sense, read-only replicas can be seen as incarnations of CQRS (=”Command and Query Responsibility Segregation”) pattern (see, for example, [MSDN.CQRS] and [Richardson.CQRS] for discussion of CQRS). This especially true for DIY replication we’re going to describe in Vol. 3 (which effectively follows an Event Sourcing pattern for the replica DB); on the other hand – viability of the replica doesn’t depend much on Event Sourcing being used to implement it, and any asynchronous replica will do.

Coming back to the real-world case in hand. After implementing read-only replica, all the goals stated above, were achieved. In practice, however, there were two instances of reporting replicas running in parallel most of the time; if one was severely delayed (usually by an ultra–heavy report – and such reports, of course, mushroomed as soon as replicas were allowed to be much more liberal with regards to DB load), it was possible to switch to the second one easily. Another reason for having two replicas, was that RDBMS was observed to crash MUCH more frequently on reporting DBs than on OLTP one; the reasons behind this disparity were two-fold: (a) inherently more complicated nature of RDBMS work on the reporting DB; (b) using the latest – greatest version of RDBMS on replicas (and the latest – greatest stuff tends to have MANY more bugs than the very same DBMS with SP1 or with FixPack 3).

With reporting replica, the system was able to handle about the same 30M-50M transactions per day – but without any risk of occasional slowdowns due to heavy reports. At this point, the game was handling hundreds of thousands of simultaneous players – and all the DB transactions were still handled by one single DB writing connection.

However, as the game has continued to grow, the system came to the point when capabilities of single writing connection were finally exhausted.

[[TODO: Stage 3a. Truncating OLTP history]]

Stage 4 (Kinda Failed). Trying Multi-Write-Connection

At this point, an attempt was made to make a (limited) switch to the multi-connection architecture. However, significant problems were observed on this way:

It was difficult to provide any guarantees for data consistency; as a result, achieving concurrency safety – while possible – took LOTS of effort.
- Granted, some of these difficulties might be attributed to the team having a long-standing culture of inherently concurrency-safe single-write connections. Still, it is obvious that complexity of concurrent-safe development in presence of multiple modifying connections is MUCH higher than that of inherently concurrent-safe single-write-connections.
“Much more importantly, observed results were VERY far from linear scalabilityMuch more importantly, observed results were VERY far from linear scalability 🙁 . In other words – when number of connections/cores was doubled, the improvement was anywhere between 1x (i.e. no improvement at all), and 1.7x (which, while being MUCH better, was still an indication of severe contention somewhere within the DB).
- Moreover, while some efforts to speed multi-connection system up were successful, all things which helped to achieve improvements in this regard, were highly irregular; in other words – you never knew whether any specific attempt to improve performance in a multi-write-connection environment would do any better 🙁 . It was a striking contrast with single-writing-connection approach, where effects of any change were very-well predictable in advance.

I am not saying that it wasn’t possible to make a workable system out of multi-connection approach. However, symptoms of significantly-worse-than-linear behaviour were very annoying, and risks of working without any idea of what can be achieved by using multi-connection approach, were deemed unacceptable.

Instead, it was decided to take a completely different route – the one of Share-Nothing DBs. One of the big advantages of Share-Nothing architectures is that there are no hidden contentions within DB; therefore, behaviour of the system (both when load grows and in response to any changes) remains highly predictable. In particular, if we scale a Share-Nothing system (after we’ve already accounted for the costs of all necessary separations to make it Share-Nothing) – it will exhibit perfect linear scalability: doubling the number of servers exactly doubles the number of transactions we can process.

Stage 5. Separating Game Worlds and Tournaments

Pre-requisite: Inter-DB Transfers

As a prerequisite to separating DB into several Share-Nothing DBs, there was a need to develop a way to transfer arbitrary data between DBs – and with data consistency guarantees too.

One example when such transfers are necessary, arises whenever a player wants to start playing within a certain Game World (taking some of his inventory with him). In this case, the data about inventory taken needs to be moved to a Game World DB (and in a perfectly consistent manner too – it wasn’t acceptable to have any losses/gains on the way).

Asynchronous Inter-DB Transfer

Thinking hare: “The protocol allows to take some asset from some source table in DB A, and to transfer this asset to some target table in DB B. It might seem trivial, but the problem here is how to provide guarantees against the disappearance of the asset or creation of its extra copies – and without using cross-DB transactionsThis was implemented via an Asynchronous Inter-DB-Transfer protocol described in Chapter III. The protocol allows to take some asset from some source table in DB A (manipulated by DB Server A), and to transfer this asset to some target table in DB B (manipulated by DB Server B). It might seem trivial, but the problem here is how to provide guarantees against the disappearance of the asset or creation of its extra copies – and without using cross-DB transactions (as cross-DB transactions would mean that the architecture is not Share-Nothing anymore). To reiterate the Asynchronous Inter-DB-Transfer protocol briefly:

DB Server A makes an ACID transaction over DB A, with transaction consisting of:
- Taking out an asset from an appropriate table
- Making a record in “outgoing messages” table, with the record specifying the nature of transfer and Outgoing-Message-ID
There is a communication protocol which guarantees that each of “outgoing messages” is delivered from DB Server A to DB Server B at least once.
DB Server B receives the “outgoing message”, and:
- Checks in DB B that message with Outgoing-Message-ID was NOT processed yet; if it wasn’t, DB Server B makes an ACID transaction over DB B, with the transaction consisting of:
  - Writing that Outgoing-Message-ID has already been processed
  - Putting the asset into appropriate table

This protocol can be seen as conceptually similar to Event-Driven Architecture as described in [Fowler], [Richardson.DatabasePerService], and [Richardson.EventDrivenArchitecture] with [Richardson.ApplicationEvents]. More importantly, the protocol provides a strict guarantee that eventually – whatever happens (including crashes of DB Servers and DBs, but excluding DB corruption during crashes) – the assets will be delivered to DB B – and exactly once. For more detailed description of the protocol – please refer to Chapter III.

Alternative: Two-Phase Commit

In theory, it might have been possible to use two-phase commit instead of Async Inter-DB Transfers. Two-phase commit (for example, via standard XA protocol), would provide BETTER guarantees than the ones provided by Async Inter-DB Transfer protocol described above (two-phase commit guarantees ACID, and async exchanges can only guarantee BASE). On the other hand, these better guarantees don’t come for free:

With two-phase commits (and regardless of single- or multiple-DB connections), we will inevitably need to issue some kind of locks in one DB waiting for another one. As long as we have only a very few of these two phase commits/locks, the system will continue to operate just fine; however, as the number of distributed locks grows, we can easily run into “distributed bottlenecks” and/or other issues which can adversely affect scalability (and dreadfully so) 🙁 .
Implementing two-phase commits with single-connection DBs – and without blocking the whole thing until the two-phase transaction is committed on both sides, is tricky (though possible – feel free to ask me if you’re interested ;-)).

Overall – feel free to try two-phase commit instead of Async Inter-DB Transfer, but don’t blame me if it doesn’t really scale well ;-(. On the other hand, if you’re mostly working with async transfers, having only a few (and rarely occurring) two-phase commits – it MIGHT work for you.

Actual Separation

As soon as we have this Asynchronous Inter-DB Transfer protocol in place, further separation is simple. Just take all the Game Worlds, (which are almost-universally are very loose coupled to anything else), and move them into a separate DB; whenever a player needs to go into the Game World – use the Asynchronous Inter-DB Transfer protocol to transfer relevant player’s assets there. Whenever there is a need to go back – use Asynchronous Transfer to transfer player’s assets back to the original table within ‘main’ DB.

Hare thumb up: “And as soon as the Game Worlds sit in a separate DB, it becomes quite easy to spread them over several separate DBsAnd as soon as the Game Worlds sit in a separate DB, it becomes quite easy to spread them over several separate DBs. Moreover, exactly the same process works for Tournaments, too. It means that at this point, the system becomes perfectly scalable with regards to Game World and Tournaments.

Let’s also note that at this stage, the system became very close to DB-per-Service pattern as described in [Fowler] and [Richardson.DatabasePerService] (with Event-Driven Architecture [Richardson.EventDrivenArchitecture] and Application Publishing Events [Richardson.ApplicationEvents] on the side): we have several DBs, each providing one specific service (serving Game World, Tournament, or Player-related requests) – and exchanging application-generated Events (represented by ”outgoing messages” of our Async Inter-DB Transfer Protocol) between them.

Stage 6. Scaling USERS – and achieving Perfect Scalability

By the time when all the Game Worlds and Tournaments were separated – the “main” DB only contained records related to players (in case of the real-world scenario under discussion, it was the USERS table). Moreover, with the separation implemented, all of the inter-player interactions had been taken out of the “main” DB.

And as soon as “main” DB no longer processed any inter-player interactions, it became easily shardable on per-player basis. In other words, it became possible to split “main DB” into several DBs, each having its own USERS table, and hosting different bunches of USERS (plus rows from tables associated with USERS).

In fact, this represents well-known “horizontal partitioning” technique which is widely used among DB developers across many industries. In the real-world case we’re discussing now, horizontal partitioning was implemented on top of single-write-connection DBs. USERS were split (as it is customary for “horizontal partitioning”) using hashes of userIDs.

Assertive hare: “And after this split of USERS, the system has obtained perfectly linear scalability.And after this split of USERS into several partitions, the system has obtained perfectly linear scalability. If USERS become a bottleneck, another server can be added relatively easily (although rehashing the whole DB can be tricky and time-consuming, it is perfectly doable).

And, if the Game World and/or Tournaments become a bottleneck – introducing a new server is even simpler: usually there is no need to rehash, and all the necessary changes can be made by adjusting matchmaking process fairly easily.

Observed Caveats

Of course, each solution aiming for real-world scalability has its own set of caveats. DB-per-Service approach (and single-write-connection DBs) is not an exception.

[[TODO: sword transfer within Game World without Game World DB: including specific target slot (with inter-message DB having info about the target slot); most importantly – there MUST be only one SINGLE request to DB for transfer (NOT two separate requests, as it can violate BASE!)]]

Requests Spanning Multiple DBs

The first caveat with the architecture discussed above concerns requests which span multiple DBs. However, as soon as we realize that in most (if not all) of the cases, we’ll need ONLY read-only requests to span multiple DBs, we realize that there are at least two solutions for these:

Traditional multi-DB read-only requests considering our separate DBs as one federated DB (though ONLY for the purposes of reading, not writing(!)). While federated DBs are usually using two-phase commits (and as noted above, I don’t feel that two-phase commit is a good thing for scalability), in case of read-only requests, it does not really apply; if we’re issuing read-only requests only over Read Uncommitted connections, I expect it to be ok. NB: I didn’t try this on a heavily loaded DB myself, so take it with a grain of salt.
“This approach does work for sure (and as a charm too), at least with the DIY replication I’ve mentioned above.Replication from multiple operational DBs into one single reporting one. Note that in this case, there can be no collisions of the data coming to replica from different DBs (by design), so it is not really a master-master replica, but rather a special case of “merge” replication without any conflicts. This approach does work for sure (and as a charm too), at least with DIY replication I’ve mentioned above. If you don’t want to go DIY for replicas, then whether your RDBMS supports this kind of “merge” replica is up to you to figure out ;-).

Consistent Distributed Backups

In any Shared-Nothing DB system with DBs being completely independent, achieving backup which is consistent across all the DBs involved, becomes a headache 🙁 . Still, at least with single-write-DB-connections, they’re perfectly doable, and we’ll discuss such backups in Vol. 3 (tentatively Chapter [[TODO]]). For now, let’s just note that in most general case, such “distributed backups” require “online backup” functionality from the underlying DB – and with “roll-forward-to-point-in-time” functionality too; fortunately, most of serious DBs out there support these commonly expected features since time immemorial :-). One caveat of such distributed backups (at least the way we’ll discuss them in Chapter [[TODO]]) is that at some point, they require to stop the processing in all of the DBs for substantial time (normally – dozens of milliseconds); fortunately, for DB Servers, this kind of delays is rarely a problem.

Generalization to Other Games

By now, we’ve discussed one successful example of single-DB-connection architectures, scaled to a hundred of millions of transactions per day (and probably beyond too; however, you can never be sure about “beyond” until you try).

The next question we have is the following: am I sure that it is not a one-off case which is especially suitable for this kind of processing? IMO (and this opinion is based on some experience too), it is not. In other words:

My rather-educated guess is that this approach will work for most of the games out there.

I’ve honestly thought about all the systems and business/game logics I know about – and asked fellow game developers too, and didn’t find any game where the model above wouldn’t work. Of course, this doesn’t mean too much (and of course, your own case can be very different); but well, this is the best possible answer I can provide without taking a look at your specific game :-) .

Arguing hare: “The stock exchange which I was working with, was perfectly suitable for this modelOne potential concern MAY apply to stock exchanges (those guys may have extremely convoluted rules); still, the stock exchange which I was working with, was perfectly suitable for this model (with each of the securities traded considered a separate Game World; it will work as long as there are no synchronous interactions between different securities traded).²

² while the stock exchange that I’ve co-architected didn’t really need partitioning at that point (working perfectly fine over a single DB connection), still after thinking about its logic, I’m quite sure that it could be scaled along the lines above

Batch Processing

One special case arises when your game (usually bank or lottery) needs some kind of batch processing. Most of the time, batch processing has the following properties:

It makes LOTS of modifications.
It should be done “as if” it is done over a snapshot in time.
Also, as a rule of thumb, it is easy to separate batch processing into bunches of statements, which statements cannot interact by design. When we need to calculate something over a million of rows – usually the calculations are rather similar for all the rows involved – and usually are more or less on per-row-basis too.

With this in mind, it is often possible to play the following game (pun intended):

Stop all the processing.
- if there are any new incoming requests, they can be delayed (for example, recorded in memory or in a separate DB), to be applied after the batch is processed.
Run batch processing.
- It can be run in parallel – simply separating statements into non-interacting groups, and running these non-interacting groups over different DB connections. While it still requires dealing with concurrency, this is MUCH easier (IMO, orders of magnitude easier) than ensuring that any of the statements cannot interact with any other statement.
Process all the delayed requests.
Bingo! We’ve processed the whole batch – in minimal time, and with very limited efforts to ensure concurrency safety.

Summary/My Personal Advice

When developing a DB for your game, I usually suggest the following:

As discussed in several places above, DO separate your DB Server from the rest of your game (in particular, from Game Servers) via a very-well defined DB Server API
- This API MUST NOT contain any SQL; instead, it MUST be expressed in terms of your Game Logic (such as “transfer artifact A from player P1 to player P2”).
- EACH request within the DB Server API MUST correspond to exactly one ACID transaction. In other words, there MUST NOT be any situations when a transaction is left incomplete after DB Server is done with the request.
After doing so, DB Server becomes an implementation detail (!). It means that it can be implemented in whatever-way-you want. On the other hand, as one (and my own preferred) way to skin this cat, I suggest to consider the following development path (with each additional step implemented only when you start feeling that you might have problems with performance without it):
- “Start with a simplistic single-write-connection DB, with reporting running off the same DBStart with a simplistic single-write-connection DB, with reporting running off the same DB (reporting using Read-Uncommitted isolation level)
- Optimize indexes and physics
- Add app-level caching (most importantly, of USERS/PLAYERS table)
  - NB: at this point, our system is high-performing but doesn’t scale (yet)
- Add replica DB(s) for reporting
  - NB: at this point, we’ve got perfect read scalability (at least for reporting purposes)
- Implement inter-DB asynchronous transfers
- Separate Game Worlds (and if applicable, Tournaments) based on Inter-DB Async Transfers protocol
- Optionally, separate Payments DB (again, using Async Inter-DB Transfers)
- Horizontally partition USERS/PLAYERS table
  - Enjoy your perfectly linear scalability! :-) (both read and write)
That being said, there are other ways to implement scalable systems, BUT they tend to require DB expertise with real-world highly-loaded systems, and you’re not too likely to find people with such expertise :-( .

How exactly these steps should be implemented is a subject for several separate (and rather lengthy) discussions, and we’ll describe some ways of doing it in Vol.3 (tentatively Chapter [[TODO]]). What’s important for the time being – is that you can start with single-write-DB-connection – adding scalability later as the need arises.

[[To Be Continued…

Tired hare: This concludes beta Chapter 20(e) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 20(f), where we’ll discuss the choice of the RDBMS for your transactional/operational DB.]]