How Database Companies Keep Their Data Straight
The The Transform Technology Summits begin October 13 with Low-Code / No Code: Enabling Enterprise Agility. Register now!
As developers grapple with bigger issues, they need to store their data in more complex ways, adding a constellation of computers to house it all.
But adding additional computer hardware can be confusing when different parts of the network need to be accessible for a particular query, especially when fast data requests are so common. Each database update must be distributed to all computers, sometimes scattered across different data centers, before the update is complete.
Complex data requires complex solutions
Developers like to have a “single source of truth” when building applications, a source of essential information. This should be able to tell them the most recent values at all times.
Providing this consistency with a single computer running a database is simple. When there are multiple machines running in parallel, defining a single version of the truth can get complicated. If two or more changes happen to different machines in a short succession, there is no easy way for the database to choose which came first. When computers do their job in milliseconds, the order of these changes can be ambiguous, forcing the database to choose who gets the plane seat or the concert tickets.
The problem only increases with the size of the tasks assigned to a database. More and more tasks require large databases that span multiple machines. These machines can be located in different data centers around the world to improve response time and add remote redundancy. But the additional communication time required greatly increases the complexity when database updates arrive in close succession on different machines.
And the problem cannot be solved by simply outsourcing everything to a high-end cloud provider. Database services offered by giants like Amazon AWS, Google Cloud, and Microsoft Azure all have consistency limits, and they can offer several consistency variants to choose from.
Certainly, some jobs are not affected by this problem. Many apps just ask databases to keep track of slowly changing and unchanging values, like, say, your monthly utility bill amount or the winner of last season’s soccer games. The information is written only once, and all subsequent requests will get the same response.
Other tasks, such as keeping track of the number of open seats on an aircraft, can be very tricky. If two people try to buy the last seat on the plane, they may both receive a response that they have a seat left. The database should take additional steps to ensure that the seat is only sold once. (The airline can always choose to overbook a flight, but that’s a business decision, not a database error.)
Databases work hard to maintain consistency when changes are worked out by grouping a number of complex changes into single packages called “transactions.” If four people traveling together want seats on the same flight, the database can keep the set together and only process changes if there are four empty seats available, for example.
In many cases, database creators have to decide whether they want to trade consistency for speed. Is strong consistency worth slowing down updates until they reach every corner of the database? Or is it better to go ahead because there is little chance that an inconsistency will cause a significant problem? After all, is it really that tragic if someone who buys a ticket five milliseconds later than someone else actually gets the ticket? You could say that no one will notice.
The problem only occurs for the time it takes for new versions of the data to propagate across the network. The databases will converge on a correct and consistent answer, so why not give it a shot if the stakes are low?
There are now several “eventually consistent” versions supported by different databases. The dilemma of how to best approach the problem has been studied extensively over the years. Computer scientists like to talk about the CAP theorem, which describes the trade-off between consistency, availability and partitioning. It is usually relatively easy to choose two of the three, but difficult to get all three in one working system.
Why is eventual consistency important?
The idea of eventual consistency has evolved as a way to soften expectations of accuracy in times when it is most difficult to deliver. It was right after that new information was written to a node but was not propagated to the constellation of machines responsible for storing the data. Database developers often try to be more specific in stating the different versions of consistency they are able to offer. Amazon CTO Werner Vogels described five different versions that Amazon considered when designing some of the databases that power Amazon Web Services (AWS). The list includes versions such as “session consistency,” which promise consistency, but only in the context of a particular session.
The notion is closely related to NoSQL databases because many of these products started out by promising only consistency over time. Over the years, database designers have studied the problem in more detail and developed better models to describe tradeoffs more accurately. The idea still bothers some DBAs, the type who wears both belts and suspenders to work, but users who don’t need the perfect answers like the speed.
How are legacy players approaching this?
Traditional database companies like Oracle and IBM remain committed to strong consistency, and their major database products continue to support it. Some developers use very large computers with terabytes of RAM to run a single database that maintains a single, consistent record. For bank and warehouse inventory tasks, this can be the easiest way to grow.
Oracle also supports database clusters, including MySQL, and these can provide eventual consistency for jobs that require more size and speed than perfection.
Microsoft’s Cosmos database offers five levels of assurance, ranging from strong consistency to long-term consistency. Developers can trade speed for precision depending on the application.
What are the upstarts doing?
Many emerging NoSQL database services are explicitly adopting term consistency to simplify development and increase speed. Startups may have started offering the simpler model for consistency, but lately they’ve given developers more options to trade raw speed for better accuracy when needed.
Cassandra, one of the first NoSQL database offerings, now offers nine options for write consistency and 10 options for read consistency. Developers can trade speed for consistency depending on application requirements.
Couchbase, for example, offers what the company calls “tunable” consistency that can vary from query to query. MongoDB can be configured to provide eventual consistency for read-only replicas for speed, but it can also be configured with a variety of options that provide more robust consistency. PlanetScale offers a model that balances consistent replication with speed, arguing that banks aren’t the only ones struggling with inconsistencies.
Some companies are building new protocols that come close to strong consistency. For example, Google’s Spanner relies on a very precise set of clocks to synchronize versions running in different data centers. The database is able to use these timestamps to determine which new data block arrived first. FaunaDB, on the other hand, uses a version of a protocol that does not rely on very precise clocks. Instead, the company creates synthetic timestamps that can help decide which version of competing values to keep.
Yugabyte chose to adopt the consistency and partitioning of the CAP theorem and forgo availability. Some read requests will be suspended until the database reaches a consistent state. CockroachDB uses a model which it says sometimes offers a serialized version of the data, but not a linearized version.
The limits of long-term consistency
For critical tasks, like those involving money, users are prepared to expect inconsistent responses. Eventually, consistent models may become acceptable for many data collection jobs, but they are not suitable for tasks that require a high degree of confidence. When businesses can afford to support large computers with a lot of RAM, databases with high consistency are suitable for anyone with control of scarce resources.
VentureBeat’s mission is to be a digital public place for technical decision-makers to learn about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in managing your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the topics that interest you
- our newsletters
- Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
- networking features, and more
Become a member