[MDB] Modern data systems: part 1

Feb 08, 2021

∙ Paid

A relational database organizes data into tables of rows and columns and links those tables through shared identifiers (keys). This architecture gained traction in the 80s and 90s, when applications were monoliths supported by single servers, and the data structure backing them – the tables, fields, and data type of each field (integer, string) – could be specified in advance. Their rigid structure and adherence to ACID properties1 make relational databases ideal for stuff like accounting or order management systems, where the structure of the data store doesn’t change, you know in advance how data is related, and state consistency is paramount (you wouldn’t want one account to be debited without another being credited at the same time or for 3 shoppers to purchase 1 remaining inventory unit). Moreover, that relational DBs store normalized and de-duplicated data helped as storage was expensive back then.

But the cloud, even as it has reduced the barriers to making software, has also created more underlying complexity. Monolithic applications are being disaggregated into microservices with independent data models run across containers distributed across clusters, fueling demand for new enabling technologies, including NoSQL databases, which are often better suited to the needs of modern web apps than relational databases.

In requiring that data be normalized2 and segregated across tables – one table might contain a user’s contact information, another with purchase orders, still another with product remaining in inventory, etc. – relational architectures are tough to partition and scale out horizontally across multiple nodes (what if a user wants to read a record at the same time another user wants to write to it?). Instead, to accommodate more data, relational databases had to “scale up” by adding more processing power, storage, and memory.

This is fine for many use cases but breaks down when applied to modern web apps, when an analyst might prefer the same data be duplicated across multiple nodes to ensure availability, even if that means trading off consistency. For example, an update to one’s job status on LinkedIn or a comment in Facebook doesn’t require strict ACID compliance – it’s okay if you see the update or comment to your newsfeed before everyone in your network does – so long as all nodes eventually reflect the change (i.e., so long as the system is “eventually consistent”). It’s far more important that systems are available to all users at all times than that they are consistent at the same time.

With storage now abundant and cheap, the inflexible structure of a relational set-up aggravates a new bottleneck, developer efficiency, at a time when every company is using more software to effectively compete. MongoDB’s founders – Dwight Merriman, Eliot Horowitz and Kevin Ryan – experienced firsthand the performance constraints of relational DBs at DoubleClick, a company they founded prior to Mongo. To scale DoubleClick, an ad network, eventually sold to Google, that crawled the internet and served billions of ads per day, they were forced to build custom databases that didn’t shoehorn data in a tabular format and could query documents natively. The NoSQL (Not Only SQL, aka non-relational) use cases were so promising in fact that Mongo’s co-founders abandoned the application idea they originally set out to build after leaving DoubleClick to work full time on the custom database they were developing to support that application3

There are different flavors of NoSQL databases, suited to different applications.

Like a bank may want to map how money is transferred from one customer to another for fraud detection purposes, in which case a graph database, which prioritizes relationships between entities, might be more appropriate than a relational database that records account withdrawals;

Or an IT administrator who wants to track deviations from baseline CPU usage might use a time series database like Splunk or InfluxDB;

Or an e-commerce store that wants to offer real-time product recommendations might lean on a key-value database like Redis or CouchDB, which in linking a unique identifier to a value – analogous to a directory that assigns a unique name (“key”) to a telephone number (“value”) – can perform very rapid read-write operations.

A document database like MongoDB is a more flexible version of a key-value store. In a document-based database, documents are the basic unit of data and collections are groupings of documents. You can think of documents as analogous to rows in a relational database and collections as analogous to tables4. The documents within a collection are part of the same category but don’t necessarily share the same schema.

scuttleblurb

[MDB] Modern data systems: part 1

This post is for paid subscribers