[SNOW] Modern data systems: part 2
There are some obvious benefits to combining data from different sources: a spike in call center volumes can be traced back to a database error that prevents purchase orders from going through; a purchase order from one brand can be correlated against another to promote cross-selling; a marketing department can combine first party data from a brand’s loyalty program with third-party data sourced through a broker to craft marketing campaigns; a retailer can aggregate data across stores and regions to optimize inventory and time promotions; a CPG supplier might use that same data to inform product development.
A data warehouse has for decades been pitched as the place where all this was supposed to happen. Whereas a database is used to read and write data, a data warehouse is optimized to analyze it1. They play different roles but are part of the same workflow. Data from different DBs – one recording email click-throughs, another tracking purchase orders, and yet another updating inventory levels – might be cleaned up and otherwise prepared in an external staging area (a data lake like S3) before being streamed2 to a warehouse, where it can be analyzed to, say, monitor the effectiveness of an email ad campaign or evaluate the ability to meet incremental demand from that campaign over the next 60 days.
Or the data might be pulled off factory equipment, pushed to a data lake, and then streamed to an internal staging area (like Snowflake’s Snowpipe), where it is transformed3 and shipped to a warehouse, where it can be mined for insight4. The query results might then be fed into a BI dashboard like Tableau of Qlik or to a ticketing solution like ServiceNow or Jira.
Sounds nice but consolidating data from different sources in one place has historically been too expensive and cumbersome to be practical. In the ‘90s, data warehouses were tied to specialized on-prem servers. To limit compute and storage costs, data had to organized and rolled up by engineers according to questions that were defined in advance before then being batch shipped to warehouses, which could only handle so many concurrent users. This process was tough to scale and unsuitable for the pending era of big data.
Then sometime in the mid-2000s, appliance-based data warehouses were succeeded by Apache Hadoop, an open-source platform that consists of two parts: a distributed file system, Hadoop Distributed File System (HDFS), that stores the data, and a programming model, MapReduce, that analyzes it5. HDFS breaks down large files into smaller blocks and stores those blocks in clusters of commodity servers that are networked together. By having lots of cheap computers working in parallel, Hadoop can store and process high volumes of rapidly changing data far faster than the traditional non-distributed setup. And back then because networks were slow, being able to crunch huge volumes of data locally was a huge step forward6.
Two publicly traded Hadoop vendors, Cloudera and Hortonworks, emerged. After beating each other up for several years, the two companies merged in 2018, the new entity assuming Cloudera’s name. Up until then, Hortonworks was investing heavily in the ability to intelligently route data from one system to another7, while Cloudera was emphasizing machine learning and data warehousing. Combined, the two companies hoped to deliver an “end-to-end” system that could leverage Hadoop’s technology to ingest and transform data from edge devices and Cloudera’s to store and query it.
Cloudera and Hortonworks tried to differentiate on enterprise-grade features (data governance, access controls, and audit logs) and availability across public clouds and on-prem data centers/private clouds, catering to big financial institutions and telcos who managed lots of legacy infrastructure and feared being locked into a single cloud environment as they moved workloads off premise. But the merger failed to deliver. Part of this was due to cultural conflicts. Hortonworks ran an aggressive sales and partnership-driven organization whereas Cloudera was more product-oriented; Hortonworks was a 100% open-source company that monetized through support and services whereas Cloudera fused open source software with proprietary components8. Customers churned off as management failed to articulate a clear technology roadmap and cross-selling opportunities were less abundant that expected. Moreover, networking technology improved in the intervening years and it was no longer so critical that code run close to the data, obviating one of Hadoop’s key value props.
In theory, the hybrid value proposition that Cloudera continues to pitch makes sense. But in practice it seems even large enterprises with legacy infrastructure are more than willing to adopt cloud native solutions rather than stretch their often kludgy on-prem tools. It’s almost like enterprises see the cloud as a fresh start, an excuse to adopt modern platforms while looking forward to a future where the cloud replaces on-prem as the center of gravity for IT. Meanwhile, Cloudera, which didn’t even launch a viable cloud solution until 2019, still seems moored to legacy environments, treating the cloud as burst capacity for existing on-prem workloads.
Also, the default assumption that enterprises will put up with difficult platforms from incumbent vendors seems far less applicable today, when developers have more sway over which solutions get adopted within the enterprise. The Cloudera Data Platform (CDP), which mashes together Hadoop, data routing, warehousing, streaming analytics, and machine learning and makes it all available across cloud and on-prem environments under a “single plane of glass” just doesn’t seem to be resonating relative to dedicated user-friendly solutions. HDFS competes with S3 and Azure Blob storage, which leverage the scale economies of Amazon and Microsoft to offer the same thing, but cheaper. MapReduce is being replaced by Spark, which uses in-memory processing (meaning, the data is stored in RAM rather than on disk) to deliver 100x faster performance and comes packaged with libraries that support SQL queries9 and machine learning.