Data, Observability and Monitoring (SPLK, ESTC, NEWR, DT, DDOG)
Compared to the monolithic architectures of the on-premise world, cloud computing and the microservices paradigm accompanying it have given rise to more flexible and scalable applications. But those applications often run on complex distributed environments. A large company like Uber might be running thousands of microservices written in different languages, making it tough to track what’s going on under the hood. They need sophisticated monitoring tools that span the distending surface area of applications, virtual machines, containers, infrastructure, networking, and everything else involved in running applications. Because system downtime has direct business implications, it is vital that apps and all their dependencies are operational and that inevitable technical issues are resolved quickly. Software that is used to evaluate software falls under the domain of Observability, which honeycomb.io describes as: “being able to ask arbitrary questions about your environment without having to know ahead of time what you wanted to ask” (as opposed to Monitoring, which is about tracking metrics that you already know about). Questions like “where are the performance constraints?” or “why did these set of users experience unusually long response times?” are answered through 3 foundational elements: Logs – a time series record of events; Traces – the journey of a request across all the nodes of a distributed system; and Metrics – values that measure response time, requests per second, CPU and memory usage, etc., instantiated as graphs on a dashboard.
Demand for Observability & Monitoring (O&M) tools has been swept forward by growing adoption of Agile/DevOps, a workflow pattern slash cultural movement where developers assume more responsibility for not only developing applications but also making sure they are secure and work in production, and doing so in the moment rather than after-the-fact. As I wrote in my Atlassian post:
An alternative setup, one embodied in the philosophy of Agile/DevOps, has small groups of developers working in 2 or 3 week “sprints”, continuously A/B testing and integrating small batches of code to the trunk, with every change logged into version control. Rather than having a developer wait in line for QA to test his program, a battery of automated tests – ensuring that the new function works as the programmer intended, that the application containing the function works as it’s supposed to, and that the application works in interaction with other applications – is triggered with every commit, and if a problem is detected, an “Andon cord” (so to speak) is pulled, production is halted, and the error is swarmed with resources until the root cause is identified, isolated, and resolved.
In a fast feedback process like this, where code is maintained in constantly deployable state, any issue that emerges is proximate in time to the change that caused it, and is not left to migrate further down the value stream where it festers with countless other changes that went unaudited earlier, any one or combination of which could have caused the issue.
Some O&M vendors started in application monitoring (Dynatrace, New Relic), others in infrastructure (Datadog) or log monitoring (Splunk) or search (Elastic). But everyone is basically converging to the same set of solutions1– application/infrastructure/log/network monitoring, Security, DevOps workflow automation, all permeated with “AI”. Compared to a jumble of discrete tools, an integrated solution is easier to manage and enables users to more efficiently correlate events across different parts of the stack. Meanwhile, as is true in other software domains, legacy on-premise vendors like Micro Focus, IBM and Wiley/CA are losing relevance as the center of gravity shifts to the cloud. AppDynamics, while still a presence in the enterprise market, has lost its mojo since being acquired by Cisco (though I’ve heard they’ve nabbed some big deals by bundling APM into Cisco networking contracts).
Enterprises seem willing to explore new options when it comes to cloud native applications, even as they maintain the legacy tools for existing workloads, forcing O&M players with on-premise origins to radically reorient. Dynatrace built its cloud platform from scratch. Splunk changed its sales incentive structure to make SaaS deals at least as attractive as term deals and recently began releasing all new versions and updates to the cloud first, with the expectation that its cloud offerings will diverge from on-prem solutions.