scuttleblurb

Share this post

User's avatar
scuttleblurb
Nvidia (NVDA): Part 2 of 2

Nvidia (NVDA): Part 2 of 2

scuttleblurb's avatar
scuttleblurb
Jul 28, 2021
∙ Paid
1

Share this post

User's avatar
scuttleblurb
Nvidia (NVDA): Part 2 of 2
1
Share

I left Part 1 talking about how the demands of high performance compute had outgrown the chip and were increasingly being handled at the level of the data center.  There are two different but related planes of inter-connection inside a datacenter: within a single server (scale-up) and across servers (scale-out).  In a scale-up configuration a server, like Nvidia’s monster DGX-2, might have up to 8 GPUs connected1 together and treated by CUDA as a single compute unit.  In scale-out, servers are connected to an an Ethernet network through network interface cards (NICs, which can be FPGA, ASIC, or SoC-based) like those from Mellanox, so that data can be moved around and shared across them.  Those servers can act in concert as a kind of unified processor to train machine learning models. 

As the basic compute unit moves from chips to datacenters, a different kind of processor, sitting beside a CPU, is needed to handle the datacenter infrastructure functions that are increasingly encoded in software, which is where Data Processing Units (DPU, also called “SmartNIC”) come in.  Described by Nvidia as a “programmable data center on a chip”, DPUs from Mellanox take the low-level networking, storage requests, security, anomaly detection, and data transport functions away from CPUs.  5 years from now Nvidia thinks every data center node could have a DPU, as embedding security at the node level plays to the zero trust security paradigm and offloading infrastructure processes from CPUs accelerates workloads.  Something like 30% of cloud processing is hogged by networking needs.  CPUs that should be running apps are instead managing the infrastructure services required to run the apps (a single BlueField 2 DPU provides the same data center services as 125 x86 cores).  Just as Nvidia’s GPUs have CUDA software running on top, DPUs have complementary APIs and frameworks (DOCA SDK) for developers to program security and networking applications. 

In short, GPUs accelerate analytics and machine learning apps, CPUs run the operating system, and DPUs handle security, storage, and networking.  These chips are the foundational pillars of Nvidia’s data center platform.

This idea of treating the datacenter as a unified computing unit isn’t new.  Intel framed the datacenter opportunity the same way 5 or 6 years ago, as it complemented its CPUs with interconnect (Omni-Path Fabric2, silicon photonics), accelerators (FPGAs and Xeon Phi), memory (Optane), and deep learning chips (Nervana, Habana).  Its Infrastructure Processing Unit (IPU) is analogous to Nvidia’s DPU.  AMD is progressing along a similar path, complementing its EPYC x86 server CPUs and GPUs with networking, storage, and acceleration technology from its pending $35bn acquisition of Xilinx, mirroring the GPU/CPU/Networking stack that Nvidia is building through its acquisitions of Mellanox and, soon, ARM.


ARM sells a family of instruction set architectures.  You can think of an instruction set as the vocabulary that software uses to communicate with hardware, the “words” that instruct the chip to add and subtract, to store this and retrieve that.  In the decades leading up to the 1980s, instruction sets had gotten increasingly ornate as it was believed that the more complex the instruction set (i.e., the more intricate the vocabulary), the easier it would be for developers to build more powerful software.  Also, the more complex the tasks that could be executed, the more developers could economize RAM, an expensive resource at the time.

During the 1980s, some engineers at UC Berkeley, bucking the trend of the preceding 3 decades, whittled down instruction complexity, creating an alternative to CISC appropriately called a Reduced Instruction Set Computer (RISC).  A RISC processor required up to 50% more instructions to accomplish a given task vs. a CISC, but each of those instructions, by virtue of their relative simplicity and standard size, could be executed 4x-5x faster3.  Most PCs in the 1980s used simple instructions that could have been handled by RISC processors, which were more power efficient to boot.  But for whatever reason the PC ecosystem coalesced around Intel’s x86 CISC architecture instead. It required a new compute platform – the explosion of mobile, IoT, and other portable devices – for RISC’s more power efficient architecture to take off.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 scuttleblurb
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share