Inside the Mind of Modern GPUs: How Graphics Cards Power Your Games, AI, and the Future of Computing

Written by Massa Medi| April 25, 2025

Have you ever stopped to wonder exactly how much raw computational muscle sits inside your graphics card as you immerse yourself in the ultra-realistic worlds of today’s video games? If you guessed in the realm of 100 million calculations per second, you’d have just enough power to run Super Mario 64—back in 1996. Fast-forward to 2011, and even the pixelated world of Minecraft required a system that could perform about 100 billion calculations each second.

But today, if you want to run the graphical juggernauts like Cyberpunk 2077 at full fidelity, your graphics card needs to flex its silicon muscles to churn out an astronomical 36 trillion calculations per second. Numbers this large can be hard to process—so let’s break it down for your imagination.

Imagine solving one multiplication problem every single second. Now, multiply that effort across every single person on earth. Even then, to match your GPU’s 36 trillion calculations per second, you’d need about 4,400 planets teeming with people, each solving a calculation every second! It’s mind-boggling to realize that a device sitting inside your PC performs what the populations of thousands of Earths could only dream of achieving.

In this article, we’re going to embark on a deep-dive into the inner workings of a modern graphics card. We’ll break down complex systems into bite-sized, digestible explanations, and by the end, you’ll have a newfound appreciation for the silicon wizardry hidden inside your computer. We’ll explore this journey in two main parts:

Cracking open the graphics card to examine its components, delving into the physical architecture of the GPU (Graphics Processing Unit) itself.
Investigating the computational architecture—explaining how GPUs process gigantic loads of data to excel at game graphics, cryptocurrency mining, neural networks, and AI.

Before we get started, it’s worth noting that this exploration is made possible with support from Micron, the maker of the advanced graphics memory inside the GPU we’ll be dissecting today.

CPU vs. GPU: The Ultimate Silicon Showdown

Before we jump headfirst into the guts of the GPU, let’s first clarify how GPUs compare to their more general-purpose sibling: the CPU (Central Processing Unit).

The Graphics Processing Unit on a modern card boasts over 10,000 cores—tiny calculators designed to churn through simple operations as quickly as possible. Contrast this to the typical CPU nestled on your motherboard, which contains only about 24 cores. Intuitively, one might think “More cores = more power!”—but, as with many things in technology, the reality is a bit more nuanced.

Think of it like this: Imagine a massive cargo ship (your GPU) and a jumbo jet airplane (your CPU). The ship can haul enormous amounts of cargo, albeit slowly, while the jet is fast, nimble, and can carry a variety of packages, including passengers, quickly to any destination. Similarly:

GPUs are designed for a massive number of parallel operations, handling vast “cargoes” of data at a steadier pace, but with far less flexibility.
CPUs excel at handling diverse tasks and can respond rapidly to the unpredictable twists and turns of different applications—but can’t lug quite the same load in one go.

It’s not a perfect analogy, but it captures the key trade-off: For mountains of simple calculations, like rendering graphics, the GPU’s parallel muscle wins out. For fast, complex, and more versatile workloads—think running your operating system or connecting to the network—the CPU is the undisputed champion.

Curious about CPU architecture? Stay tuned for a deep dive in an upcoming article—make sure to subscribe so you don’t miss out!

The Anatomy of a Graphics Card: Dissecting the GPU

Let’s peel back the layers and look at what’s actually inside a high-end graphics card.

At the heart of it all is the printed circuit board (PCB), bristling with precisely mounted components. The true brain is the graphics processing unit (GPU) die itself—like the NVIDIA GA102 chip, constructed from an astonishing 28.3 billion transistors. Most of this chip’s vast landscape is devoted to its processing cores, organized as follows:

7 Graphics Processing Clusters (GPCs): The biggest core subdivisions.
Each GPC divides further into 12 Streaming Multiprocessors (SMs).
Each SM contains four groups called warps and a dedicated ray tracing core.
Each warp houses 32 CUDA (Compute Unified Device Architecture) cores—the main workers—and one tensor core.

All told, the full GA102 boasts 10,752 CUDA cores, 336 tensor cores, and 84 ray tracing cores—each optimized for a different flavor of math.

CUDA cores: Think of these as ultra-fast calculators for addition, multiplication, and other basic arithmetic. They’re the workhorses for gaming graphics.
Tensor cores: Specialized for matrix multiplication and addition, vital for geometric transformations, AI training, and neural network operations.
Ray tracing cores: Exceptionally powerful (but fewer in number) units that handle the complex math behind realistic lighting and reflections.

Here’s a fascinating quirk: the same GA102 chip design is used across several flagship NVIDIA graphics cards, including the 3080, 3090, 3080 Ti, and 3090 Ti—even though these sell for different prices and were released in different years. Why? Because chip manufacturing is never perfect. Tiny defects, such as dust or patterning errors, are almost inevitable. Rather than toss out a whole chip due to a small flaw, engineers can turn off the affected region. This modular, “repetitive” design means one defective area (say, a streaming multiprocessor) can be deactivated, allowing the rest of the chip to work flawlessly.

As a result, chips are “binned” (sorted) by quality:

3090 Ti: All 10,752 CUDA cores intact—the top-tier, flawless chips.
3090: 10,496 working cores.
3080 Ti: 10,240 working cores.
3080: 8,704 working cores (with about 16 streaming multiprocessors disabled).

Differences in clock speed, memory type, and quantity further distinguish these cards—but the GPU “DNA” remains the same.

A Closer Look at the CUDA Core: The Simplest Super Calculator

Zooming into a single CUDA core, you’ll find a marvelously efficient calculator—this core, occupying around 410,000 transistors, orchestrates several fundamental mathematical operations. A key sub-region of about 50,000 transistors performs the crucial Fused Multiply-Add (FMA)—that is, a × b + c in one go, the single most common operation in graphics processing.

Half the CUDA cores handle these FMAs using 32-bit floating point (scientific notation) numbers, while the others tackle 32-bit integers or floats as needed. Additional circuitry equips the core for negative numbers, bit shifts, masking, managing instruction queues, accumulating results, and outputting data. In essence: it’s a lightning-fast, highly focused calculator, ticking away one multiply-add combo per clock cycle.

Multiply this by 10,496 cores at a brisk 1.7 GHz clock speed (as in the 3090 graphics card), and you get 35.6 trillion calculations per second! For more complex functions—division, square roots, trig—the chip uses “special function units.” There are only four of these per streaming multiprocessor, making them rarer but vital for fancy math.

Beyond the Core: Around the GPU Die

Encircling the processing heart of the GA102 are key supporting structures:

12 graphics memory controllers for blazing-fast access,
NVLink controllers for linking multiple GPUs,
PCIe interface for connection to your motherboard,
6 MB Level 2 memory cache (SRAM)—acting as ultra-fast scratchpad storage,
The GigaThread engine, orchestrating task assignments and managing the chip’s workflow.

Beyond the die, you’ll spot the display ports, the beefy 12V power connector (supplying hundreds of watts!), and PCIe pins for motherboard interaction. The countless tiny bits and pieces—collectively known as the Voltage Regulator Module (VRM)—step down power to the precise 1.1V needed by the GPU core. Heat’s a major concern; thus, a massive heatsink and four heat pipes ferry thermal energy away from the GPU and memory chips to radiator fins, with fans whirring to keep things cool.

And of course, there’s graphics memory: In the 3090, 24 GB of ultra-fast GDDR6X SDRAM from Micron. When you see a loading screen in a game, most of that wait is just your system copying the required 3D models from storage into this dedicated memory. As you play, the GPU’s tiny 6MB cache holds just a fraction of the scene, so chunks of the game’s environments are streamed relentlessly between memory and GPU, feeding the insatiable appetite of those trillions of calculating cores.

These memory chips aren’t just fast—they’re designed for parallelism: the 24 chips transfer data across a 384-bit “bus” at once, achieving a ludicrous bandwidth of 1.15 terabytes per second. For context, the CPU’s memory (DRAM) runs with a mere 64-bit bus, maxing out at about 64 gigabytes per second.

Here’s a twist: While most think computers speak only in binary (zeroes and ones), new GDDR6X and GDDR7 graphics memory uses a multi-level approach. Instead of simple highs and lows, they leverage multiple voltage levels per wire, allowing far more data to move in a single tick:

GDDR7: Uses “PAM3” encoding, combining binary bits into ternary digits (0, 1, -1). For example, three binary bits are efficiently encoded into two ternary digits. An advanced mapping (11 bits into 7 ternary digits) cleverly squeezes 276 bits of data down to 176 symbols.
GDDR6X (found in the 3090): Implements “PAM4,” encoding two bits into four voltage levels.

Engineering at this level is truly an arms race: PAM3, adopted in GDDR7, offers reduced complexity, better power efficiency, and a better signal-to-noise ratio for even faster chips. Micron has continually pushed these innovations, even pioneering HBM (High-Bandwidth Memory), where DRAM chips are stacked into vertical “cubes,” connected by through-silicon vias (TSVs).

In HBM3e, one memory cube (think, a skyscraper made from microchips) offers up to 36 GB of high-speed capacity. So when you hear about AI systems packed with up to 192 GB of “stacked” memory—know there’s some serious, Micron-backed sorcery fueling that feat!

For most of us, flagship AI “accelerator” boards are out of reach—they cost tens of thousands of dollars and are backordered for years. But if you’re interested in the science of advanced memory, or in building the next generation of chips, Micron is always hiring talented engineers.

From Data to Reality: The GPU’s Parallel Power in Action

Now that we’ve mapped out the physical hardware, let’s talk architecture—how does the GPU’s computational design let it shred through tasks like game graphics and bitcoin mining?

Many of these tasks are classified as “embarrassingly parallel.” While the name may draw a giggle, in technical terms it means the computation can be split into many smaller jobs, each working independently with almost no coordination—perfectly suited to thousands of parallel cores.

The operating principle is SIMD (Single Instruction, Multiple Data): the same calculation is performed across loads of different data points at the same time.

Example: Building a Virtual Cowboy Hat

Imagine a 3D cowboy hat resting on a table in a game scene. The hat’s modeled from about 28,000 triangles—each triangle defined by three “vertices.” That’s roughly 14,000 unique points, each with X, Y, and Z spatial coordinates.

Every 3D object you see in the scene—tables, chairs, hats, robots—exists in its own model space, with coordinates like (0,0,0) marking the center of the object. To build a full 3D world, engines position these objects together, each with its own local origin. However, the camera needs to know where every object sits relative to everything else. The solution: You convert—or “transform”—each set of object-centric coordinates into a shared world space.

Here’s how parallel math shines:

For our 14,000 cowboy hat vertices, you simply add the hat’s position in world space to each local coordinate. It’s a quick, identical operation—but must be done for every point.
The same applies for the table, chair, and every other object—hundreds of thousands, even millions, of similar calculations.

In practice, this scene may require converting 8.3 million vertices—demanding 25 million addition operations.
But here’s the beauty: None of these calculations rely on the others. Each one can fire off on a different CUDA core, all working in glorious parallel.

This example is a single stage (model-to-world transformation) of a much deeper graphics rendering pipeline. The actual rendering—rotation, scaling, lighting, shading—involves many additional steps, nearly all of which are “embarrassingly parallel.”

SIMD, SIMT, and Scheduling Threads: Marching in Lockstep (and Beyond)

These parallel operations are implemented via “threads”—tiny units of computation matched directly to CUDA cores. Threads are grouped:

Warps: Groups of 32 threads executing the same instructions together.
Thread Blocks: Collections of warps, managed by Streaming Multiprocessors.
Grids: Aggregations of thread blocks dispatched across the whole GPU.

The GigaThread Engine oversees this complex choreography, mapping job blocks to available multiprocessors for maximum throughput.

Traditionally, every thread within a warp had to stay precisely “in step”—executing exactly the same instruction at the same time (think of a Roman phalanx). This is the basis of SIMD architecture.

In modern GPUs, however, NVIDIA and others have advanced to SIMT (Single Instruction, Multiple Threads): Each thread gets its own program counter, meaning it can run independently and isn’t forced to stay in lockstep with the rest of its warp. This offers greater flexibility, especially when some threads need to branch or pause due to different data. SIMT also includes a shared cache system (128 KB L1 per multiprocessor) so threads can pass data around as needed.

In layman’s terms, SIMT means GPUs are even more efficient at keeping all their cores busy, even when tasks diverge or branch in unpredictable ways.

(A fun aside: the term “warp” here doesn’t come from Star Trek, but instead, weaving—referencing the Jacquard loom from 1804, which used punch cards to select threads for making complex patterns.)

Real-World Applications: Bitcoin Mining and Neural Networks

Bitcoin Mining: Harnessing Parallel Power for Profit

Let’s talk Bitcoin. Mining for cryptocurrency was one of the earliest “killer apps” for powerful GPUs. Here’s how it works:

The Bitcoin blockchain requires blocks to be validated via the SHA-256 hashing algorithm. This process takes transaction data—plus a timestamp, some extra bits, and a variable called a "nonce"—and combines them to generate a random 256-bit hash. It’s a bit like a lottery ticket generator: each new nonce spits out a new “ticket.”

To win (i.e., earn Bitcoin), miners race to find a hash value where the first 80 bits are zeros. The more SHA-256 hashes you can compute in a second, the better your chances.

GPUs could run thousands of these attempts in parallel, each thread crunching a different nonce with identical base data. A high-end graphics card could generate ~95 million hashes per second—massively outpacing CPUs.

However, dedicated mining machines called ASICs now rule the Bitcoin world (processing up to 250 trillion hashes per second!), making GPUs look quaint in comparison—a bit like bringing a spoon to an excavator fight.

AI, Tensor Cores, and Neural Networks

The tensor cores inside top GPUs were engineered to accelerate the bread-and-butter calculations of AI: matrix multiplication and addition. Here’s what happens:

The tensor core receives three matrices. First, it multiplies two of them (essentially summing the row-by-column products).
Then, it adds in the third matrix.
The result is output—and thanks to the highly parallel nature, all these calculations can happen simultaneously.

Neural networks and generative AI models require trillions (even quadrillions) of these matrix multiplications and additions—on far larger matrices than one would ever wish to calculate by hand. The tensor cores are designed precisely to crunch through this mountain of math at superhuman speed.

Ray Tracing Cores: Lighting the Future of Graphics

Finally, let’s not forget ray tracing cores. While already explored in depth in a separate article, these specialized units simulate the complex physics of light—enabling jaw-droppingly realistic reflections, shadows, and lighting in real-time 3D environments.

In Closing: The Magic Behind Every Frame

The next time you load up a hyper-realistic video game, watch an AI beatbox, or marvel at blockchain technology, take a moment to appreciate the feat of parallel computing humming beneath your fingertips. Your graphics card, powered by brilliant engineering and relentless innovation, brings ideas to life—not just by working quickly, but by multiplying that speed by thousands.

At Branch Education, we’re on a mission to make free, visually immersive educational content accessible to everyone, explaining the deep science and engineering behind the tech that shapes our world. If you found this article enlightening, please consider liking, commenting, sharing with a curious friend, or subscribing for more deep dives. Support from readers and members helps us keep tackling these ambitious projects.

Interested in the bleeding edge of memory and AI? Check out Micron’s latest work in high-bandwidth and graphics memory, or explore a world-changing career designing chips by following the links below.

And if you want to binge more in-depth animations about the wonders of engineering and technology, just hit one of the recommended videos or subscribe to the channel.

Thanks for reading—and for joining us inside the mind of modern graphics cards.