NVIDIA RTX Spark inside: complete Blackwell + Grace + MediaTek chip architecture

We've already covered the RTX Spark from a business perspective here on the blog: why it will change how companies buy computers. This post is the dense technical complement. The full SoC architecture, piece by piece: Blackwell GPU, custom Grace CPU with MediaTek, internal NVLink (NVLink C2C), unified memory, TSMC 3nm process, comparison with competing chips (Apple Silicon, AMD Strix Halo, Intel Lunar Lake), local AI workloads, and what it means for on-device inference at scale. Nothing left out.

RTX Spark was announced at GTC Taipei in 2026 as the first "PC reinvented in 40 years." Behind the marketing, it's a unified SoC that joins three things that always lived separately in traditional PCs: dedicated GPU, server-class CPU, and coherent memory management. The result is the first chip designed from the ground up to run a local AI agent with performance that previously only existed in data centers. Worth understanding from the inside.

1. Overview: the whole SoC

RTX Spark is a SoC (System on Chip) with five main components integrated in a single package:

Blackwell GPU: 6,144 Tensor Cores, 1 petaflop of AI performance (FP4 sparse).
Custom Grace CPU: 20 ARM cores, designed in partnership with MediaTek.
NVLink C2C: chip-to-chip bus connecting GPU and CPU at high speed inside the package.
Unified memory: 128 GB shared between GPU and CPU, with no copying.
TSMC 3nm process: 70 billion transistors in the total SoC.

The architecture is coherent: GPU and CPU see the same memory space, with hardware cache coherence. There is no "host-to-device transfer" as in a traditional discrete GPU. That difference is not cosmetic. It changes the entire programming model for heterogeneous workloads.

2. Blackwell GPU: what's inside 6,144 Tensor Cores

Blackwell is NVIDIA's GPU architecture, successor to Hopper (H100). The Spark version is the "consumer-grade-but-not-really" variant: same architectural family as the datacenter B100/B200, scaled for a PC thermal envelope.

Compute hierarchy:

SM (Streaming Multiprocessor): base compute block. Each SM has 4 sub-cores, each sub-core has one 5th-generation Tensor Core.
6,144 Tensor Cores implies ~1,536 sub-cores and ~384 SMs (depends on the exact count per SM for the Spark variant).
CUDA Cores: for non-tensor workloads (traditional graphics, general compute). Estimated at 12k to 16k.
4th-gen RT Cores: ray tracing for graphics.

5th-generation Tensor Cores support:

FP4: 4-bit floating point. New Blackwell format, doubles throughput vs FP8. Used in quantized LLM inference.
FP8 (E4M3, E5M2): 8 bits. Current standard for serving LLMs in production.
FP16/BF16: 16 bits. Training and inference without aggressive quantization.
TF32, FP32, FP64: high-precision formats for HPC.
2:4 Sparsity: structured sparsity standard. Weights in a "2 zeros out of every 4" pattern double throughput. Supported directly in hardware.

1 petaflop in FP4 with sparsity is the marketing number. Without sparsity, ~500 TFLOPs. In FP8, ~250 TFLOPs. In BF16 (no quantization), ~125 TFLOPs. Each precision reduction doubles throughput. This is the design axis: prioritize low-precision formats to maximize LLM inference throughput, not training.

3. Transformer Engine: the specialization that matters

Blackwell has the 2nd-generation Transformer Engine, a hardware module that automates precision decisions per layer. It detects the dynamic range of tensors and chooses FP8/FP4 where it can, FP16 where needed. No programmer intervention required.

Combined with native FP4, the practical gain is: real-time inference of Llama 70B (quantized in FP4) on a consumer-class chip. Before Blackwell, this required a datacenter A100 or H100.

4. Custom Grace CPU: why ARM and why MediaTek

NVIDIA Grace is an ARM Neoverse CPU designed to partner with GPUs on AI workloads. The original Grace (in the GH200, GB200) has 72 to 144 cores. The Spark version is customized: 20 ARM cores, designed in partnership with MediaTek.

Why MediaTek: MediaTek specializes in high-efficiency mobile/embedded SoCs. It has expertise in modem integration, ISP, and design for the constrained thermal envelope of laptops and PCs. NVIDIA brings compute architecture, MediaTek brings consumer integrated SoC domain knowledge. A natural fit.

Why ARM, not x86:

ARM has superior perf-per-watt for parallel workloads. A PC running an agent 24/7 needs that.
Coherence with datacenter Grace: software written for Grace server chips runs on Spark without porting.
The ARM license allows deep customization (extensions, cache, interconnect). x86 is Intel/AMD, closed.
The ARM software ecosystem on Windows is mature after years of Qualcomm Snapdragon X, Apple Silicon proving the market, and Microsoft optimizing native Windows for ARM64.

Trade-off: legacy x86 Windows applications run via emulation (Prism, on Windows 11 ARM). Emulation performance is around 80% of native on typical apps, perfectly usable but not ideal. Apps recompiled for native ARM64 (increasingly common) run at full performance.

5. NVLink C2C: the bus that unlocks unified memory

NVLink Chip-to-Chip (C2C) is the interconnect between the Grace CPU and the Blackwell GPU inside the package. Characteristics:

Bandwidth: ~900 GB/s bidirectional (the Spark version is likely ~600-900 GB/s, depending on tier).
Latency: orders of magnitude lower than PCIe Gen 5 (which delivers ~64 GB/s on x16).
Cache coherence: hardware maintains coherence between CPU cache and GPU memory. No manual flush required from the application.

Comparison with PCIe: PCIe Gen 5 x16 = 64 GB/s, no coherence. Moving 10 GB of data from CPU to GPU costs ~150ms via PCIe and ~11ms via NVLink C2C. In an iterative inference loop, that's the difference between viable and not.

6. Unified memory: 128 GB for everything

RTX Spark has 128 GB of memory shared between CPU and GPU. Technically, likely LPDDR5X (low-power DDR5 extended), the standard for high-efficiency SoCs.

Why this matters: top-tier discrete PC GPUs have 24 GB (RTX 4090) to 48 GB (RTX 6000 Ada). Large LLM models don't fit. To run Llama 70B in FP8 (~70 GB) you need a datacenter A100 80GB or two GPUs with NVLink.

With 128 GB unified, Spark runs:

Llama 70B in FP8 (~70 GB) with room to spare.
Llama 405B quantized in FP4 (~100 GB).
Two mid-size models simultaneously (e.g., 30B + 30B).
Model plus large KV cache for long context.

Memory bandwidth is critical for inference (LLM inference is memory-bound, not compute-bound). LPDDR5X-9600 delivers ~150 GB/s per channel. Spark likely uses 4 to 8 channels, totaling 600 GB/s to 1.2 TB/s. Compare with:

RTX 4090: GDDR6X, ~1 TB/s. But only 24 GB.
Apple M4 Max: LPDDR5X, ~546 GB/s. Up to 128 GB.
H100: HBM3, 3 TB/s. 80 GB.

Spark falls between Apple Silicon and dedicated GPU. It's not an H100. But it's the first consumer platform capable of running 70B+ models with decent latency.

7. TSMC 3nm and 70 billion transistors

The TSMC N3 (3nm) process is the same used in Apple M3/M4, Snapdragon 8 Gen 3, and datacenter Blackwell. Compared to the previous node (N5/4nm): ~30% better density, ~10-15% better perf-per-watt.

70 billion transistors is in the same range as the M4 Max (~62B) and well above the M3 Max (~40B). Distributed across the GPU (majority), CPU (~20%), interconnect and memory controllers, cache, and specialized blocks (NVENC/NVDEC, ISP, DisplayPort, networking).

Thermal envelope: estimated at 80-150W depending on variant (laptop vs desktop vs workstation). Compared with:

RTX 4090 laptop: 175W GPU only + ~50W CPU = 225W.
Apple M4 Max: ~80W under heavy load.
H100 PCIe: 350W GPU only.

Spark is more efficient than an equivalent discrete solution in a PC, close to Apple's efficiency, with much higher AI performance.

8. Workloads where Spark shines

The design favors four categories:

Local LLM inference: 7B to 70B models in real time. Primary use case, announced by NVIDIA. Users: local agents (Hermes, OpenShell), 24/7 personal assistants, applications with sensitive data that cannot leave the machine.
Image/video generation: Stable Diffusion XL in seconds, Flux in a few seconds, short video via models like Wan, CogVideoX in minutes.
Local RAG over private datasets: index 100k to 1M docs with a local embedding model, semantic search plus LLM, all on the machine.
Light fine-tuning: LoRA/QLoRA on 7B to 13B models is feasible locally. Full fine-tuning of large models remains datacenter work.

Where Spark does NOT shine (and was not designed to):

Training a large model from scratch (requires a cluster).
Top-tier gaming compared to a dedicated RTX 4090/5090 (Blackwell Spark is consumer-AI tier, not gaming-enthusiast tier).
Traditional HPC (FP64) at scale (datacenter is a better use of money).

9. The three product lines: laptop, desktop, workstation

NVIDIA announced three form factors built on the same base chip.

RTX Spark laptop: ~80-100W envelope, performance optimized for battery life. Variants from manufacturers (Acer, ASUS, Dell, HP, Lenovo, MSI). The first serious consumer-AI laptop product.

RTX Spark desktop: ~120-150W envelope, full performance. For agents running 24/7 without battery dependency, the office AI hub.

DGX Station: largest variant, 768 GB of memory, 20 petaflops, 8 TB/s bandwidth. For LLM developers, large model fine-tuning, local deployment of trillion-parameter models. A different league, but it runs Windows and the same stack as the smaller Spark.

10. Direct comparison with competitors

Chip	NPU/GPU AI TOPS	Max RAM	Bandwidth	Process
RTX Spark	~1000 (FP4 sparse)	128 GB	~600 GB/s to 1.2 TB/s	TSMC 3nm
Apple M4 Max	~38 (NPU) + GPU	128 GB	546 GB/s	TSMC 3nm
AMD Strix Halo (Ryzen AI Max)	~50 (NPU)	128 GB	~256 GB/s	TSMC 4nm
Intel Lunar Lake	~48 (NPU)	32 GB	~136 GB/s	TSMC N3B
Qualcomm Snapdragon X Elite	~45 (NPU)	64 GB	~136 GB/s	TSMC 4nm

Spark is a tier above in almost every dimension for AI workloads. The fair comparison is with Apple Silicon (which has its own ecosystem and does not run native Windows). Spark has higher AI throughput, Apple has a denser local AI developer ecosystem today (MLX, llama.cpp Metal). A technical draw, with the winner determined by use-case context.

vs AMD Strix Halo: AMD also has 128 GB and runs native x86 (full Windows compatibility). Spark wins on AI throughput by 10x or more, AMD wins on legacy software compatibility. Partially overlapping markets.

11. Software stack: 100% CUDA, local

The point that sets Spark apart from any Apple/AMD/Qualcomm device: it runs the entire CUDA stack. PyTorch, JAX, TensorRT, Triton, cuDNN, NCCL, every library that exists for NVIDIA GPUs runs exactly the same way. A model trained on a datacenter H100 loads on Spark without changing a line of code (subject to available memory).

For AI developers, this is decisive. There is no real alternative. Apple has MLX (proprietary, small ecosystem). AMD has ROCm (it exists, but support is inconsistent). NVIDIA is the de facto standard.

Additionally: the software from each announced partner (Adobe Photoshop and Premiere 2x faster, Blackmagic DaVinci Resolve, Cadence design tools, thousands of others) is being recompiled to run natively on Spark with CUDA acceleration.

12. What this changes in AI application architecture

Spark introduces a new architecture pattern for AI apps: local-first execution with optional cloud-burst. Emerging patterns:

Agent runs locally 95% of the time. When a larger model is needed, it makes a cloud request. Average cost drops dramatically.
Local RAG over sensitive data. Local embedding model, local vector DB, local LLM. Data never leaves.
Creation apps (design, video, audio) with inline inference at interactive latency.
Iterative local fine-tuning for per-user personalization, without sending personal data to the cloud.

This is NVIDIA's bet: not every workload needs to go to a datacenter, and if inference fits locally, it will go local. Zero cost per token, minimal latency, privacy by default. Cloud remains relevant for frontier models, training, and capacity bursts, but becomes a complementary layer, not the only one.

13. The technical reframe

RTX Spark is not "a faster PC." It's the first desktop-class platform engineered ground-up for a persistent local AI agent. Architecturally, it's where the market is converging: unified SoC, coherent shared memory, AI throughput in low-precision formats as the primary metric, energy efficiency as a central constraint.

Apple has been moving in this direction for four generations of Silicon. AMD followed with Strix Halo. Intel is behind but coming with Panther Lake. Spark is NVIDIA's entry, with the asymmetric advantage of the CUDA ecosystem and a GPU architecture built for AI from the start.

If you design AI software, the target hardware calculation changes. It's no longer "cloud or nothing." It's "cloud, local with Spark/Apple/AMD, edge, or a combination." Each with its own cost, latency, privacy, and capabilities tradeoff. Those who understand the full architectural gradient will design better. Those who don't will keep paying 100% cloud when 30% local would do.