Google Cloud debuts TPU 8t for training, 8i for low-latency AI

At Google Cloud Next 2026, the company introduced TPU 8t for training, rated at 2.8x Ironwood with up to 121 FP4 exaflops, and TPU 8i for inference with a Boardfly network that cuts latency by up to 50%.

Google Cloud unveiled TPU 8t and TPU 8i at its Google Cloud Next 2026 conference, splitting its tensor processors into dedicated chips for training and for inference. TPU 8t targets large-scale training, while TPU 8i focuses on low-latency inference for AI agents. Google lists 2.8x performance versus the Ironwood generation for 8t, with up to 121 FP4 exaflops. For 8i, internal tests show up to a 50% latency drop with a new network design.

TPU 8t adds larger high-bandwidth memory, faster storage access, and a revised machine learning architecture. The company expects it to cut training cycles for advanced models from months to weeks. Each 8t pod can scale to as many as 9,600 chips, with per-chip scale-up bandwidth of 19.2 Tbits/sec and scale-out links at 400 Gbits/sec. A 3D torus interconnect moves training data across thousands of accelerators.

Built for data center inference, TPU 8i emphasizes low-latency execution and high utilization for agent workloads. Each chip includes 384MB of on-die SRAM to keep short-term memory on the processor, reducing stalls. Each 8i server doubles the number of CPU hosts using in-house Arm-based Axion processors to handle orchestration and I/O.

A key change in 8i is Boardfly, a network topology that increases chip-to-chip ports and shortens the longest path between nodes. In Google testing, this cut latency by up to 50% compared with prior designs built for training throughput. By contrast, TPU 8t and earlier generations use topologies optimized for bandwidth.

The hardware fits into Google Cloud’s broader AI hypercomputer approach, which links compute, storage, networking, and software. Google is expanding access to the Pathways runtime from DeepMind to connect TPU pods and split training and inference tasks across them. The company also introduced Virgo Network, a data center fabric built for high-capacity data access and low-latency traffic across TPU installations. Within a single data center, Virgo can connect up to 134,000 TPU 8t chips as one fabric. Combined with Pathways and the JAX library, Google reports it can link more than one million TPU 8t chips across multiple data centers in a single training cluster. For inference, Virgo trims latency by about 40% versus Ironwood in a network without other traffic.

Google highlights observability and reliability in the new fabric, including sub-millisecond telemetry and automated detection of degrading chips or nodes. Efficiency gains are part of the update as well: the company estimates double the performance per watt compared with Ironwood, helped by chip-level changes and fourth-generation liquid cooling.

Amin Vahdat, senior vice president and chief technologist for AI and Infrastructure at Google, described the design focus: “In the age of agents, what you really care about is latency, the minimum time it takes to get the data.” He noted that Google had been working on TPU 8t and 8i for two years, informed by internal discussions with DeepMind about future bottlenecks.

Google Cloud CEO Thomas Kurian told reporters: “We felt that people would want systems optimized for training, and separately, systems optimized for inference.”

Google points to early non-AI uses for TPUs as well. Citadel Securities lowered trading system costs by 30% using TPU-based infrastructure, according to Vahdat. The company did not disclose pricing. It positioned TPU 8t for training the largest multimodal and language models and TPU 8i for enterprise applications that need consistent, low-latency inference. Microsoft projects that 1.3 billion agents could be in operation by 2028.

Articles by this author