AI Agents Shift CPU Demand, Spur Token Economics Debate

At Dell Technologies World 2026 in Las Vegas, attendees said AI agents are moving inference demand back to CPUs and increasing token costs after Uber reportedly exhausted its 2026 AI budget early.
At Dell Technologies World 2026, held this week at the Venetian Conference Center in Las Vegas, company executives, partners and customers discussed how agentic AI is changing infrastructure needs. Presenters and attendees described a growing emphasis on CPU performance for inference workloads even as GPUs remain central for model training.
Speakers explained that agents act as orchestration layers that make repeated calls to language models, coordinate external tools and keep context active for longer tasks. Those patterns create many short inference requests, different memory and I/O needs, and tighter latency requirements than batch training.
A reporter on site said the debate over hardware priorities intensified as organizations moved agent projects into production. Several presenters noted that inference workloads require stronger single-thread performance, larger system memory and optimized system architecture alongside accelerators used for training.
“Agents are token hungry; they can use up in days or months what otherwise would take months to a year,” a conference participant remarked, summarizing multiple conversations at the event. Attendees pointed to an example in which a large enterprise reportedly exhausted its AI budget for 2026 by April or May after running extensive agent experiments.
That pattern of high token use has prompted firms to reevaluate where and how they run inference. Vendors and customers discussed trade-offs between cloud API costs and the capital and operating expenses of running inference at scale on-premises or in hybrid setups. Some presenters said the recurring costs of frequent model calls can change procurement decisions.
Dell showcased hardware and systems geared to agent workloads, including boxed and edge systems designed for low-latency inference and private deployments. Executives and partners outlined configurations that pair GPU-heavy environments for training with CPU-optimized platforms for inference closer to users and data.
Participants described agent behavior as generating many short inference requests that are often billed per token by cloud providers. That billing model, combined with changing compute patterns, was presented as a factor shaping architecture and procurement choices for 2026 and beyond.
Speakers concluded that organizations will continue weighing cloud versus on-premises options based on performance needs for inference and the economics of frequent model calls. Vendors offering integrated systems and services targeting inference workloads emphasized those solutions as one option for managing ongoing AI costs.








