The Bottleneck Has Shifted

Open-Weight Models, Nvidia Rubin, and the New Inference Economics

Mar 07, 2026

The rules of the AI game have already changed, and most people haven’t noticed yet. Open-weight models that rival frontier labs are now dropping every few weeks. Massive, sparse, agent-ready architectures priced like lightweight tools. Nvidia is answering with an entirely new class of rack-scale hardware built not for peak flops, but for bandwidth-hungry, long-context, tool-calling workloads that actually ship revenue.

Meanwhile, seasoned investors are quietly warning that the infrastructure boom we’re watching could be the next great overbuild, even as real token demand doubles month after month. Three separate threads, one unmistakable signal: the bottleneck has shifted from “who can train the biggest model” to “who can run the most useful agents the cheapest, longest, and most reliably.”

Part 1, Exploring the Implications of New Open Source LLMs

Open source LLMs have grown up a lot lately. These are AI models where the core code and weights are free for anyone to use or tweak. In the past, open ones were small and basic, good for running on a home computer but not as smart as the big paid versions. Now, that’s flipped. Recent releases hit massive scales, like models with 400 billion total parameters, but they only activate 13 billion per token processed. That’s like having a huge team of experts where just a few jump in for each job, keeping things fast and cheap.

This matters because it makes top level AI available to more people and companies. For example, one model offers a context window of nearly 200,000 tokens, that’s like remembering a super long conversation, at prices as low as $0.30 for a million input tokens and $1.10 for outputs. It performs well on code writing and agent tasks, where the AI acts like a helper that calls tools or runs sub tasks. Another has 230 billion parameters total but just 10 billion active, licensed freely so businesses can build on it without big fees.

These models focus on practical stuff, not just sounding smart. They use tricks like sparse mixtures of experts to save compute, meaning the AI routes work to the right specialist parts without wasting power on everything. Attention systems, which help the model focus on key details, now mix local and global views to handle long contexts without costs exploding. One setup alternates layers, doing cheap windowed math most of the time instead of full scans that grow quadratically, or like a snowball getting bigger too fast.

Throughput is a big deal too, that’s how many tokens the model spits out per second. Some claim 100 tokens per second even at 128,000 context lengths, beating others that slow to 33 on the same hardware. This turns AI into a tool for agents, systems that break down big goals into steps, like swarms of up to 100 mini agents making 1,500 tool calls and finishing 4.5 times faster than a single one.

But there are catches. Benchmarks, the tests that score these models, are getting unreliable. One popular coding test is saturated, meaning scores are maxed out and contaminated by data leaks, so it rejects good answers sometimes. Labs suggest switching to harder versions for real insights. Also, not all open models are fully free for business, some have noncommercial licenses that block money making uses.

The big implication is that AI power is spreading out. Weights are transferable, so anyone can start with a strong base and compete on add ons like custom tuning or apps. This weakens strict safety rules based on precaution, because if one group slows down, others just grab the open weights and keep going. Unilateral pauses don’t stop progress, they shift it elsewhere, as seen in labs tweaking their policies under competition.

Part 2, Examining Nvidia’s Breakthroughs

Nvidia’s new Rubin platform answers these model trends by rethinking hardware at the rack level, that’s a whole shelf of computers working as one. It’s not about one super chip anymore, it’s integrating everything for better bandwidth, build ease, and real world speed, especially for inference, or running trained models on new data.

Compute jumps aren’t even across the board. Low precision formats like FP4 and FP8 get a 3.5 times boost in operations per second over the last generation, while FP16 only rises 1.6 times. This comes from shrinking to a 3nm process, adding more processing units from 160 to 224, widening cores for sparse math, and cranking clocks to 2.38 GHz. Nvidia’s betting that future AI leans on these efficient, less precise modes for most work, like quick inferences in agents.

They sell it in token terms, promising up to 10 times cheaper cost per token and needing 4 times fewer GPUs for training sparse expert models. Bandwidth steals the show, with memory at 288GB but speeds hitting 22 terabytes per second, 2.75 times faster, though early batches might top at 20 if parts lag. Why? Long contexts and expert routing eat bandwidth for fast prefills, loading the input, and steady outputs, even if storage size doesn’t change.

Networking upgrades match this. Rack bandwidth doubles, with 36 switch chips per rack each at 28.8 terabytes, using 400G tech at higher rates, plus 3.6 teraflops of in network math to cut jitter. In AI with lots of routing and huge inputs, the connections become part of the brain, not just wires.

The hidden wins are in making and fixing these systems. Rubin ditches internal cables for modular plugs, cutting assembly from two hours to five minutes. Service drops by up to 18 times, from over 90 minutes to under five. This lets Nvidia scale production, but it narrows builders to a few automated firms, creating a moat against copycats.

Not everything’s perfect. Past sparsity tricks, like fixed 2 to 4 ratios, flopped because they hurt accuracy and were too rigid. Rubin’s adaptive approach depends on natural data patterns, and experts doubt it’ll hit peaks on every model. Real sustained speed on actual tasks counts more than lab highs.

Compared to older Nvidia gear, Rubin feels like a direct response to open models’ needs. Where before the focus was raw power, now it’s efficiency for long runs and agents, aligning hardware with software shifts.

Part 3, Economic Impact

Howard Marks, an investor at Oaktree Capital, wrote a memo called AI Hurtles Ahead to make sense of these tech changes for money makers. He sees AI growing quickly, with real demand but risks of over spending. His framework splits AI into levels to show the market potential.

Level 1 is basic chat, saving time on thinking or lookups. Level 2 adds tools, so the AI searches, crunches data, and does specific jobs you tell it. Level 3 is full agents, where you give a goal and rules, and it iterates alone to deliver finished work. Marks says jumping to level 3 blows up the market from helpful tools to trillions in value, by replacing whole tasks people do.

He points out a speed up loop, where AI helps build better AI. One model release noted it debugged its own training and tests. Another lab leader says AI writes most of their code now, with the cycle revving up monthly, maybe hitting self building AIs in one to two years. This means falling behind costs more as things accelerate.

But Marks calls out fuzzy numbers. He mentions 400 million people and 75 to 80 percent of companies using AI, but without defining “using,” it’s vague. Is it a quick try or daily workflows? Better data comes from token counts, like one platform doubling from 6.4 trillion to 13 trillion weekly tokens in a month. That’s actual usage driving hardware buys.

On spending, he splits training costs as bets on the future, versus inference as responding to now demand. But he warns booms often build too much, hurting returns, and some early money is just AI firms swapping cash before real customers pay up. This keeps investors grounded, separating exciting stories from smart bets.

Marks’s take highlights how tech like open models and efficient hardware feeds into bigger economics, but with traps for those chasing hype.

Part 4, Tying It All Together

Putting these pieces side by side shows a clear shift. Open LLMs make frontier AI cheap and widespread, focusing on efficiency for agents and long tasks, with metrics like 100 tokens per second or $0.30 per million inputs proving it. Nvidia’s Rubin builds the hardware backbone, prioritizing bandwidth at 22 terabytes per second and quick assembly to handle those demands at scale. Marks translates this to investments, warning that real growth, like doubling token volumes, can coexist with overbuilt infrastructure that wastes money.

The thread running through is inference economics ruling now, cost per task completed, not just training size. This democratizes AI, but it exposes flaws in heavy precaution, open weights mean progress happens globally anyway, so holding back just lets others lead. For builders, the win is shipping to learn from real use. For investors, watch measurable demand like token surges against capex risks. Overall, AI’s bottleneck is affordability at scale, and those adapting to it will shape the next wave.

Agora

Discussion about this post

Ready for more?