The Bleeding Edge
The near future of AI is weirder and faster than you think
“I almost want to reject the question entirely... I see this as an extension of computing... It’s still the decade of agents, not the year.” [On clarifying the timeline for AI agents and cautioning against short-term hype about breakthrough years in AI development]
“Reinforcement learning is terrible. It’s like sucking supervision through a straw.”
“The march of nines in self-driving AI shows why demos can be unimpressive despite hard work, as improving reliability from 90% to 99.9% takes as long as all prior progress combined.”
“AGI is still at least a decade away. The problems are tractable and surmountable but still difficult.”
“Overall, the models are amazing but still need a lot of work. For now, autocomplete is my sweet spot. Current AI agents are slop, not amazing.”
“AGI progress will be gradual, blending into steady ~2% GDP growth over centuries rather than sudden disruption. Ten years should still be considered an optimistic timeline.”
“I’m very unimpressed by demos. Polished demonstrations rarely capture real-world complexity.”
— Andrej Karpathy, 2025
The story everyone tells is simple. Scaling laws, training runs, sky high inference, and value accruing to the hyper scalers who can raise capital.
That story is already out of date. A new stack is forming around efficiency and cost per token.
With fresh training objectives that change what a model knows we are seeing the rise of vertically integrate, tiny reasoning engines that do one thing well.
This is, and will never be AGI or super intelligence.
But if you care about capability per watt, latency under real workload, and the next wave of products, this matters.
Autoregressive transformers will remain the backbone, yet they’re being surrounded by hybrid attention layers, diffusion style generation for text, world model style mid-training for code, and small recursive reasoners that punch above their size. This mix pushes us toward cheaper long context, faster tokens, more reliable tool use, and a more modular agent stack.
Attention hybrids that bend the cost curve
Quadratic attention is the standard self-attention mechanism in transformer models. Every token must be compared with every other token to compute attention scores, making it computationally expensive. is the sequence length.
This is because every token must be compared with every other token to compute attention scores, making it computationally expensive for long sequences. You double context, you double the cost.
The new play is to bend that curve without tanking accuracy. Teams are doing this by mixing cheaper linear style blocks with full attention blocks in a fixed rhythm, often three linear for one full.
The linear blocks maintain a compact recurrent state rather than a growing key value cache, so memory stays flat as context grows. The full blocks keep global context fresh enough to avoid model collapse in reasoning.
You’ve likely seen names like Qwen3 Next and Kimi Linear (both Chinese models).
Both use a variant called Gated DeltaNet in most layers, then drop back to a heavier attention in regular intervals. The gate is not a cosmetic detail. It stabilizes training and lets the model scale features up or down on the fly, which matters during long context replay where numerical issues usually lurk.
The result is simple to state. Throughput rises, KV cache falls, and you can push context into the hundreds of thousands of tokens while preserving quality that is close enough for many workloads.
The deeper point is architectural. We are leaving the era of one attention rule to govern them all, and moving to attention as a menu, chosen per layer to hit a latency, memory, and accuracy target.
What this unlocks
Long transcripts, million token retrieval, and agent memory all shift from exotic to routine. Latency budgets shrink on consumer hardware. Pricing can shift from blunt, per token, to class based pricing that rewards linear layers. This aligns vendor economics with user value.
Where the tradeoffs hide
Linear blocks compress chat history into a fixed sized state. Efficient, but not free. If your task needs crisp, long range token interactions on every step, you still want pockets of full attention. Exactly why the best current systems mix the two.
You can also break multi-turn agent loops if your long context math goes unstable. A huge unlock which decides whether your agent derails on the hundredth tool call.
Diffusion for text that writes in parallel
Keep reading with a 7-day free trial
Subscribe to Agora to keep reading this post and get 7 days of free access to the full post archives.

