What Comes After LLMs?

JEPA, Diffusion, and More

May 11, 2024

A four 4 year old child has been awake about 16,000 waking hours x 3600 seconds per hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes per seconds = roughly 1E15 bytes of data (or roughly 10,000 terabytes of data).
In 4 years, a child has seen 50 times more data than the biggest LLMs. Text is simply too low bandwidth and too scarce a modality to learn how the world works.
—Yann LeCun, Chief Scientist at Meta

Artificial Intelligence has taken the world by storm since Chat GPT was released in late 2022. But “artificial intelligence” has become kind of a catch all word to describe LLM (large language model) architecture.

But are LLMs really here to stay? Or will we see a diversification of model types and architectures emerge to solve problems such as increased accuracy and generalized intelligence?

What is an LLM?

A large language model is a neural network software program that is pre-trained on terabytes of data to create a lossy compression of the world’s information.

It maps relationships between words which are stored as vectors, and uses these vectors to understand the context of a query, and predict how best to answer it.

To learn more, I explain in simpler terms and greater detail → here.

What’s Next?

LLMs are an incredibly interesting piece of technology. But as explored in my previous article, the tech is still struggling with some challenges.

Unfortunately two competing statements can be true:
AI is a major platform / tech shift and will change the world
AI is in a bubble
—
Jamin Ball
Clouded Judgment

There are three main reasons for this:

We’re waiting for highly accurate AI agents (though Devin get’s close)
We’re waiting for AI products to have effective memory.
The 90/10 problem (accurate 90% of the time, wrong 10%)

The Scaling Laws

The Scaling Laws are why we have LLMs in the first place. They’re almost the reverse of Moore’s Law (which holds that transistor chips will get smaller and exponentially more powerful every few years).

Every LLM model has a certain number of weights and biases (called parameters). The Scaling Laws say that the larger a model gets, the more their performance improves in a smooth, predictable way that follows a power law scaling relationship.

In other words, the larger it gets the better it is.

To hear Microsoft CEO Satya Nadella explaining this phenomenon click → here.

But we will reach a point where there are marginal diminishing returns?

Llama-3 is several billion parameters and GPT-4 is one trillion.

Will there come a time when building bigger models only achieves negligibly better results, and isn’t worth the capex?

Sam Altman hinted on the 20VC podcast that with GPT-5, we may have already reached this point, and there is really no reason to keep building bigger and bigger models, because it will just turn into an arms race with no end in sight.

But then again, he may have just been saying this to throw others off the scent, as suggested by Brad Gerstner on the BG2 podcast.

And with open source models like Mixtral and Llama - 3 close to achieving parity with propriety models like GPT-4 and Claude - 3, it appears much of the value capture will occur at the application layer moving forward, not at the foundational layer.

Feel free to go crazy in the comments with your own conjectures.

So if the application layer is where the value capture will happen, how can a company building models differentiate?

Better Data

Every foundational language model worth its salt has already ingested and integrated the entire open internet into its neural network.

As the models continue to scale, there will be no significant differentiation, unless the model creators can train theirs on higher quality data sets.

This is where the value capture will happen, given the current environment. Even Mistral’s CEO, Arthur Mench, a champion of the open source movement, said on the 20VC Podcast that they are creating a propriety model.

They will have a variety of open source models for developers to utilize, and at least one proprietary model that will be sufficiently differentiated so that people will pay for it.

Google is already attempting to do this with their recent purchase of Reddit. X.Ai has the Twitter data. And a number of tech companies have begun hiring creatives with the specific purpose of creating higher quality, differentiated data for the models to train on.

Character AI, though its kind of a toy, is a master class at this. They have over 1000 fine-tuned “bots” which are extremely good at mimicking any style of communication or persona, and can return an output with very low latency.

High quality data and the compute necessary for both training, and inference, will be the only things standing in the way of the continued scaling and improvement of these models.

But will it be enough?

Will higher quality data and lower compute costs be enough to solve the problems listed above?

Agency, memory, and factuality are hard problems, and perhaps the LLM is not even the best architecture to solve them.

Yann LeCun, Meta’s Chief Scientist, argues that a vast majority of the world’s knowledge, counter intuitively, is not contained in text.

A four 4 year old child has been awake about 16,000 waking hours x 3600 seconds per hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes per seconds = roughly 1E15 bytes of data (or roughly 10,000 terabytes of data).
In 4 years, a child has seen 50 times more data than the biggest LLMs. Text is simply too low bandwidth and too scarce a modality to learn how the world works.

Which is why most LLM companies are now focused on multi-modality, or training their models using images, text, audio, and video, in order to better understand the world.

Figure AI is even embodying Chat GPT into a robot body so it can experience the world first hand, so I guess we can add tactile data to the list.

Which is a nice segue into JEPA.

JEPA Models

JEPA stands for “Joint Embedding Predictive Architecture”. JEPA is a different kind of predictive machine learning model which has been more traditionally associated with robotics.

JEPA is really good at helping robots map and understand the world so that they can learn how to interact within it.

LLMs’, on the other hand, are incredibly good at understanding relationships between complex, abstract concepts.

As a result, JEPA models might prove to be better at understanding the reality of the world we live in from first principles, and therefore, understanding how to make higher level, multi-step decisions.

Think of it this way

If you asked an LLM “please book me a ticket to France,” it would probably tell you that it can’t help with that.

But it would be happy to tell you what the most reliable airlines are for a trip to France this time of year, and what you should do when you get there, and where you should stay based on reviews it found on the web.

A properly trained JEPA model would understand how to get to France. It would understand you would have to get to the airport, you’d have to have a car waiting for you when you get there, you’d have to have somewhere to stay, and so on.

And knowing how to do all that, it could talk to the AI agents that represent the airlines, then the ones for Uber or Lyft, and then ones that represent the Waldorf.

And after asking you a few questions about your preferences, a JEPA model could book you a car, a flight, and a hotel room, and send the confirmations to your email for your review and alterations if need be.

So perhaps in the future, JEPA models will be able to solve the agency and factuality problems LLMs have, while LLMs will be able to supplement them with the memory, personalization, and abstraction features that JEPA models are not known for.

Diffusion Models:

How did I create David? It was simple. I just chipped away everything that was not David.
–Michelangelo

Michelangelo's David Has Weak Ankles ...

Diffusion Models are yet another type of AI architecture, and they’re specifically very good at creating natural images or video.

Diffusion models start with a simple distribution, typically Gaussian noise. Think about it like snow on an old TV set.

Wumbo Patrick GIF - Wumbo Patrick Star ... — Gaussian Noise

Then they apply a series of transformations to gradually turn the static into the desired complex data distribution (e.g. a natural image).

So in a way, a diffusion model is just like Michelangelo. It starts with a block of marble, and strips away everything that is not David, until it has succeeded in creating what the user asked for.

Dall-E, Stable Diffusion, and Midjourney are all diffusion models.

And recently on the No Priors Podcast, team leads at OpenAI, Aditya Ramesh, Tim Brooks and Bill Peebles, confirmed that Sora is a diffusion model that combines the generative power of the diffusion architecture, with the reinforcement learning capabilities of the Transformer, which is an integral part of LLM architecture.

The future of multi-modal understanding may lie in the fusion of Michelangelo-esque diffusion models, and Pavlovian-esque reinforcement learning LLM architecture.

Conclusion:

This is such an exciting and active area of research and no one knows where its going to go.

But one thing is for sure, this type of artificial intelligence technology is a new modality of computing which will affect every aspect of how we do business and interact with the world.

If this seems like a bold claim, consider the following example. Every major platform shift in the 20th century meant the new medium swallowed the old medium.

Television swallowed the radio, and integrated it into its content. The internet swallowed radio and television, and integrated it into its content. And now LLM’s have swallowed the internet, and integrated it into its content.

AI technology will simply become a new way that we interact with the world, in the same way that mobile phones became a new way that we interact with the world.

Everyone will have an AI assistant, in the same way that Bill Gates proclaimed that every office and home will have a computer.

And your AI assistant will keep your calendar, interact with your friends’ and co-workers’ assistants, and keep you safe from phishing and deep fake attacks.

Just as the steam engine helped thwart the pessimistic predictions of Thomas Malthus, artificial intelligence technology is the steam engine of the mind, and by leveraging it, we can create an even better world than the one we live in today.

“We live in the best of all possible worlds,” said Doctor Pangloss.
“Yes,” replied Candide, “but we must cultivate our own gardens.”
–Candide, by Voltaire, published 1759