Generative AI has been dominated by transformer models for years, underpinning advanced systems like OpenAI’s Sora for video generation, and text generators like Anthropic’s Claude, Google’s Gemini, and GPT-4. However, transformers are hitting technical roadblocks, particularly in terms of computational efficiency.
Transformers excel in many areas but struggle with processing vast amounts of data efficiently on standard hardware. This inefficiency leads to rising power demands, posing sustainability challenges as companies scale their infrastructure to support these models.
A new architecture, Test-Time Training (TTT), has emerged as a promising alternative. Developed over 18 months by researchers from Stanford, UC San Diego, UC Berkeley, and Meta, TTT models claim to process more data than transformers while consuming significantly less compute power.
A key component of transformers is the “hidden state,” essentially a long list of data that the model uses to remember what it has processed. This hidden state grows as the model processes more data, becoming computationally demanding. For instance, generating a single word about a book a transformer has read requires scanning through its entire hidden state, akin to rereading the book.
TTT models tackle this by replacing the hidden state with a machine learning model. This model, unlike a transformer’s lookup table, doesn’t expand as more data is processed. Instead, it encodes the data into representative variables called weights, maintaining a consistent model size regardless of the data volume.
Yu Sun, a post-doctoral researcher at Stanford and a contributor to the TTT research, explains, “If you think of a transformer as an intelligent entity, then the lookup table — its hidden state — is the transformer’s brain. This brain enables the well-known capabilities of transformers such as in-context learning.” However, this also limits them due to the need to process the entire lookup table for generating responses.
Sun and his team’s TTT models aim to overcome these limitations. Their internal machine learning model doesn’t grow with more data, making TTT models highly efficient. This efficiency could enable TTT models to handle billions of data pieces, from text to images and videos, far surpassing current model capabilities.
“Our system can generate X words about a book without the computational complexity of rereading the book X times,” Sun said. “Current large video models based on transformers can process only 10 seconds of video because they rely on a lookup table. Our goal is to develop a system capable of processing long videos, akin to human visual experience.”
Despite their promise, TTT models aren’t a straightforward replacement for transformers. The research is still in its early stages, with only two small models developed for initial study. Comparing these to larger transformer implementations is challenging.
Mike Cook, a senior lecturer at King’s College London’s department of informatics, who was not involved in the TTT research, commented, “It’s an interesting innovation. If the data backs up the claims of efficiency gains, that’s great. But it’s too early to say if it’s better than existing architectures. An old professor of mine joked that adding another layer of abstraction solves any problem in computer science. Adding a neural network inside a neural network reminds me of that.”
Nevertheless, the pace of research into transformer alternatives is accelerating, reflecting a growing recognition of the need for breakthroughs in AI efficiency. This week, AI startup Mistral released a model called Codestral Mamba, based on state space models (SSMs), another potential alternative to transformers. SSMs, like TTT models, promise greater computational efficiency and scalability.
AI21 Labs and Cartesia are also exploring SSMs. Cartesia, in particular, has pioneered some of the earliest SSMs, including Codestral Mamba’s predecessors, Mamba and Mamba-2.
Success in these efforts could make generative AI even more accessible and widespread, for better or worse. As these new architectures develop, they could drive significant advancements in how we process and generate data, opening up new possibilities in the AI landscape.