Mixture-of-recursions delivers 2x faster inference—Here’s how to implement it

Researchers from KAIST AI and Mila have unveiled a novel Transformer architecture designed to enhance the memory and computational efficiency of large language models (LLMs). Known as the Mixture-of-Recursions (MoR), this architecture greatly boosts model accuracy and throughput compared to standard transformers, even when restricted by identical parameter counts and computing budgets.

The scaling challenges of LLMs

The remarkable abilities of current LLMs are intrinsically linked to their continuously growing size. However, as these models expand, their memory demands and computational needs often become impractical, posing challenges for organizations outside major data centers in terms of training and deployment. This has spurred a pursuit of more efficient design alternatives.

Efforts to increase LLM efficiency have largely concentrated on two strategies: parameter sharing and adaptive computation. Parameter sharing methods decrease the total number of distinct parameters by reapplying weights across different model sections, thereby lessening overall computational complexity. A prime example is “layer tying,” a technique that reuses a model’s weights over multiple layers. Adaptive computation techniques modify models to utilize only the necessary inference resources. For instance, “early exiting” dynamically allocates computing power by permitting the model to cease processing simpler tokens earlier within the network.

Nonetheless, developing an architecture that successfully integrates both parameter efficiency and adaptive computation remains a challenging endeavor.


AI Scaling Hits Its Limits

Power constraints, increasing token costs, and inference delays are reshaping enterprise AI. Participate in our exclusive salon to learn how leading teams are:

  • Leveraging energy as a strategic asset
  • Designing efficient inference for real throughput advancements
  • Achieving competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


How Mixture-of-Recursions works

Mixture-of-Recursions is a framework that merges parameter sharing with adaptive computation to address the substantial computational demands of LLMs. It expands upon the concept of Recursive Transformers, which are models that apply a set of shared layers repeatedly. Rather than using a deep stack of unique layers, a Recursive Transformer divides the model into several “recursion blocks,” each containing a shared pool of parameters. This structure enables more computation without enlarging the model.

MoR enhances this recursive strategy with two essential components. The first is a lightweight router that smartly assigns a specific recursion depth to each token. This concept mirrors the routing system in Mixture-of-Experts (MoE) models, where a router directs tokens to specialized expert networks. In MoR, however, the “experts” are the varying recursion depths, allowing the model to dynamically determine the amount of computation to apply to each token. It decides how many times a shared block of layers should be applied based on a token’s complexity, or its required “depth of thinking.” This ensures that computation is focused only where it is most necessary, avoiding unnecessary cycles on easily processed input parts.

Mixture-of-recursion (source: arXiv)
Mixture-of-recursion Source: arXiv

The second component is a more efficient key-value (KV) caching strategy. KV caching is a common technique that stores information from previous tokens to accelerate generation, but it becomes a memory bottleneck in recursive models. MoR introduces a “recursion-wise” KV caching mechanism that selectively stores and retrieves key-value pairs only for the tokens that are still active at a given recursion step. This targeted caching reduces memory traffic and enhances throughput without the need for complex, post-training alterations.

As stated by the researchers in their paper, “In essence, MoR enables models to efficiently adjust their thinking depth on a per-token basis, unifying parameter efficiency with adaptive computation.”

Different token routing and KV caching mechanisms for recursive transformers (source: arXiv)
Different token routing and KV caching mechanisms for recursive transformers Source: arXiv

MoR in action

To evaluate their framework, the researchers trained MoR models with parameters ranging from 135 million to 1.7 billion and compared them against vanilla and standard recursive baseline models on validation loss and few-shot accuracy benchmarks.

The results show substantial improvements. When provided an equal training compute budget, an MoR model achieved higher average few-shot accuracy (43.1% vs. 42.3%) than a vanilla baseline despite utilizing nearly 50% fewer parameters. When trained on the same data volume, the MoR model decreased training time by 19% and reduced peak memory usage by 25% compared to the vanilla model.

The MoR architecture also demonstrates scalability. While it slightly lagged behind the vanilla model at the smallest 135M parameter scale, the gap quickly narrowed as the model size grew. For models with more than 360M parameters, MoR matched or surpassed the performance of standard Transformers, particularly on lower compute budgets. Moreover, MoR’s design significantly enhances inference throughput. One MoR configuration achieved a 2.06x speedup over the vanilla baseline. For a company operating at scale, this could lead to substantial operational cost savings.

Sangmin Bae, co-author of the paper and a PhD student at KAIST, detailed the practical impact in an email to VentureBeat. “While providing exact numbers is challenging, at a high level, reducing model parameter size and KV cache footprint enables us to perform inference on many more samples simultaneously,” he stated. “This results in processing a higher number of tokens at once, making it feasible to handle longer context windows.”

A practical path for enterprise adoption

Although the paper’s results stem from models trained from scratch, a crucial question for enterprises is how to adopt MoR without substantial initial investment. According to Bae, “uptraining” existing open-source models is a “definitely more cost-effective approach.” He mentioned that while creating a new model is straightforward, an “uptraining approach could be more suitable and efficient until the scalability of MoR itself is fully validated.”

Adopting MoR also provides new architectural “knobs” for developers, allowing them to fine-tune the balance between performance and efficiency. This trade-off will depend entirely on the application’s needs.

“For simpler tasks or scenarios, it may be advantageous to use models with more recursion steps, offering greater flexibility, and vice versa,” Bae explained. He emphasized that the “optimal settings will highly depend on the specific deployment setting,” encouraging teams to explore the trade-offs based on the paper’s findings.

Looking forward, the MoR framework is “modality-agnostic,” meaning its adaptive computation principles are not limited to text. This opens the door to significant efficiency gains in processing video, audio, and other complex data types.

“We’re very excited about its potential extension to multi-modality scenarios where efficiency gains are crucial,” Bae stated.

By dynamically adjusting the processing depth for each segment of a video or audio stream, MoR could unlock even greater cost savings and performance improvements, bringing the power of large-scale AI to a wider range of enterprise applications. As the paper concludes, MoR offers “an effective path towards achieving large-model capabilities with significantly reduced computational and memory overhead.”

AINews,TechNews
Recommended Content