RecurrentGemma: An Open Language Model For Smaller Devices

New research shows that Google's smaller RecurrentGemma model has about the same level of performance as larger LLMs, including Google's.

May 1st, 2024 12:30pm by Kimberley Mok

Featued image for: RecurrentGemma: An Open Language Model For Smaller Devices

Image via Unsplash+.

Large language models (LLMs) have been making a huge impact during the last couple of years, in particular with the emergence of tools like OpenAI’s ChatGPT. However, the mammoth size of LLMs — many of which are now trained on billions (and sometimes trillions) of machine learning variables — makes them too computationally heavy for devices like personal computers, smartphones and other smart devices.

These constraints might explain the growing interest in small language models (SLMs), as well as open LLMs like Google’s RecurrentGemma-2B, which was released a few weeks ago.

Based off Google’s novel Griffin architecture, RecurrentGemma is a more efficient and streamlined, 2-billion-parameter version of the company’s line of open Gemma AI models. This makes it an excellent choice for applications that require real-time processing, like translation or interactive AI use cases.

Building on Recurrent Neural Networks

Most significantly, the underlying model architecture of RecurrentGemma isn’t based on what is called transformer architecture, as most LLMs like GPT-4 and BERT are.

A transformer is a type of deep learning model that is designed to process sequential data contextually, in order to handle text-based tasks like translation and summarization.

RecurrentGemma is not built on transformers; rather, it is built on linear recurrences.

Transformers have revolutionized the field of natural language processing (NLP), but arguably their biggest drawback is that they are designed to parse each new piece of information in parallel, which translates to heavier requirements for memory and computational power. This means that most transformer-based LLMs typically are too resource-hungry for most small devices, like smartphones.

In contrast, RecurrentGemma is built on what is known as linear recurrences, a vital component to recurrent neural networks (RNNs), as explained in the research team’s preprint paper.

In a recent blog post, Google explained that “RecurrentGemma is a technically distinct model that leverages recurrent neural networks and local attention to improve memory efficiency. While achieving similar benchmark score performance to the Gemma 2B model, RecurrentGemma’s unique architecture results in several advantages, [including] reduced memory usage, higher throughput, and research innovation.”

Hidden States and Local Attention

Before the emergence of transformers, RNNs were typically used to process sequential data by utilizing a “hidden state” that is continuously updated as data is processed. This hidden state is combined with a “local attention” mechanism that allows the model to recall information earlier in a sequence, without having to recall all the hidden states at each step of the process (as a “global attention” mechanism would require).

“Although one can reduce the cache size by using local attention, this comes at the price of reduced performance,” noted the team in their paper. “In contrast, RecurrentGemma-2B compresses input sequences into a fixed-size state without sacrificing performance. This reduces memory use and enables efficient inference on long sequences.”

Because resource usage is fixed, RecurrentGemma is able to handle lengthier language processing tasks efficiently, even with the typical computational constraints of personal devices, and without having to rely on powerful GPUs or cloud-based computing.

RecurrentGemma-2B is capable of achieving about the same level of performance as other larger Gemma models.

Despite the lack of a transformer-based architecture, the research team found that RecurrentGemma performed well in a variety of tests when compared to larger LLMs, including those of the Gemma family of models.

According to the team’s findings, RecurrentGemma-2B-IT (IT meaning an instruction-tuned model) achieved a 43.7% win rate against the larger Mistral 7B model in hundreds of prompts that included creative writing and coding tasks. This result was also only slightly below the 45% win rate achieved by Gemma-1.1-2B-IT in the same set of tasks.

Additionally, the researchers found that RecurrentGemma-2B-IT outperformed a Mistral 7B v0.2 Instruct model, with a 59.8% win rate on 400 prompts testing out basic security protocols.

Overall, the team found that a both a pre-trained RecurrentGemma model with 2 billion non-embedding parameters and an instruction-tuned variant achieved comparable performance to Gemma-2B, even though Gemma-2B was trained on 50% more tokens.

In the end, the team notes that RecurrentGemma-2B is capable of achieving about the same level of performance as other larger Gemma models, by leveraging the advantages of RNNs and local attention mechanisms — making it much more efficient and suitable for deployment in situations where resources are constrained.

Ultimately, models like RecurrentGemma-2B could signal a shift to smaller and more agile AI models that can be run on less powerful devices.