RLM - Recursive Language Models

The Problem

Modern LLMs suffer from “context rot” - as context grows, performance degrades even when it fits within the model’s window. A 100k token conversation causes the model to “forget” or give lower-quality responses, despite technically being able to process it all.

Traditional solutions: - Bigger context windows → Still suffer from rot, expensive - RAG/retrieval → Requires pre-indexing, rigid search strategies

The Solution

Recursive Language Models (RLMs) treat context as a programmable object that models explore adaptively at test-time. Instead of cramming everything into one call, the model recursively breaks down and processes context in a REPL environment.

Key result from the paper: RLM using GPT-4-mini outperforms vanilla GPT-4 by 2x on long-context benchmarks while costing the same or less!

How It Works

Context as variable - Store your document in a Python REPL environment
Adaptive exploration - Root LM decides how to chunk, search, and process
Recursive queries - Call llm_query() on manageable chunks
No context rot - Each model call works with small, focused context

Based on the paper: Recursive Language Models by Alex Zhang et al.

Developer Guide

If you are new to using nbdev here are some useful pointers to get you started.

Install rlm in Development mode

# make sure rlm package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to rlm
$ nbdev_prepare

Usage

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/numb3r33/rlm.git

or from conda

$ conda install -c numb3r33 rlm

or from pypi

$ pip install rlm

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

Quick Start

Here’s a simple example of using RLM to answer questions over long documents:

export OPENAI_API_KEY=“your-key-here”

Basic Usage

from rlm.tools import prep_shell, make_run_repl 
from rlm.core import advanced_toolloop 
from rlm.prompts import REPL_SYSTEM_PROMPT

Your long document/context

with open("document.txt") as f: 
    context = f.read()

Setup RLM

sh = prep_shell(context, model="openai/openai/gpt-oss-120b", base_url="https://your-litellm-gateway.com")
run_repl = make_run_repl(sh)

Ask a question

query = “What are the main themes discussed in this document?”

Run RLM with verbose output

responses = advanced_toolloop( query, sp=REPL_SYSTEM_PROMPT, tools=[run_repl], sh=sh, model="openai/openai/gpt-oss-120b", base_url="https://your-litellm-gateway.com", max_steps=50, verbose=True)

Get the answer

for item in responses:
    if isinstance(item, dict) and item.get("type") == "final":
        print(f"Answer: {item['answer']}")

Vanialla approach ( may fail with very long context )

try:
    vanilla_result = benchmark_vanilla(context, query, model="gpt-4", base_url="...")
    print(f"Vanilla: {vanilla_result['time']:.2f}s, {vanilla_result['tokens']} tokens")
except Exception as e:
    print(f"Vanilla failed: {e}")

RLM approach

rlm_result = benchmark_rlm(context, query, model="gpt-4", base_url="...", verbose=True)
print(f"RLM: {rlm_result['time']:.2f}s, {rlm_result['tokens']} tokens")
print(f"Answer: {rlm_result['answer']}")

FAQ

When should I use RLM vs vanilla LLM?

Use RLM when: - Context exceeds model’s window (100k+ tokens) - You experience “context rot” (model gets worse with long conversations) - Task requires reasoning across many documents - You want adaptive, test-time chunking strategies

Use vanilla when: - Context is short (< 10k tokens) - Simple fact retrieval - Speed is critical and context fits easily

What’s the difference between `max_steps` and recursion depth?

max_steps: How many REPL iterations the root LM gets (horizontal - loop count)
Recursion depth: How deep calls can nest (vertical - call stack depth)

RLM enforces depth=1 by design: root LM can call llm_query(), but those calls can’t spawn further recursion.

Why doesn’t the model always use FINAL()?

Some models don’t consistently follow the FINAL() instruction. RLM includes a fallback that captures the last assistant message if FINAL() isn’t detected.

How does RLM compare to RAG?

RLM advantages: - No pre-indexing needed - Adaptive search strategies (model decides how to explore) - Better for complex multi-step reasoning

RAG advantages: - Faster for simple lookups - Works well with persistent knowledge bases - Lower cost per query for repeated queries

Can I customize the system prompt?

Yes! Import and modify REPL_SYSTEM_PROMPT or create your own:

from rlm.prompts import REPL_SYSTEM_PROMPT

custom_prompt = REPL_SYSTEM_PROMPT + "\nAdditional instructions here..."

Future Directions

Based on the Recursive Language Models paper, here are planned enhancements:

Short-term

Token/cost tracking: Detailed metrics for each step
Multiple benchmark tasks: Expand beyond document Q&A
Error recovery improvements: Better handling of API failures and malformed tool calls
Configurable FINAL detection: Custom patterns beyond FINAL() and FINAL_VAR()

Medium-term

Training for recursion: Fine-tune models explicitly for RLM patterns (like o1 for reasoning)
Deeper recursion: Support depth > 1 for more complex tasks
Multi-modal context: Support for images, tables, structured data
Streaming responses: Real-time answer updates as RLM progresses

Long-term

RL-based optimization: Learn optimal chunking and recursion strategies
Hybrid RAG+RLM: Combine pre-indexed retrieval with adaptive exploration
Benchmark suite: Comprehensive evaluation across domains

Contributing

We welcome contributions! Areas where help is needed: - Additional benchmark tasks - Prompt engineering for better FINAL() compliance - Performance optimizations - Documentation improvements

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
context		context
nbs		nbs
rlm		rlm
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
settings.ini		settings.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLM - Recursive Language Models

The Problem

The Solution

How It Works

Developer Guide

Install rlm in Development mode

Usage

Installation

Documentation

Quick Start

Basic Usage

Your long document/context

Setup RLM

Ask a question

Run RLM with verbose output

Get the answer

Vanialla approach ( may fail with very long context )

RLM approach

FAQ

When should I use RLM vs vanilla LLM?

What’s the difference between `max_steps` and recursion depth?

Why doesn’t the model always use FINAL()?

How does RLM compare to RAG?

Can I customize the system prompt?

Future Directions

Short-term

Medium-term

Long-term

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

numb3r33/rlm

Folders and files

Latest commit

History

Repository files navigation

RLM - Recursive Language Models

The Problem

The Solution

How It Works

Developer Guide

Install rlm in Development mode

Usage

Installation

Documentation

Quick Start

Basic Usage

Your long document/context

Setup RLM

Ask a question

Run RLM with verbose output

Get the answer

Vanialla approach ( may fail with very long context )

RLM approach

FAQ

When should I use RLM vs vanilla LLM?

What’s the difference between max_steps and recursion depth?

Why doesn’t the model always use FINAL()?

How does RLM compare to RAG?

Can I customize the system prompt?

Future Directions

Short-term

Medium-term

Long-term

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

What’s the difference between `max_steps` and recursion depth?

Packages