Modern LLMs suffer from “context rot” - as context grows, performance degrades even when it fits within the model’s window. A 100k token conversation causes the model to “forget” or give lower-quality responses, despite technically being able to process it all.
Traditional solutions: - Bigger context windows → Still suffer from rot, expensive - RAG/retrieval → Requires pre-indexing, rigid search strategies
Recursive Language Models (RLMs) treat context as a programmable object that models explore adaptively at test-time. Instead of cramming everything into one call, the model recursively breaks down and processes context in a REPL environment.
Key result from the paper: RLM using GPT-4-mini outperforms vanilla GPT-4 by 2x on long-context benchmarks while costing the same or less!
- Context as variable - Store your document in a Python REPL environment
- Adaptive exploration - Root LM decides how to chunk, search, and process
- Recursive queries - Call
llm_query()on manageable chunks - No context rot - Each model call works with small, focused context
Based on the paper: Recursive Language Models by Alex Zhang et al.
If you are new to using nbdev here are some useful pointers to get you
started.
# make sure rlm package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to rlm
$ nbdev_prepareInstall latest from the GitHub repository:
$ pip install git+https://github.com/numb3r33/rlm.gitor from conda
$ conda install -c numb3r33 rlmor from pypi
$ pip install rlmDocumentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.
Here’s a simple example of using RLM to answer questions over long documents:
export OPENAI_API_KEY=“your-key-here”
from rlm.tools import prep_shell, make_run_repl
from rlm.core import advanced_toolloop
from rlm.prompts import REPL_SYSTEM_PROMPT
with open("document.txt") as f:
context = f.read()
sh = prep_shell(context, model="openai/openai/gpt-oss-120b", base_url="https://your-litellm-gateway.com")
run_repl = make_run_repl(sh)
query = “What are the main themes discussed in this document?”
responses = advanced_toolloop( query, sp=REPL_SYSTEM_PROMPT, tools=[run_repl], sh=sh, model="openai/openai/gpt-oss-120b", base_url="https://your-litellm-gateway.com", max_steps=50, verbose=True)
for item in responses:
if isinstance(item, dict) and item.get("type") == "final":
print(f"Answer: {item['answer']}")
try:
vanilla_result = benchmark_vanilla(context, query, model="gpt-4", base_url="...")
print(f"Vanilla: {vanilla_result['time']:.2f}s, {vanilla_result['tokens']} tokens")
except Exception as e:
print(f"Vanilla failed: {e}")
rlm_result = benchmark_rlm(context, query, model="gpt-4", base_url="...", verbose=True)
print(f"RLM: {rlm_result['time']:.2f}s, {rlm_result['tokens']} tokens")
print(f"Answer: {rlm_result['answer']}")
Use RLM when: - Context exceeds model’s window (100k+ tokens) - You experience “context rot” (model gets worse with long conversations) - Task requires reasoning across many documents - You want adaptive, test-time chunking strategies
Use vanilla when: - Context is short (< 10k tokens) - Simple fact retrieval - Speed is critical and context fits easily
max_steps: How many REPL iterations the root LM gets (horizontal - loop count)- Recursion depth: How deep calls can nest (vertical - call stack depth)
RLM enforces depth=1 by design: root LM can call llm_query(), but
those calls can’t spawn further recursion.
Some models don’t consistently follow the FINAL() instruction. RLM includes a fallback that captures the last assistant message if FINAL() isn’t detected.
RLM advantages: - No pre-indexing needed - Adaptive search strategies (model decides how to explore) - Better for complex multi-step reasoning
RAG advantages: - Faster for simple lookups - Works well with persistent knowledge bases - Lower cost per query for repeated queries
Yes! Import and modify REPL_SYSTEM_PROMPT or create your own:
from rlm.prompts import REPL_SYSTEM_PROMPT
custom_prompt = REPL_SYSTEM_PROMPT + "\nAdditional instructions here..."
Based on the Recursive Language Models paper, here are planned enhancements:
- Token/cost tracking: Detailed metrics for each step
- Multiple benchmark tasks: Expand beyond document Q&A
- Error recovery improvements: Better handling of API failures and malformed tool calls
- Configurable FINAL detection: Custom patterns beyond FINAL() and FINAL_VAR()
- Training for recursion: Fine-tune models explicitly for RLM patterns (like o1 for reasoning)
- Deeper recursion: Support depth > 1 for more complex tasks
- Multi-modal context: Support for images, tables, structured data
- Streaming responses: Real-time answer updates as RLM progresses
- RL-based optimization: Learn optimal chunking and recursion strategies
- Hybrid RAG+RLM: Combine pre-indexed retrieval with adaptive exploration
- Benchmark suite: Comprehensive evaluation across domains
We welcome contributions! Areas where help is needed: - Additional benchmark tasks - Prompt engineering for better FINAL() compliance - Performance optimizations - Documentation improvements