A Unified Architecture for Self-Improving Intelligent Systems
Current artificial intelligence systems generally fall into two categories: static function approximators (models) or task-solving loops (agents). While agents can accumulate episodic memory (“in-context learning”), they typically lack a principled mechanism to modify their own control structure. We propose the General Evaluator Model (GEM), a minimal formal framework for self-improving systems. GEM redefines the agent not as a fixed loop, but as a recursive evaluator in which self-modification is a privileged subset of the action space. By distinguishing between State Updates (Type 1) and Configuration Updates (Type 2), and by constraining the latter with an explicit performance functional over a reference suite, GEM provides a substrate-independent blueprint for agents that can safely rewrite their own configuration.
Most contemporary agents are defined by a fixed tuple: a large language model (LLM), a system prompt, a set of tools, and a memory bank. The memory changes over time, but the structure of the agent — its prompts, tool definitions, and control logic — remains frozen.
This creates a ceiling on intelligence. A truly intelligent system must be able to critique and upgrade its own reasoning process, not just accumulate more data. It requires structural self-improvement, not only state-level learning.
The General Evaluator Model (GEM) starts from a simple claim:
An intelligent agent is fundamentally an evaluator that maps its current configuration, memory, and observations into actions.
A self-improving agent is one whose evaluator can select actions that modify its own configuration, subject to explicit performance constraints.
GEM treats the agent’s configuration (its “source code”) as mutable data, accessible to the evaluator itself, and formalizes self-modification as part of the action space.
We define an agent at time
-
Configuration
$\theta$ : The structural logic governing the agent.- Examples: system prompts, tool definitions, control policies, retrieval hyperparameters, reflection cadence.
- Domain: a meta-language
$\mathcal{M}$ . In practice,$\theta \in \mathcal{M}$ is a structured object (e.g. JSON, DSL, Python config) that fully specifies behavior given the engine.
-
Memory
$M$ : The fluid internal state of the agent.- Examples: dialogue history, vector store contents, current beliefs about the world, episodic logs, cached plans.
- Domain: arbitrary data structures, possibly spanning multiple stores.
-
Observation
$O$ : The most recent percept from the environment.- Examples: user message, sensor readings, tool outputs.
-
Engine
$f$ : The underlying substrate that executes the logic encoded by$\theta$ .- Examples: a fixed LLM (e.g. GPT-4), a policy network, a symbolic solver.
- Type:
$f: (\theta, M, O) \to \Delta(A)$ where$\Delta(A)$ denotes a distribution over actions.
Intuitively:
Standard reinforcement learning treats the action space
-
$A_{\text{world}}$ (Act): Actions that affect the external environment.- Examples:
send_message(user_text),move_robot(Δx, Δy),call_api(request).
- Examples:
-
$A_{\text{mem}}$ (Learn): Actions that modify the memory$M$ , but not the configuration$\theta$ .- Examples:
write_to_memory(entry),update_belief(key, value),append_to_log(trace).
- Examples:
-
$A_{\text{sys}}$ (Evolve): Actions that propose changes to the configuration$\theta$ .- Examples:
rewrite_prompt(patch),add_new_tool(spec),adjust_planning_depth(k),change_reflection_interval(k).
- Examples:
We call actions in
At each discrete step
-
Evaluate Given current configuration
$\theta_t$ , memory$M_t$ , and observation$O_t$ , the engine computes a distribution over actions: $$ P_t = f(\theta_t, M_t, O_t) \in \Delta(A) $$ -
Select A concrete action
$a_t \in A$ is drawn or chosen: $$ a_t = \text{select}(P_t) $$ -
Apply Depending on which partition
$a_t$ belongs to:-
If
$a_t \in A_{\text{world}}$ : The environment responds with a new observation$O_{t+1} = \text{act_external}(a_t)$ . Memory$M_t$ may also be updated by a separate perception or logging mechanism. -
If
$a_t \in A_{\text{mem}}$ : Memory is updated by a memory application function$M_{t+1} = \text{apply_mem}(M_t, a_t)$ . Observation may be set to a synthetic feedback (e.g. "Memory updated."), and$\theta_{t+1} = \theta_t$ . -
If
$a_t \in A_{\text{sys}}$ : A configuration patch is proposed:$\theta' = \text{patch}(\theta_t, a_t)$ . This triggers the structural update mechanism described in Section 4.
-
-
Advance The new agent state is $\text{Agent}{t+1} = (\theta{t+1}, M_{t+1}, O_{t+1})$ with
$\theta_{t+1}, M_{t+1}, O_{t+1}$ determined as above.
This is the entire dynamics. There is no special “outer loop”: self-improvement is just a particular class of actions in
GEM draws a sharp line between Type 1 and Type 2 change.
When the agent selects
This covers:
- in-context learning
- retrieval-augmented generation (RAG) updates
- belief updates, logging, episodic memory, caches
The evaluator’s structure (how it reasons) is fixed; only its state (what it knows) evolves.
When the agent selects
At this point, the system does not immediately commit to
This is the key move: self-modification is an action, but a guarded one.
Self-modification without verification leads to instability and collapse. GEM therefore requires:
- a reference suite
$R$ of tasks/goals, and - a performance functional
$J: (\theta, R) \to \mathbb{R}$
Intuitively,
- task success rate
- latency or cost
- safety or compliance scores
into a single scalar.
Given a proposed configuration
The structural update rule is:
for some margin
If the update is rejected, the agent receives a synthetic observation:
This feedback becomes part of its experience: the agent can learn which kinds of structural edits tend to pass or fail, using its own failures as training data.
Two important points:
-
$R$ is a reference class, not the whole world. Over-fitting$\theta$ to$R$ is a concrete manifestation of Goodhart’s law. - In a fully autonomous system, selecting and expanding
$R$ itself becomes part of the long-term problem of self-governance.
To demonstrate that GEM is implementable with current tools, consider a language-model based agent.
Let
{
"persona": "string",
"system_prompt": "string",
"thought_pattern": "chain_of_thought | tree_of_thought",
"tools": [
{
"name": "string",
"signature": "string",
"description": "string"
}
],
"planning_depth": "integer",
"reflection_interval": "integer"
}Here,
- A frozen LLM, e.g. GPT-4 or Claude 3.x.
- Given
$(\theta_t, M_t, O_t)$ , the engine constructs a prompt and tool context, then outputs a distribution over actions$A$ (which includes both external actions and internal configuration/memory tools).
Add a special tool: update_configuration(json_patch)
This tool takes a proposed patch to the JSON configuration
- Constructs
$\theta' = \text{patch}(\theta_t, \text{json_patch})$ . - Spins up a sandbox agent with configuration
$\theta'$ . - Evaluates it on the reference suite
$R$ (e.g. a battery of math, coding, reasoning, and safety tasks). - Computes
$J(\theta', R)$ and compares it to$J(\theta_t, R)$ . - Commits or rolls back the patch based on the rule in Section 5.
- The agent consistently fails arithmetic tasks because it lacks a calculator.
- It notices a pattern in failures and proposes:
update_configuration({ "add_tool": { "name": "python", "signature": "code: string", "description": "Execute Python code and return the result." } })
- The Runtime generates
$\theta'$ with the new tool, runs the sandbox on$R$ , and observes that math and some reasoning tasks now succeed more often. - If
$J(\theta', R) \ge J(\theta_t, R) + \varepsilon$ , the update is committed. - The next time,
$\text{Agent}_{t+1}$ permanently has access to a calculator tool.
Under GEM, this is not “prompt engineering” done by a human. It is self-engineering done by the agent, guarded by a verification layer.
In GEM, prompts, tool definitions, and control policies are not fixed implementation details; they are part of
Because Type 2 updates are accepted based on
- If
$R$ is narrow, the agent can structurally overfit to a small set of tests. - If
$R$ is broad and diverse, structural improvements are more likely to generalize.
At scale, a self-evolving agent would need to expand and curate
There is no separate “manager” outside the agent deciding how to update
- chooses actions in
$A_{\text{world}}$ to act on the environment, - chooses actions in
$A_{\text{mem}}$ to shape its memory, - chooses actions in
$A_{\text{sys}}$ to propose changes to itself.
The current intelligence generates hypotheses about a better intelligence, which are then tested against
We can summarize GEM in one line:
An intelligent system is an evaluator
$f(\theta, M, O)$ whose action space includes privileged operations that rewrite its own configuration$\theta$ , with those rewrites constrained by a performance functional$J$ over a reference suite$R$ .
Everything else — prompts, tools, memory stores, policies, even “agent frameworks” — is just a particular choice of:
- meta-language
$\mathcal{M}$ for$\theta$ , - memory structures
$M$ , - action space decomposition
$A_{\text{world}}, A_{\text{mem}}, A_{\text{sys}}$ , and - verification functional
$J$ .
GEM’s claim is that once you structure an agent this way, self-improvement is no longer a hack or an afterthought. It is part of the agent’s core interface to itself.
This repository contains a prototype implementation of GEM (v0).
- Node.js
npm installin root.npm run buildin root.
The agent requires an OpenAI API key to function.
- Set your API key:
export OPENAI_API_KEY=sk-... - Run the agent:
npm start
- Baseline Performance: Ask questions like "What is 2 + 2?". The agent will answer correctly but may fail the hidden style guidelines of the Reference Suite.
- Trigger Update: Type
update config. The agent will introspect on the Reference Suite tasks and propose a configuration patch to maximize its score (e.g., adopting a specific output format).
All configuration changes are logged to: memory/config_log.jsonl
Each entry contains:
theta_before/theta_afterj_before/j_after(Performance scores)committed(Boolean result)
Located in R/tasks.jsonl. These are the "tests" the agent must pass to accept a self-update.