Stars
benchmark and evaluate generative research synthesis
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Optimize prompts, code, and more with AI-powered Reflective Text Evolution
Open-source implementation of AlphaEvolve
Recovery-Bench is a benchmark for evaluating the capability of LLM agents to recover from mistakes
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
NPUEval is an LLM evaluation dataset written specifically to target AIE kernel code generation on RyzenAI hardware.
MCP server integrating GEPA (Genetic-Evolutionary Prompt Architecture) for automatic prompt optimization with Claude Desktop
Renderer for the harmony response format to be used with gpt-oss
slime is an LLM post-training framework for RL Scaling.
Trajectories for running OpenHands on Terminal Bench
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
[NeurIPS '25] Challenging Software Optimization Tasks for Evaluating SWE-Agents
A benchmark for LLMs on complicated tasks in the terminal
Sky-T1: Train your own O1 preview model within $450
SECOM: On Memory Construction and Retrieval for Personalized Conversational Agents, ICLR 2025
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)
Letta is the platform for building stateful agents: open AI with advanced memory that can learn and self-improve over time.
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.