Learn by building: Chain-of-Thought (CoT) probing, in-context learning (ICL) emergence, induction-head scans, and distilled vs. base model comparisons.
Works out-of-the-box with
sshleifer/tiny-gpt2
for CPU demos; swap to larger causal LMs for deeper dives.
pip install -e .
python -m inductra.runner icl --model sshleifer/tiny-gpt2 --ctx 8 16 32 64
python -m inductra.runner cot --model sshleifer/tiny-gpt2 --layer -1
python -m inductra.runner compare --base gpt2 --distilled distilgpt2
- ICL emergence grid — synthetic tasks (copy, reverse, modular add, parity) across context lengths; plot accuracy vs context.
- Induction heads — shifted-copy attention score for detecting induction-like heads.
- CoT probing — linear probe on residual streams to classify presence of CoT prefix (“Let’s think step by step”).
- Distilled vs base — quick accuracy delta on modular arithmetic.
- Hooks & probes — ~100 LOC each, minimal and hackable.
This repo pairs well with How to Become a Mechanistic Interpretability Researcher by Neel Nanda (Alignment Forum, 2025).
Distilled raw notes from the guide:
- Stage 1 (≤1 month): breadth-first basics, code a transformer, replicate toy mech interp tasks. Don’t wait—experiment early.
- Stage 2 (1–5 day projects): maximize info gain per unit time, run fast probes, debug aggressively.
- Stage 3 (1–2 week sprints): develop skepticism, set real baselines, and always write up. Public write-ups are the best credential.
Skill growth rates:
- Fast → coding, debugging, plotting.
- Medium → taste & prioritization, built by doing multiple projects.
- Slow → idea generation, emerges over months of exploration.
Practical tips: use LLMs as collaborators (they’re decent at mech interp tasks), don’t block on mentorship, and focus on “doing” over endless reading.
- Hook adapters for Llama/Mistral/Qwen
- Token-shifted causal tracing
- Per-head ranking plots
- GSM8K CoT token supervision
- Path patching experiments
MIT