Our ML infra lets us steer trillion-parameter frontier models in real time: - live, mid-chain-of-thought edits to internal activations - directly altering how the model reasons (not just outputs) - stackable edits - no added latency We can make models more Gen Z, more concise, or anything else. Thanks for having us Gopal Raman South Park Commons!
Goodfire
Software Development
San Francisco, CA 5,363 followers
AI interpretability research company building safe and powerful AI systems
About us
Our mission is to advance humanity's understanding of AI by examining the inner workings of advanced AI models (or “AI Interpretability”). As a research-driven product organization, we bridge the gap between theoretical science and practical applications of interpretability. We're building critical infrastructure that empowers developers to understand, edit, and debug AI models at scale, ensuring the creation of safer and more reliable systems. Goodfire is a public benefit corporation headquartered in San Francisco.
- Website
-
https://goodfire.ai/
External link for Goodfire
- Industry
- Software Development
- Company size
- 11-50 employees
- Headquarters
- San Francisco, CA
- Type
- Privately Held
- Founded
- 2024
Locations
-
Primary
Get directions
San Francisco, CA, US
Employees at Goodfire
Updates
-
What counts as an explanation of how an LLM works? In our last Stanford guest lecture, Ekdeep Singh Lubana explains the different levels of analysis in interpretability, and outlines his neuro-inspired "model systems approach". Plus, how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context (as a case study for that approach). 00:33 - What counts as an explanation? 04:47 - Levels of analysis & standard interpretability approaches 18:19 - The "model systems" approach to interp [Case study on in-context learning] 23:36 - How LLM representations change in-context 44:10 - Modeling ICL with rational analysis 1:10:54 - Conclusion & questions Thanks again to Surya Ganguli for having us in his class!
-
-
What kinds of algorithmic building blocks do transformer models use to do different tasks? Jack Merullo recently gave a guest lecture at Stanford on "computational motifs" - the basic algorithmic primitives that show up again and again when we peer inside different circuits and models - and how they can help us understand models more fully. If you're interested in learning more about this approach to studying neural networks, check out this lecture (and our previous one, on causal approaches) - and stay tuned for our final lecture and reading list. Thanks again to Surya Ganguli for having us in his class! Watch the full lecture on YouTube: https://lnkd.in/g8n44bPs
-
-
Goodfire reposted this
I'll be at NeurIPS next week with Myra Deng, Jack Merullo, Mark Bissell, and others from the Goodfire team. If you want to chat about AI interpretability for scientific discovery, monitoring, or our recent research, fill out this form and we'll get in touch! https://lnkd.in/ejFFJbdj
-
We believe some high-priority directions for interpretability research are neglected by existing educational resources - so we've made some to get the community up to speed. If you're interested in AI but not caught up on interpretability, check out the Stanford guest lectures + reading lists we're releasing this month! We just posted the first lecture: Atticus Geiger on causal approaches to interpretability - applying frameworks and tools from causal modeling to understand LLMs and other neural networks. The video link is in the comments! Thanks to Surya Ganguli for having us in his course! 00:00 - Intro 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an intermediate component) 21:42 - Causal abstraction methods (explaining a complex causal system with a simple one) 54:54 - Lookback mechanisms: a case study in designing counterfactuals
-
-
Sparse autoencoders (SAEs) assume that model representations are static. But LLM features drift & evolve across context! Our new paper introduces a neuroscience-inspired method to capture these dynamic representations: Temporal Feature Analysis. Applying it reveals rich structure in model activations over the course of a conversation. E.g. in the video below, we plot 100 different conversations with Gemma 2 2B. Initial user prompts show similar trajectories (purple), but as the conversations continue, they diverge into different topic clusters (blue/green/yellow). This opens up a new "surface of attack" for interpretability - information about the geometry of model representations that previous methods largely neglected. That information is crucial for faithfully understanding & manipulating models!
-
Goodfire reposted this
Prompting and activation steering are just two sides of the same coin! New research from Goodfire researchers. Pretty insane how well their Bayesian belief model predicts empirical results. Read more here: https://lnkd.in/eSGMjrwP
-
-
Goodfire reposted this
LLMs memorize a lot of their training data. But where do "memories" live inside models? How are they stored? How much are they involved in different tasks? 📌 Jack Merullo and Srihita Vatsavaya's new paper investigates all of these questions! They found a signature of memorization in model weights, and use it to edit models, generally removing the ability to recite text verbatim. This reveals a spectrum of different model capabilities - some which rely heavily on memorization (like factual recall), and others which do perfectly fine without it (pure reasoning). In addition to answering fundamental scientific questions, this points to new applications - like being able to turn memorization up or down for different tasks, or making lightweight agents that excel at reasoning rather than encyclopedic knowledge.
-
-
Goodfire reposted this
44M real users’ data secured by our AI interpretability platform Ember. ⤵️ We partnered with Rakuten AI to translate frontier interpretability research into real production value - keeping personal user data safe without slowing down Rakuten’s AI agents. PII detection is a common concern in enterprise AI systems. In production, it requires methods that are: - Lightweight enough to run efficiently at scale - High-recall, so no sensitive data slips through - Trained only on synthetic data, since customer data can’t be used Using Ember, we built interpretability-based classifiers to catch PII with techniques that outperform black-box guardrails on recall, latency, and cost. Our methods were 15–500× cheaper than state of the art LLM-as-a-judge approaches. Huge thanks to Nam Nguyen, Dhruvil Gala, Myra Deng, Michael Byun and Daniel Balsam for leading the charge on this project at Goodfire, and to our collaborators at Rakuten - Yusuke Kaji, Kenta Naruse, Felix Giovanni Virgo, Mio Takei, and others who were early believers in Goodfire and our vision of interpretable AI. We’re excited about helping enterprises build safe, intentionally designed AI systems. If you’re interested in exploring what a partnership could look like, I’d love to chat.
-
-
Goodfire reposted this
Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability? We're looking for several Research Fellows and Research Engineering Fellows to start this fall. Fellows will work on areas like interp for scientific discovery, causal analysis, representational structure of memorization/generalization, dynamics of representations, and more. We're looking for a range of skillsets - e.g. RL, Bayesian inference, distributed systems, signal processing, and API infrastructure. Fellows will collaborate with senior members of our technical staff, contribute to core projects, and work full time in person in our San Francisco office. Full post and links to apply: https://lnkd.in/eum9VZhq
-