Questions

Research:

  • The circuits agenda breaks model computation down into a series of if-statements, akin to a decision tree. But as the studied behaviors get more complex, this framework seems to lack expressive power for explaining behaviors in a human-understandable way. What is the most expressive, understandable framework to explain computations and model outputs? What if explanations could look closer to code, with variables, functions, loops, and state?
  • How might we reverse-engineer known science from biological and physics-based models? Unlike language / vision models, scientific models have an actual “ground truth” for what they should be learning.
  • How do we successfully “bridge” modalities together in multi-modal models? How can we tell when we’ve succeeded? How can we scalably bridge modalities?
  • How might we engineer AI agents to identify whether mechanistic explanations for phenomena in small models carry over to large models? One present problem in interpretability is that it’s fairly possible to explain model phenomena in toy setups, but it’s too time-consuming to find the equivalent mechanisms in larger, newly released models.

Have thoughts on these? Email me at nickj [at] berkeley [dot] edu!