Questions

Research:

How might we find functional neurons that purely help perform internal computations, as opposed to semantic neurons that convey concepts like “dog”? I’d expect these neurons more likely to be causally relevant for model outputs.
How might we reverse-engineer known science from biological and physics-based models? Unlike language / vision models, scientific models have an actual “ground truth” for what they should be learning.
How do we successfully “bridge” modalities together in multi-modal models? How can we tell when we’ve succeeded? How can we scalably bridge modalities?
How might we engineer AI agents to identify whether mechanistic explanations for phenomena in small models carry over to large models? One present problem in interpretability is that it’s fairly possible to explain model phenomena in toy setups, but it’s too time-consuming to find the equivalent mechanisms in larger, newly released models.

Have thoughts on these? Email me at nickj [at] berkeley [dot] edu!