Questions

Research:

  • The circuits agenda breaks model computation down into a series of if-statements, akin to a decision tree. But as the studied behaviors get more complex, this framework seems to lack expressive power for explaining behaviors in a human-understandable way. What is the most expressive, intuitive framework to explain computations and model outputs? What if explanations could look closer to code, with variables, functions, loops, and state?
  • How can we discover unknown behaviors about models by analyzing their training data and outputs for hidden patterns and biases? This question drives my research here.

Have thoughts on these? Email me at nickj [at] berkeley [dot] edu!