Towards Data-centric Interpretability with Sparse Autoencoders
Nick Jiang*, Lily Sun*, Lewis Smith, Neel Nanda
NeurIPS Mech Interp Workshop 2025 (Spotlight)
paper | blog | summary
TLDR: we show that sparse autoencoders are a versatile tool for textual data analysis.
Vision Transformers Don’t Need Trained Registers
Nick Jiang*, Amil Dravid*, Alexei Efros, Yossi Gandelsman
NeurIPS 2025 (Spotlight - top 3% of submissions)
paper | code | summary
TLDR: we mechanistically study how attention sinks emerge, then remove them at test time.
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang*, Anish Kachinthaya*, Suzie Petryk, Yossi Gandelsman
ICLR 2025
paper | code | summary
TLDR: we use logit lens to identify and remove hallucinations from vision-language models.