Welcome! On this website you will find notes from the ICML 2024 tutorial "Data Attribution at Scale." These notes were put together by Andrew Ilyas, Kristian Georgiev, Logan Engstrom, and Sung Min (Sam) Park. They are very much a work in progress—please feel free to suggest edits/improvements by reaching out to any one of us, or by emailing us at ml-data-tutorial@mit.edu.

Abstract

Data attribution is the study of the relation between data and ML predictions. In downstream applications, data attribution methods can help interpret and compare models; curate datasets; and assess learning algorithm stability.

This tutorial surveys the field of data attribution, with a focus on what we call “predictive data attribution.” We first motivate this notion within a broad, purpose-based taxonomy of data attribution. Next, we highlight how one can view predictive data attribution through the lens of a classic statistical problem that we call “weighted refitting." We discuss why classical methods for solving the weighted refitting problem struggle when directly applied to large-scale machine learning settings (and thus cannot directly solve problems in modern contexts). With these shortcomings in mind, we overview recent progress on performing predictive data attribution for modern ML models. Finally, we discussing applications—current and future—of data attribution.


Slides: PDF

Video: SlidesLive (requires ICML account), Youtube (coming soon!)



Chapters

Data Attribution at Scale | ICML 2024 - MIT