Welcome! On this website you will find notes from the ICML 2024 tutorial "Data Attribution at Scale." These notes were put together by Andrew Ilyas, Kristian Georgiev, Logan Engstrom, and Sung Min (Sam) Park. They are very much a work in progress—please feel free to suggest edits/improvements by reaching out to any one of us, or by emailing us at ml-data-tutorial@mit.edu.

Abstract

Data attribution is the study of the relation between data and ML predictions. In downstream applications, data attribution methods can help interpret and compare models; curate datasets; and assess learning algorithm stability.

This tutorial surveys the field of data attribution, with a focus on what we call “predictive data attribution.” We first motivate this notion within a broad, purpose-based taxonomy of data attribution. Next, we highlight how one can view predictive data attribution through the lens of a classic statistical problem that we call “weighted refitting." We discuss why classical methods for solving the weighted refitting problem struggle when directly applied to large-scale machine learning settings (and thus cannot directly solve problems in modern contexts). With these shortcomings in mind, we overview recent progress on performing predictive data attribution for modern ML models. Finally, we discussing applications—current and future—of data attribution.

Slides: PDF

Video: SlidesLive (requires ICML account), Youtube (coming soon!)

Chapters

I: Data problems (and solution concepts) in ML

July 18, 2024
II: Theoretical foundations

July 18, 2024
III: Scaling to deep learning

July 18, 2024
IV: Data attribution in the wild

July 18, 2024

Abstract

Chapters

I: Data problems (and solution concepts) in ML

II: Theoretical foundations

III: Scaling to deep learning

IV: Data attribution in the wild