Updates

  • New course materials are under active development, but you can see our purposed activities in the tutorial proposal.

  • We are excited to bring this tutorial (2nd Edition) to ICWSM 2024. Hope to see you in Buffalo!


Course Description

Many problems in computational social science require measuring the frequency of certain labels within a data sample. This task is generally called ''prevalence estimation'' or ''quantification''. Frequently, researchers make use of a pre-trained classifier that assigns a label, or label probability, to each item. The classifier-aided approaches are especially attractive when the dataset is large, ground truth labels are difficult or expensive to obtain, and a high quality classifier such as ChatGPT or Jigsaw's Perspective API is available for use. However, it is usually not safe to simply apply the classifier to all items and count the predictions of each class, because the test dataset may differ in important ways from the dataset on which the classifier was trained, a phenomenon called ''distribution shift''. In addition, a second type of distribution shift may occur when one wishes to compare the prevalence between multiple datasets, such as tracking changes over time. To cope with that, some stability assumptions need to be made about the nature of possible distribution shifts across datasets.

This tutorial will introduce a conceptual framework called ''Calibrate-Extrapolate''. It rethinks the prevalence estimation process as calibrating the classifier outputs against ground truth labels to obtain the joint distribution of a base dataset, and then extrapolating to the joint distribution of a target dataset. Visualizing the joint distribution makes the stability assumption needed for a prevalence estimation technique clear and easy to understand.

We will provide hands-on coding exercises that walk the participants through solving a prevalence estimation problem on simulated data. All codes will be provided in a Jupyter notebook, with TO-DO code blocks for the participants to fill in. We will generate several simulated datasets, as a way to build research intuitions about the impacts of classifier predictive power and violations of assumptions. We will also discuss the data generating processes and stability assumptions from a causal perspective. After attending this tutorial, participants will be able to understand the basics and challenges of prevalence estimation problem in computational social science, and construct a data analysis pipeline to conduct prevalence estimation for their own projects.

Presentations

Course Materials (ICWSM 2023 version)

Paper

  • Siqi Wu, Paul Resnick (2024). Calibrate-Extrapolate: Rethinking Prevalence Estimation with Black Box Classifiers. arXiv:2401.09329


Instructors

Siqi Wu (2023, 2024)
Paul Resnick (2023)