Data

  • HOT Speech Comments Dataset
    This text dataset includes 3,481 social media user comments posted in response to political news posts and videos on Twitter, YouTube, and Reddit in August, 2021. The dataset also includes MTurk workers’ annotations of these comments as hateful, offensive, and/or toxic; and codes assigned by researchers describing various rhetorical dimensions of these comments.
    Keywords: incivility; hate speech; Reddit; Twitter; YouTube
    [ICPSR]
  • YouTube Political Discussion Dataset
    This dataset contains processed metadata of 1,267 US partisan media on YouTube, 274,241 YouTube political videos, and 9,304,653 YouTube users who have commented on YouTube political videos.
    Keywords: YouTube comments; cross-partisan communication; echo chamber; political science
    [README, dataverse]
  • Complete/Sampled Retweet Cascades Datasets
    These datasets include 2 sets of complete/sampled retweet cascades on the topics of cyberbullying (tweet sampling rate: 52.72%, 3M complete cascades, 1.17M sampled cascade) and YouTube sharing (tweet sampling rate: 91.53%, 2.02M complete cascades, 1.8M sampled cascades).
    Keywords: Twitter filtered stream; tweet cascades; sampled data; information diffusion
    [README, dataverse]
  • Vevo Music Graph Dataset
    This dataset contains the metadata and historical data of 60,740 videos from the complete set of Vevo artists who are active in English-speaking countries, and 63 daily snapshots of the video recommendation network.
    Keywords: YouTube recommendation network; recommender systems; video popularity; Vevo videos
    [README, dataverse]
  • YouTube Engagement '16 Datasets
    These datasets include (a) a tweeted videos dataset, which contains 5M YouTube videos that are uploaded and tweeted from 2016-07-01 to 2016-08-31, and are watched at least 100 times within 30 days of their onsets; (b) three quality videos datasets, which contain 96K videos deemed of high quality by domain experts. To our knowledge, they are the only publicly available datasets including information of video watch time.
    Keywords: video watch time; video quality; video engagement
    [README, dataverse]

  • Tutorial

  • Prevalence Estimation in Social Media Using Black Box Classifiers
    Siqi Wu and Paul Resnick
    AAAI International Conference on Weblogs and Social Media (ICWSM), 2023, 2024.
    [website, paper]

  • Demonstration

  • AttentionFlow
    AttentionFlow is a system to visualize networks of time series and the dynamic influence they have on one another. We demonstrate AttentionFlow using two real-world datasets: VevoMusic and WikiTraffic.
    [demo, code]
  • HIPie
    HIPie is an interactive visualization system to explain and predict the popularity of YouTube videos.
    [demo, code]

  • Software

  • pyquantifier
    pyquantifier is a Python package to estimate class prevalence in unlabeled datasets by specifying stability assumptions.
    [code]
  • twitter-intact-stream (currently deprecated)
    twitter-intact-stream is a Python package to reconstruct the complete Twitter filtered stream using Twitter API v1.
    [code]
  • youtube-insight (currently deprecated)
    youtube-insight is a Python package to collect metadata and historical time series data for YouTube videos.
    [code]