Datasets and APIs

We’ve started compiling some datasets and APIs relevant to the hackathon, which we’ll encourage the participants to use, repurpose, combine in creative ways!

News, Fake News & Social Media

Online Rating systems

Online rating and recommender systems have a general problem that “the rich get richer”, and many reviews are also known to be “fake”.

(Un-)Ethical online user tracking and data collection

  • Princeton Web census
    Reports on tracking, fingerprinting, and other ethically questionable practices by the top 1M web sites (data available).
  • Sociam/Oxford data on mobile apps permissions + connecting to sketchy servers.
    [coming soon]

Language related

  • Wikipedia toxic comments dataset :
    A dataset of comments from Wikipedia talk pages, with crowdsourced annotations flagging toxic comments and personal attacks.
    Dataset of
  • Conversation AI’s Perspective API:
    An API giving ML-assessments of toxicity for natural language comments, built by the Conversation AI project (associated with Google)
    An analysis of the associated biases (reflecting social biases against groups, e.g. women, muslims, LGBT, etc.):
    Blog post on Medium
    Technical presentation from AIES conference 2018
  • Biased and de-biased Word Embeddings:
    Word embeddings (vector representations of words, capturing their semantics) learnt from a large corpus incorporate the social biases expressed in the corpus.
    word2vec tool for learning “raw” word embeddings.
    Tutorial to remove gender bias from such word embeddings (with dataset of de-biased word embeddings):
  • African-American English (AAE) tweets
    AAE is a dialect of English used predominantly by blacks, with roots in African languages. It is often not recognized as English (particular when combined with abbreviations in short tweets) by standard NLP tools, and is associated with social biases related to education and poverty.

   Police and justice related

  • New York City stop-and frisk data
    The NYPD had a controversial “stop and search” policy for many years, where officers could stop people on the street on suspicion of criminal activity, and “frisk” them. These stops were found to disproportionately affect black and hispanic people, and later deemed illegal. Each stop (and the outcome) were recorded, in a dataset which is now available:
  • Compas dataset
    The infamous compas tool, which aimed to predict how likely a convicted criminal was to re-offend, and found to be biased against blacks, in a well-known pro-publica study. The dataset and study are available at:
  • Feedback loops in predictive policing tools
    [coming soon]



Emancipating Users Against Algorithmic Biases for a Trusted Digital Economy

%d bloggers like this: