Hate Speech Dataset Catalogue

This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.

The list is maintained by Leon Derczynski, Bertie Vidgen, Hannah Rose Kirk, Pica Johansson, Yi-Ling Chung, Mads Guldborg Kjeldgaard Kongsbak, Laila Sprejer, and Philine Zeinert.

We provide a list of datasets and keywords. If you would like to contribute to our catalogue or add your dataset, please see the instructions for contributing.

If you use these resources, please cite (and read!) our paper: Directions in Abusive Language Training Data: Garbage In, Garbage Out. And if you would like to find other resources for researching online hate, visit The Alan Turing Institute's Online Hate Research Hub or read The Alan Turing Institute's Reading List on Online Hate and Abuse Research.

If you're looking for a good paper on online hate training datasets (beyond our paper, of course!) then have a look at 'Resources and benchmark corpora for hate speech detection: a systematic review' by Poletto et al. in Language Resources and Evaluation.

Accompanying data statements preferred for all corpora.

See datasets

How to contribute

We accept entries to our catalogue based on pull requests to the content folder. The dataset must be avaliable for download to be included in the list. If you want to add an entry, follow these steps!

Please send just one dataset addition/edit at a time - edit it in, then save. This will make everyone’s life easier (including yours!)

Create file

Go to the repo url file and click the "Add file" dropdown and then click on "Create new file".

Choose location

In the following page type content/datasets/<name-of-the-file>.md. if you want to add an entry to the datasets catalog or content/keywords/<name-of-the-file>.md if you want to add an entry to the lists of abusive keywords, if you want to just add an static page you can leave in the root of content it will automatically get assigned an url eg: /content/about.md becomes the /about page

Fill in content

Copy the contents of templates/dataset.md or templates/keywords.md respectively to the camp below, filling out the fields with the correct data format. Everything below the second --- will automatically get rendered into the page, so you may add any standard markdown fields e.g tables, headings, lists...

Commit changes

Click on "Commit changes", on the popup make sure you give some brief detail on the proposed change. and then click on Propose changes

Submit PR

Submit the pull request on the next page when prompted.

Datasets

Search for datasets

Abusive Language Detection on Arabic Social Media (Al Jazeera)

Link to publication: https://www.aclweb.org/anthology/W17-3008

Link to data: http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx

Task Description: Ternary (Obscene, Offensive but not obscene, Clean)

Details of Task: Incivility

Size of Dataset: 32000

Percentage Abusive: 0.81%

Language: Arabic

Level of Annotation: Posts

Platform: AlJazeera

Medium: Text

Reference: Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56.

Large-Scale Hate Speech Detection with Cross-Domain Transfer

Link to publication: https://aclanthology.org/2022.lrec-1.238/

Link to data: https://github.com/avaapm/hatespeech

Task Description: Three-class (Hate speech, Offensive language, None)

Details of Task: Hate speech detection on social media (Twitter) including 5 target groups (gender, race, religion, politics, sports)

Size of Dataset: 100k English (27593 hate, 30747 offensive, 41660 none)

Percentage Abusive: 58.3%

Language: English

Level of Annotation: Posts

Platform: Twitter

Medium: Text, Image

Reference: Cagri Toraman, Furkan Şahinuç, Eyup Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2215–2225, Marseille, France. European Language Resources Association.

CoRAL: a Context-aware Croatian Abusive Language Dataset

Link to publication: https://aclanthology.org/2022.findings-aacl.21/

Link to data: https://github.com/shekharRavi/CoRAL-dataset-Findings-of-the-ACL-AACL-IJCNLP-2022

Task Description: Multi-class based on context dependency categories (CDC)

Details of Task: Detectioning CDC from abusive comments

Size of Dataset: 2240

Percentage Abusive: 100%

Language: Croatian

Level of Annotation: Posts

Platform: Posts

Medium: Newspaper Comments

Reference: Ravi Shekhar, Mladen Karan and Matthew Purver (2022). CoRAL: a Context-aware Croatian Abusive Language Dataset. Findings of the ACL: AACL-IJCNLP.

AbuseEval v1.0

Link to publication: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf

Link to data: https://github.com/tommasoc80/AbuseEval

Task Description: Explicitness annotation of offensive and abusive content

Details of Task: Enriched versions of the OffensEval/OLID dataset with the distinction of explicit/implicit offensive messages and the new dimension for abusive messages. Labels for offensive language: EXPLICIT, IMPLICT, NOT; Labels for abusive language: EXPLICIT, IMPLICT, NOTABU

Size of Dataset: 14100

Percentage Abusive: 20.75%

Language: English

Level of Annotation: Tweets

Platform: Twitter

Medium: Text

Reference: Caselli, T., Basile, V., Jelena, M., Inga, K., and Michael, G. 2020. "I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language". The 12th Language Resources and Evaluation Conference (pp. 6193-6202). European Language Resources Association.

Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language

Link to publication: https://arxiv.org/abs/2103.10195

Link to data: https://drive.google.com/file/d/1mM2vnjsy7QfUmdVUpKqHRJjZyQobhTrW/view

Task Description: Binary (misogyny/none) and Multi-class (none, discredit, derailing, dominance, stereotyping & objectification, threat of violence, sexual harassment, damning)

Details of Task: Introducing an Arabic Levantine Twitter dataset for Misogynistic language

Size of Dataset: 6603

Percentage Abusive: 48.76%

Language: Arabic

Level of Annotation: Posts

Platform: Twitter

Medium: Text, Images

Reference: Hala Mulki and Bilal Ghanem. 2021. Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 154–163, Kyiv, Ukraine (Virtual). Association for Computational Linguistics

Offensive Language and Hate Speech Detection for Danish

Link to publication: http://www.derczynski.com/papers/danish_hsd.pdf

Link to data: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805

Task Description: Branching structure of tasks: Binary (Offensive, Not), Within Offensive (Target, Not), Within Target (Individual, Group, Other)

Details of Task: Group-directed + Person-directed

Size of Dataset: 3600

Percentage Abusive: 0.12%

Language: Danish

Level of Annotation: Posts

Platform: Twitter, Reddit, Newspaper comments

Medium: Text

Reference: Sigurbergsson, G. and Derczynski, L., 2019. Offensive Language and Hate Speech Detection for Danish. ArXiv.

Detecting Abusive Albanian

Link to publication: https://arxiv.org/abs/2107.13592

Link to data: https://doi.org/10.6084/m9.figshare.19333298.v1

Task Description: Hierarchical (offensive/not; untargeted/targeted; person/group/other)

Details of Task: Detect and categorise abusive language in social media data

Size of Dataset: 11874

Percentage Abusive: 13.2%

Language: Albanian

Level of Annotation: Posts

Platform: Instagram, Youtube

Medium: Text

Reference: Nurce, E., Keci, J., Derczynski, L., 2021. Detecting Abusive Albanian. arXiv:2107.13592

Hate Speech Detection in the Bengali language: A Dataset and its Baseline Evaluation

Link to publication: https://arxiv.org/pdf/2012.09686.pdf

Link to data: https://www.kaggle.com/naurosromim/bengali-hate-speech-dataset

Task Description: Binary (hateful, not)

Details of Task: Several categories: sports, entertainment, crime, religion, politics, celebrity and meme

Size of Dataset: 30000

Percentage Abusive: 0.33%

Language: Bengali

Level of Annotation: Posts

Platform: Youtube, Facebook

Medium: Text

Reference: Romim, N., Ahmed, M., Talukder, H., & Islam, M. S. (2021). Hate speech detection in the bengali language: A dataset and its baseline evaluation. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 457-468). Springer, Singapore.

Measuring Hate Speech

Link to publication: https://arxiv.org/abs/2009.10277

Link to data: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech

Task Description: 10 ordinal labels (sentiment, (dis)respect, insult, humiliation, inferior status, violence, dehumanization, genocide, attack/defense, hate speech), which are debiased and aggregated into a continuous hate speech severity score (hate_speech_score) that includes a region for counterspeech & supportive speeech. Includes 8 target identity groups (race/ethnicity, religion, national origin/citizenship, gender, sexual orientation, age, disability, political ideology) and 42 identity subgroups.

Details of Task: Hate speech measurement on social media in English

Size of Dataset: 39,565 comments annotated by 7,912 annotators on 10 ordinal labels, for 1,355,560 total labels.

Percentage Abusive: 25%

Language: English

Level of Annotation: Social media comment

Platform: Twitter, Reddit, Youtube

Medium: Text

Reference: Kennedy, C. J., Bacon, G., Sahn, A., & von Vacano, C. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application. arXiv preprint arXiv:2009.10277.

Lists of Abusive Keywords