Link to publication: https://www.aclweb.org/anthology/W17-3008
Link to data: http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx
Task Description: Ternary (Obscene, Offensive but not obscene, Clean)
Details of Task: Incivility
Size of Dataset: 32000
Percentage Abusive: 0.81%
Language: Arabic
Level of Annotation: Posts
Platform: AlJazeera
Medium: Text
Reference: Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56.
Link to publication: https://aclanthology.org/2022.lrec-1.238/
Link to data: https://github.com/avaapm/hatespeech
Task Description: Three-class (Hate speech, Offensive language, None)
Details of Task: Hate speech detection on social media (Twitter) including 5 target groups (gender, race, religion, politics, sports)
Size of Dataset: 100k English (27593 hate, 30747 offensive, 41660 none)
Percentage Abusive: 58.3%
Language: English
Level of Annotation: Posts
Platform: Twitter
Medium: Text, Image
Reference: Cagri Toraman, Furkan Şahinuç, Eyup Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2215–2225, Marseille, France. European Language Resources Association.
Link to publication: https://aclanthology.org/2022.findings-aacl.21/
Link to data: https://github.com/shekharRavi/CoRAL-dataset-Findings-of-the-ACL-AACL-IJCNLP-2022
Task Description: Multi-class based on context dependency categories (CDC)
Details of Task: Detectioning CDC from abusive comments
Size of Dataset: 2240
Percentage Abusive: 100%
Language: Croatian
Level of Annotation: Posts
Platform: Posts
Medium: Newspaper Comments
Reference: Ravi Shekhar, Mladen Karan and Matthew Purver (2022). CoRAL: a Context-aware Croatian Abusive Language Dataset. Findings of the ACL: AACL-IJCNLP.
Link to publication: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
Link to data: https://github.com/tommasoc80/AbuseEval
Task Description: Explicitness annotation of offensive and abusive content
Details of Task: Enriched versions of the OffensEval/OLID dataset with the distinction of explicit/implicit offensive messages and the new dimension for abusive messages. Labels for offensive language: EXPLICIT, IMPLICT, NOT; Labels for abusive language: EXPLICIT, IMPLICT, NOTABU
Size of Dataset: 14100
Percentage Abusive: 20.75%
Language: English
Level of Annotation: Tweets
Platform: Twitter
Medium: Text
Reference: Caselli, T., Basile, V., Jelena, M., Inga, K., and Michael, G. 2020. "I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language". The 12th Language Resources and Evaluation Conference (pp. 6193-6202). European Language Resources Association.
Link to publication: https://arxiv.org/abs/2103.10195
Link to data: https://drive.google.com/file/d/1mM2vnjsy7QfUmdVUpKqHRJjZyQobhTrW/view
Task Description: Binary (misogyny/none) and Multi-class (none, discredit, derailing, dominance, stereotyping & objectification, threat of violence, sexual harassment, damning)
Details of Task: Introducing an Arabic Levantine Twitter dataset for Misogynistic language
Size of Dataset: 6603
Percentage Abusive: 48.76%
Language: Arabic
Level of Annotation: Posts
Platform: Twitter
Medium: Text, Images
Reference: Hala Mulki and Bilal Ghanem. 2021. Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 154–163, Kyiv, Ukraine (Virtual). Association for Computational Linguistics
Link to publication: http://www.derczynski.com/papers/danish_hsd.pdf
Link to data: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805
Task Description: Branching structure of tasks: Binary (Offensive, Not), Within Offensive (Target, Not), Within Target (Individual, Group, Other)
Details of Task: Group-directed + Person-directed
Size of Dataset: 3600
Percentage Abusive: 0.12%
Language: Danish
Level of Annotation: Posts
Platform: Twitter, Reddit, Newspaper comments
Medium: Text
Reference: Sigurbergsson, G. and Derczynski, L., 2019. Offensive Language and Hate Speech Detection for Danish. ArXiv.
Link to publication: https://arxiv.org/abs/2107.13592
Link to data: https://doi.org/10.6084/m9.figshare.19333298.v1
Task Description: Hierarchical (offensive/not; untargeted/targeted; person/group/other)
Details of Task: Detect and categorise abusive language in social media data
Size of Dataset: 11874
Percentage Abusive: 13.2%
Language: Albanian
Level of Annotation: Posts
Platform: Instagram, Youtube
Medium: Text
Reference: Nurce, E., Keci, J., Derczynski, L., 2021. Detecting Abusive Albanian. arXiv:2107.13592
Link to publication: https://arxiv.org/pdf/2012.09686.pdf
Link to data: https://www.kaggle.com/naurosromim/bengali-hate-speech-dataset
Task Description: Binary (hateful, not)
Details of Task: Several categories: sports, entertainment, crime, religion, politics, celebrity and meme
Size of Dataset: 30000
Percentage Abusive: 0.33%
Language: Bengali
Level of Annotation: Posts
Platform: Youtube, Facebook
Medium: Text
Reference: Romim, N., Ahmed, M., Talukder, H., & Islam, M. S. (2021). Hate speech detection in the bengali language: A dataset and its baseline evaluation. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 457-468). Springer, Singapore.
Link to publication: https://arxiv.org/abs/2009.10277
Link to data: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech
Task Description: 10 ordinal labels (sentiment, (dis)respect, insult, humiliation, inferior status, violence, dehumanization, genocide, attack/defense, hate speech), which are debiased and aggregated into a continuous hate speech severity score (hate_speech_score) that includes a region for counterspeech & supportive speeech. Includes 8 target identity groups (race/ethnicity, religion, national origin/citizenship, gender, sexual orientation, age, disability, political ideology) and 42 identity subgroups.
Details of Task: Hate speech measurement on social media in English
Size of Dataset: 39,565 comments annotated by 7,912 annotators on 10 ordinal labels, for 1,355,560 total labels.
Percentage Abusive: 25%
Language: English
Level of Annotation: Social media comment
Platform: Twitter, Reddit, Youtube
Medium: Text
Reference: Kennedy, C. J., Bacon, G., Sahn, A., & von Vacano, C. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application. arXiv preprint arXiv:2009.10277.