Files
DE-LIMIT/Dataset

Dataset Description

The datasets used are as follows:

  1. Arabic: 1a. Mulki et al. 1b. Ousidhoum et al.
  2. English: 2a. Davidson et al. 2b. Gilbert et al. 2c. Waseem et al. 2d. Basile et al. 2e. Ousidhoum et al. 2f. Founta et al.
  3. German: 3a. Ross et al. 3b. Bretschneider et al.
  4. Indonesian: 4a. Ibrohim et al. 4b. Alfina et al.
  5. Italian: 5a. Sanguinetti et al. 5b. Bosco et al.
  6. Polish: 6a. Ptaszynski et al.
  7. Portugese: 7a. Fortuna et al.
  8. Spanish: 8a. Basile et al. 8b. Pereira et al.
  9. French: 9a. Ousidhoum et al.

In cases where the actual text is not given by the source and only tweet ids and labels are given, use any twitter scraping tools to extract the texts. In the above datasets, some of them contain multiple labels for the text such as hate-speech, abusive, offensive, etc. In such cases, only the text with either hate-speech and normal labels are used and others are discarded.

Instructions for getting the datasets

  1. Download the datasets from the above sources and place it in the subfolder Dataset/full_data
  2. Use the Translation.ipynb to translate the datasets into english
  3. Use the ids given in ID Mapping folder for splitting the datasets into train, val and test splits. Use the file Stratified Split.ipynb for doing the splits.