alihan/DE-LIMIT

mirror of https://github.com/hate-alert/DE-LIMIT.git synced 2021-05-12 18:32:23 +03:00

Files

History

Saketh c12e146c83 Move translation file to dataset

2020-04-14 16:23:18 +05:30

..

Add ids for dataset, and split code

2020-04-14 15:57:46 +05:30

README.md

Move translation file to dataset

2020-04-14 16:23:18 +05:30

Stratified Split.ipynb

Add ids for dataset, and split code

2020-04-14 15:57:46 +05:30

Translation.ipynb

Move translation file to dataset

2020-04-14 16:23:18 +05:30

README.md

Dataset Description

The datasets used are as follows:

Arabic: 1a. Mulki et al. 1b. Ousidhoum et al.
English: 2a. Davidson et al. 2b. Gilbert et al. 2c. Waseem et al. 2d. Basile et al. 2e. Ousidhoum et al. 2f. Founta et al.
German: 3a. Ross et al. 3b. Bretschneider et al.
Indonesian: 4a. Ibrohim et al. 4b. Alfina et al.
Italian: 5a. Sanguinetti et al. 5b. Bosco et al.
Polish: 6a. Ptaszynski et al.
Portugese: 7a. Fortuna et al.
Spanish: 8a. Basile et al. 8b. Pereira et al.
French: 9a. Ousidhoum et al.

In cases where the actual text is not given by the source and only tweet ids and labels are given, use any twitter scraping tools to extract the texts. In the above datasets, some of them contain multiple labels for the text such as hate-speech, abusive, offensive, etc. In such cases, only the text with either hate-speech and normal labels are used and others are discarded.

Instructions for getting the datasets

Download the datasets from the above sources and place it in the subfolder Dataset/full_data
Use the Translation.ipynb to translate the datasets into english
Use the ids given in ID Mapping folder for splitting the datasets into train, val and test splits. Use the file Stratified Split.ipynb for doing the splits.