preloader
image

Twitter Hate Speech Recognition

The goal of this project is to create a model to identify the tweets that push hate. This is a very important problem of the social networks. I’ve developed this work with a University collegue for an University Exam.

Project Details

We’ve collected almost 25.000 tweets in english from a public repository. These tweests are tagged in three categories: hate-speech, offensive-speech and neutral speech. The classification is based on user’s reports.

First of all we have applied some text processing techniques to clean the data: text preprocessing (lower case, removing punctuation, exceeding spaces, …), tokenization and stop words removal. Then we have decided to try the Text Classification in different scenarios: stemming and/or lemmatization application.

For Text Representation we have applied TF-IDF calculation and DOC2VEC algorithm for vectorization of the tweets. Some other feature extracstion tools have been used and than the text classification takes place.

We have tried different machine learning models: SVM, Adaboost, Logistic Regression, Random Forest, Neural Networks in different configurations of our features. The best results that we have achived reaches 94% of accuracy without overfitting.