Microblog processing : summarization and impoliteness detection
Social Media is an excellent source for studying human interaction and behavior. Sensing social media such as Facebook and Twitter, by the smart autonomous application empower its user community with real-time information unfolds across different part of the world. In this thesis, we study social media text from the summarization and impoliteness perspective. In the first part of the thesis, Microblog Summarization is explored from the three scenarios. In the first scenario, we present a summarization system, built over the Twitter stream, to summarize the topic for a given duration. Daily summary or digest from Microblog is a way to update social media users what happened today on the subject of her interest. To design a Microblog based summarization system, Tweet ranking is the primary task. After ranking tweets, relevant tweet selection is the crucial task for any summarization system due to the massive volume of tweets in the Twitter stream. In addition, the Summarization system should include novel tweets in the summary or digest. The measure of relevance is typically the similarity score obtained from different text similarity (between user information need and tweets) algorithms. More similar, the higher the score. So, we need to choose a threshold that can minimize false-positive judgments in this case. We have developed various methods by exploiting statistical features of the rank list to estimate these thresholds and evaluated against thresholds determined via grid search. We have used language models to rank the tweets to select relevant tweets where the selection of the smoothing technique and its parameters are critical. Results are also compared with the standard probabilistic ranking algorithm BM25. Learning to Rank strategies are also implemented, which show substantial improvement in some of the result metrics. In the second scenario: we develop a real-time version of the summarization system that continually monitors the Twitter stream and generates relevant and novel real time push notifications that are delivered to user's cellphones. In the third scenario, the summarization system was evaluated on a disaster-related incident such as an earthquake. We have also performed comprehensive failure analysis on our experiment and identified key issues that can be addressed in the future. In the second part of the thesis, the social media stream is studied from the impoliteness perspective. Due to an exponential rise in the social media user-base, incidents like Hate speech, trolling, cyberbullying are also increasing, and that has lead to Hate Speech detection problems being reshaped into different research problems such as aggression detection, offensive language detection, factual-post detection. We refer to all such anti-social typology under the ambit of impoliteness. This thesis attempts to study the effectiveness of different text representation schemes on an NLP downstream task such as classification. A set of text representation schemes, based on Bag-of-Word techniques, distributed word representation or word embedding, sentence embedding, are empirically evaluated on traditional classifiers and deep neural models. Experiment results show that on the English dataset, overall, text representation using Googles' universal sentence encoder (USE) performs better than word embedding, and BoW techniques on traditional classifiers such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models. Recent pre-trained transfer learning models like Elmo, ULMFiT, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using fastText produces best weighted F1-score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better thanword embedding on traditional classifiers such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the dataset, deep neural models are more robust and perform substantially better than traditional classifiers such as SVM, logistic regression, and Naive Bayes classifiers. During the disaster-related incident, Twitter is flooded with millions of posts. In such emergencies, identification of factual posts is vital for organizations involved in the relief operation. We approach this problem as a combination of classification and ranking problems. Following from this work, the aggression visualization problem is addressed as the last component. We have designed a user interface based on web browser plugins over Facebook and Twitter to visualize the aggressive comments posted by any user. This plugin interface might help the security agencies to keep a tab on the social media stream. The proposed plugin help celebrities to customize their timeline by raising the appropriate flag, which enables them to delete or hide such abusive comments from their timeline. In addition to these, the interface might be helpful to the research community to prepare weakly labeled training data in a few minutes using comments posted by users on celebrity's social media timeline.
- PhD Theses