Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments

Madhu, Hiren; Satapara, Shrey; Modha, Sandip; Mandl, Thomas; Majumder, Prasenjit

Publication:
Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments

dc.contributor.affiliation	DA-IICT, Gandhinagar
dc.contributor.author	Madhu, Hiren
dc.contributor.author	Satapara, Shrey
dc.contributor.author	Modha, Sandip
dc.contributor.author	Mandl, Thomas
dc.contributor.author	Majumder, Prasenjit
dc.contributor.researcher	Satapara, Shrey (202111005)
dc.date.accessioned	2025-08-01T13:09:15Z
dc.date.issued	01-04-2023
dc.description.abstract	The spread of Hate Speech on online platforms is a severe issue for societies and requires the identification of offensive content by platforms. Research has modeled Hate Speech recognition as a�text classification�problem that predicts the class of a message based on the text of the message only. However, context plays a huge role in communication. In particular, for short messages, the text of the preceding tweets can completely change the interpretation of a message within a discourse. This work extends previous efforts to classify Hate Speech by considering the current and previous tweets jointly. In particular, we introduce a clearly defined way of extracting context. We present the development of the first dataset for conversational-based Hate Speech classification with an approach for collecting context from long conversations for code-mixed Hindi (ICHCL dataset). Overall, our benchmark experiments show that the inclusion of context can improve classification performance over a baseline. Furthermore, we develop a novel processing pipeline for processing the context. The best-performing pipeline uses a fine-tuned SentBERT paired with an�LSTM�as a classifier. This pipeline achieves a macro F1 score of 0.892 on the ICHCL test dataset. Another�KNN, SentBERT, and ABC weighting-based pipeline yields an F1 Macro of 0.807, which gives the best results among traditional classifiers. So even a KNN model gives better results with an optimized�BERT�than a vanilla BERT model.
dc.format.extent	1-16
dc.identifier.citation	Hiren Madhu, Shrey Satapara, Sandip Modha, Mandl, Thomas, Majumder, Prasenjit, "Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments," Expert Systems with Applications, Elsevier, ISSN: 0957-4174, vol. 215, Article no. 119342, pp. 1-16, 1 Apr. 2023, doi: 10.1016/j.eswa.2022.119342. [Published date : 25 Nov. 2022]
dc.identifier.doi	10.1016/j.eswa.2022.119342
dc.identifier.issn	0957-4174
dc.identifier.scopus	2-s2.0-85145576108
dc.identifier.uri	https://ir.daiict.ac.in/handle/dau.ir/1777
dc.identifier.wos	WOS:000895345700005
dc.language.iso	en
dc.publisher	Elsevier
dc.relation.ispartofseries	Vol. 215; No.
dc.source	Expert Systems with Applications
dc.source.uri	https://www.sciencedirect.com/science/article/pii/S0957417422023600?via%3Dihub
dc.title	Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments
dspace.entity.type	Publication
relation.isAuthorOfPublication	2157d717-1c67-4d71-b314-ed3eddebf251
relation.isAuthorOfPublication	2157d717-1c67-4d71-b314-ed3eddebf251
relation.isAuthorOfPublication.latestForDiscovery	2157d717-1c67-4d71-b314-ed3eddebf251

Collections

Journal Article

Publication: Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments

Files

Collections

Publication:
Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments