Person: Majumder, Prasenjit
Loading...
Name
Prasenjit Majumder
Job Title
Faculty
Email Address
Telephone
079-68261605
Birth Date
Specialization
Natural Language Processing, Information Retrieval, Cognitive Science
20 results
Search Results
Now showing 1 - 10 of 20
Publication Metadata only Multilingual Information Access in South Asian Languages(Springer-Verlag, Berlin, 2013-08-13) Majumder, Prasenjit; Mitra, Mandar; Bhattacharyya, Pushpak; Subramaniam, L Venkata; Contractor, Danish; Rosso, PaoloPublication Metadata only From Extractive to Abstractive Summarization: A Journey(Springer, Singapore, 2019-08-13) Mehta, Parth; Majumder, PrasenjitPublication Metadata only 7th Forum for Information Retrieval Evaluation(Association for Computing Machinery (ACM), New York, 2015-12-04) Majumder, Prasenjit; Mitra, Mandar; Agrawal, Madhulika; Mehta, ParthPublication Metadata only Learning combination weights in data fusion using Genetic Algorithms(Elsevier, 01-05-2015) Ghosh, Kripabandhu; Parui, Swapan Kumar; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; DA-IICT, GandhinagarResearchers have shown that a weighted linear combination in data fusion can produce better results than an unweighted combination. Many techniques have been used to determine the linear combination weights. In this work, we have used the Genetic Algorithm (GA) for the same purpose. The GA is not new and it has been used earlier in several other applications. But, to the best of our knowledge, the GA has not been used for fusion of runs in information retrieval. First, we use GA to learn the optimum fusion weights using the entire set of relevance assessment. Next, we learn the weights from the relevance assessments of the top retrieved documents only. Finally, we also learn the weights by a twofold training and testing on the queries. We test our method on the runs submitted in TREC. We see that our weight learning scheme, using both full and partial sets of relevance assessment, produces significant improvements over the best candidate run, CombSUM, CombMNZ, Z-Score, linear combination method with performance level, performance level square weighting scheme, multiple linear regression-based weight learning scheme, mixture model result merging scheme, LambdaMerge, ClustFuseCombSUM and ClustFuseCombMNZ. Furthermore, we study how the correlation among the scores in the runs can be used to eliminate redundant runs in a set of runs to be fused. We observe that similar runs have similar contributions in fusion. So, eliminating the redundant runs in a group of similar runs does not hurt fusion performance in any significant way.Publication Metadata only Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments(Elsevier, 01-04-2023) Madhu, Hiren; Satapara, Shrey; Modha, Sandip; Mandl, Thomas; Majumder, Prasenjit; DA-IICT, Gandhinagar; Satapara, Shrey (202111005)The spread of Hate Speech on online platforms is a severe issue for societies and requires the identification of offensive content by platforms. Research has modeled Hate Speech recognition as a�text classification�problem that predicts the class of a message based on the text of the message only. However, context plays a huge role in communication. In particular, for short messages, the text of the preceding tweets can completely change the interpretation of a message within a discourse. This work extends previous efforts to classify Hate Speech by considering the current and previous tweets jointly. In particular, we introduce a clearly defined way of extracting context. We present the development of the first dataset for conversational-based Hate Speech classification with an approach for collecting context from long conversations for code-mixed Hindi (ICHCL dataset). Overall, our benchmark experiments show that the inclusion of context can improve classification performance over a baseline. Furthermore, we develop a novel processing pipeline for processing the context. The best-performing pipeline uses a fine-tuned SentBERT paired with an�LSTM�as a classifier. This pipeline achieves a macro F1 score of 0.892 on the ICHCL test dataset. Another�KNN, SentBERT, and ABC weighting-based pipeline yields an F1 Macro of 0.807, which gives the best results among traditional classifiers. So even a KNN model gives better results with an optimized�BERT�than a vanilla BERT model.Publication Metadata only Approaches to Temporal Expression Recognition in Hindi(ACM, 01-01-2015) Ramrakhiyani, Nitin; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; DA-IICT, GandhinagarTemporal annotation of plain text is considered a useful component of modern information retrieval tasks. In this work, different approaches for identification and classification of temporal expressions in Hindi are developed and analyzed. First, a rule-based approach is developed, which takes plain text as input and based on a set of hand-crafted rules, produces a tagged output with identified temporal expressions. This approach performs with a strict F1-measure of 0.83. In another approach, a CRF-based classifier is trained with human tagged data and is then tested on a test dataset. The trained classifier identifies the time expressions from plain text and further classifies them to various classes. This approach performs with a strict F1-measure of 0.78. Next, the CRF is replaced by an SVM-based classifier and the same experiment is performed with the same features. This approach is shown to be comparable to the CRF and performs with a strict F1-measure of 0.77. Using the rule base information as an additional feature enhances the performances to 0.86 and 0.84 for the CRF and SVM respectively. With three different comparable systems performing the extraction task, merging them to take advantage of their positives is the next step. As the first merge experiment, rule-based tagged data is fed to the CRF and SVM classifiers as additional training data. Evaluation results report an increase in F1-measure of the CRF from 0.78 to 0.8. Second, a voting-based approach is implemented, which chooses the best class for each token from the outputs of the three approaches. This approach results in the best performance for this task with a strict F1-measure of 0.88. In this process a reusable gold standard dataset for temporal tagging in Hindi is also developed. Named the ILTIMEX2012 corpus, it consists of 300 manually tagged Hindi news documents.Publication Metadata only Design and analysis of microblog-based summarization system(Springer, 02-11-2021) Modha, Sandip; Majumder, Prasenjit; Mandl, Thomas; Singla, Rishab; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; DA-IICT, Gandhinagar; Modha, Sandip (201221001)A daily summary or digest from microblogs allows social media users to stay up to date on what happened today on their favorite topic. Summarizing microblogs is a non-trivial task. This paper presents a summarization system built over the Twitter stream to summarize the topic for a given duration. Tweet ranking is the primary task of designing a microblog-based summarization system. After ranking tweets, the selection of relevant tweets is the crucial task for any summarization system due to the massive volume of tweets in the Twitter stream. In addition, the summarization system should include novel tweets in the summary or digest. The measure of relevance is typically the similarity score obtained from different text similarity algorithms. These measure the similarity between user information needs and each tweet. The more similar, the higher the score. So we need to choose a threshold that can minimize false-positive judgments for this task. In this paper, we proposed novel threshold estimation methods to find optimal values for these thresholds and evaluate them against thresholds determined via grid search. These methods estimate the thresholds with reasonable accuracy, according to the results. Previous research has empirically and heuristically set these thresholds, and our work suggests a method that exploits statistical features of the ranking list to estimate these thresholds. We used language models to rank the tweets and to select relevant tweets. For any language model, the selection of the smoothing technique and its parameters are critical. The results are also compared with the standard probabilistic ranking algorithm BM25. Learning to rank strategies is also implemented, which shows substantial improvement in some of the result metrics. Experiments were performed on standard benchmarks like the TREC Microblog 2015, TREC RTS 2016, and TREC RTS 2017 datasets. Different variants of normal discounted cumulative gain, the standard official evaluation metric of TREC, nDCG-1, nDCG-0, and nDCG-p are used in this study. We also performed a comprehensive failure analysis on our experiments and identified key issues for improvement that can be addressed in the future.Publication Metadata only Introduction to the Special Issue on Indian Language Information Retrieval Part I(ACM) Harman, Donna; Kando, Noriko; Majumder, Prasenjit; Mitra, Mandar; Peters, Carol; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; DA-IICT, GandhinagarThe special issue of Transactions on Asian Language Information Processing (TALIP) discusses six research papers on Indian language Information Retrieval (IR). The first article, 'The FIRE 2008 Evaluation Exercise' by Prasenjit Majumder and co-workers, provides the motivation and background for the FIRE initiative. It describes how the FIRE 2008 test collection was constructed, and summarizes the approaches adopted by various participants. The authors also discuss the limitations of the datasets, and outline the tasks planned for the next iteration of FIRE. Leveling and Jones in their article,'Sub-word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR,' try a corpus-based stemming approach based on morpheme induction, as well as sub-word indexing units. The final article, An Information Extraction System for Urdu - A Resource Poor Language' by Smruthi, addresses Natural Language Processing (NLP) tasks for Urdu, a language that is not addressed by any of the other articles.Publication Metadata only Report on the FIRE 2020 evaluation initiative(ACM, 16-07-2021) Mehta, Parth; Mandl, Thomas; Majumder, Prasenjit; Gangopadhyay, Surupendu; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; DA-IICT, Gandhinagar; Gangopadhyay, Surupendu (201921002)This report gives an overview on the Forum for Information Retrieval Evaluation (FIRE) initiative for South-Asian languages1. The FIRE conference was conducted online in December 2020. The event combined a conference including keynotes, peer reviewed paper session with an Evaluation Forum. This report will present an overview of the conference and provide insights into the evaluation tracks. Current domains include legal information access, mixed script information retrieval, semantic analysis and social media posts classification. The tasks are discussed and connections to other evaluation initiatives are shown.Publication Metadata only Query specific graph-based query reformulation using UMLS for clinical information access(Elsevier, 01-08-2020) Sankhavara, Jainisha; Dave, Rishi; Dave, Bhargav; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; Majumder, Prasenjit; DA-IICT, Gandhinagar; Sankhavara, Jainisha (201521004); Dave, Rishi (201811073); Dave, Bhargav (201811049)Biomedical document retrieval requires entity level processing instead of term level. This paper explores the usage and impact of UMLS for entity-based query reformulation in biomedical document retrieval. A novel graph-based approach for query reformulation using UMLS is described herein which queries are expanded using biomedical entities. The proposed method considers UMLS entities from a query with their related entities identified by UMLS and constructs a query-specific graph of biomedical entities for term selection. This query reformulation approach is compared with baseline, pseudo relevance feedback based query expansion approach and state-of-the-art UMLS based query reformulation approaches. The experiments on CDS 2015 and CDS 2016 datasets shows 35% and 45% improvement in retrieval performance, respectively.