M Tech Dissertations
Permanent URI for this collectionhttp://drsr.daiict.ac.in/handle/123456789/3
Browse
11 results
Search Results
Item Open Access Biomedical information retrieval(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Purabia, Pooja R.; Majumder, PrasenjitIt is well known that the volume of biomedical literature is growing exponentially and that scientists are being overwhelmed when they sift through the scope and diversity of this unstructured knowledge to find relevant information. TREC Precision Medicine 2017 is a track focusing on retrieving relevant scientific abstract and clinical trials from PubMed and Clinicaltrails.gov for cancer patients given their medical case. This report describes the system architecture for the TREC 2017 Precision Medicine Track. I explored query expansion techniques using wellknown broad knowledge sources such as Metamap and Entrez database. I used different pseudo relevance feedback technique like TF-IDF, BO1 and Local Context Analysis to retrieve relevant medical abstracts. I have used hidden aspects of topic like precision medicine and treatment aspect to improve the scores. I report infNDCG, R-Prec and P@10 scores.Item Open Access Distant supervision for relation extraction(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Doshi, Prarthana; Jat, PMRelation Extraction(RE) is one of important task of Information Extraction. InformationExtraction is used to get data from natural language text. Relation extractionis done using different methods. Most techniques found in the area ofrelation extraction uses labelled data. The downside of using labelled data is thatit is very costly to generate the labelled data as it requires human labour to understandeach sentence and entities and label it accordingly. There is a big amount ofnatural language data available and it is increasing day by day. So, the supervisedtechniques may not scale and adapt well with real time dynamic data.The issue of human annotations is addressed by recent approach of distant supervision.Distant supervision is a task that attempts automatic labelling of data.This is realized by extracting facts from publicly available knowledge bases likeWikidata, DBPedia, etc. Most of the knowledge bases are freely available. Theassumption of distant supervision is that if there is a relation between entitiesin knowledge base, then a sentence, in which those entities are present together,represents that relation. But there are some problems associated with distant supervisionlike incomplete knowledge base or wrong label problem.Most techniques in the area of relation extraction used available NLP toolsfor the feature extraction. These tools themselves have errors. In this work, weexplore convolutional neural network for the task which does not require NLPbased preprocessing.To avoid the wrong label problem, we have used selective attention over instances.It considers the problem as the multi-instance problem and we have concludedthat it gives better result. We have also used CNN with context modelwhere the input of the model is divided in three parts based on the entity position.This helps model to understand the sentence representation and the modelperforms well as compared to basic CNN model.Item Open Access Document representation using extended locality preserving indexing(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Khalpada, Vaidehi S.; Mitra, Suman K.The main purpose of web search is to obtain the relevant information pertaining to our need from the documents available on the Internet. Each term (word) in a document contributes to a dimension. It is challenging to process this high dimensional data. Not all terms convey important meaning, some terms are related to each other, some are synonyms. This redundancy in the document collection increases the dimensionality of the document space. Processing this high dimensional document collection to obtain useful information from it requires a lot of storage space and time for computation. Dimensionality reduction plays an important role here to reduce the data dimension so that computation can be done fast and the storage required is also less. These documents are represented as vectors in high dimensional space. Our main aim is to obtain the representation of documents in this reduced subspace so that the relation among the documents in the subspace does not get changed from the one in original vector space. So, the accuracy of the similarity measure of the documents obtained in the subspace is evaluated. Document representation in terms of term document matrix is an important step in document indexing. Document indexing is the process to obtain an index which helps in retrieving relevant documents effectively, analogous to the index of a book. Latent Semantic Indexing (LSI) is a global structure preserving approach while Locality Preserving Indexing (LPI) is a local structure preserving approach. LPI assigns weights to the neighbours to obtain the reduced representation while preserving local structure. However, it does not retain any information about nonneighbours. A new approach Extended Locality Preserving Indexing (ELPI) is proposed which preserves the topology of the document space by modifying the weighing scheme. Experiments for evaluating document similarity and for classification show small but encouraging improvement using ELPI as compared to LPI.Item Open Access Modelling and short term forecasting of flash floods in an urban environment(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Ogale, Suraj; Srivastava, SanjayRapid urbanization, climate change and extreme rainfall has resulted in a growingnumber of cases of urban flash floods. It is important to predict the occurrenceof flood so that the aftermath of the flood can be minimized. Flood forecasting isa major exercise performed to determine the chances of a flood when suitableconditions are present. Short term forecasting or nowcasting is a dominant techniqueused in urban cities for prediction of the very near future incident up to sixhours. In orthodox methods of flood forecasting, current weather conditions areexamined using conventional methods such as use of radar, satellite imaging andcomplex calculation involving complicated mathematical equations.Recent developments in Information and Communication Technology(ICT) andMachine Learning(ML) has helped us to study this hydrological problem alongwith many real world situation in different perspective. The main aim of thisthesis is to design a theoretical model that accounts parameters causing an urbanflash flood and develop a prediction tool for the forecasting of near future event.To test the soundness of a model, data syntheses is performed and the results areseen using the artificial neural network.Item Open Access Text retrieval from the degraded document images(Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Vasani, Hiral; Mitra, Suman K.Image binarization is used to obtain a black and white text document from a colored one. Basically, it can be taken as an image segmentation task that segments the text part from the background. Such a black and white document can be used in many applications, namely Optical Character Recognition (OCR). Text documents suffer from various types of degradations that make image binarization a challenging task. This thesis presents the work done to design a technique that segments text from the background. In this method, the document image is first darkened in order to enhance the text (foreground) in it. The text image is again processed separately so as to suppress the background. The two images so obtained are combined in such a way that the suppressed background is retained from the last image and enhanced text is used from the first image. Then this pre-processed image is binarized using an existing thresholding technique. The first binarized image is subjected to some post-processing in order to remove unwanted smaller components and other noise. The output image so obtained is compared to the ground truth results using some evaluation parameters. The results of the algorithm are compared to the existing Binarization techniques.Item Open Access Retrieval of legal documents using query expansion(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Agrawal, Madhulika; Majumder, PrasenjitStructure of query by a lawyer and a layman is different. Legal content in the layman’s query is very less. Thus pre-processing of these queries is required for better retrieval performance. In this thesis, we used various query expansion techniques and found that increasing query size increases system performance. MAP of 0.5034 was obtained by using BM25 retrieval model with query expansion up to 2550 terms using Bo1 query expansion model. By explicitly adding terms to the query, using topics obtained from topic modelling, a MAP value of 0.4281 was obtained. Further by relevance feedback of documents using topic modelling and only 2 cycles of feedback, we got MAP of 0.3832. Baseline result that we had was MAP value of 0.3799 using In_expC2 retrieval model. We also compared the relevance judgment of a lawyer and a non-lawyer and found out that for relative evaluation of two systems, non-lawyer’s relevance judgment is at par with the lawyer’s judgment.Item Open Access Feature based approach for singer identification(Dhirubhai Ambani Institute of Information and Communication Technology, 2012) Radadia, Purushotam G.; Patil, Hemant A.One of the challenging and difficult problems under the category of Music Information Retrieval (MIR) is to identify a singer of a given song under strong instrumental accompaniments. Besides instrumental sounds, other parameters are also severely affecting Singer IDentification (SID) accuracy, such as quality of song recording devices, transmission channels and other singing voices present within a song. In our work, we propose singer identification with large database of 500 songs (largest database ever used in any of the SID problem) prepared from Hindi (Indian Language) Bollywood songs. In addition, vocal portions are segmented manually from each of the songs. Different features have been employed in addition to state-of-the-art feature set, Mel Frequency Cepstral Coefficients (MFCC) in this thesis work. To identify a singer, three classifiers are employed, viz., 2nd order polynomial classifier, 3rd order polynomial classifier and state-of-the-art GMM classifier. Furthermore, to alleviate the effect of recording devices and transmission channels, Cepstral Mean Subtraction (CMS) technique on MFCC is utilized for singer identification and it is providing better results than the baseline MFCC alone. Moreover, the 3rd order classifier outperforms amongst all three classifiers. Score-level fusion technique of MFCC and CMSMFCC is also used in this thesis and it improves the results significantly.Item Open Access SMS query processing for information retrieval(Dhirubhai Ambani Institute of Information and Communication Technology, 2012) Shinghal, Khushboo; Majumder, PrasenjitSMS text messaging is one of the fast and popular communication mode on mobile phones these days. This study presents a query processing system for information retrieval system when queries are Short-message-Service (SMS). SMS contains various user improvisation and typographical errors. Proposed approach uses approximate string matching techniques and context extraction to normalize SMS queries with minimum linguistic resources. We have tested the system on FIRE 2011 SMS based FAQ retrieval corpus. Results seems encouragingItem Open Access Learning to rank: using Bayesian networks(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Gupta, Parth; Mjumder, Prasenjit; Mitra, Suman K.Ranking is one of the key components of an Information Retrieval system. Recently supervised learning is involved for learning the ranking function and is called 'Learning to Rank' collectively. In this study we present one approach to solve this problem. We intend to test this problem in di erent stochastic environment and hence we choose to use Bayesian Networks for machine learning. This work also involves experimentation results on standard learning to rank dataset `Letor4.0'[6]. We call our approach as BayesNetRank. We compare the performance of BayesNetRank with another Support Vector Machine(SVM) based approach called RankSVM [5]. Performance analysis is also involved in the study to identify for which kind of queries, proposed system gives results on either extremes. Evaluation results are shown using two rank based evaluation metrics, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG).Item Open Access Use of probabilistic context free grammar for natural language interface for an application(Dhirubhai Ambani Institute of Information and Communication Technology, 2008) Agarwal, Chetan; Jotwani, Naresh D.This thesis report deals with the development of a natural language interface for a database application (library system). This application uses Probabilistic Context Free Grammar as a computation model. The material presented in this thesis provides an overview to study the topics on Natural Language Processing, Probabilistic Context Free Grammar, Parsing and extracting the semantic from a pare tree. Probabilistic Context Free Grammar is a computational model which defines probabilistic relationships among a set of production rules for a given grammar. These probabilistic relationships among production rules have several advantages in natural language processing. The goal of natural language processing is to build computational models of Natural Languages for its analysis and generation. Application build takes a simple English sentence, parses the sentence, extracts the semantic and translates it into an SQL query. Application (library system) is coded in programming language JAVA. Though the given code is for a simple library system but can be modified according to requirements fulfilling the criteria of targeted task.