Imbalanced bioassay data classification for drug discovery

Shah, Jeni Snehal

dc.contributor.advisor	Joshi, Manjunath V.
dc.contributor.author	Shah, Jeni Snehal
dc.date.accessioned	2019-03-19T09:30:51Z
dc.date.available	2019-03-19T09:30:51Z
dc.date.issued	2018
dc.identifier.citation	Shah, Jeni Snehal (2018). Imbalanced Bioassay Data Classification for Drug Discovery. Dhirubhai Ambani Institute of Information and Communication Technology, ix, 47 p. (Acc. No: T00699)
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/733
dc.description.abstract	All the methods developed for pattern recognition will show inferior performance if the dataset presented to it is imbalanced, i.e. if the samples belonging to one class are much more in number compared to the samples from the other class/es. Due to this, imbalanced dataset classification has been an active area of research in machine learning. In this thesis, a novel approach to classifying imbalanced bioassay data is presented. Bioassay data classification is an important task in drug discovery. Bioassay data consists of feature descriptors of various compounds and the corresponding label which denotes its potency as a drug: active or inactive. This data is highly imbalanced, with the percentage of active compounds ranging from 0.1% to 1.4%, leading to inaccuracies in classification for the minority class. An approach for classification in which separate models are trained by using different features derived by training stacked autoencoders (SAE) is proposed. After learning the features using SAEs, feed-forward neural networks (FNN) are used for classification, which are trained to minimize a class sensitive cost function. Before learning the features, data cleaning is performed using Synthetic Minority Oversampling Technique (SMOTE) and removing Tomek links. Different levels of features can be obtained using SAE. While some active samples may not be correctly classified by a trained network on a certain feature space, it is assumed that it can be classified correctly in another feature space. This is the underlying assumption behind learning hierarchical feature vectors and learning separate classifiers for each feature space. vi
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology
dc.subject	Pattern recognition
dc.subject	Deep learning
dc.subject	Machine learning
dc.classification.ddc	005.74 SHA
dc.title	Imbalanced bioassay data classification for drug discovery
dc.type	Dissertation
dc.degree	M. Tech
dc.student.id	201611003
dc.accession.number	T00699

Files in this item

Name:: 201611003_Jeni Snehal Shah.pdf
Size:: 1.457Mb
Format:: PDF
Description:: 201611003

View/Open

This item appears in the following Collection(s)

M Tech Dissertations [923]

Show simple item record