Imbalanced bioassay data classification for drug discovery

Shah, Jeni Snehal

Imbalanced bioassay data classification for drug discovery

Files

201611003_Jeni Snehal Shah.pdf (1.46 MB)

Date

2018

Authors

Shah, Jeni Snehal

Publisher

Dhirubhai Ambani Institute of Information and Communication Technology

Abstract

All the methods developed for pattern recognition will show inferior performance if the dataset presented to it is imbalanced, i.e. if the samples belonging to one class are much more in number compared to the samples from the other class/es. Due to this, imbalanced dataset classification has been an active area of research in machine learning. In this thesis, a novel approach to classifying imbalanced bioassay data is presented. Bioassay data classification is an important task in drug discovery. Bioassay data consists of feature descriptors of various compounds and the corresponding label which denotes its potency as a drug: active or inactive. This data is highly imbalanced, with the percentage of active compounds ranging from 0.1% to 1.4%, leading to inaccuracies in classification for the minority class. An approach for classification in which separate models are trained by using different features derived by training stacked autoencoders (SAE) is proposed. After learning the features using SAEs, feed-forward neural networks (FNN) are used for classification, which are trained to minimize a class sensitive cost function. Before learning the features, data cleaning is performed using Synthetic Minority Oversampling Technique (SMOTE) and removing Tomek links. Different levels of features can be obtained using SAE. While some active samples may not be correctly classified by a trained network on a certain feature space, it is assumed that it can be classified correctly in another feature space. This is the underlying assumption behind learning hierarchical feature vectors and learning separate classifiers for each feature space. vi

Keywords

Pattern recognition, Deep learning, Machine learning

Citation

Shah, Jeni Snehal (2018). Imbalanced Bioassay Data Classification for Drug Discovery. Dhirubhai Ambani Institute of Information and Communication Technology, ix, 47 p. (Acc. No: T00699)

URI

http://drsr.daiict.ac.in/handle/123456789/733

Collections

M Tech Dissertations

Full item page

Imbalanced bioassay data classification for drug discovery

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By