Imbalanced bioassay data classification for drug discovery
Abstract
All the methods developed for pattern recognition will show inferior performance
if the dataset presented to it is imbalanced, i.e. if the samples belonging to one
class are much more in number compared to the samples from the other class/es.
Due to this, imbalanced dataset classification has been an active area of research in
machine learning. In this thesis, a novel approach to classifying imbalanced bioassay
data is presented. Bioassay data classification is an important task in drug discovery.
Bioassay data consists of feature descriptors of various compounds and
the corresponding label which denotes its potency as a drug: active or inactive.
This data is highly imbalanced, with the percentage of active compounds ranging
from 0.1% to 1.4%, leading to inaccuracies in classification for the minority class.
An approach for classification in which separate models are trained by using
different features derived by training stacked autoencoders (SAE) is proposed.
After learning the features using SAEs, feed-forward neural networks (FNN) are
used for classification, which are trained to minimize a class sensitive cost function.
Before learning the features, data cleaning is performed using Synthetic Minority
Oversampling Technique (SMOTE) and removing Tomek links. Different
levels of features can be obtained using SAE. While some active samples may
not be correctly classified by a trained network on a certain feature space, it is
assumed that it can be classified correctly in another feature space. This is the
underlying assumption behind learning hierarchical feature vectors and learning
separate classifiers for each feature space.
vi
Collections
- M Tech Dissertations [923]