Document representation using extended locality preserving indexing
Abstract
The main purpose of web search is to obtain the relevant information pertaining
to our need from the documents available on the Internet. Each term (word) in
a document contributes to a dimension. It is challenging to process this high dimensional
data. Not all terms convey important meaning, some terms are related
to each other, some are synonyms. This redundancy in the document collection
increases the dimensionality of the document space. Processing this high dimensional
document collection to obtain useful information from it requires a lot of
storage space and time for computation. Dimensionality reduction plays an important
role here to reduce the data dimension so that computation can be done
fast and the storage required is also less.
These documents are represented as vectors in high dimensional space. Our
main aim is to obtain the representation of documents in this reduced subspace
so that the relation among the documents in the subspace does not get changed
from the one in original vector space. So, the accuracy of the similarity measure of
the documents obtained in the subspace is evaluated. Document representation
in terms of term document matrix is an important step in document indexing.
Document indexing is the process to obtain an index which helps in retrieving
relevant documents effectively, analogous to the index of a book.
Latent Semantic Indexing (LSI) is a global structure preserving approach while
Locality Preserving Indexing (LPI) is a local structure preserving approach. LPI
assigns weights to the neighbours to obtain the reduced representation while preserving
local structure. However, it does not retain any information about nonneighbours.
A new approach Extended Locality Preserving Indexing (ELPI) is
proposed which preserves the topology of the document space by modifying the
weighing scheme. Experiments for evaluating document similarity and for classification
show small but encouraging improvement using ELPI as compared to
LPI.
Collections
- M Tech Dissertations [923]