Document representation using extended locality preserving indexing
The main purpose of web search is to obtain the relevant information pertaining to our need from the documents available on the Internet. Each term (word) in a document contributes to a dimension. It is challenging to process this high dimensional data. Not all terms convey important meaning, some terms are related to each other, some are synonyms. This redundancy in the document collection increases the dimensionality of the document space. Processing this high dimensional document collection to obtain useful information from it requires a lot of storage space and time for computation. Dimensionality reduction plays an important role here to reduce the data dimension so that computation can be done fast and the storage required is also less. These documents are represented as vectors in high dimensional space. Our main aim is to obtain the representation of documents in this reduced subspace so that the relation among the documents in the subspace does not get changed from the one in original vector space. So, the accuracy of the similarity measure of the documents obtained in the subspace is evaluated. Document representation in terms of term document matrix is an important step in document indexing. Document indexing is the process to obtain an index which helps in retrieving relevant documents effectively, analogous to the index of a book. Latent Semantic Indexing (LSI) is a global structure preserving approach while Locality Preserving Indexing (LPI) is a local structure preserving approach. LPI assigns weights to the neighbours to obtain the reduced representation while preserving local structure. However, it does not retain any information about nonneighbours. A new approach Extended Locality Preserving Indexing (ELPI) is proposed which preserves the topology of the document space by modifying the weighing scheme. Experiments for evaluating document similarity and for classification show small but encouraging improvement using ELPI as compared to LPI.
- M Tech Dissertations