Study and analysis of primary database operations using single node hadoop cluster

Tripathi, Prakriti Vaibhav

dc.contributor.advisor	Jat, P. M.
dc.contributor.author	Tripathi, Prakriti Vaibhav
dc.date.accessioned	2017-06-10T14:42:35Z
dc.date.available	2017-06-10T14:42:35Z
dc.date.issued	2015
dc.identifier.citation	Tripathi, Prakriti Vaibhav (2015). Study and analysis of primary database operations using single node hadoop cluster. Dhirubhai Ambani Institute of Information and Communication Technology, viii, 39 p. (Acc.No: T00497)
dc.identifier.uri	http://drsr.daiict.ac.in/handle/123456789/534
dc.description.abstract	MapReduce framework and Hive query language are widely used for large data processing in Hadoop system. Nowadays media, corporates and government organizations are generating a very large amount of raw data. This rises new challenges and opportunities to process large data to meaningful information. Traditional data processing solutions (like Oracle or DB2) are not very efficient to manage, and analyze large volumes of unstructured data. Hive and MapReduce based solution is good, but it requires improvements to achieve better performance.Many researchers have proposed various strategies for the improving the original Hive system. Strategies can be applied either at run time or before execution. Many strategies, like file format (eg: Record Columnar file format) and cost optimization are applied before execution to improve the performance of query processing. We tried out Tez and Vectorized query execution strategies at run time to improve the performance. The performance of Hive queries can be improved by minimizing the queue average delay, utilizing the free memory and increasing the parallelism. These can be achieved by minimizing the process and run time overheads which occurred due to inefficient query translation and execution. The Tez and Vectorized query execution approaches can eliminate the unnecessary overheads by addressing the issues like unnecessary Map phase, unnecessary data loading and unnecessary idle time between phases (Map and Reduce phase). In order to measure the performance gain, we executed identified queries over four different data sets. We identified the queries based on the primary operations. We performed our experiments for Join, Order By, Group By, Logical and Predicate operations. We addressed two aspects in our experiments. First, we measured the query execution time with respect to a variety of data (four different data sets) by using the MR execution approach (default execution approach), Tez and Vectorized query execution approaches. Second, we measured the query execution time with respect to data size (ie. number of rows) by using the MR, Tez and Vectorized query execution approaches. Based on the measurement results, Tez and Vectorized execution was found to be good for every primary operation compared to MR execution. Overall, Vectorized execution performs better compared to others for Join operation in both aspects and Tez execution performs better compared to others for Order By and Predicate operations in both aspects. Based on the experiments and analysis, we can conclude that Tez and Vectorized execution approaches always improves the performance compared to MR execution approach.
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology
dc.subject	Databse
dc.subject	Hadoop
dc.subject	Hadoop Cluster
dc.subject	Database Operation
dc.subject	Single Node Hadoop
dc.classification.ddc	005.7585 TRI
dc.title	Study and analysis of primary database operations using single node hadoop cluster
dc.type	Dissertation
dc.degree	M. Tech
dc.student.id	201211039
dc.accession.number	T00497

Files in this item

Name:: 201211039.pdf
Size:: 5.345Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

M Tech Dissertations [923]

Show simple item record