Show simple item record

dc.contributor.advisorJat, P. M.
dc.contributor.authorTripathi, Prakriti Vaibhav
dc.date.accessioned2017-06-10T14:42:35Z
dc.date.available2017-06-10T14:42:35Z
dc.date.issued2015
dc.identifier.citationTripathi, Prakriti Vaibhav (2015). Study and analysis of primary database operations using single node hadoop cluster. Dhirubhai Ambani Institute of Information and Communication Technology, viii, 39 p. (Acc.No: T00497)
dc.identifier.urihttp://drsr.daiict.ac.in/handle/123456789/534
dc.description.abstractMapReduce framework and Hive query language are widely used for large data processing in Hadoop system. Nowadays media, corporates and government organizations are generating a very large amount of raw data. This rises new challenges and opportunities to process large data to meaningful information. Traditional data processing solutions (like Oracle or DB2) are not very efficient to manage, and analyze large volumes of unstructured data. Hive and MapReduce based solution is good, but it requires improvements to achieve better performance.Many researchers have proposed various strategies for the improving the original Hive system. Strategies can be applied either at run time or before execution. Many strategies, like file format (eg: Record Columnar file format) and cost optimization are applied before execution to improve the performance of query processing. We tried out Tez and Vectorized query execution strategies at run time to improve the performance. The performance of Hive queries can be improved by minimizing the queue average delay, utilizing the free memory and increasing the parallelism. These can be achieved by minimizing the process and run time overheads which occurred due to inefficient query translation and execution. The Tez and Vectorized query execution approaches can eliminate the unnecessary overheads by addressing the issues like unnecessary Map phase, unnecessary data loading and unnecessary idle time between phases (Map and Reduce phase). In order to measure the performance gain, we executed identified queries over four different data sets. We identified the queries based on the primary operations. We performed our experiments for Join, Order By, Group By, Logical and Predicate operations. We addressed two aspects in our experiments. First, we measured the query execution time with respect to a variety of data (four different data sets) by using the MR execution approach (default execution approach), Tez and Vectorized query execution approaches. Second, we measured the query execution time with respect to data size (ie. number of rows) by using the MR, Tez and Vectorized query execution approaches. Based on the measurement results, Tez and Vectorized execution was found to be good for every primary operation compared to MR execution. Overall, Vectorized execution performs better compared to others for Join operation in both aspects and Tez execution performs better compared to others for Order By and Predicate operations in both aspects. Based on the experiments and analysis, we can conclude that Tez and Vectorized execution approaches always improves the performance compared to MR execution approach.
dc.publisherDhirubhai Ambani Institute of Information and Communication Technology
dc.subjectDatabse
dc.subjectHadoop
dc.subjectHadoop Cluster
dc.subjectDatabase Operation
dc.subjectSingle Node Hadoop
dc.classification.ddc005.7585 TRI
dc.titleStudy and analysis of primary database operations using single node hadoop cluster
dc.typeDissertation
dc.degreeM. Tech
dc.student.id201211039
dc.accession.numberT00497


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record