Study and analysis of primary database operations using single node hadoop cluster
Tripathi, Prakriti Vaibhav
MetadataShow full item record
MapReduce framework and Hive query language are widely used for large data processing in Hadoop system. Nowadays media, corporates and government organizations are generating a very large amount of raw data. This rises new challenges and opportunities to process large data to meaningful information. Traditional data processing solutions (like Oracle or DB2) are not very efficient to manage, and analyze large volumes of unstructured data. Hive and MapReduce based solution is good, but it requires improvements to achieve better performance.Many researchers have proposed various strategies for the improving the original Hive system. Strategies can be applied either at run time or before execution. Many strategies, like file format (eg: Record Columnar file format) and cost optimization are applied before execution to improve the performance of query processing. We tried out Tez and Vectorized query execution strategies at run time to improve the performance. The performance of Hive queries can be improved by minimizing the queue average delay, utilizing the free memory and increasing the parallelism. These can be achieved by minimizing the process and run time overheads which occurred due to inefficient query translation and execution. The Tez and Vectorized query execution approaches can eliminate the unnecessary overheads by addressing the issues like unnecessary Map phase, unnecessary data loading and unnecessary idle time between phases (Map and Reduce phase). In order to measure the performance gain, we executed identified queries over four different data sets. We identified the queries based on the primary operations. We performed our experiments for Join, Order By, Group By, Logical and Predicate operations. We addressed two aspects in our experiments. First, we measured the query execution time with respect to a variety of data (four different data sets) by using the MR execution approach (default execution approach), Tez and Vectorized query execution approaches. Second, we measured the query execution time with respect to data size (ie. number of rows) by using the MR, Tez and Vectorized query execution approaches. Based on the measurement results, Tez and Vectorized execution was found to be good for every primary operation compared to MR execution. Overall, Vectorized execution performs better compared to others for Join operation in both aspects and Tez execution performs better compared to others for Order By and Predicate operations in both aspects. Based on the experiments and analysis, we can conclude that Tez and Vectorized execution approaches always improves the performance compared to MR execution approach.
- M Tech Dissertations