• Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage StatisticsView Google Analytics Statistics

    Study and analysis of primary database operations using single node hadoop cluster

    Thumbnail
    View/Open
    201211039.pdf (5.345Mb)
    Date
    2015
    Author
    Tripathi, Prakriti Vaibhav
    Metadata
    Show full item record
    Abstract
    MapReduce framework and Hive query language are widely used for large data processing in Hadoop system. Nowadays media, corporates and government organizations are generating a very large amount of raw data. This rises new challenges and opportunities to process large data to meaningful information. Traditional data processing solutions (like Oracle or DB2) are not very efficient to manage, and analyze large volumes of unstructured data. Hive and MapReduce based solution is good, but it requires improvements to achieve better performance.Many researchers have proposed various strategies for the improving the original Hive system. Strategies can be applied either at run time or before execution. Many strategies, like file format (eg: Record Columnar file format) and cost optimization are applied before execution to improve the performance of query processing. We tried out Tez and Vectorized query execution strategies at run time to improve the performance. The performance of Hive queries can be improved by minimizing the queue average delay, utilizing the free memory and increasing the parallelism. These can be achieved by minimizing the process and run time overheads which occurred due to inefficient query translation and execution. The Tez and Vectorized query execution approaches can eliminate the unnecessary overheads by addressing the issues like unnecessary Map phase, unnecessary data loading and unnecessary idle time between phases (Map and Reduce phase). In order to measure the performance gain, we executed identified queries over four different data sets. We identified the queries based on the primary operations. We performed our experiments for Join, Order By, Group By, Logical and Predicate operations. We addressed two aspects in our experiments. First, we measured the query execution time with respect to a variety of data (four different data sets) by using the MR execution approach (default execution approach), Tez and Vectorized query execution approaches. Second, we measured the query execution time with respect to data size (ie. number of rows) by using the MR, Tez and Vectorized query execution approaches. Based on the measurement results, Tez and Vectorized execution was found to be good for every primary operation compared to MR execution. Overall, Vectorized execution performs better compared to others for Join operation in both aspects and Tez execution performs better compared to others for Order By and Predicate operations in both aspects. Based on the experiments and analysis, we can conclude that Tez and Vectorized execution approaches always improves the performance compared to MR execution approach.
    URI
    http://drsr.daiict.ac.in/handle/123456789/534
    Collections
    • M Tech Dissertations [923]

    Resource Centre copyright © 2006-2017 
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     


    Resource Centre copyright © 2006-2017 
    Contact Us | Send Feedback
    Theme by 
    Atmire NV