Hybrid partitioning and distribution of RDF data
RDF is a standard model by W3C specifically designed for data interchange on the web. RDF was established and used for the development of the semantic web. However, nowadays RDF data is being used for diverse domains and is not limited to the semantic web. Tremendous increase is witnessed in RDF data due to its applications in various domains. With growing RDF data it is vital to manage this data efficiently. The thesis aims at efficient storage and faster querying of RDF data using various data partitioning techniques. The thesis studies the problem of basic data partitioning techniques for RDF data storage and proposes the use of hybrid data partitioning in centralized and distributed environment as a part of the solution to store and query RDF data. The dissertation emphasizes on efficient data storage and faster query execution for stationary RDF data. It demonstrates basic data partitioning techniques like PT (Property Table), BT (Binary Table), HP (Horizontally Partitioned Table), and use of MV (Materialized Views) over BT. Even though basic data partitioning techniques outperforms TT (Triple Table) they suffer from various performances issues. The thesis gives a detailed insight into advantages and disadvantages of basic data partitioning techniques. Consequently, it proposes hybrid solutions for data partitioning by exploiting the best of available techniques. It proposes three hybrid data partitioning techniques namely DAHP (Data-Aware Hybrid Partitioning), DASIVP (Data-Aware Structure Indexed Vertical Partitioning) and WAHP (Workload-Aware Hybrid Partitioning). DAHP and WAHP are a combination of PT and BT whereas DASIVP combines structure index partitioning with BT. DAHP and DASIVP consider a data-aware approach and WAHP considers a workload-aware approach. Data-aware approach stores RDF data based on how the data is related to each other in the dataset and workload-aware approach stores RDF data based on how the data that is queried together. The thesis demonstrates detailed evaluation of query perform ance and data storage for all the data partitioning techniques. Query performances for these data partitioning techniques are evaluated in terms of QET (Query Execution Time). It calculates break-even point for all the data partitioning techniques. Hybrid data partitioning techniques have shown significant improvement over basic data partitioning techniques. A set of metrics is devised which can help to consider the suitability of given data partitioning technique for a RDF dataset. RDF data has increased to a point where it is difficult to manage this data on a single machine. It is necessary to distribute the data on different nodes and process it in parallel so that efficient query performance can be achieved. Data distribution and parallel processing of queries may generate many intermediate results which will involve communication among nodes. It becomes necessary to minimize inter-node communication among nodes in order to achieve faster execution of queries. This work presents a solution to manage RDF data in a distributed environment using a proposed hybrid technique. The solution aims at efficient RDF data storage and faster query execution by minimizing inter-node communication among nodes. Finally, the dissertation proposes DWAHP (Workload-Aware Hybrid Partitioning and Distribution) which exploits query workload and distributes data among nodes. DWAHP has two phases: Phase 1 considers Workload-Aware Hybrid Partitioning technique which generates workload-aware clusters consisti ng of PT and BT. Phase 2 considers a distribution scheme that distributes data among nodes using an n-hop Property Reachability Matrix. DWAHP Phase 1 helps in reducing number of joins, as it keeps the data which is queried together as a separate partition. DWAHP Phase 2 helps in diminishing inter-node communication among nodes with the use of an n-hop Property Reachability Matrix. The thesis demonstrates DWAHP and analyzes its query performance in terms of query execution time, query cost, storage space, and inter-node communication. Queries on RDF data mostly involve star and linear query patterns. DWAHP manages joins such that it is able to answer all linear and star queries without inter-node communication. DWAHP is compared with a state-of-the-art solution. It outperforms the state-of-the-art solution with 72% of faster query execution time, 61% of reduced query cost by occupying less than one-third of storage space. Increase in RDF data is witnessed as RDF data is being used in diverse domains. Discussed partitioning techniques can be utilized for various RDF stores. Data-aware RDF stores can be utilized for applications when data characteristics are known and workload-aware RDF stores can be utilized when data queries are known in advance.
- PhD Theses