Comparative Performance Analysis of Column Family Databases : Cassandra and HBase
Abstract
Up until now, relational databases have been unquestionably the most prevalenttype of databases used to handle data. The advent of cloud computing and bigdata has underlined the need for databases that are capable of managing and analyzingbig data. By allowing storage and retrieval of structured as well as unstructureddata, NoSQL databases circumvent the limitations of relational databases.Because of their support for schema flexibility, rapid data access and potential toscale up quickly, they have emerged as the favored choice for big data processing.These systems have several properties/parameters which can be tuned to achievespecific performance goals based on business needs. Having well-defined performanceobjectives assist us in articulating the acceptable trade-offs for our application.This motivates us to evaluate the performance of one such frequently usedNoSQL system: Cassandra. Apache Cassandra is an open-source, decentralized,distributed, fault-tolerant, highly available, elastically scalable, tunably consistent,row-oriented database. In order to accomplish the performance evaluation,we use the Yahoo! Cloud Serving Benchmark (YCSB) for benchmarking efforts.Our findings highlight that increasing thread count initially improves throughputand CPU utilization but later decreases it. Higher record count, consistency level,and dataset size lead to decreased throughput and increased latency. Strongerconsistency level also increases the CPU utilization. Increasing operation countimproves throughput but increases latency as well. These findings provide guidancefor optimizing Cassandra�s performance by adjusting these parameters.We also assess Apache HBase, another well-known NoSQL database, using YCSB.The relative performance of these databases under analytical as well as updateheavyworkloads is the primary focus of our investigation. Our test results demonstratethat for both workloads, Cassandra outperforms HBase in read operations,whereas HBase excels in write operations. This research quantifies the performancetraits of Cassandra and HBase, assisting developers and architects in choosingthe best database system for their big data applications.
Collections
- M Tech Dissertations [923]