Explosion of data is dramatically changing the landscape of how we store and process structured and unstructured data. Innovation in database technology is fueled by the need to address the scalability and performance challenges introduced by ‘big data’. Our CTO Don Taylor reflected on this issue at the MIT CIO symposium with the term Unwieldy Data.
The dominance of SQL and relational databases is being challenged by NoSQL options. Some NoSQL options include aggregate oriented databases (Key-Value, Document and Column-family data models), graph databases and the Hadoop ecosystem, which is an open-source software framework for storage and large-scale processing of data-sets horizontally and in parallel, on clusters of commodity hardware. These NoSQL technologies address the use cases where the relational model does not fit well and also address scalability and performance issues, but need a robust ecosystem and standards to match the capabilities offered by SQL and relational databases.
Organizations are faced with a critical decision around the use of SQL or NoSQL for their Big Data solutions. Understanding the capabilities and challenges inherent in each of these technologies can help organizations make a better decision when choosing the right tool for the problem.
Relational databases are based on the premise of a fixed schema, where data is normalized and stored in rows and columns of tables that are relational, and provide ACID properties for database transactions. SQL (Structured Query Language) is a powerful declarative programming language with procedural elements and is used to interact with the stored data. The prevalence of SQL can be attributed to the fact that SQL is an adopted standard by RDBMS vendors, allowing for applications and skills to be portable.
Over the years, designers and engineers have found crafty ways of representing all kinds of data in the relational model and it has worked well. Relational databases become a challenge when the data has a changing schema and is dynamically structured or if data is inter-connected and the context and relationships of that data have become more important than the data itself. Relational databases use joins to traverse relationships and they are extremely taxing as they increase in number. A join is a search for a key in another table and the performance of this search can be improved by indexing the keys, but the tradeoff is slower insert, updates and deletes.
Scalability in relational databases is addressed by scaling vertically on powerful hardware. Scaling horizontally is a complex problem in relational databases, but can be achieved by shared-nothing architecture and distributed query optimizers for OLAP workloads and with shared-everything architecture and Sharding for OLTP workloads as explained in this Dr. Dobbs article.
Aggregate Oriented Databases
Martin Fowler explains and categorizes the Key-Value, Document and Column-family data models as aggregate oriented NoSQL databases in this blog post. The common theme behind NoSQL databases is that schema is optional and data is de-normalized. The adoption of aggregate oriented databases can be attributed to the representation of data in XML and JSON formats and the use of these formats, especially JSON for web, application and service development along with data interchange. Let’s take a look at the key characteristics of these databases.
1) Key-Value Stores – One key associated with one value, which is a Blob and does not have a structure. Querying the data is only possible through the key.
2) Document Stores – Key-Value stores with values containing implicit schema allowing for queries beyond the key.
3) Column-Family – Key-Value store where key is mapped to a set of columns. Queries are based on keys and key ranges.
With limited query support, aggregate oriented databases rely on the MapReduce programming framework to filter and aggregate data. MapReduce uses a map function to filter relevant data, which is then submitted to the reduce function for aggregation.
These NoSQL databases are built to scale out horizontally on distributed nodes, allowing the ability to add nodes as needed.
Graph databases are NoSQL databases based on the graph data structure and Graph Theory. Graphs have a simple structure - Graph = (Node, Edge), where nodes are connected by edges and can have properties associated with them. Graph databases offer index-free adjacency, i.e. every node in the database that has a relationship with another node, has that relationship/edge stored in the database. Every node knows what other nodes it is connected to. This is in stark contrast with relational databases that compute these relationships in queries, thus impacting performance as the size of the data set increases.
Graph databases use an index only once to find the starting node and then traverse the graph with pointer chasing and pattern matching to find the data. Performance in graph databases is not impacted by the total size of the data set, and this is huge for working with big data sets.
The graph data model is well suited for heavily associative data and almost all data can be represented elegantly in this model. Social, personalized recommendations and graph searches are among some of the prevalent use cases for the graph database.
Hadoop is an open-source software framework for storage and large-scale processing of big data sets horizontally and in parallel. It is designed to scale on clusters of machines, each offering local computation and storage. The data is loaded or appended to the Hadoop Distributed File System (HDFS), a storage system that is designed to span large clusters of commodity servers. HDFS accepts data in any format regardless of the schema. Queries and batch read operations can then be executed against these data sets.
Hadoop is built to scale out horizontally using MapReduce, where a problem is split up and sent to the servers for processing in parallel. The results from each server are aggregated for the solution. Compute and Storage co-exist on the same server in the cluster.
Hadoop in essence is not a Database System, as it does not support transactional processing. SQL and NoSQL databases can augment their capabilities by offloading the ingress, processing and egress of large volumes of data to Hadoop, allowing them to focus on their true capabilities and use case.
So in conclusion, there isn’t a silver bullet in database technology that can handle all the use cases in the emerging world of big and inter-connected data, but solutions exist for each use case, making the case for using multiple technologies and data models to solve the problem at hand.
Ideally the database of the future will be designed such that the data store is abstracted from the different data models, while providing capabilities to represent the data in the data model of choice (graph, aggregate or relational), based on the use case.