Delivering Project & Product Management as a Service

To SQL or No SQL this is the question

Some History

In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep (Genesis 1:2).

Ages ago, serialization of data was the responsibility of the application programmer, you built the libraries that saved your data to the disk in various file formats and indexes.

Those were the days before ISO and ANSI, and there was an organization called CODASYL, which tried to standardize those formats. Some of relics of those data formats were alive till the mid 2000 and I was involved in an archaeological software activity to move to modern standard only because the data was hardware dependent and there was no hardware available to run the software, and even emulation of the hardware was being phased out.

Then there were vendors who provided their own standard methods to save data, IBMs ESAM and VSAM so if you used their computing platform you just needed to reuse their methods. This was circa 1970.

It was clear that it would be helpful to find a standard method of mapping data and relationships into a model that would be standardized. Digital Equipment Corporation (DEC) produced a network modeled database called DENDB on their VAX/VMS operating system, in that period. And IBM provided a hierarchical database called IMS/DB.

The main key point was that all generic DB entities related to objects (entities) and links or relations between them, and suggested ways to ease their organization.

Then came Edgar F. Codd the prophet of relational databases and proved mathematically that a relational structure between table entities can be optimized to reduce duplication, enhance efficiency of access and reliability. Relational databases quickly caught on and multiple implementations were created. Since you had a standard way of managing data, a language called SQL was quickly adopted and soon became the prevailing data access language.

Fast forward to the new millennium, and Big Data came to be. Everyone was saving and retrieving data, not just corporate workers, but the anyone who can hold a smartphone, and applications had to deal with huge volumes of data, which was not the old mom and dad accounting data, but included pictures, sounds and all things related to people life. And it had to be done fast since the users are free to change vendors and not just locked in their cubicles getting paid to wait for the screen to render.

Relational databases, although mathematically well defined, were breaking.

Time for a change

As with all programming, the problem is mapping reality into binary data. This may be simpler in the future considering the Simulation Hypothesis, but we are not yet there, and for now, we have to deal with analog reality.

When modelling reality, we must simplify complexity in RDBM (Relational Data Base Systems) this is translated to tables, but in other cases to other logical entities.

Key-value DB

In KV DB the atomic entity is a record that holds a value and a unique key to that value. The value can be anything from simple string or number to complex object. You fetch the data by searching for the keys. Think of it as an RDBMS with one table with two columns. Fast simple and scalable, but missing relations between entities unless you want to represent them as values, and then it flattens the structure and it’s hard to use with complex realities with multiple relationships and hierarchies.

Column based DB

This type of DB transposes the table entity in RDBMS and instead of having the record with multiple fields organized in a row, it’s organized in a column. So, for example, we replace a table with personal data with several columns’ stores, that each contains the specific field from the record.

This structure lands itself well to fast reads of data were you get only the relevant columns that you need and don’t have to load the whole record, but writing to the database may be slower as you have to have multiple access to columns instead to one row.

Document based DB

When modeling large complex entities sometimes is best the leave the complexity for post-retrieval time for example if you have multiple “documents” or complex objects that don’t convey a specific structure and are expected to change over time, just manage those entities as “objects” “blobs” and just manage common metadata, like collection of similar objects and the relation between them. In this technology you can keep a dynamic schema at the cost of parsing it at read time (slow). Here is an example of equivalent data kept as a document structure in relation to the RDBMS structure. In this case the Document is formatted in JSON and on MongoDB, one of the most commonly used Document driven DBs.

Graph based DB

In the move to Big Data, and out if the corporate control, because you can’t know the structure of the data beforehand and you need to support it in multiple formats, one needs a structure that can hold EVERYTHING you throw at it and yet support fast updates and reasonable quick data retrieval.

The way to do that is via Graph DB. In such a schema any data is either a node or an edge (link) between nodes. This way the relationships between objects are equal citizens in the schema and each have a label that you can traverse through the graph, for example: Disengof -isAstreetIn-> Tel-Aviv -isAtownIn-> Israel. The graph is built during updates and is very easy to understand by a human, and retrieve from using a dedicated query language.

In flexibility is carried over to the metadata, meaning, the structure of the data. Definition of the types of objects and other metadata is overlayed on the same graph and connecting the metadata to the relevant objects in the graph. Using this type of layering can stick various “schemas” to the graph factual nodes and links. This Metadata structure is called Ontology.

A good example of it is taken from Neo4J which is one of the popular Graph DBs:

Choosing the right DB technology for the task

Use RDBMS

  • Use Cases: Well-suited for applications with structured data and complex relationships, such as financial systems, e-commerce platforms, and ERP systems.
  • Consistency: Ensures strong data consistency through ACID (Atomicity, Consistency, Isolation, and Durability) transactions, i.e. reliable.
  • Query Language: Uses SQL for querying and manipulation of data, everyone knows it.
  • Scalability: May face challenges in horizontal scalability (adding fields) and handling big data efficiently.
  • Normalization: Supports normalization to minimize data redundancy.
  • Schema Changes are small and known: Schema changes can be complex and require careful planning to maintain backward compatibility.

Use Key value DB

  • Use Cases: Ideal for scenarios requiring high-speed data retrieval and low-latency, such as caching systems, session management, and real-time.
  • Scalability: Highly scalable due to its simple structure; can be distributed easily.
  • Schema Flexibility: Limited schema enforcement, making it suitable for applications with varying data structures.
  • Complex Relationships: Not suitable for complex relationships or querying that requires multiple levels of joins, you’ll have to implement it in the application layer.
  • Consistency: Depending on the system, may offer eventual consistency rather than strong consistency (not every updates to the DB is propagated immeidatly).
  • Examples: Redis, Amazon DynamoDB.

Document based DB

  • Use Cases: Great for applications with flexible and changing data schemas, content management systems, and applications dealing with user-generated content that you don’t control. As well as front-end storage that you implement as part of the UX.
  • Schema Evolution: Can handle schema changes more smoothly compared to RDBMS.
  • Query Flexibility: Supports querying by content, enabling complex queries within a document.
  • Scalability: Good horizontal scalability, suitable for handling large volumes of data.
  • Examples: MongoDB, Couchbase.

Graph based DB

  • Use Cases: Designed for applications that heavily depend on understanding relationships, such as social networks, recommendation engines, and fraud detection.
  • Relationships: Optimized for traversing relationships and performing graph-based queries.
  • Complex Queries: Enables powerful querying for complex relationship-based scenarios that would be challenging in other database types.
  • Scalability: Can be scaled horizontally, but performance might degrade with very large datasets and complex queries.
  • Examples: Neo4j, Amazon Neptune.

Environmental considerations

When selecting a technology, it’s wise to understand your resources and constraints, this will enable you to reduce the options space and make decisions easier. Sometimes being locked to an environment can be a good thing. Anyways, in some cases setup a scoring system that will compare the alternatives according to the following dimensions:

  • Use case fitment: Asses the use cases that the DB is supposed to support.
  • Data Security: Assess security features, encryption, and access control mechanisms.
  • Data Integrity: Evaluate how each database ensures data accuracy and consistency.
  • Performance: Consider read and write performance requirements, latency, and throughput.
  • Cost: Evaluate licensing, hosting, and operational costs for each technology.
  • Development and Maintenance: Factor in ease of development, available libraries, tools, and community support – Is your team have experience with those tools?
  • Data Volume and Growth: Project the expected data volume and growth rate to ensure the selected database can handle it.
  • Backup and Recovery: Consider backup, replication, and disaster recovery mechanisms.