Book notes

Designing Data-Intensive Applications: Ch2. – Data models and query languages

Like data structures there are many ways to represent and query persistent data. The main models for representing persistent data are relational, document oriented, graphical, and hierarchical. Although that last one seems to be all but dead because of the onerous burden it places on the programmer to maintain and query the data. Relational is still king after 30 or so years and document oriented and graphical are on the rise and shine in domains with specific modeling requirements.

The models are in some ways a response to how to best map program and domain data into a persistent and queryable form. Ad-hoc methods don’t scale because each project reinvents the wheel so the consolidation around the various main models makes sense. It also helps that each one comes with a query language that makes it very easy to query the persisted data. Technically anything that doesn’t fall under the relational model is considered NoSQL but that’s a pretty terrible nomenclature. It is better to be specific about what one is talking about by explicitly naming the dominant data model.

Each model is good at representing certain types of relationships that the others might not be so good at. Although with recursive SQL queries it is possible to model hierarchical, graphical, and document centric queries it is probably better to migrate to another model if those are the bulk of the queries. There will always be an impedance mismatch between the domain and data models and anything that reduces the mismatch is probably the right thing to do in the long run.

The various models also have historical baggage that comes along with them. The NoSQL folks in some sense are re-inventing history by de-normalizing the data and throwing away schema validation in the name of flexibility. Although not having a schema might seem to be good in the end someone must take on the responsibility of enforcing the schema and when the database does not enforce it then the application ends up enforcing it. Many SQL database like Postgres also support JSON and allow indexing it and querying like other SQL data which kinda makes the distinction between schemaless database moot.

The query languages for each model also make different trade-offs and fall on some spectrum between declarative and imperative. Although SQL is considered declarative there are many vendor extensions that make it less declarative by introducing imperative and side-effecting operations. The closer one stays to the declarative model the easier it is for the query optimizer to provide fast and good results (when the indices are tuned properly of course). Should mention that CSS and XSL are technically also declarative query languages because both CSS and XSL have “selector” syntax for picking elements from HTML.

Many modern workloads suit the graphical model but I don’t know how much of this is because of the popularity of social networks. Social networks naturally fit the graphical model and many social networks like facebook and twitter are advancing the capabilities of NoSQL databases to better fit their workloads. Graphical databases break up into sub-models based on how they are represented as either property graphs or triple-stores but that distinction does not seem warranted to me. It’s more of an implementation detail and I don’t even know if Datalog technically qualifies as a graphical model which is how the book classifies it. I’d consider Datalog relational more than anything else. Out of all the query languages I think Datalog is the best one because it is basically a restricted form of Prolog. I think this model has much untapped potential.

There is also RDF and stuff around the semantic web but those initiatives, although well meaning, seem to be dead. Unfortunate because making the web semantic sounds kinda cool. In another universe maybe there is no Google and Facebook because people were wise and implemented the semantic web. RDF is queried with SPARQL and it is suspiciously like Cypher the query language for neo4j (vice-versa actually because Cypher is based on SPARQL).