Slightly biased: PostgreSQL > MongoDB
By Robert Yackobian.
A technology is always secondary to business requirements, and the choice of a database engine is no exception. Such a choice has significant long term consequences, and should be taken after serious deliberations.
This article strives to help in that decision-making process by lining out main features of PostgreSQL vs MongoDB, and hopefully help to anyone struggling with making that choice. The article is based on years of personal experience with both engines.
Disclaimer: There was no undue influence whatsoever from anyone PostgreSQL or MongoDB, these are all my opinions and my personal biases.
Here we will specifically cover:
- ACID vs Base
- Data types
- JSON support
- OLTP Performance
- OLAP Performance
- Database design (ERD)
- SQL support
- Known Limitations
ACID vs BASE
First of all, PostgreSQL is ACID VS MongoDB which is BASE.
Simply speaking, ACID is about database transactions integrity, while BASE provides better performance in some scenarios on expense of not having transaction integrity enforced.
While MongoDB is making progress to become “more ACID,” it is still behind PostgreSQL on that matter. For example, MongoDB now has transactions but creating/dropping a collection is not part of the translation (according to my tests as there is no reference to that in the MongoDB documentation.
With PostgreSQL you’ve got this:
BEGIN; Create table MANAGERS as select * from EMP WHERE manager = ‘YES’;
DELETE FROM EMP WHERE manager = ‘YES’;
COMMIT;
The ‘create table MANAGERS’ command is still part of the transaction. If the create statement fails, the entire transaction can be rolled back.
In fact, PostgreSQL is even more ACID than Oracle, where a DDL command such as creating a table automatically commits the transaction, which may actually hurt.
Even when MongoDB does support multi-document transaction support, the documentation cautions to limit each transaction to 1000 documents, which makes MongoDB multi-document transaction useless for way too many scenarios, like loading a large collection.
ACID databases are preferred for domains like fintech and healthcare where data integrity is important. BASE databases are preferred in domains where speed advertisement industry, fraud detection, prepaid phones, and monitoring systems is more important than integrity.
For example, fraud detection software attempts to identify fraudulent transactions by running many rules/checks on each incoming card transaction on a high volume of incoming data. The more checks the higher the probability to detect a fraud attempt. When deciding on what database engine to use for such a use case, you will get more (mostly accurate) queries with MongoDB vs less (absolutely accurate) queries with PostgreSQL.
Being slightly biased, unless speed is an overwhelmingly weighting requirement, I prefer a general purpose ACID database such as PostgreSQL.
Data Types
In a sense, MongoDB is JSON-based: data is received, stored and returned as a JSON document. Indeed, MongoDB has other data types, but these are more of a way to tell MongoDB: “Save this date inside the JSON doc as a date.” MongoDB is schemaless, therefore there is no way to define a “Table” with a INT or DATE column. From my point of view this is less than ideal.
On another hand, PostgreSQL fully supports all the basic data-types like int, number, text, date, byte and some more advanced, including JSON. For example:
- Chkpass – A PostgreSQL data type that is designed for storing encrypted passwords. Each password is automatically converted to encrypted form upon entry, and is always stored and fetched as encrypted. A typical query will be: select ‘user pass is OK’ from myusers where stored_pass=’1234’ This will check if the user passed the correct password. On the other hand, running the next query is useless as Postgres will not show the actual content of the stored_pass column: Select stored_pass from myusers;
- IP – another Postgres data type that will store and validate the format of the received value to be in an IP format. And it’s not a half baked feature. For example inet and CIDR are supported, enabling queries such as: SELECT ‘192.168.1.19’::inet << ‘192.168.1.0/24’::cidr;
MongoDB is completely stuck with JSON, while PostgreSQL always has options. JSON is popular today, but so was XML once. You can never know when a complex data type goes out of fashion, so why entrap yourself with one?
More about JSON
MongoDB being JSON-native delivers much more JSON-related options compared to PostgreSQL.
The list of MongoDB update operators and modifiers includes:
The PostgreSQL way of updating a JSON field is simple: column=new_value and a handful functions such as jsonb_set, jsonb_insert. MongoDB aresnal is much more extensive: $currentDate, $inc, $min, $max, $mul, $rename, $set, $setOnInsert, $unset, $addToSet, $pop, $pull, $push, $pullAll, $each, $position and more …
When you run a simple update in MongoDB, a call will look like:
db.toys.updateOne({ toy_name: “Buzz Lightyear” }, { $set: { “can_do”: “fly” } })
Here we supply two parameters in JSON format. First parameter is a json document: { toy_name: “Buzz Lightyear” } locates the data that needs to be updated. Second parameter is also a json document: { $set: { “can_do”: “fly” } } instructs MongoDB to set the attribute “can_do” to “fly”.
But what if “Buzz Lightyear” is now able to talk and dance?
MongoDB:
db.toys.updateOne(
{ toy_name: “Buzz Lightyear” },
{ $push: { “can_do”: “talk” } })
db.toys.updateOne(
{ toy_name: “Buzz Lightyear” },
{ $push: { “can_do”: “dance” } })
But (unfortunately) has lost his ability to fly:
db.toys.updateOne(
{ toy_name: “Buzz Lightyear” },
{ $pop: { “can_do”: “fly” } })
PostgreSQL:
Assuming the following JSON data structure represents our target field:
{“name”: “Buzz Lightyear”, “can do”: [“run”, “dance”]}
Inserting a New Value into the JSON Field:
update toys
set json_col = jsonb_insert(json_col, ‘{can do, -1}’, ‘”fly”‘::jsonb, true)
where toy_name = “Buzz Lightyear”;
Updating an Existing Value in the JSON Field:
update toys
set json_col = jsonb_set(json_col, ‘{can do, 0}’, ‘”run”‘::jsonb)
where toy_name = “Buzz Lightyear”;
Removing a Value from the JSON Field is a bit tricky as there is no equivalent of MongoDB $pop operator:
update toys
set json_col = jsonb_set(json_col, ‘{can do}’, (
select jsonb_agg(value)
from jsonb_array_elements(toys->’can do’) as value
where value <> ‘”fly”‘
))
where toy_name = “Buzz Lightyear”;
Updating a JSON doc is just an example showing Mongo is overall more fluent with JSON. There is also a MongoDB built-in JSON validator which is much more convenient then the PostgreSQL JSON validator (validate_json_schema). Then MongoDB integration with Node.js is truly seamless compared to ORM tools available for PostgreSQL.
No contest here, MongoDB is much better for working with JSON than PostgreSQL.
OLTP performance
Most systems you deal with daily (ATM, paying your bills, a movie ticket) are OLTP systems. OLTP applications will typically use a primary key or a regular index to fetch a small set of rows while inserts and updates are on a small number of rows.
Such applications have a high volume of small transactions where data accuracy we already mentioned above is critical, and fast data access for updates is very important.
MongoDB has great indexing features, and data can be fetched and updated quickly. PostgreSQL is similar in that regard, but it also has ACID (already mentioned above) and can deliver far superrior data accuracy.
OLAP Performance
DWH/DCS/Big data systems involve large inserts and updates (data loading) parallel to heavy queries on large amounts of data.
PostgreSQL ecosystem provides several features/tools to manage DWH/DCS/Big data systems:
- Data sharding (such as Citus DB) – splitting data across different machines in Read and Write mode.
- Standby instances can be queried.
- Materialized views for refreshing a table data from base/staging tables.
- Foreign Data Wrappers (FDW) allow querying remote databases (not necessarily a PostgreSQL database) for easy data integration.
- Table partitioning – allows you to effectively split your data across disks and even different machines (when used in conjunction with FDWs) achieving an effective poor man’s database sharding.
MongoDB also has features that are handy for DWH/DCS/Big data systems:
- Data sharding (Simple called a “shard cluster”)
- TTL collections – deleting old data without extra coding.
- Capped collection that can grow up to a predefined size.
- Replicas can be queried.
Read concerns – an automatic rule driven system as of which node should be queried and when. Effectively moving the queries load from the primary to the secondaries. This great feature is unmatched in PostgreSQL and sadly is rarely in use.
MongoDB is great for dynamic queries of frequently written or read data. That’s because it is designed to work with data of all different types that are changing all the time, without requiring complex transactions between objects. You’re going to get good performance, even when running ad hoc queries on small subsets of fields, contained in documents with loads of fields.
On the other hand Postgres delivers unmatched features such as: Foreign data wrappers, great partitioning options, effective data sharding (which does not consume as much resources as a Mongo shard cluster), and parallel query operations.
But above all, PostgreSQL is SQL while MongoDB is not. I know there are SQL bridges for MongoDB (discussed later here) but they are hardly as effective and full featured compared to PostgreSQL full ANSI compliance. The ability to use any JDBC tools with Postgres is a winner.
Database design (ERD)
PostgreSQL is an SQL compliant relational database with JSON functionality. MongoDB is not relational, and not SQL compliant, and is also schemaless.
MongoDB is bound to JSON while in PostgreSQL you can choose to use regular column data types and JSON in the same table. This “hybrid” nature of PostgreSQL is a great advantage.
For example, consider a table, storing data about products. Every product has fixed columns such as ID, name, category, but many other bits of information that are unique to a specific product, for example how much RAM a laptop has. In that case if a product is an orange then the amount of RAM is not relevant. The table structure in PostgreSQL can be:
- ID as integer
- Name as text
- Category as integer
- Extanded_data jsonb
The extanded_data column will include all the pieces of information that are relevant to this product only. That way you can gain schema-less database design but still have a database design. The product fields ID, name and category are not expected to ever change. There is no scenario a product will not have these 3 fields and still the JSON field is the place to add new attributes when needed. Having fixed columns makes reading the schema much easier as the column name implies its content. Fixed columns also enable foreign keys for better data validation and speedy table joins.
In MongoDB there is no “table” structure as the database is schemaless.
But the document will contain the same “column” (in JSON this is called an attributes).
Being schemaless brings a lot of advantages such as quick development and deployment, but one big disadvantage: logical data corruptions, which are also hard to find. For example consider a MongoDB collection as above already containing lots of data under the attribute name “Extanded_data”: Now a novice programmer mistakenly “creates” a new attribute “ExtandedData“, simply by: db.products.insert (“id”:1,”name”:”Buzz Lightyear”,”category”:”toys”,ExtandedData”:”It can fly”}. This data is lost to all others as the are looking for the “Extanded_data” rather than “ExtandedData”
Since MongoDB has no foreign keys, it is quite common to break the normalization rules and include the parent data in the child collection. This is done to improve performance but with a cost of having the same data in many places.
Having a normalized schema design saves space (as each piece of information is saved once), improves performance (as you can write effective queries that scan smaller tables) and also improves reliability of the data (as there is a single source of truth of each piece of data). So with PostgreSQL you can better design, document and manage the schema structure while getting better performance and reliability.
SQL support
SQL is presently by far the most popular query language in the world
PostgreSQL is fully ANSI SQL compliant. This means you can use any SQL client tool to query the PostgreSQL data.
MongoDB is not SQL compliant but you can use specialized tools on top of MongoDB to query data using SQL commands. Tools such as Studio3T deliver some representation of collections as tables and json attributes as columns and allow basic SQL queries. This is limited (not all SQL commands are supported) and hard to setup and maintain (you need to set up a MongoDB to SQL bridge). And still, all the basic SQL commands such as distinct, avg,sum, max,min, order by,group by – are supported. More advanced features such as CTE (Common table expressions) and Window functions are not supported.
You can only query the database using select statements. Updates, deletes or inserts are not possible through sql bridges.
Don’t count on these tools as the infrastructure for your data engineering system.
Tools such as Tableau/PowerBI also deliver some capabilities for working with MongoDB but these are nothing close compared to their functionality available for an ANSI SQL database such as PostgreSQL.
Bottom line: when choosing between MongoDB and PosgreSQL for more than basic reporting, choose PostgreSQL.
Database Limitations
Obviously each database has its limitations regarding max number of clients, max size of each data type and so on, so no point comparing all the different limitations of each database. Except for 2 big shortcomings of MongoDB:
- Max connections can be easily exceeded. MongoDB does not deliver a built in or a server side connection pool (comparable to pgbouncer) and relies on connection pooling from the driver. Postgres delivers pgbouncer (and others as pg_pool) which is an excellent server side connection pooling solution. This means it’s much easier to exhaust the allowed connection in Mongo than in Postgres.
- Max document size. In PostgreSQL the max jsonb size is 1GB while in MongoDB it’s only 16MB => a single MongoDB document cannot exceed 16MB and in some cases you will be forced to split a single json document to several documents. This is not an extreme scenario. For example assume your developer planned on having an collection named customer_activity which stores (obviously) the customer’s activity. In most cases a smart developer will create a new document with a current timestamp for a new customer activity but there is the option of placing the entire activity of a single customer under one document (Using the $push operator). This single document contains all the customer’s activity and will become huge over time. Unfortunately more developers than you might think choose this option leading to huge documents in a matter of weeks/months. In that regard, PostgreSQL is often more forgiving.
To be continued to:
- Extensions
- Indexing
- Backup and recovery
- Practical max size per node
- Sharding
- Monitoring
- Connection pooling
- Proxies
- Development and Administration tools.
=========================
Robert Yackobian sends regards.
robert.yackobian@senticore.com