MongoDB: Step-1: An overview
Purpose & Layout for the 5 part series
When I started on MongoDB, I found there is not a good step-by-step guide that I found, to get started, understand what MongoDB offers, understand the design considerations when migrating from RDBMS to NoSql and then use MongoDB so as to unleash its full potential. This is a developer’s view of using MongoDB and understand the basics.
MongoDB: Step-1: An overview
MongoDB: Step-2: Setting up the server
MongoDB: Step-3: Mongo Client Commands
MongoDB: Step-4: Designing the schema
MongoDB: Step-5: Indexes and Query Performance Optimizations —
Step1 — An overview
This covers the evolution of databases, the data explosion (big data), data its importance, the need for analysis of data, Mongo and its history, what Mongo offers, why Mongo, replica-set, sharding etc
Evolution of Databases
Before 1990, data was stored in form of file systems, graphs, etc where you need to use the 3rd generation language where you need to tell not only what you need to do but also how you need to retrieve the data.
In 1990, the relational database like Oracle, SqlServer, Informix became very famous due to the use of the 4th generation language where you only need to instruct the “what” part and not detail how the data needs to be retrieved.
For data modeling in the RDBMS system, the normalization of data as laid down by Edgar F. Codd in 1970 became very famous. RDBMS became the goto databases for all the business that required the ACID properties of the transaction, namely
Around the year 2000, various Dataware Housing & Analytical tools developed called the OLAP systems. These systems focussed on analytics of the structured data, thereby providing insights into the data.
Around mid-1990s., the internet started becoming extremely popular. And with this, there was a huge explosion of data that was generated. This data was also highly unstructured. RDBMS systems could simply not keep up with the data explosion. This was just the beginning of Big Data.
The need to inherently support Big Data characterized by high velocity, high volume, high variety, and high variability data gave birth to solutions that were called the NoSQL databases. Today there exist a variety of NoSQL solutions which can be categorized in 4 sections according to the way the data is stored, namely
· Document-based — MongoDb, DynamoDb
· Key-Value — Cassandra
· Column store — Redis
· Graph-based — Neo4J
Data is the new OIL
The figure below shows the exponential growth of data.
Data is an essential resource that powers the information economy in much the way that oil has fuelled the industrial economy.
Like one who has more oil is wealthy, in the knowledge economy, one who has more knowledge/information is now more wealthy.
Information can be extracted from data just as energy can be extracted from oil. Though Data flows like oil but we must “drill down” into data to extract value from it.
Data is the fuel for AI/ML techniques opening up the options for innovative game-changing opportunities that were never known before. Data promises a plethora of new uses — diagnosis of diseases, the direction of traffic patterns, driverless cars, etc. Improved customer service, better decision making, and better operational efficiency are some of the other advantages.
Characteristics of Big Data
Big Data could be structured, unstructured or semi-structured. Apart from this, it is characterized by the volume, variety, velocity and variability of the data.
The name Big Data itself is related to a size that is enormous. The size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic that needs to be considered while dealing with Big Data.
The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
SQL v/s NoSQL
Some of the key differentiating features between SQL and NoSQL.
•Rich and fast access to data using an expressive query language and secondary indexes
•Good transaction support and strong consistency
•Built for waterfall models and structured data
•Built for a small set of internal users and not for millions of users
•Mostly for the system of records not for emerging systems of engagement
•Heavy license fee & scale vertically
NoSQL platforms address the BigData scalability and flexibility issues
· Ability to support unstructured data
· Scale storage
· Scale compute power
· Scale to support a large number of users
NoSql databases, achieve the above by relaxing some of the traditional constraints laid by the relational database systems. The need to scale out makes NoSql solutions as distributed solutions. And with distributed systems, comes the CAP theorem. So you need to pick your trade-off between consistency OR availability.
NoSQL databases may/may-not support the ACID properties of a transaction.
They may not provide the SQL expression language and may provide a different query language of its own. Some of the NoSQL databases have made an effort to embrace a restricted SQL for querying abilities.
CAP Theorem was proposed by Eric Brewer (professor of computer science at the University of California, Berkeley, and vice president of infrastructure at Google) in the year 2000.
C which stands for Consistency (ACID Transaction) indicates that every reader will have access to a recently persisted (or written) data set. So if you write a data set to node-1 and try to read it from node-2, if the system promises consistency, it must return the recently written data set.
A which stands for Availability (Total Redundancy) indicates that every slave or data node which is close to the data set (data locality) executes queries if alive and in the cluster.
P which stands for Partition Tolerance (infinite scale-out) indicates even if the connections between nodes in a cluster are down, the Availability of data and consistency of data set promises, are intact.
Any system which satisfies two out of these 3 properties is a distributed system. In the context of Hadoop, it supports the Availability(A) and Partition Tolerance(P) property
- It is Open Source, NoSQL solution licensed under the Server Side Public Licence (SSPL)
- MongoDB Inc. (formerly 10gen) founded in 2007, headquartered in New York City. It is worldwide served with an expanding team of 1000+ employees.
- As of 2018, downloaded 40 million times. Last stable release 4.2.2. on 9-Dec-19
Why MongoDB — Features of MongoDB
1.Humongous —MongoDB has been used to build solutions with clusters having over 1,000 nodes, delivering millions of operations per second on over 100 billion documents and petabytes of data
2. Schema Less
3. Polymorphic Data-Type — Same placeholder can have a different datatype
4. The big size limit per Document (16 MB), lets you keep related data together
5. Optimized for reads with its rich query language & full feature index support
MongoDB is the leading NoSQL solution
Other NoSQL solutions
Cassandra — Developed @ Facebook for inbox search. It is a distributed storage system for storing a large amount of structured data
Redis — Most famous key-value pair store.
Hbase — Distributed, designed for the Big Table database by Google
Neo4j — Native graph database, storage also as graph nodes
Amazon DynamoDB — Non-relational, allows documents, columnar and graph amongst its data models
Memcached — Open source, high performance, distributed memory caching system intended to speed up dynamic web applications by reducing the database load
MongoDB versus Cassandra
Cassandra is the next nearest competitor of MongoDb in terms of market capitalization. Below are some of the high-level comparison and differences between both of them.
Terminology & Concepts — RDBMS to MongoDB
Storage as BSON Documents
- Wired Tiger engine for storage from 3.2 onwards
- Stores the data in the document form, the BSON format.
- Binary JSON (BSON) — Contains a list of ordered elements containing
— Field Name — Type — Value
4. The max size of the document is 16 MB.
BSON versus JSON
- Dates and Binary data
- Skips records where irrelevant (the main reason it is used in MongoDB). Good support to build indexes and searches using BSON keys
- Lightweight, fast & highly transferable
- Consists of additional information like length of strings, object subtypes, etc.
- Faster encoding and decoding
Key Features — Replication & Automatic failover
The Arbiter — The arbiter is lightweight and does not hold data. Its only job is to conduct and election. It should not be hosted in the datacenter when the primary or secondary nodes are hosted. Since it does not participate in any heavy lifting activities, it can be deployed in some shared server too. In the more recent version of MongoDB, they recommend not to have the artiber, to avoid network partitioned conditions.
The replica set if it loses the arbiter still remains available. Just that in case the primary fails over, in absence of the arbiter the node what has the highest priority initiates the election and will mostly win the election.
Maximum 7 nodes can be voting members in a replica set. All other nodes are marked with priority 0 and vote 0
Datacenter & Availability — The setup which has BCP datacenter’s as an option typically has a replica set consisting of n nodes ( x nodes in Prod datacenter + x nodes BCP datacenter + arbiter in yet another datacenter). The default Quorum size is (n/2+1). Example: In a data center, where there are 3 Prod nodes, 3 BCP nodes, 1 arbiter, the Quorum is 4. i.e. a minimum of 4 nodes (which can include arbiter) is needed for the cluster to be available
Priority — Typically the hardware in Prod is better/costly and is co-located with the application servers. So as choice DBA’s wold like to give a preference to automatically fail-over to the other node in the PROD data center rather than the BCP datacenter. Prod nodes are given higher priority compared to BCP nodes. Also, nodes of the same data center are given the same priority. Example: Prod nodes are given priority 5, BCP nodes are given priority of 4.
Primary Node Failure & Resilience— When the primary node goes down, the arbiter detects the missing heartbeat, it waits for 10 secs by default before the next election is conducted or the next primary is allocated, which takes 2 more secs. So the primary would be unavailable for about 12 secs during which all writes to the primary would fail.
Network Partition — The network if it gets partitioned, says 3 nodes are datacenter1 and 2 in datacenter2, and the link between the datacenters is broken but each of the nodes in the partitions is working fine, depending on the quorum settings. The datacenter2 will still continue to support read operation under the premise that the network will be re-established quickly. But it is not getting synched up with datacenter1 which hosts the primary and therefore having the latest updates. Data in datacenter2 is no longer consistent. It is therefore important to plan your deployments carefully based on the application requirements.
Production Scenario — Automatic failover
Say, the primary node crashes, how would the automatic failover take place.
- When there is a node failure in PROD, the arbiter (in absence of arbiter node with the highest priority) uses the priority to elect one of the nodes in PROD.
- When the original primary node comes back again, the newly elected primary continues to be primary. This prevents unnecessary frequent flip over within the nodes of a given datacenter.
- In case of full data-center failure, say PROD data center is unavailable, the data nodes in BCP which are next in the priority, one of them is elected as Primary provided the Quorum is satisfied.
- In the case when the BCP has taken over, and the PROD data center with high priority nodes is up, then flip-over gets initiated which is a desirable flip-over as we would now like to leverage the high-end capability of PROD and continue business as usual.
Key Features — Sharding — Horizontal Scaling — Automatic load balancing
Sharding is Mongo’s answer to the requirement to scale out for work(CPU), memory(working set — RAM) or storage.
Mongod config servers — They hold the information of which shard-key is hosted in which chunk and therefore which shard. The config servers are also not exposed to the application server. So the deployment of the configuration server can be changed any time. The config server is also a replica set of exactly 3 nodes (not less, not more).
mongos— The routing server. This is exposed to the application servers. The application layer connects via the driver to a comma-separated list of mongos servers instead of a comma-separated list of mongod servers. The application never talks directly to the mongod server anymore in the sharded cluster.
Not all the collections in the database need to be sharded. Collections that are not sharded are stored in the primary shard. Each database has one replica-set configured as the primary shard.
Sharding — Pitfalls
- Once defined it cannot be changed
- Once a collection is sharded it cannot be un-sharded
- The shard-key value is immutable (From 4.0, this is mutable)
Because of the above factors, what shard-key to use is an important design consideration explained in Step4 of this series. Incorrect selection of the shard-key can lead to bottle-necks. What shard-key works depends entirely on your solution.
Each of the queries, there is an extra overhead of consulting the config server.
The ability to horizontally scale comes at a cost. On the minimum
- One replica set for the config servers (3 nodes)
- mongos servers for routing (2 or more)
- Multiple shards each being a replica set (with just a 3 node replica set, that is 3 extra nodes on minimum.
When to scale-out
Due to limitations of shard-key and cost, scale-out only if there is business need to do so. Scale-up instead, it may sometimes be cost-effective, if you are just on the borderline and don’t see exponential growth.
When is it too late to scale-out
- When the working set is already to the brim
- When the CPU is already working above the threshold
- When the storage limitations thresholds have crossed.
It is important to monitor your cluster so that you can see the above coming and plan your scaling requirements before it is late.
Key Features — Adhoc Queries
MongoDB supports field, range query, and regular-expression searches.
Key Features — Indexes
Full-featured index support in a MongoDB document can be indexed with primary and secondary indices
Key Features — Aggregation
MongoDB provides three ways to perform aggregation: the aggregation pipeline, the map-reduce function, and single-purpose aggregation methods.
Map-reduce can be used for batch processing of data and aggregation operations. But according to MongoDB’s documentation, the Aggregation Pipeline provides better performance for most aggregation operations.
The aggregation framework enables users to obtain the kind of results for which the SQL GROUP BY clause is used. Aggregation operators can be strung together to form a pipeline — analogous to Unix pipes. The aggregation framework includes the $lookup operator which can join documents from multiple collections, as well as statistical operators such as standard deviation.
Key Features — Capped Collections
MongoDB supports fixed-size collections called capped collections. This type of collection maintains insertion order and, once the specified size has been reached, behaves like a circular queue.
Key Features — File Storage — GridFS
MongoDB can be used as a file system, called GridFS, with load balancing and data replication features over multiple machines for storing files.
This function, called a grid file system, is included with MongoDB drivers. MongoDB exposes functions for file manipulation and content to developers. GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document.
Without complicating your stack, any sizes of files can be stored. Since it is stored in Mongo it comes with all the data replication that Mongo offers out of the box.
While the concept looks good in the paper you should evaluate if GridFS is a good fit for your application given the pitfalls.
GridFS — Pitfalls
The chucking of the file happens on the client-driver and not on the mongo-server, so during updates do not expect that the update of the single file is an atomic transaction. Also, there is no versioning concept for GridFS.
The single big file is stored in chunks, so you can directly move to a specific section of the file. But the aggregation of the file happens on the client-driver end and not on the server. This means there will be considerable load on the client hardware where the file has to be re-assembled.
Also, since the assembling happens on the client-driver, this should be used only the file is not concurrently written and read around the same time. Otherwise, by the time the file is re-assembled, a portion of the file could be updated on the server side thus beating the “consistency” aspect of MongoDB.
Key Features — Transactions
Support for multi-document ACID transactions was added to MongoDB with a 4.0 release.
MongoDB has established a space for itself in the NoSql world with more and more applications moving to MongoDB. It, therefore important to understand the various design considerations when adapting MongoDB to unleash its full potential without falling into its pitfalls.
All related links