Sharding is Mongo’s answer for scaling. It is a way of distributing both work and data. More work over more data. And it is required for big-data problems — really really big data. So it is important to understand
- What is sharding?
- How is it done?
- Pitfalls & how to design your data distribution strategy?
- When it should be done?
- When is it too late?
Horizontal scaling (scaling-out)
It achieves the scaling requirements by scaling out, i.e. by adding more nodes (machines). This is also called horizontal scaling. Horizontal scaling is cheaper than vertical scaling. With vertical scaling, you are forced to choose high-cost hardware, whereas with horizontal scaling you can go with multiple relatively low-cost hardware.
MongoDB does horizontal partitioning of data, unlike RDBMS solutions that do vertical partitioning of data.
Shard — A partition in Mongo is called a shard. Each partition or shard contains a subset of the total data that is there in the collection. And each document resides entirely in one shard. A shard is a replica-set that holds the data.
Config server — The config server holds a configuration or the mapping of which document belongs to which shard. It is also a mongo-server(mongod-service). It is a 3 node replica set.
Chunk — To have one mapping for each key is difficult. So instead config server manages chunk. A chunk is a set of documents. Each chunk belongs to exactly one shard. MongoDB partitions sharded data into chunks. Each chunk has an inclusive lower and exclusive upper range based on the shard key.
Auto-sharding — The chunking of data, managing the range depending on the distribution of data across chunks is automatic or called auto-sharding of data.
Load balancing/Chunk Migration — Mongo manages an equal distribution of data across shards by migrating the chunks, so as to unleash the power of distributed computing.
Shard-Key — The key used for chunking the data. The correct selection of the shard-key is important to prevent bottleneck conditions.
Query Router — Mongo Shard server(mongos) — The mongos acts as a query router, providing an interface between client applications and the sharded cluster. The client application no longer gives comma-delimited list of mongod-servers, instead, it uses a comma-delimited list of mongos-server. Application servers never refer to the mongod server in a sharded cluster.
It consults the configuration server, figures out where your documents reside, decides where to route the query and routes the query to that shard.
Primary shard — The shard where all the non-sharded collection of the database reside is the primary shard. There is one primary shard per mongo shard cluster. These non-sharded collections are typically low on CPU/memory/storage requirements. The collection that is shard is split across multiple shards as below.
Thus MongoDB is able to distribute its workload on 1TB of data across 4 different shards.
The balancer migrates chunks to the appropriate shard respecting any configured zones. This process is also called “chunk-migration”
Upfront correct selection of the shard-key is important, due to the below reasons.
- Shard-key is immutable
- Shard-key once defined cannot be changed
- A sharded collection cannot be un-sharded
Say, you pick a boolean field.
— The cardinality of the key is just 2.
— That means at the most you can have just 2 shards and you cannot scale more than that.
— If the data contains null value instead of true or false, you cannot categorize it to any of the 2 shards
— Fields like isClosed which can hold true/false boolean value cannot be chosen as shard key as it cannot move from false to true once persisted.
— Also, data need not be uniformly distributed across both the values. It is possible that 98% of the value is true and 2% it is false. In this case, data is concentrated on ShardA. ShardA is over-worked whereas ShardB is idle. Thus defeating the purpose of sharding.
Hashed sharding can only be done on one field name. Multiple fields cannot be used for hashed-sharding
The hash value of 2 keys which are next to each other will still be far apart. Thus even if the key distribution is nearby or the cardinality of the field is less, using hashed sharding we can evenly distribute the data evenly across shards.
Pros — Good write scalability
Cons — Poor read scalability due to “broadcast-operations”. The query now needs to be fired on all the shards to get the data. Co-location of related documents on which there are frequent queries is lost.
A range of shard keys whose values are “close” are more likely to reside on the same chunk. This allows for targeted operations as a mongos can route the operations to only the shards that contain the required data.
Pros — Good read scalability
Cons- Possibly bad write scalability, if the choice of the shard-key is bad.
Ex. “created-date” as shard-key
— The cardinality is very high
— The data is also uniformly distributed
— The field is also immutable
Pros — When reading, if the query is based on the created-time, then query performance may be isolated to just 1 or a couple of shards thus giving good performance. But if it is not so, then the query will have to be executed on almost all shards.
Cons — When writing, all the data is written to the same shard. Thus creating a hot-spot for writing and limiting the write-scalability of the application.
- An application that requires segmenting user data based on the geographic country — For example, certain business is governed by data-protection rules.
- A database that requires resource allocation based on geographic country.
You can associate each zone with one or more shards in the cluster. A shard can associate with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone.
- Data protection rules can be satisfied
- Data is collocated within the fixed set of shards within the zone
- Read-scalability — Such cases, the queries are also zone-specific, so fewer shards have to work for a given query.
- Scaling can be limited to the zone. Even if servers in one zone are over-loaded they cannot leverage the idle capacity in the other zone.
Shard-Key selection & strategy — Thus understanding the data, the queries and the scaling requirements will give you a deeper understanding of your data and will help you come up with an optimal shard-key so as to achieve the read and write scalability you wish to have. It is crucial to spend enough time and research to arrive at the correct shard-key.
When should you go for sharding?
There are overheads going the sharding way for your scaling
Cost — Extra replica set for config servers. At least one more replica set. Minimum of 2 mongo query router servers. Complicated deployment & overhead to maintain them. Overhead of each query to go to mongos & be consulted by config server.
Due to limitations of shard-key and cost, scale-out only if there is the business need to do so. Scale-up instead, it may sometimes be cost-effective, if you are just on the borderline and don’t see exponential growth.
When is it too late to scale-out?
- When the working set is already to the brim
- When the CPU is already working above the threshold
- When the storage limitations thresholds have crossed.
It is important to monitor your cluster so that you can see the above coming and plan your scaling requirements before it is late.
Even with all the complexity, it is this ability to scale-out that MongoDB has established itself as a major player in the big-data world. In some deployments, there are over 1000 nodes in a cluster hosting petabytes of data & crunching and analyzing this amount of data is no easy feat that Mongo pulls of effortlessly.