MongoDB Analytics for Big-Data
How can I derive value from the data that is stored in MongoDB?
The ultimate objective of storing all the data is to derive business insights, find new business opportunities, reduce operational costs or simply optimization of some form — which means data-analytics.
Analytics of big-data
Typical business needs are mostly aggregation of the data in meaningful way. There are 3 approaches to achieve this in MongoDB.
- Find command — Simple requirement, extremely limited in scope.
- Aggregation Pipeline — This satisfies 90% of analytic needs in MongoDB. Together with its power, speed and simplicity it is the go to choice for most developers.
- MapReduce — This offers both the power and flexibility and using this we can do everything that Aggregation Pipeline can offer and beyond. But it is lot more complex than aggregate pipeline
Aggregation Pipeline, is a framework that defines a multi-stage pipeline where the document from the collection goes through series of stages culminating to a document which has the aggregated result. The output documents of each stage is the input for the next stage.
This also means unlike the find command which has fixed stages,here you can mix and match the different stages & also repeat them again down the stages to finally end up with the result that you need.
Aggregation framework implementation is in C++ and hence it will generally be faster and therefore the speed.
It is more consistent compared to map-reduce. Infact may of the previously written mapReduce functions were written in aggregate framework to make it faster and more readable/maintainable.
Principles & Performance
- Each stage spits out a document on which the next stage works. The smaller the document, the lesser data to be processed to the next stage. The aim should be to reduce the volume of data in the initial stages.
- Not only does the number of documents matter, the size of the document also matter, so use the $match, $project as early as possible in the document
- Indexes are used efficiently used when there is a $match operator as the 1st operator in the pipeline. The $sort operator can also use the indexes, so sort early in the pipeline, if it can use the indexes your query will be very fast. In my experience if you have both the options $match and $sort which an be the 1st pipeline operator, go for the match operator.
- The best way to optimize an aggregate pipeline would be use the principles and build the pipeline, measure, understand the execution plan, modify indexes, pipeline stages and remeasure.
- Aggregation is a resource intensive operation. Therefore running on the relatively idle secondary node is a good choice provided you are ok with eventually consistent data.
There are whole bunch of operators specifically designed for aggregate pipeline, I will discuss some of the most commonly used ones.
$group — Documents can be grouped on a specific key. Using group we can used a aggregate command like $count, $sum , $max, $min operators.
$project — Restrict what attributes you select in a document for the next stage. This also used to rename the field in the final stage, so that data looks meaningful in the final document presented.
$sort — sorts the documents. This is a costly operation so try to understand the query execution plan and you can make the plan pick up an index for its sort requirements
$first/$last — Gets you the first or the last document within a group, normally used in combination with a $group, $sort followed by $first/$last command
$limit — limits the number of documents
$skip — skips the number of documents
$unwind —Works only on arrays. Pipeline works only with documents. It does not work with array’s inside the documents. So to work on the data inside the array, we need to explode each value in the array as a document. The rest of the values in the document thus get repeated. Say we have a array of 3 values in the document, we will end up with 3 documents. Since the number of documents and the data increases in this stage, the memory requirements increases. The limitations of the aggregate pipeline need to be kept in mind when using this operator.
$push — This is opposite of unwind, you can create a array from documents. Typically the $unwind operator is used to explode the array to documents, do some operation on it and then it is put back to its original array format.
$addToSet — This is same as $push, just that the array value if duplicated, it will be made unique.
date functions — A whole range of date functions till the precision of milliseconds are supported, which can be operated upon for reporting purposes.
$geonear — It is a specialized match operator which can match documents based on the geo-locations
- The result can be a cursor or a document. The size limitation of the document therefore applies to the final result of the aggregate command. There is a limit on the size of the document returned by the aggregate command, it cannot be more than 16MB. If any document returned exceeds 16MB, it will throw error. Therefore always check for the “ok” command to know if your query ran to completion. This constraint only applies to the final result. The limitation does not apply to the temporary documents that are created between the pipeline stages.
- The memory utilization is capped to 100 MB. Any stage of the pipeline goes beyond this limit, it will error. To prevent this, “allowDiskUse” needs to be made true. This will remove the restriction by the extra I/O will slow down the process. The “allowDiskUse” cannot be used when specific operators like “$addToSet” or “$push” are used
This helps solve big-data problems in a distributed way where the compute, storage and memory required for the computation is distributed across different nodes.
MapReduce is written in java-script and therefore it will be slow than aggregate framework which is written in C++. In-fact Aggregation Pipeline is addition done to MongoDB in version-2.2. So if you have some pre-existing MapReduce command and are facing performance issue, you may consider switching to Aggregate Pipeline.
Due to high level of inter node communication too, the overall resource consumption is high but instantaneous load is less. Work gets evenly distributed & thus the compute does not hog one node. Sometimes this too can be important deal breaker thus making MapReduce the go to way for aggregation needs.
Aggregation framework suffers from some limitations. MapReduce does not suffer from any such limitations & offers you the full flexibility. More work can be done than what you can possibly get done on one node’s compute/memory.
There are 3 stages
The document is read and it may emit 0 or more documents which act as input for the Reduce stage. Basically extract the relevant data and emit it. Sometime you may not anything to emit and sometimes you may emit more than one document. You need to emit all the attributes that you need to process the aggregation on.
The reduce may be called n number of times per key. It could be due to prior map operation or prior reduce operation. Each time it gets a key and an array of values which it needs to aggregate. Reduce operations can run in parallel across shards.
This runs once per key. This is optional. This can be used to do some last time corrections like formatting etc before the final document is prepared
Output options of MapReduce
- inline option(Standard output)
- Data aggregation is a resource intensive operation and therefore if you plan to use secondary nodes for your map-reduce you cannot use “Collection” as you output option as writing to the collection is allowed only in primary node. So “inline” is your only option.
- You cannot contain queries within the map or the reduce operations. You will get the the this operator. The this operator is the pointer to the current document. And you can use the this operator to access any of the attribute of the document.
While the complexity of MapReduce may scare developers, the flexibility make up for it and it is here to stay. On the other hand, Aggregate Pipeline with its speed and simplicity enjoys a far higher acceptability among the MongoDB developer fraternity. With these 2 powerful workflows at it’s disposal MongoDB has made it’s own place in the world of Big-Data analytics.
All related links