MongoDB Analytics for Big-Data

Analytics of big-data

Typical business needs are mostly aggregation of the data in meaningful way. There are 3 approaches to achieve this in MongoDB.

  1. Find command — Simple requirement, extremely limited in scope.
  2. Aggregation Pipeline — This satisfies 90% of analytic needs in MongoDB. Together with its power, speed and simplicity it is the go to choice for most developers.
  3. MapReduce — This offers both the power and flexibility and using this we can do everything that Aggregation Pipeline can offer and beyond. But it is lot more complex than aggregate pipeline

Aggregation Pipeline

Aggregation Pipeline, is a framework that defines a multi-stage pipeline where the document from the collection goes through series of stages culminating to a document which has the aggregated result. The output documents of each stage is the input for the next stage.

Example of Pipeline stage

Principles & Performance

  1. Each stage spits out a document on which the next stage works. The smaller the document, the lesser data to be processed to the next stage. The aim should be to reduce the volume of data in the initial stages.
  2. Not only does the number of documents matter, the size of the document also matter, so use the $match, $project as early as possible in the document
  3. Indexes are used efficiently used when there is a $match operator as the 1st operator in the pipeline. The $sort operator can also use the indexes, so sort early in the pipeline, if it can use the indexes your query will be very fast. In my experience if you have both the options $match and $sort which an be the 1st pipeline operator, go for the match operator.
  4. The best way to optimize an aggregate pipeline would be use the principles and build the pipeline, measure, understand the execution plan, modify indexes, pipeline stages and remeasure.
  5. Aggregation is a resource intensive operation. Therefore running on the relatively idle secondary node is a good choice provided you are ok with eventually consistent data.

Aggregate Operators

There are whole bunch of operators specifically designed for aggregate pipeline, I will discuss some of the most commonly used ones.

Limitations

  1. The result can be a cursor or a document. The size limitation of the document therefore applies to the final result of the aggregate command. There is a limit on the size of the document returned by the aggregate command, it cannot be more than 16MB. If any document returned exceeds 16MB, it will throw error. Therefore always check for the “ok” command to know if your query ran to completion. This constraint only applies to the final result. The limitation does not apply to the temporary documents that are created between the pipeline stages.
  2. The memory utilization is capped to 100 MB. Any stage of the pipeline goes beyond this limit, it will error. To prevent this, “allowDiskUse” needs to be made true. This will remove the restriction by the extra I/O will slow down the process. The “allowDiskUse” cannot be used when specific operators like “$addToSet” or “$push” are used

MapReduce

This helps solve big-data problems in a distributed way where the compute, storage and memory required for the computation is distributed across different nodes.

Cons -Speed

MapReduce is written in java-script and therefore it will be slow than aggregate framework which is written in C++. In-fact Aggregation Pipeline is addition done to MongoDB in version-2.2. So if you have some pre-existing MapReduce command and are facing performance issue, you may consider switching to Aggregate Pipeline.

Pros-Distributed Load

Due to high level of inter node communication too, the overall resource consumption is high but instantaneous load is less. Work gets evenly distributed & thus the compute does not hog one node. Sometimes this too can be important deal breaker thus making MapReduce the go to way for aggregation needs.

Pros-Flexibility

Aggregation framework suffers from some limitations. MapReduce does not suffer from any such limitations & offers you the full flexibility. More work can be done than what you can possibly get done on one node’s compute/memory.

MapReduce Stages

There are 3 stages

Map function

The document is read and it may emit 0 or more documents which act as input for the Reduce stage. Basically extract the relevant data and emit it. Sometime you may not anything to emit and sometimes you may emit more than one document. You need to emit all the attributes that you need to process the aggregation on.

Reduce function

The reduce may be called n number of times per key. It could be due to prior map operation or prior reduce operation. Each time it gets a key and an array of values which it needs to aggregate. Reduce operations can run in parallel across shards.

Finalize function

This runs once per key. This is optional. This can be used to do some last time corrections like formatting etc before the final document is prepared

Output options of MapReduce

  1. inline option(Standard output)
  2. Collection

Limitations

  1. Data aggregation is a resource intensive operation and therefore if you plan to use secondary nodes for your map-reduce you cannot use “Collection” as you output option as writing to the collection is allowed only in primary node. So “inline” is your only option.
  2. You cannot contain queries within the map or the reduce operations. You will get the the this operator. The this operator is the pointer to the current document. And you can use the this operator to access any of the attribute of the document.

Summary

While the complexity of MapReduce may scare developers, the flexibility make up for it and it is here to stay. On the other hand, Aggregate Pipeline with its speed and simplicity enjoys a far higher acceptability among the MongoDB developer fraternity. With these 2 powerful workflows at it’s disposal MongoDB has made it’s own place in the world of Big-Data analytics.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sarada Sastri

Sarada Sastri

Java Architect | MongoDB | Oracle DB| Application Performance Tuning | Design Thinking | https://www.linkedin.com/in/saradasastri/