MongoDB Analytics for Big-Data

Analytics of big-data

  1. Find command — Simple requirement, extremely limited in scope.
  2. Aggregation Pipeline — This satisfies 90% of analytic needs in MongoDB. Together with its power, speed and simplicity it is the go to choice for most developers.
  3. MapReduce — This offers both the power and flexibility and using this we can do everything that Aggregation Pipeline can offer and beyond. But it is lot more complex than aggregate pipeline

Aggregation Pipeline

Example of Pipeline stage

Principles & Performance

  1. Each stage spits out a document on which the next stage works. The smaller the document, the lesser data to be processed to the next stage. The aim should be to reduce the volume of data in the initial stages.
  2. Not only does the number of documents matter, the size of the document also matter, so use the $match, $project as early as possible in the document
  3. Indexes are used efficiently used when there is a $match operator as the 1st operator in the pipeline. The $sort operator can also use the indexes, so sort early in the pipeline, if it can use the indexes your query will be very fast. In my experience if you have both the options $match and $sort which an be the 1st pipeline operator, go for the match operator.
  4. The best way to optimize an aggregate pipeline would be use the principles and build the pipeline, measure, understand the execution plan, modify indexes, pipeline stages and remeasure.
  5. Aggregation is a resource intensive operation. Therefore running on the relatively idle secondary node is a good choice provided you are ok with eventually consistent data.

Aggregate Operators


  1. The result can be a cursor or a document. The size limitation of the document therefore applies to the final result of the aggregate command. There is a limit on the size of the document returned by the aggregate command, it cannot be more than 16MB. If any document returned exceeds 16MB, it will throw error. Therefore always check for the “ok” command to know if your query ran to completion. This constraint only applies to the final result. The limitation does not apply to the temporary documents that are created between the pipeline stages.
  2. The memory utilization is capped to 100 MB. Any stage of the pipeline goes beyond this limit, it will error. To prevent this, “allowDiskUse” needs to be made true. This will remove the restriction by the extra I/O will slow down the process. The “allowDiskUse” cannot be used when specific operators like “$addToSet” or “$push” are used


Cons -Speed

Pros-Distributed Load


MapReduce Stages

Map function

Reduce function

Finalize function

Output options of MapReduce

  1. inline option(Standard output)
  2. Collection


  1. Data aggregation is a resource intensive operation and therefore if you plan to use secondary nodes for your map-reduce you cannot use “Collection” as you output option as writing to the collection is allowed only in primary node. So “inline” is your only option.
  2. You cannot contain queries within the map or the reduce operations. You will get the the this operator. The this operator is the pointer to the current document. And you can use the this operator to access any of the attribute of the document.





Java Architect | MongoDB | Oracle DB| Application Performance Tuning | Design Thinking |

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Top Ways to Produce Optimized SQL Queries in Production

A Good Binary Search Problem

How to Get the Perfect Spot Rate: 8 Rules of Freight Quoting

Reduce Cost and Increase Productivity with Value Added IT Services from buzinessware — {link} -

#Envisioning a New Age of Empathy

Experience During The AF MERN Group Project

Parashift Recommends #15: Programming, self-driving car, hurricane, and more

Live your life with GoHome 🏠

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sarada Sastri

Sarada Sastri

Java Architect | MongoDB | Oracle DB| Application Performance Tuning | Design Thinking |

More from Medium

Automated Data Pipeline Testing using Great Expectations

How cloud analytics benefits utilities

Elastic, data aggregation