MongoDB: Step-4: Designing the schema

Sarada Sastri
8 min readFeb 16, 2020

Step-4: Designing the schema

We will touch upon the principles that can be used for defining & determining the schema design for any project. Once you apply this you will arrive upon a fluid schema upon which you may then apply design patterns to arrive at the final schema.

The secret to what schema you will arrive at depends on your application & below are some of the key considerations that will help you get there.

  1. Does your application do more reads than writes?

2. What data needs to be together when reading from the database?

3. What performance considerations are there?

4. How large are the documents?

5. How large will they get?

6. How do you anticipate your data will grow and scale?

Design principles

Design for — Performance, Scalability & Simplicity

Provide for — Transaction control

Data De-normalization

In a typical RDBMS system, the data is stored in a normalized format, typically the 3N form, typically in the form of tables with rows and columns. Here duplication of data is avoided and the relation between the data is established in the form of primary key — foreign key relationship across different tables

Many of us use OOP (object-oriented programming), we have objects, not tables. This forces the application developers to resort to ORM (object relation mapping) frameworks like hibernate etc.

Read requirements define the schema design

Mongo has full-featured indexes to enable optimized reads. And this is an ideal big data solution where you have different read requirements. Also, the applications that choose Mongo as a database are solutions that write once but read many times.

Understand the read requirements of your application and design your storage in MongoDB based on how you would want to query and extract value. Below are a few ideas based on my experience with MongoDB.

RDBMS to Mongo migration

Things to consider:

  1. Transaction: In Mongo, the transactions cannot span multiple collections or multiple documents within the same transaction.
  2. Data will have to be denormalized and all related data will have to be stored within the same document.

Performance — Choice of the primary key

Mongo uses _id as the primary key as default and has an inbuilt mandatory index which it uses for searching and mapping across indexes & collection. So if you have a unique key in your application, it is advisable to use that key as _id. This way you save one more index that you may need on your uniqueKey.

Polymorphic objects share the same collection

Since MongoDB does not have any schema, the onus of schema enforcement is on the application layer. This offers flexibility to the application layer as it can store different structures and types of data. But this also means MongoDb will not complain when there are typo’s in the queries, it will simply not return any data.

This also means you can store documents that are polymorphic, i.e. having few attributes similar and few attributes not similar together. You don’t have to come up with different collections to hold different documents. i.e. if we talk in terms of oo language like java, you can store different polymorphic objects of a hierarchy within the same collection.

Example: The Person object, the employee object that extends Person can be stored in the same collection

Using array in documents

Mongo supports multi-value indexes that can speed up query operations. Let’s see this in action.

Example: Movies have different release dates in different locations. Say you want to query on the release-dates

movies{
name: "Jumanji",
release_US: ISODate("1977-12-27T01:00:00+01:00"),
release_UK: ISODate("1977-12-27T01:00:00+01:00"),
release_India: ISODate("1977-12-27T01:00:00+01:00")
}

To query on the release dates across locations, 3 indexes will be needed.

{"releases_US": 1}
{"releases_UK": 1}
{"releases_India": 1}

Maintainance overhead of every location increases the turn-around time for inserts. Instead, if the release dates are stored as array of values as below

movies{
name: "Jumanji", releases:[
{location:"US", release_date: ISODate("1977-12-27T01:00:00+01:00")},
{location:"UK", release_date: ISODate("1977-12-27T01:00:00+01:00")},
{location:"India", release_date: ISODate("1977-12-27T01:00:00+01:00")}
]

You can use one single multivalue index

{"releases.location": 1, "release_date":"1"}

Consider size limitations — Mongo documents have a size limitation of 16MB today. If your application requires different versions of the same data to be stored, the size of each document will keep adding and may exceed 16MB. In such cases, you may want to store the different versions of the document as separate documents.

Anti-pattern for combination search: Using an array to store the different versions of the same document may not go down well with the search requirements.

Example: Find from the inventory the product Apple, 6S with price 20000.

{"_id":"5",“product”: “Apple”,
versions:[
{“version”:”6”,“price”:”20000”}},
{“version”:”6S”,“price”:”40000”}}
]}
inventory.find({"versions":"6S", "price":"20000"})

Observe: The above query should not get any data, but you would get data from Mongo. Mongo matches any one value of the array for the 1st part of the query and then the 2nd part of the query is matched independently and therefore the document is returned.

Solution: Where you want to query the different attributes of the array in conjunction avoid arrays.

Also, you cannot have a multivalue index that is a combination of non-array fields and array-fields. So if you know your read queries will be a combination, then you need to think through if you really want an array in this case.

PreComputed Summary

Data by itself hardly makes any sense. It is only when information is derived from data that the power of data is realized. Example: What are the total sales for a product? Which are the top fast-moving products?

Such computations may be resource-intensive — which could mean searching for lots of documents, filtering, aggregating & massaging of the data. Being a resource-intensive operation both in terms of CPU & memory, it is not recommended to do it every time such a request for the report comes in.

Instead, such a snapshot view of the data or the report can be pre-calculated and stored as a batch job at periodic intervals.

Design for the typical case, not the rare case.

Exceptional requirements for few documents should not dominate the design especially when the trade-off would degrade the performance of otherwise the typical 99.999% cases.

Say you have a document that would otherwise be < 16MB, but there could be outlier data that could exceed 16MB. You would not want to design the storage of the document to use GridFS. Instead, you would put a marker and a link to the other document that would be a continuation of the existing document and leave the application to handle the special case in a sub-optimized manner which would be outlier case.

Document Versioning & Historical Data Collection

Here a version attribute is added to the document and incremented each time an update comes to the document, the existing document is written to historic-collection and the updated document is written to the current collection.

Pros — Most of the queries are for the latest document which is serviced very easily as the volume of the collection remains relatively small

Cons — The historic data needs to be collected from another collection and this could mean based on the timeline the application query builder will have to make the decision to hit the 2nd collection.

Also, this solution can work only for scenarios where there are less frequent updates to the document

Capped Collections — Fixed collection size

This is a special feature in Mongo. Space is preallocated. So if your application has a requirement of very fast writes, you can use this. Here the oldest document will be overwritten with the new one that is coming in. The writes are very fast. There is also no overhead of deleting old documents.

With Wired-Tiger, the documents are updated in-place, thereby saving the time of rewrite on the update algorithm followed by the previous Memcache storage mechanism.

Performance & Eventual Consistency

In most enterprise settings, the secondary nodes are eventually consistent. Application services that require data that is 100% consistent should always read from the primary node, while other services that are ok with eventual consistency can read from a secondary node. Designing the application this way reduces the load from the primary and free it for write operations.

Pitfalls — While it looks like common sense to use secondary nodes for reads, carefully evaluate the write and read concern of your application, understand the business case of your application before you go down this way.

Since data is eventually consistent, it is quite possible that the secondary node gives you inconsistent data too. Is your application OK with this?

I have met one startup that used the secondary nodes for reads, which ended up making a loss for its clients one fine day and lost its creditability enough to lose its business and shut shop.

Performance & TTL Indexes

Time to live indexes is a feature where you can choose how long your document will be kept before it is deleted.

The point to note here is the document will live at least the amount of time specified and eventually, it will be deleted. Do not build your logic that the document will exactly get delete at the time when the document expires.

Also, this is something I would not recommend for production-grade code as the scheduler will choose to delete documents during high traffic hours too, putting a load on the primary node. Instead, it is recommended to write your own purge tool which will delete the documents during your chosen low traffic hours.

Sharding, horizontal scaling & performance

Database systems with large data sets or high throughput applications can challenge the capacity of a single server. For example, high query rates can exhaust the CPU capacity of the server. Working set sizes larger than the system’s RAM to stress the I/O capacity of disk drives

Overhead — there is the additional overhead of maintaining the config server replica set, the additional replica sets across which data has to be distributed.

So it is advisable to start with a single replica set and think of sharding when there is a business requirement to scale eventually. With this approach, you will also get the view in the existing data to come up with an appropriate Shard-Key

Shard-Key Constraints

  1. Once defined cannot be changed
  2. Once sharded cannot be un-sharded
  3. The shard-key field is immutable till 4.0

Considerations for Shard-Key

  1. Cardinality
  2. Key distribution across cardinal values
  3. Hot spot consideration
  4. Hashed strategy — Optimal write scalability but read scalability is limited
  5. Range strategy (with and without Tags) — Optimal read scalability

Latency & ChangeStream

ChangeStream is a relatively new concept in MongoDB. But this feature is not suitable for low latency applications like platforms etc where fast streaming of high-velocity data is a primary requirement.

Next Step5

All related links
https://medium.com/@sarada.sastri/mongodb-step-1-an-overview-791e4fe861aa
https://medium.com/@sarada.sastri/mongodb-step-2-setting-up-the-server-4b0fad446c3e
https://medium.com/@sarada.sastri/mongodb-step-3-mongo-client-commands-4366b5ef139f
https://medium.com/@sarada.sastri/mongodb-step-4-designing-the-schema-8c91ee947230
https://medium.com/@sarada.sastri/mongodb-step-5-indexes-and-query-performance-optimizations-ed1bf744315b

--

--

Sarada Sastri

Java Architect | MongoDB | Oracle DB| Application Performance Tuning | Design Thinking | https://www.linkedin.com/in/saradasastri/