Scalable IoT Projects with MongoDB: Gaining Value from IoT & Digital Twins

Hi, I’m Christian, and I have Julian and Robert with me We are going to present how to build scalable IoT projects with MongoDB, which is a huge opportunity for companies to gain value from the Internet of Things and digital twins In order to do so, we summarized our experience that we gained from more than 100 different IoT projects and are going to take you today on a bike ride, which provides a common understanding of IoT and digital twins first Then we will introduce IoT reference architectures and MongoDB is placed within those architectures and we want you to get started quickly We would walk through several reference implementations, based on the publicly available Citi Bike data, therefore the bike rate, right, and those implementations are available in GitHub Last but not least, we will also show how MongoDB can support you on the path from a first pilot to a successful solution in production Having said that, Julian, the stage is yours -Thank you As Christian mentioned, we are going to take you on an amazing ride through IoT wonderland today Before we start, I want to give you some backgrounds on two major questions Why does IoT matter and why should you bother? When looking at the technological and product innovations of the last couple of years, we can see a strong trend in the direction of digital solutions and smart devices This development moves incredibly fast and what used to be seen as a technological breakthrough like smartphones 10 years ago, today would simply be taken for granted This is why I want to stop this high-speed train for a second and take a look in the rearview mirror before shedding some light on what is about to come The so-called contrary of cycles provide a nice framework for identifying significant technological stepping stones and getting a better understanding of their economic impact Let’s have a look at the chart As you can see, with technologies becoming more mature, the effect of innovation on the economic output has been increasing The last examples are electricity and chemical industries which provided us with new sources of energy and materials for even more powerful machines Next came petrochemicals and the automotive industry leading to easier and more decentralized transportation of people and goods They have been a game-changer in how we conduct trade and took labor mobility to a whole new level All these inventions and developments change the way we interact and came along with unprecedented growth in productivity and wealth, but now we are facing something new The invention of the computer has triggered new contrary of wave, one that has the potential to be the most impactful in history, the era of digitalization People have become more connected than ever before However, people were not the only ones Soon, they were joined by machines This has unlocked extraordinary potentials for an increase in efficiency, effectiveness, and automation Machines have been getting smarter and more interconnected, forming up what we call today, the Internet of Things Connecting physical and the digital world is a new now general fast-moving trend across many industries, but how fast? In a study of Cisco on the matter, they state that approximately 50 billion devices are already connected This is almost seven times the global population Boosted by breakthroughs in mobile bandwidth speeds like 5G, this number will continue to grow, but to what number? Researchers of this study expect that by 2030, the number of devices on the Internet of Things will reach 500 billion, which is 10 times the number of devices today Why does this matter to your organization? In one of their studies, McKinsey stated that leaders in the IoT space are already witnessing compound effects on their costs and revenue streams of more than 15% Chances are that this number is a mere shadow compared to what is about to come Let’s now look at what is necessary to build an IoT initiative What lies at the heart of this transformation cycle is the digital backbone Essentially, this is the foundation of all digital initiatives The digital backbone is the core of the digital strategy of an enterprise, addressing and mitigating all IT complexities, risks, and barriers to innovation This way businesses can meet new market demands and become more competitive The digital backbone includes all technology components that are required to allow companies to evolve and generate value across two dimensions First, internally, focus on digitalization and optimization of internal processes

in order to save costs and second, externally, aiming to create new business models and revenue streams Constructing this foundation and understanding what value the initiative adds to the business is crucial for the success of IoT projects Before we go into the technical details to see how exactly the spec bond looks like, let’s do an overview of what digital twin actually means You can think of a digital twin as simply a digital representation of a physical asset These can be physical assets, like some machines, buildings, or even whole cities, but also logical assets like processes or supply chains Depending on the industry vertical, even you, and me as human beings have the potential to have our own digital twins For example, fitness and health trackers on our smartphones create your digital twin based on your physical activities Another important feature of digital twins in IoT is how widely adaptable they are, thus allowing almost every industry to gain huge benefits On the slide, you can see an overview of companies that already use Mongo DB in their IoT projects from generic IoT platforms like Bosch or [?] industry solutions, like FANUC or Halliburton, networking, equipment, smart cities, insurance solutions, or increasing the output of farmers up to 8% by IoT Not to forget the huge area of connected cars and management of large commercial transportation fleets like Transics or MAN Rio However, as the concept is quite broad and general, building a formal understanding around it is difficult Naming conventions vary a lot across different industries This is why we use our insights from IoT projects all around the world and boiled them down to the core concepts and direct implications A digital twin is the combination of all information across the physical life cycle of a product, ranging from R&D, over production, operation, and maintenance to decommissioning All of these faces generate data that must be captured and consolidated in the information lifecycle, which is then used to describe physical assets, predict their behavior, and draw recommendations based on analytics One of the most critical and often overlooked aspects when talking about digital twins is that the information lives longer than the physical product itself This allows for a reflexive and thus continuous improvement cycle in which the collected data is used to improve both existing and future products, processes, or services Why is it the case that digital twins are not adopted even further? The answer is simple It’s hard Common technology stacks are often not suited for the demands imposed by building an IoT solution The volume of data, as well as the integrated feature of a digital twin across departments and sometimes even verticals, comes with a lot of challenges This is why according to McKinsey, only 30% of relevant IoT projects make it into production and the company-wide roll-out They have consolidated the main technical challenges and requirements into one major statement The capability to extract, interpret and harmonize data from disparate systems that were not designed to work together and interchange data Let’s look at how this looks like on a level of the information life cycle Each phase of the physical product life cycle can be mapped to different organizational departments creating different kinds of data that typically originated from different systems To make matters worse, all these departments have different structures and standards when it comes to working with data Now all this data needs to be consolidated and made accessible via unified service and API layer in order to visualize the information, provide a basis for machine learning and analytics, trigger actions on the physical device, and ensure the availability for additional enterprise systems Rob, can you please explain to us how MongoDB can help to overcome these challenges and give us an insight into why the document model is a good choice for IoT? -Sure, usually IoT projects start with a pilot only focusing on some basic attributes of physical objects If we map out these attributes of a thing, we could enumerate it into a nice singular list that fits well into a table As your needs grow, you might add new sensors and devices These may require the need to define different properties, schemas, and actions In applying these new requirements to a relational database model,

you can imagine how complicated things can become Now, add variability of a user to find data and it changes to relational database schemas It requires a massive amount of work, with more unit testing and lots of added risk, which ultimately slows down the development of new features in your product Let’s take a look at what makes MongoDB such a great fit for IoT data MongoDB has a flexible data model Based on JSON, it’s simple to define and group together attributes and events from your things Here we see a direct comparison between a relational schema and a JSON representation of the same data While the data needs to be spread into multiple tables in a relational database, all the information for a particular asset is represented in one JSON document Expensive joins are not needed, and documents for different types of devices can look different without additional needs in terms of data modeling If you think in terms of physical storage of this data, the relational tables could be spread on different areas of the disk, causing multiple disks IO’s to obtain the data necessary to perform the query With MongoDB and the document model, it’s one disk IO, and I have all this information I need to satisfy the query JSON is not just optimized for storage of static data As your needs change, as you add more sensors with different attributes or you start to add security policies and information about the actions of assets, these data can easily be added without changes to your application Being able to adapt to change quickly gets even more important as soon as we talk about handling large data volumes We’ve talked about the power and flexibility of JSON, and there’s a clear trend towards leveraging JSON and using JSON based standards like JSON-LD to represent linked data This standard is also used for the web of things standard by the worldwide consortium to describe digital representations Since MongoDB is based on JSON, it is straightforward to adopt such a specification to your existing applications JSON-LD is an ideal way to represent data and metadata as it provides context to the stored data It provides a globally unique identifier for an object and still allows developers to use easy naming conventions while keeping a high level of precision On a side note, JSON-LD is the representation behind the Google Knowledge Graph What about time series data? After all, IoT is all about time series It’s important to note that there are a number of Times series databases on the market that allow you to cover every potential edge case, like rolling up data on the fly from microseconds to years While this is an amazing feature, it is unnecessary for most project requirements MongoDB has a general-purpose database that handles time series data as well as many other use cases for you What makes MongoDB such a good choice for time series data? It’s the flexible schema, scalability, and a modern query language that can not only handle simple queries, but also perform complex analytics and integration with leading, machine learning, and AI platforms like Apache Spark We have published a white paper that does an in-depth analysis on how to design time series schemas in MongoDB, leveraging the schema design pattern of bucketing With bucketing, you are authorizing the storage of time series data by defining a specific time series per document, rather than a document per event You can read up on the details on this approach in the white paper Now, let’s take a look at a real-life use case of time series data in MongoDB Mercedes Benz, a global car company in a high level, all their modern cars transmit status every few minutes into the vehicle data conditioning service This data is used in a variety of use cases such as in the call centers, in the garages, as well as by customers themselves via mobile apps In addition, historical analysis is performed to improve the quality of the cars and its components If we dig down into the details of the schema used by Mercedes Benz, we see an example of the usual complexity when we talk about digital twins It’s not just about a few sensors sending data, but compound objects with a lot of details about the actual vehicle Take a look at vehicle IDENT data and the current basic data subdocument, which is the typical time series data many people talk about like, mileage and batteries Then take a look at the detailed information about the electronic control unit containing lots of details about hardware and software, as well as diagnostic trouble codes All in all, it’s a lot of fine granular data stored together efficiently With Mercedes Benz, we’re speaking about multiple terabytes of data for this application, but this is not the limit of MongoDB

In another example, one of the global leading IoT platforms leverages MongoDB for a wind park monitoring Here, we find about 25 terabytes of hot data produced by 6000 wind turbines Internally at MongoDB, we drink our own wine by storing and querying 25 billion data points of data at MongoDB Atlas clusters every single day These are just some of the thousands of customers that have found success with MongoDB and time series data If you want to learn more about Mercedes Benz use case, please have a look at [?] presentation where he goes into the details of telediagnostics and the implementation at Mercedes Benz Now I’d like to hand it off to Christian to talk more about MongoDB, IoT reference architecture -Thank you, Rob Based on this overview of MongoDB, let us now zoom into the reference architectures for IoT and how MongoDB can be positioned in the space of IoT and digital twins First, I want to give you a high-end overview, and then drill down into a particular architecture covering Atlas IoT architectures usually span three areas, we have the devices, we have network for transmission of data, and we have applications Let us start with the devices, and computing close to them, which is usually referred to as Edge or Fog computing Of course, we have the typical sensors sending data and receiving commands in the sense of actuators In most cases, they send the data to an Edge device, collecting the data, and preprocessing it This is where MongoDB already comes into play with either a server installation or the mobile version of MongoDB Realm Realm also plays a big role in developing applications as it allows us to develop on mobile devices with offline first synchronization capabilities I do invite you to listen to the other various, other talks, and the keynote for more information about the capabilities of Realm Now, let’s take a look at the back end of IoT solutions which can be deployed on-premises, in the cloud, or in a hybrid way Of course, the main task of MongoDB is the data storage itself That means hot data storage for real-time data access, which can be done in MongoDB server or fully managed in the cloud via Atlas We also provide additional cold data capabilities that we will see in a minute When it comes to event processing, Atlas Triggers or the native change streams in MongoDB are here to help you There are third-party integrations like the Kafka connector For real-time analytics, the aggregation pipelines come into play, and there are additional integrations available The same holds true for advanced analytics There is out of the box connectivity to Spark, which can be leveraged, for example, in Azure Databricks, plus, there are other frameworks like TensorFlow, that work seamlessly with MongoDB For visualizations, charts can be used to work natively with the document model for visualizations For SQL based access, the BI connector is there to integrate any tool that speaks SQL As promised, I would like to show you how an implementation in Atlas could look like We’ve already seen that MongoDB can be used on the Edge gateway itself It’s the basis for ingesting stream data via its high available replica sets that scale horizontally via sharding In addition to those replicas sets for the streaming data, we can provide workload isolation, that means to have different critical machines for potential load intensive tasks on your data It’s particularly noteworthy that this is ETL-free, as they are additional members of a replica set and automatically sync the data in real-time As stated previously, Data Lake offers an offline queryable archive on Blob storages on the cloud providers, and it also includes auto-archiving capabilities for your cold data Furthermore, Atlas Surge allows you to integrate Lucene-based full-text searches on your data, and you are able to define the search indexes in a flexible way according to your application needs Now, this is the beauty where Atlas comes into play because all of these functionalities are accessible as one single database endpoint There’s only one query language that you need to run, providing access to all the different forms of data, to all the different consumers No matter if you develop now, mobile applications, microservices, visualization you can perform advanced analytics or do reporting on top of this data Perfect Now, that we’ve covered a lot of the architectural content,

let us deep-dive into the reference implementations We know it’s usually very hard to start from scratch, this is why we want to provide you with some guidelines on how to get started easily and without running into issues later on that might break up your whole solution Let’s get started Fasten your seatbelts or better put on your bike helmets because now we are going on a ride with Citi Bike We have chosen this example not just because it is very easy to understand and implement, but it’s also a perfect example of bike exercise, you don’t want to own the bike It’s also a perfect story for managed services in the cloud Citi Bike is a managed bike service, very similar to manage MongoDB and Atlas Citi Bike publishes a real-time system data in the GBFS format for all bike stations across New York This is a publicly available API that offers the station information to station status and additional information like alerts It is provided in XML and JSON format It can be queried very easily via RESTful calls, and we have the link on the slide here All of the reference implementations will follow the same pattern we see here It starts with the actual devices Our bikes station and their status information Usually, an Edge gateway is used, that gathers the data from the actual devices and sends them to the back end In addition, Edge computing also allows you to process the data locally without any additional latencies for transferring the data either to a central data center or into the cloud Streaming and alerting layer is responsible for the secure and guaranteed bidirectional data transfer This is especially important in highly distributed systems as in IoT because we usually have to cope with unstable networks into the back end In the back end, the data is streamed into a hot data storage layer for real-time data processing, and to support batch Analysis Machine Learning across a large amount of data also called Data Tiers used Then the data is used for visualization and dashboards by applications for end-users, as well as advanced analytics and machine learning Let us start with the first sample implementation Each of these implementations now will provide you with the food chain from the bike share feed into dashboards that visualize the high turnover stations on a heatmap, we see it here on the upper picture, as well as [?] bike availability per station that we see on the lower screenshot We always put the link to the GitHub repo where you can find all the assets to play around with those implementations on your own Of course, we use Atlas for data storage and MongoDB charts to visualize the data and come up with those dashboards For this implementation which is very simple to do, we used a scheduled Python script that reads the station information on a daily basis and reads the station status every 30 seconds, and then, of course, stores this information to MongoDB This is exactly what we want to have a look at here As JSON is also frequently used as a Data Interchange Format, using it in MongoDB is super simple This example shows that we only need five commands to register and frequently update the muscle data of our bike stations in the database The first step is to read the station feed, which is a simple web service call, then we iterate over the stations that we received from the feed in order to leverage the indexed_ID attribute Here we put the station ID into it, and we want to get proper GeoJSON format We reformat longitude and latitude to form a proper GeoJSON point Since all the stations can change their muscle data, we simply perform a replace by ID here leveraging upsert=True That means we automatically create new stations if they do not exist yet As we modify a large number of stations, we will batch those calls to MongoDB via a bike ride mechanism This is also the last step of writing the bulk to the database Now, for the status of each station, one single UPDATE statement is needed, which pushes the new measurements into the bucket Again, we use upsert=True here that allows us to automatically create new buckets once the current one is full In this case, we have a bucket size of 120 which roughly holds one hour of data if we create a status every 30 seconds

As the flexible schema adapts to changes, it is also no problem if the bike stations will send new data We can simply add them to the database No errors will occur, but, of course, the application needs to interpret this new data If you want to get warnings or errors in case of non-matching schemata, you can do this in MongoDB with the support of the standard called JSON Schema That has been the MongoDB only solution We want to show you now how a solution looks like on different cloud providers We have chosen a consistent and simple approach across all the three major cloud providers so that you can try out each solution and compare the different cloud providers to each other Now we will provide a highly scalable customer example later on that contains much more details Since Atlas is cloud-agnostic, it fits very well into all of those alternatives The AWS implementation is based on IoT Core and leverages HTTP calls That means we use a Lambda function that frequently calls the stream, parses the data, and therefore simulates the devices In reality, the bike stations might send the data directly to AWS IoT Core and update the so-called device shadow automatically A second Lambda function is then used to transfer the data to MongoDB Atlas and it leverages the best practices for working with Lambda functions and Atlas, as well as the best practices for updating their devices and creating buckets for the time series data As you now might have imagined, the implementation on Azure looks very similar Use a scheduled Azure function to simulate the devices and transfer the data to Azure IoT Hub In order to use a consistent approach, we also combined the station information with the station status here that is then transferred to Atlas via another function that is automatically triggered as soon as new data arrives Last but not least, a similar architecture in GCP, using IoT Core, again based on Cloud Functions All of the free cloud provider architectures use the simple approach or functions in order to show you very basic implementation that you can find on GitHub Again, our goal is to get you started quickly and to be able to compare the different alternatives Now let us have a look at a more comprehensive real-world architecture This one here shows a lot more cloud-native integrations on Atlas on GCP It is very similar under different cloud providers on AWS and Azure, but we’ve chosen GCP here because MongoDB has been recently awarded as the technology partner of the year What does that mean for you? Atlas is available in the Google Cloud Console as a fully managed automated database service across all global GCP regions, and also allows for global clusters with global VPC [?] applications High performance IOPS out there, you have enterprise security in place, and all the operational practices are fully automated and available for you by default Also, the billing is integrated into the cloud providers’ monthly invoicing via the marketplace As you can see, there are many integrations with Cloud Dataflow, Cloud Fusion, GKE, GCE, the App Engine, TensorFlow, and many more integrations here This also includes upcoming integrations that we’re working on, for example, logging into stack drivers and the IAA mechanisms for user authentication and authorization When all of those implementations have shown how to work with the cloud-native integration The following two approaches show now how we can get independent of the cloud providers and therefore, not just how to avoid vendor lock-in, but also how to avoid the limitations of the non-compliance of well-known IoT standards like MQTT Now in this example here we leverage HiveMQ and provide to Python-based MQTT publishers and subscribers that use the same simple approach that we have seen in the Atlas only implementation An HiveMQ allows us to use the full functionality of MQTT here as well as the latest MQTT 5.0 standard that’s currently, unfortunately, not supported by the cloud-native solutions Last but not least, a fully scalable, multi-cloud hybrid solution can be built on top of HiveMQ, Kafka, and MongoDB Here I also invite you to refer to the presentation of Kye from Confluent He outlines the implementation in much more details and also shows an example how to connect 100,000 cars and store the data in MongoDB In the GitHub Repository here,

we provide you with a custom write strategy for the Kafka Sink Connector that demonstrates the efficient storage of time series data in MongoDB streaming via Kafka Wow With all these technical details, now, it is very important to start your initiative right Once you have terabytes of data, it’s really hard to migrate away or incorporate changes We want you to avoid any unnecessary pain so let us talk about how MongoDB can help you from day one, to start your IoT initiative in the right way and in order to be scalable and successful here In order to do so, we developed over time the model MongoDB’s innovation accelerator As a first step, it helps you to build the foundation for a successful project Now that means together, we will establish a road map for building new applications, new functions, and use cases which of course, includes deep technical expertise and how best practices work in MongoDB With these robust spaces, you can now have short innovation cycles with an iterative development approach for your new value cases Similar to lean start-up methods, a leading indicator for success in IoT is to start many different initiatives Companies really need to be ready to experiment with ideas to fail-fast and to bring those ideas into production at a high potential for success What does that mean now? Bringing the innovation accelerator into life, this example here shows how we build an IoT innovation factor for one of the customers Their process starts with a constant brainstorming of different use cases Then these use cases undergo a forward evaluation from a technical and also an economical perspective After they get that goal, an iterative approach for development and testing is started before they head over a finished implementation to the line of business What is the critical success factor here to work on more than 20 different use cases per year? It is to have a rock-solid foundation for the project, to have clear guidelines for innovation because, with that, developers can focus on creating value instead of fighting with technology for each and every single project This example here, I would like to outline It is taken from another leading German automotive manufacturer It shows what it actually means to get started with this new kind of project This particular one here has been the first cloud-only project for this company We from MongoDB helped to build out the basis That means a cloud operating model which consists of 18 different processes plus the necessary implementation guideline for rapid development, end scaling, and production In the short time frame of just three months, this frame was able to come up with two digital twin projects One was the migration from DB2 to MongoDB which serves almost four million vehicles with more than 500 million different components and over 300 million individual configuration combinations The second digital twin has been built for electrical cars with an almost unlimited amount of attributes per product that customers can configure here Okay, there we are Thank you very much We would like to invite to get started today Sign up for a free trial in Atlas Play around with the reference architectures and, of course, we’re happy to receive your feedback Thank you