Running H2O: Scalable Machine Learning on KubeFlow (Cloud Next '18)

[MUSIC PLAYING] Well, good afternoon, everyone and thank you for coming to the session on the last series of sessions on the first day of Google Cloud Next My name is Anthony Heilig and I’m a technical program manager IN Google Cloud And Kubeflow is one of the partner programs that I am responsible for And I’ll be presenting this afternoon’s session along with Vinod and Nicholas from H2O.ai So you can think of Kubeflow as TensorFlow packaged up into Kubernetes containers to make it easier for everyone to deploy, develop, and manage their portable distributed machine learning on Kubernetes Because what we were seeing is people were doing this in different ways, and everybody had their own solution, their own architecture And everybody had rolled their own custom bespoke thing And what we started the Kubeflow project we said, you know what, this looks a lot like what we were seeing with containers before Kubernetes was developed back in 2014 So everybody had their own way of developing their container solution and we wanted to standardize those APIs And Kubernetes has helped us to do that And it allows the developers to really focus on the higher value things and not necessarily worry about the infrastructure and how everything else is plumbed together And so if we already had that Kubernetes, why wasn’t it easy to just throw TensorFlow into the mix and everything work? Well, it turns out that there’s a lot you have to do in order to really make this work, especially the portable distributed portion of it So you would have to become an expert in containers, and packaging, Kubernetes service endpoints, persistent volumes, et cetera, et cetera And we just wanted to eliminate the developers’ responsibility with becoming an expert in all of these and really get them back to developing their application And also, once they develop it once, being comfortable that they could run it in several different places, because the goal of Kubeflow was, Kubeflow is to run anywhere Kubernetes runs So if we could fix this, we could really make life easier for you And so that’s why we built Kubeflow We wanted to turn having separate experimentation, training, and cloud clusters, on the right-hand side of this picture, and this workflow that you see on the left side, which is a traditional machine learning workflow where you start a data ingest, and then you end up with logging, and then the loop feeds back on itself And then you can retrain, et cetera So what this becomes with Kubeflow is your architecture becomes a Kubernetes installation This workflow is taken care of with Kubeflow And then you can do experimentation on your laptop via mini queue You can do training in your on-premises cloud or some other small cluster installation that you may have And then you can push to the cloud And again, Kubeflow makes all of this seamless because it’s hiding all of the nuances of getting everything from an infrastructure point of view underneath the covers that is Kubernetes And Kubeflow maintains the machine learning aspect, the distributed nature, making sure you can use the different resources, whether it be CPUs, GPUs, or TPUs on Google Cloud And so what we wanted to do was to make it easier to bring these subject matter experts to machine learning, because this is that next level of unlocking that we’re trying to do Because once we get these subject matter experts with the deep domain expertise that they can talk more the language about the problems they’re trying to solve, as opposed to trying to have to deal with all of the infrastructure and making sure everything is plumbed together and works correctly This is the vision that we have And so with this introduction to Kubeflow, I’d like to turn it over to Vinod and Nicholas from H2O.ai to give you an example of some of the things that you can do with Kubeflow Thank you [APPLAUSE] Thank you for the great introduction Going to skip into the next one So I’m Vinod Iyengar I’ve been with H2O for about three years I run a lot of the product alliances and marketing at H2O So talk a little about who we are as a company Quick show of hands How many people have heard about H2O, the company, or used one of our open source products? That’s quite a few Oh, great Perfect So this one is, I guess, [INAUDIBLE] But we’ve been in business for about six years We are a venture backed company We are about 100 employees now That number is a little dated Some of the world’s leading AI experts working Most folks know us for the H2O open source machine learning platform that’s used by over 40,000 organizations

globally Most recently, we launched a new product called Driverless AI, which is our automated solution And we’ll talk about that in a little bit of detail later We are headquartered in Mountain View, right across from Google, on the other side And we have offices in London, Prague, and India Recently, Gartner named us as a leader in the machine learning and data science magic quadrant They called us a technology leader with the most completeness of vision What they mean by that is that our technology bets, our mindshare, partner network, ecosystem, essentially made us the quasi industry standard for machine learning and AI when it comes to enterprises So most of the vendors on this quadrant, for example, they actually use our open source platforms under the hood So that gives us a real credibility in the enterprise One other thing that I’m extremely proud about is our customers gave us the highest overall score amongst all vendors for satisfaction, relationship, account management, and support And that’s something that we take extreme pride about We make sure that our customers get the best service and experience So just to talk about the issue of products suite, so as I mentioned, most folks know us for the open source products that are on the left You see H2O open source, H2O Core as we call it internally That’s the in-memory distributed machine learning algorithms That comes in our pipeline or you can use it with H2O Flow, which is our own interactive [INAUDIBLE] interface There’s Sparkling Water, which is essentially H2O Core that runs on top of Apache Spark That’s extremely popular too Nearly a third of all our open source usage comes from that And last year we made the decision to start deporting some of these algorithms over to GPUs So I’m talking about statistical machine learning algorithms, not just deep planning, which is already being [INAUDIBLE] on GPUs You port it over things like GLMs, Random Forest, gradient boosted machines, clustering algorithms, dimension algorithms So the core traditional machine learning algorithms, those are now also ported over to GPUs And on the right, you see our– well, right for you, our commercial offering First enterprise offering called Driverless AI, which takes all the benefits of open source, but packages it all together to give you full automation end to end You can go from data ingest all the way to production All the intermediate steps, like feature engineering, model building, training, [INAUDIBLE] tuning, are all taken care of So in a way, I mean, our goal is very similar to– as we said, our goal is to democratize machine learning and AI, make it really easy, give the tools and the products for enterprises, developers, and data scientists to do the jobs really easily so that they can focus on solving business problems So let me talk about our two core products over here, and then we’ll see a demo later done by Nick, who’s going to actually show how all of our products integrate really well with Kubeflow, and in general with Google Cloud So if you are committed to using Google Cloud, and if you saw the keynote today, there’s a lot of amazing announcements You can take advantage of all those amazing new things with H2O’s offerings So very quick overview on H2O We call it H2O Core internally So what is it? It’s essentially a math platform So it’s an open source in-memory engine What we’ve done is we’ve taken the most common machine learning algorithms like, as I mentioned, the GLMs of the world, Random Forest, gradient boosted trees, and we rewrote them from scratch to be fully parallelized and distributed So you can essentially spin up multiple machines as long as your– if your data is large, you just spin up more machines and we can handle all the data And that really works well in the Kubernetes and Kubeflow paradigm too, because when you are training on small data on your laptop, for example, then if you want to burst to cloud for a much larger dataset, it’s extremely easy to do that It is written in Java, but we do offer a really robust REST API that allows you to run it from R, Python, or H2O Flow, which our own web UI And of course, it is built for handling big data So people have run terabytes of data, and even gone into like hundreds of terabytes of data without an issue You can use all of your data You don’t really need to sample That’s the real value proposition of H2O So what kind of algorithms do we support? So this is just a select list of some other algorithms As you can see, most of the classic statistical learning algorithms, tree-based methods, neural networks, deep learning, dimension [INAUDIBLE] algorithms And we also support a lot of enterprise learning techniques, like class training, time series, stuff like ISAX, virtual embedding like [INAUDIBLE] and TFIDF But in addition, we do a lot of hyperparameter tuning options that are given, and in an easy to use format One of the things we did is we created very thoughtful defaults for each of those parameters, so that even as a new user, if you don’t know what those parameters mean, you don’t have to figure out It can still run a good model, good enough model, and then come back and start hitting them In addition, we also added an automatic machine

learning implementation so that it can run through a whole bunch of algorithms, try a bunch of tuning parameters for you, all automatically, and give you a good result So when you think about the typical machine learning workflow, going back to Anthony’s slide, on the left side if you will, so you have your data integration happening Bring data in from different silos or tables, do your joins, if you will And once you’re done, you have a machine learning running data frame At each point, you do feature engineering model building and a model requirement So H2O3 essentially fits in the model building and deployment stage So you need to do your own feature engineering You need to do your data [INAUDIBLE] transformation, but once the data is in a good enough format, you can use the H2O algorithms to run on the whole dataset and build models So from an architecture perspective, this is how it looks What happens under the hood is you have– on the left, you’ve got all the data sources Obviously, we support most common enterprise data sources Pull in data from wherever you have Once the data is in H2O, the H2O data frame essentially, it’s a distributed key value store that distributes the whole data across all the nodes you have up and running And this is all abstracted for the end user End user doesn’t need to know how the distribution is happening And then you can do expert analysis and feature engineering You can run one of the different algorithms we support, go through hyperparameter tuning, do the evaluation scoring And all of these steps that you do in H2O are all fully distributed for you under the hood So we take care of the work, like doing the non-blocking hash tables, doing the in-memory map reduces, distributed fork joints All those things are fully abstracted, so you as an end user don’t have to worry So what do you write– the code that runs on your laptop, one minor change will run on a cluster So it’s as easy as that And then the cool part is after you’ve done all the modeling, you can export the artifacts as either a plain old Java object or a model object and deploy it into a whole bunch of scoring environments This code is highly optimized for low latency scoring So you’re talking about microsecond and millisecond level latency And it is completely independent of the training cluster So you can shut down the training cluster, take the scoring artifact and put it to deployment anywhere So it’s a really clean end to end story for enterprises, and even for any dataset So that was H2O Core open source So let’s look at Driverless AI for a second So what is Driverless AI? So we think of it as an expert data scientist in a box So what we did is we took all the learnings we had from or five or six years of open source usage from the community and from our customers, and we took the learnings from that and created the software So it’s actually built by the experts The expert data scientists have built the software We’ve taken their best practices, their heuristics, their workflows, and codified that in the software I’m sure most people who are here have heard about Kaggle, right? If not, Kaggle is this dataset is supported by people for [INAUDIBLE] It’s actually owned by Google as well But we essentially hired five of the top 100 grandmasters from Kaggle, and we took their workflows and basically codified it So you’re essentially getting grandmaster level machine learning with this product And how do we do that? So again, let’s go back to this enterprise machine learning workflow So you have data coming in from different sources You still need to do that You need to do the ETL part You need to do the data quality and transformation as well We do some of it, but not a whole lot But once the data is in a single data source, if you will, a single file or a data frame, at that point on, you can let Driverless take care of everything So we do all the feature engineering for you We’ll do the [INAUDIBLE] encoding We’ll do the target encoding We’ll do dimensional data reduction for you We’ll find the right interaction elements And all of those things automatically for you, depending on the type of the dataset, type of problem [INAUDIBLE] So all you need to do as a user is specify what you’re trying to predict and we’ll take care of that point And then in parallel, we do model building as well So it’s a very intertwined process– feature engineering and model building back and forth iteratively And goes through this entire process to give you a pipeline This pipeline will now have all the feature engineering steps that it thought were useful and the final model or ensemble that it picked to be the most effective And this pipeline is generated as code for you So again, going back to the original paradigm, you want deployment ready code that is independent of the training cluster So you can take the code and deploy into production on the edge if you want to And finally, we do another cool thing, is we actually try to interpret the model for you So what we’re trying to solve is three things There’s a lot of things happening in this slide, but essentially, it’s talent, time, and trust We found that those are the three big reasons, hurdles, obstacles to enterprise AI adoption So talent is– there is a deep lack of expert AI talent That’s no news to anyone The data scientists are a hot commodity to find And they are extremely [INAUDIBLE]

hire and keep as well So what do all the enterprises around the world do? So that’s a big challenge The second problem is of time It takes a lot of time to build models, train them effectively And even with all the advances that have happened in the hardware side, you don’t have the software to take advantage of it And finally, even once you build a good model, these models often don’t go to production because the business users don’t trust them Enterprises, the regulators are afraid of black box models There is inherent lack of trust in the air So how do we solve all these problems? Basically, we solve that by using Driverless AI to deliver additional data scientists that can solve the talent problem It can make your existing work force more effective We solve the time problem by automating the entire workflow, and also using the latest GPUs and hardware to accelerate the machine learning workflow So you can give your results in a matter of hours instead of what would take weeks or months And finally, every model we build, we are able to generate explanations for you that are fully interpretable So the question comes up, how accurate are we? I mean, automation is cool [INAUDIBLE] is the buzzword now, but is it really good? So what we did is we actually took Driverless AI to task and we participated in Kaggle as a bot So in some of the competitions [INAUDIBLE] so this one in particular is pretty cool The BNP Paribas, this was a year old Out of the box, Driverless AI came number 10 on the leaderboard out of about 3,000 participants So just for context, so all the people who ranked higher than Driverless AI, on average, had about anywhere from 10 to 25 submissions A couple of people had like over hundreds of submissions And then these people spent weeks, if not months, to get a higher score And in the matter of a few hours, Driverless AI came ranked number 10 So it can perform at the level of an expert Kaggle grandmaster That’s what we’re talking about here And this is not just one competition We participated in a bunch of other competitions And it typically comes in the top five, and sometimes top one percentile without any human intervention And just to give you a sneak preview of what’s happening under the hood, so you’re talking about these features These are the features that get created Some of these automatically generated features, things like text handling, frequency encoding, cross validated target encoding, and truncated SVD, for example These are all complex features that, A, you don’t know– if you know how to use them, it’s painful to actually code them up in Python and then try to run them on all the different combinations if possible So it’s a lot of time and effort that’s required if you know the expertise And we do that all for you [INAUDIBLE] So one final thing about deployment After the model is built, as I mentioned earlier, you do get a Java artifact So you can get the POJO or MOJO for H2 And for Driverless also, you can get the MOJO file That’s essentially a zip file that contains the binary representation of the model And it also contains all the feature engineering information And it’s completely self-independent It’s self-sufficient, doesn’t have any dependencies, so you can put it into the edge, you can run it wherever you have a Java runtime environment And obviously, it’s easy to add other language bindings too, but that’s a really cool deployment story So you can have one of these models running on your phone, for example And it’ll still give you low latency scoring It can give you some millisecond response times So you don’t have to call a REST API or an endpoint if you don’t have to We can embed these applications in your software Cool With that, I am going to invite Nicholas over to talk about, how does H2O fit into Google Cloud’s Platform? NICHOLAS PNG: Hello, everyone So we’re just going to talk of a couple of things First off is just a couple of pipelines that we’ve run on GCP slash played with in GCP as a preempt to the final one, which is our demo on Kubeflow Obviously, it being in the title, it should come last and be the most important thing So firstly, we have just a high level deployment of the pipeline that we might suggest if you are already utilizing Spark So a lot of companies already use Spark as part of their data engineering platform What this does provide to you is the ability to ingest data directly through Google Dataproc, which is Google’s offering of Apache Spark And it makes things very easy because you don’t have to set up the infrastructure Google sets up and pulls up the infrastructure for you And then what we can do is actually play on top of that using our offering of Sparkling Water and H2O And you can do your data munging, you can do your data engineering And then once you’re done with that, you can output the final file into Driverless AI From here, Driverless AI can do all the wonderful things

that Vinod did mention We will allow for automatic feature engineering, automatic hyperparameter tuning and model building And then you can serve the final model out as a MOJO or a Python scoring package in, again, GCE Now, both Driverless AI and the MOJO scoring pipelines can all be served on VMs in the GCP, so on Google Compute Engine And the nice thing is that you can actually have a smaller platform that doesn’t necessarily need GPUs to serve MOJOs, because they don’t actually need the GPUs So then another one, just another quick pipeline that we did have is just going straight from BigQuery And this does highlight a little bit of the integration work that we’ve done as far as data ingest with Driverless AI Essentially, the data is still stored in– well, in this case, it would be stored directly in BigQuery The other ones, obviously, are kind of moot in this case But you can do the data munging directly in BigQuery, meaning formulate a select statement or formulate a SQL query that gets you the data that you want, at which point you can actually just copy-paste that same SQL statement directly into the Driverless AI UI through the data ingest At which point you would be able to just simply ingest that data and Driverless will grab that data and then be able to run with it Same backend of the pipeline And finally, we’re going to just talk a quick highlight before we get to a demo here of just H2O plus Kubeflow So the idea being is that the one nice thing about Kubernetes and Kubeflow in general is that, one, Kubernetes allows you to distribute your workload So you don’t necessarily need to give over a whole machine to a program that’s running, or only really needs one CPU and maybe a couple gigs of RAM to support it Now obviously, a lot of this stuff, H2O’s Core product in Driverless AI would require significantly higher resources But the nice thing is, if you had a cluster on GKE, you’re able to distribute the workload for H2O Core And then maybe also on top of this– and this is something I’ll show in the example– also deploy another pod that contains Driverless And it’s all running on the same infrastructure It’s all very easy to do, because all you need to do is just give it two Docker containers, one for H2O and one for Driverless AI, which we do provide Cool So at this point, let me just sign in here OK, so we’re going to just go over a quick demo Now, what I have here is, on the left-hand side is going to be the GitHub repo So this is actually public It’s H2O AI slash H2O Kubeflow So if you take a look up here This is actually going to be merged into the actual Kubeflow repo eventually I think at the moment we have an open issue for it And then I just want to show you my infrastructure right now We have a quick, small demo cluster So it’s not tiny, but it’s not enormous either But basically, it comes down to we have about 32 CPUs plus some x number of RAM And it’s a GKE cluster, so it’s running on Google GCP Also another quick look, but basically, the two VMs, that’s all that is there So on the right-hand side, we have a terminal Now, Kubeflow basically allows you to use something called Ksonnet as a package that allows for templating And this is one of the really great things of Kubeflow in my opinion, as an addition to Kubernetes, is Kubernetes itself, when you want to make a deployment, you have to create a YAML file or a manifest file and say explicitly that this is the runtime environment that my container needs to have And now if I go part two, if, for example, my original container was a testing container and only needed four CPUs and a bit of RAM to go along with it, then the next one needed to be much more powerful, then you would have to actually change the YAML file You’d have to actually change the manifest, request more resources And that can be a pain in the butt, especially if you are not in an environment where you can easily change these things, or if you’re trying to do it automatically, right? The cool thing with Kubeflow is that if you do ks init– and so this is actually the directory that has already been init-ed But I’ll actually go back just to show you the idea behind it So if I do ks init and then demo demo, what that will do is create– yeah, sorry, ks init demo demo– what that’s doing is it’s actually

creating a folder, organizing kubectl to point to that specific folder And basically, setting up the runtime environment under which you can actually deploy Kubeflow, as well as deploy anything else within that area So now we’ll actually go back to Kubeflow demo here Sorry And so what you’ll see here is basically what Ksonnet has created So you’ll see some environments You’ll see some components And you’ll see an app.yaml So if we look into components here, what that’s showing you is some of the components that I’ve already added Ksonnet allows you to add a registry Now, a registry actually adds a list of packages that you can install, after which you install specific packages So if I do ks pkg list, you’ll actually see that I’ve in– it’ll actually show you– the asterisk shows that it’s installed here So I’ve installed our H2O package I’ve also installed Driverless AI I’ve installed the Kubeflow core packages And these are all running, so if I do kubectl get pods, I actually have a couple of them running already pre-baked so that I can actually show you what’s going on the front end as well as on the backend So from the point where we’re at here, you’ve installed the packages using ks pkg install And then you’d actually just write out– so it’d be h2o-kubeflow in this instance And then I can do /h2o3-scaling So I haven’t installed this package It’s basically a work in progress, but gives you the example of how you would install said package From here, if we kubectl get pods– get sbc– so this is actually showing you the endpoints I’m using a load balancer to expose the port numbers so that I can actually use their APIs in the web browser or in a Jupyter notebook or something of the sort But then the last step after you’ve installed the package would be ks prototype use io.ksonnet.pkg.h2o3-static And this is basically like the first line And you can actually read through a lot of these deployment steps here in the repo So I’m not going to go any further than giving you the first step, but what this is doing is that it’s actually creating those components that I showed you initially So again, if you ls into the components, you can see that after using prototype use, what you get is this component installed in the Ksonnet app And if you’ll notice here, you would have had to take a couple parameters and pass a couple of parameters So actually, that’s going to be this file right here So if you take a look at the parameters, you can actually see that it sets some parameters for each of the components that have been deployed So now we’re actually going to just dive a little bit deeper into after you’ve deployed So sorry, just one last quick thing. ks apply is going to be how you actually deploy So ks apply the specific environment that you want, and then also the component So in this case, h2o3-static would be one of the components that I had already created When you do that, what you’re ultimately going to get is the package along with the parameters is going to be passed to Kubeflow And that’s going to deploy your packages and pull up as many replicas as were requested, as well as pull up the specific resources allocated to each pod So now– we’re going to go quickly first to the H2O deployment And this is H2O Core So if you go to the external API in the load balancer, and it’s just HTTP and then 54321, which is the exposed port, you can see that this is H2O Flow So this is actually what Vinod was talking about earlier This is our basically simplified GUI So it’s very similar to Jupyter notebook in the sense that it gives you a step-through notebook And you can do some more simplistic steps Rather than coding this in Python, you can actually just go import files And I can set a variable that says variable is locA for location A is equal to some string here And then import the file, just like this And instead of using Pythonic language,

it’s very clearly just one step next, do the next thing, right? And I’m going to go ahead and leave that now So now, if we go to a Jupyter notebook Sorry about that Let’s open up the demo So I have– just basically, this is just a quick notebook showing you what would happen You would import And this would be from your local computer You can import the H2O packages, connect to the specific URL plus endpoint And it will actually show you the resources So we have a cluster of three nodes That’s because that’s the deployment package that I made Go ahead and import some data Go ahead and set up some data And this is running through a demo that’s on our documentation So if you do look at docs.h2o.ai, you can actually see the same demo, or something fairly similar Ultimately, what this is coming down to is we’re running H2O’s AutoML package The AutoML meaning that it’s going to do an intense grid search across several different algorithms that we have and try and find the best one And you can track progress here Basically, all I’ve done is set a single parameter here saying that don’t run more than 120 seconds But basically, take any algorithms that you want, try as much as you can, and give me the best possible result So while it’s running, I’m actually going to show you the other product that’s running, which is Driverless AI So Driverless AI, it’s exposed at this IP right here, and just the port number 12345 So if I go 12345, you’ll see Driverless AI is running And I’ve actually loaded a couple of datasets already In case you’re not familiar with this, the first step would be going through and adding a dataset We do have direct integration to Google BigQuery, as well as Google Cloud Storage So if I were wanting to go to Google BigQuery, again, you can go and type in the dataset ID right here You can type in the GCS bucket right here And then you can do some sort of select So SELECT * FROM DATASET Just in the interest of time, I’m actually just going to close that window and show you from what we have already So what I have here is one time series datasets So this is a Walmart dataset that was from Kaggle Basically, it’s predicting the weekly sales of a Walmart store in series of time So the first thing that Driverless AI does is it gives you a couple of options straight from the Dataset tab First, you can describe it So basically, what this is doing is giving you some statistical inference It’s telling you what type of values they are, how many values there are, and if there are any missing, and mean, min, max, just some very basic statistics A lot of you would see something fairly similar if you went to pandas and did pandas.describe The next thing that we do have is visualizations So this is a package that we call AutoVis AutoVis basically gives you the opportunity to create visualization plots automatically So Driverless does this automatically for you And what it’s going to show is a lot of the plots that, as a data scientist, you probably want to know Outlier plots are very important Seeing if that maybe, for some reason, the whole distribution is here, but for some reason there’s some data points that are kind of lying way out on the side And so this would be on a specific column here And if you scroll through, you can see each different column Same goes for– you can get radar plots You can get histograms or skewed histograms Correlation graphs are also really cool And you can see the correlation between store or is holiday, everything seems to be correlated is holiday And that makes a lot of sense considering this is retail And then after this, you can create an experiment So the first steps, we’re all like, oh, is the data good? Did I import the right dataset? Do I feel comfortable with running experiment on this? Once I go there, now I can go to predict I’m ready to do a prediction And so at this point, you can run a prediction And that means adding a test dataset if you so choose If you do not add a test dataset, it’s fine Driverless AI will notice and automatically use cross-validation folds to test and train So it will do testing and validation on the out-of-fold data so that you don’t ever have to worry about, well, about cheating, essentially So time column– this is a time series dataset It’s one of those things where we do need to address time

And we don’t want to look at September to predict August sales, right? So you can always set Auto for time column Otherwise, if you know the specific time column, you can actually go in and say, OK, well that’s actually just the date And then from here, you can actually select how many weeks do you want to predict forward, how many weeks after do you want to predict And always, Auto is optional Otherwise, you can go in and select it more specifically Then finally, as we said, a target column So this, in this case, would be weekly sales And from here, you get the final step, which is you select your score So RMSE, RMSLE, RMSPE, MAE, these are all scores for regression And the reason is because you’re predicting a revenue So this would be a regression problem And you have three knobs– Accuracy, Time, and Interpretability Interpretability is just how interpretable do you want on a scale of 1 to 10 0 interpretability, or 1, would mean that you don’t mind having a very complex model, versus 10 is saying make it as simple as possible Make it so that the feature transformations are all very clear and easy to understand Time– less time means take a shorter amount of time to run, or more time, meaning take longer to run, I don’t mind Run as long as you need Accuracy– how high do I prioritize accuracy? So is accuracy more important? Can I have a more complex model, things like that At which point, then you can click Launch Experiment So once you launch, it will give you this kind of UI From here, you can watch as we progress through the [INAUDIBLE] and see more iterations As variables change, you’ll see variable importance here And you can check on CPU usage, an actual versus predicted graph, as well as some just logging and trace values up front I actually have already completed data, or passed, so I’m actually just going to show you that right here so that we can talk about it a bit But what it comes down to is you can see now the graph has filled itself in You can actually take a look here This is kind of cool, because this is actually showing that the only data point that was actually kept that is from the original dataset is Department, of the top five So you can see that some of them are target lag values One of them is a clustered value So this is using a clustering algorithm of our making, so we clustered the values between Department and Markdown number 4 And from that, we actually created a new value that is going to become the value in that column And then finally, there’s also another lag value at the top here So the other thing you can take a look at here is the iteration So over time, it started out with a couple models And you can see there’s some outlier models which were really, really not high performant Obviously, RMSE meaning the lower RMSE the better, right? So we started out at around 2,600 RMSE And then finally, the final model they were predicting is a little bit less than that So over time, basically– and this is if we had given it more time, if we had made it a little bit less interpretable, if we had higher prioritized accuracy, it’ll progress a little bit more But for these settings, this is what you would expect is that over time, we’d progress and find a certain point where it would stop, because it said, OK, this was good enough This is kind of what you were expecting And I’m just going to quickly talk about this We didn’t run a pre-baked MLI, but this is machine learning interpretability Essentially, what it comes with is three packages– or three surrogate models– a random forest, a singular decision tree, and K-LIME So K-LIME is essentially LIME, but K number of LIME models What that does is it creates a series of linear models, essentially, that approximate what the actual model is doing So it’s basically trying to give you linear betas The same idea as like a linear regression, you have a beta on each point that will explain how a change in that data point would affect the prediction So that’s K-LIME The next one is random forest, which will give you local plots, which are basically feature importance, but positive and negative There’s also a feature importance which is just positive And then finally, there’s a decision tree which allows you to trace the path of a singular decision tree Very straightforward If I saw it at this specific column, why I made a decision [MUSIC PLAYING]