Even More Python for Beginners – Data Tools (Full Series)

>> Welcome back yet again for an even more Python for Beginners This time, we’re going to focus in on Data Libraries Now, if you haven’t seen any of our videos before, well, you should definitely go check those out before you check this one out But let’s start off by getting in, doing some basic introductions, and answering three questions Who are we? Who are you? What is it that we’re going to be doing today? So let’s start off with that first question. Who are we? >> Great. So my name is Susan Ibach and if you’ve seen our other videos, hi again I’m the CEO and Head Geek of HockeyGeekGirl Incorporated, Instruction and Instructional designer I am a Data Geek, data is my happy space So happy to be talking here about some datas today I’m also a mom, a marathon runner, and a Thrash Metal fan >> As for me, I am Christopher Harrison I’m a long-time Geek I started actually when the Vic-20 came out really dating myself with that little reference I am a Program Manager inside of Academic Ecosystems I have a propensity to tell dad jokes as well as the propensity to say the word propensity I’m a husband, I’m a marathoner, and I’m a father if you will of a four-legged child also sometimes known as a dog She’s an adorable little black and white like Amstaff mix But anyway, so who are you then? Well, here’s what we’re assuming We’re assuming that you’ve done some level of Python and that you’re maybe looking to explore some data science articles, I think would be the best way to describe this That you might be going, “Hey, there’s a new little quickstart here It walks me through going in, setting up a basic little model and predicting a value, or seeing whether or not somebody’s going to buy a bike or something like that.” You’re trying to go, “Hey, what does this code mean? What are these libraries? What are these classes? What are these objects? What are those things?” That’s what we really want to try and show off to you >> Yeah. We want to help you get to that point so that when you’re going through those data science tutorials, you can focus on learning the data science in those tutorials and hopefully have a slightly higher comfort level with some of the code that you might encounter when you’re looking at those articles >> So then to support that, what we’re going to do here today, is introduce some of the common libraries and things that you’re going to be using and some of the common tools and techniques are things like Jupyter, DataFrames, and so forth that are really going to I say, lay the foundation for what is going to be your journey, your adventure into data science So if you’re ready? I think we’re ready >> I think we are. Then let’s introduce some code >> Yeah. Let’s get into it You’re just trying to figure out what are these different things that I’m being told to just follow along with and just run some code Hopefully we’ll explain some of what’s going on inside of there >> Because a lot of times when a lot of people get into Python because they’ve an interested in exploring that world of data science By opening up that first data science tutorial, you’ve got all the data science concepts and at the same time all new Python libraries in the page that you might not be familiar with So what we’re hoping here is introduce you to some of those Python tools and some of those libraries so that maybe when you do that data science tutorial, you can focus more on just new data science concepts >> So that’s really what it is that we’re going to be doing here, is introducing those libraries Walk you through some common tools techniques, and navigate through some common scenarios The way that we’re going to be doing it is by walking

through really some basic scenarios You’ll notice that there’ll be a little bit of a common thread throughout a lot of these modules Or a lot of these modules, a lot of these concepts will build on the ones prior That being said, there still all designed to be snackable, all just like really short So you can jump in, take a look at the particular tools, concepts that you need and then jump back out from there >> But you’ll find it easiest to learn these particular series if you do watch them sequentially You can still skip the PowerPoint presentation and jump straight to the demos If you’re one of those people who just just show me the code in action You can still absolutely do that Just follow the code modules, but you will probably find easier to follow the concepts if you do the modules in order >> So let’s head on into it >> Let’s start coding >> One of the things that you’re going to notice when you start getting in and playing around with data is the fact that you’re going to work very iteratively, that you’re going to try couple of things, go back, make a couple of changes, try it again, go back, make a couple of changes, try it again, and so forth and so on The problem with doing that with more traditional tools is the fact that you oftentimes have to set up a little bit of scaffolding and getting to the exact spot of code that you want to be able to run, can be a little bit clunky So this is where Jupyter comes into play So what I want to take a look at here is the concept of Jupyter Notebooks and where it is that you can set this up, where it is that you can run this, etc So Jupyter is an open source platform or framework, if you will, and really what it’s there to do is to give you an integrated development environment So that way, what I’ll be able to do is maybe do a little bit of prereq work, get to the exact spot that I’m wanting to play with, and just keep going doing whatever it is that I need to do in here Once I’ve got that done, then I can move on to the next block, and maybe I need to go back and make a couple of changes and then run back down and so forth and so on That’s what this is all going to enable Now you were going to notice that you could install Jupyter locally on your system That if you head on over to Jupyter.org, you can download and install Jupyter, get it up and running, and actually get the browser experience that I’m going to be demoing for you in just a couple of moments, running locally on your system Now I’d be remiss if I didn’t highlight the fact that Visual Studio Code also has a wonderful integration for Jupyter Notebooks as well So if you really like to use Visual Studio Code, you want to stay in that environment, then you can also configure Visual Studio Code to give you the ability to run your Notebooks as well right there inside of the Editor, which is personally one of my favorite little things inside of VS Code If you check out the GitHub repository that’s part of this course, you’ll see a link to how you can go in and set all of that up I also want to highlight, and this is what I’m going to be using for my demo, the fact that you can run Jupyter in the Cloud There’s a couple of different ways that you could do this In my case, I’m going to do it by utilizing the Azure DSVM or Data Science Virtual Machine What the DSVM, is it’s a Virtual Machine, as you might expect, that has a whole bunch of tools already installed, and it’s perfect for getting in and playing around with data That most of the things that you would need, including Jupyter, is going to be right there in that nice neat little package All that you have to do is to say, “Yes, I want this,” and then create it, and anyway you go from there Once again, we’ll have the instructions on how you could set that up inside of the GitHub repository Let’s check out how we can use Jupyter to start playing around with some code So what I’ve got here is the Jupyter Hub set up inside of my DSVM, my Data Science Virtual Machine I’ve already got a few different folders and things like that, that come just prebuilt inside of there But what I want to do is play around with a brand new one and start creating some Notebooks inside of there So what I did is I hit “New” right up there on the top right What you’re going to notice is I’ve got the option to hit “Folder” So I’m going to create a brand new folder here You’ll notice that it will just automatically give me this new little folder right here called Untitled Folder

In order to rename that, it’s an additional step here that I need to hit the little check, and then come up here, hit “Rename,” and now I can give it my very creative name of demo So now I hit “Rename,” now you’re going to notice that it’s set up as demo, and now I can click on that link, and now I can start creating the different items that I might want inside of here Now it’s worth highlighting the fact that inside of here, I can have things like CSV files, images, and anything else that I might be using as part of the manipulations and work that I’m going to be doing But the real center to everything that we’re going to be working with is a notebook, and so you’ll notice once again that I hit “New,” and then there’s a whole bunch of pre-configured options for me for different notebooks, and what I really like about this is that if you want to use, say for example, R, that I’ve got that option right there If you’re wanting to play around with Scala, you can do that Both R and Scala, by the way, are languages that are very popular when you’re working with data Or I can also go in and choose my different Python notebooks as well So what I’m going to do is I’m going to hit py37_default, and that’s going to set up a brand new notebook for me, that is just simply Python Now in order to make this function, there’s always going to be a kernel, and you might have noticed like right up here, there’s a little thing that was going on right up here, where it’s telling me that it was creating a kernel and setting all of that up That’s my runtime behind the scenes, and so if you want to go in and play around with that little runtime that’s making it all happen, then you can hit the Kernel window, and then you can choose to say restart it, you can reconnect In case something maybe has gone wrong or maybe you opened this up from somewhere else, you can restart and have it rerun everything inside of your notebook Then last but not least, you’ll notice right here, that you have the ability to restart and clear output One big thing to note about the notebooks when you’re playing with them, and let’s say I may be load up some data and I manipulate it and I get a result, that actually becomes part of your notebook So you can refer back to this with the answers, if you will, already in place If you want to start all over again, that’s where that clear output comes into play So that way, it will just get rid of everything, and allow you to redo whatever it was that you were working on Now I also want to highlight in here one very important spot, and that is that little square bracket placeholder, if you will, that’s right there, and that will indicate one of three different things If it’s blank, like it is now, what that’s going to indicate is that whatever’s in that cell, has not run Now if it is running, then what will happen instead is there’ll be an asterisk inside of there, and it’s hard for me to get it like a real big asterisk using the tool, so you can see it’s there If you see an asterisk, what that means is that it’s executing, it’s currently running The last thing that you’ll notice is a number in there, and what that number is going to indicate is the order in which that cell has executed, and that becomes important because even though the cells are going to go top to bottom, the order of execution doesn’t necessarily need to be top to bottom, and that can actually help you out sometimes Let’s say I’m all the way down towards the bottom, I’ve got a lot of work that’s been going on up here, I try to execute this one little line of code, and then I forgot to do something Maybe I forgot to load up a library or to run one last little line of code Well, then what I can do is just create another cell down at the very bottom, run that one little line that I need, and then go back up and then rerun that cell So it gives you a lot of flexibility there But you just have to be careful to make sure that things are executing in the order in which you expect them to execute So let’s play around just a little bit here I’m going to go ahead and say name equals Christopher, and then I’m going to say print name, just like that, and now I can run my cell There are buttons to run the cells

You can go in and explore the buttons on your own I’m going to be honest with you, I never remember what the buttons are Because there’s really just two shortcut keys that you need to know, Control Enter and Shift Enter Watch the difference What I’m going to do here is I’m going to hit Control Enter, and what Control Enter will do, is it will run that cell and that’s it It’ll just run that cell, stay there, and we’re done Contrast that with Shift Enter So now I’m going to hit Shift Enter, and now what I want you to notice is it still run the cell You’ll notice the number changed from a one to two But now I got a brand new cell When I use Shift Enter, what’s going to happen is it will run the cell, move to the next, and if there’s not already one there, then it will create a brand new one One other really cool thing about Jupyter Notebooks, is the fact that if I just want to see what’s inside of a variable, then what I can do is just end my block with it, run it, and then it will just display whatever’s inside of there, and what’s really cool about that is the fact that, you’ll be doing all sorts of really cool things inside of your notebook, and then I just want to see what’s inside of an item Just end your code with it, and it will print out whatever is inside of there It’s probably my favorite feature inside of Jupyter Notebooks because quite frequently, you’re going to be working with different pieces of data, you’re not going to necessarily know what’s going on You just want a real quick sneak preview of what’s going on Jupyter is there to help you out The last little thing that I want to mention is I can also change my cell type to be Markdown, and now I can use Markdown inside of here, and so this is, if I can spell world, there we go, this is a great way to leave myself Apparently, I’m not able to talk and type at the same time today, but that’s all right, some notes just like that So that way I want to indicate anything else that might be going on inside of here, then I can just do that with Markdown Now, Markdown does behave like a normal cell, so you’ll notice that it’s an edit mode here at the moment, but when I do like a “Control Enter” just to run that cell, now you’ll notice that it will put it into a more proper display here So maybe what I could do is I could move that cell up I’m just hitting those little arrows right up there, up to the very top, so now I’ve got that as part of my Hello, World So there’s the quick version if you will, of getting in and working around with Jupyter Notebooks There’s an often lot of power in tooling inside of there, but that’s probably the most common tools, the most common options that you’ll be using when you get in and play around This is going to be the core of all the demos that we’re going to be doing in this course, everything that we’re going to do is going to be done inside of our Jupyter Notebooks I again want to highlight the fact that you could run this in the Cloud like I am with a DSVM, you could run this locally, and you could also do this inside of Visual Studio Code Again, you can check out the GitHub repository for links on how to do all the installations and how to set up Visual Studio Code >> Christopher just took you through some of the things we can do with Jupyter Notebooks Now, one of the challenges you’re going to work with when you start playing with Jupyter Notebooks is use some other tools you might need to work with as well, specifically Anaconda and Conda Anaconda is an open-source distribution of Python and R that’s often used for data science When you install Anaconda, it comes with about 1500 different packages that you can use, it has this graphical interface called Anaconda Navigator, it also has a command line interface called Anaconda Prompt, and it also comes with a tool called Conda So Conda is a tool that you can use to create and manage Python environments Now back in the original introduction to Python series, we talked about how we could use virtual environments to manage all the different packages we work with when we’re coding in Python and the different versions of Python So that becomes important with Notebooks as well, but we tend to use Conda when we need to manage packages and environments for Jupyter Notebooks So what you’re going to want to do is install Anaconda and then after you’ve installed Anaconda, you’ll just launch the Anaconda Prompt, and then to create a virtual environment using Conda,

you use the Conda create command, you give your environment a name and you specify the version of Python you’re going to be running The dash y there is simply a way of saying, if you ask me questions to confirm if it’s running, just answer yes to the questions by default, just so you understand what that little dash y is doing So you just have to come up with a name to the notebook and know which version of Python you want to work with Once you’ve created the environment, then you can activate it So you simply create Conda until it activate your notebook A couple of other useful commands you might need down the road, if you need to deactivate your environment to move back and forth, you have Conda deactivate, and if you no longer need an environment, you can delete it by using the Conda remove command Once you have activated an environment, so you create an environment, you activate an environment, then you can install the libraries you’re going to use inside that environment So we can install Pandas, you might be installing things like Matplotlib, NumPy These are some of the libraries you’re going to start getting to know if you start exploring that world of data science, so you may need to install them Otherwise, what will happen is when you are running Jupyter Notebooks, you’re going see error messages coming back saying, “I don’t know what Matplotlib is.” So this is how we’re going to fix that problem After you’ve installed the libraries that you need to use for your code, you’re also going to install Jupyter itself for the Jupyter Notebooks If at any point you lose track of which libraries you’ve already installed, you can always use Conda list to get a list of all the installed libraries Once you’ve installed what you need, you can simply launch Jupyter Notebooks from inside the active environment and now your code should be able to access all the libraries you need to do your coding Couple of little tips that might be useful I’ve run into this before, when you’re back in the command line environment and you’ve launched Jupyter Notebooks and you want to go back, maybe installed an extra package, “Control C” “Control C” will take you back to the Anaconda Prompt, and sometimes after you’ve installed a package, I find it doesn’t seem to recognize it right away when I launch for Jupyter Notebooks and that helps me I usually find if I deactivate the environment and activate it again, it usually works, doesn’t happen often, but it’s something you can try if you run into that issue All right, let’s try it So we just talked a bit about how to create virtual environments and why they’re important, let’s take a look at that in some actual code So I have here a Jupyter Notebook and one of the things we often do when working with Python is import different libraries Pandas, NumPy, you’ll see some of these as we go through this course Now, in this case I have one where I’m importing something from Matplotlib This is another library we’re going to play around with later When I run this line of code and try to import the library, I get an error message saying, “Cannot find the module Matplotlib.” This is probably the number one error message you’re going to get used to seeing when you’re working inside Jupyter Notebooks So we want to learn how do we deal with that message when it comes up When we did the intro to Python course with Christopher, we learned how to work with virtual environments, but with Jupyter Notebooks, we usually do this with something called Conda So if I close this down, you can see here this was actually the Conda environment I was using, but we aren’t going to be using that one because it was giving us an error message So if we have launched the Anaconda Prompt, you can see right now I’m in the base environment because of the word base appearing here So what I’m going to need to do is create a new virtual environment So I could tell Conda I want to create a new environment and I’m going to give it the name Python environment, and it’s going to be a Python environment with Python version 3.7, and I’m just putting a dash y here So if any prompts come up asking me questions, it will simply automatically answer yes to those types of questions Now it’s going off and creating this virtual environment for me, for Conda Once I’ve created that environment, I can then move to that environment by saying Conda activate and it’s even nice, it gives you a little hint here in case I forget the commands, activate my environment, which I called PythonEnv Remember, this is just the name I gave to the environment Once you’ve activated that environment, you’ll even notice with the prompt will actually show you the environment name so you can remember that you’re in your virtual environment Now I can install any of the libraries I might need to use when I’m writing my code So in our case, we were getting an error trying Matplotlib, so we can say, hey, Conda, would you please install Matplotlib for me? So now it’s going to install that library so that I’ll be able to access it from inside of my code Now once we’ve installed the Matplotlib library, you might have other libraries you’ll need to install as well depending on what you’re trying to do, but there’s one extra library you always want to install, and that one, so maybe you install Pandas, maybe you install NumPy, but we definitely want to install one called Jupyter So we actually want to install Jupyter,

which is what runs the Notebooks inside our virtual environment Because that way when we run the Jupyter Notebooks, we know the Jupyter installation is actually using the libraries that we have installed inside this virtual environment Once the Notebooks have been installed, then we’re actually going to launch Jupyter Notebooks from the actual Anaconda prompt itself So this time when I say run Jupyter Notebook, I’m launching Jupyter Notebooks from inside my Conda environment, PythonEnv This time when we navigate to our Notebook and we go to that line of code where it couldn’t find matplotlib, we’re going to see that line of code is going to run successfully So you get a little peek behind the scenes here, all the various Notebooks Susan has created for this course But if I go back to the one where we had imported matplotlib, you’ll see here, before it was getting us an error But now when we run that line of code, it run successfully, so no more errors So now when you’re using your Jupyter Notebook and you get that error message saying, “Hey, I can’t find that library.” You know how to create a Conda environment, install your libraries there, run your Jupyter Notebook, and get it going successfully >> As is mentioned on the outset, what we’re going to focus in on is some of the common libraries and tooling that you’re going to use when it comes to working down the path of data science So along those lines, we’re going to take a look at a really common library known as Pandas Now sorry to disappoint right up front, to say this has nothing to do with the bears I know I was disappointed when I first found that out But rather, it’s a library that has a couple of very common utilities that you’ll be using to help work with, manage, and manipulate the data as you start to analyze it So what is Pandas? Pandas is an open-source/BSD-licensed library, and it’s really geared towards high performance and familiar data structures, that what you’re going to notice in here is that we’re going to have a series library that’s going to be a little bit like a list, only gives us a couple of additional tools as well We’re also going to see a DataFrame, where if you’ve played around with Excel or maybe a relational database, that you’ll feel very comfortable inside of a DataFrame So first up, what is series? Well, series is very similar to a Python list, that what it’s going to be is a single dimensional array of objects where I have all of my value sitting there, and they’ll have some form of an index One big difference, however, between a series and a list is the fact that we can set our index to really be whatever it is that we might want By default, it’s going to be zero-based But if you do need some level of control over that, you can have that as opposed to an index, which is always going to be zero-based Now to create a brand new series, what you’re going to do is use the constructor So if you’ve already played around with classes before, you’ll know that you go ahead and call this, just like you would, a normal method, and then pass in the appropriate items You’ll notice that when you get right down to it, what we’re doing here is we’re really actually just converting in a list into a series So you’ll notice the square brackets here, that’s what we’re using to indicate a list typically So we’re actually converting this into a series The other thing that I want to highlight here is you’ll notice this little pd right here You might be wondering, “Well, what is pd?” Commonly, when you import in the Pandas library, you’ll rename this as pd, pd being short for for Pandas Now if you’ve seen me do other videos, you may know the fact that I really don’t like single and two and three letter variable names or otherwise, that to me it’s not necessarily as clear as to what it might be That being said, I will always go with convention So convention always overrides my own personal opinion because well, I have opinions on everything But the community has decided that when we use Pandas, that we’ll go ahead and abbreviate this as pd, and so I follow right along with that I do recommend that you do the same,

that whenever there’s a convention, you should go ahead and fall in line there Because after all, it’s not just going to be you that will be looking at your code, but it’s frequently going to be other Python developers as well You want to make sure that they know what it is that you’ve been up to inside of your code So while I’m not necessarily a fan of one and two letter variable and namespace names, I’ll still go ahead and use pd because again, that’s the convention So you’ll notice here we can go ahead and set all this up When we go ahead and display this, it will then show up as just simply that little list right there with the index You’re also going to notice that if I want to go access a particular item, I can use the normal index functions So here is my airports with two just like that I can do a for loop to loop through each one of those So if I print all of those out, you’ll notice that I’ll get Seattle-Tacoma, Dulles, London, Heathrow, and Schiphol down at the very bottom Let’s turn our attention now to DataFrames I would say that this is probably the most common datatype that you’ll be using when you’re getting in and playing around inside of anything that’s data sciencey A DataFrame is a two-dimensional data structure If you’ve played around inside of a database, if you’ve played around inside of a spreadsheet, you’re going to feel very much at home here because what you’re going to notice is that we have columns and we have our rows, and our columns are going to have name So you’ll notice that we’ve got a name here, we’ve got city here, we’ve got country all the way on the end there We’ve got our normal columns, just like we would expect You’re also going to notice again, similar more to a database than maybe to a spreadsheet that we will also have a column that will be our index You’ll also notice that you’ll have the ability to control that By default, it’s going to be zero-based But if maybe I’m importing in from somewhere else, maybe from a table, maybe from a CSV file, maybe there’s already IDs that have been set, that I could go in and identify, “Hey, this is the column that I want you to use as an index rather than doing the zero-based there.” So you do get the ability to control that If you want to create a DataFrame, what you’re going to notice is that we’ll use again, a constructor But effectively, we’re going to convert a list of lists into our DataFrame So I want to highlight just this item right here where you’ll notice that we’ve got Seattle-Tacoma, which is the name of the airport You’ll notice that we’ve got the city which is Seattle, and then we’ve got the country which is USA Then you’ll notice the next list down below that, I’ll highlight that one in blue here You’ll notice on that one that we’ve got Dulles, we’ve got Washington in USA So again, we’ve got the name of the airport, the city, and the country So each one of those is an individual list We’re going to take all of those together, convert that into a DataFrame, and the result is going to wind up being that little table that you’re seeing right there So you’ll notice that Seattle-Tacoma, Dulles, London, Heathrow, Schiphol, and Changi all become our individual airport names Just imagine, if you will, the Changi was added onto there, even though it just wasn’t listed in the code above You’ll notice the cities, and then you’ll also notice the countries So what happened is we took those lists and we converted that into a table structure But what I also want you to notice is the column names there The column names are quite frankly, not overly helpful That I got column names of 0, 1, and 2 Don’t tell me anything Not only that, but dealing with data where everything depends on position, can get really tricky, because you have to be really careful about things moving around and so forth So whenever we can identify something based on name, that’s typically going to be the way that we want to do it So rather than going with those numbers that we see by default, let’s instead identify what our column names should be So you’ll notice the last parameter there, which is our columns So now, when we run this bit of code, now will actually wind up with the column names So you’ll notice that we’ve got name, we’ve got city, and we’ve got country You’re going to notice that there’s a lot of tools

that are available to us inside of the DataFrame, and we’re going to take a look at those in the next couple of videos But for now, this is how we go in and create the items Let’s get in and take a look at a couple of real quick code demos, so we can see this in action Let’s take a look at a notebook where we’ve got a little bit of code that highlights what it was that we talked about previously So what we’re going to notice here is we’re going to start off by loading up our Pandas library, just as we normally would in Python code, because again, obviously we’re writing Python by simply saying, “Import pandas as pd.” Now, I’m actually going to leave this behind for just a second here I actually wanted to do things out of order and then I’m going to double back So what I’m actually going to do is I’m going to run this line of code right here, where I’m trying to use that Pandas library What you’re going to notice, of course, is that it’s going to give us an error message saying, “Hey, I don’t know what pd is Pd is not defined.” The reason that it’s not defined is because of course, we didn’t run that little line of code right up top, so let’s run that So I’ll do a real quick shift enter that will actually move me down You’ll notice now I can see the little number 2 right there That tells me that this is run So now, I can keep it on and create our little basic series here So when I run this, what you’re going to notice is my output here is going to be my datatype It’s telling me the datatype that’s inside of there, object or string really, and you’ll notice my index over here, 0, 1, 2, 3, 4, 5, and 6 You’re also going to notice that cool little trick that we talked about previously, where if I just put a variable or some operation that’s going to return back a value at the end line inside of a cell, that it will just simply print that out Boom, we go ahead, we have airports and that will just go ahead and print that out right there That’s something I find myself doing quite frequently, because especially when you’re dealing with data that you’ve loaded in from somewhere else, maybe you’ve done a couple of manipulations on it, and now you want to see what the updated values are, you want to see maybe whether or not you did things correctly, those types of things It’s a very quick and easy way that you can do that, and it will just simply print out If I want to go in and access a particular item, then I can do that by its index So I’ll go grab two, that’s of course is going to be the third item All counting starts with zero, so that’s going to give me back London Heathrow Then you’ll also notice that I could loop through all of the items, do a real quick print, and then you’re going to notice that it gives me all of the values back I do want to do it real quick aside here, and I want to point out one subtle little thing, that I’ll sometimes get asked when I’m doing notebooks here That you’ll notice that up at the very top of London Heathrow, that I’ve got the single-quotes there indicating the fact that it’s a string Then you’ll notice down below that I don’t get that, I just simply get the string without the single quotes, and you might be wondering, well, why the difference there? What’s happening? What’s happening there is effectively what’s in charge of displaying something out on the screen When you use print, print is going to write something out to the console Print has its own way of writing things out to the console, and in this case, my notebook is playing the role of the console So we’re seeing really the exact same output that we would see if we were running this locally So no single quotes or anything like that, it’s just simply writing out to the console, that’s the expected behavior and we’re done So it’s not my notebook that’s printing it out, but rather really it’s print that’s in control of how that’s going to be printed out On the flip side however, if I just simply do this, where I’m executing a little bit of code, that’s what I’m doing there I’m saying airports, and then my indexer of two, I’m executing a little bit of code That code is then going to return back a value to me In that case, it’s now going to be Jupyter that’s in charge of printing that out, and that’s why we’re getting the little string there, because it’s indicating to me, giving me a little bit of a hint there of the datatype as well So it’s showing me, “Hey, this is a string, ” and that’s why I’m seeing those single quotes there So to distill all of that down, the difference between those two is print is just in charge, it’s printing out all of those values as opposed to just executing a little bit of code that’s giving me a value back

When the notebook is in charge, then it’s the notebook that’s going to determine how that’s going to be output, and the people who did up Jupyter and put all of this together decided, “Hey, we also want to indicate to you the datatype.” That’s why you’re getting the single-quotes there So if you are curious, that’s what’s going on Back to our regularly scheduled programming here Let’s take a look at a DataFrame So now what we’re going to do is execute the exact same bit of code that we saw on the slide What you’re going to notice is that we’ve got our list of lists So we’ve got right here, Seattle-Tacoma, Seattle in USA, as a list Each one of those entries is now going to become a cell inside of a row So when I run this, you’re going to notice down below, there we go, that is now a row as promised Then you’ll also notice that each of the first entries then becomes a column Then you’ll notice my column names are they’re rather unhelpful, 0, 1, and 2 If we have the opportunity to name something, we should almost always take that opportunity, and so that’s what we’re going to do right here, is we’re going to indicate, “Hey, let’s put in some columns here.” So let’s identify the first one is name, the second one is city, and then the third as country So now, let’s go ahead and run that Now, you’re going to notice the output there with our name, our city, and our country As I highlighted towards the end of the last little video there where we were introducing all of this, there’s an awful lot that we can do with DataFrames This is how we can go in and create them In the next section, we’re going to take a look at how we can start to manipulate them a little bit and start to see what’s going on inside of them When you’re working with data, you’re typically going to be importing this in from a database, from a CSV file, from somewhere else When you’re pulling that data in, the first thing you’ll typically want to do is figure out what it is that you have Because real-world data is messy, there’s going to be errors, there’s going to be inconsistencies, there’s going to be missing values, etc, that before you begin performing any operations on the data, you first need to figure out what in the world you’ve got So fortunately, what we’re going to notice with the DataFrame is that we have a lot of different ways that we can explore our data Now, like we did in the prior module, we are going to hardcode in what our DataFrame is going to be, just to make things a little bit easier Don’t worry. However, we are going to show off the code to actually load this up from a CSV So how can we go in and explore this, and again, there’s a whole host of different ways that we can do this A couple of the most common are head and tail, where with head we can indicate the number of rows from the top that we want to be able to see, and with tail as you might expect, we can see the number of rows from the bottom that we want to be able to see So head 3 is going to give us the first three rows, tail 3 of course is going to give us the last three rows We can also find out what the shape is, or basically, what are the dimensions In other words, it will give us the number of rows and the number of columns So if we take a look at the output here, what we’re going to notice is seven for the number of rows, and then three for the number of columns Now, you may be thinking, well, wait a minute Christopher, I’m counting four columns, because I see index and I see name, city, and country, and you’re right What the three is actually indicating to us because there will always be an index, is the number of data columns that we have So three is the number of data columns, seven is going to be the number of rows, zero through six, of course that’s going to be seven So that’s the shape that we’re getting back But shape is really a high-level beck It’s nice to know what it is that we’re dealing with, but sometimes we need a little bit lower level of information here, and this is where info comes into play With info, we can get an awful lot of what’s going on inside of our data So what we’re going to see when we run info, is the number of rows or entries We’re going to see the index range, so 0-6 in our case here,

we’re going to see the number of columns, we’re going to see the information about each column, including the names, whether or not any of them are null or not null So it’s going to show us the non-null values So you can extrapolate there, how many null values that you might have, and then it’s also going to indicate the data types Keep in mind that a string is an object in this [inaudible] So by using head, tail, size, and info, I can get a little bit of information about what’s going on inside of my DataFrame Let’s now turn our attention to the code that actually makes all of this magic happen Let’s see how we could explore a DataFrame using a little bit of code So let’s first again import in our pandas, and let’s create that DataFrame Now, you might be wondering, hey, why in the world would you start exploring a DataFrame when you’ve hard-coded in what that’s going to look like Again, you’re exactly right We’re going to show a little bit later how we can load up some data We just want a simple DataFrame that we could start to show off some of the skills So that way when it comes time to load up some data, then we know better of what’s going on and how to start exploring all of that So there we go I’m just going to create my DataFrame There it all is, and now let’s explore it So like we mentioned before, head will give us the ability to see the top three rows, because I specify three If I specify two for example, as you might expect, that’s going to give us back just simply those two items, five, I think you start to see where all of this is going It will just give you back that number of rows If you want to start from the bottom, then you could simply use tail, and that will just count up from the bottom So if we take a look, you’ll notice that there’s the bottom of that So Narita, Pearson, and Changi, and if we take a look over here, we can see back in our original DataFrame, that’s exactly what we’ve got there If I just want to see what’s the shape, what does it look like, 7, 3 Again, remember that the three is giving us the number of data columns, so it’s not counting the index Then finally, I can see the full info of that So I can see the columns, I can see the names, I can see how many items in there are not null, etc I would say of all the information that that’s going to give you, the not null is probably going to be one of the most important things, because working with null values can really be a struggle because sometimes they’ll indicate that there’s just missing data, sometimes it’ll indicate a condition, when you start doing math with them, things start to go sideways very quickly Null values are really a struggle So being able to see real quickly, hey, what does my data look like, is going to be really powerful, and it really is going to be probably the first thing that you will always do when you load up some data, especially data that you’re not overly familiar with, to see what it is that you’re working with, and now you’ve got the tools to be able to do that Now that we’ve seen how we could grab some rows from our DataFrame, give the overall structure of it, let’s get in and see how we can start to pull out different pieces of data Now, you’re going to notice about a DataFrame, the fact that there’s a lot of different methods and tools and so forth that are at your disposal for getting in and finding basically anything that you might want, performing all different manipulations, and so forth We’re going to take a look at a couple of these as we go through the next handful of videos here But I want to stress the fact that we’re not going to go through everything If you go check out the documentation linked to from our GitHub page, you’ll be able to see everything that’s available to you, and begin to just continue to grow your tool set But let’s get in and see how we can identify individual items or start slicing and dicing our data So how can we go grab specific rows and specific columns, or how can we slice and/or dice our data where we’re grabbing full columns, full rows, etc Well, one way that we can do this is by grabbing a specific column by identifying it based on its name

Then you’ll notice that the indexer for DataFrame is going to by default, be the name of the column So if I say airport city, that’s going to give us back everything that’s inside of that city column as well as the index So when I do that, boom, I’ve got my city and I’ve got the index to go along with that How about if I want two different columns? Well, if I want two different columns, then what I need to do is I need to actually pass in, where is my cursor here? Let me put that to red. There we go Now I can find it. There we go So what I can do is I can create a list, like I’ve got here, with the name of each column that I want So you’ll notice that I’m identifying my name and my country there So those are going to be the two columns that I want back The main thing that I want you to notice about the syntax here is the fact that we’re putting this inside of a list If we didn’t put that inside of a list, then what we’re actually doing is passing in different parameters into that indexer, and that’s not the way that it behaves We want to pass in one the parameter, that one parameter is going to be a list So that little bit of square bracket that you’re seeing inside of there, that’s got to be there So that way we can identify, “Hey, these are the columns.” Cool So that’s how I can get individual columns Now, let’s say that I want to go in, and grab things based on their position, that sometimes I’m going to have the name, sometimes I’m going to have the column that, maybe what I’m trying to do is loop through things, maybe I want to see where something is specifically inside of a DataFrame Maybe I do happen to know the location, and so that’s what I want to be able to go in and do This is especially helpful, if may be I want to specify a range of columns or a range of rows So what I can do, is I can use my eye location or my index location With my eye location, my index location, now what I can do is specify the rows and the columns by index that I want So if I do [0, 0] like I’ve got right here, as you might expect, that’s going to give us right here, Seattle-Tacoma So if I run this code, if you will on my slide, that’s what we’re going to get back If I say [2, 2] that’s going to count down again, zero-based So [0, 1, 2] it’s actually going to give us the third and the third, that’s going to give us, in this case, United Kingdom Now you might notice on the slide, and this will be true in the next couple of slides as well, that little spot right up there, that column 0, column 1, column 2, those are not actually part of the DataFrame, they’re just there to help make our demo a little bit easier here So that way you can see where 0, 1 and 2 are actually are So [0, 0] Seattle-Tacoma, [2, 2] is going to give us the United Kingdom there You can also go in and specify a range This is what I was mentioning before, where things are actually a little bit more powerful when you’re using that index location, is the ability to specify a range of values So similar to using a range on a list where I could go in and specify one colon, and then whatever my end index is going to wind up being, maybe for example 4, I can do the exact same thing here So let’s start off by using colon and colon When I pass in those ranges, on either side or in this case both sides, what I’m indicating is that I want all of the values from my rows or my columns So if you remember from before, the first parameter that’s going to indicate the rows that I want Second parameter that’s going to indicate the columns that I want When I say colon and colon, I’m saying all rows and all columns So if I ran this, I’m going to get back that highlighted section there, where I’m going to get everything If I said, ”Hey, I only want,” in my case, the first two rows, so it’s going to be [0,1] and then stopping at index two but not including it, then I could go in and specify that as my range This is now going to indicate that we

want back all of our columns So if I run this, I’m now going to get back Seattle-Tacoma, Seattle, USA and Dulles, Washington, USA as my values back So first two rows and all columns I could go the other direction as well, where maybe I put that over on the column side So now that’s going to give me back the first two columns So that’s going to give me back the name and the city, and we get our index as well How about if I want to specify a list of items there? So maybe what I want to get back, is to get back the first column and in my case, the third column Remember again, counting starts at 0, so 0 and 2, column 0 and column 2 The way that I can do that, is by specifying a list I want to highlight here, the difference between a range and a list My range allows me to specify the starting point and the end point, whereas my list here is going to give me the ability to specify the individual items that I want here So in my case, I want the first column and I want the third column So 0 and 2 I’m going to specify 0 and 2, those are the columns that I want back You’ll notice the colon at the beginning, that’s going to specify that we want all of the rows So if I run this, now what I’m going to get is my name and my country and again, the index as well If you want to do this by name, then you could just simply use location So just Loc, and now if I specify name and country, that would give me back the exact same but only doing it by name instead of by the index All right. Well, now that we’ve seen a little bit of a slide representation, what do you say we see this in action Let’s open up a notebook and see how we can actually use these tools Let’s see how we can find certain items inside of our DataFrame So just like before, I’m going to go grab our Pandas library, and I’m going to set up our DataFrame You’ll notice just like before, we’ve got name, city, country and all of the airports that are inside of there If we want to go find a specific column, the simplest way to do this is just to use that indexer So you’ll notice that I pass in the name of city and that will give me back the city. Just like that If I specify country, then it will of course give me back the country It’s always giving me back that index as well Which honestly I like, so that way I can always find a specific row if I need to later If I want to go grab a particular items, then what I’m going to do is put that inside of a list as that parameter So that’s going to give me back my name and my country here You’ll notice one more time, with our notebook, the fact that if I just execute the code, then it will just print that right out for me Which again, one of my favorite little things with notebooks If you happen to know the index location, then you can do that So to get whatever the first cell is, I can do that with [0, 0] to go get it over at [2, 2] then I can do it that way Now, most commonly, the way that you’re going to be using this index location is to go grab multiple columns or multiple rows So if I want to specify a range, then I can use that colon comma colon, it’s indicating that I want all rows, all columns If I want to indicate the start and end, then I can do that by using, in my case, 0 and 2, this will give me back the first two rows or in turn the first two columns, if I so desire If I maybe you want to go grab the second and the third, make sure that you indicate the ending index Remember, it’s going to go all the way up to but not including and so now, what I’m going to get back is 1 and 2 here, which is what we would’ve expected So it behaves just like a range for say, an array or for a list If you want to specify individual items

So maybe I want the first, and the third back column wise, then I can do that by specifying that little list that you see right there So [0, 2] this is going to give us back columns 0 and 2, or the first and the third So when I run this, it’s going to give me back name and country If I sneak all the way back up to the top here, we’ll notice name and country is our first and third So that was exactly what we got back If you want to do that same thing but you want to do it by name, then you could use location, again that second parameter is going to be the column So name and country is going to give me back name and country Just like that, we get the same things So that’s a real quick primer, real quick starting point for going in and working with our data One thing that you are going to notice whenever you’re working with a DataFrame, is that we need the ability to quickly go in, find rows, find columns, etc Like I mentioned at the outset, there’s a lot of different ways that we can do this This is a couple of very common ways that you can do it, and again, we’ve linked to the docs inside of our GitHub page, so you could go off and explore from there But there’s still more with DataFrames and that’s what we’re going to take a look at in the next video >> Whenever you start working with data science and you’re working with Python, at some point you’re going to be needing to load some data files The most common data file format you’ll likely be working with is a CSV file or a comma separated variable file One of the things we just want to touch on is what does that look like and how do I work with those files when I’m working inside of Jupyter Notebook So comma separated variable files or CSV files, is an extremely popular format for data files So a lot of times when you’re going off and searching for data, this will be the format it’s in What you’ll see is each row of the file contains one record and it’ll have a special character, often a comma, that’s used to separate the different column values The first row may contain, actually, the names of the different columns, so you know what is the value in each row So you can see here I’ve got the name, city and country, and then I have the name, city and country for a number of different airports, for example So that might be a CSV file that I might use to analyze inside my data Now if you want to use this data file inside of Jupyter Notebook, you’re going to have to make sure the Notebook can find the physical file So how do I do it? Well, I’m going to upload the files to Jupyter itself Now one of the things that’s a really good practice is to take that file and create a sub folder and the most common name for the sub folder is something like “Data”, not very original but nice and descriptive So we create a sub folder called “Data”, you upload all your files into there, and then we’ll be able to access it from our Notebook So we would first have to create the folder, and then, once we’ve done that, there’s an upload option to upload our files Let’s move over to Jupyter Notebooks and try that out So I have a nice CSV file here and I want to be able to use this and access this with my Python code, but how do I get to it? So what I’m going to do is I’m going to go to Jupyter, and I’m going to create a folder so that I can keep all the data files in one place So over here on the right-hand side, I can just say, “Give me a new folder”, instead of a new Notebook, and by default you’ll see it shows up as “Untitled Folder” So all I have to do is simply select that and then there’s a “Rename” option to give it a better name, so I’m going to call that “Data” That’s a fairly common name given to folders where we store data files Then, once I have the folder created, I simply select that folder, and then I can upload whatever CSV files I want to use So I have this CSV file with information about all different airports, and I ask it to “Upload” that file, and once it’s uploaded, you can even see that you’ve successfully uploaded your CS file into Jupyter So now our Python code inside the Jupyter Notebook is going to be able to access this file Once we have a CSV file with the data we want to access, there’s a few things you might need to know about working with CSV files and your Python code, just aspects that can mess around with your data if you don’t understand them correctly So let’s talk a bit about how to read and write CSV files and use that interacting with the Pandas DataFrames that Christopher was talking about earlier So the “Read_CSV” command is what we use to load a DataFrame from a CSV file So let’s say we have a nice file here which has name, city, country, and again, that list of airports, what we can do is we can simply call “Read_CSV”, parse up the name of that file, put it in our nice, tight, little data folder that we created, and it creates the DataFrame for us

Now something you might notice is that we did have a row, which indicated the name, city, country, the names of our columns, and for “Read_CSV”, automatically figured out that was the values for our column names So you’ll see the columns have headers of name, city and country You’ll also notice that an index was created automatically because as you saw, Christopher showed you, DataFrames have an index, that index was created automatically for the rows from our CSV file Now if you have any problems inside your CSV file, because sometimes you don’t have control over where these files come from, one of the things you have to worry about in your Python code is how are you going to handle those types of errors in different situations with your Python code So in this case, if you take a look at the CSV file, there’s an extra comma in the row for London So we have the airports called Heathrow, city is London and then we have this extra two commas So in this case it actually sees four values instead of three for that particular row, and by default, that’s actually going to crash and it wont load any data at all into our DataFrame But good news, Pandas has some fantastic features you can use, for example, there’s an “Error_Bad_Lines”, and if you set that to “False”, that means just skip the rows with the errors and load everything you’ve got So in this case you can see I can load the DataFrame and the row for Heathrow, London is missing from the loaded data Of course, this is only okay if it’s all right to skip those rows Now what other situations do we deal with? Well, maybe the data file that you receive doesn’t actually have values for the column headers So in that case it might get confused So one of the things we could also do with Pandas is your Python code you can specify the “Header” is “None”, in which case the “Read_CSV” will automatically create column headers It’ll just call the columns zero, one, two and so on So it’ll create a column index to make up for the fact that it doesn’t know what the column names are So you need to know when you’re writing Python code if that CSV file has a header row or not This is one way to handle it But you might have noticed when Christopher’s writing that code to query a DataFrame, it was really convenient to have names for your columns That’s okay, we can fix that as well So the other thing you can do is you can provide names for your columns by specifying the “Names Parameter” So if we add a “Names Parameter” here and we say, “I want to use name, city, and country as the names of the three columns”, there is no column names in my file, but now you can see I successfully assign names to each column when it create the DataFrame If you have any missing values, one of the things to get used to when you’re working with Python and Pandas specifically, is this thing called “NaN”, and that’s the way blank values are going to appear So if we take a look at this file, you can see the record for Schiphol Airport in the Netherlands, the city value, there’s nothing specified So in this case when I call “Read_CSV”, you’ll see that it shows the “NaN” as the value for the city, for Schiphol If you have manipulated your data and started working with data, sometimes you want to save your changes so that you can go back to them later So in addition to being able to read data from a CSV file into a Pandas DataFrame, you’ll also be able to take data in a Pandas DataFrame and write it to a CSV file So we do that with two CSV But what you’ll see that might throw you off initially, is the index values will be written to the file as well So you can actually see because my DataFrame had an index, that the file created has these index values saved as well That’s fine if you want it If you don’t want it, then all you have to do is specify index equals false, in which case it will not include the index values inside the created CSV file Let’s go into some notebooks and try that out in some code So we were just talking about how we can use Python to work with data files like CSV files Let’s try it from an actual notebook So I have here a notebook in front of me, and because I want to read the data into a Pandas DataFrame, because when we’re working with Python, DataFrames are very useful for data science I’m going to need to import that Pandas library again You’re going to get used to that line of code I’m thinking Let’s say I have a nice CSV file containing airport information, like the airport’s name, city, and country, and you can actually see that actual row is in my CSV file itself, along with the names, cities and country for a number of different airports So I can simply use the read CSV command to read that data into a DataFrame and then display the contents of that DataFrame So sure enough, you can see that the column names were read successfully from the file You can see that index values were created as Christopher was explaining Pandas DataFrames, you always have an index for each row In this case because there was none provided, it generated them for me So everything looks great But when we are writing Python code, we have to make sure that that code can handle different variations in the data file that we’re loading

For example, I might have a record here Where the row for Heathrow, London has an extra comma in it That extra comma is going to mess up my code because it now thinks there’s four values in that row instead of three So by default, if I just try to read that file, it’s actually going to crash and it won’t put any data at all inside my DataFrame Well, I have a couple of options here I can now go digging through that file, try to figure out where’s the one row, if its causing the error and try to fix it, and yes, there’ll be some hints in the error message is to help you find which row has the problem But one of the neat things about data science is, sometimes it’s okay if we’ve got a million rows, maybe we can leave 12 of them out So maybe we just want to skip the bad rows So in that case, Python supports that as well So what we can do, is we can specify error bad lines equals false That simply means, if you meet a row that you can’t interpret, skip that row and continue So if I run this one you’ll see it successfully loads the records and that record from London, for United Kingdom, was simply skipped So that record was not loaded The other common situation you’ll encounter when reading data files with your Python code is the file may not contain a row that tells you what the column headers are So here I have some data, but there’s no row telling me what the column names should be By default, if you pass this in, basically Python and the Pandas are going to get a little confused because they assume you are passing in values for the column headers So it reads the first row of data and thinks those of the column headers So suddenly I have a column header called Seattle-Tacoma and a Seattle in USA So I need to make sure I have a way of saying no, no, no, the first row in the file is actually data, it’s not headers That’s why we have the option of specifying header equals none So if I specify header equal none when I load the file, that’s a way of saying there’s no header row and I simply want you to assign values to those column headers If I don’t tell it what values to assign, it’ll just give them a number value, column 0, column 1, column 2 Now, you might though have appreciated when Christopher was doing some queries against the DataFrame, that it sometimes it’s nice to be able specify the columns by name It’s easier for me to remember that City contains the city rather than column index 1 is the city So you could also, even if the header values aren’t specified in the file, you can specify by using the names parameter, what column names you would like to use So if I run this, you’ll see it actually successfully said, you don’t have a header row So here specify the value of name, city, and country for my three columns The other scenario you’re most likely to run into when working with CSV files, or data files in general, is a missing value In this case, I have a data file and for one of my records, Schiphol in the Netherlands, there’s no city specified You just see comma, comma Now, if you read that, the way Python Pandas will display a missing value, is it displays it as NaN, so not a number, if you will So that indicates that there was no value found for that particular value inside the record So you’ll see that as well Now, sometimes when you start really doing a lot of data science and you do a lot of cleaning of data, you’ll see that when you look at data science courses Once you finish cleaning the data, you might decide, hey, now that this data is all tidied up and cleaned, maybe I’d like to save a copy of it So the next time I’m working with it, I don’t have to do as much cleansing and rework So one of the neat things we can do with Pandas is, we can actually write the contents back out to a CSV file as well So if I have data in a DataFrame already, then I can simply say, hey, let’s use to CSV and let’s write that output to a data file Now, one of the things that’s a little interesting, if I was to open that file up and I’ll show it to you in a minute, is the index values will be written in as well So if I don’t want the index values, that 0, 1, 2, 3, 4 to appear in the CSV file, then I want to specify when I write the file that I want index equals false That’s a way of saying please do not include the index values So let’s go take a look at the actual data files created so you can see the difference between using index equals false or not specifying index equals false So here you can see, here’s the file I created by default You can see it added the index values from the DataFrame You can see when I did specify index equals false, that it simply added the row values itself without an index. So there you have it You now have the ability to use your Python code to move data to and from CSV files and Pandas DataFrames As you explore different data science tutorials and watch different data science videos, one of things they’re always talking about,

is preparing your data and the time you spend preparing your data One of the things you’ll end up doing as you prepare your data is you’re going to have to either remove columns from a DataFrame or you might need to take some of the columns inside a DataFrame and move them into a separate one, splitting them off, if you will So let’s explore how you do that with the Python code as well So in this case, maybe I have something like the actual arrival time column If I was doing some Data Science and I was trying to predict how many minutes late a flight was going to be, that’s a value inside the arrival delay Well, if I have the actual arrival time and the scheduled arrival time, well, I can just calculate arrival delay by subtracting one from the other But I’m trying to train a model to look at all the other factors like what Airport did it leave from, and what time did it leave, and how long was the flight, and use those values, try and predict if the flight flight on time So I don’t just want to say take the actual arrival time and subtract the scheduled arrival time, not really training a model, that’s just doing a subtraction So in this case, I might need to remove the actual arrival time column So how I do that with Python code? Well, there is, I’m sure you’re discovering by now Pandas has some fantastic features, and the DataFrame object has a drop method So you can simply say drop a particular column So if I say, hey on my delays DataFrame, let’s drop the column called actual arrival time Then this doesn’t actually modify the delays DataFrame, it simply returns a new DataFrame which contains everything except the actual arrival time So I get a new DataFrame containing everything except actual arrival time If however you want to make the change to the delays DataFrame itself, you can actually specify in-place equals true That’s a way of saying modify the DataFrame where I’m doing the drop So that way I don’t have to copy the values to a new DataFrame, I’m just saying no, modify this one right here That will simply drop the actual arrival time column from delays DataFrame Now, the other thing you might need do sometimes is you might need to take some of the columns from a DataFrame and move them aside One of the most common things you’ll hear people talk about in Data Science courses is the idea of qualitative data and quantitative data So qualitative data is descriptive data, it’s things like what Airport is it, are things blue or green, or gold? They’re describing things But quantitative data tends to be numbers, how big is something, how long is something, how many minutes the delay was, and so on You’ll discover as you explore data science that those quantitative values, training models, they love that type of data So quite often, we have to take those quantitative values and pull them aside to use them to train our model So in this case, that’s my origin airport, my destination airport So I can simply create a slice of that DataFrame, and take back to that module Christopher showed you where we talked about how to query a DataFrame So basically, I’m going to query the DataFrame to return just those two columns, and then take those two columns and put them in a new value So all you do is you do a query using, I’m using Loc here Remember that first colon means which rows do I want back, and a single colon means return all rows Then you specify which colons you want, in this case, origin Airport and destination Airport, and take those values and put them into description DataFrame That will create this nice new DataFrame for me with just the columns that I need So these techniques are going to make it easier for you when you start having to prepare your data for Data Science Now, let’s take a try doing in code So when you’re exploring in different Data Science tutorials, quite often they talk all about preparing data One of the things I’ll often talk about is either needing to remove certain columns or to move certain columns into a new DataFrame So we’re going to look at our Python code and see how Pandas gives us some neat tools to make it easier to do that So we’re still using Pandas DataFrames, so I’m going to start off by importing that library Then I’m just going to read a data file and to give us a starting point So I’m reading in some flight information here If you look at the first few rows, you can see all sorts of wonderful data, lots of different columns of information that I might be analyzing, very typical sort of data file Now, you might get into a situation here, where if you look at this, you’ll notice I want to maybe predict the arrival delay, which is how many minutes late a flight is going to be Because this data set is so complete, it actually has to be scheduled arrival time as well as the actual arrival time Well, it doesn’t take a Machine Learning model to figure out that if you subtract the time a plane actually arrived from when it was supposed to arrive, you can just calculate arrival delay But if I’m going to try and train a model to predict what time future planes will be landing, it won’t know the actual arrival time when I’m asking it to predict the flight is leaving tomorrow So I might need to delete that arrival time column It’s unnecessary, it’s not going to help with my model training

So you will have situations where you want to delete it So if I want to remove this arrival time column, fortunately, the Pandas DataFrame has a feature to help me with that There is a feature called, a method called Drop You tell it which column you want to remove, and you can then say create me a new DataFrame without that extra column So if I do that, this new DataFrame I create, you’ll notice, it still has the scheduled arrival time, but the arrival time column is gone Now, one of the things you may notice here, and this is something to keep in mind when working with Pandas DataFrames is it didn’t actually modify delays DataFrame It simply created a new DataFrame without that extra column If you want to modify the delays DataFrame itself, you can do that, Pandas will support it You specify the value in-place equals true That means, modify the delays DataFrame itself So now the delays DataFrame has no longer got that arrival time column So it’s up to you, depending on how you like to work with your code and what you doing with it, whether you wanted a new one or to modify the existing one Now, sometimes we have to split up different columns One of the things that comes up a lot I said in Data Science is this need to sometimes only have the numeric values to do our analysis or only to look at the string values Because you’ll discover as you explore Data Science that there are there are different techniques for handling string values versus numeric values So it’s not unusual to take your DataFrame and say, hey, take the values that are more string or qualitative, put them in one place Take the quantitative, the numeric values, put them in different place So in this case, if we take a look, maybe I want to take all the string tight values, what we would call qualitative data and move that into a separate DataFrame Well, to do that, you simply use a DataFrame queries, similar to what Christopher was showing you when we showed you how to do querying a DataFrame So we say, let’s do a query using Loc of a DataFrame, and we say, which rows we want? Colon means return all rows Which columns we want? We simply list the columns we want returned Then we take that query result and we put it into a new DataFrame, and press ”Store” we now have our brand new DataFrame containing just the columns we needed So there you have it Python with the powers of Pandas in DataFrames gives us lots of clever, simple ways to remove a single column or to move aside or split a DataFrame into a new DataFrame So continuing with the fact that a lot of Data Science courses and tutorials you look at, there’s a lot of talk about preparing data One of the other scenarios which you’ll often have to handle is handling duplicate rows You’ll see that Data Science really doesn’t like it when there’s multiple copies of the same row They also tend to get upset or crashed, even can crash code if encounter rows which have missing values So let’s see how Python and the Pandas DataFrame gives us some methods and techniques for handling those situations So then specifically, let’s start with the missing values There are many different Data Science methods you’ll use which will actually crash if they had a row with missing values So if I had the DataFrame here in front of me for some airport information, because I have an arrival time and a delay time that are missing values showing up as that NaN, this could actually cause my code to crash I won’t be able to train a model So one of the things we need is a way of just knowing if a particular DataFrame has missing values The info method of the DataFrame is fantastic for this What it’ll do, it’ll tell you-all sorts of great information One of the things it’ll tell you is how many rows are in your DataFrame Then for each of the columns, it will tell you how many of the rows in that DataFrame contain values that are not null So you can see here that the flight date, well, all 300,000 rows contain non null values So there are actual values in every single row for the value of flight date, and for a unique carrier But wait, when I hit tail number, out of that 300,000 rows, only 299,000 or so of them have actual values, non null values in them That means that some of those rows do contain a null or a missing value So you can see here there’s actually three different columns that have a number lower than 300,000 with non null So that indicates to me there are some missing values for some of the rows of these three columns So now that I know I have some missing values, how can I now use Python to deal with those values? Well, fortunately, Python.Pandas is designed for this, so it has a drop Na method, which means drop Na, those NaN values It allows me to remove those rows from my DataFrame Now, one thing to be aware of, very specific when working with Pandas DataFrames When I do a drop Na, I am not actually dropping the rows from the DataFrame itself I am saying, drop the rows and then copy the DataFrame with the dropped rows to a new DataFrame

So here I have this one row with values So what it does is it creates a new DataFrame called delay no nulls DataFrame, and it’s missing that row too which had the null values or the missing values If however you want to actually modify the DataFrame itself, you can do that by just specifying in-place equals true Then what happens is literally just removes that row from the existing delays df DataFrame The other scenario we might need to handle is duplicate rows So you will often find that a lot of times we’re working with data science, you’ll be given files from multiple different places and then you combine all those files together into one DataFrame That may mean that you get into situations where sometimes rows may be loaded multiple times, so we also want to handle those duplicate rows because that can skew results when you’re doing data science So here I have some data in front of me and some simple information about airports, but I have two rows for the airport in Dulles, Washington So how do I handle that using my Python of my data and Python and Pandas, we have the airport’s DataFrame we call duplicated and what that will do is it will return true or false for every row, and if a row that it locates is a duplicate of a previous row, then we’ll see a true So you can see here that the first row, it returns false, second row is false because up until now it has not seen a duplicate of Dulles The next row is another row, it’s exactly the same, so now it says true because this row is a duplicate of a previously seen row, and the remaining rows are all false I also want to make sure we highlight here that it’s the entire row, it’s a duplicate We have multiple records with a country of USA, but duplicated, it’s looking for records with the entire row is a duplicate So this allows us to locate duplicate records. Now, how do we get rid of them? Well, there’s a wonderful little method called drop duplicates, and I’m going to use that same little trick I did for dropping missing rows with missing values I’m going to use the in-place equals true, to say within this DataFrame itself, “Drop” the duplicate rows and what that’s going to simply do for me is get rid of that second row for Dulles Washington So my data is all set, nice and tidy, ready for me to go on and play with my data science So now we’ve dropped our duplicate rows, we’re ready to start going and doing things like training models Now, let’s try all of us out in actual code So as we’re preparing our data to train models, we do have to make sure that the data we’re working with doesn’t have missing values, which can cause issues and actually when you try to train a model might give you an error and we also need to deal with duplicate rows because duplicate rows can skew the results when your training data models So now, let’s take a look at inside our actual code using Python and Pandas DataFrame methods, how we can handle missing values and duplicates So taking a look here, we’re still working with pandas DataFrame So I’m going to import that wonderful Pandas library and I’m going to start by loading a nice data set here into our DataFrame, which contains all sorts of information about different flights would have common left and arrival delay information Now, this is a very big DataFrame and I do not want to go reading through every single row to find out which rows might have a duplicate value But fortunately, I can use the DataFrames info method What that will do is it will tell me all sorts of great information about that DataFrame and of the data inside it So it tells me things like, “Hey, you have 300,000 rows inside that DataFrame across 16 different columns.” It also tells me, for each column, how many rows contain actual values, non-null values So I can see that flight date has 300,000 rows, all containing non-null values, unique carrier, all the rows contain non-null values But when you get to tail number, you’ll notice only about 299,000 of the rows contain non-nulls or for arrival time That again, there’s a number of records here, most of the rows have a non-null value but that must mean that some of them do have a null value So this is a simple way to check and see if you have any rows which contain a null value or any particular columns that contain null values inside your DataFrame So now what I’ve identified, I do have rows with missing values, nulls, NaNs, these words could be used interchangeably here, how do I handle that? So if we go back to our code, once you’ve Identified it, you can use the “Drop NA” and what that will do, is it will drop the rows with missing values Now, by default, this does not modify the delays DataFrame itself Instead, what I’m doing is I’m saying drop rows with missing values, sorry, and take the remaining rows and put them into a new DataFrame So if I “Run” this and then I do an “Info” on the new DataFrame, you’ll see it only has 295,000 rows left because all the rows with missing values been removed,

but you can see that every single column, all of the rows contain actual values So I can confirm that I have successfully dropped every missing value or every row with a missing value from my DataFrame If however, you prefer not to create a new DataFrame, if you just want to drop the missing rows from the existing DataFrame delays_df, you can do that too That supported, there’s a function called or a parameter called inplace equals true when you set it to true, basically that’s saying modify the delays DataFrame itself So now, what I’ve done is I’ve actually dropped the missing rows from the delays DataFrame So you can see once again now it only has 295,000 rows, but every single one of the columns, all the values in there have no missing values, no nulls The other scenario we sometimes have to deal with is duplicate values So one of the things you’ll find is sometimes when you’re doing data science, said data may come in from multiple files and then all be merged together into one DataFrame, so as a result, you can end up with some duplicate rows and that can skew results when you start doing data science So one of the things we need to do is check if there’s duplicates and then deal with them So I’m just going to “Import” a very small file here, so that you can see it has duplicate values So it has three records from USA Those aren’t duplicates, they just happen to be in the same country but here I can see that there’s actually two records from Dulles, Washington Now, I’ve deliberately displayed a very small DataFrame here, so you can see that, but what if a DataFrame was too big to just show on the screen? How could I quickly just check and say, “Hey, look at all the rows and tell me if it’s duplicates.” Well, the Pandas DataFrame has this duplicated method, and what that will do is it’ll return a true or false indicating whether or not a row is a duplicate of a previous row So in this case, the third row, this one here to Dulles, Washington, is a duplicate of the previous row So you can see that the one with the index of two, it says, “Yes, that row is in fact a duplicate.” It does have to be the entire row if it’s a duplicate, simply having the same value in a column does not make it a duplicate row Once you’ve identified you do have duplicates If you want to get rid of those, all you need to do is call ”Drop duplicates.” I’m using that same parameter inplace equals true to say, “Drop” the duplicate rows inside the airports DataFrame itself So now when I “Run,” you can actually see I’ve successfully gotten rid of that extra Dulles, Washington record You will notice, of course, that index number that can cause gaps and your indexes So just be aware of that as well when you’re duplicating or dropping duplicate rows So there we have it, we can figure out if we have any missing values in our DataFrame, get rid of the rows that do have missing values, we can determine if we do have any duplicate rows and we can get rid of those duplicate rows So now you’ve got your data a little more cleaned up and ready for data science and training models >> We’ve seen a lot of different ways that we can manipulate data Now, one of the biggest tasks that we’re going to have is to get some insights from that data That we’re going to want to be able to try and predict certain behaviors, try and predict certain values Now, in order to do that, what we’ll need is to split up our data that we’ll need to be able to say, “Hey, these are all of the things that are going to impact some particular value,” or to put this another way, to give you a couple of different examples or scenarios, that given X data predict Y value So for example, given a particular customer, I want to be able to predict whether or not they’re going to buy a product or given a shopping cart, I want to be able to offer up some other items have that customer might be interested in If you’ve ever shopped online, you’ve seen this behavior where you’ve seen that little spot that says other customers have bought This is exactly what’s happening behind the scenes is it’s a little bit of Machine Learning Along those lines, with our data, since we’ve been looking at flight data, that what we want to be able to do is say, given a particular flight, predict the time that it’s going to be delayed, whether or not it’s going to be early, etc But in order to do that, what we’re going to need to do is to take our data and split it up Then we’re going to need to create two different DataFrames Then we’re going to want to set up the labeled DataFrame Basically, what we want to predict in our case, that’s going to be how late or how early is a particular flight going to be into a DataFrame commonly labeled as y,

and in particular a lowercase y. I know I’ve said previously my challenge is with single letter variable names and things like that This is convention, so I do follow along with it even though it pains me inside But yeah, so the value that we’re going to predict is going to be in y So in our case, the value that we want to predict is the minutes early or minutes late that it’s going to wind up being at that particular flight would wind up being The data that’s going to influence this This is going to include things like the time of departure, the time of arrival, the airport that it took off from, the airport that it was going to, etc All of that information is going to make up the features or what’s going to influence the label So if you stop and think about it, the airport that we’re coming from, the airport that we’re going to, when you’re talking about things like weather, air traffic control, etc, all of those different things are going to have an impact on whether or not a flight is early or a flight is late So we’re going to take all of that information, put that off somewhere else into a different DataFrame This DataFrame is going to be labeled as X So if we want to take a look at a little bit more of a concrete example here, what we’re going to notice is we’ve got the distances, we’ve got the elapsed time, and then we’ve got the arrival delay So seven minutes, four minutes early, five minutes, etc, the whole way down What we want to be able to predict or label is going to be that delay and everything else is going to influence it So if we remember from what we had before, using that location index, were able to go get all of the rows for that arrival delay or column and then in turn, we can go grab all of the features by just grabbing everything else We can either list them off like we have here or maybe if it was basically everything but that last column, we could’ve gone ahead and set up a range where we indicated angle zero to whatever it might be The end result is going to wind up being that will have two different DataFrames, X and y, where X is going to be all of our features and y is going to be the label that we’re going to be trying to predict You’re also going to notice that the index is going to come along for the ride That gives us the ability and also gives our trainer the ability to align everything properly So that way we know which pieces of data are going to be correlated to which eventual results here Now, the other thing that we’re typically going to have to do is to split up our data into training data and testing data When we think about trying to train up a machine-learning model, or really train anything up What we’re going to need to do is feed it some information Hey, this is what’s going on, we’re going to make some assumptions based on that information, and then we’re going to want to be able to test our assumptions Like this is how we learned That we’ll learn something and then we’ll go up and we’ll test that behavior, and we’ll say, “Hey, is this in fact the cases?” This is not the case That’s exactly what we’re going to do when we’re setting up our machine learning model That we’re going to say, “Hey, this is the information that I want you to use to train yourself and then this is the information that I want you to use to confirm that all of those assumptions, that you’ve now generated are in fact correct.” Typically, you’re only going to have one set of data So what you’ll do is take that one set of data, take some of the rows, turn that into the training data, and then some of the rows and turn that into the testing data So you’re going to notice that we’ve got our DataFrame X What we’re going to do is we’re going to split that up into some training data and into some testing data I want you to notice the indexes here You’re going to notice that on the training side, we’re going to have zero and two, whereas on the test side we’re going to have one and 17 Everything else that’s there is irrelevant, the takeaway that I want you to get out of this is we didn’t just simply say, well, the top 70 percent that’s where we’re going to use to

train and then the bottom 30 percent, is what we’re going to use to validate all of those assumptions that we’re going to use to test That’s not really going to give us good results because information could be skewed It’s possible depending on our dataset where it is that we’ve loaded it in from? That maybe there was a sort that took place So maybe for example, everything is sorted based on the destination airport So now what I’m going to wind up with down with the very bottom is everything that’s landing in, I’m trying to think of an airport code that starts with a z So whatever that might be, I’m just going to use Sea-Tac just because as a later or later So let’s say that I’ve got Seattle down at the very bottom So now that winds up becoming my testing data So all of the assumptions are going to be based on everything but Seattle and then I’m going to wind up testing those assumptions against Seattle You can easily see hey, that’s not really a good way to do things So instead what we want to do is we want to randomize this, so that way we’re not going to be biasing our training and our testing So when we go in and we look at our results, we look at our accuracy, we’re going to know, oh, we’ve done a good job here We’re testing this and training this on some nice random values, that’s exactly what’s happening here So as a general rule, train on 70 percent of your data and test on 30 percent of your data Somewhere in that ballpark, it’s a general nice starting point It’s a nice general rule of thumb that you can follow Now, there’s a nice little library that’s available to us called scikit-learn that does an awful lot of really nice things in this space One of the things that it does is offers us a train test split What’s great about this, is we can actually just give it our X DataFrame and our y DataFrame, so our features and our label and then it will give us back everything that we might want So we pull that little function in then we go ahead and call it What you’re going to notice is that it will give us back that X_train, X_test, y_train y_test Where X_train is going to be all the features we’re going to train on, X_test is what we’re going to use to test all of those assumptions that are generated Then the y_train and y_test is going to do the exact same thing except for the values that we want to be able to predict, again, that’s going to be that label X and y is what we’re coming in with, the test size is where we’re going to indicate how we’re going to split everything up Now, the last item that’s on there is the random state Now, we’re sending random state to a value here and by doing that, we’re now seeding how random values are going to be generated So at the end of the day when you specify a random state, you’re actually not going to be getting random values anymore, because it’s always going to be basing it on that exact same seed The reason that you might decide to do this, is if you’re looking for some level of replayability So that maybe what I want to be able to do, is to test out different algorithms or otherwise, and I want to make sure that we’re always doing it on the exact same training data and on the exact same testing data So that’s why you might decide to set that random state The end result of all of this is we’re going to start with a DataFrame that looks a little bit like this, where what we want to be able to do is to predict our delay So we’re going to split off our distance and our elapsed time into the X side, we’re going to set up the y side as the predicted delay, and then you’ll also notice that we’re going to have our train section and we’re going to have our test section We’ll be able to use those to in turn train up our model Let’s go in and see a couple of code examples so we can hopefully bring all of this together Let’s take a look at a code example of how we can split up our data into training data and testing data So just like before, we’re going to need pandas So let me import that in and let’s import in a CSV file and check out its shape What you’re going to notice is the fact that we have 30,000 rows and 16 columns inside of our data Now, we’re going to keep things a little bit simple here,

that normally at this point after you’ve loaded up your CSV file, you would probably do some exploration, start to get rid of some bad data, maybe bring data in from somewhere else, etc We’re going the green path here where everything is just going to work for us So we’re making our lives a little bit easier admittedly, but in the a real world there’d still be a little bit of work that you would have to do here So 30,000 rows, 16 columns Let’s go ahead and start to break down all of that information into the appropriate spots So we’re going to use the distance and the elapsed time here, boom So you’ll notice that we call the head just so we can see the first drive rows here, all of that looks nice and we’re going to grab the delay and that all looks good So right now what we have is we’ve got two new DataFrames We’ve got X, which is our features This is everything that we’re going to use to try and do our predictions and then over on the y side, this is the value that we’re now trying to predict So given the distance and the amount of time that it took, how late is that? That’s what we’re trying to set up here What we need to do in order to be able to feed this into a model and start to train that up, is we need some testing data and some training data Where again, the testing data is going to be used to validate all of the assumptions The training data is going to be used to build up that model We’re going to use that nice little helper function here of train_test_split Then we’re going to set up our x_train and x_test, y_train and y_test We’re going to pass in the x and y Again, our x is going to be our features Our y is going to be what it is that we’re trying to predict We’re going to set up 30 percent as the test size and then the random state of 42 So now when I run this and I take a look at the shape for that x_train, what we’re going to notice is 21,000 rows inside of there When we look at the test, what we’re going to notice is 9,000 rows inside of there So you’ll notice that we now have 30 percent of our data that’s going to be set up as testing, and we’re going to have 70 percent of our data that’s setup as training Now if you’re anything like me, I know I am, you might be looking at that real quick and going, “Hey, wait a minute Those numbers don’t seem to line up You have 30,000 rows Shouldn’t it be 20 and 10?” Remember we didn’t say a third, we said 30 percent So 30 percent of 30,000 is 9,000 Maybe it was just me that got caught on that Maybe you’re looking at that and going, ” Yeah That makes complete sense.” Either way, there you are So those are a couple of numbers Then what we’ll also notice, if I go in and grab the head of train, for example, that I did in fact get back completely random rows on the train side Then if I took a look at y, I’m going to see the exact same thing So there’s a train and test, again the shape on both of those Then what you’re going to notice is on the head, that again, we’re getting our random ideas But I do want to point out this fact here that you’ll notice if I do this, I can actually go x_train.head There we go. Let’s go ahead and run this, and I’m going to just draw on the screen, make it a little easier Let me launch, zoom it real quick There we go. So up to the very top there, this is my y, and then down below, that’s our x What I want you to notice here, is that the IDs line up, and that’s important So then that way, because we’ve split all of our data up I’ve got my features over here, I’ve got my labels over here Again, what’s going to drive the prediction? What it is that we’re trying to predict? I’ve now got that in two separate DataFrames We need to be able to, at some point, bring all of those together That’s what that ID column is going to do for us, and that’s why all of that matches up Now that we’ve got the data ready to go,

let’s see how we could begin to train up a model Before we get into this module, I want to make sure that I’m setting expectations appropriately We are not going to get into selecting which model, differences between them all, etc, that there’s a lot to that We actually do have content that you can go checkout, that will give you more information about that I’ve mentioned to a handful of times already, the GitHub repository Again, if you go there, you’re going to notice that we have links to additional resources where you can start to dig deeper into those types of conversations like, “Hey, what’s the difference between logical regression and linear regression?” So you can go check all of that out The example that I want to walk through here is going to use linear regression, which is designed to predict a value, and potentially show all of us that on a neat little chart That’s what linear regression is all about Again, what we want to be able to do is to predict how late or how early a particular flight is going to be That’s exactly what linear regression can do for us Again, if you want a deeper conversation on all of that, you can check out the links inside the repository, and learn more about all the possibilities there Let’s go ahead and take a look at how we’re going to take all of our prepared data and start to feed that into a model So if you remember from before, this is what we’ve got We’ve got our training data, we’ve got our testing data, we’ve got our features, we’ve got our labels, we’re ready to go We’re all dressed up, now we just need a destination We need to have somewhere to go That somewhere that we’re going to go, is going to be into some form of a model, into some form of an algorithm In this particular case, it’s going to be linear regression So linear regression is going to give us that ability to predict a value there that’s going to be on some form of a line, thus the name linear regression So it’s not just a clever name Now there’s a little fit function or a little fit method that we’re going to use after we get a bit all of our data, that’s going to actually give us the ability in turn to start to predict values and give us outcomes for any new data that we fill into it As I mentioned at the outset, there’s a lot of different ways that we can do this There’s a lot of different classes that are available to us We can use scikit-learn There’s other utilities out there that you might want to check out We’re going to use scikit-learn to perform this operation So it typically is going to follow this same pattern here, where we’re going to import in the desired class that’s going to do our work We’re going to create an instance of that Then we’re going to call fit to have it start doing all of its work Now that’s what I’m about to demo here But as you might expect, there’s still more to it So when I go in I run all of this, the next question becomes, well, how accurate is that? That’s what we’re going to see in a later module For right now, what I want to focus in on is the mechanic How do we now take our data and fit that into a model? That’s going to be our code demo Let’s see how we can set up our model So what we’re going to do, as previously mentioned, is we’re going to bring in our pandas library and bring up our train_test_split We’re going to do exactly what we did before So I’m not going to spend a lot of time walking back through this code, but this is what we’ve seen, that we’ve seen how to read in the CSV, we’ve seen how to drop our values that are null or otherwise, we’re going to then go grab our distance and elapsed time, we’re going to grab our arrival delay, and we’re going to split all of that up So basically, everything that we’ve seen, all of those individual demos, all fall right there It all comes together So let’s go ahead and run that little bit. Boom Now we’re ready to set up our model Now, here’s what I want you to notice, we’re going to go grab our LinearRegression from SciKit-learn,

we’re then going to set up an instance of that class, and then we’re going to go ahead and call fit We’re going to pass in our training data, and offer our features, and for our labels So again, x is going to be what’s going to drive the value, y is going to be the value that we’re going to drive Now, I’m openly going to admit here that the ending is a little bit anticlimactic because we’ve now set up this model But we don’t know anything about it? We haven’t seen yet maybe, how we could predict a value, and you’ll also notice that we haven’t yet used any of that testing data Well, the reason that we haven’t used any of the testing data is all that we’ve done at this point is we’ve just done our training, all that we’ve done is we’ve set up our model Now, it’s time to turn our attention to see how well did we do? That’s where we’re going to look at in the next module >> So Christopher, just walked you through how to actually train a model Yes, that can feel a bit anticlimactic Once I was chatting with the data scientists who told me that, 80 percent of your time is spent actually preparing the data during the training, and only 20 percent of your time is actually spent training the model When it gets down to the code, it’s really one line of code that you execute to train your model You spend most of your time doing that preparation, getting the data ready for training It’s interesting, you may notice over the next couple of modules same thing really, it’s just one line of code does all the magic for us It’s getting the data ready that’s going to take you most of your time But we have got a model trained based on the code that you saw with Christopher So now, let’s take a look at how we actually test for model once we trained it So we know it’s actually somewhat accurate, and then we can rely on that model, and make decisions based on that model So how do we test a model? When we are looking at Christopher’s content, you saw that we had split up our data into four parts We had x_train and x_test, x_train is the data we use to train our model along with y_train, y_train being the values we’re predicting, x_train being our features Then we also had test data, x_test and y_test That was data we had put aside for testing and that’s what we’re going to use to check how accurate our model is? So what we can do is take that x_test data we put aside and we’re going to use that to test our model So we use predict as the method that we typically use on the different models to say, here are some values that we would like to test with, please try and predict values for those different values in our test data So we take the data in x_test, and we pass that in using our regressor, which is the model object we had created, and we call predict, and we’re passing the test data, and it will pass back a prediction for each of the rows of test data We can put that into a dataframe, I’m calling it y predicted because that’s nice consistent, we have y_test for the actual values for our test data, and now we have y prediction containing the predicted values for our test data Now, once we’re finished that, we’ll be able to compare the results we have in y_test, which is the actual values that happened for each record to the predicted values our model gave us for each of the rows of our test data By comparing, here we know the first row, the first flight was five minutes early, our model predicted it would be just over three minutes late The second row, it predicted it would be 20 minutes late, our model predicted it would be about six minutes late, and so on So now, we get a sense of whether our model’s doing a good job in making those predictions Now, let’s take a look of this in some actual code So when you wrapped up with Christopher, you would successfully loaded some data, split it into training and test data, and even trained a model So now, the next step was basically to see if the model could make predictions, and if so, how accurate are those predictions? So you know if you can trust the results coming from your training model and if it’s going to make sense to make business decisions, or to make changes based on the predictions Can you trust the predictions from your model? So we need to be able to test our model and get a sense of how accurate the predictions are So what we’re going to need first though is to rerun some of our code we saw earlier So I’m just going to import all the same libraries you were using in the code with Christopher when you were training in your model, and I’m just going to rerun the same code that you were looking at, that was going ahead and said loading our dataset, still working with the flight dataset, getting rid of the rows with null values, splitting up our features, and our labels, the values we’re using to train the model, and the value we’re trying to predict, and then splitting that into training and test data So that’s all code you’ve seen before and you finished with that one little line of code It’s crazy for all this work I think, data size is going to be so complicated to train a model

You discover, training a model is all of one line of code It’s preparing the data that really takes up most of your time Now, that we have trained the model by calling fit, we have a trained model, we call the predict method to actually make a predictions Now, if we’re going to predict values, we need to be able to predict values, we want to check accuracy, we’d like to make some predictions for values where we actually know what the actual outputs were That’s why you put aside that test data because we have a whole bunch of rows inside x_test for which we actually know what happened, we know how many minutes late those flights were That’s the values we stored in y_test So if we pass the values in x_test, which are all bunch of flights where we know what happened, and we say to the model, “Hey, predict what the flight delay will be for each of these records,” then we can compare the predicted values to the actual values So that’s exactly what we’re going to do So if we look at our code though, it’s really only one line We’re going to call regressor, which if you remember up here, that was what we called our LinearRegression object, which we just trained Now we can ask that trained model to predict values for each row in x_test and we’re going to put those in a dataframe called y predict So we’re going to put that into y predict. Now it’s done You can print those values out and you can see all the different values, it’s predicted for each row Then you can compare those values to the actual values we had put aside for each of those test rows when we did splitting of our data You can see that for the first row, the actual delay, it was five minutes early that first flight, whereas, our model predicted it would be about three minutes late The second row, the actual flight was 12 minutes early, and our model predicted it was going to be about six minutes late Third row, it predicted it would be four minutes late when the actual value was about nine minutes early So I can see that my models maybe not quite perfect yet This is where you need to go look into those data science courses, and you would learn techniques to go back, and see how you get better accuracy on your model, but we’re focusing on the code part here You can see here, we have that ability to take our test data, pass it to the model, and get back predicted values What I’ll take a look at in the next module is some mathematical calculations you can do to get a better sense of accuracy because reading through all those rows manually could be a little clunky So we’ve seen how after we’ve trained a model, we can compare our test to our predicted data But the first time we did it, when we looked at it, we just saw a list of values and it’s not really practical to visually look at all the results from our actual test data results, the actual values predicted versus the values predicted by our model, and compare them row by row by row visually It’s just not a practical way to look at things So what we’ll typically do is we take advantage of the fact that we’re programmers and we use some code to do calculations to get a sense of the overall accuracy of how effective our model is in making predictions, but it still comes down to comparing the predicted values and the actual values The difference is what we want to do is use some calculations to compare the two One of the most common methods we use is something called mean squared error So the mean squared error, if you actually look at the formula for it, is just the mean of the actual values minus the predicted values squared Now, I could write some Python code In fact, you learned enough in the introduction to Python course to write a loop that would go through all the actual values, go through all the predicted values, subtract one from the other, calculate the value squared, and perform this calculation You could do that, but there’s a better way Luckily, there’s a whole bunch of great libraries out there that will help you when you’re doing data science The Scikit-learn library has all great functions for scientific calculations, including one called mean squared error So all I really have to do is import the Scikit-learn library In particular, when you’re doing these types of calculations, you’ll probably want the metrics Then I just say calculate the mean squared error of my actual results versus my predicted results, and now I can get a sense of my total accuracy of the model Generally speaking, just from a data science perspective, a lower value is going to be better Lower error is good Now, sometimes there’s a whole bunch of different numbers and metrics you can look at Another one is the root mean of squared error, which is just the square root of mean squared error which we just calculated But Scikit-learn doesn’t have a method we can use to calculate it So there’s another library you’re going to start exploring with and playing with when you do data science as well, which is NumPy So what NumPy will do is NumPy has all functions designed for street math calculations,

including one which calculates a square root So if I have the mean squared error method in Scikit-learn and I have the ability to calculate square root with NumPy, then I can put those together to get my root mean squared error So these are the two libraries that are really going to help you when you’re looking to evaluate the accuracy of your model Different types of models you’re going to learn as you explore data science are going to have different metrics that you would evaluate to check the accuracy But generally speaking, between NumPy and Scikit-learn, you should always have some method out there that’s going to help you perform those calculations So NumPy for the basic mathematical calculations and, Scikit-learn often has a lot of specific methods for predicting and measuring accuracy of your models So now, let’s go and take a look at that in some actual code So we’ve trained a model and we’ve tested our model We passed in the data in x test to predict a set of values which is now stored in y_predict and then we’ve compared those values to what we had in y_test which were the actual values for our test data But so far, we were just comparing those row by row That’s very hard when you’re going to get a sense of the overall accuracy of your model So what we want to be able to do is do some calculations to return to us a sense of overall accuracy of the model So if we take a look the code here, actually I’m importing a new library called Scikit-learn, which we’re going to be using here So we’re going to start by just training and testing or model Again, this is exact same code we’ve done in the previous lessons So don’t worry about that code You’ll see it’s exactly the same code you run before, still doing linear regression, training our model, and then passing in some test data to get some predicted results So what we’re going to do differently now is we’re going to use some calculations to determine the accuracy One of the many different ways to measure accuracy is to look at the mean squared error, and this is calculated by doing the mean of actuals minus predicted values squared You could write a loop that would do that using straight Python code You’ve learned how to do that with learning how to do loop logic when you’re in the introduction to programming courses with the Python But if you actually go to the Scikit-learn library, you’ll discover it contains a number of different methods to help you calculate metrics which are great for evaluating your model Most of them are designed for this So what I can simply do is I can say, hey, let’s import from Scikit-learn the metrics and just calculate mean squared error, pass the actual results for your test data, the predicted results for your test data, and it returns the mean squared error Just as a general guideline for this particular metric, a lower value is good Another common metric you might measure is the root mean squared error, which is literally the square root of mean squared error But Scikit-learn doesn’t have a method for this That’s okay though because there’s another library we can use called NumPy NumPy has a number of different functions you can use for mathematical calculations, including one which calculates square root So if we just calculate square root of the mean squared error, then we get root mean squared error So NumPy is another useful library that we use when we’re trying to get these calculations to measure our averages and our totals to get a sense of our accuracy I’ll just throw out a couple of other examples here Different types of models are going to have different metrics that we look at to measure accuracy So when you look at data science courses, depending on the type of model you’re learning, you’re going to see different metrics But the key here being Scikit-learn and NumPy will often have the methods to help you do it So just a couple of other examples here We’ve got the mean absolute error, and it’s the mean of the absolute value of the actuals minus predicted values The difference being, it’s a little less sense that if you have some odd data in the middle of it what we call outliers Again, lower numbers are better Another common one is the R-squared In this one, generally speaking, the higher your R-squared, the better the model So I’m not going to get into what all these numbers mean and represent But the key to takeaway here is when you’re training model to measure the accuracy, different model will have different metrics, Scikit-learn and NumPy are going to help you out a lot So in the last model when we were doing calculations, we explored very briefly a library called NumPy If you’ve actually started exploring any data science courses already, you may have actually already encountered NumPy You’ll see a lot of modules talking about using the NumPy objects and things like that Because NumPy is actually a library that’s used not just for calculations in general, but it’s also designed for doing matrix calculations So because it can do matrix calculations, it needs a way of storing matrices So it has an ability to store things like arrays and such as well Now, why am I saying all this? Well, because turns out if we take a closer look at

the values that we predicted when we trained or when we tested specifically our model, you might notice something interesting So when we called predict for our model and passed in the test data and asked it to calculate some predicted values If you actually look at the type of y_prediction, you might have noticed something, which is that y_prediction, if you look at the datatype, is actually a NumPy array So it’s not actually a DataFrame, whereas y_test is a DataFrame Now, this is significant Because so far, we’ve been working with Pandas DataFrames So when we were doing queries, we learned how to query a Pandas DataFrame When we were looking at information, like how do I display the first few rows of the last 10 rows and so on, we were doing all these with methods on Pandas DataFrames So it’s even possible maybe when you were trying some of this out yourself that you did something like, hey, let’s do a y_pred duck tail to look at the last five rows and it failed because that method doesn’t exist on NumPy arrays That’s a method that only exist on a DataFrame So the other thing that might happen is you might be doing things like trying to merge two DataFrames together or split two DataFrames apart and you try and do again those actions on this y_pred, and it’ll all fail because again, it’s not a DataFrame, it’s a NumPy array So let’s just talk a little bit more about NumPy and explore that some more So it’s a Python package for scientific computing used for a lot of matrix calculations It has an assortment of classes and methods that’ll help you out When you’re doing Python for data science, you’ll end up working with a mix of Pandas objects and methods as well as NumPy objects and methods In NumPy, you have an array So in Pandas, we had a series that’s very similar to a one-dimensional NumPy array In fact, if I want to declare it, if you compare the code for a NumPy array on the left and a Panda series on the other side, you’ll see the code is almost identical Here, I can tell it’s an array because I declared it as a NumPy array Here, I declared it’s a Panda Series, but the syntax is the same If I want to print all the values in a series or a specific value in the series, the syntax is again still the same The big difference is you have an implicit index with NumPy arrays and you have an explicit index that’s displayed zero, one, two when you have a Panda series, and they have different functionality A two-dimensional NumPy array is very similar to a Pandas DataFrame in terms of a structure When you’re writing the code to access it and to create it, again it’s going to look very similar I declared NumPy array and I specified the values for each row and each column If I want to print all the values, I just pass in the array name If I want a particular row or a particular value, I specify the index column position and the index row position Row then column, just in that order Then if I use good DataFrame again, pass in the values for the different columns and the different rows, print the entire DataFrame or you can use that iloc index location to specify a row and column position to request a specific value So again, at first glance, they seem to be almost identical, but they are different classes and they have different capabilities So one of the things you’re going to run into is there’s going to be times where you may want the functionality of a DataFrame, but it’s in a NumPy array, or maybe you want the functionality of a NumPy array and it’s in a DataFrame Or you’re trying to merge data from two objects together, one’s a NumPy array, one’s a DataFrame When you try to merge them, you’ll get errors saying they’re different data types So you might need to move data between a NumPy and a Pandas object I’ll give you one example here In this case, if I wanted my predicted values to be in a Pandas DataFrame instead of a NumPy array, all I have to do is called the DataFrame constructor, pass in the NumPy array, that’s an object it will accept, and it returned to me a DataFrame containing the same values Now, you can see the y_pred is still a NumPy array but my y_pred DataFrame is actually a DataFrame So as you explore that world of data science, be prepared, this is going to happen to you You will have times when something is in a NumPy object and you need it in a Pandas object, or something is in a Pandas object and you need it in a NumPy object I’m not going to cover every possible conversion might exist but be prepared for it, and there’s lots of great documentation out to help you when you’re looking at a specific scenario Now, let’s take a look at this in the actual code So let’s pop into our code and see what we’ve already got done We have imported our Pandas library and the model we want to use to train our data We have gone through, read our data, we have gone through, split it into training and test data, we have trained our model and we’ve even tested our model by passing in a set of test data Now, when we were done this,

you might have actually noticed when we were comparing the results for predicted and actual values, that when you actually display the values that were predicted by the model and compare that to the values that were in the actual values for the test data, that they display differently I’m guessing some of you out there were like, “I wonder why those are showing up differently.” There is a reason for it because these two objects are not actually the same type If you take a closer look, and we can use the type method to ask for the data type of an object, very useful anytime you’re not sure what you’re working with If you look at the type of y-predict, it says it’s a NumPy array, whereas y-test is a Pandas DataFrame Up until now, we’ve been very focused on working with Pandas DataFrames So what is this NumPy array things? Very similar to a DataFrame and it’s got a lot of similarities but its functionality is not the same, they are actually different classes So what does this mean in terms of our code? Well, it means if we try and do something like call the head function to display the first few rows of the array, well, that’s actually going to fail because NumPy arrays do not support that method, that’s a DataFrame method so it’s not available on the NumPy array class So let’s just take a little closer look at these So the NumPy library in general, just so you know, it’s basically a library that’s designed for mathematical calculations and matrix calculations So since it does matrix calculations, it has to support arrays A one-dimensional NumPy array is very similar to creating a series in Panda So if you look at the code, it’s very similar to create an array versus series, but really, only the big difference is I’m using a different constructor, the array constructor versus the Pandas series constructor But aside from that, all the syntax is the same, except that they do have slightly different functionality If you want a two-dimensional array which we do with the DataFrame and Pandas, you can create a two-dimensional array using a NumPy array, and you’ll see it come back Again, very similar to what we have with the Panda’s DataFrame The biggest difference between the two really is when you’re working with Pandas, the index is explicit, you see the index number displayed Each row does have an implicit index when you’re working with NumPy arrays, but it’s not displayed when you’re printing out the values, but you can still use it to reference a particular row For example, up here when I was querying my series, I said, give me the value that’s with row index number 2 which is Narita Well, you can see in the Pandas series that row index number 2 goes with Narita, that same index value is implicitly there for the array, it’s just not displayed So say, one of the biggest differences is that implicit versus explicit index that does allow some extra functionality with Pandas DataFrames when you get further into them When you have an array again, two dimensional, you can still specify by index position the row, column position of the particular value you want to return, and just like you can using iloc on a DataFrame The difference again here being these are explicit index positions with a DataFrame, they are implicit on a NumPy array There might be times where you want that functionality of a DataFrame Maybe, you want to call the head function or you want to call the tail function, or maybe you have data in a NumPy array and a DataFrame, and you want to take a column from one and move it to the other You’re going to get errors if you tried to merge together a Pandas DataFrame and a NumPy array, a column from one and the other You’re going to get errors if you try to call head or tail on a NumPy array So there might be times when you need to move it from one to the other So there’s lots of different methods you can use to switch something from being a NumPy array to becoming a Pandas DataFrame, or to change a Pandas DataFrame into a NumPy array and so on I won’t go through all of them but I’ll just give you one nice little example If you want to convert our predicted values into a DataFrame, you just pass that NumPy array into the Pandas DataFrame constructor and it will create a DataFrame for you So now, I can actually have that same wonderful functionality, the Pandas DataFrame on my predicted values So you’re going to end up going back and forth a lot between NumPy objects and Pandas objects So just be prepared when that happens, you can use type to figure out what they are, and there’s lots of great methods out there that will let you switch back and forth between NumPy and Pandas >> Chances are staring at a bunch of numbers isn’t necessarily going to give you a lot of insight Sometimes, that’s what you need but a lot of times, it’s nice to have a picture The old saying a picture is worth 1000 words, I guess it’s worth some level of numbers as well That’s what we want to explore here

is how we can lay down some charts and start to see what’s going on with our data That one big thing that we’re going to want to be on the lookout for is some level of correlation in our data that sometimes, we’re looking for something that’s relatively high level of just, ‘Hey, real quick, does x impact y?’ Sometimes, it’s because we want to remove items out of our training data because that could wind up skewing our results So for example, if we were looking at something where we were trying to predict something and we had bought maybe the amount that somebody spent on a plane ticket and they’re a fair class, well, chances are those are going to be pretty tightly correlated because generally speaking, the more money that you spend, the higher of a class that you’re going to wind up with on the plane So sometimes, we need to remove that data or sometimes, we just want to go in and explore and see those types of things So that’s what we’re looking for One way that we can do this is through the use of a scatter plot The scatter plot is a really common chart type where what’s going to happen is a dot is going to be drawn out on your chart, that’s me drawing dots A dot is going to be drawn everywhere on that chart where x meets y So now, I can see based on where all of those dots are, what the values are, what the scatter is, what the full range is of all the values that are going to be on there It’s very useful for then checking correlation between columns so if you’re going to notice that there is that correlation where the value of one changes along with the value of another, then we’re going to see that inside of our scatterplot So if we break this down to a couple of real-world examples, you are going to notice that there’s a correlation between, say, car accidents on a particular day and snowfall, but you’re probably not going to notice much of a correlation between snowfall and the stock price of Microsoft on any given day We could stop here and have a conversation about correlation and causation and so forth, that’s a little bit beyond of what we’re looking for What we’re really looking forward today is just, “Hey, is there correlation?” So a couple of scatterplots you can see right there Correlation, you’ll notice that real nice tight band that’s going from the lower left towards the upper right Then no correlation, you’ll notice that my values are just simply scattered So that’s a really nice indicator of what’s going on there We can use matplotlib to draw out our graphs So if we have a DataFrame, we’re going to be able to use that to draw everything out We are going to need the library, so we’ll pull that in right there Then, we’ll be able to actually draw a chart directly from our DataFrame I’ll just go ahead and draw this out here So what we’re going to do is we’re going to load up our CSV file We’ve already seen how to do that So not really a whole lot going on there Now, we’re going to plot everything out So we’re going to indicate, ‘hey, we want a scatterplot.’ We’re going to indicate what we want the x and the y values to be So the x is going to be distance and then the y is going to be our arrival delay You’ll notice the color, you’ll notice the title, both of those are pretty straightforward The one that I want to highlight here is the alpha What the alpha is going to do, it’s a 0-1 base number, is it’s going to indicate how lightly or how heavily drawn each one of those dots is going to be By default, the alpha is one, meaning that’s going to be just a completely dark dot If we lighten that though, now what we’re going to notice is that a single dot is going to be a lighter color If there’s two dots that are on top of one another or maybe they’re just slightly offset, then what we’re going to notice is that it’ll get a little darker with two dots, and then three dots, and then four dots, and then five dots So that way, if I’m looking at data where I’ve got a lot of points that are going to be inside of my chart and I’m trying to get a better sense of, ‘Hey, where are things grouped together?’ I can do that by setting a lower alpha So now, I’m going to see darker colors where there’s more overlap values

Then we go ahead and call show, and then what will happen is it will then display everything out What I want you to notice here, first of all, is the fact that there’s not a correlation between distance and arrival delay which might have been a little bit surprising I figured there would have been some level of correlation but we can see that there’s just dots scattered all over the place, so no real strong correlation there The other thing that you’re going to notice is the fact that we’ve got just a lot of values really down towards the bottom so there’s just not a lot of delay for a lot of our flights and those outliers are then just that really faint color there So it’s a really nice way to get an overview of how our data looks Now, we can also go in and break things down, across multiple lines as well So we don’t necessarily have to go directly from the data frame Maybe we want to be able to break this down into multiple lines, maybe because we want to be able to reuse things, maybe because we want to make it more readable, maybe because we just want a little bit more power and we just want to be able to control exactly how our chart is going to be laid out So this will actually do almost the exact same thing, the only difference is the fact that we’re going to be using the departure delay and the arrival delay to display everything out So now we can see that nice tight correlation there We can already start to see, hey, there’s an obvious a line that’s forming there, because we’ve got our departure delay, our arrival delay It obviously makes sense that there’s a tight correlation But we also have like a little bit of a little something going on over here So it’d be nice to be able to just slap down a line right across here So that way I can hopefully better visualize, how all of those numbers are coming together So this is actually where we could use a little bit of linear regression That linear regression isn’t necessarily only used to try and be able to predict values Sometimes it can be really helpful to be able to just have it draw a line The name linear there, it’s not just a clever name It’s all about being able to draw a lines there So let’s take our data and let’s feed that then into a linear regression model Then when we get our value back, now let’s go ahead and snap down that red line So this is really that same data that we saw previously The only difference is now we run it through a little bit of linear regression So the y value, that y_predicted is going to be the prediction of the arrival delay, and that’s what’s going to give us that nice red line Then we can use that model to then overlay that on top of our data and see that nice, neat little line So there one more time is a little bit of code, you’re going to notice that we’re going to draw everything out with that nice red color Then we’re going to create the scatter plot down below, where we’ve got everything laid out in blue So the end result is going to wind up being a nice neat little chart here, where we’ve got all of those raw values and then the linear regression model laid right on top of that Now we can see a nice and clearly how it is that our departure delay is impacting our arrival delay Let’s turn our attention to the code, so that way we can see all of this in action Let’s get in and take a look at a little bit of code here So just like before, we’re going to bring in our pandas library and we’re going to load in that exact same data, so nothing new here But now what we want to do is we want to start plotting everything out So we’re going to bring in our Matplotlib and then we’re going to plot directly from that data frame So what I want you to notice here, is the fact that we’re setting up a scatter chart or scatter plot rather, that we’re going to do, x as the distance, y as the arrival delay When we see this, what we’re going to notice is that, give it a moment here, that there’s no correlation between those values, is that we’re seeing a scatter plot, we’ve got values scattered everywhere If we take a look though at something that’s definitely going to be tied together,

which is the departure delay and the arrival delay Now what we’re going to see is a different result Again, give this just a second here Now we can see that little chart going across It’s really the exact same code, the only thing that we changed was just the x, y columns here So there’s everything Now I’d like to get that neat little red line there slapped down I’m going to do that by using a little bit of linear regression So let’s bring in everything for our linear regression here I’m not going to dig too deep into this code You can go check out the prior module where we talked about linear regression The thing that I want you to notice is that we’re going through the exact same steps that were dropping the null values, where setting up our x So this is the feature, this is what’s going to drive the value as the departure delay We’re going to set up the arrival delay as the value we want to predict, that’s going to put that on a nice neat little line for us Then we’re going to execute our regressor and then get the prediction values back So now I’ve got all of that inside of the y_predict So now let’s build up that chart So right here, we’ve got our x_test, let me just type out here real quick, that is going to be the departure delay Then right here that y_predict, this is now going to be that smoothed out arrival delay there That’s going to be that neat little line, that the linear regression model created for us So that’s going to be our x and our y respectively So now I can run that, and there is that neat little red line that we were just talking about Let’s say I want to see that on top of the scatter plot data Well, that’s right here So what you’re going to notice, is one more time we’re plotting everything out just like we did before We’re setting up our scatter plot just like we did before The main difference really is, is just the fact that we’re calling both plot and scatter here So now when we call show down at the very bottom, it’s going to put both onto one nice neat little graph Now we can see that red line going across all of our scattered values So that’s how we can use our scatter plot to get a real good sense of what’s going on with our data So we can see whether or not there’s correlation In turn, we can even use linear regression just to get a nice neat little line that we can slap down to be able to better get a sense for what’s going on with our data Well, if you made it this far, then I guess the last thing that we have to say at this point is, thank you We also want to encourage you to get in and really start playing with all the different tips, tricks, and tools that we’ve demoed here >> Yeah, so go out You’ve got some of those Python skills ready, now’s the perfect time to go look for some data science tutorials, check out some of the stuff of Microsoft Learn at docs.Microsoft.com, some of the quick starts You can find some really great tutorials to get started with machine learning, training models It should be a little easier now because you already know some of the code >> Yeah, so if you’re looking for like specific spots to go in and start looking, check out the GitHub repository that we’ve linked to You’re going to notice that we’ve curated a handful of nice next steps and how you can continue growing from there So now’s the perfect time, start writing some code Thanks again >> Thanks again. See you next time