Best Practices for GCE Enterprise Deployments (Cloud Next '19)

SIRUI SUN: Good afternoon, everyone and welcome to Best Practices for GCE Enterprise Deployments Thank you all so much for coming I know we’re in the homestretch of Next So, I really appreciate that you all could be here My name is Sirui I am a product manager on the Compute Engine team, and the goal for the next 15 minutes is going to be for me to share with you all some of the best practices and learnings that we’ve picked up over the years, and working both internally at Google and with our customers to deploy truly large deployments of GC And so we can have two data sources, as part of sharing that information with you today, we have two data sources The first one is the learnings that we picked up that I just talked about And the second one is going to be, I want to share with you some of the really exciting feature announcements that we’ve made here at Next, as well as leading up to Next so you have the latest and greatest product details to take with you as you go and scale out your GC Deployments So, the first thing I want to talk about is, when we talk about an effective G Suite Enterprise Deployment what is our North Star? Like, what is the goal that we’re going towards? And so, one part of the goal is for sure agility Almost every customer I talk to, they want to get to the Cloud to move quickly They want to use the scale and the power of the Cloud to quickly reach their business goals and to compete in the space that they’re in But the problem here is, agility can’t be our only focus, because if you’re just moving fast, you might be heading straight towards a cliff and not know it And so the North Star I want to introduce here for this talk, the goal that we want to reach towards, is agility with the right guardrails So for example, you want to know that you’re moving quickly, but you’re doing so in a cost-effective way and also in a way that satisfies your legal and regulatory and compliance requirements You’re going to want to mix the both of those, and this talk is going to be about how you move fast But it’s also going to be about how you use the tools that we offer you in GCE, and more broadly in GCP, to achieve the right guardrails as well So the first thing I’ll do is I’ll look inside our tool box and talk a little bit about the tools that you have available in GCE to do these sorts of things, with kind of a focus on the new features that we launched here at Next This is a 300-level talk I’m going to assume some prior knowledge So we’re going to focus on some of the newer and exciting things The first section is going to be about choosing the right VM types This is going to be key in making sure that your workloads run quickly and effectively on GCE, and also in a cost efficient way So the first thing I’m going to talk about is the different machine types that we offer and there have been some exciting new announcements in this space Ever since GCE has been launched, we’ve had the general purpose machine type This is the machine type that you use kind of by default, and we find that it has the best price-performance ratios in general and that is well suited for a lot of different workloads You should always start here when assessing your workloads, and we’ve seen them effectively used for all sorts of different workloads, including web serving, databases, mobile gaming, you name it So, you start here, but we had this for a long time, and we heard a lot of customers say, well, I have maybe a very large in-memory database that needs to store a ton of data in that VM for, say online analytical processing And so at a previous Next, we announced the memory-optimized machine types These, upon the initial announcement, these [? new ?] VM supported up to 4 terabytes of memory But in the time since then, we’ve heard from a lot of customers that, well, my database is constantly growing as the data that I work with expands, and so we want to make sure that we have the VM shapes to keep up with that Here at this event, we announced that we will have memory-optimized machines will extending that line to have up to 12 terabytes of memory, and that’s coming soon Another type of workload that we hear from customers a lot about, is they have workloads that are very compute-performance intensive And so in response to that, we’re announcing here the compute-optimized line in machine shapes So why would you want this particular type of machines? We have customers that are really sensitive to real-time performance So for example, if you’re running a AAA gaming server, every millisecond counts and you want to make sure you have the fastest per core performance Another reason you’d want this is, we see some customers, they have licensing agreements where they license per core And so they want to squeeze the most juice out of each core as possible So those are two reasons to look into the compute-optimize machine type This is now in alpha So if one of those cases sounds like it’d be useful for your workloads, I encourage you to check out the blog post that we have on this and you’ll have a sign up link for the alpha That’s all I’ll go into for machine shapes, but I just want to call out that there was a session two days ago that goes much more deeply into detail, and all of these sessions we

posted on YouTube So check this out if it’s interesting to you Another tool that we offer here at Google Cloud is the idea of custom machine types The general purpose line of VMs that I just talked about, they do come with some cookie-cutter configurations But at the same time, you actually have the flexibility to change the shape of the machine in increments of two vCPUs, or increments of one gigabyte in memory And we find that, especially as customers start to round the corner of Cloud adoption and become more mature in their deployments, that they can get a lot of bang for their buck by right-sizing their machines using custom machine types So for example, if you start with a cookie-cutter configuration but then you realize that you’re underutilizing your memory, you can just scale back the memory for that instance You shut down the instance, scale back the memory, and then turn it back on again, and you’ll see that the memory has gone down It’s really easy to do this If you go into the UI, you’ll just see a set of sliders here, or G Cloud, you can use the custom CPU, a custom memory flag Another use case we hear from customers a lot, is they might have fault-tolerant jobs, or short-lived tasks that they need to do Maybe they don’t really care when these tasks happen, and maybe these tasks are kind of embarrassingly parallel It’s a little bit of computation that you can spin up, finish, and then go on Examples of this include rendering or genomics, or media trance coding, and for those use cases, we have the preemptible machine VMs So preemtible VMs, they’re offered in the same machine types and machines shapes that I just talked about They’ll last for up to 24 hours, but they can be preempted within that 24-hour time period with a 30 second notification The tradeoff being that they’re up to 80% cheaper than regular instances And so we hear a lot of customers, when they start their Cloud journey, they’ll make everything a regular or non-preemtible VM But as they deploy, they start to identify certain workloads that fit this criteria, and then they can migrate those jobs to be pre-emptive instead And in doing so, save a lot of money You can think of it as a setting on a VM when you create it and you can just set the preemptible flag there, and you’re off to the races with a preemptible VM As we’ve kind of matured with our customers, another thing we started to hear is that some customers are really sensitive to what they call single-host tendency So they say, I’m happy moving to the Cloud, but the host machines that host my VMs, I don’t want them hosting VMs from other customers Now, this might be because they have regulatory or compliance requirements that evolved in an on-premises world, which meant that that’s how they’ve always– this is how they’ve always operated, and they don’t want to go and change those compliance requirements, or this might be because of licensing limitations For those use cases, we’ve recently GA’d what we call sole-tenant nodes So this is the ability to take a physical server, and reserve it just for your use cases and just for your VMs So what does it look like? On the left here, we see a normal host This is a non-sole-tenant node This is what you’ve always been using on GCE We manage the host hardware in both cases, but in this particular case, other customers may have VMs that are also running in the same host hardware We, of course, manage the isolation for you, and all that But if you have certain requirements that I just talked about, you can move on to a sole-tenant host There, what’s happening is, you now have this machine reserved exclusively for your VMs But you should note, that you still have the flexibility within that machine to set up VMs of different shapes and sizes and to change the configurations of sizes as needed This is a bit of a newer feature We’re just starting to see some adoption there But I want to walk through what that looks like so you can get a sense of how easy this is to actually go ahead and deploy it So we’re inside the developer console of the UI All of these features that I’m going to talk about are accessible from the UI, and the G Cloud, and the API surface, but I think the UI just makes it a little clearer sometimes to walk through Here we are in the UI We’re in the Compute Engine section, we click on sole-tenant nodes You can see I’m asked to just create one In creating one, I’m just asked for a few configuration settings, like what do you want to call it, and how big do you want the machine to be It’s just a few-step process And once I’ve created it, you’ll see an entry here Kind of hard to see, but basically, that line entry there represents a single sole-tenant host, ready to host your VMs Here you can see this particular configuration– or maybe you can’t see it, but it has 96 vCPUs, and over 600 gigabytes of RAM and you can fit as many VMs in there as you want that fit within that configuration From there, when you create a VM, you can just specify the sole-tenant node group that you want the VM to fit into, and everything else is the same Every other part of managing that VM is the same as before, same as you’re used to, as long as you specify that particular node group, which you can see in the G Cloud command over there

You’ll find that the VM is now going to be hosted in dedicated hardware One thing that we think this will really help on, is it’ll allow you to bring your existing licenses to GCP As we’ve been working with enterprise-grade customers, we found that many of them have, for example, Microsoft software licenses, already-purchased; Microsoft Windows licenses, already purchased And that the cost for them on GCE, up until now, sometimes the cost for the licensing was actually overshadowing the costs of the vCPU and RAM for these customers One of our biggest goals in the past year was to allow you to take these licenses that you had already purchased from Microsoft or from other software vendors, and apply them to GCP so that you don’t have to rebuy them when you’re on GCP At Next, we announced the ability to bring your own license This is in BETA And the way this works is, you can set up a sole-tenant node, and in doing so, you’ll be compliant with the Microsoft bring your own license requirements for dedicated hardware and per core licensing And in doing so, you can then apply the licenses you’ve already purchased to your GCE deployment So you have some options here I want to walk through some of them So this option actually already existed even before we did this work, but in talking with customers, we found that very few customers actually realized this was an option, so I just want to reiterate it here So, in this option you don’t use sole-tenant nodes, you just use our typical multi-tenant hardware And it turns out, if you have software assurance with Microsoft, you can use the license mobility program to bring software licenses into GCP already For example, for SQL Server In this configuration, you’re still paying for the premium Windows images in GCP, but you can at least bring your software licenses to your GCP hardware Option two, this is the one we’re announcing here, is you would use the sole-tenant nodes that we just talked about Those sole-tenant nodes are compliant with the Microsoft bring your own license requirements, such that they allow you to bring both your Windows licenses and your Microsoft software licenses onto GCP, and you don’t need to have software assurance from Microsoft in order to do so We’re already seeing some customers move on to this particular model, and it really allows them to get the mileage from their existing software licenses, and apply that to apply that to GCP OK, so the next section I want to talk about is around managing your data So before, we were talking about virtual machines Now I want to talk about the disks and the images and the snapshots that you have in Compute Engine, and the tools that we offer you to manage them and to give you some guardrails behind them The first thing– and this is something we’ve been hearing increasingly from customers, especially those in regulated industries, is they might have some compliance or regulatory requirements around, I need to control how my data is encrypted and how my data is managed in Compute Engine I need to make sure that, for example, if I had to, I can totally revoke access for that data from everybody, including the Cloud provider, including Google A practice that’s increasingly being adopted we’re seeing is the use of– so what happens is the customers can create keys and Cloud key management service That’s a managed service that we offer for key management and key storage And what you can do is you go ahead and create a key in Cloud KMS, you point GCE to it, and then GCE will automatically start encrypting your disks and images and snapshots with those keys And at any moment, you can actually revoke access of those keys from GCE, and what that means is GCE will stop having access to the data So let’s walk through that a little bit, just a bit more concrete detail You would first start by creating a key ring and a key in Cloud KMS That’s a very quick process You would specify, for example, the rotation period and the name of the key and where it lives And then from there, when in GCE, you go to actually create a disk or image or snapshot, you’ll find an option to point GCE to the key that you just created Here in the UI, you can see it’s one of the few options You have G Cloud as well You have the KMS key flag, where you can specify the URI for the key, and that’s and that’s basically it Once you’ve done that, you’ve told GCE to protect your disks and images and snapshots with that particular key And from there, you can revoke access as needed You can do that in the UI by just clicking on the disable option there, or there’s a disabled command and G Cloud And from there, you’ll find a Compute Engine now no longer has access to that data For example, if you took that disk and you tried mounting it to a VM, or if the disk is was already mounted onto the VM and you started it, you would find that those actions would fail until the key was restored

Another thing we’ve heard from customers, especially those in other regions, is they want the ability to store all data in a particular location So for example, we have a lot of customers in Europe, and what they tell us is, I need to make sure that none of my data lives in anywhere else other than the European Union Up until this point, persistent disks have always been regional or zonal resources That meant you have control over exactly where those disks lived Up until this point, though, images and snapshots– so images are used to create disks, and snapshots are used to backup disk state– they have been global resources And so that’s been convenient for say, a global software deployment, where you can have a global deployment and a single image resource that’s global and it’s supporting that entire deployment, or backups that support a global deployment But customers have been asking us for a lot more flexibility in being able to say, I want this particular image or snapshot to live in a particular region or location, and only live in that location We’re announcing here the ability to specify storage locations for both images and snapshots Now, for snapshots, it turns out that’s already been released to BETA We did that in the lead up to Next But we’re also going to for sure be doing that for images as well That’s coming soon to alpha The way this works is it’s very simple There should be a minimal influence to any of your workflows But if you go into– so, we’re back in the Compute Engine UI This is the list of snapshots We’ll do this for the list of images as well It will just tell you for each snapshot now, where that actual storage location of the data is And similarly, when you’re creating that image, or when you’re creating that snapshot, you’ll have an option now to say, this is where I want that image or snapshot to live If you don’t specify that option, it’ll continue to be multi-regional the way it always was You can see the associated G Cloud command there It’s fairly easy, but much more flexibility Another thing we saw customers do is they’re using snapshots to make sure they have backups of the state of VMs And what we saw a lot, was a pattern of customers writing their own automation, writing their own tooling to make sure that these snapshots were generated on a regular basis Say every day, or every week, or every month And we saw so much of this and so much of these custom tools, that we thought, hey, you know what, we should actually offer an easy way for you to do this that’s Google managed and that reduces the maintenance overhead on your side Leading up to Next, we introduced what we call scheduled snapshots to BETA So this is a publicly available feature now We highly encourage you to go play with it And what this will do is it allows you to tell us, the platform, to automatically take snapshots for you on a regular basis Here’s what that looks like, so now if you go into the snapshots view, you’ll see a new tab here called snapshots schedules This is where you can create and modify schedules If I were to go and create a schedule, I could specify a bunch of things For example, when do I want this schedule to execute, how long do I want these snapshots to stick around for, what should I do if the disk is deleted, things of that sort So here, in particular, I’m creating a snapshot schedule that’s set to run daily and to take snapshots at around noon Once I have that schedule, I can then click into a disk, so here I am in the lists of disks I can click into that, and then apply the snapshot schedule to the disk So here I am inside the edit disk page here, and I apply the snapshot schedule that I just created Easy as that And from here, you’ll see that Compute Engine is going to automatically start applying that schedule for you These snapshot schedules, they don’t charge There’s no charge associated to creating these schedules You’re just going to be paying for the storage of the snapshots as you always have And the other nice thing about this is snapshots are differential in nature, which means when you take a snapshot, it’s only storing the differential between the current time and the previous snapshot What that means is if your disk is not changing very much in between the snapshots, you won’t be charged very much and in particular, if the disk doesn’t change at all, you won’t be charged at all So another powerful tool– we just rolled this out We’re already seeing some great adoption from some of our early adopters, and they’ve said that it’s been able to allow them to basically throw away all of the custom tooling and scripts that they’ve built around this OK, so that was a recap of some of the new features and a few of the best practices around GCE, but a common question we get asked, kind of around this point as we start to talk about best practices is, how do I make sure that the people in my organization, that developers my organization, are actually following these best practices? How do I monitor what’s going on in my organization? And then, how do I start to enforce some certain set of best practices or policies, and start to put up those guardrails so that I can make sure that people

are doing the right thing? And to motivate this, I want to talk about sprawl for a second So this here is a picture of urban sprawl This is San Francisco looking out over the horizon And you can see there’s buildings, buildings, and buildings everywhere Similarly, when we’ve been working with some of our enterprise customers, what we see is the concept of Cloud sprawl So you open up GCP to your developers, and almost overnight, what we’ve seen is you have hundreds or thousands of projects spring up, because every developer gets their hands on it They do some code labs or tutorials, every time they do that, a new project spins up All of a sudden you have thousands of projects And in that environment, it can be very difficult to wrap your head around what there is in your organization, and also to make sure that people are following the best practices It can just devolve into a bit of a mess And so that’s what this section is all about, is how do you make sense of that mess How do you start to govern that mess Part of the mess is good, because it’s agility and it’s people moving quickly, but you want to make sure that, again they’re not falling off a cliff and doing so The first thing I want to talk about a little bit is the resource hierarchy What we found working with customers is that building a proper resource hierarchy is a really, really great force multiplier for you to help you understand what all is going on in your organization This is a 300 level talk, so I’m not going to dwell too much on this particular diagram, although I hope you’ve seen it before But in general, I just want to reiterate that in Google Cloud, you have resources like the VMs and disks in subnets Those roll up into projects So every resource must have a project And then projects roll up optionally into folders, and then folders roll up into organizations And what happens here is, when you set a policy on a particular say, folder, it’ll roll down into the projects, and then from there, it’ll roll down into resources And what that lets you do is, it lets you start grouping resources that you want to have similar policies on together so that you don’t have to reason over those resources individually And once you have that resource hierarchy, it allows you to just have a lot more human comprehensibility around your decisions So that was super abstract Recently I was talking to a customer, and this is one of our biggest customers, and the person said, well, we heard that spiel like, 100 times, even still, after two years of deploying Google Cloud, we still don’t really have a resource hierarchy We just have thousands of projects all reporting up into a single organization And when I asked the individual why, they said, well, we know what a folder is, we knew what a project is, we never really just had an understanding of how to put it all together, or any case studies or anything like that So I’m glad to report that that particular individual has now actually– that organization has gone on and built a very effective resource hierarchy But in case any of you are in that same boat, I wanted to share a much more concrete example And this is how Google internally has structured our resource hierarchy So it turns out we have over 4 and 1/2 million projects reporting up into Google.com, and you can imagine in that situation how much of a mess it might be if you had to reason over all of those projects individually And so we’ve done some things with our folder hierarchy to try to make sense of that So the first thing we have is the concept of what we call the Google default folder This is a folder that encapsulates all of the sandbox projects that our developers end up creating And what that means is we can start reasoning over those sandbox projects together, rather than one by one So we limit our end users to only be able to create projects willy-nilly in this particular folder That means we know this folder has all the sandbox projects, and in turn, we can then apply the policies that make sense on these sandbox projects For example, we can apply a policy that says, well you know what, sandbox projects probably shouldn’t have external IP addresses They probably shouldn’t be accessible from the public internet Or something like, sandbox projects, they probably shouldn’t have VMs that are too large or that are running overnight, or off business hours, because maybe there’s not a much use case for that So continuing along those lines, we have another set of like policies, which is for demo projects So in this particular case, we have a bunch of demos, maybe they’re being built by the sales team with the developer relations team We grouped them into a folder Some policies will apply for all demos For example, it’s probably OK for a demo to have an external IP address in some cases It’s probably OK for you to share demos with external audiences in some cases What we found is that there are also some differences between how different teams want to handle the demos, and so we split into some sub-folders below– one for the sales team, and one for the developer relations team– so that we can delegate some additional access to those teams, and say, you get to manage your policies within this folder Another thing we noticed is that the dev and test environments for our different engineering teams all tend to sort of look the same

And so we have a teams folder There we apply some general policies for what dev and tests should look like, but then we also have individual team folders that our individual teams can manage, and then we delegate access to Another thing that we found at Google is that we do have some subsidiaries, some acquisitions We do want to group them under the top level org, because we want to keep tabs on them, and because we want to still be able to reason over all of the resources in our organization But then we break them off into sub-folders like Verily in Google X, and then give broad delegation administrative powers to those particular subsidiaries And this is a pattern that we see happen a lot for other companies as well when they deal with acquisitions And then finally, we have the production folder The production folder we found just like with dev and test, production tends to have a similar set of policies applied to them So we apply those policies to the top level production folder, where they go down into all of the folders, sub-folders and sub-resources But then we do have individual services as well in there And then we allow the individual teams access to those folders and we allowed them to set their own policies there as well So this is a simplified version of a resource hierarchy that we use at Google It’s also by far not the only way to do it, for sure, but I hope this gives you an example of how you can take 4 and 1/2 million projects and start grouping them into folders in a way that allows you to reason over them more easily, and allow you to set policies over them more easily OK, so once you have a resource hierarchy, another question that comes is, how do I more easily see all that’s going on in my organization? This is a problem regardless of how well-structured your research hierarchy is, although the hierarchy does help I want to talk about the tools that we offer you to help monitor your enterprise, to help see what’s going on inside that sprawl, and who’s creating virtual machines, who is creating data, and things of that sort So there’s two tools I want to talk about, and we’re going to build them up into some use cases as we go along The first one is audit logs So, audit logs are going to give you a real time view of the individual events that are going on in your organization There are three types, and you can think of these three types as combining together to give you that view The first type is admin activity These are going to log the administrative actions that users are taking, like creation and deletion and modification of resources The second type is the system event These are going to log the events that GCE is taking on your behalf to maintain your VMs So for example, instance 01 was live-migrated at 1 PM And the third type is data access This tracks not modifications, but just reads of resources So for example, Alice at 1 PM listed all the images in her project All of these lines are created in real time with sub-second latencies And that opens up some really interesting possibilities as you start to ingest these logs that we’ll talk about in a second There are also immutable, so users can’t go in and actually change the logs after they were created Admin activity logs– the system of logs will last for over a year– by default, retained over a year, and data access logs for 30 days And all of these logs can be exported to Cloud storage BigQuery or Pub/Sub Once you do that, especially when you export to Cloud storage or BigQuery– now, obviously the retention policies don’t apply to them anymore You can retain them as long as you want And for a Pub/Sub, this opens up the door to some actions that you can take immediately once autologs are generated Admin activity logs and system event logs are generated by default. They’re on by default for all projects Whenever a user creates a project, those audit logs will start flowing And finally, admin activity logs and system event logs are not charged at all, whereas data access logs, they are charged by ingress volume So data access logs, in general, the best practice here is if you know you need it, we suggest you turn them on There’s even a way to blacklist certain users, so that they don’t generate data access logs But you should be cautious of the ingress volume charges To make this a little bit more concrete, we can take a look at what an audit log actually looks like We provide a number of ways for you to go in and actually view your audit logs Here I am in the developer console I can go into the logging section here, and I’ll just see a list of all the audit logs, and then I can filter down that list as needed And if you look into a particular audit log, you’ll find that we provide a really rich set of information for each Here’s a particular admin activity log for a user that stopped the GCE instance, and you’ll see the email address of the user, what they did, what resource they acted on, and if they had any other, for example, fields in the request body, you would see all of that as well And so, you really get a lot of visibility into what users are doing in your organization And so a question that tends to come up is, OK well, I have 4,000 projects

They’re all generating audit logs Do I have to go into each project to chase that project down and make sure it’s exporting somewhere? Or in other words, how can I have all these other logs going to a single place, so that no matter how many projects I have in my organization, no matter how many folders I have, I can make sure that I still have visibility over all of that And so a best practice here is, we have a feature called aggregated logs export that lets you do that With aggregated logs export, you can specify at an organization or a folder level that all of the projects beneath that organization or folder are sending your audit logs to a single centralized place So here’s the G Cloud command to do that So what we’re saying is, G Cloud logging syncs creates So we’re creating a new export We’re exporting to a particular bucket, which is above this line This line says, we want to export for this particular organization We’re saying we want to filter down to just the admin activity logs Maybe that’s what we’re interested in this particular example And then we’re saying, this is the most important flag here, the include children flag says, apply this to not only this organization, but actually to all of the projects and folders and resources underneath that organization With this one line, we’re able to answer that previous question, and we’re able to make sure that we have a single GCS bucket with all of the events that are happening in my entire organization and this is getting updated in real time A real world example of what we’ve seen folks do with audit logs is, well, here’s a real world example We have companies that are really interested in reacting to insider risks And sometimes they might have an insider risks warning around a user who may have had their credentials compromised, who may be doing some unwanted things inside the organization With audit logs, and with the search functionality in audit logs, you can use that to easily filter down the audit logs to just the actions that a particular user took So in this particular example, what we’re doing is– for this G Cloud command for example, we can say I want to read the audit logs that just go against GCE instances, that are just about admin activity, and that just involve this particular Alice user that I’m interested in You can also zoom in, for example, on a particular instance to see the history of that instance Now, in this particular example, we’re just looking at a particular project But you can imagine how you can combine this particular use case with the previous one where you’re aggregating all the logs across your organization to easily search against all the activities that are happening in your organization And we find that to be a really powerful tool Now, we released audit logs a long time ago, and we’ve seen a lot of adoption, but sometimes we get the question of, I don’t care so much about the view of all the events in the world, as I care about the state of all of my resources at a given point in time And this is sort of possible to reconstruct using audit logs, but it takes a little bit of work, and it’s just not an easy streamlined thing to do And as a result, we’ve introduced and we’ve GA’d very recently the Cloud asset inventory API So this is a powerful API that will, for any given state in time in the past five weeks, allow you to export what all of your resources look like and what all the policies on those resources look like And so what does that look like? So these are APIs So we’ll be making an API request appear in the first line You can see we’re requesting the access token That’s just the canonical way to get the access token, and then we attach it onto the request here And then what we can do is, we’re calling the Cloud asset inventory APIs on a particular project And we’re saying export x assets for that project You can also call them on a folder or on an organization, and it’ll do the same thing In this particular case, we’re asking for a resource as opposed to resources and the policies So we’re saying, get all the resources for this project, and then we’re saying dump that all into a GCS bucket So once you’ve done that, you’ll see just a list of all of the resources in that project Now we didn’t say what time to do it, so the Cloud asset inventory is assumed to do it for right now, but you can also specify at time anytime in the past five weeks This is what the response looks like So in blue, you’ll see some metadata about this entry So this is basically us saying, we found an instance Its parented to this project, this is the information about it And then in green, you can see we’ve printed out all the properties of that particular resource So you’ll see what the CPU platform is, the disks attached to it, basically every property that you would see from making a GET request against that resource So this API has just been GA’d We’ve just started to work with customers on this But we’ve already found that customers are using these APIs to implement a pretty powerful and wide range of use cases So for example, you can, as kind of the first thing to do, you can just see how many projects have

GCE resources inside of them Working with one customer, we asked them upfront before they did this experiment, how many they thought there were and they said around 100 It turns out they had over 500 resources– 500 projects, excuse me, with GCE resources They were really glad that they made this request, and it’s just another example of the sprawl that happens inside organizations that are rolling out the cloud You can also do things like, how many VMs have external IP addresses, or, how is my persistent disk versus local SSD mix changing over time? That’s a cost optimization one Or, because this API also tells you about the policies that are applied, you can also say, hey at a particular moment in time, who had access to a particular set of resources? For example, if you are investigating a security incident or something like that You can imagine with these APIs and with just a little bit of iteration over the results, you can really do some powerful scanning The last section we have here is around enacting and enforcing policies And so this is all around, now that I know the state of the world, and now that I have the resource hierarchy to reason over that in a more human comprehensible sort of way, how can I start to put up some guardrails? How can I start enforcing that some of the best practices that we just talked about actually be implemented by my organization at large? And so when I say the word policies, I mean that in the broadest sense of the term For a city, a policy might be all buildings need to have two fire escapes, or buildings in this region can only be under 20 stories For a Cloud, in the Cloud world, a policy might be for security purposes You might say, hey, resources shouldn’t be shared outside of the organization, or this is a set of people who can do x on this set of resources They can also be for compliance reasons We used the example before of the organization in the European Union They want to have a policy that says, my users shouldn’t be able to stand up resources in the outside of the EU Or they can be for cost savings You can say something like, hey, VMs in a sandbox project, they should be shut down on weekends and holidays you should have VMs underneath a certain size, et cetera, et cetera There’s a number of other reasons too Now, when we talk about enacting and enforcing policies, I want to draw a dichotomy here You can enforce policies before the fact, which is to say you can place limits on activities before they happen I might say users are not able to create a disk in a particular region, and if they try to, they’ll get an error Or we can enact policies after the fact as well So we can maybe be a little more permissive for users and allow them to take a certain action, but once they take an action, we’re constantly scanning the environment to see whether or not the action happened, to see what things are happening and taking action against violations of policies So at this point you might ask, well, why not only do things before the fact? Why am I even taking the risk of allowing some actions to happen and then reacting to it? Well for some policies, for like life or death policies, or security policies, you should certainly enforce them before the fact That’s for sure But if you’re just enforcing them before the fact, how do you know that you’re doing that in the right way? And so one reason to enforce both before and after the fact is defense in depth Let’s say I don’t want to have any VMs in my organization with external IP addresses because I’m very worried about data exploitation What I can do is, I certainly won’t allow my users to create VMs with external IPs, but then I also scan after the fact to see what are all my VMs, and do any have external IPs And so I can be doubly confident that I truly am following and abiding by the policy Another reason to enforce after the fact is for things where they’re more of a best practice, or I more want to nudge my users towards that behavior And in that case, you’ll find that by enforcing after the fact, which I’ll give some examples on in a second, you can be a lot more expressive and do a lot more actions as opposed to enforcing before the fact But first we’ll zoom in on enforcement before the fact We have a number of tools to set policies that will prevent behavior before it happens Of course, one is the IAM policy So this is using the built in identity and access management tools, you can extend policies on resources to say who can do what on them So for example, I can apply a policy that says, this particular set of users, they have this set of permissions on this organization, so on and so forth We also have the concept of an organizational policy, so whereas IAM policies are in the business of granting permissions to particular sets of users, when you set up an organization policy, these are in the business of placing restrictions on resources So I can set up a organization policy for example, that says, I don’t want any external IP addresses We’re also working on another one

calling back to the previous example, which says I don’t want to have any resources stored in a particular location or region And finally, another tool that we’re seeing becoming a lot more prevalent among larger organizations for enforcement before the fact, is the concept of having a level of automation sitting between the users and the GCE This is an actual real-life case study from one of our customers What they do is, they don’t allow any of their end users to create projects Instead, these users go to a service now portal that they’ve implemented They go in the service now portal, and they request a project be created That asks them questions to the user about what they want the project to be used for, and in turn, it might even kick off some approval workflows But after that, assuming that it’s approved, what this service now does is it calls up a Jenkins job, which will automatically create a project on behalf of the user And when it creates the project, it is able to create the project with some specific set of defaults that are kind of more trusted and more safe So for example, what they do is they delete the default network, because by default we’ll create a new network with every single project And instead, they’ll hook up the project to a shared network that they’ve created which is more centrally administered, and also in doing so, they can set up some other good defaults on the project, like the IAM roles and things like that And in doing so, I think they’ve come to a good mix of agility with the right guardrails So their developers can still move quickly, they can still create projects as needed, but now they have the right guardrails on that project to make sure developers don’t veer off a cliff So this is all I’ll say on the subject for now, but I do want to [? call ?] we did have a session that goes much, much deeper into all of these topics yesterday called Best Practices for privacy and security and Compute Engine If this subject matter interests you, I would highly suggest you check that out You can also do enforcement after the fact And this is where some of the really interesting pieces come together So one best practice we’re seeing in the field is the ability to take automated actions from audit logs So as I mentioned before, you can actually export an audit log to Pub/Sub, and what that means is as those actions happen in real time, you’ll get a push to Pub/Sub that you can then subscribe to, and then you can take actions on So at the start of the talk, I talked about schedule snapshots and the ability to set up those schedules Let’s say you wanted to enforce that as a best practice in your organization You want to say, anytime a disk is created, I want to encourage my users, or I want to create a schedule snapshot for those users So what you could do, for example, is you could take your logs and export them to Cloud Pub/Sub, and you might constrain or filter your logs to just the ones that are disk creation events or VM creation events that have disk creation associated with them So now you have a Cloud Pub/Sub topic that has basically a constant stream of every disk that’s being created in your organization And from there, you can set up Cloud Functions or any other kind of managed service, or it could even be a VM that just listens to this Pub/Sub topic that you set up So basically, it gets triggered every time a disk gets created in your organization And what it can do is it can read that audit log, it can understand whether or not this disk has a schedule snapshot attached to it, and if not, it can take action So it might, for example, email the creator and say, hey, I notice you didn’t have a schedule snapshot You should probably attach one Or it could go further and just automatically create a snapshot for you So in this case you’re sharing to guide your organization towards a set of best practices, and you also see how this approach has a lot more flexibility It takes a little bit more work, but has a lot more flexibility and kind of how heavy handed you want to be with your guardrails You can also imagine if you have an aggregated log export, you can make this even easier You can scale this across your entire organization, because now you can deploy this once and it applies to all the logs that are being generated across your organization Another best practice we’re seeing, and this is around automated scanning of your environments, using the information from the automated scan to check for [? pilot ?] policy violations So in the example here, let’s say we want to ensure that no disks exist outside of the European Union Then what I can do is I can use a Cloud asset inventory to export all of the disks in my organization to GCS Once I have that list, it’s a pretty simple job to just iterate over that file, detect the region for each disk And if there is a disk outside of the European Union, I can take the appropriate action Maybe that means I file a ticket, or I contact the person who owns it, or a more extreme version, I can just go and delete that disk And then you can wait some interval and repeat step one

So you can constantly be scanning your environment, and in doing so, it’s also a great vehicle to start enforcing some of these policies So with that, that brings us to the end of this talk Hopefully this has been helpful in showing you some of the best practices and the tools that we have for enterprise deployments Thank you very much