Designing and implementing disaster recovery systems in GCP (Google Cloud Next '17)

[MUSIC PLAYING] GRACE MOLLISON: I’m Grace I’m a Solutions Architect with the Cloud Platform team And I’m here to talk to you about DR So why are we talking about DR? Well, if you’re responsible for the availability of your systems, then you’ve probably had a conversation something like this How many of you have lived that picture? Yeah, quite a few of you So I’m now going to kind of delve into some of the patterns and things that you can do here So you’d be aware that whenever you start talking about DR, there are some Three-Letter Acronyms, or TLAs, where you usually start I’d like to start off by distinguishing between DR and HA Think of HA as being for micro-disasters A single server dies A disk fails Whereas DR is for a larger disaster A huge chunk of your production systems break down and you need to failover to a different location Natural disasters also fall under there So there is a relationship between the two And in some cases, particularly when running your production workloads on GCP, you will see how HA influences your DR strategy For example, using GCP’s global load balancer means that if a region becomes unavailable, your application just keeps on working When we talk about DR, there are two main terms that you need to understand First there is Recovery Time Objective, RTO, which is directly related to how long it takes to get your application back online It is determined by what strategy you adopt This value is usually defined as part of a larger service level agreement Secondly, there’s Recovery Point Objective, RPO, which relates to the amount of data at risk Ultimately, it boils down to how much data can you actually afford to lose? Think of it as the last backup or checkpoint you can roll back to This metric will vary based on the ways the data is used For example, frequently modified user data could have an RPO of just a few minutes, as you probably have the transaction logs that you can roll back to Whereas infrequently modified data may have an RPO of several hours if you need to restore maybe last night’s backup Note that this metric describes the length of time, not the amount or quality of the data So taken together, these metrics have a roughly asymptotic impact on your bottom line, which means that the smaller your RTO and RPO value, the more your application will cost to run SLA and SLO are normally conflated, but the key here is that the SLA is the entire agreement and specifies what service is to be provided, and how it’s supported– times, locations, cost performance, penalties, and responsibilities of the parties involved Whereas SLO are specific measurable characteristics of the SLA, such as availability, throughput, frequency, or quality An SLA may contain many SLOs We’ll keep those definitions in mind as we look at some of the patterns and implementation details for the scenarios Before we explore the approaches you could take, let’s take a quick look at some of the products and key features that the GCP has available to help you implement your DR strategy no matter where your production workload is So it’s a little bit of an eye chart Cloud Storage This is an object store, great for storage of backup files Cloud Load Balancing provides a single, globally accessible IP address to front your backend instances It’s global, so your application could have instances running in Europe and the US, and your end users would be directed to the closest set of resources GCE instances and images, the workhorse of the cloud These are virtual machines which you can take incremental backups of, or snapshots of, that you can copy across regions You can create custom images with your application You can then use them to launch instances from Managed instance groups work in conjunction with GLB, provides a way to distribute traffic to multiple copies of identically configured groups of instances spread across multiple zones Cloud DNS provides a programmatic way to manage your DNS entries Deployment Manager allows you to define your GCP environment in a set of templates that you can then use to create complex environments, and simple ones as well, with a single command repeatedly and consistently Equally, you could tear the environment down with a single command Network connectivity options To transfer data to and from GCP, you need good connectivity So we provide you with choices in how you can do this We have a variety of connectivity options

And it’s more than likely that you will choose cloud interconnect or direct pairing to connect through to GCP for your DR configurations So you have a reliable link and we’re not dependent on the variances of traversing the internet, where there is no control on the hops between your on-premises and GCP I think we’re in good shape now to start exploring the approaches you can take So let’s start with looking at DR when your production workload is on GCP, and start seeing how we can start using this toolbox Before I go on, I’m just wanting to drill into cloud storage in a bit more detail There are a number of storage classes The characteristics of the specific class determines the most appropriate use case So in the DR scenario, Nearline is of particular interest to us Nearline reduces your storage costs, but access costs are higher than standard It’s designed for backup-type scenarios where access is, at most, once a month Ideal for allowing you to do those DR stress tests while keeping costs low Data is always a good place to start, I reckon, so let’s look at data DR techniques when your data tier is on GCP It’s safe to assume that most of you have databases, perhaps several And likely, you’ve considered a tiered storage solution Looking at tiered storage in this context, we are focusing on using it for backups, where you have the most recent backups on faster storage, and then you archive off to backups to cheaper, slower storage It’s a pretty common pattern So Ben’s session this afternoon on building hybrid cloud tiering with Cloud Storage for backup and archival goes into more depth on this technique Here you can see an illustration of a typical tiered storage setup on the GCP where we’re using the characteristics of Cloud Storage classes to implement Inevitably, you’ll need to define some rules that describe how objects move between the tiers Thankfully, GCP makes it easy by allowing you to script a few simple rules, which are known as lifecycle rules You could do things like downgrade the storage class of an object, older than 10 days, say Delete objects created before March 1, 2017 Or maybe keep only the three most recent versions of an object in a bucket with versioning enabled That’s only half the story, though And as on-premises, you need to think about what happens when you need to do a major upgrade to the database server, say If so, then you will need to figure out how you can get back to a version on the database server that the backups you stored in Cloud Storage are compatible with It’s not as onerous as it sounds, though OK? You actually have some options about what you can do here We allow you to create a custom image that has the relevant version of the database installed Before we go into that, I want to talk a little about snapshots Snapshots are available across regions, which means that if you snapshot your database, you can restore it to a different region as easy as you can restore it to the same region As you could take regular snapshots of your system, this also means each time you upgrade, you can roll back to the snapshot with the previous version if you need to As a short-term strategy, snapshots are useful for fast recovery So back to custom images When you launch an instance, you start from an image This can be as minimal as just having the default OS, or maybe having been fully baked, and having your application, and there’ll be [INAUDIBLE] stored on it Decide where on the continuum your image fits Is it fully baked or do you want just the OS? When choosing your baking strategy, you’ll need to consider things like boot time and manageability Fast boot times are a must for autoscaling web servers But a database server may not need that fast boot time Thinking about how the instance will be used in the context of your DR requirements will help you determine what your starting image should look like Up until now, I’ve discussed DR techniques applicable when your database is deployed on a GCE instance However, GCP has a range of managed data databases that fit a wide variety of use cases So you choose to use one of these so you don’t have to take on the operational overhead of managing them yourselves, as you do when deploying a database on GCE instances This table lists them together with some sample backup methods OK I’ve been talking about data backup and recovery, but that’s only half the story You also need to consider the DR strategy for your application

or service that uses that data The techniques I’ve already discussed, like snapshots and the concept discussed with images, are all applicable We need to bring your application and your data recovery techniques together, look at the bigger picture when designing the strategy The typical way to do this is to think about your RTO and RPO values, and then what DR pattern you can adopt to meet those values So DR patterns are considered to be cold, warm, and hot Think of these as being how ready you are to recover from when something goes wrong Think about what you would do if you had a puncture How you deal with it depends with how prepared you are And we could draw an analogy with the cold, warm, and hot patterns So you’re driving along You have a puncture But you have no spare tire So you need to call someone out to come out to you with a replacement tire, and then they have to fit one You have to wait until they arrive Moving along the continuum, the equivalent of a warm pattern would be if you the spare or kit So yeah, you do have to stop, but you could fix the tire or replace it And then you’re on your journey So you’ve had a bit of a break But if you have run-flat tire, yeah, you need to slow down But you don’t stop There’s no real impact on your journey OK? So it’s just basically the idea of how ready you are, and therefore, how quickly you can recover So let’s look at a few examples of how you can apply these patterns to production workloads on GCP This cold pattern example requires you only ever have a single instance up and running Here the serving instance is part of the instance group, and that group is used as a backend service for an HTTPS load balancer The heartbeat in snapshot instance is part of a managed instance group Managed instance groups are a group of identical instances which work with load balancing services to distribute network traffic to all of the instances in the group The managed instance group is controlled by a GCE autoscaler The autoscaler is configured to keep exactly one instance running at all times If it fails, then another identically configured instance is launched in its place The heartbeat and snapshot instance runs a cron job at regular intervals to snapshot the serving instance, and it also checks the health of the serving instance at regular intervals If the heartbeat and snapshot instance detects that the serving instance has been unresponsive for a specified period of time, it instantiates a new serving instance using the latest snapshot, and adds the new instance to the managed instance group When the new serving instance comes online, the HTTPS load balancer begins directing traffic to the new instance Looking at this warm static example, you could see how we’re making use of Cloud Storage to provide a DR solution for a web app running on Compute Engine In the unlikely event that you are unable to serve your application from Compute Engine instances, then you can mitigate service interruption by using the Cloud Storage-based static site on standby In the normal configuration, Cloud DNS is configured to point at this primary application, and the standby static site sits dormant In the event that the Compute Engine application is unable to serve due to a problem, you would simply configure Cloud DNS to point at the static site This implementation of a warm pattern is one of those patterns where the cloud has encouraged you to think differently If you do not need all those wiggly moving parts all the time, this could be the pattern for you A hot pattern, when your production’s on GCP, is basically a well-architected HA deployment In this example, we’re making use of a regional managed instance group together with load balancing and Cloud SQL As I mentioned earlier, managed insurance groups are groups of identical instances which work with load balancing services to distribute network traffic to all of the instances in the group In this example, we are using a regional managed instance group, so we have instances that are distributed across three zones Now, what’s really cool about regional managed instance group is that you get HA in depth It provides mechanisms to react to failures at the application instance or zone level And you don’t have to manually intervene if any of those scenarios should occur To address application-level recovery as part of setting up the managed instance group HTTP health checks are configured that monitor and verify that services are running properly on the instances in that group If a health check determines that a service has failed

on an instance, the group automatically recreates the instance Moving up the stack, if an instance in the group unexpectedly stops, crashes, or is deleted, the managed instance group automatically recreates the instance so it can resume its processing tasks And if for some reason a zone becomes available, you do not have to intervene, as by using a regional managed instant group, there will still be instances available serving your application The global load balancer accepts traffic through a single global external with IP address, and then distributes traffic according to forwarding rules you define You set up health checks here to ensure that new connections are only load balanced to healthy instance that are up and ready to receive them In the data layer, we have Cloud SQL, which is configured for HA So we have set up a failover replica in a different zone from the master All changes made to the data on the master, including to user tables, are replicated to the failover replica using semi-asynchronous replication If the zone where the master is located experiences an outage, Cloud SQL automatically fails over to the replica, and your data continues to be available to clients Existing connections to the instance are closed, but your application can reconnect using the same connection string or IP address, as you don’t need to update your application after failover After the failover, the replica becomes the master and Cloud SQL automatically creates a new failover replica in another zone This architecture is a great example of why, by using services that provide HA facilities, your application can automatically recover when things go wrong My favorite tool in the DR toolbox is Deployment Manager It allows you to define what your environment should look like using Jinja templates in Python With a few clicks or a single gcloud command, you could bring up an environment within minutes Let’s see Deployment Manager in action, bringing up the HA web application we just looked at I’d like to welcome Oleg Oleg One of my colleagues based out of our Amsterdam office is going to drive the demo for us Oleg, thanks That’s Wizard, by the way He’s our office dog And his human I did pay him, by the way I gave him dog treats for letting me use his picture OK So the first thing Oleg is going to do is download the set of Deployment Management templates from a bucket, and he’s going to deploy it using the gcloud tool Deployment Manager uses a set of declarative templates that let you consistently deploy, update, and delete resources, such as GCE, GKE, and Cloud Storage, for example Typically, you have, at minimum, a configuration file and one or more template files Schema files are also usually part of the parcel They allow you to declare a set of rules a configuration file must meet if it wants to use a particular template The templates, as I’ve mentioned, can be written in Jinja or Python, your choice Jinja is simpler, but less powerful than using Python Using Python gives you the ability to programmatically generate the contents of your templates You could pass variables to your templates, which make them easy to use We set up the Cloud SQL backend up earlier, because in reality, you’re not likely to recreate your database environment from scratch in a DR scenario, so we didn’t either OK? Bearing in mind, if you don’t have a database ready up and running, not only would you need to create the databases, you don’t have to load it with data And it would affect how long it takes to recover And that’s what RTO value you have It wouldn’t be a small value In any DR pattern that involves a database of some kind, that tends to be the overarching recovery process that dictates the achievable RTO So can we look at the autoscaler template, she says So we don’t need to gather around his laptop That’s good OK So if you look at this template file, you’ll see that we’ve declared the minimum number of replicas that the autoscaler can scale down to is three We can switch to– can you switch to the GCE console, please? OLEG: Yes GRACE MOLLISON: OK So remember, we said three So you could see, we’ve got three instances there

Can you delete one of them now, Oleg? Go nuts So as we have autoscaling on, it will automatically replace the instance that Oleg has just deleted I spoke earlier about HA in depth using managed instance groups together with autoscaling of groups being a key tool So here we’re demonstrating this capability at the instance failure level But equally, we could set up a HTTP health check that we set up against the group In fact, it takes the actual application one or the instance failed, it would automatically create a new instance and delete the failing one OK OLEG: We did GRACE MOLLISON: Thanks, Oleg If you’re using a load balancers as well, you can also set up health checks here too But load balancer helps [? tabs ?] who just stop serving traffic They don’t actually remove the instance They don’t delete them OK? So you need to understand how to set up both types of health checks The rule of thumb is be more conservative with setting up health checks against the managed instant group So as you can see, the replacement instance is coming up OLEG: Yep, it’s up GRACE MOLLISON: As there isn’t actually– we got there in the end Sorry about the delay As there isn’t actually anything wrong with the zone The replacement instance got launched into the same zone as the one we were deleting So anyone who was paying attention and notice the zone, you would see it’s put it straight back into the same one So after that, we could easily tear down the deployment by just deleting it So, OK So this is great if you’re doing test stress of your DR environment, which you should do So I understand, having been there myself, is the fact that doing full-scale tests are difficult This just makes life easier You’ve seen it [INAUDIBLE] and you’ve fired up the environment We pulled it down as well So you’re actually investing lots of stuff that staying there that you don’t use very often And then you’re worried because you don’t use it very often, why hasn’t it come up? So moving on OK OLEG: Bye-bye GRACE MOLLISON: OK, thank you very much, Oleg Thank you Thanks, Oleg OK So moving on, let’s look at DR where production is on premises and your recovery site is Cloud Platform So I’ll be covering when your production is just on GSP When you’re using GCP as the recovery site for your on-premises workload, there are a few key things you need to consider The initial question usually being, how do you connect to GCP? GCP provides a number of ways to connect from on premises Cloud interconnect being the ideal way to provide a consistent, reliable link You need to take into consideration factors such as the bandwidth between you and the carrier interconnect provider, as well as the actual bandwidth provided by the provider directly to GCP Think about what other data will be used in that link Consider how much data has been transferred and how long have you got to meet any RTO values You can see from this table– by the way, which isn’t taking into account any external factors at all– that bandwidth availability does have a direct effect on the time it takes to transfer data Don’t forget to take into account your security controls You may determine that you want total control of your encryption keys, so factor in the end-to-end encryption/decryption process Who would have access? Think about authentication and authorization mechanisms How much can you automate? Remember, everything is an API, so you don’t need to rely on an actual, probably sleepy, human to start bringing up online your DR site I know whenever I had to instigate a DR scenario, it was always in the middle of the night I was always half asleep So when taking a traditional approach to DR on premises, it typically involves, somewhere in the process, people driving fans and swapping tapes around The types usually– or should I say, hopefully, being the off-site backup Remember– [INAUDIBLE] Remember this? Well, we can use the same tiered storage approach to replace the need for tapes, and fans, moving tapes on and off site We can replace the source to be your on premises storage appliance, for example So here you need to set up connectivity to GCP So in this example, we’re using GCP’s cloud interconnect service You can then implement a process to copy data from the storage device to a Cloud Storage bucket This process can be very simple using, say, the GSU2 command You may also want to just set up a GCS lifecycle rule, like we described before, to delete data and move to a cheaper storage class Here’s an example of a cold pattern

In this example, we have practically nothing running in your target GCP project When there’s a problem that prevents the production environment running production workloads, you deploy a Deployment Manager template to create an environment capable of running production workloads Note that we were trickling data into Cloud Storage as part of that production environment You’ll need to implement processes to restore the applications to all the instances spun up So think back to the continuum of images we spoke about earlier You’ll also need to think of a way to transfer data from Cloud Storage to the database you spin up in GCP when you invoke your DR plan, and start serving production traffic from your GCP environment This pattern may be the most cost-effective, but it has the highest RPO and RTO values in the patterns and requires the most operational effort to recover from Moving on, here’s what a warm standby setup could look like Here we have a multi-tiered application running on premises while using the minimal recovery suite on GCP Ignore the Cloud SQL instance for a moment I’ll come back to that You could see that a database server instance is running on the GCP side The instance must run at all times that it can receive replicated transactions via asynchronous or semi-asynchronous replication techniques To reduce cost, you could run the database on the smallest machine type capable of running the database service As this would be a long-running instance, sustained user discounts will apply When the on-premises application needs to failover, you can make the database system on GCP production ready by destroying the small database instance, making sure to keep the persistent disk containing your database system intact If your system is on the boot disk, you will need to set the autodelete state on the disk to false before destroying the instance Create a new instance using the machine type that has appropriate resources for handling the production load and attach the persistent disk containing your database system to the new instance In the event of disaster, the monitoring service will be triggered to spin up the web tier and application tier instances in GCP You can then adjust the Cloud DNS record to point to the web tier, which in this example, points to the external IP address of the HTTPS load balancing service OK So coming back to that Cloud SQL instance, if your on-premises database is MySQL, you can use the Cloud SQL replica instead of the VM instance, and data will replicate directly from the external master to the Cloud SQL replica For smaller RTO values, you could adjust the above strategy by keeping all of the Compute Engine instances operational, but not receiving traffic This is not cost-effective, but may be an option that meets your needs You may want a ready to take over production workload database instance, so you can omit the resizing of your instance In this pattern, you’ll notice that you have to replicate the data to your database This is one way of ensuring your DR failover database is kept up to date There are other options for keeping your recovery database up to date The method you choose will have a direct impact on your RPO value You may choose to apply backup and transactional logs as your data recovery strategy This means you do need to make copies of the transaction log files, as well as the backups You need to regularly restore backups and apply transaction log files to your recovery database, and test that it acts just like the production database Exactly how you implement this depends on the flavor of database you are using Maybe you have a read copy that can be promoted It’s similar to the techniques we discussed when using HA Cloud SQL You could choose to keep a copy of the raw data on cloud storage and rebuild them, assuming your RTO was long enough to allow that What happens when you’ve fallen over to your DR database? How do you then get your production database back up to date? In many cases, it’s just the reverse process you’d be following to keep your DR database up to date Data recovery is tricky, and you need to run regular fire drills and be totally confident that you can happily run production, whether you are using your production database or your DR database Take advantage of Cloud Deployment Manager to spin up your failover application, point it at your DR database to validate And once you’re done, you can tear down your replica environment until it’s needed for the next fire drill If you have very small RTO and RPO values, then you could take that warm pattern and modify it so it’s just an HA across with GCP, which gives you a hot pattern, as both on premises and GCP

are serving production traffic The key differences from the previous pattern being that everything is running in production mode and serving production traffic The GCE instances are part of a managed instance group behind a load balancer, and the GCP database instance is production ready If you do choose to implement this type of hybrid approach, be sure to consider a DNS service that supports weighted routing when routing traffic to the two production environments, so that you are able to deliver the same application from both In the event that one environment becomes available, you can then disable the DNS routing from the unavailable environment So we’ve discussed several examples of using production workloads in GCP and on premises I’m just going to discuss briefly now what techniques you could adopt when you’re running production across clouds If your application involves virtual machines running across different clouds, again, you have to figure out where on the continuum your images are for both clouds You’d want to use tolling that works across them both If it’s a fully configured image required, consider something like Packer, which could create identical machine images for multiple platforms from a single configuration file You could use something like Terraform, which it’s a cloud-neutral templating tool So you might want to– if you’re just running on GCP, use the native tools If you’re running across clouds, you need to find something that runs across both of them You could use configuration management tools, such as Chef, Puppet, Ansible, SaltStack to configure your instances Or even to configure the instances that you’ll use to create the images you’ll be using That’s a very common technique, and it keeps everything consistent If you need to, you could set up a VPN between clouds If you’re using containers– you know what I’m going to say There’s only one thing that needs saying It’s Kubernetes With Kubernetes, you confederate across clouds You can use GKE, which is Cloud Platform’s fully managed Kubernetes environment And then on, say, Azure or AWS, you can use Kubernetes on there There are other techniques and solutions to the ones I’ve just discussed that you can implement for multiple cloud scenarios For example, if you want to replicate data from AWS S3 to Cloud Storage, then you can do this easily as well, maybe by using Boto, which is a Python tool designed to work with AWS, which also supports Cloud Storage There’s plenty of ways to do that It’s not going to be easy I’m not going to deny it, but you can do this So just to recap, why using GCP is the destination you should be thinking of for your DR needs It improves your TCO It’s easy to carry out the fire drill so you will have a greater degree of confidence that in the unfortunate situation you have a disaster, then like Wizard here, your ops team will be calmly executing the recovery process Also, you don’t need to keep rethinking patterns They apply no matter where your production workload is running So there you have it If you’re thinking of DR, either to revamp your current solution or want you to actually implement a new DR solution, hopefully we’re giving you plenty of ideas [MUSIC PLAYING]