Cloud OnAir: CE TV: First Steps with Apache Kafka on Google Cloud Platform

[MUSIC PLAYING] JAY SMITH: Welcome to CE TV on Cloud OnAir Live webinars from Google Cloud We are hosting webinars every Tuesday My name is Jay Smith And today I’ll be talking to Gwen Shapira from Confluent How are you doing today? GWEN SHAPIRA: Fantastic being with you JAY SMITH: I want to make sure you all know that you can ask questions on our platform And we have Googlers on standby to answer them And we might answer some of them later on Let’s get started Actually I had the pleasure of meeting Gwen a few times And she was at Strata New York City, and got to see some interesting stuff She’s kind of the guru here about Kafka GWEN SHAPIRA: That was a fun event I hope you also enjoyed it JAY SMITH: I did So let’s get a little bit of your background GWEN SHAPIRA: Yes So I work for Confluent It’s a company that was started by the people who first created Apache Kafka And do we do all-Kafka-all-the-time from support to building a lot of tools that make it easier to run, to running it ourself on Confluent Cloud among others on Google Cloud So basically, I want to be here today to kind of talk to you about Kafka and how people use it And what’s going on there JAY SMITH: Awesome Let’s get right to it then GWEN SHAPIRA: Yeah so I was thinking of starting with just a simple stream processing example to explain what we’re talking about when we say that Apache Kafka is really a streaming platform And then do a quick intro to just the concepts of Kafka and the components around it Show two, maybe even three, cool use cases And then jump into a demo and show you how we implement those use cases I got really cool demo for you JAY SMITH: Excellent I like cool demos So do our viewers GWEN SHAPIRA: So basically we say that Kafka is a streaming platform What does it actually mean? It means that you can produce events Kafka will store them as an ordered stream of events It will maintain this order And then you can consume those events And you can even consume them, do stuff to them, and write them back to Kafka, which is what we call stream processing Just to show you what stream processing could look like– imagine someone trying to bring a credit card processor So you get a lot of authorization attempts from credit cards every day– like thousands and thousands every second A tiny percentage of them could be incredibly suspicious So we need to detect the suspicious events and separate them from the good events And basically have someone investigate them a bit more, right? And it sounds like it could be very complicated To be honest, doing it in real life is pretty complicated But it can start with something incredibly simple If someone is trying to repeatedly authorize the same card over and over again, it’s probably very suspicious So we want to basically filter the stream of authorization attempts for this kind of behavior And create a new stream of things that may possibly be wrong And we want to do it every time an event happens We want to do it continuously That’s the core of stream processing Not just once a day because then the bad guys may be already all the way back to somewhere And who knows We may never catch them again So we want to continuously process it And as you can see, using KSQL is the simplest way to do stream processing even if you know almost nothing about streams or Java JAY SMITH: Right This looks very similar to your standard SQL Maybe a few little differences GWEN SHAPIRA: Exactly So just to show you how it works, we basically say OK, we’re creating a new stream of possible frauds based on an existing stream of authorization attempts And the way we select possible fraud is that we look at five-minute windows So we cut down the stream into five minutes windows Pick the card number and count the authorization attempts of this card in this window And then say everything with more than three in a five-minute window looks like a suspicious card And we’re writing it as possible fraud And then we have an ongoing stream of things that look suspicious Now we want to basically send it somewhere else for deeper analysis, where maybe a qualified human can do extra investigation So really, the core idea of streaming is that you don’t process events once in a while, you process them as they’re going on in real time If you learn nothing else about processing, that is the concept you have to remember So how do we do it? So let’s look at the components that make up this platform And the core component is the Kafka brokers or what we also call a log or a distributed log There is a famous book by my CEO, Jay Kreps, called “I Love Logs” where he basically explains that log is the fundamental data structure

of all modern computing And it turns out that logs are incredibly simple It is a data structure where you write things at the end and you store the structure in order for a set amount of time, which could be forever Some people do hundreds of years amount of time And you can see that this could be your application log, this can be a click stream, this can be a credit card authorization attempt The message is so generic Like an event is– anything that has happened could be it And you keep writing events at the end and then different applications want to read it And the idea is that every application is independent And they start at some point, usually the beginning And they continue in one direction, hopefully to the end And there is one direction of movement, which makes it just efficient on disks and this kind of stuff And also each one of those consumers only needs to know OK, the last message I got is message five The next one is going to be message six Let me ping the broker and say, can I get message six? Can I get message seven? You get a bit more than just the next message But that’s the idea Which means that the clients and a broker don’t need to keep track of every single event And did everyone consume it? And who is supposed to consume it? This is one of the things that we are always a bit more computationally-intensive and memory-intensive in all the message bases and really set Kafka apart JAY SMITH: So since its all in order, its easier to keep track of GWEN SHAPIRA: Yes Exactly You just have to remember one number per consumer So sometimes people ask me how many consumers can I have with Kafka? And I know 2,000 50,000 It’s just one number How much does it long take? Right? And then I just showed you one log, which in Kafka terms is one partition A Kafka topic is made up of a lot of partitions I think the topic is kind of this logical entity– maybe like that base table– then Kafka is made up of a lot of those topics I’m going to show you a lot of topics when we do the demo But that the idea is that the topic will have many partitions You can spread them across many machines You can have many producers writing to them Many consumers reading So it’s really a scalable system The other thing is that each partition is also replicated, usually on streamed machines So you have a leader All reads and writes go to the leader And then their followers basically keep replicating syncs You have one job Keep replicating it And if something happens to the leader, basically, one of the replicas take over The idea is that because all reads and writes have been leader, we always have full consistency on that partition Which is really a core guarantee that Kafka has JAY SMITH: That’s good I mean, if you have all of those events and one of the machines goes down, heaven forbid– you want to make sure you can maintain those events So having this replica technology built in to Kafka– GWEN SHAPIRA: Exactly And usually when things go wrong, it’s because people misconfigure replication And think that they have guarantees that they actually don’t So you can see here’s that it’s kind of really scalable end-to-end– from producers through partitions to consumers And consumers really have the idea of consumer group So if one consumer instance cannot handle all the data from all the partitions because thousands of credit cards per second is a lot of work, you can actually start multiple instances and belong to the same consumer group And they will basically negotiate with each other and say, OK, this one is reading this partition, the other one is reading the other partition And you get scalability on the consumer applications And also high-availability if one of those crashes, basically the partitions will get reassigned to someone else So it’s really scalable and resilient end-to-end And then if you go to the Apache-Kafka projects, you’ll only really get clients in two languages– the Java and Scala But Kafka has a binary protocol that is only slightly complex So let’s look inside of the implementing this binary protocol in different languages So you have a pretty big ecosystem in all those different languages And then if your language is not supported, you have arrest proxies that will give you support Pretty much everyone does REST, right? JAY SMITH: Exactly And that’s great because a lot of times when you want to learn a new tool, there’s a limitation based on whatever software or whatever language it supports But being able to support a wide array of languages, I can just take whatever knowledge I have today and take advantage of the REST proxy GWEN SHAPIRA: Do you notice that we’re expected to learn new languages faster than we did in the past? JAY SMITH: Oh, yeah There’s like three new ones every month or something GWEN SHAPIRA: I know I spent most of my career with Java And then suddenly it’s like, oh, I have to learn GO And now I was told that Trust is the really big deal And [INAUDIBLE] is a really big–

Everything keeps happening And it’s funny, I have a friend who is now learning REST And he is like, do you think I can just write a Kafka producer as my first project? And I’m like, go for it, man JAY SMITH: Maybe he’ll write one in Julia too now GWEN SHAPIRA: Ooh, maybe if you have spare time I don’t know, do you do hack projects and that kind of thing? JAY SMITH: Sure, why not? GWEN SHAPIRA: Yeah, 20% time, right? That’s the– So until now, I kind of explained how you produce and consume events from your applications that you coded yourself But there is a lot of cases where the data you want is not something that you actually create yourself It actually already exists in some database And you can’t really ask the person writing to do their best to please also write to Kafka because they have their own priorities Well, you can really connect to the database and get events to Kafka with the Connect framework I mean, you could write it yourself in your own app But there is so much in common between all those different pull-and-produce loops that we figured that we may as well give you the framework and let you focus on how to query the Azure database So it’s a very basic idea Did you get that from a database into Kafka, or from Kafka into a database? You don’t have to do the entire flow Just half of it is fine But you know, then we’ll show you the entire flow And obviously once you write a good API, communities are kind of amazing So hundreds of Connect-ers showed up from all those different teams and companies like the connector from Kafka to Google BigQuery was written by a company called WePay And they just the open-sourced it And now 10 other companies also use it, because apparently getting data from Kafka to BigQuery is a very big thing JAY SMITH: Right Apache Kafka, we must emphasize the fact that it is open source It is very community-driven Confluent is very active in that community So you do get a lot of these third-party connectors and tools built for it due to its open-source nature GWEN SHAPIRA: Exactly And for me, it’s just always amazing the creativity of the community Did you ever want to get events from a Bloomberg terminal into an IRC channel, for example? And then the last component is really doing with the stream processing And I kind of already showed you how it works in the beginning So I don’t have to go over that again But I wanted to mention that this is kind of friendly SQL, but its a layer on top of a Java API and a Scala API is that you can really use more directly if you prefer using Java and Scala– which a lot of us really do And it does give you a bit more power and flexibility at the expense of slightly more complexity Which seems like the usual tradeoff in pretty much everything JAY SMITH: Right Now being able to essentially query the stream So as its writing to the logs, you’re able to get information not actually just waiting until it actually is written GWEN SHAPIRA: And I’ll show you that on screen in real time Just type a query and see the events come in It’s pretty awesome And this is like six lines of code, or so And it’s six lines because we broke it up I wrote the same thing in Java And it wasn’t huge amounts of code But it was definitely three screenfuls of lines written So having the ability to really do it quickly is pretty big for busy developers So what can we do with this? Our listeners at home may want to go and write their own stream processing framework What could they do? So one thing that we see very often is how companies have those clicks of events that are actually meaningful information that they want to analyze– sometimes from web applications or mobile apps– and then they may want to enrich the event with things that they know about the users So some profit that comes from a database query Maybe even log files to see if there was correlation with different errors that happened And you can really bring all of those into Kafka, probably into different topics Use KSQL to join, window, aggregate, do whatever you want And then you can really stream the events into external systems like Google BigQuery So you can do more analytical work And that’s one thing to emphasize KSQL is fantastic for kind of data-managing, joining, grouping And it works real nicely in real time But it’s not like please query the last five years of data kind of thing And there is a lot of stuff that really you want to write the data to an external system so you can do a lot more involved analytical work JAY SMITH: Right So you wouldn’t necessarily try to train a model with KSQL Just more like getting the information that you need to train the model GWEN SHAPIRA: And you know before you find a model, you really need clean, high-quality data This is where we see the connectors and the KSQL really come in But the big question is, how do we actually do it? How do we get data from Kafka to BigQuery?

And it’s one of the things that I had my own opinion And I said it’s obvious And then I asked other people And they had their opinion and they said it’s obvious So I want to go through some options So the old school way is through something called Secor, which is basically a batch job Highly parallelized back job So think of it as a big mass-produced job And it basically goes, connects to Kafka, starts consuming events, and writes them to BigQuery But this is a batch-up So it’s run once an hour or so And while it’s pretty cool, and it’s flexible, and does all kind of cool things– We did a lot of work to make sure we get the data in real time We don’t want to lose the real-time aspect just before we land in BigQuery So we also picked up a bunch of streaming options And you can use stream processing frameworks And they give us examples– Kafka Streams, and Apache Beam, and obviously Google Dataflow And being full-fledged programming languages, it gives you a lot of control over what your data is going to look like So if you need to do a lot of standardization, a lot of clean-up, get some data over here, some data over there, a bunch of filtering– You can really write those jobs and basically get events, clean them up, write them to BigQuery Of course, it’s programming So you have to actually write all of it yourself– all the error-handling, all the scheme-handling All those is your own code And Kafka Connect is pretty much the polar opposite It’s still streaming But it’s basically a no-code option I’ll show you Basically, write a big configuration file And you say, OK, I’m getting data from this topic to this BigQuery data set into that table And you expect Kafka Connect to handle everything for you, more or less It’s kind of like, if this does the job, you’re done in five minutes, while writing code can take a big testing cycle And if you actually need the controls, and– I mean, it would be nice if I could do it in five minutes But I can’t JAY SMITH: So it’s great for getting started GWEN SHAPIRA: It’s amazing for getting started I’ll show you how quickly you can get it done I’m thinking that these days engineering is under a lot of stress to deliver things in very short time cycle And a lot of times, you have to show results before you get allocated the time to do anything smarter So, like, look at I’m getting events from Kafka to BigQuery Fantastic Oh I really need the events in a slightly different format So let’s bring it to the next two weeks I’m going to really try to do something a bit more involved And then, once we know how to collect a lot of events into Kafka and writes them into BigQuery, why stop with clicks? I mean, there is a lot of useful data in those legacy data systems and in their applications And you can use it for example, maybe running a bank and you have your credit card processing You have transactions, and you have loan requests And you have a lot of old information about the customer’s legacy systems Bringing them all together and processing them, getting them to Google BigQuery with things like Connect or Google Dataflow, depending on the use case And then you really have a lot of data here to process Imagine trying to get access to the data on the legacy mainframe And the tools you would have to use to do it And then just how long the processes take versus doing the experimentation in data as it shows up in real time It’s a very big difference in quality of life And then if you got the simple use cases working, you can do things that are more advanced and obviously more exciting I think it’s more exciting A lot of the data business comes from devices JAY SMITH: Yes it does The IoT has made my life easier, maybe a tad lazier All of my outlets are wired And I just speak into my Google Home and turn on everything, turn off everything GWEN SHAPIRA: You actually do that? JAY SMITH: Yeah GWEN SHAPIRA: That’s pretty cool I’m not that advanced But I do have things like Nest and those programmable light bulbs If you want to set up a modem in my house And it’s a lot of fun So basically all of those devices speak a very small number of protocols A common one is MQTT And part of that the Confluent platform is an MQTT proxy So you can really collect information from those devices, stream it into the cloud The nice thing about clouds is, there’s probably a cloud center near a region nearby from where the device is And then you can really start doing advanced things Like, you can train a model And using the same data, you can also serve the model in real time Something like Google Function serving up the model on every event as it happens So you can really start with something fairly pedestrian

like clickstream processing and end up with a real time machine learning IoT system that is probably something that you can raise some money with So obviously with all that, you want to actually use all those cool tools You don’t necessarily want to install them, configures them, figure out why they are losing data Or maybe they’re not losing data and why they are not losing data Why they are not losing data, but doing it too slow, or too fast Managing it, upgrading it, troubleshooting it, getting the pager at 3 AM Have you ever had a pager job? JAY SMITH: Not for a while But I do remember getting those alerts in the middle of the night GWEN SHAPIRA: Yeah, I’m just thinking, my last pager job was maybe in 2009 I do not miss that at all So basically introducing the– or not introducing, it’s been around for a while But the idea is that if you could write yourself, Kafka is open source You’re can install and run it Or you can let us run it for you on the Google platform And I’ll show you It’s kind of integrated with the rest of the Google platform with those connectors JAY SMITH: Yeah, one thing I always tell users, developers You know, when you’re building your application, if it’s not something that is core to your project– If it’s something that you need but managing it doesn’t really add value– it’s always best to have somebody else manage it That way you can focus all your time, resources, energies, onto the things that actually matter for your application And for your business GWEN SHAPIRA: And the really funny thing is that I’ve been working with the same companies for a bunch of years now And it’s funny how companies that three years ago were like, we are never going to use cloud Clouds will not work for our banking system And then three years later, are like, oh you know we are actually starting to move to the cloud We could use some help What’s a good architecture for migrating to the cloud? What’s a good use case? How do you do cloud security? And I think it’s in just the economics of it are kind of individual Why would you want to run your own data center it’s not your competency? JAY SMITH: Absolutely It just doesn’t make sense to do it if it doesn’t benefit your application at the end of the day GWEN SHAPIRA: Exactly Focus on things you are actually good at In banking case, taking our money JAY SMITH: Let’s look at Confluent Cloud, then GWEN SHAPIRA: Yes Let me try to switch to my demo So I’m already logged into the Confluent Cloud And you can see that I basically have a bunch of clusters already here If I want to add a new one, it’s really not a big deal So let’s say I want to test something new And you can see I can play around with how much data I am time consuming and producing Important bit is to select the right provider, right region, how many availability zones And then we click continue And look, this is going to cost us, say, half a dollar per hour And we’re here for an hour So I think you better pay up JAY SMITH: All right I’ll send it to you over Google Pay GWEN SHAPIRA: OK So after I launch the cluster, I basically see a bunch of instructions on how to install a C Cloud, which is inspired by G Cloud– which we all know and love And this allows me to basically use the cloud from the command line Basically I can do something like C Cloud topics list And because I’m already connected to one of my clusters, you can see I have a pretty big list of topics going on here I wish I sorted them It’s kind of not really sorted I can create a new topic That’s going to be kind of a new topic And I can produce events into a topic if I want to So you want me to show you how to produce events in to a topic? JAY SMITH: Sure, let’s do that GWEN SHAPIRA: So this was the new topic I created OK, so I produced an event, super-simple And now I want to obviously consume it And I’m doing [INAUDIBLE] to start consuming from the beginning Otherwise, you only get new events And you’ll see that they are starting for a long time Why is nothing showing up? So this does take a bit of time because we need to negotiate the consumer group And see if someone else’s connects But soon enough, here is a load So hey, if you’re following at home, this may be the first event you and consume I know for me, when I teach classes,

it’s always exciting to see people produce and consume their first event But what I really wanted to show you is how you build a pipeline So how I get events from Wikipedia And maybe run a KSQL query And then write them to BigQuery You can show me how to do a query on BigQuery So I’m switching to Confluent control center And basically, I can see all the topics here But I really wanted to show you the connectors So you can see that they have a Wikipedia connector And the only expense– as connectors, they don’t require to write code You basically configure them So over here I basically configured the Wikipedia server It’s an RC server Wikipedia publishes everything that happens to a bunch of channels And you can see all the channels that I’m reading from It’s basically a channel for every language And I’m writing all of them to the same Wikipedia topic So let me start that And then let’s go and look at the data So I’m looking for the Wikipedia topic When I click here, you can see that I have a bunch of screen so I can see the schema in the topic And I can take a look You can see this is real-time events You can see more events are coming in And I can see the new events And so you can actually see the data So we’re combing Wikipedia And this user edited this page He put in this message I find it pretty cool Now if you want to actually run a query, I can go– Oh, it’s unable to connect I cannot run queries for you I’m sorry JAY SMITH: It’s all right Let’s see what else we can do here GWEN SHAPIRA: Yeah That’s what happens on the things that you didn’t test the night before Pretty much, that’s how demos work But I can go to the sync and basically say, so Wikipedia edits will not exist, because that’s the one with the things that we were supposed to get processed But we can still get Wikipedia events into BigQuery So we edit here And here, you can see that I’m getting events from those two topics And I’m basically mapping them Go to Wikipedia edits Wikipedia goes to Wikipedia So that’s basically a topic to table mapping for BigQuery And you can say, this is my BigQuery project And the name of the data set is Wikipedia It took me a long time to figure out how all this terminology works But it’s the same terminology if you know BigQuery It will probably be a lot easier for you than it was for me I figured out how to use the JSON key file to actually authorize my BigQuery user And so now after I figure this all out, I can start my connector I don’t know how to switch tabs So it’s going to be that hard way So this is my Google Dashboard And here I have BigQuery This is the data set And I have to create a data set But before I start the connector, I had to create a Wikipedia data set But the tables will basically be automatically created based on the mapping And their schema will be automatically created based on the schema of the events in Kafka that I showed you And if I make a change to the event in Kafka– if I could use KSQL to actually edit the events– the change will automatically reflect in BigQuery So that’s pretty cool And it actually shows me all my query history I didn’t know it does that That’s neat So it can actually, as well, do something like, you can see all of those were Wikipedia edits But I want to do something from– does it do auto-complete? Had the source Can I talk to a product manager? I want auto-complete JAY SMITH: We’ll put it on the list GWEN SHAPIRA: Then let’s get just like a few of those OK, let me see If I click here, it will just run the query for me? Query table, yeah Oh I needed to have the fully-qualified table now Hopefully, we don’t have results from today yet JAY SMITH: No, it’s still new That’s all right, though GWEN SHAPIRA: Yeah that’s a risk of live demos JAY SMITH: Looks like all of them We’ve all been there GWEN SHAPIRA: You can see here basically the Wikipedia pages You can see the channels I can also show you since I think I still have all data from the Wikipedia edits, I can probably show you how I enriched it with KSQL But when KSQL still worked for me JAY SMITH: You might want to get rid of the partition

Give an asterisk too GWEN SHAPIRA: I really like how it shows the arrows that they have That’s super-helpful And now it’s going to work See, it parses everything in real time Real time just adds a lot more usability to pretty much everything So you can see here is that I added a readable date So now you know that I did all my practice five days ago And if you go to the end, you can see that I edited language The way I edited the language, is that it’s based on channel line was like fr dot Wikipedia I basically joined it with a table that I created that has mapping of the channel to the language And then I could do a join and get– Yeah That is pretty awesome So basically, this allows us to get events from Wikipedia to Kafka If we are lucky it will do some extreme processing And then use Connect to write the events to BigQuery And all that in real time JAY SMITH: And this is just one of the many ways we can use Kafka and Confluent Cloud GWEN SHAPIRA: Exactly Yes And the next time, we’ll try to demo TensorFlow With a small IoT device that kind of drives around here Just in case people want to try it Here we are And basically, we have a Confluent Cloud professional, which is exactly the OIs that I showed you So you can just, you write Confluent.cloud or you go to that URL And you basically sign up and you say, it’s like half a dollar per hour And building this pipeline takes just I don’t know, an hour And it’s basically something that you can get started with fairly easily JAY SMITH: All right Well that was great Stay tuned for a live Q&A. We will come back in about a minute And we will answer your questions about Confluent Cloud and Kafka Welcome back Looks like we’ve got some really good questions So why don’t we jump right into them? How secure is data in Confluent Cloud? What measures do you have in place to protect data? GWEN SHAPIRA: Isn’t it the first thing that everyone always asks about the cloud? How do you keep our data safe? So Kafka itself has basically the authorization, authentication, encryption– the things you need to keep safe In Confluent Cloud by default, all the communication is SSL encrypted So you have encryption in the wire We use encrypted storage to store the data So you have encryption on disk And you authorize which is basically, you get an API key and a secret As long as you keep them safe we don’t even know them And basically, that’s the authentication method If you are even more paranoid than that I mean in this case, I exposed my Kafka to the world Like I could just use it from my laptop from the Google Wi-Fi with no issues But if you want to limit the access to just who’s in your own data center, and just within your company VPN, then we also support VPC peering So you can use this to have an even more secure experience To be honest, I’m exposing my toy clusters to the internet But I wouldn’t probably do it if it was actually running my bank JAY SMITH: And then being hosted on Google Cloud, you guys get to have the added benefit of the security that Google Cloud offers GWEN SHAPIRA: Yes For example, your load balancer is probably

what keeps us safe from DDoS attacks JAY SMITH: We currently have some applications running in our data center But we would also like to use BigQuery in real-time with data generated by our on-premises applications Is this possible with Kafka? GWEN SHAPIRA: I think I just showed that it is, right? I mean, you don’t have to get data from Wikipedia It was a cute demo, but then you can really get the data from anything that you produce to Kafka on Google Cloud You can use Connect and the BigQuery connector to get the data to Google BigQuery If you don’t want to use the BigQuery connector, you can also use Data Flow, or Kafka Streams, or something [? better, ?] or code that you wrote yourself, because you are not a invented here person The sky’s the limit JAY SMITH: All right With that REST API, you were able to write a lot We have implemented Kafka ourselves and have been running it for three years now As Kafka adoption increased, we are experiencing some issues It is lagging quite a bit And this is impacting our business Do you have any recommendations on what we should look into? GWEN SHAPIRA: Oh, tuning is a big topic So I’m going to assume that you’re running on Confluent Cloud And obviously, the [INAUDIBLE] are running perfectly well So it’s really more about looking into your consumers– like your application, why is it lagging? So first of all, kudos for noticing that your application is lagging We’ve seen people who do not monitor consumer lag, and therefore don’t even know that their application is lagging So if you are monitoring it and you know it’s lagging, you’re already a step ahead of everyone else And then, most lag-monitoring tools– definitely in Control Center, but I think also the rest of them– is they will basically show you what the lag looks like Is it increasing or decreasing? You fell behind and you cannot catch up? It will also show you how many partitions you have, how many consumers you have, and the lag on each partition So you can say, oh, actually this consumer is reading from five partitions, while the other two consumers are barely reading from two partitions And we need to somehow reassign partitions and rebalance, because it’s clear why this one is falling behind You can say, oh, it’s one consumer, 100 partitions, loss of data gets returned It may have been OK 10 years ago, maybe not OK now And then sometimes you also have a problem where one partition just gets more data And you need to consider the keys that you’re writing How do you really distribute the events to partitions to avoid this kind of skew, because it can also cause you to lag If you’re running Kafka yourself, obviously there is a lot of issues that can happen It can be that a broker is handling a lot more logs than the rest And you need to rebalance It can be that you have a bad disk for that matter It can be a lot of different things JAY SMITH: That helps narrow it down a little bit though GWEN SHAPIRA: Just a tiny bit That’s the nice thing about not running the broker yourself At least you know that this part– you figure it out who has a call to support And then you only have to worry about the rest JAY SMITH: We are cloud-native company and are looking for Kafka as a service What are some things we should consider when selecting a particular service? Good question We have a lot of people nowadays that are born in the cloud Their applications never saw an on-premise data center GWEN SHAPIRA: For me, I’m very much open-source person I think you should try to avoid look-in And try to make sure you’re using open source software, open standards And just look at the community and make sure that you’re working with the communities that you enjoy working with You obviously want to make sure that the people who are running the software are actually capable of supporting it I guess it is, but this is just due diligence I’m trying to see Do you have suggestions? JAY SMITH: Testimonials is always a good thing Whenever I am looking for somebody, I try to find somebody who else has used it What did they use? How they felt about it It’s important to emphasize though, that Confluent has a lot of the original designers, creators of Kafka So that’s one way That’s a little bit of a selling point there GWEN SHAPIRA: And as I said, is that just as a developer, you want to work with nice tools I think the reasons that I enjoy Google Cloud and really was something we tried to emulate in Confluent Cloud is really have a nice command line, APIs that are usable Just make sure that it’s a good– you’re going to spend a lot of time in that environment if you’re a developer So it has to be a nice environment JAY SMITH: Final question, we are heavily invested in GCP and are adopting Kafka Is Confluent Kafka the same as GCP Kafka? Is it a native GCP service? Well, I can answer the GCP part

Today we do not have a managed Kafka But we have a partnership with Confluent to provide you managed Kafka on GCP GWEN SHAPIRA: So the big difference I guess, is that if it’s not an active GCP service, you don’t get to see it on the big Google Dashboard where a so BigQuery, for example But other than that, we are running on the GCP infrastructure I think specifically, on GKE We’re using all the available Google tooling It is not a GCP Kafka But hopefully you kind of get a GCP experience that you know and love JAY SMITH: Right And running Confluent Cloud on GCP, you can integrate with every other GCP tool like we saw BigQuery earlier Maybe a cloud machine learning engine, Cloud Functions, App Engine– pretty much anything you see on GCP GWEN SHAPIRA: Yeah I haven’t shown that, but I’m actually running a Kafka Streams job running on Google GKE So the Google Kubernetes Engine And that connects to my Confluent Cloud And as a way to integrate you apps with the whole Google ecosystem JAY SMITH: I think that’s all the questions we have for today I’d like to thank you, Gwen for coming out GWEN SHAPIRA: Thank you so much for inviting me JAY SMITH: Sharing with the world everything about Kafka and Confluent Cloud and how it can take your events to the next level in the events that your apps generate And your devices, those smart switches, and cameras and everything GWEN SHAPIRA: And keep everything real time JAY SMITH: So stay tuned for the next session “Add a Rich Geospatial Analysis to your Toolbox with BigQuery GIS.” Thank you very much GWEN SHAPIRA: It’s all about BigQuery You could get the GIS to BigQuery with Kafka And then do the special analysis JAY SMITH: There you go