Irene Ros: Bringing Data to your Client-Side Apps

I work at boku where I do JavaScript development and stuff and we’ve been stickering everywhere with lots of Bob’s if you want some bob thicker come talk to me it’s a little bit about me because you might not have heard of me before I’m a programmer I’ve been programming since I’ve been nine and all is great and my kind of my focus has been data visualization and I’ve been working a little bit on my own and this year with the guardian newspaper interactive team where we’ve been working on some open source libraries and some interactive pieces for the Guardian and what we realized when we started working on data visualization open libraries is that most of our work kind of all the snags were hitting had to do with data and so the first library we actually released had to do with data management on the client side and this talk was kind of born I realized that what I really want to talk about is that you can indeed do data crunching in your browser and so on and so really want to talk a little bit about what is we currently do a data I would say most of what we do has to do with this crud space you know we create records and we update them and we modify them and and delete them and whatever I did see no me toque really exciting and that’s great but there’s all these other things that we could be doing right we could be with all this data that we’re gathering we could actually be computing various metrics not just for our own analytics but even for our users and even more we can give our users filtered views that have to do with the things that they care about right and so the reason I’m here is because I think that’s kind of the awesome zone where we can combine all these practices that we’ve had for many years I have to do with crud apps but join them with some more data heavy processing to create this awesome square right there and so why client-side a because their browsers can handle it right if any of you are done writing apps for IE which I’m sure all of you are deep down inside then you know how powerful or modern browsers are and you can do so much with them so that’s pretty cool and I would say if you’re doing a lot of computations that have to do with the specific users that are visiting your systems it’s much easier if we do it on the client you have to cash less you have to process less and you’re not taking up all those valuable computing resources so maybe you can serve more connections or maybe you can just have more CPU cycles to do whatever with so yeah browsers right so before I talk about the various libraries you can use because that’s what we’re going to do today I’m going to talk a little bit about three different libraries that have to do a data crunching including one that had co-authored with the Guardian folks but two other ones that are very good at what they do but before we get to that I want to talk about how we actually treat data it’s very meta right so so traditionally we think about data as rows of Records if anybody’s use sequel or even no sequel whatever call it a document and the end of the day what you have is this collection of properties or dimensions that all combined into us into this collection of rows right so something like that right that’s what we expect to have in a client we have this array of objects with names and properties and whatever and that’s really great when you are iterating over things right when you’re iterating over all those objects and you’re displaying them and a list so you’re rendering something all is great but it’s really bad if you’re trying to figure out something about a specific property right so for example i have this age property maybe i want to get an average of it well the first thing i have to do is iterate over all my records and grab that age property right that’s pretty wasteful so an alternative way for us to do this is to actually look at column wise databases right so in a column wise representation of your data what you actually have is you have every single data type every dimension every column whatever in its own structure right and the only thing binding these together as a row is the actual index position of those values right so here’s the same data that we just represented an object’s but in this column wise store right so the first array is the name property and it’s just the names of whatever we’re looking at and the second array is a specific type at these all happen to be heroes and then the bottom one is just those aged numbers right and now the great part is that I just have an array right so I can do a lot of things to it just from the native array implementation or if I really want I can go ahead and use underscore or whatever textually process that but at the very least I have all these data types all these data fragments in in single arrays so this talks can have a lot of code which is cool if you have Wi-Fi access which comes and goes feel free to check it out I Ross slit / client-side data is the repo there’s a lot of different libraries covered there because client data is awesome and I’d like to talk about all of it if I could but and the

examples I’ll show today will be there too but you know feel free to ask me at any point about anything else there or call me out if I make a mistake I’m all for that so you may have noticed various names I’m throwing around I thought that would be really nice to have a theme for today’s talk and because we’re a bunch of nerds I thought superheroes and villains would be kind of cool so I went ahead to this website called superhero DB where people have a lot of time on their hands go and spend inordinate amount of time classifying superheroes and villains right I don’t even know where they get stuff but I scraped it so I’m very happy so I just wanna I know it’s pretty small but I just want to go over a couple of properties because we’re going to use them later down the road so this is the profile for Batman it’s actually the piece of profile for Batman and he’s got various capabilities like his intelligence or strength or speed so his intelligence is it maxed out at one point oh very exciting so we’re going to look at some of those capabilities and he’s got some attributes about his height and weight which you know is pretty basic turns out he’s 188 centimeters tall never knew that and last but not least he has some superpowers so actually every single one of these records has about over 170 superpowers and their binary attributes so they’re either set to 1 or 0 1 if they have it 0 if they don’t and we’re going to look at these superpowers later down the road because it’s kind of exciting so if you want to look at more the data there’s a data folder in that repo where you can look for it just heroes JSON or just villains or all of them combined and there’s a type property so feel free to build cool stuff on top of that and let me know so we’re going to start off and look at rows of data I’m going to look at one library that does rows of data and two libraries to do that very good at handling columns of data and the first one I wanted to call out was called taffy DB so taffy DB is very database like and that records are JSON objects so again that polite showed you early on it has very very rich selection language so I’ll show some examples and that I would say is the strength of this library and that you can make almost sequel like queries you can and you can or you can check for existence or undefined our Nan’s and kind of change those together as you wish so there’s also a library called JSON select I’m not going to go over but it has kind of various CSS style syntax for selecting properties out of your JSON structures so feel free to look at that but i’ll talk about taffy taffy supports saving some local storage which is pretty exciting and it does have sort of a built-in extension mechanism if you will it doesn’t have any updates no events rather so if you are updating your data and you want to have a binding to that data because maybe your view needs to update or so on if you’re used to MVC frameworks that kind of do all that wiring for you that won’t be there you’ll be on it’ll be on you to kind of wire that stuff together and it has some templating support which doesn’t mean much because you can just throw an object into any templating engine these days so but here’s an example of me wanting to find all the male heroes the female heroes and unknown gender heroes that have height right so the first thing I do is I create this database by calling taffie the function with my array of objects and then I go ahead and find the male heroes by saying find me all the heroes that have the gender male the type hero and then the height cm property is not undefined because for some heroes that value is undefined and then I do the same thing for my female heroes and then here’s an example of an and that’s kind of cool i say that type of be here on the height is not undefined and then the gender is not male and is not female so you can sort of pass this array of objects so it’s pretty cool and you can do a lot of things in terms of querying or data and so it’s that we’ll look at an example right so I was very curious as to whether certain capabilities appear together or not so if I’m faster and I also better combat or not so if you want look at the code the example is under taffy the under staffy dashdb space competencies that HTML and there’s a chance bin that’s slightly out of date but so what we want to do first really are pretty much the only taffy part of this of this example is to find all the heroes that have all the competencies defined right because the last thing we want is to be comparing ones that have some don’t have some so this is me really pretty much to sue my query I go ahead and I say give me the six that I’d shown you and make sure that they are not undefined right so and then the rest of this is actually a bunch of d3 code which I’m not going to go over even though pulsed I don’t want to talk about visualizing big data not going to do that to you guys but here’s an example of what that built right so what that did is it selected all the superheroes I have all those abilities and then I built this very slowly scatter plot where I can say oh how does my power compared my power well that’s a stupid question but how does intelligence comparator our right so it looks like just about all the heroes are clustered and kind of the mid-range intelligence

and so on and so forth I don’t know how they have to do with combat also not not that different but it looks like most superheroes are kind of on the slow end so that’s kind of interesting right so that’s a quick example of taffy and you know let me know later if you have any questions about that and the next library I want to talk about is cross filter so I used III a bunch in my work and it was written by a really great guy and Mike Michael Bostock who writes a lot of amazing dataviz and he’s also written this library crawl called cross filter which has to do with dealing with high dimensional data so it is incredibly fast because it uses type arrays very heavily the thing it doesn’t do very well is actually deal with incomplete or messy data before I get into that I want to show just an example of how you would use cross filter so again you would sort of call the cross filter function and pass it your array of objects in case case heroes and then what you do is you build these dimensions a dimension is really just the cash right we’ve all created cash objects was just an object with a key and then either an array of our objects or just single objects the same thing is done by cross filter except he calls it a dimension and so and they can be a little more flexible and that you can actually you know combine them in whatever way you want by using this callback function so in this case I’m really just creating a dimension on the intelligence property right because we really what I want to do is be very quickly able to get all the heroes of intelligence Oh point 75 or something right so once i have my dimension builds the next thing i can do is actually call a filter so i’m going to say okay give me all give me a filter that only gets me the here is between 0 and 0.25 right so that creates a filtering and then actually go and get the top records so dot top infinity really clear API not and you know what I’m done I have to clear my filtering again so that’s an example of using cross filter and there’s all sorts of additional things you can do on top of that you can group things so for example here what I’ve really done is I made a group right I said I’ll give me the group of things in between these brackets but there’s actual kind of a contract for that call the group so I mentioned typed arrays a little bit now i’ll talk about them because territories are really really cool they allow you to very quickly manipulate rob binary data because what you start off with is just an array buffer and then you kind of say oh I’m going to look at it at 8-bit chunks or at 16-bit chunks or whatever and it’s nice because they initialize by default 20 instead of undefined which is what traditional JavaScript arrays initialized to which is really fun if you’ve ever done data things and it’s nice because you can share buffers you can have multiple views applied to the same array buffer and if you know the dimensions of your data if you know you’re only going to look at values from 0 to 255 that’s great because you can have a really small memory footprint but the problem with typed arrays is that just about anything that doesn’t conform within the dimensions you specified is going to turn into a zero so an an an undefined string or even the value that is outside the range right so here I’m defining an unsigned 8-bit integer array and what that means is I can fit value from 0 until 255 into it and what happens when I put a value outside that range for example 258 what I actually get is the number two because it’s kind of just roll over it just does not have enough bits to fit that data so you have to be very careful because you’re not actually going to know that this is not going to throw an error right and the same happens in cross filter if you put through data and you don’t filter it appropriately don’t clean it up appropriately or it has kind of little pitfalls it’s not going to air out what you are going to get is slightly wrong answers so you might get a filtration of 100 records and two of those records don’t actually belong and as someone who spent like two days debugging stuff like that it’s really not fun so so typed arrays kind of yay kind of nay and so but nonetheless crossroads is very cool so I actually thought okay now we looked a little bit at these competencies now I’m very curious to see how they’re actually distributed right so I want to look at the histograms hopefully focused on what those are so if you want to look at the code it’s under cross filter / visual data HTML and we’re going to look at the all that does not very complicated what I do is I for every single competency I take all the values for all my heroes and then I bend them in this case in 27 bins randomly selected number and I just try to kind of aggregate to see how many i have in each in each grouping so for example you can see my intelligence very Gaussian distributed that’s good to know combat almost so not so much and durability for some reason just a lot in the extremes and then kind of strength is all over the place for most part and definitely a lot more slower heroes and villains then pass one so kind of interesting but we can look very quickly at the code that put that together so

I’m gonna zoom out a little bit hopefully you guys can see these screens are so tiny so the first thing we’re going to do and this again is not really cross filter this is me using underscore because again I’m trying to first reduce my data to make sure i only have records that aren’t going to break cross filter so here i just iterate over all of them and make sure they’re not undefined or nan and then only get those those records um but then what I really do is for every competency I build this dimension and then I go and I fetch my records and I get my mins in the max the other nice thing across filter does when you build a dimensions is it sorts it for you so if you have numbers the max is always going to be the first record and the men is always going to be the bottom record so that’s exciting and then I go and I kind of built my my bins that I talked about kind of that first example and then here I do the actual grouping and I fetch my records and I do the little sparkline chart so not very complicated but again very handy and if you look at some of the examples across filter there’s a lot of cross coordination stuff that that he does you know it’s not inherent cross filter it’s just a cool demo so but that that’s cross filter and if you have any questions do talk to me about that and then the last library I want to talk about is the one that I’d written with the Guardian folks with Alex crawl if any of you know him and it’s a part of a bigger project called the miso project where we’re trying to build libraries for enabling interactive visualization and interactive media JavaScript development stuff so I’m i really love dataviz and i want to help people write it and especially in journalism where they don’t have time and not have developed personally don’t have budget and they don’t have anything really um so they’re running on a dime so we build data set to kind of try and cover the entire workflow of data management so this is kind of a transient FSA but not really in that what you start with is you fetch your data and fetching a data most of the time is really simple right because all you have to do is make an AJAX request for a JSON file well sometimes it’s more complicated sometimes you’re going over WebSockets sometimes you actually want to do polling and sometimes you want to reset your data every time and sometimes you want it to be unique against the data you already have because maybe you’re polling Twitter every second and you get an overlap and records right so all those sorts of things are really stupid mundane data things that nobody actually wants to deal with so we try to build into data set this idea of importers that really have to do with just fetching of your data right they don’t do any data processing they don’t do any data management their only job is to go ahead and fetch things and by default you know they just go make a JSON request an AJAX request for whatever type of file but practically speaking you can kind of write your own we’ve written our own so we have one for example for google spreadsheets because that’s a commonly used data store in journalism and also a really easy way to pull up your data in and a really hard place to get your data out of so if you’ve ever tried then the next step is really parsing it right so again if I just have my data pre-formatted into arrays of objects that’s awesome but most of the time that’s not going to happen and so what we do is we have things like comma delimited parsers and and Google spreadsheet parsers which is absolutely the most horrible format and whatever other parser is you really want so if somebody wants to write an XML parser please do we don’t so we’re not going to and then once you do what the parsers really do is they combine all the data into this unified format that data set actually uses and from there on you can do a bunch of different things you can compute various values you can filter the data and whatever way you want and you can derive new data sets out of that data and the nice thing that you can kind of once you do one of those operations you can do the rest so for example a selection from the data set is actually another data set to be more specific it’s something called a date of you and that it’s immutable because it’s not supposed to be mutable it’s tied to your original data but the nice thing by doing it that way is that we can offer things like event coordination so doona sets available both in nodejs and on the client side and we’ve piped hundreds of thousands of tweets through it in node and it didn’t break which was the happiest day of my life ever not really and as I mentioned it comes a lot of importers and parsers importers and parsers for common data sources and structures and we’re definitely taking contributions for that and it has this idea of a vented views so once you have a selection from your data set or maybe a selection from a selection from a selection doesn’t really matter you can enable synchronization between them with just a little flag and what it does is whenever you add or remove or update records it’s going to pipe that those events to those to those views so because really nice if you have various visualizations or other views built on top of selections you’re not responsible for propagating those changes yourself you can just kind of enable that in the

data set and it’ll you know trigger the events you’re expecting and then it has some common math functions mins Max’s means whatever we’re always writing more as people asked for them and some derived functions and by derive that mean things like group eyes or count buys or some of those slightly more complex sequel type things that come at the end of the query after the where so we try to have as many of those as as people ask for again so here’s an example for if I want to dry just go ahead and fetch my heroes from a URL I go and just say new me so dataset from a URL or if I have it locally can just use the data property instead and then I go ahead and I call heroes that fetch then so on the background I uses deferred if you really want to pass it a success or an error call back you can but by default we really want people to use the firts because you can chain them and do all sorts of great things and so all that’s going to do is it’s going to output the length of my data set which is I have 570 in my data set or a slightly more complicated example so I go ahead and get my heroes that I want to count them by the hair color because i’m very curious how you know what distribution of hair colors are in my heroes and then i go ahead and actually sort them so we take for those of you who like use underscore and are and do not like the the sort by only taking one value neither do we so we we have a sort by function that takes both rows and then you can kind of define however you want to compare the two as long as your turn 1 negative 1 or 0 and now we’ll just go ahead and sort your data and then you know let’s say we have some print function so we’re going to get a list of all top hair colors turns out none is the most popular value and then black and brown turns out bald as a hair color I was not aware that’s good to know for fall fashion anyway so there’s a bunch of other examples there of data set things but I wanted to go over one before we’re done called super powers so again I mentioned those many many super power properties and I thought it’d be fun to kind of take a look and see if they’re related in any way so I tried to see if there’s any clusters of superpowers they’re used together there aren’t so but that example is still there if you want to look it’s called co-occurrence HTML but we’re going to look at super powers HTML and it just has a few metrics about superpowers so there’s a hundred and so 24 of them total and there’s about five pursue / hero and that’s kind of the distribution right there there’s one person it was 45 and I keep meaning to go back and see what that is and then there’s a top ten right and these kind of represent how many heroes versus how many villains have them so it looks like there’s a lot more super-strong villains and there are heroes which is kind of interesting my favorite is actually the bottom 10 because for example I didn’t know that wishing was a superpower that makes me a superhero and quite a few other ones like or Bing I don’t even know what orbing is so anyways just an example of cool things you can query out of your data so let’s look very quickly at the code that put that together so um whoops so as I mentioned I go ahead and I fetch my data I just call it the fetch on it and then the first thing I do is I just need a list of all the superpowers so i’m using underscore for this right super the standard lived for javascript and then what i want to do is actually want to add them up right i want to get that count of how many superpowers each here has so in this isn’t actually live yet but we’re going to be releasing oh point three soon it’s going to have computed columns which everybody has been asking for but it’s just a way to make a column that you don’t actually have the data for to begin with it’s actually a calculation of a bunch of properties whatever and the nice thing is that stays in sync for you so if you’re updating anything about your rows all your computer columns are going to go and recalculate for those specific rows so that’s pretty fun so really all I do here is I say Oh create this column called SP count and it’s going to be a number and then please just go ahead and add them up right I’m just doing a little inject here starting with 0 basically add the zeros and ones which is pretty simple and then I go and update those little charts and so that’s an example you know in terms of iterating over all my data I can just call dot each again very common pattern and this is sort of our selection mechanism and that you can just call dot where and pass in a function as long as it returns true that function will be included in your selection versus not right so again very basic and the rest is all just kind of you know HTML updating and stuff so that’s useful data set in a in a really quick whirl and that’s the one I can answer the most questions on if you ever need anything please ping me about it and I just want to kind of sum up a couple things that I learned from data crunching in the browsers is the first thing of course I think about whether you’re going to most of your operations involve rows or

columns because you may need both or you may want to optimize for one or the other you may have seen that I mix and match my libraries all the time right i’ve used underscore and just about all my examples because sometimes it’s really fast but beware of iterating too much right if I call that filter and rejects and each that’s three times that I’ve iterated all over all my data so there may be a more wise way for you to do that and there’s a couple examples for underscore as well and that in that repository beware of premature optimization right so I like performance testing like the next guy but unless I know that I’m going to have two hundred thousand records on my client I really don’t care because practically speaking when you’re dealing with maybe a few hundred or even a few thousand records those performance metrics might not really be very accurate because they’re going to be impacted by the load on your machine or other tabs are doing all the other JavaScript running on your page so they’re just less consistent so as much as I say everybody performance test your code practically speaking if you know you’re not going to deal with a lot of data pick the library that works the best for you instead of the one that might just like be two percent faster in that guy’s benchmarks if you’re going to be searching a lot through your data then precash it right just create those objects some of the libraries I showed allow you to do that or they allow you to filter things or so on but beware of over caching because keys also take space right anybody who uses I didn’t know sequel databases like I do sometimes or whatever we all know that Keys take space and its really nice to structure with these really smart keys that you can very easily select but at the same time that is costly and if you’re going to be only using a subset of your data be sure to filter that out and then act on it there’s no reason to to have your data there if you’re not going to be using it and don’t forget to clean up your events right if you have you eyes or tied through data that have events and so on and so forth triggering then the best thing you can do is make sure those are outbound when you’re done and last but not least not everything can be client-side right so I still do a lot of my data I in our or in the server or wherever it is I need to have it happen because the reality is most sometime idea doesn’t start with 500 records thankfully there’s not that many superheroes or villains to keep track of but the nice thing I can do is what I do have multiple you is or visualizations or whatnot that all require the same subset of data the best thing you can do is come up with that smaller subset and pipe that to the client and so that’s a question I always get is how when do I know what to keep on a server would to keep on the client well find the smallest upset and pipe that back but when you do pipe it back it doesn’t mean that you’re then left to your own devices and and JavaScript basic API there’s lots of libraries that help you work with your data so if you have any questions please do reach out to me I absolutely love answering things about client-side data and data visualization and a couple of like small buku things that are not related but if you are into gaming in JavaScript we’re curating this website called build new games and if you want to write with us we get to pair speakers which is really excite us which is really exciting so contact me or email admin opoku and we’ve all been talking about functional programming and no programming and I’m tired of that argument so if you’re interested in application control flow management which is really what we need just come chat with me for a library that we’re alpha testing now and I’d love to give you access and get your opinion so that’s that you