Interview with Leland Wilkinson | Grammar of Graphics | Open Source, Statistics & Software Dev

Sanyam Bhutani: Hey, this is Sanyam Bhutani and you’re listening to “Chai Time Data Science”, a podcast for data science enthusiasts, where I interview practitioners, researchers, and calculus about their journey, experience, and talk all things about data science Hello, and welcome to another episode of the “Chai Time Data Science show”. In this episode, I am really privileged to be interviewing one of the best statisticians and quite literally the author of the grammar of graphics: Dr. Leland Wilkinson, Chief Scientist at H2O.ai. In this episode, we talk all about Leland’s journey into the field over the past few years his amazing contributions that I believe need no introduction to the audience His work at SYSTAT for the audience that aren’t familiar with it, followed by his work at Tableau and his current work at H2O.ai where again, he’s working on very exciting projects. We talk all about how software development has evolved all over these years and his current work at H2O, his take on the field, broadly speaking, and the interview of course includes many best advices for all of the beginners out there. Note that this is a special interview release is happening on H2O’s YouTube channel if you want to check out the other interviews The link to the playlist is there in the description Another note to the non native English speakers in the audience. If you’re watching this on YouTube, this interview along with all of the future releases will have checked subtitles. So in case you want to enable the subtitle for a better experience, please do so These will be proper data science checked subtitles Without further ado, it’s my privilege to be sharing this interview with Dr. Leland Wilkinson with you. I hope you enjoyed the conversation as much as I did Hi, everyone. Today I’m really privileged to have on the show a top tier if I may, one of the best statisticians of today data scientists as well as creative thinker Dr. Leland Wilkinson, Dr. Leland, thank you so much for joining me on the time data science podcast Dr. Leland Wilkinson: Thank you Sanyam Bhutani: So, today you working as Chief Scientist at H2O.ai. If I may, you’ve been involved in the field for over half a century. Now. Could you tell the listeners how you got started? was it called statistics in your day? Now we call it data science. Famously, how has the field evolved? Dr. Leland Wilkinson: Well, statistics actually spin around that term has been around for centuries. And statistics came out of, well, you could read all about it in Steve staplers book, The history of statistics, but it goes way, way back to the 18th century. But yes, I should confess I have a rather unorthodox background or that although there are a few people like me in the American Statistical Association, I loved mathematics and was going to be majoring In mathematics at Harvard in Applied Math, and I started out, I’d done a lot of math in school before Harvard and was passionate about it, and then just got kind of it was the 60s, what can I say? I decided not long after that to me to switch to English, so I could read a bunch of novels. And I really, and Shakespeare and whatever. So as I graduated, I then went to Divinity School Because at Harvard, I did want to be a chaplain in hospitals, working with people in the hospital. And after I got out of divinity school, I went into psychology at Yale, where I was hoping to continue my study of psychology, get ordained and become a chaplain and at Yale. I just visited The statistics department and started spending

my time with friends in the computer center and I was totally hooked And again, because it was the 60s Unknown: everybody was out protesting, and I actually didn’t study much psychology. I spent almost statistics down the street. Okay. And after that I got a teaching job in Chicago and started teaching statistics to psychologists. And so, at that point, everything was still statistics, although my mentor in statistics at Yale was John Hartigan who was a student of john tookie. And my dissertation advisor was also a student or john tookie. at Princeton, and Dr. Leland Wilkinson: we worked hard taught, basically for the point of view of data analysis. tookie of course coined the term exploratory data analysis. And then I think another person who had been at Bell Labs and worked closely with tookie, and is one of my heroes, Bill Cleveland, coined the word data science, I believe he invented it, and Bill and of course tookie and even my advisor, were against using the word statistics to do what we do to describe what we do with data. We analyze data in many ways visually. And with a computer with various models, and classical statistics is more about a specific corner of data analysis, a very important one, but is not the same as machine learning or data analysis. And so machine learning is a term came out. I forget I A terrible sense of time. But I think it came out in the early 90s, late 80s, and had evolved out of the term data mining, which is kind of a silly term, I think, but and now we call it AI. And I don’t have to give me much information on we’re in the sort of second stage of AI, where now as many of us know, we’re finding a blending of models, such as deep learning, which came out of statistics, that model came out of statistics in psychology. It’s now applied, and we’re finding out that we get remarkable behavior in prediction or automated cars and so on coming out of these more formal models, so I’m sorry, that was so long description of a history. But that puts me where I am today. And I should say, I’m very grateful. The background that I had, because it gave me a much broader perspective. And I don’t believe data analysis or statistics is solely about numbers and algebra. It’s much broader than that. And it takes an understanding of the role of us in society. And, you know, a larger dimension to appreciate how we should be doing data and losses Sanyam Bhutani: That’s a great insight. How was it like working on statistics back in the 60s because we didn’t have import * from what there weren’t library? So how was working on computers back in the day? Dr. Leland Wilkinson: Yeah, well, I guess you’ve already dated me. So what? You know, here I am. And I must say it’s thrilling to work at H2O Because I get to work with young people like you, but I will say It was a thrilling time to be working with computers because I started out working at the Yale computer center and an IBM 7090 direct coupled system 7090 7094 with punch cards, took a move to terminals eventually, but I have to say, and you know, in my lifetime, the great revolution occurred when the micro computer appeared. The internet was the second computer revolution in that perspective for data analysis. But I can’t tell you how thrilling it was to build a computer. And I bought parts through the mail from mostly from California. And we soldered all these parts together and I built something called a chrome emco which was named after a Stanford University building actually and I just could not believe it. I had to Total Control and use of a computer 24

hours a day. And I actually spend almost 24 hours a day coding. But I thought I can’t believe it because on the mainframe, you’ve got this little slice of time, and you really couldn’t do a whole lot So for me, that was the great thrill. That’s how I wrote cyst that I actually pretty much shut down my mainframe account. And from that point on, I was writing everything on a micro computer in several different languages. I had basic Fortran, Lisp. There was even alcohol on these tiny computers. This was all before the IBM PC or the apple. So I had a big head start by building a kit. You know the way Bill Gates did And but of course, I never made it that way There, but that’s how I get started and it was a thrilling time. To be able to be running microcomputers Sanyam Bhutani: To talk more about your journey you took the traditional route if I may of completing a Bachelor’s followed by masters then a PhD. Why was this route important to you? You did your PhD in psychology Could you tell us more about your research back in the day? Dr. Leland Wilkinson: Well, actually, my interest in psychology quickly segwayed into what at that time was called mathematical psychology. And back then my heroes were of course Tookie and Tversky and Kahneman and the people who are doing deep research into perception, a similarity multidimensional scaling people like Roger Shepard and Kruskal I was at some of the meetings where multidimensional scaling was actually born, it was pretty thrilling. And my dissertation actually was in the relation between similarities and preferences. So mathematical models of how we prefer one thing to another. And is that preference based on our judgment of similarity, or is it based on a different mechanism? Go a lot was happening then at places like UC Irvine, Yale, Michigan, Chapel Hill, by people pioneers in that area that eventually led to some of the things that Burski and Kahneman worked on and others so it was thrilling time Sanyam Bhutani: So throughout your career, another thing that another theme if I may have noticed is you been always connected to the industry and teaching simultaneously. So, first, can you tell us why was being connected to the industry important to you even while pursuing your PhD? I believe you were working as a statistical consultant Dr. Leland Wilkinson: Yes, I did Yeah, I had some interesting projects One of them was one of the first studies on abortion, early studies back when abortion was illegal in the state of Connecticut, but studies on children and their development and so on So I really got a good first hand experience with with statistical consulting. I also was in a federal trial involving discrimination. And that was quite a thrilling experience, Judge Newman was and is a famous federal justice. is on the federal bench. He, when I testified using the t test for analyzing these employment discrimination data, he turned to me it was a bench trial. He said, Did you do a test for heteroscedasticity? And I was just blown away. I mean, here is a judge who knows that much about statistics to know that one of the most important things is check your assumptions. Yeah Sanyam Bhutani: Um, so being also connected to University. So this is sort of a silly question as a professor, how have you seen the way of learning evolve over the years, of course, internet wasn’t as big early during your days, and now you can almost find any piece of information on the internet Dr. Leland Wilkinson: The internet didn’t exist Actually, I should say the World Wide Web is what really changed the internet. We all know the history. I think of the internet

and how important it was. And yes, we a number of us did some things with our modems and so on before the World Wide Web was invented, yeah, which was suddenly accelerated everything Know, learning has radically changed. I mean, I learned in small groups, seminars and, and small classes in graduate school. Nowadays, as you know, I’m sure many people know that Silicon Valley and the software industry, at the cutting edges, is less interested in your pedigrees, then what you can actually do so you don’t want to saunter into an interview at Google and say, I went to Harvard or I went to Stanford Okay, that might have gotten you a little notice but what really gets notice is, please walk over to this whiteboard and sketch out for me a depth first search algorithm in I don’t care Python or something that didn’t exist back then a job interviews and so now in academics, things have not changed quite as much you still present a talk at your interviews for for, say a postdoc or assistant professor position, and you’ll get grilled down your assumptions and what you you know what you think and then you’ll go you know, to dinner with everyone and they’ll give you too much wine and you’ll fall asleep. But anyway, the the the online education has been a major change that’s given people outside the field, people who didn’t go to Stanford or Berkeley or Carnegie in computer science, the chance to learn about data science and apps Actually leveraged themselves through internships and so on into some pretty heavy duty positions. And that’s, that’s a wonderful thing to see Sanyam Bhutani: Coming again, back to your journey, when you developed SYSTAT, can you tell us why was this important to you? And how did you find the motivation to actually get started working on it? Dr. Leland Wilkinson: Well, actually, for my dissertation, I needed a method, a statistical model called the repeated measures analysis of variance in the multivariate layout. And at that point, I had been using SAS VMD data, text several programs on the mainframe that been there, but nobody had the multivariate analysis of variance for repeated measures So I decided, why don’t I write it? Okay. I did. And then and that I actually sent that out And a lot of people used it the deck of cards was about as long as this bookshelf It was like 234 thousand cards with Fortran code. But when I got to Illinois, and the computer said there was run by a much more restrictive staff who didn’t even want to let faculty anywhere near the computer, I and I was able to buy and put together with my consulting income, this pretty monstrous micro computer, which, you know, I was building, I decided, oh, let’s just take the program downloaded from punch cards to my local storage on the micro computer, which was floppy disks. They were the big floppy disks. Okay, you were, they were like a megabyte per disk. So I was able to put the whole thing on there. And then I thought, Hey, I could actually write a little statistics package. And I think I kind of went crazy. I’m sure a number of you have had that experience where you start coding at nine in the morning And at 10 at night. You realize what happened today? Was, what’s it raining outside? I had no idea. And I just was on a roll And I wrote, oh, hundreds of thousands of lines of code. By now I think I probably written a million lines of, of production code that people actually use We don’t talk as much nowadays about lines of code be some bad programmers generate tons of it And beautiful program isn’t necessarily a long one. But what I’m saying is if you’ve been doing This continuously I would say, literally every day of the

year, I’ve had my hands on a keyboard for the last almost 50 years, yes, you begin to accumulate a lot of a lot of code. The other thing is the colleagues who are my age or even younger, cuz, you know, say full professors. By the time they get into departments or companies where they’ve done all their research and written lots of code back then they haven’t got time. They’re too busy writing grants, and organizing grad students or doing running large projects at places like Google, so they stop writing code. And even this morning, I was sitting here writing a ton of Java code to get some stuff done. And I’ve never stopped coding. I love coding. It’s just it’s fun Sanyam Bhutani: If I me as this naive Question What keeps you so passionate even at the age of 75 young: What keeps you so passionate even even at this age? Dr. Leland Wilkinson: Well, they finally said it my Sanyam Bhutani: question Dr. Leland Wilkinson: Honestly, I think that the most significant things in my life, obviously after my family and my life here are coding and exciting new challenges. And then lastly, working with young people because one of the things people my age who have been in computers often do is they reminisce and they say, you know, like these young whippersnappers don’t understand. We wrote in Fortran We wrote in C Yeah, well, you know, what, if you aren’t learning from a 20, something, you know, a 22 year old what’s happening today, you’re not Going to go anywhere. And that’s what’s pretty much driven me over the years is having the chance to get advice from people who are about the my granddaughter Sanyam Bhutani: That’s really inspiring coming to software development in the early 80s Can you tell us what was your favorite developments in software development over the years in open source or otherwise? Dr. Leland Wilkinson: Well, there there I actually was tempted to put SYSTAT into open source. I’d already had some competitors copying parts of the program and so on. But I was really a little bit at that time before the growth of the open source source movement, you know, Richard Stallman and and the all the mechanisms needed to really let open source take off Yeah, I would say our was probably certainly in Statistics, the first emergence of that phenomenon and that had to do partly with how at&t mishandled s were which was developed by john chambers and Rick Becker and other bill Cleveland other people at the labs. So I sort of spanned that open source movement. I have written a bunch of our packages now. And that’s been a lot of fun. So they’re open source. Now is facing some significant challenges. And I think some of that not to get too political about it, but some of it was had to do with large corporations that are busy buying up every other startup in the world, and everything is getting sucked into about five mega corporations. And the problem then is that those mega corporations are using Tons of open source software. And it’s getting more and more difficult for startups and little groups of people to get the support they need to develop Unknown: new code. Yeah. But Sanyam Bhutani: I feel like definitely, the general open sourcing has been beneficial overall speaking for the society and even, I think for the industry speaking, Dr. Leland Wilkinson: oh, it’s a revolution. Now, there’s no question It’s made a huge difference. I mean, I think if you just look at the, you can actually compare sales figures because naturally open source, it’s free. Yeah But if you actually look at the number of users, it is eclipsed all the other statistics packages in terms of the number of users Unknown: It’s just been a major change Sanyam Bhutani: Now, coming to another aspect that you’ve contributed to you quite literally authored and said the path for the grammar of graphics Without which we wouldn’t have

GG plot by 10 bouquet and even Tableau I think Could you tell us what led you to authoring the book and essentially becoming if I made the father of modern visualizations? Dr. Leland Wilkinson: Well, I’ll go, I’ll go a little bit into that. When I came to SPSS, they had asked me to do visualization. Okay, because they had been using third party software that was just not living up to SPSS is replication anyway, I spent some time there and and sort of ran into roadblocks from what I will say, would be sort of third level of the bureaucracy where the managers were threatened by the recommendations I was making Okay, and I have, I still have many friends from SPSS, so I don’t mean to cast aspersions It’s been It was a magnificent place to be. But I will say there was a meeting in which I said, Look, I’m trying to tell you how you do this stuff. And if you don’t want to listen to me, I’ll just write a book and show the world how to do it. And I left the meeting. And that was also by the way, I’m sorry, it should go too far into this. But I was asked by some people at SPSS, how did you ever code so much? We can’t believe how, you know, sis that in the mid 80s, you released this ton of graphics software, and I said, I never went to meetings. And so I’m not a very good organization man to put it, you know, in context of them. And so much as I preach, I got tremendous support from the CEO of SPSS jack noon. And he said, Don’t you worry about this stuff, just go ahead and do what you’re doing. So I put together a team of about seven people, people like Dan rope and grand wills, and Roger dubs, and, and Andy Norton and so on. These were brilliant, and really productive people. And what happened was, I had been, as I started to think about coding in Java, these things along with Graham and Dan and so on. And Dan was an inspiration he had come from the bureau labor statistics and already had written quite a few graphics. I realized there was a grammar here, and it was related to the grammar of experimental design, which if you’re in that technical area, and statistics, you know, there are concepts that are quite mathematical D optimal designs and design algebra and so on. And so I put all this stuff together and we started cutting Yet, a lot of the architecture was similar to CES. That’s where I written a ton of code to do millions of different kinds of graphics. But we had some thrilling moments when the grammar we would code, Dan rope and I, for example, set for a week struggling with something called a scatter plot matrix. And I said, I know this object can be made with the algebra. And suddenly one day I said, I know it’s a quadratic form, which is what we have when we do in matrices, X transpose X, and we make a correlation matrix. And here let’s do it with objects with graphical objects. And I couldn’t believe it, Dan typed into his program, the algebra we worked out and out pop the scatterplot matrix, we never wrote a scatter plot matrix. Okay. Many of the other graphics we never wrote, they were an aspect of the entrepreneur. And I think that’s what led probably one of the first people was Hadley Wickham at Iowa State when he was a graduate student working with die cook a fantastic statistician there who’s now actually back in Melbourne. And he realized this himself. So he developed the GG plot system, which implemented I’d say about three quarters of what’s in the book, but just very elegantly, and it just took off. And of course, the other one was Tableau where Pat Hanrahan and crystal t implemented, not just the algebra but the UI that I outlined in the book, and it’s very, very close to the way Tableau looks today. So Sanyam Bhutani: Did you anticipate it to set up so many

ripples into the stats world, so to speak? Dr. Leland Wilkinson: I wrote it as a monograph. So I literally thought of it as a journal article, but I it wouldn’t fit into a typical journal. So I picked Springer because they were a math publisher. And I thought they gave me the license to go ahead and lay out all the math in there and not feel inhibited by page length and so on. And I I basically took almost no royalties and thought, I hope grad students and professors will see this but I told my editor, you know what, the sales are gonna really be terrible the first year and second year, maybe a few more, and I said, I expect the sales to increase every year, and he was pretty nonplussed. At least I mean, normally get this burst of sales, right? And they be paid off but actually every year The sales have continued to increase. Now it’s I don’t mean selling, you know, like your latest hottest novel or even or even a book like hasti Friedman and tip shirani, which is sort of the Bible of the machine learning, but it’s, it’s so quite well, and most important to me is it is recognized among the people I most admire. And those people are at, you know, their Microsoft, Google, Facebook, they’re there. They’re the people who do visualization And don’t just read a book, but actually other people who have to code to do visualization and now there’s there’s an explosion now of beautiful new visualization techniques and programs Sanyam Bhutani: Yeah, coming to your current day job now you’re the chief scientist at H2O, Can you tell us what has been your current job current life look like as it is, to quite a few projects are happening both open source and otherwise virtual tasks are you currently involved in? Dr. Leland Wilkinson: Most of my work is in visualization, of course at H2O. AutoViz was a component of driverless AI that I wrote as a server side library, and then young governments are Johnny, as we call him. And Justin, who is now in the PwC group that and basically and Alexi who is the sort of person supervising the data table for da anyway, they hooked it all up. And that’s been pretty thrilling. That’s that’s been a project where you look at data and you say, Can I do some visualization Before I have any model in mind, and can I learn something from those visualizations before I start to imagine models that implement More recently, I’m working on what started as the cue the quantum project. It’s now Q It’s in development. And it’s, it’s very exciting. And it’s an exploratory and out visualization analysis package again, to go alongside of da AI driverless AI. And what’s exciting for me with that project, especially is I moved back to Chicago to be near my daughter and her family here in Chicago, and much as I left California. This gave me the chance to sort of enlarge my life and be back with friends and I discovered That I can work very productively or like this You’ve seen zoom, WebEx, whatever. And we have one meeting. I tell people out here that they’re astounded because everyone knows you can do video conferencing or just, you know, iPhone, FaceTime so on but I say, we get together every Monday, one person’s in Singapore, ones in Bangalore once in LA, once in Mountain View, once in Chicago, all having a meeting as if we were in Mountain View. So I know it’s a cliche, but going through the experience every week of doing that and getting stuff done is pretty thrilling. Now, when I mentioned Dan rope, he and I back at SPSS, he was located in Washington DC, and I was in Chicago, and we did the same thing using Earth. Early, pre beta copy of I forgotten the

name. It was a conferencing program that Microsoft was working on that they bought the team for cooking. And that’s when I first discovered, oh, boy working, you know, with a with a sketch board, with video and so on, you get actually more done than you would if you were in the office by the water cooler So, Sanyam Bhutani: yeah, coming to you mentioned are two ways. Can you tell us why is auto auto visualization important to auto ml automated machine learning? Isn’t the purpose of auto ml to completely replace the human Vader’s AutoViz come into the picture then? Dr. Leland Wilkinson: Good, good question. And no, that’s not the focus of it, but that’s really it will not. And I personally believe AI is never going to replace the full range of human perception, reasoning and so on Because we are incredible brain machines, we know that but we also have bodies. And that gives us a wisdom of the environment that machine learning doesn’t really have an probably, in my opinion is not going to have for a long time. But auto of is, is designed to handle extremely large data sets where we could not sit down with a program like I won’t name them, but there’s some very good statistics packages that allow you to open a data set. It was scatterplot look at histograms. But, for example, today, I’m looking at a data set that’s a microarray data set. It has 800 rows by 20,000 columns. You cannot physically do scattered plus 20,000 see to you know, yep, Different Scott, you, you couldn’t do them. So the point of all of is, is to let machine learning algorithms and statistical algorithms go through there and say, you know, you should take a look at this And this might be 50 scatterplot. But it’s not an impossibly large number of scatter plots. And the second thing about autosave is, is it uses models and methods that are not classical statistical, they’re more, they’re closer and inspired by two keys work, where instead of looking at skewness, the classical estimate that estimated that involves cubing numbers. I look at other kinds of more recent measures of Skewness that are based on means and medians and so on, that allow you to look to Do away with some of the assumptions that were overly rigid in classical statistics. So out of this I view as a platform for helping people get an initial look at data. I did write a program, which, unfortunately got buried because of a mistake I made with a startup before H2O. But anyway, it was a program that I call it a second opinion that actually took your models and you fit a model like a regression model or a classification model or clustering. And it would do the same analysis you did using a package like SAS or R. And I would then think about the problem the way tookie would think about it. Like look at the residuals Do you find unusual patterns? In the residuals, he used to call it the the smooth plus rough or the fit plus residual. And I would deliver up documents that would indicate whether you were justified in keeping that model. Or you needed to refine it. Nowadays in machine learning, of course, we have automated procedures that will do things similar to that And we all have at least a lot of machine learning people I know don’t even care about distributions and whatever they say. And I think they’re profoundly wrong. But they say all the machine can do a better job than people can in fitting these models, and so just leave it up to them. I’ve had questions is and audiences where people will say, you know, computer scientists will say welcome What’s the point of

visualization because the machine can do it better than we can? And I say, No, that’s wrong. That’s false. Give counter examples where the machine will simply Miss patterns. Because the models the machine is using weren’t written or devised by people who are very familiar with data analysis. And when there’s a problem, I think in some machine learning circles today, that hinges around comments, that George box, a famous statistician made, but you’ve heard these comments before. If you know everything’s a nail your hammers kind of be your solution for for. And that’s the opposite of what people like to keep did. You don’t take a model, throw it at data, and then get a prediction and now you’re done Sanyam Bhutani: Got it. Again, since you made it clear that you are on the side of humans in a loop. Can you tell even or elaborate a bit on where does AutoViz come into the auto ml picture like what all portions of the auto ml pipeline are being dealt with using autofills? Dr. Leland Wilkinson: Well, actually, we’re getting a lot of requests from companies, companies that can auto vis enhance auto ml and give us the ability to improve those models automatically without people even looking at the graphs. And that Automl at H2O was originally in effect called a DL, which was named after Dmitry Larko, this brilliant Russian scientist at H2O. I don’t think I’m divulging any trade secret here. But everyone, not a nature. Well keep your hands off, Dmitry. He’s brilliant Anyway, my point here is that Dmitry devised this way of adding features to models, which he was a pioneer. Now everybody talks about feature engineering and so on. Well, he was doing this several years ago. And I’ll talk with Dmitry about some of the two key transformations that I do and other of his which are called by tookie re expressions to do things like symmetrize a distribution. And actually, Dmitry is already implemented some of that right inside the AI. Okay, it turns out, it’s not widely known, for example, buying machine learning people, but if you do some transformations like log a predictor variable, models like decision trees or random forest or gradient boosting machines don’t care, because they don’t care about the shape of the distribution. But actually, they do care if you transform the dependent variable, because often the loss functions are calculated in ways that involve Computing Center squares or other statistics that are different on a log scale than they are on the raw scale. Yes, so that’s one of the areas where we’ve already incorporated in da is some of the things that are done inside out of his itself Go to another is is outliers, although that’s a very tricky area and modeling because generally, statisticians don’t like to remove outliers unless you know, they were caused by some artifact In my book, the grammar of graphics I relate, favorite story from American Sociological review where people a researcher was modeling the frequency of sexual cleitus, I guess, in married couples over 50 years old. And the estimate came up to be absolutely huge Obviously, the author was an over 50 years old. But the point was, the estimate was just wrong. There was an outlier. And the outlier was caused by an SPSS convention for coding which some of you might remember that on the old IBM machines and so and there wasn’t a special missing value code. SAS actually implemented one which was solve this particular problem. Anyway,

The code was 999. So you have a few people in the data who were having intercourse 999 times a month that drag up the estimate Now my point here is that in outliers in data analysis, by all means delete them if you know why they occurred. But if somebody is just an outlier, because it’s a very big number in the data, you better think twice about deleting them. And often statisticians make robust models, which you can down wait those outliers without completely deleting them Sanyam Bhutani: That’s great insight. Now, ask you another insider question. If you could maybe give us some insight of view of what we call the maker culture. What are your thoughts on the maker and makers are gonna make philosophy as we call it Dr. Leland Wilkinson: It’s really interesting. And I think I have to credit Sri, our founder We treat somebody with this idea. And you know, it resembles in some way, the way jack Noonan and SPSS treated me and other people. And that is he gave you a very long leash. I told you that story about how I was just getting direct pratically tied down by all sorts of rules and regulations. And jack said, don’t worry about it, do your thing. Just make software. And that’s what Sri does. It’s interesting, these products, at least as long as I’ve been there, which is I think almost four years emerged from these people These makers at H2O, without any strong hand guidance from marketing and sales. Now, everyone pays attention to marketing and sales, and we have some dynamite people running those groups. But But those are after you get the basic software idea in place, and then you start asking customers, what do you think? And what would you like to see, as opposed to the old fashioned waterfall method of development that you know, are you write 100 page, Microsoft Word document, with every single risk requirement And engineers as you know, just hate this because they’re being treated like trolls, you know, just do that or, or, you know, code monkeys that turn and so H2O is very different. I can cite some amazing things. I did some, for example, some work in producing HTML documents reports that came out of a second opinion to describe a statistical analysis. And when I get to H2O, a small group of people there One of them was Megan, who has worked a lot in data analysis sort of thought, can we do auto dock and now sensit done something like this very early and very innovatively in I think 1980s to to expand their output with explanations and so on. But I’m just saying, nobody asked her to do this. You know, and say here’s exactly what you have to do the same As I said, with with DAI, and Dmitry, doing that kind of stuff, I could name three or four examples of projects inside the company that just emerged because someone had a crazy idea. And they tell Sri, and he said, Yeah, work with it Go with it. So, I don’t know how typical I’ve been only one other startup where that surely didn’t happen and they went bankrupt But I think that’s what’s unusual in my experience with H2O Sanyam Bhutani: Got it: Now, zooming out a bit. Can you speak a bit about how is the hype or industry changed over the years as we went from statistical statistics being the famous world to machine learning to data science to a now? How has the hype and industry change with the trend Dr. Leland Wilkinson: just the way you described it! Turns last about two years. The thing is data science, that term

that bill Cleveland coined, has now resulted in the founding at some top notch places Columbia, Michigan, so on Berkeley in data science programs, so you’re not going to see that term go away as fast as some of these others But you know, I do remember going to a few years ago, strata, which is an industry conference, it’s not really like MIPS or American statistical meeting. But back then, you know, people are going around going to dupa dupa, dupa, dupa, dupa dupa. Do as if that’s solved every problem in machine learning. And yes, a dupe is still there. But, you know, we now have distributed systems that They’re far more powerful And so you can expect these things will be replaced every six months to a year, a lot of these terms. So. And by the way, at least, to the extent of my psychology training, I think the way AI is being used today is is really a misnomer. Anybody who thinks the brain looks like a deep learning structure doesn’t understand the brain. And I’ve had a number of friends who work in that area of research, you know, and a lot of psychology departments have basically turned into MRI departments. But no AI, the way we use it today is basically an extension of statistical models. I think most people know that first. Every procedure we use in what we call AI today was invented by a statistician. Yeah. If you read hasty for even tips around a, you’ll get a good idea, the history and how that that works So, even deep learning, basically came out of psychological research in the 1960s in some engineering research, but these models were used to explain human speech and so on, and they’ve turned out to be very powerful, but they are not models of the brain, the brain doesn’t work that way. So anyway, so today actually, this may happen, it may be one of the great new steps in AI. When, when models start incorporating This is sort of like ensembles, but but different in that it’s hierarchical. The brain is very hierarchical in divisions system and the auditory system and so on. And if these deep learning models get integrated into a more general hierarchical system, we might start seeing the kinds of processing that will enable much higher levels of behavior not just you know, training cats picture Sanyam Bhutani: The next question is by Russ Wolfinger What do you see? Dr. Leland Wilkinson: I don’t know if he’s on now but he’s a he’s a great guy, really fun statistician. [He was] Sanyam Bhutani: really excited about the AMA section. So he sent this question to the AMA: What do you foresee for the future of open source starts versus and graphical software and I think this is a tricky one. How about Python VS R as the future of it Dr. Leland Wilkinson: Oh, that’s a touchy question. Well, you know, in general, I think people who’ve coded in Python and are understand why there are now two communities. I don’t think either one will become the standard for machine learning. I think you got at both. Briefly To summarize, I would say you pick Python generally for data munging. Although if you do leverage some of the things that Hadley wickedness done in the tidy verse and so on, you can do some pretty impressive things in our but in my estimation, Python is going to continue to be and partly through the support of Google is going to continue to be a leading data munging language environment also As you know,

Python can get Cythonized. So the performance of it can be pretty much as good as Java or c++. R: if you’re going to do a hierarchical nonlinear model with I don’t know, it’s a few more qualifications, there is no place other than our you can go because that one was developed by the professor who invented that And actually, another system status, which is sort of in classical econometrics, statistics areas, is where you’re going to choose if you have certain types of econometric models and you want to go to the horse’s mouth, that’s where it is. So I don’t much like Software bigots, you know, the, you know, people bragging about how object oriented is dead and we need functional programming or this or that or JavaScript, you know, I really don’t care. The real issue is the algorithms you’re using, and what you plan to do with them. And you code it in Fortran and I don’t care. But but I do think R and Python are going to continue to thrive There’s no question. The growth in those two areas is phenomenal. Personally, I just I’m still in the Java world. I, it took me 10 years to learn object oriented design, and Java programming. And when I read blogs about how object oriented is dead, I noticed that people are talking about that happening. probably spend about a year trying to learn what all of that is about. But those people aren’t Josh block or, you know, real experts in design. So I just feel very comfortable in Java. It’s just a very clean language from my point of view Sanyam Bhutani: And what do you foresee as the future of open source types and graphical software? Dr. Leland Wilkinson: Well, I mean, I think, for example, their projects going on right now in China, by alipay, you know, related to Alibaba, that are involved open source grammar, graphics, coding in JavaScript of all things, and they’re making great progress So I think open source is going to continue to thrive. I did read this Morning in the New York Times a very troubling article on the relation between open source and the big might just be called the Fang companies. But in this case it was about Amazon. And we use Amazon I use AWS all the time It’s magnificent, huge cloud platform. But some of what I read about how they have incorporated open source and in so doing, basically destroyed the little company that was making that open source is very troubling and I don’t know where that’s going to go. One thing about open source that some people don’t realize, I think most people know it, it costs somebody pays for it. Now I I wrote system on a sabbatical I couldn’t have done that if I weren’t a professor, you know, with the comfort of knowing that my life wasn’t depending on whether SYSTAT, you know, earned anything. And most places were open sources happening today, places like our studio and so on, those people are paid. So they’re developing some commercial applications, but at the same time they have a license to write the open source. That’s a very delicate situation. And if the large companies don’t realize and some of them do realize, but I mean, if they don’t realize that this delicate ecosystem depends on everybody treating each other with respect, open source is going to be stifled. And we’ll see Sanyam Bhutani: This has been a great interview. My final question to you would be what would be your best advice to someone who’s just getting started in the field of machine

learning, broadly speaking, Dr. Leland Wilkinson: Code code code code. ship. It’s tough to get a, you know, full time jobs nowadays, especially in academics in this area, but just get involved and code. It’s like in mathematics, the mathematicians I’ve talked to, and my daughters have an expectation. You’re not doing math. Unless you do proofs. You can’t read a math book and just say, Oh, I get it. No, you got it. Whoops. Well, in machine learning, you’ve got to code in our Python IDE. And you can’t run them like a step package You know, like, the way we used to pick up SAS or SPSS or something and write three or four lines of commands are and then say, Oh, I just did a regression. That’s not Data Analysis. Nowadays, you can pick up SAS and write code in SAS to do real data analysis. But you have to think creatively. And in fact, that’s how people like Russ became kaggle champions. So yes, code code code that’s in my is the answer. And when you run into trouble, yeah, go to the Stack Overflow or wherever. Be aware that there’s a lot of misinformation in those places But you will learn a tremendous amount if you were actively coding Sanyam Bhutani: That’s, that’s great advice. Thank you so much again, Dr. Leland for joining me on the podcast and on behalf of the community for all of your huge contributions to the complete community if I may Dr. Leland Wilkinson: Well, thank you very much. enjoyed the chance Sanyam Bhutani: Thank you so much for listening to this episode. If you enjoyed the show, please be sure to give it a review or feel free to shoot me a message, You can find all of the social media links in the description. If you like the show, please subscribe and tune in each week to “Chai Time Data Science.”