Nora Neumann – Usable A/B testing – A Bayesian approach

so as already introduced I’m going to talk about a usable AP testing approach today using Bayesian statistics so welcome to my talk a bit about me my name is Nora I work as a data engineer at research gate here in Berlin and before we dig right into it I’d like to have a show of hands who of you has already worked with AP experiments split testing okay quite a few and who of you has already worked with a Bayesian approach to this visiting Bayesian approach yes nice okay so please correct me if I do something wrong okay so first start why would we actually care about a/b testing at ResearchGate so we want to make absolutely sure that if we introduced changes to our product in it’s quite a diverse product that our members are not negatively affected by it we also use a beta testing to actually rapidly test new features on our platform and to develop new features iteratively um we regularly run live tests on all different parts of our of our website to ensure that the scientists really find the right content they actually need so we test new product designs we test new product features we look at different conversion rates for example or we could also look at product usage metrics we also test new algorithms and at any given point in time we actually run several experiments in parallel on different parts of our platform now I’ve talked a bit about how we use a/b experiments but there are like a few things to consider first so I’ll walk you through some general considerations which most of you will probably be familiar with and then I’ll make some critical points about one of the more common approaches and then we’ll look into how we can actually use patient statistics to help us with some of the problems involved with a be experiment testing and I’ll introduce the results of our library which is still a work in progress but which will be release soon okay so some general considerations about AP experiments you should really define the motivation behind your experiment which means you should be impartial when setting up your experiment and not just test to justify gut feeling you should think about your user segments which means if you actually optimize parts of your product how many users and which members could actually benefit from the changes and is this worth the engineering effort actually also don’t test too many versions in parallel because you actually want to under identify the underlying pattern what works best for your members and when you test a lot of versions in parallel you can actually could dilute to see the accept effective changes that are running and also don’t get frustrated which means most of your experiments probably won’t yield a significant difference between the different variants you’re actually testing you should also know your baseline which means you should actually define key metrics you’re working on so for example if you’re an online shop and you want people to actually buy stuff on your website and your test for example two versions of a new checkout card you probably won’t be interested in the purchases alone but you would be more interested in average revenue for per customer actually because you want to go with that variant that allows you to earn more money on your website if you’re doing that you should also understand the range of your acceptable fluctuations of your baseline which means your conversion rate is never static so or really static it might fluctuate with a very small margin of error or confidence interval but you should actually look at the different time ranges because your conversion write rate might change on the time of the day or the day of the week for example if you start your experiment on the wrong on the wrong day of the week and for example don’t let it for long enough let it run for long enough you might just measure the difference between two days or three days rather than user behavior and not knowing this will actually have problems with hypothesis based testings normally I’m assuming using an experiment framework that actually diverts your traffic to different versions you can actually have you can actually buy solutions or buy a monthly fee for them like visual website optimizer or you develop an in-house solution the split testing probably most of you are familiar with so the easiest version is

you divert the traffic 50% 150 percent of your users that come to your website see one version you had a 50 percent see the other version and a new car that they told them there are actually more intelligent algorithms there has been a few years there has been this nice opinionated Block in article called 20 lines of code that will be a be testing every single time which introduced the idea of multi-armed bandits to actually and I wrote the traffic to your different version so that means without multi unbend as you you have a slightly different split normally for example you could say 10% of your users that actually come to your website will be even in evenly split into a 50 percent which sees one version or the other and the rest of for example of your traffic to your website like 90 percent will actually see they’re currently best performing one which actually helps you to not lose revenue while running the test what you should also do and what’s very important is actually to run a a experiments which is in dummy experiment that means um you split your traffic of the users to exactly the same version of your website or your landing page for example and if you see a significant difference between the two versions although you show you users exactly the same thing this is an indicator that how you assign users to your experiment has actually liked the system actually has a bug most of experiment ap experiment frameworks actually use hypothesis based testing to call a winning variant or to analyze your experiments me that’s the most widespread technique we calculate a so-called p-value to see if there is a significant difference between two versions we also measure a confidence intervals but to work with the sample to work with those hypothesis based approaches we actually need to set up a fixed sample size in advance in the most simplest setup and by measuring or calculating the p-value we actually want to control in combination with a significance level we actually control often we declare a difference between two variants although they actually is known to calculate the sample size in advance you have to consider your minimum detectable yeah effect you have to consider statistical power for hypothesis based tests English this means often where you recognize a successful test which is typically set to 80% you also define a significance level which means how often you will observe a positive result although there actually is none which is typically set to a very low number for example 5% there are a lot of sample size calculators out there that allow you to calculate sample sizes in advance this is an example of Evan Miller’s awesome EB testing tools where you have a baseline conversion of 3% for example and you want to be able to detect if there are five percent relative change to that baseline which means you actually need to acquire large sample size for each variant so depending on how much traffic you know actually get to your website it could mean that your test is now running your experiment is not running for several weeks or several months which in a agile development environment sometimes is not feasible the smaller your minimum detectable is the more samples you will need to collect per variant the larger actually the true difference between your Tucson or two versions of your website is the less examples you will actually need to confirm that there is a difference between the two versions in a ideal world you would probably stick to the rules the experiment will run exactly as long as you need to according to your pre calculated sample size your website has enough traffic so you don’t have to wait weeks or weeks to actually gain results and define a winning variant of your website you don’t look at your experiment while it is running that’s very important and you only test two versions against each other and not more than that and you’re really really patient and nobody actually bothers you to tell you a preliminary result before while the experiment is already running and if you run exactly like this you won’t have any problems with hypothesis based testing however in the real world with significance based experiment evaluation we calculate the p-value with

our preferred statistical test and this is actually the probability of your data given your hypothesis so it’s not the probability of your hypothesis given the data because that’s not what the p-value tells you if user hypothesis based testing approach to declare one winning variant you can only reject the null hypothesis and your null hypothesis will be there is actually no difference between the two versions although you’re actually testing for exact same opposite of this situation and there is no indication if your result is important so there is no highly significant results when you just calculate the p-value statisticians have been teaching people to calculate confidence intervals to capture the uncertainty of my measurement however there are some problem with those confidence intervals actually because a confidence interval with you for example calculate with 93 you calculate a 95% confidence interval that actually means it will contain the true parameter with 95% probability but this is not the same as a 95% probability that the true parameter falls within your interval what it means is actually you could compare it with you say the Pope is Catholic but not every catholic’ is the Pope so that’s the confidence intervals depending on how many experiments you run they actually fluctuate so they will contain the true parameter yes but you have no guarantee that the true parameter which means your conversion rate for example is exactly in the center of your confidence interval and the more experiments you actually run the more it will actually fluctuate and if you peek at your and look at your running experiment and actually test before it’s done it actually means that you increase the chance of falsely detecting a statistically significant difference between your two versions because you actually run into the multiple hypothesis testing problem this not only means that you won’t that you might falsity take the statistically significant result but it could also mean that you fail to detect a statistically significant result and this effectively corrupts your test the formula on the bottom I’m not sure if you can see it actually helps you to calculate how you increase the probability of detecting of falsely positive significant result which means n is the number of for example variance you test n is the number of segments you choose for an experiment to analyze it after the fact because that’s something a lot of people do for example we have had an experiment running there is no significant difference between two versions but then you think let’s look at the countries for example where the users came from to analyze if they is a significant difference between users coming from a country looking at the different versions they actually have shown them throughout the experiment if you don’t correct for this and now they are always around this so you can either use sequential testing where you actually introduce a very strict p-value correction that means you decide beforehand how many tests you are actually going to run and that means depending on how many versions you are actually going to compare you have to reduce your p-value so you can’t just say if the p-value is people low zero zero five I have a significant result but rather you have to decrease this value so it’s very very low in the end which is a normal that’s called she-duck correction and this is a very restricting and you could also use a false discovery rate which means you control based on the significant results you get with multiple tests you control the ratio of the false positive results in this set so it’s a bit stricter and you can control the error rate a bit better and given all these problems with hypothesis based testing we actually wanted to find some ways around this at ResearchGate so we thought we searched for a method that actually helps us to easily communicate results we get from such an ebay experiment setup and our primary goal after all is even if we don’t have any significant better version or whatever we want to make absolutely sure

that at least the version we choose after running an experiment is not worse than the one before okay and interestingly also last year visual website optimizer actually introduced their smarts that’s engine which also uses patient statistics to call a winning Marion and I have written two statements here which we think for example are quite relevant in this discussion because with the first hypothesis based testings you could only say we reject the null hypothesis that the variant a is equal to the variant P with a p-value of 0 0 2 which would be if you can result but isn’t it better to actually communicate that there is an 85-percent probability or chance that the variant P has a leaf that has an 8% lived over variant day so that’s why we started to look into patient statistics and Bayesian reasoning and Bayesian reasoning can be nicely explained with this nice quote from Sherlock Holmes to Watson how often have I said to you that when you have eliminated eliminated the impossible whatever remains however improbable it must be the truth so with patient reasoning we actually update our beliefs about our data when we gather evidence so that means for example we think we have a conversion rate that is half for example 0 for example 4% and the more evidence the more data we gather throughout our experiment the more we can either confirm or deny this prior conviction and I’m quickly going through some of the formulas involved with Bayesian statistics so we first of all have the Bayes theorem which gives you the posterior probability of your hypothesis given the evidence so this is what we’ve learned after we’ve gathered enough evidence which is the likelihood of your evidence if your hypothesis is true times the prior probability of your hypothesis and then you also have the prior probability that the evidence is true but you can this is a normalizing term which you wouldn’t really need in the setting of an a/b split testing so the easier version is that the posterior probability after we’ve gathered enough evidence is the prior probability ability that my hypothesis is true times the likelihood which means the evidence likelihood of my evidence which their likelihood of medida if my hypothesis is true and an interesting thing about a/b testing especially the split testing approach is that we actually have non overlapping populations because if you using a be experiment framework one user that comes to your website will only be shown one single version of your experiment one user should not be part of both experiment versions and if you assume that this is true a and P for example are two versions actually become independent which means the posterior probability for two conversion rates given your data actually becomes a multiplication of the posterior probability for each single version and luckily also a conversion rate in terms of views and clicks is like a coin flip model so with quite a bit with coin flip model I mean you can throw a coin and canady to come up with heads or tails when you have if you show your website to a certain amount of users they can either click or follow something or they don’t so you have a success or failure event and this is very handy because if we want to compare the two variants to put their posterior probabilities their posterior and area posterior probability becomes a two dimensional function of both conversion rates we however still can calculate likelihood which is a binomial event and the prior probability which actually follows a beta distribution and because we have a binomial likelihood function our conjugate posterior distribution also becomes a bit of distribution and conjugate distributions in Bayesian statistics only mean that the prior and the posterior probability distribution are of the same family so in this case they are both repeated distribution this is the formula for the posterior distribution of our variant a and as I

said before the conjugate posterior probability becomes a beta distribution again and we can easily write this as the beta distribution of our conversion rates for variant a with two separate parameters a and B which are actually necessary to form a beta distribution and this is choosing these parameters a and B in an AP experiment setting is kind of a black art you can either choose an uninformative priors which means you don’t have any strong conviction about for example if you test a new feature and you implement two variants of it and try to the users you are highly likely that you don’t know your basic conversion rate beforehand so you don’t have any strong convictions about how the data might be distributed so you can simply choose for example a so-called uninformative uniform prior which would be the red line this means that your parameters for the beta distribution a and B actually are set to one which is exactly this uniform distribution I’ve also plotted the blue line is for example where you have where you said when you always set a and B to the same value you can choose but all of them are essentially uninformative priors because the mode which means the value of your distribution with the highest probability will always be 0 5 meaning 50% however you can also if you have for example the case where you’ve run several experiments and several and you actually watch your conversion read of a different part of your website regularly and you’re really really sure that for example in this case your conversion rate is most probable around 4% in this case you can actually also calculate the parameters a and B which would be your prior parameters from this distribution given that you have the kind of standard deviation of your distribution and that’s given also the sample size you’ve actually already acquired so this is the prior distribution and now let’s talk about an example experiment and how we can actually use it neatly to analyze our it B experiment results so we have two variants for example we let the experiment run for exactly one week because we actually want to capture the whole traffic so we want to capture a normal week on our website for example which is very important because again if you only adhere to sample size depending on your traffic either you have to click much more data or you actually have so much traffic that you could be done with your experiment within a few minutes for example if you’re probably Google you probably have 200,000 views within a few seconds or minutes so the experiment would only run for a total duration of two minutes for example so in this case we’ve had two versions of let’s say sign up page maybe and we have a different number of clicks and we’ve acquired total number of views throughout our experiment and then we want to find the posterior probability that the conversion rate of a and B is actually greater than the one of variant E and we don’t have to calculate everything by hand we can also do Monte Carlo style sampling which means we actually approximate the posterior probability distribution and we draw samples from our posterior beta distribution to obtain credible values for conversion rates for both variants and to make this either we can actually use numpy but they are also very good as a very good package called pi MC we actually have a more advanced sampling procedure and you will obtain a distribution of credible conversion rates using this approach and you can then also the so-called highest density interval which in this case actually for example if you see with a credible interval of 95% then you can actually see with a 95% probability your true value will fall within this range which you can’t do for normal confidence intervals this is just a quick simple approach you could use so you use you import from numpy random you input the beta distribution you sabra for example in our case we sample 1 million times and then we’ve used here

an onion formative prior which where we set a and B to 1 which is the uniform distribution because we have no strong conviction whatsoever what the conversion rates of our variance could be and then as I said before we wanted to have something where you can easily interpret the result results afterwards so we just simply plot the distribution of our conversion rates and we found out that this is very easy to communicate to other people within our company because they will totally get there is a huge gap between those two versions and there it will know won’t be a discussion is this significant because we don’t calculate that in with using Bayesian statistics and there won’t be a discussion to the confidence intervals for example overlap because you can clearly see for example in this case no they don’t and in this case this is for the data I showed you before for the clicks in fuse we actually the variant 8 has the highest problem the mode of our variant a distribution is actually at 53.9% the mode for variant B is at 59% and what’s really nice is because we have distributions of our conversion rates we can also calculate the difference between our two versions and again obtain a distribution of the relative difference between the two variants which is also also clearly nice because then we recalculate there for 95% highest density interval we can see that at least if we now change from variant a to variant P we at least get 6.4 percent relative increase which we can’t really do if we use hypothesis based testing and we can also say there is an 85 percent probability that the variant B has a lift of over 8% over variant a and it has a hundred percent probability that it’s actually six at least six point four percent increase we can also use this method to actually work with multiple variants and there we are not really interested in every single difference compared to the hour where for example version a which could be our default variant but we are rather interested in calculating an overall probability of choosing one variant being the best and in this case we can easily identify variant es being the one version with the highest probability and we then again can calculate lift and see that we are at least get an increase of 15% and you can also use this approach for sequential data collection which you can’t really do with a hypothesis based approach because as the base reasoning is we collect more data and based on that we update our beliefs and update what we actually believe about our conversion rates we can calculate and we can actually set the threshold of not caring which means if my two variance if there is a difference between my two variants and it’s like zero zero one percent I wouldn’t really care if a gene a change to the actually worse performing variant because I don’t care about a loss in conversion rate of zero zero one percent and we can calculate the expected performance loss by choosing one variant over the other and this is actually can be used as a stopping criterion to choose to actually decide which version to choose you can also use the region of practice equivalents which is again a concept but that’s based on the difference between two variants so this is a in interval around a null value because if your two variants are exactly the same they will have the center of their difference Delta distribution exactly at null zero and then you can define a small interval for example where you also say I don’t care if my highest density if there is this difference that falls within that region and I can compare the rope and the highest density interval of the relative lift and then I can actually decide which variant to use and as we gather more data what actually happens if you calculate these probability distributions the more data you acquire the narrower your distributions get and after a certain point for example I set my threshold of caring meaning I don’t care if when variant B or my new version actually performs 0 0 1 % worse than my default variant that I have right now and you just add fuse and clicks on a regular basis you can do this every day you can do this every 15 minutes that

you’re choosing and you can stop with that as soon as you have acquired enough samples which also means you don’t have to calculate you expect your sample size in advance because it really depends on the traffic and you can also calculate the difference and again if you let it run for several days you will see that the difference the Delta between our two versions that I’ve shown you in the previous plot gets smaller and smaller and we can see because all the both distributions from the last slide actually completely overlap so that might not be a significant improve improvement when I chose my variant version however as we follow a Bayesian reasoning we would actually say ok I won’t lose a lot if I change to the new variant and I might learn something more which is actually relevant in a business decision setting so with that thank you for your attention okay thanks a lot Nora I think that was really the packed room so everyone is interesting in a/b testing I’m quite sure we have questions yes there’s the first one hello thanks for the great talk what do you do when you have metrics that are not binary but yeah so then you probably have some how do you choose your prior stare when you do this normal yeah so if you don’t have any conversion rates and no binary outcome your likelihood function is not binomial anymore so you can still for example choose a prior normal distribution which that mean you don’t have to specify a and B for your prior beta distribution and you won’t have a conjugate prior Konya your prior and posterior probability distribution anymore so this is really a setting that works well for conversion rates however if you really want to have more information on that detailed I’ll recommend you try and crush curls Bayesian estimation suppose it’s T testing because there you actually work with differences in means of two samples so not a conversion rate not a binomial outcome but rather you have other underlying distributions that’s it answer okay awesome here’s another question thanks for the talk do you have code examples of what you showed online or not yet we will release that pretty soon because this will be a library you can actually use and yeah it was one in the back thanks for the great talk you didn’t really talk much about the prior and I think a lot of the power of Bayesian reasoning comes from the park and he’s maybe say a little bit about how to select good prior and then you used uninformative priors but like they limit your scope of your Bayesian analysis and are sometimes random like essentially assuming the same distribution between your a and your B which not necessarily might be the way to go well the solution should I mean theoretically be the same and we I mean if we want to test two versions against each other it would be quite unfair to actually assume different priors for each distribution unless we’ve tested them already before if you had already run an experiment of two variants and then you want to actually improve one version then you could use a prior beta distribution and could sample actually get the a and B parameters from that I would argue but as I said before I mean there is one slide where you actually exactly know that your variant will be for example will have a conversion rate of 4% so you know where the mode of your distribution is and there are two simple formulas based on the data you data you’ve already acquired which means like the total number of views for example and your mode you can actually calculate a and B from that however if you choose a very informative prior distribution this also means you have to gather more evidence in order to actually be sure

that the posterior disregard much more data in terms of actually check to actually change the posterior distribution because if you use an uninformative prior this actually means that the posterior probability distribution is mainly driven by the data you have acquired yeah but if you use an informative prior you have to gather much more evidence meaning you have to the likelihood does not influence yet though your posterior probability distribution as much so you mean when you actually have yeah but then you already know that for example then you can use yeah okay so if you have like already your default variant implemented and then you want to compare a new variant against each other then it’s actually good because yeah okay yeah in that setting if you’re repetitive said testing sorry that was my misunderstanding then it would actually make more sense to use your prior information about your default variant for example because then you should actually gather more evidence that your new version might be better or not yeah I am very interested you said that visual website optimizer implemented or Anita’s approach or a sort of so my question what would be what’s the difference between their implementation I can’t really tell you what they exactly implemented to be on I know that they also use a stopping rule for your experiment that’s also based on the expected performance or based on expected performance loss and you probably have to follow Chris to chioce blog because he actually I think was responsible for implementing their that visual website optimizer to see what they actually implement I’m I haven’t read the technical paper yet so I can’t I mean the idea is actually the same but we do have a different experiment framework and an in-house framework so we can’t we won’t choose someone else oh great so do we have more questions if not and I would say let’s thank Nora again