# Nora Neumann – Usable A/B testing – A Bayesian approach   our preferred statistical test and this is actually the probability of your data given your hypothesis so it’s not the probability of your hypothesis given the data because that’s not what the p-value tells you if user hypothesis based testing approach to declare one winning variant you can only reject the null hypothesis and your null hypothesis will be there is actually no difference between the two versions although you’re actually testing for exact same opposite of this situation and there is no indication if your result is important so there is no highly significant results when you just calculate the p-value statisticians have been teaching people to calculate confidence intervals to capture the uncertainty of my measurement however there are some problem with those confidence intervals actually because a confidence interval with you for example calculate with 93 you calculate a 95% confidence interval that actually means it will contain the true parameter with 95% probability but this is not the same as a 95% probability that the true parameter falls within your interval what it means is actually you could compare it with you say the Pope is Catholic but not every catholic’ is the Pope so that’s the confidence intervals depending on how many experiments you run they actually fluctuate so they will contain the true parameter yes but you have no guarantee that the true parameter which means your conversion rate for example is exactly in the center of your confidence interval and the more experiments you actually run the more it will actually fluctuate and if you peek at your and look at your running experiment and actually test before it’s done it actually means that you increase the chance of falsely detecting a statistically significant difference between your two versions because you actually run into the multiple hypothesis testing problem this not only means that you won’t that you might falsity take the statistically significant result but it could also mean that you fail to detect a statistically significant result and this effectively corrupts your test the formula on the bottom I’m not sure if you can see it actually helps you to calculate how you increase the probability of detecting of falsely positive significant result which means n is the number of for example variance you test n is the number of segments you choose for an experiment to analyze it after the fact because that’s something a lot of people do for example we have had an experiment running there is no significant difference between two versions but then you think let’s look at the countries for example where the users came from to analyze if they is a significant difference between users coming from a country looking at the different versions they actually have shown them throughout the experiment if you don’t correct for this and now they are always around this so you can either use sequential testing where you actually introduce a very strict p-value correction that means you decide beforehand how many tests you are actually going to run and that means depending on how many versions you are actually going to compare you have to reduce your p-value so you can’t just say if the p-value is people low zero zero five I have a significant result but rather you have to decrease this value so it’s very very low in the end which is a normal that’s called she-duck correction and this is a very restricting and you could also use a false discovery rate which means you control based on the significant results you get with multiple tests you control the ratio of the false positive results in this set so it’s a bit stricter and you can control the error rate a bit better and given all these problems with hypothesis based testing we actually wanted to find some ways around this at ResearchGate so we thought we searched for a method that actually helps us to easily communicate results we get from such an ebay experiment setup and our primary goal after all is even if we don’t have any significant better version or whatever we want to make absolutely sure that at least the version we choose after running an experiment is not worse than the one before okay and interestingly also last year visual website optimizer actually introduced their smarts that’s engine which also uses patient statistics to call a winning Marion and I have written two statements here which we think for example are quite relevant in this discussion because with the first hypothesis based testings you could only say we reject the null hypothesis that the variant a is equal to the variant P with a p-value of 0 0 2 which would be if you can result but isn’t it better to actually communicate that there is an 85-percent probability or chance that the variant P has a leaf that has an 8% lived over variant day so that’s why we started to look into patient statistics and Bayesian reasoning and Bayesian reasoning can be nicely explained with this nice quote from Sherlock Holmes to Watson how often have I said to you that when you have eliminated eliminated the impossible whatever remains however improbable it must be the truth so with patient reasoning we actually update our beliefs about our data when we gather evidence so that means for example we think we have a conversion rate that is half for example 0 for example 4% and the more evidence the more data we gather throughout our experiment the more we can either confirm or deny this prior conviction and I’m quickly going through some of the formulas involved with Bayesian statistics so we first of all have the Bayes theorem which gives you the posterior probability of your hypothesis given the evidence so this is what we’ve learned after we’ve gathered enough evidence which is the likelihood of your evidence if your hypothesis is true times the prior probability of your hypothesis and then you also have the prior probability that the evidence is true but you can this is a normalizing term which you wouldn’t really need in the setting of an a/b split testing so the easier version is that the posterior probability after we’ve gathered enough evidence is the prior probability ability that my hypothesis is true times the likelihood which means the evidence likelihood of my evidence which their likelihood of medida if my hypothesis is true and an interesting thing about a/b testing especially the split testing approach is that we actually have non overlapping populations because if you using a be experiment framework one user that comes to your website will only be shown one single version of your experiment one user should not be part of both experiment versions and if you assume that this is true a and P for example are two versions actually become independent which means the posterior probability for two conversion rates given your data actually becomes a multiplication of the posterior probability for each single version and luckily also a conversion rate in terms of views and clicks is like a coin flip model so with quite a bit with coin flip model I mean you can throw a coin and canady to come up with heads or tails when you have if you show your website to a certain amount of users they can either click or follow something or they don’t so you have a success or failure event and this is very handy because if we want to compare the two variants to put their posterior probabilities their posterior and area posterior probability becomes a two dimensional function of both conversion rates we however still can calculate likelihood which is a binomial event and the prior probability which actually follows a beta distribution and because we have a binomial likelihood function our conjugate posterior distribution also becomes a bit of distribution and conjugate distributions in Bayesian statistics only mean that the prior and the posterior probability distribution are of the same family so in this case they are both repeated distribution this is the formula for the posterior distribution of our variant a and as I    