Lennart Martens 9th European Summer School for Proteomics 2015

actually I I thought I had a really bad deal when and I met invited me of course it’s a really good deal to get invited here and very happy to be here but he gave me the title basic bio informatics the whole point I can do that i will talk about search engines i’ll talk about quantification this is basic bioinformatics turns out you already had to talk about search engines by nobody lessons on culturelle witches you know probably one of the world’s most esteemed experts in the field you’ve heard a lot about quantification and all the stuff hasn’t brutal stole all the stuff was in the nonlinear talk so you know I couldn’t talk about these things anymore so I have to talk about everything else that is basic by informatics so what’s the first thing that comes up to me when I hear basic and informatics it’s a somebody all right so let’s do a bit of a summer the go to use when you program the CPU directly and actually this is a joke of course right up here to talk but this is not assembler this is a computer game this is the computer game i play my free time this amazing is the best computer game i ever played you rise assembler in the computer game and you try to make the most efficient possible solution to solve a stupid fast with minimal stuff I you know this has taken over my my my holiday anyway and my solution there was the most efficient solution on record on these because it communicates with the energy that’s right so anybody who plays that we want a screenshot of that that’s good stuff so before I start in my lecture I think it’s very important that that people get educated in bioinformatics because you will use a lot of bioinformatics you will go to bioinformatics but you are used to products of bioinformatics and in order to use them right you need to use this you need to use your wetware it’s not easy it’s also not hard but it’s not trivial now you cannot just say all click these buttons and whatever search engine are using and I’ll expect things to come out you have to think about what you’re doing and for that we have made extensive series of tutorials that explain about just about anything at least an identification right now complication is coming up and they’re freely available on the website of my group there’s a lately tutorials and SS bioinformatics for proteomics or you really can’t miss and then it has all of these different sections this is the first section peptide and protein identification and then it has these different topics and it really go step-by-step through everything and it has PDS with all the documentation how do you do the tutorials it has the resources so all the data you need to do the tutorial and it has links to all the freely available tools that you can download to do the tutorial with plus these tools are really not tutorial tools they’re serious research tools so you also get all the tools right nicely organized in one place some of them are from my group some of them are from other groups and this is just to show you what these tutorials look like we tend to think they’re quite professionally made we use them a lot of workshops around the world and there are my screenshots of what you should see on your screen when you do certain things they explain everything really nicely they use colors they have tips the tips can be about the software but they can also be about your research and most importantly of all they have these italics sections and the sections and italics or questions every question is number this is question 1.4 and these questions are go from very easy two very very hard some of these questions we don’t know the real answer to but we can give you some hints and these questions are really there to make you think about what you’re doing and the really nice thing is that we went to the effort of providing a separate document with all the answers and the document is if i recall correctly 28 pages because some of the questions require a multi-page answer there they go deep that’s why there are numbers so if you want to learn about this if you have a bit of time it’s worth actually picking up one or two of these see if you like them and maybe look at some of the other ones they will really teach you a lot okay so that’s not the stuff i will talk about today instead i’ll talk about these things first thing is FDR estimation you all should be familiar with the fact that when you identify specter you match them to peptides some of the times you identify something wrong okay and the problem is that we cannot get rid of them easily so what we do is we accept a certain amount of crap in our results and the amount of crap we accept is estimated by this false discovery rate the amount of crap in your data so this has been a very hot topic you know not the topic of the day although it’s definitely very much in vogue again today but the past four thousand day so let’s say give or take time he acts and update right this came up when proteins became high-throughput then we started realizing we have to control for this the problem is how do you know that these FDR estimations that everybody uses work so we’ll talk about that about a way to figure out whether they work and I’ll show you where they don’t work then I’ll tell you why this is going to get only more difficult as we go along so in bioinformatics who are

reaching an impasse the technology that we’ve been using for the past ten years to make sense of our data is quickly becoming outdated I’ll show you why this in detail and it’s for specific purposes that it becomes outdated so it’s pretty much at the cutting edge now but in five years this will be standing I’ll talk a little bit about data management and dissemination because I spent a lot of time and effort here in my career in this kind of stuff I’ll talk about quality control i think it’s the next big bioinformatics user-oriented challenge is to deliver call control the people in proteomics talk about that and then finally I’ll talk about some philosophical nonsense about bioinformatics beyond the algorithms and into the world of the bioinformatician so FDR estimation I usually use this this amusing little picture that some of you may find familiar to explain what the problem is with identifying peptides it’s a lot of perfect analogy but it works reasonably well you have to find waldo or Wally depending on where you are in the world and you know Wally has this white t-shirt with red stripes and you know it’s a funny character and so what you do when you go through this if you check everyone and you see whether they match the mental image of Waldo and of course they put some decoys in there look here’s a decoy there’s a decoy here’s a decoy there’s a decoy here’s somebody Photoshop themselves in as a decoy that’s what you get when you find things on the internet and so the idea is that you actually try and match and the pattern has to match correctly right and so the number of times you return a match that is wrong you find not Wally but somebody who looks like all you counted as a false positive thing right but hopefully most of the time you will find volumes idkey behind the old lady and so that works reasonably well the problem is when we go to bigger databases okay when we go to big advantages it gets very very difficult to make sure you’ve got the right guy and that is what I will talk about this problem where do you encounter it meta proteomics now you want to do proteomes of entire microbial communities it’s booming quite a bit second thing proto genomics everybody wants to proto genomics right but then you have your sixth reading frame translation of your genome database this is a giant giant database anything that is multi species right is going to give you a lot of problems so we came up with a way to figure out what the problems are when you look at FDR and we use the focus furiosus which is a really strange animal it’s at the bottom of the ocean at this hydro thermal vents these volcanoes at the bottom of the sea and so these guys they are very happy at 93 to a hundred degrees centigrade you take any of your cells you put them at one hundred a hundred degrees centigrade everything everything is dead in immediately these guys are very happy you know that they are happy because they’re called pillow Coco’s pillow for fire pokers because they’re cocky and furiosus because they breed like crazy so there must be happy now they’re a very special kind of archival bacterium an extremal file and extremely thermal file and there are these thermal cookie which is a family now if you look at the proteome of pure pokers it has a lot of proteins that we have as well and if you go through the proteome when you look at the peptides they actually seem to distribute in a lot of metrics like human peptides but the sequences are dramatically different in fact here is the overlap the shared tryptic peptides between human and brokers are shown here there are five all five of which are shorter than six amino acids which are usually excluded from search results so Pierre caucus does not look like us although the peptides when you look at how they distribute of a chromatographic run if you look at the mass distributions they look like ours now why do we use this we use this as an entrapment database and the idea is simple you search data against the human database of course this is human data hela data and we get 8,000 psms identified okay now we also do a decoy search you know about decoy surgery so this is the number of decoys which turns out to be one percent of that okay that’s the way we do this now the question is since we haven’t called the search engine that we’ve included Bureau caucus data any Peru caucus hits must be wrong because we searched human data against the bureaucracy database thing we get back must be wrong so we find out how many peptides did we match against real focus and that’s the entrapment is and they should be lower or equal to the number of decoy pets ok let’s go through that again decoys mean if you I get ten decoys out of 100 total hits the 10 decoys or a model for 10

wrong hits in the real search so I’ve got 10 out of a hundred that are wrong so my crap is ten percent then out of 100 if I now use beautiful keys in parallel since I am sure now sure that my FDR is ten percent I should not get more than 10pm focuses because if I get 50 people caucus heads which are all wrong I know they’re all wrong they’re just a different type of decoy now I certainly have fifty percent FDR and my original estimate is wrong so i use peeled caucus to validate it’s the secondary decoy search and you see that this works right you see that the amount of Piran caucus hits is always lower than what you would expect from the hits but notice that we’re doing something here we include Homo sapiens which is about this size database here we include all mammalia which is this size data base 10 times bigger here we include all vector data which is yet another tenfold bigot and here we include all okay iota and notice the overlap between all possible or cariola that we’ve ever brought in sequences from and Peele caucus is absolutely minimal there is practically no overlap and you see what happens we control this FB are beautifully right so when the algorithms claimed it to FDR is one percent it really is one percent so who can spot the problem here what is the problem there’s a very very big shining glaring problem screaming at you look at the green bar from that side to that side what does the green bar do it goes down you see that the number of hits you get back goes from eight thousand seven thousand six thousand two hundred and here when the database really gets very big it goes down to 5,000 is this good do we like this no okay this is the biggest problem when you search a large database the recall the number of peptides who identify goes down dramatically okay however we do still control the FDR now people have tried to fix them so for one thing you can try and use a better search engine so we tested three search engines we use Moscow of course everybody uses Moscow we use ohm sub which is known as a very conservative search engine it doesn’t give you many results but it tends to have a good control of FDI and we use exponent which typically gives you more heads and then what we do is we bought the number of decoys against the amount of target is so at any point you can calculate the fer for that point by taking this and dividing it by that so the thing is you see here this line for mascots and homes as I follow that really nicely and this is exactly what you expect there precisely at the place where we expect them to be okay they give you the right number of it but if you look at here sorry this is entrapment it’s so this is filled focus this is decoy so one beautiful kiss for one decoy so that’s perfect right the puter focus hits match to decoy dates so the thing works for Talon however you see that you get 4 203 coin heads you get 300 pyrococcus hits so now what does that mean if we think that we have 200 decoy heads we actually get 300 wrong hits back so what have we learned tumblr is under estimating the FDR it’s telling us based on target decoy that you have 200 crap hits but in fact when we try it on something else we get 300 kravitz that’s fifty percent increase right now this translates if you look at the decoy hits that you get from famine this translates and more good hits these are the real the target hits so at one percent FDR whoa you know I won’t tell them it gives me six thousand two hundred years instead of five thousand you know it must be amazing yeah this is what you now see and interestingly muscat that homes are very similar also note I don’t know if Jones here actually maybe he’s not here that might be good right excellent so i can i can say these things and he will not shoot me on embarrassed but you say that muscat has really built for five percent FDR that was the original default threshold look at where it shines muscat shines at five percent of the art it’s optimized for that it has always been using Maskull at one percent FDR doesn’t kill the algorithm but it’s better at five percent it’s an interesting observation and something that kind of makes sense and so someone gives you more these two guys give you about the same thing so why is tom them superior the thing is it’s not when we look at the entrapment rate so now decoys are gone and we just look at field focus but since there are a type of decoy we can do f the art calculation based on pillow caucus where’s Tom them now oh look it’s at exactly the same place as where mascotas

I mean it nearly tracks the curves perfectly also I still pessimistic okay but that’s expected so when we use peel focus suddenly found them isn’t special anymore right so what you get is the algorithm that gives you the biggest return is somehow exploiting something about the decoys that fails if you give it a real biological sample as an entrapment and so we have a way now to calculate whether FDR’s that are estimated makes sense it’s not perfect way but we can already show that problem has an issue okay now the big database meta proteomics we have all signed ray Donlon a blue and the combination of the two search engines in orange combinations of search engines give you more hits that’s why our tools that are in the tutorials search q e and peptide shake search fury ships with seven search engines built in you download the file your unzip it you have just installed seven search engines on your computer and you run it and you go you give the result to peptide shake it it piles it all together and it gives you this orange line which is always bad it’s usually not huge that’s five to seven to ten percent invest cases but it helps now as we’ve seen before download outperforms alsa no surprises everybody knows this okay now this is the normal p.m. cookies database and what we’re doing now is we’re doing the inverse of what we did before we’re taking people cookies data and we’re searching it against the bureaucracy database sorry i don’t find pyrococcus proteins this is data comes from a few exactly so it’s hi ms ms resolution this is the best data we can have right now so you know Piron copies we identify a lot of pilot focus proteins everybody’s happy extend them goes better than homes now now you see this dotted lines here they’re the strange thing they’re a new database so still pyrococcus spectra now we searched them against the pyro caucus database plus the human intestinal metoprolol database it’s based on next-generation sequencing of human fecal samples and they took all the open reading frames they could get from that and shove them into a ginormous database full of bacterial sequences and these bacteria look very different from pyrococcus i would hope your intestines or not 100 degrees centigrade even if you ate the chili pepper and so the point is this database is enormous and as expected the blue and the gray line drop but you notice one thing that you should you should have already expected tandem drops much more than home son in fact down the mccombs a– are now cut on the same size that’s because some of the tricks that on them unwittingly I it’s not on purpose some of the tricks that down the muses are now useless still however you do get an increase and the increase becomes bigger than we for relatively speaking when you use multiple search engines so take home message number one in search engine behavior depends on the size of the database in general the big exception is ms gf+ second the harder the task gets the more benefits you get from multiple search engines ok so teamwork matters when it gets hard but the problem of course remains that this is very unsettling and we go from at one percent FDI we go from roughly 9800 to 4,500 it’s that’s half of the stuff poof it’s really really annoying and it’s just because we change the database so people come up with a way to fix them and the way they fix that oh sorry not the first thing I have to tell you why do we need that why do we lose all of this identifications it’s very simple I have thought that the distribution of the decoys course these are important and the normal people focus course in blue what do you see the decoys all have low scores and the purim cookies have both low scores and high schools and the FDR mechanism is you brutalize finish where all the stuff on the right has 1% orange compared to the boom that’s one percent FDR in the stupid wing ok now this is this is a bit strange why does all the decoy look like that well you know deeds done at argument is it’s because it’s a good scoring algorithm it can differentiate crap from real hits and it seems to do so when we go to the big database the one with the intestinal mythical TL look at what happens to the decoys course I mean they move from about 80 to about 30 the decoy hit suddenly all get bigger scores so why is that why does the decoy distribution

shift it shifts because you now have instead of I don’t know five thousand proteins now we have the equivalent of say 120,000 proteins imagined a number of different sequences in all of these proteins so the sequence face is now enormous when you make shuffle to reverse versions of this sequence space you will find a lot of sequences that start giving you better hits than before there’s a lot more sequence diversity that gives you higher scores and the algorithms are not built to compensate for that now there’s another defect that you should notice that is that this blue curve it actually goes off dramatically you see that difference you know it’s it gets boosted a prayer especially here and the reason why it gets boosted is because it gets a lot more crap hits as well that’s here but it also gets more goodness or solely things because keep in mind that we’re searching pyrococcus data against prokaryotic organisms the hits that we get our most likely false so any increase in the blue curve is wrong so there’s not a good picture right so we make our FDR tougher and whatever gains we have we should not trust probably in this particular case right now people fix them and he said yeah oh yeah oh we can fix that and the way we fix that is we pretend not to have a big database so we first search against few cookies and then everything that’s identified we put away everything that’s not identified we now search against the human intestinal metoprolol database and that is called a two-stage search or a multi-stage search and it’s been recommended as a solution for protea genomics by alexiuss vista nature methods most recently in this very nice review and it’s been suggested in a land and a lot of papers as a fix for meta proteomics so let’s see if it works this was before people cook is alone this is what we showed Peter purchase plus h i am pdb so it gets this problem and this is what happens when you do this two-stage thing it’s magic okay this decoy distribution goes back to where it belongs it’s a little bit bigger but it really doesn’t make that much of a and still we get the blue curve from this guy this is the best of both worlds right we take the good curve here and it takes a good curve here and we make a new thing out of it problem solved until you do the math we got off first at 1% FDR and won’t i sent fer when we searched the pyrococcus database we get 10,000 old hits and this is one hundred percent from pyrococcus of course it’s the final focus not on ice right so here the propeptide so this is peptide respect to the mattresses peptides we don’t talk about proteins that’s too difficult you get up to one hundred percent that’s all fine when you take this bigger database and that is here we get half the hits remember we half the number of it but ninety nine point four percent comes from pyrococcus or false discovery rate is under control everything works as we expect it to at the peptide level you go a bit over the one percent here you stay a bit under the one percent that is rather typical any step you grow up from the peptide to the speculum you tend to accumulate more FDR than you expect at the protein level is even balloons so what happens in the two step we nearly completely rescue the identification count okay so we go back to our 10,000 identifications but 90 1.5% only comes from pyrococcus so we have eight and a half percent false discovery rate ok we asked for one percent we overestimate Eightfold sorry underestimate Eightfold or FDR at the peptide level this is peptides we’re not talking proteins yet this is now twenty percent FDR because a lot of these hits is eight percent they are scattered randomly across all possible peptides they are not multiple spectra against the same peptide so this problem balloons at the peptide level that’s pretty dramatically so have we saved have you done any magic yes it’s black magic and it took our soul with it at five percent FDR things are exactly the same we control the FDR by sacrificing a lot of hits but when we do the two-stage search here it goes off the rails there it explodes in your face now this is sixty percent is right forty percent is wrong and people publish and

they say I did this and I love five percent FDR and it’s not their fault right it’s not their fault but now we can see that this creates problems so this multi-stage searching thing this is a deal with the devil I’m inspired by the Catholics around so you really have to watch out for that and you know what the biggest problem of all is I don’t have a solution for you right now I me know a few people in the world are working on solutions but few people realize the problem despite the fact that it’s in plain sight everybody can do these analysis you can do them in one afternoon on any data set you want but the problem is it’s very very hard to fix this and we probably have to rethink the way we do our searches so if there any buying from it to anybody for mutations in the room raise hands on two three four while we’re trying to be represented so you guys have work to do you want to do some work this is a very interesting / work so you know now there’s lots of Waldo’s and that is the problem so let’s go into the problem in a bit more detail if you understand and that i can give you a second thing that’s all more very close Orizon that will make our life difficult it’s the reason why FDLR estimations fail in these big databases and in a lot of other cases this is target decoy searching which of the following two pictures is michael jackson is this hard you have diet and you have a decoy okay now we make the database big which of these is michael jackson right now you start to doubt right is this guy on the Left really Michael Jackson do you want to know what the answer is for the real Michael Jackson’s not a–not is not on the slide these are all imitators ok this guy happens to be the best in the data apparently but so the thing is it becomes incredibly difficult to see the difference between things that Luca light and the bigger the databases the more chances you have of finding multiple peptides at lucca light now let me demonstrate how bad this problem can go by making a hyperbole I’m going to do something extreme like the rafting but then worse it’s like I will be rafting on I don’t know on a piece of paper all right so I’m going to do something super extreme just to show you how bad it can get you take a bunch of peptides we took 70,000 identified baptized that we validated in a lot of ways based on their fragmentation patterns these are beautiful spectra and then these 70,000 peptides we started making these photo/schalk versions these imitators so we put nude some of the amino acids as a stupid trick I need to switch them around and location and you can do that across the whole sequence but they stay isobaric you do isobaric mutations I think you together have a roughly similar weight as an Athenian oxidation analyzing this is actually I’m trap data that we use for this so you know an iron shop cannot see the difference between these guys here we have deletion or mutation a linear and glycine together after roughly the same mass at least for an ion trap as a lysine and then here we take asparagine and replace it with two glycines so you see we make a lot of Photoshop peptides this is always the same guy but it yields very many Photoshop versions so if this is our peptide we make these Photoshop versions then we take another celebrity and we take all the other four imitators and so we go on so that every of these 70,000 peptides has at least 100 imitators in the database this is really pushing the search engine right because now the search engine is really really confused and you see what happens this is the decoy distribution incidentally we took a reverse decoy in yellow and we took ten different shuffled versions of a target database as a decoy in green do you see any difference between the decoys so the next time somebody tries to publish a paper where they say reversed are zero point so many percent better than shuffle it over because you know even if you shuffle the damn thing ten times it gives you more or less exactly the same distribution okay so how you make the decoy doesn’t matter this is the real hits in read the original form at once and in blue you see the hits we get against a photoshopped websites it has shifted to the right this is very scary this is mascots to show you that this is not a mascot problem we did it we found them as well does everybody see the strange pattern and found them so thumb bloom has this discretization in the score now it comes from the formula of the hyper score so it’s quite funny to see the salt tooth which you don’t have in muscat although if you look very carefully there’s a little bit of salt to threaten neck let’s do it the peptide length is variable okay what does that mean this is a distribution of the Isles core difference i’m using mascot but column gives you exactly the same the iOS four different so i take the

spectrum I take the original peptide head which I know is you know ninety percent 95 99 percent right and I calculate the score and then I take the score of the best decoy eight now if you look at this traditional shuffle to reverse databases the decoy is nearly always worse than the original hit okay except for this very small amount which is our one percent FDR okay this is the one percent when decoy actually gets a better fit this is exactly what you expect when you look at our Photoshop peptide nearly all of them give a better hit than the original one and it’s very dramatic ninety-five percent of cases you get a photo shocked you get an imitator peptide that actually gets a better score or equal score than the forward and seventy-five percent get a better score and twenty percent get an equal score this means that if you give me any of your data sets anyone you please and you give me a little bit of time to create imitators i will give you back your data set and i will say four ninety-five percent of your identifications i have an equal or better match alright oops now this is of course a highly artificial situation it’s the biggest problem you can give a search engine but what we wanted to show is that if you give them this problem they break now to be honest this is now in getting into the detail to be honest they don’t break that dramatically if i show you a hundred photoshop pictures of someone and the original what is your chance of getting the right one I mean you cannot tell the difference anymore so what is your chance of getting the right one one out of a hundred and one right because you randomly going to pick a picture the search engine gets at least a hundred decoys at least a hundred imitators for each peptide so by random chance ninety-nine percent of the decoy should be better but it’s only ninety five percent so to be honest we push these guys to the edge and they kind of hung on by their fingertips for a few seconds which is not bad it’s not bad but it’s not enough this is the same visualization of how to score but it’s not that important now we have tools like percolating who knows about populated at the John talk about percolator he should because it’s built into mascot it’s really really good you want to search engine to be today its mascot post percolate it’s extremely good and so percolator looks at all these speeches right as you talked about that features we have all these features so you can look at the mass difference of each of the fragment ions you can look at the ion intensity and the V ions and the y iron so you can look at how many be ions are many y ions and if you look at all of these things maybe you’ll find a pattern for instance decoys tend to have fewer y ions but more bis than target is and so it learns it adapts to that and it figures out how things are different so we try to find whether there’s a way that we could use this approach to differential between the different hits you see the decoys the original decoys here and you can see that the PPM mouse arrow is indeed much bigger for most of these traditional deployments so we can learn something from this we can learn about the decoys but now look at our directed decoys the photoshop guys there is a difference but it’s minimal and then here this is the distribution of the m sms error medium okay that distribution again shows a clear bias for these traditional decoys it does not do that with the photoshop baptized here to this is the BIR coverage look at a massive difference that you get ready shuffled are reverse sequences look at the almost perfect overlap between the photoshop ones and the original ones none of these parameters is helping us there’s only one parameter that we could find that give us a bit of an edge and it’s the RK interquartile distribution on the m sms mass error it’s a consistency with which the arrow moves in m sms vector and this is an anion trap and what you can see is that well the traditional decoys are much worse killed by the way there is one shuffle gaiden is very strange you see the original one and you see the photoshop ones and the photoshop ones are different that’s the only thing we could find okay so there’s a lot of work to be done here to fix this problem because this problem you don’t notice it when you search human you don’t sort find it when you search East you don’t find it when you search ecoli but you find it when you search yeast with 15 modifications variable once you find it when you do a proto genomic study you find this when you do a bit of proteomic study not to this extent not that bad but you see the tail of this thing which you have seen in the previous life and that is a problem that we currently can’t fix very well why do I make such a

big fuss about that is because everybody is going to into RNA sequencing and they all want to do RNA sequencing data into a proteomics pipeline from matching the spectrum so what do people do they take the reference protein they take the sequences that they get from the sequencing and they added the sequences that are not in the reference and these are all small mutations of existing peptides and now you’re creating a database that looks surprisingly although only fractionally but surprisingly like this one and we’ve been doing this with a group in Ghent and what we see is that our identification rates and the combined data bases go because of this problem so this is really very close on our horizon and it’s a completely ignored problem and the target decoy approach does not compensate for this at all the time a decoy approaches Picasso’s cubist picture against the normal time times and now we have lots of normal looking peptides that all look alike we need new tools we do stuff so it’s not a happy message unless you’re applying from the tissue good talk about something else everybody knows this picture I it’s a great wave off kanagawa by a hokusai and I think it’s a beautiful metaphor for our life in the omics fields because this is high throughput omics data this is classical bioscience it’s the rock in the background right it’s the one gene one protein one PhD in biochemistry style of things where everything is under control and this is all of us so how do we deal with this what do we do you can see this as a set you can also see this as an opportunity so the thing is manage your data how many groups how many people here know of a systematic database-driven storage of data in the mass spectrometry group they are in or they work with so how many people store their data in a database of some sort yeah I know you guys do you guys were pioneers I think you guys beat even me to this which is highly unfortunate you see nobody does that so step back think about this does that make sense I mean facebook keeps track of every photo you take from your be right and you don’t even bother taking care of your proteomics data does that actually make any sense it does not okay so a lot of people build tools for this and unfortunately not all of these tools are universally applicable and all of these tools are very production great so we have one and which has been around since 2003 so I think cells on beat me writing I said one in 2002 and so sorry no of course not a point about look when I publish my night I waited until it was stable it’s still running on we keep updating the system and apart from the cells own one it’s the longest running database in front of it vector the systems that exist were reviewed in 2010 there’s a few new interesting ones that are possibly coming out in the near future so keep your eye out for this but really you should seriously consider especially you young people who know about big data and what everybody tracks of you to at least track your own stuff right i mean seriously considered doing this ok why would you do this I was in South Africa teaching a course with Katherine lily and Kathy nearly said yes and this peptide is very interesting because it contains my initials Kathlyn as Lily ksl and there are very few peptides with ksr so I make a VPN connection to get quickly I do one query on my database and I say Oh Katherine this is true time all of the peptides you’ve ever found that contain ksl cumulative and the number of spectra in each of the different runs in which we found it I got that like that ok this is a whimsical example it doesn’t make much sense but it allows us to do so much stuff we can data mine this stuff you know in milliyet ways plus it’s super convenient for a lot of research purposes and you know the most convenient thing is the biologist comes back after two years and says oh I’m going to publish the study now can I have all my again I know and the other thing is that you’re lovely it allows you to take your data and chuck them into a public archive which again it’s the big version of what I’ve just talked about and these public archives are super useful right you can do a lot of stuff with that and so one of them is privacy when I build a TBI and it’s part of a big consortium called proteome exchange that tries to exchange data as broadly as widely as possible and make life easy for you by making very complicated diagrams but mostly it has these submit data access data links and they have this tool to submit your data and it depends on whether you want to do a good job or whether you want to do a half-assed job may I recommend that you try to do a good job it will take about one day of

your life to do a really good job one day which is peanuts compared to all the other time you’re going to spend on getting your data published or quiet okay but it will cost you one day to do it nice but then everybody can make good use of your data whereas if you do this everybody else can just kind of the do something with now I’m not going to talk too much about that just tell you that it’s there a lot of turtles mandates that you deposit your data you can keep it private during the review process although the reviewers and the editor so you can give them a log in and they can see your data but nobody else can and when it’s published a lot of journals automatically notify the database and it becomes publicly available and all the fungus are asking you that these days so please consider this kind of things and please spend to one day to annotate your data if for nothing else than for my eternal gratitude long story short you remember the tsunami what if we could take this flood of data and you know channel it and use it to irrigate the desert of the unknown proteomes that we have all around us right this is the vision this is what it should become this isn’t a tag by Doyle and it’s desalinated water so it’s the metaphor works in seawater great quality control and I always try to use i’m using pictures especially at long lectures like this and i was googling and i found this website epic fail i recommend it when you feel really sad and really bad go to this website they have all these pictures right this is what we try to do and this is what came out this hot one of the best it’s almost a horror movie so people are trying to make that payment icicles right so you make some dough kind thing you turn it you put the line it’s almost like making sushi a button with a twist and you twist it and you’ve got it in two strands and then people put their epic fail pictures this is 115 all right that is the stuff of horror movies okay so the question now is when you set out to do a proteomics experiment this is your target do you know whether you ended up with this it’s a very important question to ask yourself right so in a lot of analytical fields quality control is implicit to the point that an analytical chemistry which is our sibling field they have something called accreditation if you want to be an analytical chemical lab with an official recommendation that you’re allowed to do this kind of stuff you know what they do they send you test samples on the stuff you are credited for this is made by the standardization organizations and you know nothing only that you’re supposed to measure a particular compound or set of compounds in this mystery sample and they ask you for the amount and the error on the measurement and the error must be less than what they dictate is the more for accreditation now suppose you get it right okay accreditation prolonged you can be accredited for another six months or a year but what did you get it wrong they send you back you got it wrong try again but that’s all they don’t tell you whether you were up down too big too small they don’t tell you anything the second time you know what happens fails they take your beautiful accreditation you tear it up and you are out of business that’s how they do quality control now you’re very silent suddenly because nobody can imagine doing this and proteomics but seriously why not all right and I think the biggest problem is because you don’t believe in your data you don’t believe you can do anything like that and the reason why you don’t is because you don’t know how good your data is or how that but I give you the benefit of the doubt and I think actually it’s true I think a lot of these modern instruments and good protocols that we have today we delivered quite good data all of us but if you don’t check how will you ever know right doubt is worse than whatever comes out in there so what we did was to gouge the interesting quality control you little poll we get to set the MPC meeting it rejects together with us from Gurgaon and we first thing we asked we had 68 respondents that took eight days to collect and then it said how would you describe yourself and you see a lot of people are ms researchers which is the most generic term you have some management people one you have some bios or petitions quite a few actually 13 and you have some service users and some technicians and we have one genius which I know firsthand is actually passed on bird he put that in as another joke and only he thought it was funny about it so he’s the only one addicted and you know when you have 86 a lot percent it’s one matter hi honey and so then do you already use quality control shockingly fifty-nine percent says no nothing against Silas shock

then is it easy to obtain software to help you do this and bus originally on the poet yes no and then I whimsically added does such software exist oops now interestingly forty-eight percent say does such software exist but forty-one percent say that yes we do quality control so how does that rhyme right that doesn’t better but of course you can do quality control without software which is what a lot of people so how important is quality control this is the big kahuna and we asked for different settings how important is it for your work so the comparison between runs in your own lab or between projects for comparison with other labs and for comparison and for quality assessment of public data interestingly you see a pattern this looks a heck of a lot like this and this looks a heck of a lot like that because india and your is the same and in third and everybody else is also the same so people have a similar feeling and this is not random answers that’s nice to see but you see the main giant is people feel it’s quite important only few people say it’s not important and for public data and other people’s data we think it’s more important than for our own data what happened to this a scientist is self-critical first okay and if you look at the distributions essentially across all these four they’re essentially the same ideal you shouldn’t have been in the biostatistics course to figure out that these are insane so in other questions how easy is it for you to look at QC data and to analyze pc data and here you see a shift you see that it’s slightly easier to visualize them to analyze okay analyze means process the numbers visualized means make a few plots look at some stuff and a lot of vendor software allows look at some of the stuff in a little and some of the commercial software does as well so people on average have an average feeling about how easy it is but it’s not easy enough this should move to here because then people would use it all the time okay let’s have a brief history of some attempts at quality control in proteomics I’ll start with the one and a co-author of this one can I really don’t like this the reason why I don’t like this is this table I can show you the email fights I had with the principal alters me and a few other bioinformaticians this is a flawed representation it gets huge amounts of citations this is the hugo test sample study they took a bunch of human proteins chuck them in a sample and told people so this is like this accreditation sample nobody knew what was in there but they were equal molar and they shove the sample of people with how many proteins about 20 ish and they said analyze anything you can and give us back to results so people analyze and gave back the results and now there is a desktop how many of these did you get right these are the different groups that participated anonymous and this is the different proteins and applause means you found it and everything else means something was wrong and now you see that a lot of these proteins are wrong look at these guys you know screw ups now these guys to screw ups now this is not helping now you look at this you say oh everybody doesn’t wrong but you know what the problem is here protein inference John Paul told you about protein insurance right if you look at the peptides that people matched every group matched peptides from all the proteins so they got the peptides right what they got wrong is the assignment of the peptide to the exact right accession mumbling and the problem here is this is a note the NCBI non-redundant database which is enormously redundant it contains all known splice variants it contains all ever detected IP immunoglobulins it contains all x-ray diffraction construct switch completely artificial it takes all truncated forms ever recorded every pro so it’s very easy to match a peptide to the wrong accession number and they scored on the accession number so they’re not testing mass spectrometry they’re testing protein inference which we frankly socket okay so if you look at this table this table gives you completely the wrong picture the only useful thing in this table is this di entries which is where Eric Deutsche who analyze this data in detail he found out that there was a trypsinization problem and that’s a very easy Quality Controller everybody can do search for data with zero miss cleavages and search it with three missed cleavages and count how many peptides you get with multiple missed cleavages if you have many with two millimeters by usual standards your protocol was bad you trypsinization went wrong it’s super easy test to do and you have the data anyway the search takes only a few minutes anyway this was an attempt I think I should be able to have

an abortive attempt also you probably have not heard of this which says you know this is something completely different but I thought that was a step in the right direction to infer it does make this tool that you could plug into a thermal instrument with this before and after scan or rub and it actually did this quality control thing very quick and dirty and based on that it would do a diagnosis reports like systems nominal our systems nominal and in the latter case it would shut down the instrument or rather it would go into an endless loop and the instrument would forever be waiting for this program to return and that way they could conserve precious sample if during the middle of the night your instrument went bananas it will not continue to inject the last milliliter of cerebrospinal fluid you rested from the poor cancer patient but it would leave the sample there so that the next morning you could fix the instrument and then conserve precious sample it’s a very simple very elegant application of quality control but they didn’t go beyond the simple thing about blocking the machine if you start doing crazy stuff so we have to wait for NIST to come into the game with the CP tag study and these you know this is mr. National Institutes of Standards and Technology they say how long a second lasts right these are serious people and so they started looking at the workflow and they said how can we measure quality as many of these places from a single raw file you need nothing but a single roll fun and it came up with all these metrics some of which are redundant and they started skating this for all the CP tank paper so i’ll pop stuff like this peptide identification so how many peptides right now Caillou these guys are not doing as well as most of the other guys you see that but they’re consistent chromatography something is wrong right they should have listened to Mike so here I’ll source aha you see what’s going on here these guys with few identifications they have a problem with the charge ratios so everybody else has a lot of ammo this is the media panel visit to the precursor of something else this is the ratio of three plus vs. 2 plus and this is the ratio of 4 plus vs 2 plus so you see that these ratios drop so something went wrong in the ionization that’s why they can’t identify so much so you’ve not only seen a problem that is consistent and hampering all runs you’ve also diagnosed the reason I know there’s other stuff like dynamic sampling you see that’s a bit screwed up here you see here the amex one features and what they tell you and you see that that is actually not too bad and then here you see the anise two features and that too is not too bad but these features are incredibly interesting and they tell you a lot about a lot of stuff the raw file tells you about your experiment even trypsinization is included but unfortunately everybody did it right here so you know it mumbling but this is cool this is something that everybody should have all the time right now let’s go into a bit more detail the problem is the NIST stuff is hard to run it’s a bunch of Perl scripts that only work if you rotate your computer counterclockwise seven times at midnight and then sacrifice a black cat so they stop then the disc lomita implementation which which does everything that then this thing does but it doesn’t require the sacrifice of the cat it does require the turning of the computer unfortunately because he has a lot of different tools that you have to somehow glue together but they’re applicable to everything and not just thermal the NIST stuff was only thermal and it actually uses standard formats and open software to give you all these things and nice plots in our so you see the same kind of thing we’ve seen before when I water than are so this was a very good stuff in the right direction meanwhile proteome software the guys who make scaffold implemented this as a paid ad all into scaffold the problem is nobody wants to pay for this why would you not want to pay for this this is worth a lot of money anyway they did not end thank god these guys now threw it in the open source domain so we can take this coat and bass from director and is now working with them to get this coat and to make a free tool out of it so hopefully this will soon be in everybody’s lab now another L know the pipeline that does this kind of stuff is the open MS pipeline to go pass or nine if you know this kind of stuff it’s published here these two things and it also allows you to calculate all the NIST metrics and the way it outputs that is a new thing in QC ml so there’s just some technical nonsense about a format intended to be transparent and to act as a placeholder for your metrics now what is interesting about QC abaza it’s an XML format but you really don’t care about that you won’t even notice it the thing just spits it out and you can make beautiful reports and PDF and things automatically from that so who cares what it looks like on the inside but what is really important and this is something I really pressed when we were building this is that it had to have a database equivalent this is a database schema why because you want to archive this stuff you want to be able to look

at quality control over the past 10 years not this rum but everything why because you can do this this is the median mass error of an orbit trap machine it’s very close to zero which is where you expect it to be there’s an upper fence of 5 ppm and it has a ninety-five percent confidence interval here so it’s very very good and then you have that so immediately you see something is wrong right you want to be able to spot these things won’t be to dena you you want to assess what is standard performance what does that look like and is this standard performance or not is it acceptable or not okay the problem with that is that you need to keep also track of needed on you need to know what you have been doing this is the same time if they that median mov see over very many runs this is many years of mass spectrometry that you looking at intent we can do that we have everything in the database and what you can see is that the protocols which all have different scholars have different outputs and it really depends on the protocol more than anything else so the way that you do your experiments necessarily influences the way your data values so you need to keep track of this kind of stuff otherwise your barriers will be very big now you can set it for the yellow stuff it should look like this this is a different metric and this is a much less useful metric because for some of the experiments is all over the place unless all of these experiments are complete crap now now before you start going through your papers from Ghent I’m thinking that they’re all bad this is actually a standard sample that they arrived so it’s it’s it’s not a real proteome example but it’s different standards and different protocols but wouldn’t it be nice to have this kind of stuff and wouldn’t it be nice if you saw that things are actually very conserved if that happens your brain is going to make it click and you’re going to say accreditation bring it all I am capable of doing this I trust myself we need that you can also do it on public data the big problem with public data is that it’s extremely heterogeneous it’s done by different search engines different ways different submission tools everything looks different so the way to compare things is to put them all to a common standard so we bill 24 that cop write us off that does that puts everything on the common standards and it allows you to look at the data incidentally we spent seven years building this thing hence the whimsical name asad it’s automatic automatic spectrum and outpatient pipeline but it also has a different meaning we spent five out of the seven years compensating for the lack of meta data that people saw it this is the biggest problem with the data is because people are too lazy to invest one day my guys have to work seven years to get all of that fixed in a reasonable way to be honest is getting a lot better nowadays people are really taking the responsibility but emphasize again keep the meta data and you should be keeping the meta data anyway for this kind of stuff in the database that you have locally that you should have any way actually see how these things fit together and we organized ourselves will become a serious analytical domain okay and now you can go further than that we went into the thermal log files your instrument drops log files about the state of the charge in every component of the mass spectrometer it has a lot of measurement on the hardware so we decided why don’t look at the hard way this is the capillary temperature this is what you said this is what is requested the software also logs that this is the real temperature you see how this is like a fuzzy algorithm and the thing tries to compensate constantly when the temperature changes that’s what your thermostat would do as well it’s a normal thermostat me but if this were suddenly to go boo-boo-boo-boo-boo-boo-boop then there’s something wrong with your thermostat or maybe someone is blowing on your needle with very frosty graphics but you see how this is useful you see how you can get this fall in Santa fall and you can define as soon as this thing actually goes beyond you know two times that I’m going to sound the alarm yeah speaking of sounding the alarm this is the Watts can power consumption of to go pump for you see this yellow line this yellow line says I heard a strange noise from the instrument it’s really true this actual data right from our collaborators and more and this tool that we built that is freely available you can an update your data so you see here there is an event as yellow so you can treat it so there’s a strange event you know what this red line is the turbopump ate itself it literally sucked a lot of components into it and it’ll promise the this is yours you look at the power consumption oh don’t ok so the instrument was on them were empty but you know the thermal engineers don’t use this data when we asked them do you guys actually look at the log files only when the problem has occurred we look at the latest log file and see oh yes the power consumption is low but imagine you know you want to do professional proteomics monitor your instruments when you see this go up shut

it down call the service guy says fix it before it’s broken or you can see when your column start to degrade you can see when maybe disturb this sensor is starting to be great why don’t we do that kind of stuff it’s process control is part of engineering everywhere in the world again we want to be serious we have to do this kind of stuff it’s easy even I can do it right and I don’t even have a master so also on the epic fail I don’t think this is an epic fail ok this is variant so we may not be analytical chemistry yet but there’s no harm in trying even if it involves a foul ok last bit so is off a coke bit stupid joke about that soft way who of you has ever used freely available by informatics software presence ok how many of these eight o’clock how are you three AML are happy with the free software that I used so that’s less than off how many of different okay I’m gonna hire I stopped the year you see this is the problem right if the software gives with you option to hang crash or start flashing it’s not good right this is bad how many people use commercial software everybody does your instrument has it but I mean really use it like mastro’s or something like that the rest don’t use software I mean you’re either free software or you have commercials all trans people you get you cannot have something in between how many people use the software that ships with the machine you may not think of that as commercial okay probably everybody else how many people are happy with a commercial or vendor solution all right even less than people are happy with the free software amazingly but this is a problem this is a really big problem in proteomics informatics and a few other fields but mostly proteomics informatics as far as I can see is that a lot of the solutions we develop are not usable by other people I speak as a bioinformatician I try to not do that but even you know even we sometimes make stuff that you know could be better and to be honest I try to use tools from other groups as well and this is sometimes extremely difficult even for biofilm additions so I want to make a few statements Meyer fermentation is a real job and it should be treated as such and not as the guy who sits there is great to solve all my IT problems you laugh because it’s true that’s how people’s look at this it’s a sufficiently complex field that requires a separate job title that’s bioinformatician and it requires special separate specialization there are many courses around that now although most they are still very heavily genomics men focused and it’s only getting worse with the sequencing this however does not in capitals mean that you can treat bioinformatics as a black box so you have to make sure you know what’s going on which is why we made these bloody tutorials in the first place and if you look at the paper in which we publish the tutorials it says outs are opening up the black box of proteins informatics if you go to somebody and you say here’s my raw file I want an excel sheet with the list of protein names thank you very much you are doing it wrong you need to understand what happens to your data don’t be the PhD student who stands there with me in the jury and goes and then we identified seven thousand proteins how bad guy it happens happens a lot so commoditization this means that the stuff becomes easy enough to use with by yourself with a mouse on screen and commoditization is an ongoing process that we should learn to take advantage of there’s a lot of software out there that can help you the problem is not all of the software is very good so we have to talk about that in the next one typical analyses are not dominant afternoon the bio fermentation is not a bio magician instead it has become a substantial part of any project and I would dare say and we’ve heard this from the introduction by having that our projects are becoming so complex that bioinformatics is now an integral part of your study and in fact a lot may depend on the bioinformatics the bioinformatics now I have to start the duck slightly might be more important than the experimental part for some of you the experimental part is sufficiently standards in the terms of how you carry it out that it’s difficult to mess it up but the bioinformatics might be very tricky if you do not consider that and if you assume that some dip down the hall is going to fix that for you in an afternoon after you have done all your analyses that is a very bad way of starting a modern audiences around you should think about these things for that reason I’m not the only idiot thinking that thunders increasingly required data analysis and management plans how many of you know funders in your home country that ask

you a section specifically data management and analysis so I know it’s in the UK I know it’s in Belgium now not sure about Germany ya know right it’s coming it will take a few years but it’s coming because people notice that a lot of projects fail because of that people acquire enormous amounts of date and we go moves what now so this mean this is meant so that you think out whatever you want to do in your bio informatics before you do the experiments and that you budget it in your project so that it actually can be good finally if your full clarity and this is extremely important the bioinformatician is not there to fix your printer the mass spectrometry is not there to fix your bite it’s the same okay so stop treating these people like IT helpdesk people because you know what happens to them they get really frustrated and they go to industry where they make three times more money and I don’t mean biotech industry I mean everybody who needs a programmer these days which is about anybody I lose PhD students to Silicon Valley all the time right while they’re doing their PhD these guys go and if they’re lucky after three years instead of a PhD there will be multimillionaires this is the challenge so you want to treat these people reasonable I don’t say you know you have to treat them like dogs but treat them reasonably because otherwise you’re chasing them away if this happens it’ll be careful right now we’ve talked about help the poor man from addition now we talk about why the Bioman edition starts so the goal of the night ization of bioinformatics tools is haphazard best why bio informatics is too often an afterthought in the project you know it’s true and it shows them the results of the bioinformatics efforts because they’re hacked they have to be done quickly on no money bioethics is actually considered as irrelevant by many experimentalist because their focus is on getting the day that there’s nothing wrong with that however it does impact the ability to do a decent analysis if you don’t care about what happens next in rare but highly unfortunate situations bioinformatics solutions are considered competitive and people say we can do this amazing thing i’m not going to show you already having metal fans in the audience so you guys know this double tapping technique on the guitar that was invented by one of the greatest metal guitarists of all time van halen eddie van halen and he invented that he made his an awesome sauce you know what the guy did when he performed in life turns his back on the audience it’s really true so that nobody could see his secret technique don’t be a diva don’t be anybody imagine you go to somebody i have this protocol that isolates every phosphopeptide in the sample an impurity shot see that doesn’t make any sense I give you by informatics is the same thing it’s part of your experimental protocol now this is the development of really solutions for you guys and bioinformatics is having a counter incentivize that will show you that I will I will prove it to you companies provide some good solutions but the problem with them is that the cutting edge stuff tends to you know drag a few years before they really make it into the commercial solutions and the commercial solutions when they have it you have to pay sure to get the update usually so you know it takes a while before the spreads in the field so free software buffers that and many groups are constantly reinventing the bioinformatics wheel or rather their first figuring out that there is a substance called Robin then they’re freaking out that you can volkan eyes it and then some clever guy says I can build an inner tube and maybe after that they get to the wheel so we’re you know how many search engines are being prompted do we need all of them if any of them really better than the other ones i would say we need four or five anyway right you need to scatter elizabeth and you need some innovative ones but do we need the same old same old so why are we doing that right thank you this is my beautiful hand-drawn picture of why cause benefit curve smattering when you develop bioinformatics tools this is the bioinformatician cost benefit curve this is how much time it takes to get a paper in buying chromatics with a hacked together perl script or art code or whatever okay you built a cool algorithm that does a cool job it’s easy you get a big big benefit this is the amount of effort you have to do to get some extra citations for said paper you know what this extra effort means it means that another buyer in fermentation can run it okay how often do you think this happens i would say in about fifty percent of cases people go this far now the cost benefit curve levels off after that now we look at the user this is the user they don’t care jack about a

bioinformatics solution until it actually works but look at the benefit is minimal because the problem is it doesn’t work all the time but if it works all the time it gets a little bit better and then it works all the time on all the cases you throw at it and that it becomes really useful for you okay this is not to scale in order to get a piece of software developed that does this you’re looking at multi-year things remember my database that took me seven years to publish this is not a typical i typically go about three years between making the tool available for free to everyone including the source code and publishing advance because you you want people to use it you want to find the bugs you want to fix the bugs and you don’t publish until it’s finished you don’t publish a protocol that is half-assed right did you do the same thing with software you want when people read it that they can download it and use it it’s important but the problem is where’s my incentive because I could have published it and bioinformatics and be done with it so my science advisory board looks at this and they say you’re the biggest idiot we’ve ever seen because why are you doing this can I people think I’m a nice guy but that is really the big problem so the way to fix that is we have to incentivize this with a bioinformatician right now they have no incentive so we’re trying to work on that very hard to make it worthwhile for people to make better tools and you will benefit as a result and you don’t have to do anything just sit back and hope that leaves are successful in this but that’s a different thing I don’t want to talk about and in order to help you we recently published some guidelines in jpr and with this am using fake picture that I Photoshop myself about managing expectations you know reality check for bioinformatics papers and we came up with three diff performance now in jpr if you want to publish a bioinformatics research article or application note it has to work you have to say when it does not work you have to say what you need to run it you have to provide documentation for you be using you have to provide documentation for developers you have to provide sample data so that it will run the first time you will provide benchmark data so that other people can compare their performance against yours in the future you have to provide availability you have to provide a license and you have to provide system requirements this was shocking it this is this does not exist anywhere else it was shocking that this did not exist and when you don’t want to do all of that we call it the brief communication the idea is that my new and proteomics person reads something like this in a protease journal and says brief communication you say I don’t water it could be cool but I’m not going to go to the URL because this is not going to work for me and that’s the whole point but when you see a research article or an application note you say i’m going to download this stuff and it’s going to hopefully work from me and the review is super tight back okay so we hope to you know start by doing this and then do some positive incentives as well anyway in conclusion if the president can embrace coding then you can thanks