Webinar: A Practical Guide to NCBI BLAST

Good afternoon everyone! This is Peter Cooper from the NCBI We’ve got two webinars today One is A Practical Guide to the NCBI Blast You’re seeing sort of a cover slide for that, there The other one begins about an hour and 15 minutes later and that’s a webinar on human variation resources and medical genetics resources These were originally requested by Des Moines University, so I just wanted to give a shout out to them They should be listening The classroom out there We’ll record both webinars They will be available on our YouTube channel on the webinars playlist Take a couple of weeks for them to get up there Materials for these webinars are on the FTP site and there’s a compressed link that will take you there That’s gonna have the slides in there as well as the demos that I’m gonna do after I give the slideshow This is a shortened version of a larger workshop that we give that lasts a couple of hours So, if you have any questions about this, there’s my email address and contact information is on the front What we’re gonna do today is talk a little bit about the basics of using BLAST, the reasons people people use it, some of the statistical information that BLAST gives you, the scoring systems, what the search programs are, and talk about some other alignment services that are not BLAST We offer them on our website And then, I’m gonna go over to the web browser and do some live searches The BLAST is probably the most widely used sequence similarity search tool in the world What is does is it finds high scoring local alignments between two sequences They can be protein sequences or DNA sequences BLAST includes a model of score distributions for random local alignments and because of that, it can provide some statistical information about how different your alignments are from chance So, BLAST tells you, then, about non-chance similarities between biological sequences That’s interesting because if the similarities are not due to chance, then they must be due to something else A couple things that could be The most interesting one from the point of view of the original purpose of BLAST is homology that these two sequences are descended from a common ancestor In most cases these days, people are using BLAST for simple identification, so, that includes things like annotating things like genomes or seeing what kinds of problems there are, now, with the alignments of things through a genome All BLAST sequences, the All BLAST searches begin with a sequence, either a protein or a nucleotide sequence That can be either one that you determined or it can be one from the database Let’s talk for a minute about BLAST statistics The most important statistics you get back from BLAST is something called the expect value or the expectation value, it ranges from essentially zero up to the size of the database It’s the number of alignments that you would expect by chance with a particular score or a greater score So, for example, if we have a five base or a sequence like this one, ELVIS, that has an expect value of 48,000 We would expect to see 48,000 hits that good or better in a database search by chance We don’t know anything about that particular hit The one below it, though, has an e-value of, that’s read as seven times 10 to the minus 18, that means I wouldn’t expect to see any hits that are that good by chance That tells you about your alignments are different than chance and if they’re different than chance, they’re due to something else A real important point from this slide is that the e-value depends directly on the size of the search base, the size of the database You wanna search the smallest database that’s likely to contain the sequence of interest We’ll come back to that point when we talk about limiting your database BLAST uses two schools of thought in terms of scoring things The one that’s in a classic kind of BLAST scoring system is called the Position Independent Scoring System That means that the same substitution in your alignment gets the same score in any position in that alignment The model that that really assumes is that all positions and sequence are equally likely to change That’s not a realistic model for the way proteins or DNA sequences evolve Nevertheless, this is used by ordinary BLAST, BLAST P uses BLOSUM62 and it includes a concept of conservative substitutions Nucleotide searches are less sophisticated They use an identity matrix The other kind of BLAST, which we’re not really gonna have time to talk about today, but we will see the results of, are Position Dependent Scoring Matrices And those kinds of scoring matrices, the substitution score depends on the position in the protein or in the alignment This means, of course, that some positions are more important, less likely to change than others And that’s a realistic model for the way proteins and other biological sequences evolve Programs that do this are PSI-BLAST, DELTA-BLAST Those search a database with a position specific score matrix or a PSSM Reverse PSI-BLAST searches a database’s PSSMs and identifies conserved domains and that’s the search that we’re gonna see today that uses these position specific scoring systems

All BLAST programs also include some kind of a penalty that allows them to incorporate gaps in the alignment Okay, so, let’s talk for a minute about that these BLAST search programs are, what they’re called The nucleotide search programs that you’re gonna See on the web are blastn and megablast Blastn is a traditional BLAST algorithm It’s the most sensitive kind of nucleotide search Megablast, by the way, is the default algorithm and this is the best program for simple identification, things, species, annotation, you just have to remember that that’s the default algorithm on the BLAST pages When you do searches you may sometimes need to change it to the more sensitive blastn There’s sort of an intermediate algorithm that’s not used very much called discontiguous megablast, it’s also more sensitive than megablast And then, here are the protein search programs and today we’re really gonna only focus on the position independent scoring one There’s blastp which is a protein, protein search and alignment program And then, there are these translating searches These are useful for unannotated protein coding regions And we’ll do one of these today There’s blastx, which translates your query sequence and searches a protein database, tblastn, which translates the database, searches it with a protein, and tblastx, which translates both the query and the database But all of these things are searches at the protein level When you search BLAST at NCBI how do you get here? Certainly, you can get there from our homepage and there’s a link right there at the NCBI home, links to BLAST There’s also the most common way to do it, for most people, is simply to type NCBI BLAST in Google and that will take you right to the BLAST homepage This is the current look of the BLAST homepage and by the way, this is going to be changing fairly soon The furniture is going to be rearranged, there’s no real difference in function Basically, this is divided into several sections that are useful in terms of what they do One is a way to get access to assembled genomes and I’ll show you that in more detail in a few minutes that you can pick any genome that you want from here and find the most complete set of data Then, the center part of the page is basically the core of the blast searches and we’re gonna be focused on that today, what we’re calling basic BLAST Those links take you to pages that are preloaded to do those kinds of searches Then, finally, at the bottom, there’s sort of a miscellaneous set of search programs or alignment tools that are related to BLAST, but are not necessarily BLAST And we’ll visit some of those today When you’re using BLAST, you need to have something to search with and that’s your query sequence I just want to point out a few aspects of this that still confuse some people The BLAST search programs on the web will take FASTA formatted sequences, like those shown in the upper left of this slide They will also take accession numbers, NCBI accession numbers, we’ll pull those from a database and do a search with them You can also run BLAST directly from the entrez pages, nucleotide and protein We’ll do a couple of examples like that today Another point to take away from this slide is you can use multiple queries in a single search Most people know that, but occasionally, we’ll run into somebody that thinks you can only do one sequence at a time Each sequence will be searched separately as a BLAST search if you do that I want to talk for a minute about something that’s a little bit odd if you think about it One thing you can do at NCBI which is useful is to compare your own sequences without doing a database search at all I just want to point out the two options for doing that at this point in the talk One is BLAST 2 sequences And so, any BLAST page at NCBI, when you go to the BLAST form there’s a little checkbox that says align two or more sequences and if you check that box, another box will open up, another form, another text field will open up in the form, and you can enter sequence in there You can enter many sequences in there So, you can do your own little database search against your own customized database, if you want You can also access this under that specialized BLAST section at the bottom of the page Something we know from talking to people at the help desk, here, is that many times when people are doing BLAST 2 sequences, what they really want to do is a global sequence alignment, we do have that available in the specialized BLAST section It’s not BLAST at all, this is an algorithm called Needleman-Wunsch and it allows you to compare the entire lengths of two sequences This is a global alignment tool It doesn’t provide any meaningful statistics about whether this is a chance alignment or not It will just align anything to anything and it will include all the residues of your particular sequence If you want to do global alignment, if you’re interested in knowing what the percent identity is, between two proteins, this is the tool that you would have to use to do that Okay, so, now, those were searches independent of our databases Let’s talk a little bit about the BLAST databases

And this is sort of a complicated part of our system that you need to sort of understand What’s goin’ on there Some of it’s a bit chaotic The protein databases, which you’re searching using either blastt or blastx, are fairly straightforward You do have sort of a comprehensive database called nr This is a non-redundant database It contains the majority of the protein sequences that people are interested in in NCBI It also has available useful subsets on the database pull-down list RefSeq, Swiss-Prot, PDB Just keep in mind there are some sequences that are not part of the protein nr US, and European, and Asian patents sequences that we get are not in there, they’re in a separate database Proteins that are coming from metagenomic samples Sort of ecological genome thing, those are not in there And, also, the proteins from Next-Gen assemblies So, these are transcriptome shotgun assembly sequences This is a growing set of data, in particular for the nucleotide side, but there are some PSA proteins, as well, those are not part of nr This is what the nucleotide search page database pull-down list looks like And it’s quite a bit more complex And I’ve got a, you can certainly, any time you go to a BLAST page, you can click on one of those question marks, get information about what sequences are included This is a slide that has more details about that A couple things to keep in mind about the nucleotide database is it makes them different than the protein The main one is just the default database that we call nr I like to refer to it as nt, which is what we call it on FTP site This is not a comprehensive database It contains a traditional GenBank sequences, things that are not bulk sequences, nr, NCBI RefSeq RNA sequences That’s actually a very small set of data compared to everything else we have at NCBI It’s a useful set, but it’s a smaller set Some subsets of that which are cleaner are the RefSeq, RNA database, there’s also a 16S RNA database, as well, that you can search So, what’s not an nr is the majority of the nucleotide data That includes all the bulk sequences, the RefSeq Genomic Sequences, which include our chromosome records and our various sizes of assemblies there, patents are not in there, and some other large sets of data, including Whole Genome Shotgun sequences, Transcriptome Shotgun Assemblies, and SRA data, which is really the largest set of data at NCBI It’s so large that there’s no way to actually search it as a single entity We’ll talk a little bit about that when we do a demo Another set of databases that are really separate BLAST pages, if you will, are available through that device At the top of the BLAST homepage it lets you search Genome/Assembly Databases Basically, this is a way of getting you a BLAST page that has the most completely assembled genome for that particular species, so you can type an organism name in there and then you can link directly to it and that will take you to a BLAST page set up to run that search Now, I mentioned this earlier in the talk, that the most important thing you can do when you’re using BLASTs is to search the smallest database that’s likely to contain the sequence of interest and that’s because the database gets larger and larger As this gets larger and larger, it gets harder and harder to discern the signal from the noise that’s in there And that has to do, probably, with the way the expect value scales with the size of the database There’s some useful things that you can do, here You can use one of the organism limits, you can type the name of an organism or group of organisms You can even exclude groups of organisms that you don’t want to see So, here’s an example: getting all the bacterial sequences without the order enterobacteriales in there You can get rid of things like model sequences or uncultured sequences if you’re working with bacteria You can even specify things like a molecular weight range So, any entrez query that works in the protein and nucleotide database will also work on this page Okay, a couple of things to finish up, here, and then we’ll pause and see if there are any questions One of the things that, as a person who manages BLAST help or sits on the BLAST help desk, here, it’s very important to me and to all the other people who sort of support BLAST is that you understand that there is a identifier for your search and that’s called the request identifier If you look at your BLAST results it’s at the top There’s RID, which might not be clear what that stands for But that identifier is the unique identifier for your results, so if you have a problem with BLAST, you can write to us and give us that identifier If you click on that link, it will give you a URL like the one in the middle of this slide and you can just paste that in a web browser and get your results back

or you can send it to somebody you know that you want to show them your results or you can send it to us We will keep your results on the servers at NCBI for about 36 hours You can see them to the recent results link that’s on the BLAST page They will also show up in your My NCBI It doesn’t make them last any longer, they still last for 36 hours So, keep this in mind and we’ll show you that managing that a little bit later on today when we do a search So, be sure to send us an RID if you have a question about a particular search We can look up your results and see exactly what the settings were and we can figure out if there’s a bug or if there’s something we can help you with to make the search work more efficiently Another thing I want to mention is that BLAST offers a number of download options This is actually an older screenshot We’ve added a couple of more structured formats, here Just be aware that they’re here These are the kinds of things that you’ll want if you need to save BLAST results or to save huge sets of results to try to parse out information from them because there’s structured formats that you can parse with a script Or you can use some of the utilities that come with BLAST to re-display them And the hit table is particularly popular, even with people who don’t script, because that can be loaded into Excel The .CSV version of it in particular Okay, so let me talk about some of the specialized BLAST services, then we’ll stop for questions Bonnie’s lookin’ at me ’cause I said we’d stop (laughter) for questions next And these are ones we’re gonna demonstrate in a few minutes PrimerBlast is our primer designer and specificity checker It takes advantage of free software, Primer3, to design the primers and it uses adaptations of our sequences if you want to design primers that do things like ban exon boundaries and things like that, that uses a BLAST to make sure your primers are specific MOLE-BLAST is a tool, it’s very specialized and I won’t demonstrate that today We have done webinars on this particular topic This is a way of clustering sequences and funding attachment on the placement of things like 16S sequences We use it internally in our taxonomy group to help identify things Two special protein services that I will demonstrate today are COBALT, which is our multiple alignment tool COBALT stands for Constraint Based Alignment Tool It does a Protein Global Multiple Alignment And just like Needleman-Wunsch, a global alignment tool like this requires that you input sequences that you know are related to each other, ’cause otherwise, you’ll just get a mess The beauty part about COBALT is it lets you take the output from a BLAST search and feed it into COBALT, so you already know those sequences are related and you can write it as an extension to your BLAST search And then, I’ll give you a quick demo of a new tool that’s kind of a, something we’re sort of trying out That’s a rapid protein identification tool It’s called SmartBLAST It uses a very rapid approach to searching that uses k-mer content of sequences to find matches It’s very quick and it might replace some of our internal mechanisms for neighboring things like proteins to give you a live search, say, if you wanted to find a related protein on the web And it produces, also uses COBALT to produce a multiple alignment and a protein tree Okay So, now I’ll just mention that there is a help link on the BLAST pages and this has lots of good information Including links to their handbooks chapters, the help documents, and the YouTube channel, which has a lot of BLAST tutorials In fact, this will go on that, in addition to the webinars playlist, it’ll go on the BLAST tutorials playlist on our YouTube channel So, Bonnie says we have three questions – [Voiceover] The first question Is whether, how many, what’s the limits for multiple sequences in the query data set? – [Voiceover] Well, that’s a question that we get often at BLAST help in particular You’re allowed, the way this works is you’re allowed one hour of CPU time, so, that’s processing time It’s not real time Now, that could be just a few minutes of real time, depending upon how many processors your search runs on So, there is no fixed limit based on number of sequences, number of residues, but you will run up against it fairly quickly if you use large numbers of sequences that have a lot of hits in the database I’m afraid I can’t give you a concrete answer If people write in with proteins, I would say no more than 100 at a time for nucleotide sequences, depends on the length because that can vary a lot But if you’re trying to do something like search with chromosome one against the nt database, Bonnie’s laughing, but this happens all the time Don’t do that, it’s not going to happen for you – [Voiceover] Because chromosome one is how long, Peter? – [Voiceover] I don’t remember, it’s big

– [Voiceover] Okay (laughter) The largest human chromosome, so – [Voiceover] There are lots of things that you can do to sort of ameliorate that problem, but if you have a need to BLAST hundred thousands of sequences, you might need to think about some other options and I can point some of those out to you and they’re available on the help desk and in the developer options Okay, do we have another one? – [Voiceover] The question was, is the reference Is the protein reference database not included on nr? And I wanted to make sure that this person meant the BLAST pProtein nr database and I didn’t get a clarification – [Voiceover] Well, the answer’s really the same on– – [Voiceover] Okay – [Voiceover] Both – [Voiceover] Okay – [Voiceover] The RefSeq, well, as long as we’re talking, let’s address them separately On the protein side, RefSeq proteins are included in nr That’s easy The nucleotide side, the messenger RNA, the transcript sequences are included in the nt nr database, the nucleotide default database The larger reference sequences, the assemblies, like chromosomes and contigs and things like that, those genomic sequences are not included in nt nr And there was a third question? – [Voiceover] The person says that the BLAST results are missing right away after they did the BLAST and they can’t find them – [Voiceover] Not sure, not sure I understand the question – [Voiceover] I know, I didn’t either, completely But I hoped that you on the BLAST help might have seen that before – [Voiceover] Yeah, no, I don’t know what, what that person means by they can’t find them – [Voiceover] Okay – [Voiceover] Maybe you can rephrase that and we’ll come back to it later I’d like to stop here and go on to do a few live demos What I wanted to do is to do a few searches And we’re going to We’ve got, actually, a document on the FTP site that goes over what we’re planning to do in the live searches I’m gonna do a couple of things with a mammalian protein called creatine kinase, B, the brain-type kinase We’re gonna do some BLAST searches with that Then, we’re gonna do a translating search against a fish, PSA database, to find the corresponding nucleotide sequence for a protein We’ll use a different protein for that We’ll use glycine dehydrogenase Then, actually, I will give a quick demo, Smart BLAST, using an open reading frame that I got from that fish sequence And then, we’re gonna do two other searches One is to show you some things about the nucleotide system by searching the human genome with a transcript from macaca fascicularis and we’re gonna design some primers using primer BLAST There’s another example in here, I’m using SRA, but we won’t have time to do that one today, I’m pretty sure, ’cause I wanna spend some time on those first several Okay So, what I wanna do is we’re gonna start, actually, not in BLAST, but we’re gonna start in the protein database ‘Cause I’m gonna set some searches up for you Notice that I also have, I’ve mentioned those BLAST RIDs and I have those in this document and they’re stable, for awhile, anyway We have the ability to preserve these at NCBI At some point they will go stale because the underlying things change and they don’t work anymore But I can use those to retrieve my results, save us some time, in particular for the first search result setup, and I’ll show you by just retrieving our results why it’s important to limit your database I’m gonna go over here to the NCBI homepage and I’m gonna change my database to protein I’m just gonna retrieve the sequence that I happen to know the accession number for This is a creatine kinase from a human I know the accession number, this is a reference sequence accession number I’m gonna retrieve a human reference sequence My goal, ultimately, with this search is to try to find the collective set of sequences to do multiple sequence alignment with them and I’m gonna try to collect mammalian creatine kinase So, notice that for many protein sequence like this I can click the Run BLAST link, here And, really, all it did for me was it just loaded the BLAST page for me with the accession number in the query box, there And I’ll talk about some of the settings on the page I normally have slides for that, but I thought I’d just do it live because I think we can do things a little bit more expediently that way One of the first things you need to do when you come here is to figure out what database you’re searching And here’s the pull-down list Now, I’m gonna leave it set to nr for a moment I’m actually not gonna run that search for you, I’m gonna retrieve the results because that’s gonna take awhile And this is a good example just to illustrate for you the problem that you’re gonna run into, now, with the size of the protein database being so large and so heavily weighted in sort of two areas One of those areas is the vertebrate protein and the other

area that it’s heavily weighted in is bacterial protein And that has to do with the efforts, the sequence, and annotate links So, we’ll come back to limiting this in a few minutes There also is a set of There are some settings below this fold down here that is called Algorithm parameters, so, I wanna just show you that briefly because there are some things down there that I think you may need to change sometimes One of the important things, here, is there two parameters that govern your output One of them is the maximum target sequences But no matter what else I do, BLAST will not show me more than 100 sequences It’s a little bit worse than that, though What is means is that BLAST will not collect more than 100 sequences On BLAST, there’s a two-stage algorithm, so, if you’d have this set to low, you can wind up missing some important things So, you can increase this And, in fact, the search that I’m gonna show you, I’ve set it to 5,000 And, in fact, that wasn’t enough, as you’ll see in a minute You, also, will probably want to adjust your expect threshold, ordinarily Notice that it’s set to 10 That means that the worst score that I’m gonna show you, I would expect to see 10 hits that are that good or better by chance That’s not something that’s terribly interesting to you So, you can set this to some other value There are lots of good arbitrary values to set, here One that’s quite common for protein searches is this one Sometimes, 10 to the minus six, so, I think that’s a useful one to use, there You could set it to one times 10 to the minus three But that means that at least there’s some kind of, you know, possibility that that’s not due to chance The other thing to notice that I want to point out, here, now, that’s a recent change in BLAST for protein, one of the shortcuts that BLAST uses is it doesn’t find or try to extend every match, it finds the matches of certain size and then starts to extend them The default once was three, for protein searches It’s now six That makes the searches faster, but just be aware if you come here and you find that you don’t get exactly the same results that you got, say, last year at this time, it might be because of the word size It’s gonna affect sort of marginal hits in many cases Those are the things I wanted to show you, here, for the protein pages And I could run this It may take awhile to run So, what I’m gonna do is go back over here and retrieve my RID And so, I’ve got one, here, that I ran that has this RID, here There’s a complete URL there I can also just you that I can get this for my recent results, I’ll just copy that I’ll go over here to this link that says recent results and that’s available on any of the BLAST pages And I’ll paste this in here Now, I actually did do some filter on this So, I eliminated some of the model organism searches and I’ll come back to that in a minute when we go back and show you with this applied Resubmit it, we can see all those settings The main point in showing you this is that, first of all, we did run conserve domain search on this It has the phosphagen kinase conserve domain, the creatine kinase, so, that’s what I would expect If didn’t know what this protein was it would give me some ideas of what the function of this protein is I’ve maxed out my display, here, for the graphical overview It holds 100 I can change that if I want to And then, down here I’ve got my, what we call the BLAST descriptions These are the hits, essentially And they’re sorted for me by E value Notice I have all these E values of zero They’re not really zero, they’re just a very small number And I’m trying to reach my cutoff The number was set to, I think when I ran the search I probably left it at 10 I might need to reformat this, because I don’t have everything, so, let’s make sure that I get all my results back What I can do, here, is go back through to the formatting options, we’ll use this more than once today It’s still set to 100 descriptions, but notice that I can get up to 5,000, which that’s what I originally requested And so, you can see what happens very quickly is that I get a tremendous number of hits that are from all kinds of lifeforms I should be able to take this all the way back to the bacteria because it has conserve domain So, here, I’m back into somethings that are insects and have arginine kinases in them But this is an overwhelming amount of output and the main

point that I wanted to make to you with this search is that there’s no reason for you to do this because you probably don’t need all this You probably want to do something much simpler and restrict to a particular set of organisms So, this just goes on and on and on 5,000 hits And by the time I get to the bottom of my descriptions, I still haven’t reached my E value cutoff And that means that I’m missing significant hits Maybe that significant hit is one that I’m interested in The main take-home from this is to make sure you’re limiting your database to something that you’re interested in And, I actually, this one I already did limit a little bit Let me show you what I did If I have a result like this and I want to resubmit it, I can click this link that says edit and resubmit This will take me back to the page that shows me exactly what I did So, I did a search with creatine kinase I did this yesterday I searched nr And I did get rid of some kind of sequences I got rid of models and I got rid of uncultured environmental sample sequences I did ask for 5,000 hits I didn’t restrict my E value and I never reached it I had an E value set here for 10, but I never reached that at all I never even got to an E value that wasn’t significant because even with 5,000 hits, I didn’t find all those significant matches This, down here, was on by default This is a filter for the kinds of sequences that violate BLAST statistics, which is low complexity regions So, if I wanna fix this to make it a little bit more manageable as a search Probably the best thing that I can do is to run it against a particular group of organisms So, if I’m interested in making a little phylogenetic tree or a protein tree, then I would want to collect sequences for my group of organisms I’m gonna collect mammals, which is a much smaller set of data And let’s (sighs) look for a better control set of data and let’s do the reference protein database I’ve now made the database smaller and I’ve made this RefSeq protein database even smaller by doing two things: restricting by organism and getting rid of the model sequences that our pipeline is producing And the other thing I might want to do is to go down here and to change my expect threshold like we did earlier So, I could make it to something fairly significant And these are just arbitrary cutoffs You can go back and change them if you find that you’re not getting sequences that you want or you’re getting things that you don’t want I will try to run this one live Let’s see how long it takes Immediately, I have my conserve domain results That’s partly because the sequences in the database we already know what conserve domain’s are on it Remember, that’s a position specific kind of search And if I think this is taking too long, I can go back over here because I have run this And get my request ID out of my old document And if you want to come back and retrieve these, they will work for a while Probably several months They won’t work next year Let’s see where we are here I didn’t need to do that because this is done So, now, instead of maxing out everything, I have 32 BLAST hits and if you’ll look at my graphical overview, here, you can see there’s sort of two kinds of hits There are these longer ones This is the one that we started with These are the cytosolic isoforms or genes for creatine kinase And notice that there are some that are going to the mitochondria and you can probably guess that the reason that they don’t align is because there is a leader peptide at the beginning, here, a single peptide that tells the cell to send that to the mitochondria Looking at my output, here You can see the organisms in the list My E value was set pretty stringently, but these are all very similar proteins, so, I got them here That looks pretty good I wanted to look at an alignment just to see how BLAST does in alignment When we look at this one from (mumbles) If I go here, for the pig, a U type, it jumps me down to the alignment, and you can see the way BLAST shows the protein alignment Notice this is a local alignment So the first 11 residues of my query didn’t align the first 44 of my subject sequence

didn’t align with the query And then you can see how the center line is sort of reflecting the scoring system These plus signs represent positive scores for substitutions in the underlying BLOSUM62 matrix The blank spaces means that that’s a negative or a zero score in that matrix The identities, of course, are given a letter there I don’t see any gaps in this particular alignment, but if there had been some of these rendered at oh yes, there’s one right here There’s a gap right there BLAST inserted a gap there, that was cheaper than aligning residues incorrectly in that position So I have an ad here for SmartBLAST, and we can do that next with a different protein After we do another kind of protein search Does anybody, let’s pause here for a minute, Bonnie, and see if there are any particular questions about this one – [Voiceover] Not about the clarification from the missing BLAST cells, but I wanted you to address was sometimes the network crashes during the, I just wanted to see if you knew, or could address When you make the BLAST request, it gets to the NCBI servers The NCBI servers can run it, and the results should be in the results, recent results window But when, if there is a network crash, at what part of the process would that present the BLAST? – [Voiceover] I don’t know – [Voiceover] Okay – [Voiceover] The person has the particular issue, they should write to us We can help them solve the problem, but I can’t completely understand what the problem is in the question Write to BLAST help or write to me, with enough details about exactly what happened What your search was If you got to the point of getting an RID, you know, send that to us, too – [Voiceover] There is another question about is 5,000 the optimal value for protein, or does it depend on the size of the query sequence? – [Voiceover] No, you mean 5,000, it depends on what you’re trying to do It depends on what you’re trying to do, the main thing is to make sure that that isn’t limiting your results That the expect value cut off is limiting your results So I want to do a different kind of example We’re gonna use another protein search just to show you what a translating search is useful for And I’m gonna go back up here to the BLAST homepage Notice that I can choose from, here I’m gonna take a protein sequence I’m gonna take a highly conserved protein, and I’m gonna try to find it in one of these transcriptome shotgun assemblies These are assemblies from next gen, RNA-seq data, or organisms that we basically have no other kinds of data for Many of them have no protein data just because, at NCBI But those assemblies of their transcripts, you can identify the corresponding regions So let me just show you that real quickly I’m gonna go to tblastn And we’re gonna use as a query sequence, one of the ones that’s in my output over here This is a glycine dehydrogenase So again, the query sequence, that’s not that hard to do But notice that my databases here are different So I’m using a protein query but my databases are nucleotide databases This is a growing set of data called transcriptome shotgun assembly Many cases those assemblies are also represented in SRA, so you could potentially search SRA, as long as you knew what the experiments were that you needed to search Transcriptome shotgun assembly database is set up the same way that WGS’s on BLAST When I choose that database, notice that I have to do something else So choosing an organism for example, is not optional, I have to choose one So the organism that I’m gonna choose here, striped bass, okay? So that’s an organism that’s near and dear to a lot of people’s hearts around this part of the world Chesapeake Bay So there’s a transcriptome shotgun assembly for that organism If I want to, I can change my expect threshold, just as a matter of course to be something a little bit significant I’ll leave those settings the way they are, and let’s see if this is fast These searches are a little more burdensome

than doing an ordinary BLASTP search, because what BLAST is doing, is it’s translating the database in all six reading frames on the fly, to give you a protein So I have a very nice hit here Some other sort of minor hits If I click down here, here’s my match and this is a translation of my subject sequence It’s a pretty decent match, to give me the pretty good idea that this is a (mumbles) protein When I go here, this doesn’t take me to the nucleotide database, these are sorted in a separate system This is going to our WGS browser, which also contains the transcriptome shotgun assemblies I’ll go ahead and open that on a new tab So what I can do there is to get the FASTA sequence, but this exists basically only here, I can’t get this out of the normal nucleotide database There’s a master record for this here You could download the entire set if I wanted to Now what I did, actually, was I translated that, and I got a little protein sequence, and this would be a common use case for smartBLAST I have a protein that I generated from some kind of project like this and I want to identify it Let’s see what happens when I do smartBLAST I’m gonna go back over here, this is my open reading frame that I got from that sequence I’ll show you a couple of smartBLAST that are kind of useful One is the rapid identification The other is that it does give me a little bit increased look back time, because the database is smaller, and I don’t run into that problem of being limited by the number of proteins that I can collect I’m gonna go back here to the BLAST homepage I’m gonna retrieve the smartBLAST link here I’m gonna paste that protein sequence in So I search with a bony fish sequence Which is labeled as unknown, and notice that it places it in this nice little protein tree for me If I wanted to know what this protein was, then I have no question that this is a glycine dehydrogenase, decarboxylating, midochondrial form, so I have other fishes there This yellow croaker, the damselfish, guppies, the zebra fish notice the hits are in two different colors There is a reference database or what we call landmark database of proteomes from well-studied organisms The house mouse and zebra fish are two of those You go to the help tab, it defines what all the other organisms are that are in that database In addition, it gives me hits from the best hits from the NR database, so that’s where this large yellow croaker, the damselfish and the guppy sequence come from So that’s very good and it was very fast at identifying that The other thing that’s kinda useful about this, so here are my best hits Top five, but these additional hits are interesting because it let me look back Because the number of organisms in that reference database are limited, I can see much further back than I could in a search against NR So here is some bacteria, Thermotoga Maritima, in fact if I wanted to find Escherichia coli, could just do a find in page, make it easy So here’s my match to e coli, which I defy you to find that easily on a search against NR, because what happens is you’re gonna have to get many, many thousands of hits to see that one It’s still a significant match, it’s only 33 percent identical to the protein that I started with So I’m gonna change gears and do one thing with BLAST, and just show you some of the formatting options that are useful Alright, so let me go over here to the BLAST page again Actually, what I think I’ll do is I’ll start with a nucleotide database, which is one way of doing this search that I’m gonna do now I’m gonna go back here and I’m gonna do a search using a sequence that’s actually got a problem with it We can see that problem very easily by using one of the formatting options in BLAST So I’m gonna go ahead and copy that That’s a nucleotide sequence So this one has sort of a funny definition line It was some kind of a high throughput cDNA sequencing project Because it’s similar to CDC20, but there actually is a problem with this sequence It is from a monkey I’m gonna click the run BLAST button here, and I’m gonna throw this into

the standard nucleotide BLAST page, ’cause there’s a shortcut here that’s pretty handy So let me search the human genome I’m gonna do that Now this is my first nucleotide search today Notice that this is set to megaBLAST That means that its not very sensitive, and I won’t have the kinds of problems that I have with protein where I’m looking back very far and seeing lots and lots of hits This is an example really of an identification search and a search that looks at annotation problems We’ll go ahead and run that That was very fast, because we have an index search of the human genome We actually have two human, two sets of data with two genomes We have the transcripts, I hit the corresponding CDC20 transcript, and then we have hits to two different genome assemblies We’ll focus on the primary assembly So if we look at the alignment to CDC20, I can go here and take a look at that Here’s some mismatches at the beginning Mismatches, but it’s a very close alignment One of the things I want you to notice is there’s a little gap right here That’s near the pre-prime end of the alignment That might not be a big problem if it’s in the UTR, but what if its in the coding region? It could cause a frame shift So we could see, if we can add the coding regions onto this then we can see what’s happened there So that’s the main thing I came here to show you If I go to the formatting options, I can add the CDS feature, which will pull the coding region features of some RNA translations from the nucleotide sequence database I could also render this in a way that’s going to give me some indication of where there are differences It’s kinda hard to see where the mismatches are and where the gaps are, and things like that Let me reformat that So here’s my in terminal refining, there are some changes in the coding region here, they’re kinds of subtle things (mumbles) for Alanine Some of them are silent substitutions here But here I have the problem So it looks like the gap that was inserted, here it looks like there must have been a sequencing error probably, then the sequence So it threw a frame shift in here, beginning here, there’s a different reading frame translation of the sequence So these kind of formatting options are very useful for seeing those kinds of things The other thing I want you to see here is the way that you can look at hits to the genome There is a hit to a pseudo gene, I won’t go into that right now You can look at that later if you want to I’m just gonna look at the main hit, which is on chromosome one We’re down in the alignments here, and so we’re sorting this by E value We can sort this not by E value, but by query start position That will give us basically the exon in order Now they didn’t line up exactly right, because BLAST doesn’t know about splice junctions The other thing that I can do here that makes this very useful to be able to do is to just display this in the graphical sequence viewer, so we can see what’s going on I’m gonna display this here My BLAST hits are gonna be loaded in the graphical sequence viewer In this case, sort of inverted my BLAST hits, so it’s showing me the subject sequence with the query sequence aligned to it this way So, you can quite clearly see the exon intron structure here with the mismatches highlighted We’re gonna use the graphical sequencer just to zoom in cause it’s by front end You’ll notice that actually I missed the first exon One of the things you can do as an exercise, or to convince yourself that it’s true If I use a blastn, which is a more sensitive kind of search, that I will make that first exon I will be able to align to it MegaBLAST, which is what we used, which is a very large word flash shortcut, it uses a word type of 28 So if there were any mismatches in that 28 nucleotide hit, it won’t find a hit at all For this first untranslated exon, doesn’t find the match with megaBLAST, but it will with blastn Okay, so I think we need to wrap up pretty soon Why don’t we pause here, Bonnie, and see if there’s any questions

If not, I can do the primer-BLAST example fairly quickly Do you have anything? – [Voiceover] Well, you can set up the primer-BLAST example while I ask this question, because it feeds right into it That is, is there no RID for primer-BLAST? – [Voiceover] There is an RID for primer-BLAST I’ll show you that when we do it – [Voiceover] Okay – [Voiceover] So what I’m gonna do is We’re gonna design primers for a particular exon of a gene, so I’m gonna work with BRCA1 I’m gonna cheat and use a shortcut that’s built in a lot of places It’s a gene sensor I’m actually gonna search Pub Med, and notice that there’s this ad for gene Um, I wanna do this in nucleotide It also has one, but the advantage of the one in nucleotide is that it gives me access to the genomic sequence This is a RefSeq gene record, I can highlight sequence features here When I click that link, I’m able to sort of browse the features Let me go ahead and get the exons here So there are 23 exons of BRCA1 I’m gonna pick exon 15, as the thing that I want to amplify This is a common kind of task that people have, they need to get primers that will amplify an exon of a gene But notice that I can now display this exon in FASTA format I want to design primers that will amplify this What I can do is send this directly to primer-BLAST, with a template sequence So this will design primers that will amplify within that exon, but if I want to, I can make it so that it starts before the beginning of the sequence and ends before, and binds outside of the exon I can go ahead and copy that, move it over Likewise, I can change the endpoint over here Then I need to pick a background database It’s already set to the tax ID for humans I want to amplify this out of genomic DNA, so I’m gonna pick RefSeq reference assembly from selected organisms is a good one to pick Refseq representative genomes also works well for humans, because this is the representative genome for humans I’ll click the Get Primers button It recognizes that I’m, there’s a sequence that it matches chromosome 17, that’s the gene BRCA1, so that’s okay I want to say that that’s right, that’s what I want to find Primer-BLAST can sometimes get a little slower than everything else, cause it’s run on fewer machines It also had a very wide open BLAST search at the end So I got three primer pairs that are outside of exon 15 in BRCA1 So these are a decent set of primers that will amplify that exon, and not really bind within it too much You could actually add something to this to see that this is a region that has a lot of disease causing mutations, so this is a common kind of task that people who are screening things for these mutations would do Now, somebody asked me about the RID for primer-BLAST You can use this Job ID here, to go back and get the primer-BLAST results, previous one Yeah, so it’s over here So if you have your primer-BLAST job id, which is that long string that we had a minute ago, you can enter that in here, and retrieve the results They don’t persist, you know, any longer than ordinary BLAST jobs But you can do that Okay, so that is a wrap on this one So those are the things that I wanted to show you We ran over by a minute or two We can stay open for a couple more minutes, if anybody has any questions Okay, thanks everybody for coming, and that concludes this webinar We will have another one beginning in about 15 minutes