Principal Component Analysis (PCA) clearly explained (2015)

step quest quest step quest hello and welcome to stat quest stat quest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill today we’re going to be talking about principle component analysis or PCA for short let’s start off with an example of principle component analysis in action he is an example PCA plot that I got from an article that I was just reading it shows clusters of cell types this graph was drawn from single-cell RNA sequencing data there were about 10,000 transcribed genes in each cell and each dot in this graph represents a single cell and it’s transcription profile the general idea is that cells with similar transcription profiles should cluster and we sort of see that in this graph we see that blood cells form one cluster that’s different from pluripotent cells which is different from neuronal cells and dermal or epidermal cells so the big question is how does transcription from 10,000 genes get compressed into a single dot on a graph the answer is PCA PCA is a method for compressing a lot of data into something that captures the essence of the original data in this stat quest we’re going to learn all about how PCA does this compression also we’re going to find out what these access labels refer to before we dive into the nitty-gritty of PCA we’re going to cover a little background material we’re going to have an introduction to dimensions just to warn you this is going to seem very very simple but just hang in there you’ll be glad we did this it’ll keep your head from exploding if you can remember all the way back to first or second grade you’ll remember that one dimension equals a number now imagine we had to pretend RNA seek data set for a single cell here I’ve labeled the genes just a B and C and the read counts are 10 0 and 14 for those genes we can plot these values on the number line just like we did in first or second grade a with 10 reads it’s a dot at 10 gene V with zero reads it’s a dot at zero and lastly gene C with 14 reads gets a dot at 14 if we plotted all genes we might see something like this a uniform distribution of transcript counts or we might get a non-uniform distribution of transcript counts some genes might not be transcribed very much and they be on the left side of our member line and some genes might get transcribed a lot and they’d be on the right side of a number line even though our number line is a very simple graph we can get some useful information out of it now let’s fast forward to fifth or sixth grade when we learned about two dimensional graphs now we have two axes instead of just one and now we can plot data from two different cells instead of just one here’s a pretend RNA sequencing data set for two single cells just like before we have the same genes but now we have read counts for two separate cells if you can remember from fifth or sixth grade the way we plot the data for gene a is we go over to ten per cell one and we go up to eight per cell two and we put a point there for gene B we go over zero for cell one so we don’t move it all and we go up to four cell two and for gene C we go over 14 and up ten if we plotted all of the genes we might see something that looks like this here we see that the expression in the two cells is correlated meaning genes that are highly transcribed in cell one are also highly transcribed in cell two and genes that are lowly transcribed in cell one are also lowly transcribed in self to or we might see that the expression of two cells is not correlated meaning if a gene is highly transcribed in cell one that doesn’t tell us anything about whether it’s highly or lowly transcribed in cell two okay so maybe some time when we took calculus we started drawing three dimensional graphs that’s just a fancy graph that has depth with three separate axes we can now plot data from

three separate cells so now our pretend RNA sequencing data set has data for three single cells and just like before if we wanted to plot the data for gene a we would go over to ten for cell one up to eight for cell two and then back eight per cell three we then draw lines perpendicular to each access to figure out where they all meet and then we put a dot there I’m not going to do too many examples of this because you get the idea so this is what we know about dimensions so far if we have one cells worth of data we only need to have a one-dimensional graph which is just a number line if we have data from two cells then we need a two-dimensional graph which is just an XY graph that we learned about fifth grade if we have data from three cells then we need a three-dimensional graph that’s a fancy graph with depth what happens if we have data from four separate cells you guessed it we need a four dimensional graph the problem is we can’t draw that on paper and if we had data from two hundred individual cells we need a 200 dimensional graph there’s no way we can draw that so the question is are all of those dimensions super important or some more important than others to answer that question we’re going to go back to a data set that just has two cells and two dimensions hypothetically speaking what if we had two cell data that look like this here we see that almost all of the variation in the data is from left to right that to say selwyn has some genes that are lowly transcribed in some genes that are highly transcribed but it looks like all of cell 2 genes are all transcribed at the same level if we flattened the data that is removed the up and down variation our graph would look much different from what it looked like before and if we flattened the data we could just graph it with a single number line in this case we can take two-dimensional data and display it on a one-dimensional graph without too much loss of information both graphs say the important variation is left to right here’s another example of how some dimensions are more important than others TV and movies TV and movies are almost always 2d that is they’re shown on flat screens at home or in the movie theater and we don’t usually have fancy 3d goggles on when we watch them so they’re 2d even though the subjects in the movie are 3d this is OK the third dimension usually doesn’t add that much to the story this is why when we cough up the extra three or four dollars to watch a movie in 3d we’re usually disappointed anyways people look like people things look like things even when they have no depth and are flattened on a screen basically a movie camera takes 3d information and flattens it to 2d without too much loss of information to summarize what we know so far we know that each cell that we sequence adds another dimension and we also know that some dimensions are more important than others so what does all this have to do with PCA well PCA takes a data set with a lot of dimensions ie lots of cells and flattens it to just two or three dimensions so we can look at it it tries to find a meaningful way to flatten the data by focusing on the things that are different between the cells we’re going to talk a lot more about this later for any biologists out there this is sort of like flattening a Z stack of microscope images to make a single two-dimensional image for publication so let’s start with an example again we’ll just start with two cells here’s the data like before the genes are imaginary so I’ve just listed them from a to I and here’s a 2d plot from the data from two cells generally speaking the dots are spread out along a diagonal line another way to think about this is that the maximum variation of the data is between the two endpoints of this line and generally speaking the dots are also spread out a little above and a little below the first line that we drew another way to think about this is that the second largest amount of variation is at the endpoints of this new line that we just drew if we rotate the whole graph the two lines that we drew make new x and y-axes this makes the left-right above and below variation easier to see we don’t have to tilt our head anymore and like we saw before the data varies a lot to the left and the

right and the data varies a little up and down note all of the points can be drawn in terms of left and right and up and down just like any other 2d graph that is to say we don’t need another line to describe diagonal variation we’ve already captured the two directions we can have variation with these two lines these two new or rotated axes that describe the variation in the data or principal components principal component one or PC one the first principal component is the access that spans the most variation in the data PC 2 or principal component number two is the access that spans the second most variation so these are the general ideas we’ve covered so far for each gene we plotted a point based on how many reads were from each cell principle component one captures the direction where most of the variation is principal component two captures the direction of the second most variation what if we had three cells just like before principal component one would span the direction of the most variation and principal component two would span the direction of the second most variation however since we have another direction we can have variation we need another principal component that’s principal component number three it spans the direction of the third most variation what if we had four cells principal component one would span the direction of the most variation principal component two would span the direction of the second most variation principal component three would span the direction of the third most variation and you guessed it principal component 4 would span the direction of the fourth most variation there is a principal component for each dimension or each cell in the data if we had 200 cells we would have 200 principal components principal component 200 would span the direction of the 200th most variation hooray now that we know what pc1 & pc2 are we know what the X and y axes are in this figure pc1 is the direction of the most variation of gene expression & pc2 is the second most variation of gene expression but I bet just right now you’re asking yourself this question this is a plot of cells not genes how do we plot cells so far all we’ve talked about is how to plot genes to answer your question we’re going to go back to the original scatter plot for two cells for now let’s focus on principal component one the length and direction of PC one is mostly determined by the circle genes the genes on the endpoints are the extreme genes now we’re just going to move the graph over to the left side of the screen so we can put other interesting things on the right side if we wanted to we could score genes based on how much they influenced principal component number one and here’s a list of qualitative scores that we might give each gene genes close to the ends of the line like a and F would have high scores because they highly influenced PC won the genes in the middle like B and C would have low scores we could also use quantitative scores for each gene so genes with little influence on principal component 1 we get values close to zero and genes with more influence we get numbers further from 0 genes on opposite ends of the line we get similarly large numbers but with different signs so a might get a positive number like positive 10 and F because it’s all the way at the other end of the line might get a negative number like negative 14 similarly we could also rank genes and how they influence principal component number two now we have two tables of genes and the influence they have on the principal components one is for principal component one and the other table is for principal component number two now that we have these two tables for the first two principal components we can use them to plot cells and not just genes we do that by combining the read counts for all genes in a cell to get a single value here’s how to do that first we return to the original read counts for each cell we can then calculate a score for sub-1 by taking the read count for gene a and multiplying it by G neighs influence on the principal

component and adding that to the read count for Jun be multiplied by the influence of gene B and doing that for all genes here’s a concrete example for cell 1 gene a we have 10 read counts and the influence gene a has is 10 so the first part of this summation is 10 times 10 the second part of the summation is the read count for gene B which is zero multiplied by the influence gene B has which is 0.05 we just continue to multiply and sum and multiply and some until we’ve done it for each gene in the cell for this example we might end up with a number like 12 that would be our value for PC 1 to calculate a value for principal component we do the same thing as before except instead of using the weight or the influences on principal component one we use the weights or influences on principal component number two so in this case gene a has ten reads and we multiply it by three because that’s the influence gene a has on principal component number two we add to that the read counts for gene B multiplied by the influence that gene B has on principal component number two in this case that’s zero times ten and we just do that for every single gene again and we end up with a score for principal component number two and in this case that might equal six so we’ve done the math for cell one we’ve got values for principal component number one and a value for principal component number two now all we have to do is plot it on a graph and if we create a graph where the x-axis is principal component 1 and the y-axis is principal component number 2 we can do what we did in fifth grade we just go over 12 and up 6 and put our dot right there now we have to calculate scores for cell number two and if we did the math by multiplying the read counts for each gene by the influence that each gene has on the principal component we might end up with numbers like 2 for principal component 1 and 8 for principal component number 2 again we just plot it like we did in fifth grade if we sequenced a third cell and it’s transcription was similar to cell one it would get scores similar to cell ones and as a result when we plotted it on the graph cell number three would be closer to cell number one and it would be to cell number two hooray at long last we know how they plotted all of the cells on this graph these are the general ideas we covered so far genes with the largest variation between cells will have the most influence on the principal components that is to say genes highly expressed in some cells and not expressed in others will have a lot of variation and influence on the principal components the first principal component captures the most variation in the data the second principal component captures the second most variation in the data you can use the original data in the first two principal components to get X Y values to plot on a figure cells with similar transcription patterns will cluster together and just like they say on TV but wait there’s more we can use the graph to identify key genes do you see how cells are spread out left and right above and below if we wanted to find out which genes had a big influence in putting dermal cells on the left side of the graph and neural cells on the right side we could look at the influence scores on principal component number one and if we wanted to find out which genes help distinguish blood cells from neural and dermal cells we could look at the influence scores in principal component number two but wait there’s even more yes there’s a couple Diagnostics you can do if you’re drawing your own PCA plot these are ways you can tell if your PCA is actually worth anything one diagnostic plot is called a scree plot where you plot how much variation each principal component can account for what you want to see in this diagnostic plot is that most of the variation is accounted for by the first two principal components lastly here’s a terminology alert the ways I’ve been describing things has been fairly intuitive but there’s actually a lot of technical jargon for principal component analysis the numbers that describe the weights for the importance for each gene – principal component one I’ve just been

calling influence or weight but in PCA terminology those weights are called loadings an array of loadings is called an eigenvector and that’s all there is to pca so tune in next time for another exciting stat quest