SPSS – Hierarchical Multiple Linear Regression

– [Instructor] Okay, up for right now is a complete example of hierarchical multiple linear regression so we’re gonna cover how to from start to finish, run a multiple regression that has steps including data screening, power, and what you might write in the write up, and example of a possible representation of the data So this is data set two from blackboard and what’s in the data is that we have gender, where zero is female, one is male Age of the participant, and extroversion, so high scores are extroverted, low scores are introverted We’re really looking at how well they take care of their cars and so the dependent variable is car Are they washing it, or cleaning it, or they gave it oil change, they’re getting checkups, that sort of thing And so what we’re gonna do is we’re gonna control for demographic variable of sex and age, and then test if extroversion adds something to that equation in predicting how well people take care of their cars Okay? And so you’ll wanna start with power, and power for the (mumbles) not limited here in G power, is just, there’s only really a couple of options, so click on F tests, and then pull down that window, and you’ll get two options, linear multiple regression, R squared from zero, that tested the overall model is significant, or R squared increase, which you could use for this type of model, and that would test if extroversion is an addition to the model I wanna go deviation from zero, ’cause I kind know overall it’s significant, but both options are viable If you don’t know, this is F squared, so not your normal aida or R squared So if you hover over it, it’ll give you the convention sizes or you can hit Determine out here, and kind of calculate from a different, a couple of different things but this square multiplication that’s row, you can do R squared there, and that will calculate it for you So I’m gonna close this bad boy and leave it at .15, alpha is always .05, power is 80%, and this case we have three predictors total, so we use three That says we need 77 people to detect a significant effect I only have 40, so let’s see what happens It’s gonna tell you my calculate power The next thing I wanna do is the really intense process of data screening for regression But this isn’t a fake regression, it’s a real regression, so it’s a little easier ’cause I don’t have to create some random variables to test this The first thing is always missing data and accuracy of your data, so go Analyze, Descriptives, and then Frequencies I’m gonna select everything and move it over And under Statistics, really you need the min and the max, but it doesn’t hurt if you kind of look in the means and the standard deviations, if it is this your own research field and it’s not sort of a silly example You can notice things like, wait why is that score so low? Oh no, maybe I forgot to reverse code it, that sort of thing And then okay, let’s look at the output here It indicates that my data is zero to one, which is good ’cause gender should be evenly split My ages don’t seem abnormal, like you wouldn’t expect somebody to be four and have a car My extroversion score, is it find me what that scale was, I think it’s zero to 100, so we’re doing pretty good And the car scale is also zero to 100 So how well they’re taking care of their car So far everything looks good And I don’t have any missing data here, so see, no missing So that first assumption check works out Now, to do outliers, what we’re gonna do is we’re actually gonna set up the regression to run as if we were ready to test and then check for outliers in three different ways The reason I picked these three, they do seem to be the most popular To me they really get at the point of what regression is testing, and they sort of will cover you There are lots and lots of options as you’ll see here in a second to test for outliers in regression, and these seem to be, to me these were the best three Okay, so let’s set up the analysis as if we’re gonna run it So Analyze, Regression, Linear Our DV is car Now, this is a hierarchical regression, so we’re gonna get to use these different blocks here, and they’re not actually called blocks in the output, it’s called models, so block just means what do you want to do next So first, we’re gonna control for demographics

Put that in independent Hit Next to get block two or model two, and then put in extroversion here You do not have to include all three It actually does that for you automatically and so whatever you’ve used in step one will carry over to the other steps ’cause you wanna keep controlling for it So it shows you, it’ll show you them several times in your output Okay, after you do that, what you wanna hit is, Statistics We’re gonna get R squared change, that’s super important for the type of way that I’m gonna suggest you write this up Part and partials are the sr and pr, and then hit Continue Under Plots, for data screening, ZPRED in Y, Z residual in X, histogram, and normal probability plot And that’s your normal data screening For the graphs, one thing you can do when there are multiple variables to kinda get an idea of how well your equation is to graph the predictive values against the actual values Remember that R, big R is a correlation between Y hat, your predictive score What would I have guessed the score to be and why your actual score was? So the better your R and the bigger your R squared, the closer you’re getting to the real score So the dots are perfectly aligned, you have done a great job, but that never, almost never happens But it’s kind of to see how well we’re doing, so hit Next, and this is where I’m gonna put DEPENDNT in Y, and adjusted predicted in X So that’s gonna give me Y hat on the X axis, so that’s all of my axis combined with coefficient and Y on my Y axis Then hit Continue Under Save, we’re gonna click the three different distances, Mahalanobis, Cook’s and leverage Look, there are so many options, influence statistics, Df beta is pretty popular, studentized, deleted residuals are also pretty popular Almost all of this is different ways to look at outliers We’re gonna cover these three Hit Continue That should be good, hit OK First things first, we wanna check for outliers So I’m gonna ignore all my outputs so far Hit the start button and go back to the data You’ll see that I have three new columns And those columns are for each of the separate outlier analyses, let’s start with Mahalanobis So the cut off score for Mahalabobis is gonna be three variables with three degrees of freedom, and let’s see, chi square table, might gonna see option, there it is And we’re gonna use .001, ’cause this is still, we want them to be really crazy before we delete anybody For three degrees of freedom, it’s 16.27 So that’s my cut off score Now, normally you just sort and you look But in this sort of analysis, when I have three things I wanna compare, and I kinda wanna keep track of what I’m doing, I’m gonna actually show you a way to create multiple columns that tell me if people are outliers on each variable separately, and then create total outliers score I don’t think this data set’s too crazy We don’t have a whole lot of outliers, but if you had 400 participants, you don’t wanna code this by hand That’s gonna take way too long So what you’re gonna do is go to Transform, Recode into Different Variable Let’s take Mahalanobis distance here, move it over I’m gonna call this out_mah, I know it’s for outliers for Mahalanobis You have to click Change so you get that variable name here And then before you hit OK, you have to get the old, you have to tell what are you gonna transfer this into So this is how a lot people recode or reverse code labels to So click Old and New Values We’re gonna use this HIGHEST option So I wanna take everybody above 16.27, ’cause that’s what I’ve set the cut off score was 16.27 And I wanna make everybody, sorry, 16.27, above that score, one And that basically codes everyone who’s scores are too high, as one I’m gonna take everybody else, so everything below 16.27, and then all the other random decimal points, and make them zeros So that basically codes everybody into zero, not an outlier, one an outlier And then hit Continue, and OK The crappy part about this is since that each have different cut off scores, you have to do them one at a time

So I didn’t get anybody with outliers on Mahalanobis I’m gonna do that twice more, once for Cook’s, which is a measure of influence, which is a discrepancy and leverage together, and then once for leverage, which is just straight, how much are they changing the slope So let’s see Let’s do, now, for Cook’s Transform, Recode into Different Variables I’m gonna hit Reset to clear everything out Move over Cook’s, type out_cook here, Change Old and New Values So what’s my cut off score for Cook’s? Well, the formula for Cook’s is four divided by n minus k minus one Or four over degrees of freedom So I have four divided by n, n is 40, minus three for k, for three predictors, age, sex and extroversion, minus one So 40 minus three minus one is 36 So four over 36, so .11 is my cut off score for Cook’s Same functions Value through HIGHEST, so .111, it is gonna be a one, and all other values is gonna be, ooh, not missing, a zero And then Add So everybody above .111 gets a marker for being an outlier, everybody below that score gets a zero for being not an outlier Continue and OK Right, and so it looks like I’ve got two little Cook’s scores that are too high One of them, oops, that’s leverage, .114, and then one of them is .312 So those are too high One more time for leverage Transform, Recode into Different Variables Reset And let’s do leverage, and do out_lev, Change, Old and New Values here So what’s my cut off score for leverage? Well, let’s see The score for leverage is two k plus two divided by n So two times k which is three, two times three is six plus two is eight, divided by n which in this case is 40, so eight over 40 is 0.2 So I’m gonna do value through HIGHEST, so 2.0 and up is gonna be a one, those are my outliers And then all other values can be zero Those are my not outliers Continue and OK And so I have an outlier for leverage as well So their score is higher than 2.0 Now this is very easy to see because there’s only 40 people and I can kinda scroll through it, but again, if you have 100 or more, or even just a couple more than this, it can be kinda tedious to look through them The sorting of multiple columns in SPSS is not always the best thing So what you wanna do is go Transform, Compute, let’s just add all those together This is gonna be total outliers, I’m gonna call it out_total Then I’m just gonna do out Mahalanobis plus double click out Cook’s plus, double click out leverage, so just add them all up Hit OK And then now I can sort my out_total column Remember, you can right click on the column and click sort For some reason does not totally work well in my Mac, with no mouse, so I’m gonna do this through Sort Cases Gonna put the highest people in the top So I have one person who has two or more markers, so they’re two out of three I would delete this person because their score has two markers out of three that indicate it’s an outlier I mean, you don’t have to delete them, ’cause really what is going on? Look at the data before you delete it, clearly They’re a young person who has a high extroversion and they take care of their car, and more than likely they’re the top of those two variables So they’re getting, they’re kind of, they’re getting that high Cook’s and leverage scores because they’re probably discrepant, which means they’re far away from the rest of the data So being at the very top or the very bottom tends to make you far away from everybody

But it looks to me like they’re really, especially far away on the car score If you’re following along in my User’s Guide, I did delete them You can leave them in and try it, and then take them out and try it to see what happens That’s the popular thing to do But since I wanna match the handouts that you’re looking at, I’m gonna delete this person because they have two out of three There we go Alright, so that being said, that makes all of this output moot So I’m gonna get rid of it ‘Cause I deleted something Next thing I wanna tell is multicollinearity So Analyze, Correlate, Bivariate Remember, this is only for independent variables Do you want them to be correlated with your DV? That’s the point So sex, age, and extroversion, we move those over and hit OK And that is gonna show me that gender and age aren’t correlated, which isn’t too surprising It is correlated with extroversion, so differences in men and women, and then age and extroversion is also correlated but none of these are too high The cut off score is .9 But remember at .7, you might get some suppression with multiple regression, so I might tell you to try it and see what happens if you get that high Okay, so I’m gonna rewrite my regression because I deleted somebody, and I’m gonna make a point to talk about, I’m just gonna hit OK, the fact that when I do that, it’s gonna give me three new outlier columns, because I ran it again Don’t delete anybody Don’t do it Don’t think about it Don’t make this a thing Don’t delete people multiple times So essentially, these three columns, we don’t need Alright, so There’s my output Alright, we’re gonna check normality first So that looks pretty good Maybe a little bimodal, but not too bad We have at least 30 people And it’s centered over zero, it ranges from two to two, so I’d say it’s okay And then linearity, pretty good, especially with only 40 people Homogeneity and homoscedasticity, also look pretty good So most of the data’s between two here and two We’re getting three up here because it’s just slightly over two but really that’s almost perfectly between two and two The data here is between two and two And that’s like, it’s about a score area is gonna get so homogeneity and homoscedasticity both checked out Okay So all of my, there’s one more plot We’re gonna come back to what this plot is in a second So all my assumptions check out after I deleted one outlier Now let’s look at the actual analysis Which is just a little bit higher up in my notes here Copy this into Words so you can read it a little better rather than side by side Well thank goodness I wasn’t anything salacious There we go, it was just Z test (sighs) Now SPSS is doing that fun thing where it doesn’t like to copy (shutter clicks) Let’s turn off the sound here Struggling There we go So the first question you have to ask yourself in regression is, is the overall model significant? So let’s talk about model one, it’s just my demographics And yeah, it’s significant So I’m gonna say F of and then here we go, this first line, so df 2 and 36 is 21.66, here my p value’s less than .001, and my R squared for just this step is 55 So what does that tell me? That means 55% of the variance is due to demographics Whoa, that’s huge And it is significant Next thing is model two, so this is our extroversion, or extraversion, either way you think about it And I’m not gonna use that ANOVA box So the interesting thing about the two different boxes here that you don’t see in a simultaneous regression is that they’re gonna be different

So what does this change statistics thing do out here? That is testing this number right here, R squared change is greater than zero When you have the first model, the first step, those two numbers match because it, you’re starting at zero, so it says, is it greater than zero? When you add a second step, what happens is is to now this is testing if this change is different than zero? So is 7% a significant addition to the model? Versus this number down here in the ANOVA box is testing if the overall R squared, 61%, is greater than zero And I, I mean, you can go either way But I feel like purporting the ANOVA’s a little bit of cheating if your first step was really big Your second step was still gonna be significant ’cause the first one was big even if that addition is not So I always biased towards using this change statistics ’cause that’s kind of the point of doing it, hierarchical regression is to show that that extra step is significant Adding this variable was important, so we should do it So that’s what’s different between the two But this is an example so of course it is significant If I can get capital F here, there we go, so it’s gonna be one and 35, is 5.96 And then my p value is .02 My R squared, which I’m gonna cheat and copy from up here, is .07 And then what I would do in word to make this super duper clear what I was talking about is insert a change statistic symbol, which is delta, the little triangle So I’m saying the change in F is significantly different, and then the change in R squared So that tells people, or at least it tells me, that is the change in R squared, so the addition to R squared And most people can figure that out, because they don’t assume that after getting 55% of the variance, somehow you magically dropped to only 7%, they go, oh that must mean an additional segment So you don’t really need to list R squared total, because hopefully people can figure out to just add them together, and that’s how you get 61% It’s gonna look a little high because we’ve round it up on both of them And so in that case, I might tell you to use three decimals, but I mean it’s .01 so it’s not a huge deal Okay, the next question is which predictors are significant? And so, I’m gonna take the coefficients box here, my output, bop, and use that to answer that question So the way I learned this was to only talk about the predictors in the step they’re entered And people vary on this point I think about it as more of a theoretical view I’m gonna talk about, I’m gonna control for demographics So here’s what happen to demographics when they’re by themselves When I control for it, I’m basically down with it, and then I’m gonna add extroversion So after controlling for demographics, what happens with extroversion? ‘Cause you’ll notice that the coefficients do change That’s because there are other variables in the equation So mathematically they have to change We can’t actually hold them constant It’s more of a theoretical idea of I’m controlling for these and then doing this I have seen it both ways, where people report them in both steps or only the last step But the way I kinda think about, or the way I don’t kinda think about it, the way I think about it is just talk about them in the step they’re entered, because that, you did them in steps for a reason So talk about them in the step they’re entered Remember number one rule when I help people with things, is do what your advisor wants Do what the reviewer wants as much as you can Practically And basically go with what makes sense to you If it makes more sense to talk about both, do both and see what happens See if people will accept your explanation So I’m gonna talk about them in the step they’re entered So that means, for model one, when I’m controlling for demographics, sex is a significant predictor I’m gonna list, I’m gonna do beta, so Insert The advantage of beta is that it standardized, so I can compare, there’s beta I can compare statistics So I don’t know why this always comes up

with this other font, there we go Let’s do Times New Roman Sorry, it’s one of my things, it just makes me crazy Alright, there we go So I’m gonna list beta What’s the advantage of beta? Beta is standardized because gender and age are definitely not on the same scale, ’cause one is zero and one, the other one is age Beta will let me tell which predictor is stronger, but so will partial correlations So you could go with either one Remembering that b is more interpretable, so it is in the scale you’re using so you can talk about it easier And beta is standardized, so that you can compare better Either one Alright, so beta is .68, my t says it’s significant Remember, the degrees of freedom for t match the second degree of freedom for F in the step we’re talking about So it’s 36 here, ’cause it’s n minus k minus one So it’s 6.00, p value is less than .001 And I’m gonna use pr squared as my affect size So what in the heck is pr squared? Sr and pr are types of partial correlations This output out here are zero order, it’s just plain or, that’s the correlation between gender and my DV, cars Partial correlations are in the second column where it says partial That is the correlation between gender and car controlling for age, like subtracting out all the variance for age Semi-partial correlations are the relationship between gender and car, including age, so the difference between pr in the middle column and sr in the last column is the denominator Pr is calculated only over leftover variance So it basically takes age and just like caves it out and says that variance due to age doesn’t exist anymore Poof, gone For partial, sorry, sr, semi-partial correlations, that variance due to age is still part of the denominator So it’s over total variance on the bottom If you can’t remember the order, like I do sometimes, remember that pr is always larger than sr because the denominator is smaller, and so, unless they’re all zero And so go with the larger column, which is this one I’m gonna square that because we think about, they’re both affect size, so it doesn’t actually matter, but I like to think about it as R squared and so we’ll keep in the same theme here And that tells me how much variance is accounted for It’s actually 50% We’ll talk about what does that mean here in a second So for age, the beta’s 33, also significant That doesn’t always happen Sometimes it might just be one of them Equals 2.92 P is really less than .01 here And let’s do pr squared Word will keep up with me here, so .44 squared Squared, .19 And here’s the tricky part Because these don’t have the same denominator It says R squared, they do not add up They will not add up to my total R squared Sometimes bigger, sometimes it’s smaller, it just really depends on the mathematical properties and their overlap between sex and age But since they are fairly uncorrelated, that means that pr will be bigger The more correlated they are, the smaller they’ll be Don’t expect those to add up Its’ just my word of warning here Right, so 50% of the unaccounted for variance is due to gender, and 19% is due to age I can also look at beta and tell that gender is a better predictor The interpretation for age here is as age goes up, for every one unit increase in age, we get .33 standard deviations rather, or .54, .55, increases in car As age goes up, care for car goes up The tricky part of these categorical variables, as sex goes up, what does that mean? That’s an odd way to say that Basically as we go from zero to one So zero group is girls, females The one group is guys, males So the difference between boys and girls is .68 standard deviations or 26 points So as sex goes up,

as we are looking at guys, care for car goes up Our guys are taking better care of their cars than our girls Sorry ladies Alright So let’s talk about extroversion I added that in model two So what happens here? Is it, it already know it’s a significant predictor, because I only have one in that model two, extroversion, most significant Let’s see what happen with that 33, about the same size as age Now my degrees of freedom for t are gonna be different, though, because it’s the second degree of freedom here, so that’s 35 instead of 36 Is 2.44, p value’s .02 Which, with only one variable will match the p value up here Let’s do pr squared Can you tell it’s late? Getting silly voices Alright, we’ve got .38, squared, so come here Calculator 38, squared, so .14 And you know that it does not match R squared Overall addition to the equation is .07 which would be this .25 thing squared, I’m pretty sure so let’s try that 26, (clicks) that’s where .07 is coming from So if you square a semi-partial, you get R squared change But we’re talking about partial correlations so minus age and gender’s variance out of the DVs, so subtracting some numbers out of the denominator, it’s 14% of the unaccounted for variance So it’s a significant predictor I would, to write that up, talk about all those different things One caveat that I always tell people is if a predictor is not significant, you can’t just pretend like it didn’t exist anymore So talk about predictors even if they’re not significant And then my thing is, in the step they’re in All of mine were significant in their specific steps so we’d talk about them all But you really don’t want to ignore one just ’cause it wasn’t significant, ’cause people are gonna go, what happened to the other variable, they just stopped talking about it Say it’s not significant Now for pictures, what can I do with, making a graph, a representation of this It’s usually a little hard because if you have three variables, technically you’re predicting into 3D space The sort of cheap way to do it, it’s not really cheap, but it’s the easiest way, would be to create a picture here, this one, of the relationship between the predictive values and the actual real values ‘Cause this gives me a picture of all these variables together, equals what? Now, I got that scatter plot when I ran my plots with dependent as Y and adjusted predicted as X, but this graph is terrible So what I would do to make it APA Style Remember, APA does not have all this stuff at the top It’s not letting me delete here, oh there we go It’s being grumpy There we go And then I would change this stuff from the bottom, so click once to get it, click twice to get it where you can type into either of the equation, is a good one, or one or all the variables So Sex + Age + Extroversion You could also call this Predictive Values It doesn’t have to be equal, that’s like the other option and calling it Predictive Values I like to remind people what are the variables I’m using unless you have 10, then it might be kinda long Over here on Car, that’s not a very good one, so click once, click twice, this is my Car Care, oops not Care Care, Car Care Score You can delete this awful blurred gray background Double click on it, change it to transparent here and Apply, that’s just the personal preference ’cause gray is awful But I also like to add the Fit Line, so Add Fit Line at Total, that would add your Fit Line and then you can turn off right here, attach line to label, since that’s not actually the equation Apply So I don’t wanna include that equation because my real equation has three predictors and a coefficient, that’s what’s gonna be what you’re reporting with all of your beta values, or your b values This is just a way to get to give you the stupid line So how are we doing? Let’s close this and it’ll pop back over here, there we go We’re doing pretty good, because lots of dots

are close to the line I mean only one person even touching the line but they are pretty close It could be way spread out Remember this is 61% of the variance, that’s a lot So we’re getting pretty good at guessing people’s scores with all three variables at once That is how you run a multiple linear regression, hierarchical multiple linear regression, you got steps, how you would talk about each piece in your write up and a potential graph or way to visualize the data