I’m going to continue highlighting words I get asked about regularly. Today’s feature: multicollinearity! This is a word you’ve probably heard about in stats class, but might not know specifically what it is or why it matters.
We worry about multicollinearity when we are using multiple independent variables to model one output. One of your independent variables is multicollinear with another if the two variables can be used to predict each other in a linear model. Usually we are worried about multicollinearity when that linear model prediction is accurate 60-90% of the time. If it’s accurate less than 60% of the time, they aren’t multicollinear. If it’s accurate more than 90% of the time, you can probably just use one of the variables and simplify your model. If one variable precisely equals the other, it’s called perfect collinearity and you only need one in the model.
If you have multicollinear variables, the coefficient estimates for your model are misleading and hard to interpret. You can’t use your model to understand the effect of one variable without the other. In cases where the independent variable values are very high or very low, the prediction from your model can be very far off. In the instances where the two multicollinear variables don’t match each other, their contributions to the prediction essentially cancels each other out and the prediction isn’t very accurate.
Multiple linear regression (MLR) is sensitive to multicollinearities. Spatial statistics tools based on MLR, including geographically weighted regression and regression kriging, are also sensitive to multicollinearity. Check the assumptions of any statistics tool you’re using, and if the tool you’re considering using is sensitive to multicollinearities, make sure you check your independent variables.
Are there other words you are curious about? Mention them in the comments!