Planning April 2014
Research You Can Use
Not Your Grandparents' Regression Analysis
By Reid Ewing
This month's column is not for everyone. It is about regression analysis — the workhorse of statistical methods in the planning field. The column is a bit technical, so just read this column for the gist, and don't sweat the details. The point I want to highlight is that even experts can make mistakes in regression analysis. So enlist some help when you need it.
Regression analysis is a statistical method for estimating relationships between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the value of the dependent variable changes when any one of the independent variables is shifted while the other independent variables are held constant. For example: How does the price of homes vary with proximity to transit, when the floor area of the unit is held constant?
My guess is that almost everyone holding a master's degree in planning has been exposed to least squares regression (and possibly only least squares) in a statistics, quantitative methods, or research methods course. When I learned regression analysis in the early 1970s, we were taught that a good model meets three criteria:
- It has a high R2, meaning that the model explains most of the variation in the dependent variable.
- It has statistically significant regression coefficients for its independent variables (meaning that the variables are likely related to the dependent variable rather than related only by chance).
- It has positive or negative signs on its coefficients that square with theory. We would expect home prices to be positively related to both floor area and proximity to transit.
As a doctoral student, I learned that there are other model criteria as well. There should be no extreme outlying values because they have undue influence on the values of regression coefficients. The error term, which is what is left unexplained after the independent variables do all the explaining they can, must be uncorrelated with the independent variables. Likewise, the independent variables cannot be too highly correlated with each other, because the resulting condition, multi-collinearity, also means that the regression coefficients have large standard errors associated with them.
Finally, all relevant independent variables must be included in the regression equation, or the omitted variables may bias the regression coefficients of the included variables. If transit service is typically provided close to parks, then the exclusion of one variable, proximity to parks, will bias the regression coefficient of the other, proximity of transit.
Another potential pitfall of regression analysis, at least within the spatial contexts that planners often work, is called spatial autocorrelation. Spatial autocorrelation is the degree to which values of the dependent variable in a regression analysis are correlated with values for nearby persons or properties within geographic space. My house price affects my neighbors' house price, and vice versa.
Relatedly, spatial autocorrelation may also refer to the degree to which the error term for particular individuals is correlated with the error for nearby individuals. If spatial autocorrelation exists in the error term of a regression model, it violates the standard regression assumption of independently distributed errors.
This last problem was hardly even recognized until pointed out in the late 1980s by Luc Anselin of the School of Geographical Sciences and Urban Planning at Arizona State University — who happens to be the most cited academic planner, according to a recent citation analysis by Tom Sanchez at Virginia Tech. Anselin's book, Spatial Econometrics: Methods and Models, moved the discipline of spatial econometrics to the mainstream of econometrics.
Which leads me to the featured article in this month's column: "Do All Impact Fees Affect Housing Prices the Same?" by Shishir Mathur of San Jose State University, just published in the Journal of Planning Education and Research. This study uses least squares regression to relate home prices to a host of variables including impact fee levels in the community where the houses sold. Its conclusion: Impact fees generally raise the price of new housing, but with exceptions. For example, a park impact fee increases the price of new and existing housing, whereas a fire protection impact fee doesn't.
Mathur's study is sophisticated methodologically. In particular, it tests for spatial autocorrelation and recomputes the regressions to take into account the proximity of houses sold. So it is surprising that the study falls victim to a common error of multivariate studies: model underspecification due to omitted variables.
There are literally hundreds of hedonic price studies that have tested dozens of other variables such as proximity to parks and transit, and found they significantly affect housing prices. These and other variables are prominently absent from Mathur's study. The exclusion of such variables could well lead to biased coefficient values for the variables that are included in the models, or even be the cause of the spatial autocorrelation that is detected in the spatial diagnostics.
As I said, this is complicated stuff. I would recommend that both practitioners and academics use statistical consultants when they run into econometric problems they cannot handle, as I have with Bill Greene of New York University, author of the widely used text, Econometric Analysis. In short, times have changed. Our grandparents' regression simply doesn't cut it for today's planners.
Reid Ewing is a professor of city and metropolitan planning at the University of Utah and an associate editor of the Journal of the American Planning Association. He is the coauthor (with Keith Bartholomew) of the literature review "Hedonic Price Effects of Pedestrian- and Transit-Oriented Development," Journal of Planning Literature, 2011. Past columns are available at http://mrc.cap.utah.edu/publications/research-you-can-use/.