What's a good value for R-squared? (2024)

<![if !vml]><![endif]>Linear regression models

Notes on linearregression analysis (pdf file)

Introductionto linear regression analysis

Mathematicsof simple regression

Regression examples

<![if !supportLists]>·<![endif]>Baseball batting averages

<![if !supportLists]>·<![endif]>Beer sales vs. price, part 1: descriptiveanalysis

<![if !supportLists]>·<![endif]>Beer sales vs. price, part 2: fitting a simplemodel

<![if !supportLists]>·<![endif]>Beer sales vs. price, part 3: transformationsof variables

<![if !supportLists]>·<![endif]>Beer sales vs.price, part 4: additional predictors

<![if !supportLists]>·<![endif]>NC natural gasconsumption vs. temperature

<![if !supportLists]>·<![endif]>More regression datasetsat regressit.com

What to look for inregression output

What’s a goodvalue for R-squared?
What's the bottom line? How to compare models
Testing the assumptions of linear regression
Additional notes on regressionanalysis
Stepwise and all-possible-regressions
Excel file withsimple regression formulas

Excel file with regressionformulas in matrix form

Notes on logistic regression (new!)
<![if !supportLineBreakNewLine]>
<![endif]>

If you use Excelin your work or in your teaching to any extent, you should check out the latestrelease of RegressIt, a free Excel add-in for linear and logistic regression.See it at regressit.com. The linear regression version runs on both PC's and Macs andhas a richer and easier-to-use interface and much better designed output thanother add-ins for statistical analysis. It may make a good complement if not asubstitute for whatever regression software you are currently using,Excel-based or otherwise. RegressIt is an excellent tool forinteractive presentations, online teaching of regression, and development ofvideos of examples of regression modeling. It includes extensive built-indocumentation and pop-up teaching notes as well as some novel features tosupport systematic grading and auditing of student work on a large scale. Thereis a separate logisticregression version withhighly interactive tables and charts that runs on PC's. RegressIt also nowincludes a two-wayinterface with R that allowsyou to run linear and logistic regression models in R without writing any codewhatsoever.

If you havebeen using Excel's own Data Analysis add-in for regression (Analysis Toolpak),this is the time to stop. It has notchanged since it was first introduced in 1993, and it was a poor design eventhen. It's a toy (a clumsy one at that), not a tool for serious work. Visitthis page for a discussion: What's wrong with Excel's Analysis Toolpak for regression

What’sa good value for R-squared?

Percentof variance explained vs. percent of standard deviation explained

An example inwhich R-squared is a poor guide to analysis

Guidelines forinterpreting R-squared

Thequestion is often asked: "what's a good value for R-squared?" or“how big does R-squared need to be for the regression model to bevalid?” Sometimes the claimis even made: "a model is not useful unless its R-squared is at leastx", where x may be some fraction greater than 50%. The correct response to this question ispolite laughter followed by: "That depends!" A former student of mine landed a job ata top consulting firm by being the only candidate who gave that answer duringhis interview.

R-squared isthe “percent of variance explained” by the model. That is, R-squared is the fraction by which the variance of the errors is lessthan the variance of the dependent variable. (The latter number would be theerror variance for a constant-only model, which merely predicts that everyobservation will equal the sample mean.)It is called R-squared because in a simple regression model it is just the square of the correlation betweenthe dependent and independent variables, which is commonly denoted by“r”. In a multiple regression model R-squared isdetermined by pairwise correlations among allthe variables, including correlations of the independent variables with eachother as well as with the dependent variable. In the latter setting, the square root ofR-squared is known as “multiple R”, and it is equal to thecorrelation between the dependent variable and the regression model’spredictions for it. (Note: if themodel does not include a constant, which is a so-called “regressionthrough the origin”, then R-squared has a different definition. See this page for moredetails. You cannot compare R-squaredbetween a model that includes a constant and one that does not.)

Generally it is better to lookat adjustedR-squared rather than R-squared and to look at the standard error of the regressionrather than the standard deviation of the errors. These are unbiased estimators that correct for the sample size and numbers ofcoefficients estimated. Adjusted R-squared is always smaller than R-squared,but the difference is usually very small unless you are trying to estimate toomany coefficients from too small a sample in the presence of too much noise.Specifically, adjusted R-squared isequal to 1 minus (n - 1)/(n – k - 1) times1-minus-R-squared, where n is the sample size and k is the number ofindependent variables. (It ispossible that adjusted R-squared is negative if the model is too complexfor the sample size and/or the independent variables have too little predictivevalue, and some software just reports that adjusted R-squared is zero in thatcase.) Adjusted R-squared bears the same relation to the standard error of theregression that R-squared bears to the standard deviation of the errors: onenecessarily goes up when the other goes down for models fitted to the samesample of the same dependent variable.

Now, what is the relevant variance that requiresexplanation, and how much or how little explanation is necessary or useful? There is a huge range ofapplications for linear regression analysis in science, medicine, engineering,economics, finance, marketing, manufacturing, sports, etc.. In some situationsthe variables under consideration have very strong and intuitively obviousrelationships, while in other situations you may be looking for very weaksignals in very noisy data. Thedecisions that depend on the analysis could have either narrow or wide marginsfor prediction error, and the stakes could be small or large. For example, in medical research,a new drug treatment might have highly variable effects on individual patients,in comparison to alternative treatments, and yet have statistically significantbenefits in an experimental study of thousands of subjects. That is to say, the amount of varianceexplained when predicting individual outcomes could be small, and yet theestimates of the coefficients that measure the drug’s effects could besignificantly different from zero (as measured by low P-values) in a largesample. A result like this couldsave many lives over the long run and be worth millions of dollars in profitsif it results in the drug’s approval for widespread use.

Even in thecontext of a single statistical decision problem, there may be many ways toframe the analysis, resulting in different standards and expectations for theamount of variance to be explained in the linear regression stage. We have seen by now that there are many transformationsthat may be applied to a variable before it is used as a dependent variable ina regression model: deflation, logging, seasonal adjustment, differencing. Allof these transformations will change the variance and may also change the unitsin which variance is measured. Logging completely changes the the units of measurement:roughly speaking, the error measures become percentages rather than absoluteamounts, as explained here.Deflation and seasonal adjustment also change the units of measurement, anddifferencing usually reduces the variance dramatically when applied tononstationary time series data. Therefore, if the dependent variable in theregression model has already been transformed in some way, it is possible thatmuch of the variance has already been "explained" merely by thatprocess. With respect to whichvariance should improvement be measured in such cases: that of the originalseries, the deflated series, the seasonally adjusted series, the differencedseries, or the logged series? You cannot meaningfully compare R-squaredbetween models that have used different transformations of the dependentvariable, as the example below will illustrate.

Moreover,variance is a hard quantity to think about because it is measured in squared units (dollars squared, beercans squared….). It is easierto think in terms of standard deviations,because they are measured in the same units as the variables and they directlydetermine the widths of confidence intervals. So, it is instructive to also considerthe “percent of standard deviationexplained,” i.e., the percent by which the standard deviation of theerrors is less than the standard deviation of the dependent variable. This is equal to one minus the square root of 1-minus-R-squared. Here is a table that shows theconversion:

<![if !vml]><![endif]>

For example,if the model’s R-squared is 90%, the variance of its errors is 90% lessthan the variance of the dependent variable and the standard deviation of itserrors is 68% less than the standard deviation of the dependent variable. That is, the standard deviation of theregression model’s errors is about 1/3 the size of the standard deviationof the errors that you would get with a constant-only model. That’s very good, but itdoesn’t sound quite as impressive as “NINETY PERCENTEXPLAINED!”.

If themodel’s R-squared is 75%, the standard deviation of the errors is exactlyone-half of the standard deviation of the dependent variable. Now, suppose that the addition ofanother variable or two to this model increases R-squared to 76%. That’s better, right? Well, by theformula above, this increases the percent of standard deviation explained from50% to 51%, which means the standard deviation of the errors is reduced from50% of that of the constant-only model to 49%, a shrinkage of 2% in relativeterms. Confidence intervals forforecasts produced by the second model would therefore be about 2% narrowerthan those of the first model, on average, not enough to notice on agraph. You should askyourself: is that worth theincrease in model complexity?

An increasein R-squared from 75% to 80% would reduce the error standard deviation by about10% in relative terms. That beginsto rise to the level of a perceptible reduction in the widths of confidenceintervals. But don’t forget, confidence intervals are realistic guides tothe accuracy of predictions only if themodel’s assumptions are correct.When adding more variablesto a model, you need to think about the cause-and-effect assumptions thatimplicitly go with them, and you should also look at how their addition changesthe estimated coefficients of other variables. Do they become easier to explain, orharder? And do the residual statsand plots indicate that the model’s assumptions are OK? If they aren’t, then youshouldn’t be obsessing over small improvements in R-squared anyway. Your problems lie elsewhere.

Anotherhandy rule of thumb: for small values(R-squared less than 25%), the percent of standard deviation explained isroughly one-half of the percent of variance explained. So, for example, amodel with an R-squared of 10% yields errors that are 5% smaller than those ofa constant-only model, on average.

How big an R-squared is “bigenough”, or cause for celebration or despair? That depends on the decision-makingsituation, and it depends on your objectives or needs, and it depends on howthe dependent variable is defined.In some situations it might be reasonable to hope and expect to explain99% of the variance, or equivalently 90% of the standard deviation of thedependent variable. In other cases,you might consider yourself to be doing very well if you explained 10% of thevariance, or equivalently 5% of the standard deviation, or perhaps evenless. The following section givesan example that highlights these issues.If you want to skip the example and go straight to the concludingcomments, click here.

Return to top of page.

An example in whichR-squared is a poor guide to analysis:Consider the U.S. monthly auto sales series that was used forillustration in the first chapter of these notes, whose graph is reproducedhere:

<![if !vml]><![endif]>

The unitsare $billions and the date range shown here is from January 1970 to February1996. Suppose that the objective ofthe analysis is to predict monthly auto sales from monthly total personalincome. I am using these variables(and this antiquated date range) for two reasons: (i) this very (silly) example was usedto illustrate the benefits of regression analysis in a textbook that I wasusing in that era, and (ii) I have seen many students undertake self-designedforecasting projects in which they have blindly fitted regression models usingmacroeconomic indicators such as personal income, gross domestic product,unemployment, and stock prices as predictors of nearly everything, the logic being that they reflect thegeneral state of the economy and therefore have implications for every kind ofbusiness activity. Perhaps so, butthe question is whether they do it in a linear,additive fashion that stands out against the background noise in thevariable that is to be predicted, and whether they adequately explain time patterns in the data, and whether theyyield useful predictions andinferences in comparison to other ways in which you might choose to spend yourtime. Return to topof page.

Thecorresponding graph of personal income (also in $billions) looks like this:

<![if !vml]><![endif]>

There is noseasonality in the income data. Infact, there is almost no pattern in it at all except for a trend that increasedslightly in the earlier years. (Thisis not a good sign if we hope to get forecasts that have any specificity.) By comparison, the seasonalpattern is the most striking feature in the auto sales, so the first thing thatneeds to be done is to seasonally adjustthe latter. Seasonally adjustedauto sales (independently obtained from the same government source) andpersonal income line up like this when plotted on the same graph:

<![if !vml]><![endif]>

The strong andgenerally similar-looking trends suggest that we will get a very high value ofR-squared if we regress sales on income, and indeed we do. Here is the summary table for thatregression:

<![if !vml]><![endif]>

Adjusted R-squared isalmost 97%! However, a result likethis is to be expected when regressing a strongly trended series on any other strongly trended series, regardlessof whether they are logically related.Here are the line fit plot and residuals-vs-time plot for the model:

<![if !vml]><![endif]>

The residual-vs-timeplot indicates that the model has some terrible problems. First, there is verystrong positive autocorrelation in theerrors, i.e., a tendency to make the same error many times in a row. In fact, the lag-1 autocorrelation is0.77 for this model. It is clearwhy this happens: the two curves donot have exactly the same shape.The trend in the auto sales series tends to vary over time while thetrend in income is much more consistent, so the two variales get out-of-synchwith each other. This is typical ofnonstationary time series data. Second, themodel’s largest errors have occurred in the more recent years andespecially in the last few months (at the “business end” of thedata, as I like to say), which means that we should expect the next few errorsto be huge too, given the strong positive correlation between consecutiveerrors. And finally, the local variance of the errors increasessteadily over time. The reason for this is that random variations in autosales (like most other measures of macroeconomic activity) tend to beconsistent over time in percentageterms rather than absolute terms, and the absolute level of the series hasrisen dramatically due to a combination of inflationary growth and realgrowth. As the level as grown, thevariance of the random fluctuations has grown with it. Confidence intervals for forecasts inthe near future will therefore be way too narrow, being based on average errorsizes over the whole history of the series. So, despite the high value ofR-squared, this is a very badmodel. Return to topof page.

One way to try toimprove the model would be to deflate bothseries first. This would at leasteliminate the inflationary component of growth, which hopefully will make thevariance of the errors more consistent over time. Here is a time series plot showing autosales and personal income after they have been deflated by dividing them by theU.S. all-product consumer price index (CPI) at each point in time, with the CPInormalized to a value of 1.0 in February 1996 (the last row of the data). This does indeed flatten out the trendsomewhat, and it also brings out some fine detail in the month-to-monthvariations that was not so apparent on the original plot. In particular, we begin to see somesmall bumps and wiggles in the income data that roughly line up with largerbumps and wiggles in the auto sales data.

<![if !vml]><![endif]>

If we fit a simpleregression model to these two variables, the following results are obtained:

<![if !vml]><![endif]>

Adjusted R-squared isonly 0.788 for this model, which is worse, right? Well, no. We “explained” some of the variancein the original data by deflating it prior to fitting this model. Because the dependent variables are notthe same, it is not appropriate to do a head-to-head comparison of R-squared. Arguably this is a better model, becauseit separates out the real growth in sales from the inflationary growth, andalso because the errors have a more consistent variance over time. (The latter issue is not the bottomline, but it is a step in the direction of fixing the model assumptions.) Most interestingly, the deflated incomedata shows some fine detail that matches up with similar patterns in the salesdata. However, the error varianceis still a long way from being constant over the full two-and-a-half decades, andthe problems of badly autocorrelated errors and a particularly bad fit to themost recent data have not been solved.

Anotherstatistic that we might be tempted to compare between these two models is thestandard error of the regression, which normally is the best bottom-linestatistic to focus on. The secondmodel’s standard error is much larger: 3.253 vs. 2.218 for the firstmodel. But wait… these two numbers cannot be directlycompared, either, because they are not measured in the same units. The standard error of the first model ismeasured in units of current dollars,while the standard error of the second model is measured in units of 1996 dollars. Those were decades of high inflation,and 1996 dollars were not worth nearly as much as dollars were worth in theearlier years. (In fact, a 1996dollar was only worth about one-quarter of a 1970 dollar.) Return to top of page.

The slopecoefficients in the two models are also of interest. Because the units of the dependentand independent variables are the same in each model (current dollars in thefirst model, 1996 dollars in the second model), the slope coefficient can be interpreted as the predicted increase indollars spent on autos per dollar of increase in income. The slope coefficients in the two modelsare nearly identical: 0.086 and0.087, implying that on the margin, 8.6% to 8.7% of additional income is spenton autos.

Let’s now trysomething totally different: fitting a simple time series model to the deflateddata. In particular, let’sfit a random-walk-with-drift model,which is logically equivalent to fitting a constant-only model to the first difference (period to periodchange) in the original series. Letthe differenced series be called AUTOSALES_SADJ_1996_DOLLARS_DIFF1 (which isthe name that would be automatically assigned in RegressIt). Notice that we are now 3 levels deep indata transformations: seasonaladjustment, deflation, and differencing!This sort of situation is very common in time series analysis. Here are the results of fitting thismodel, in which AUTOSALES_SADJ_1996_DOLLARS_DIFF1 is the dependent variablesand there are no independent variables, just the constant. This model merely predicts that eachmonthly difference will be the same, i.e., it predicts constant growth relativeto the previous month’s value.

<![if !vml]><![endif]>

Adjusted R-squaredhas dropped to zero! This is not aproblem: a constant-only regression always has an R-squared of zero, but thatdoesn’t necessarily imply that it is not a good model for the particulardependent variable that has been used. We should look instead at thestandard error of the regression.The units and sample of the dependent variable are the same for thismodel as for the previous one, so their regression standard errors can belegitimately compared. (Thesample size for the second model is actually 1 less than that of the firstmodel due to the lack of period-zero value for computing a period-1 difference,but this is insignificant in such a large data set.) The regression standard error of thismodel is only 2.111, compared to 3.253 for the previous one, a reduction ofroughly one-third, which is a very significant improvement. (The residual-vs-time plot for thismodel and the previous one have the same vertical scaling: look at them both and compare the sizeof the errors, particularly those that have occurred recently.) The reason why this model’sforecasts are so much more accurate is that it looks at last month’s actual sales values, whereas the previous modelonly looked at personal income data.It is often the case that the bestinformation about where a time series is going to go next is where it has beenlately.

There is no line fitplot for this model, because there is no independent variable, but here is theresidual-versus-time plot:

<![if !vml]><![endif]>

These residuals lookquite random to the naked eye, but they actually exhibit negative autocorrelation, i.e., a tendency to alternate betweenoverprediction and underprediction from one month to the next. (The lag-1 autocorrelation here is-0.356.) This often happens whendifferenced data is used, but overall the errors of this model are much closerto being independently and identically distributed than those of the previoustwo, so we can have a good deal more confidence in any confidence intervals forforecasts that may be computed from it.Of course, this model does not shed light on the relationship betweenpersonal income and auto sales.

So, what is therelationship between auto sales and personal income? That is a complex question and it willnot be further pursued here except to note that there some other simple thingswe could do besides fitting a regression model. For example, we could compute the percentage of income spent on automobilesover time, i.e., just divide the auto sales series by the personal incomeseries and see what the pattern looks like. Here is the resulting picture:

<![if !vml]><![endif]>

This chart nicelyillustrates cyclical variations in the fraction of income spent on autos, whichwould be interesting to try to match up with other explanatory variables. The range is from about 7% to about 10%,which is generally consistent with the slope coefficients that were obtained inthe two regression models (8.6% and 8.7%).However, this chart re-emphasizes what was seen in the residual-vs-timecharts for the simple regression models:the fraction of income spent on autos is not consistent over time. In particular, notice that the fractionwas increasing toward the end of the sample, exceeding 10% in the last month.

The bottom line hereis that R-squared was not of any use inguiding us through this particular analysis toward better and better models. In fact, among the models consideredabove, the worst one had an R-squared of 97% and the best one had an R-squaredof zero. At various stages of theanalysis, data transformations were suggested: seasonal adjustment, deflating,differencing. (Logging was nottried here, but would have been an alternative to deflation.) And every time the dependent variable istransformed, it becomes impossible to make meaningful before-and-aftercomparisons of R-squared.Furthermore, regression was probably not even the best tool to use herein order to study the relation between the two variables. It is not a “universalwrench” that should be used on every problem. Return to top of page.

So,what IS a good value for R-squared?Itdepends on the variable with respect to which you measure it, it depends on theunits in which that variable is measured and whether any data transformationshave been applied, and it depends on the decision-making context. If the dependent variable is anonstationary (e.g., trending or random-walking) time series, an R-squaredvalue very close to 1 (such as the 97% figure obtained in the first modelabove) may not be very impressive.In fact, if R-squared is very close to 1, and the data consists of timeseries, this is usually a bad sign rather than a good one: there will often be significant timepatterns in the errors, as in the example above. On the other hand, if the dependentvariable is a properly stationarized series (e.g., differences or percentagedifferences rather than levels), then an R-squared of 25% may be quite good. Infact, an R-squared of 10% or even less could have some information value whenyou are looking for a weak signal in the presence of a lot of noise in asetting where even a very weak onewould be of general interest. Sometimes there is a lot of value in explainingonly a very small fraction of the variance, and sometimes there isn't. Datatransformations such as logging or deflating also change the interpretation andstandards for R-squared, inasmuch as they change the variance you start outwith.

However, be very careful when evaluating a modelwith a low value of R-squared.In such a situation: (i) itis better if the set of variables in the model is determined a priori (as inthe case of a designed experiment or a test of a well-posed hypothesis) ratherby searching among a lineup of randomly selected suspects; (ii) the data shouldbe clean (not contaminated by outliers, inconsistent measurements, orambiguities in what is being measured, as in the case of poorly worded surveysgiven to unmotivated subjects); (iii) the coefficient estimates should beindividually or at least jointly significantly different from zero (as measuredby their P-values and/or the P-value of the F statistic), which may require alarge sample to achieve in the presence of low correlations; and (iv) it is agood idea to do cross-validation(out-of-sample testing) to see if the model performs about equally well on datathat was not used to identify or estimate it, particularly when the structureof the model was not known a priori.It is easy to find spurious (accidental) correlations if you go on afishing expedition in a large pool of candidate independent variables whileusing low standards for acceptance.I have often had students use this approach to try to predict stockreturns using regression models--which I do not recommend--and it is notuncommon for them to find models that yield R-squared values in the range of 5%to 10%, but they virtually never survive out-of-sample testing. (You should buy index fundsinstead.)

There are a varietyof ways in which to cross-validate a model. A discussion of some of them can befound here. If your software doesn’t offersuch options, there are simple tests you can conduct on your own. One is to split the data set in half andfit the model separately to both halves to see if you get similar results interms of coefficient estimates and adjusted R-squared.

When working with time series data, if you compare thestandard deviation of the errors of a regression model which uses exogenouspredictors against that of a simple time series model (say, an autoregressiveor exponential smoothing or random walk model), you may be disappointed by whatyou find. If the variable to bepredicted is a time series, it will often be the case that most of thepredictive power is derived from its own history via lags, differences, and/orseasonal adjustment. This is the reason why we spent some time studying theproperties of time series models before tackling regression models.

A rule of thumb for small values of R-squared: If R-squared is small (say 25% or less),then the fraction by which the standard deviation of the errors is less thanthe standard deviation of the dependent variable is approximately one-half of R-squared, as shown in the table above. So,for example, if your model has an R-squared of 10%, then its errors are onlyabout 5% smaller on average than those of a constant-only model, which merelypredicts that everything will equal the mean. Is that enough to be useful, ornot? Another handy reference point: if the model has an R-squared of 75%,its errors are 50% smaller on average than those of a constant-only model.(This is not an approximation: itfollows directly from the fact that reducing the error standard deviation to½ of its former value is equivalent to reducing its variance to ¼of its former value.)

In general you shouldlook at adjusted R-squared rather thanR-squared. Adjusted R-squaredis an unbiased estimate of thefraction of variance explained, taking into account the sample size and numberof variables. Usually adjustedR-squared is only slightly smaller than R-squared, but it is possible foradjusted R-squared to be zero or negative if a model with insufficientlyinformative variables is fitted to too small a sample of data.

What measure of yourmodel's explanatory power should you report to your boss or client orinstructor? Ifyou used regression analysis, then to be perfectly candid you should of courseinclude the adjusted R-squared for the regression model that was actuallyfitted (whether to the original data or some transformation thereof), alongwith other details of the output, somewhere in your report. You should more strongly emphasize the standard error of the regression,though, because that measures the predictive accuracy of the model in realterms, and it scales the width of all confidence intervals calculated from themodel. You may also want to reportother practical measures of error size such as the mean absolute error or meanabsolute percentage error and/or mean absolute scaled error.

What should neverhappen to you: Don'tever let yourself fall into the trap of fitting (and then promoting!) aregression model that has a respectable-looking R-squared but is actually verymuch inferior to a simple time series model. If the dependent variable in yourmodel is a nonstationary time series, be sure that you do a comparison of errormeasures against an appropriate time series model. Remember that what R-squared measures isthe proportional reduction in error variance that the regression model achieves in comparison to a constant-only model(i.e., mean model) fitted to the same dependent variable, but theconstant-only model may not be the most appropriate reference point, and thedependent variable you end up using may not be the one you started with if datatransformations turn out to be important.

And finally: R-squared is not the bottom line. Youdon’t get paid in proportion to R-squared. The real bottom line in your analysis ismeasured by consequences of decisions that you and others will make on thebasis of it. In general, theimportant criteria for a good regression model are (a) to make the smallestpossible errors, in practical terms,when predicting what will happen in thefuture, and (b) to derive useful inferences from the structure of the modeland the estimated values of its parameters. Return to top of page.

Go on to next topic: How to compare models

What's a good value for R-squared? (2024)

What’sa good value for R-squared?

References