Sensitivity Analysis In Linear Regression
Chatterjee, S. and Hadi, A. S. (1988),
New York:
John Wiley & Sons.
ISBN: 0-471-82216-7
Preface
The past twenty years have seen a great surge of activity in the general area of model fitting. The linear regression model fitted by least squares is undoubtedly the most widely used statistical procedure. In this book we concentrate on one important aspect of the fitting of linear regression models by least squares. We examine the factors that determine the fit and study the sensitivity of the fit to these factors.
Several elements determine a fitted regression equation: the variables, the observations, and the model assumptions. We study the effect of each of these factors on the fitted model in turn. The regression coefficient for a particular variable will change if a variable not currently included in the model is brought into the model. We examine methods for estimating the change and assessing its relative importance. Each observation in the data set plays a role in the fit. We study extensively the role of a single observation and multiple observations on the whole fitting procedure. Methods for the study of the joint effect of a variable and an observation are also presented.
Many variables included in a regression study are measured with error, but the standard least squares estimates do not take this into account. The effects of measurement errors on the estimated regression coefficients are assessed. Assessment of the effects of measurement errors is of great importance (for example in epidemiological studies) where regression coefficients are used to apportion effects due to different variables.
The implicit assumption in least squares fitting is that the random disturbances present in the model have a Gaussian distribution. The generalized linear models proposed by Nelder and Wedderburn (1972) can be used to examine the sensitivity of the fitted model to the probability laws of the "errors" in the model. The object of this analysis is to assess qualitatively and quantitatively (numerically) the robustness of the regression fit.
This book does not aim at theory but brings together, from a practical point of view, scattered results in regression analysis. We rely heavily on examples to illustrate the theory. Besides numerical measures, we focus on diagnostic plots to assess sensitivity; rapid advances in statistical graphics make it imperative to incorporate diagnostic plots in any newly proposed methodology.
This book is divided into nine chapters and an appendix. The chapters are more or less self-contained. Chapter 1 gives a summary of the standard least squares regression results, reviews the assumptions on which these results are based, and introduces the notations that we follow in the text. Chapter 2 discusses the properties of the prediction projection) matrix which plays a pivotal role in regression. Chapter 3 discusses the role of variables in a regression equation. Chapters 4 and 5 examine the impact of individual and multiple observations on the fit. The nature of an observation (outlier, leverage point, influential point) is discussed in considerable detail. Chapter 6 assesses the joint impact of a variable and an observation. Chapter 7 examines the impact of measurement errors on the regression coefficients (the classical "error- in-variables" problem) from a numerical analyst's point of view. Chapter 8 presents a methodology for examining the effect of error laws of the "random disturbances" on the estimated regression parameters. Chapter 9 outlines some of the computational methods for efficiently executing the procedures described in the previous chapters. Since some readers might not be familiar with the concept of matrix norms, the main properties of norms are presented in the Appendix. The Appendix also contains proofs of some of the results in Chapters 4 and 5.
The methods we discuss attempt to provide the data analyst with a clear and complete picture of how and why the data at hand affect the results of a multiple regression analysis. The material should prove useful to anyone who is involved in analyzing data. For an effective use of the book, some matrix algebra and familiarity with the basic concepts of regression analysis is needed. This book could serve as a text for a second course in regression analysis or as a supplement to the basic text in the first course. The book brings together material that is often scattered in the literature and should, therefore, be a valuable addition to the basic material found in most regression texts.
Certain issues associated with least squares regression are covered briefly and others are not covered at all; because we feel there exist excellent texts that deal with these issues. A detailed discussion of multicollinearity can be found in Chatterjee and Price (1977) and Belsley, Kuh, and Welsch (1980); transformations of response and/or explanatory variables are covered in detail in Atkinson (1985) and Carroll and Ruppert (1988); the problems of heteroscedasticity and autocorrelation are addressed in Chatterjee and Price (1977) and Judge et al. (1985); and robust regression can be found in Huber (1981) and Rousseeuw and Leroy (1987).
Interactive, menu-driven, and user-friendly computer programs implementing the statistical procedures and graphical displays presented in this book are available from Ali S. Hadi, 358 Ives Hall, Cornell University, Ithaca, NY 14851-0952. These programs are written in APL; but user's knowledge of APL is not necessary. Two versions of these programs are available; one is tailored for the Macintosh and the other for the IBM PC.
Some of the material in this book has been used in courses we have taught at Cornell University and New York University. We would like to thank our many students whose comments have improved the clarity of exposition and eliminated many errors. In writing this book we have been helped by comments and encouragement from our many friends and colleagues. We would like particularly to mention Isadore Blumen, Sangit Chatterjee, Mary Dowling, Andrew Forbes, Glen Heller, Peter Lenk, Robert Ling, Philip McCarthy, Douglas Morrice, Cris Negm, Daryl Pregibon, Mary Rouse, David Ruppert, Steven Schwager, Karen Shane, Gary Simon, Jeffrey Simonoff, Leonard Stefanski, Paul Switzer, Chi-Ling Tsai, Paul Velleman, Martin Wells, and Roy Welsch.
For help in preparing the manuscript for publication we would like to thank Janet Brown, Helene Croft, and Evelyn Maybe. We would also like to thank Robert Cooke and Ted Sobel of Cooke Publications for providing us with a pre-release version of their software MathWriter4M that we have used in writing the mathematical expressions in this book. We would like to thank Bea Shube of John Wiley & Sons for her patience, understanding, and encouragement.
SAMPRIT CHATTERJEE
ALI S. HADI
Eagle Island, Maine
Ithaca, New York
October, 1987
Table of Contents
1. INTRODUCTION 1
-
1.1. Introduction 1
1.2. Notations 2
1.3. Standard Estimation Results in Least Squares 3
1.4. Assumptions 5
1.5. Iterative Regression Process 6
1.6. Organization of the Book 6
2. PREDICTION MATRIX 9
-
2.1. Introduction 9
2.2. Roles of P and (I - P) in Linear Regression 10
2.3. Properties of the Prediction Matrix 14
-
2.3.1. General Properties 14
2.3.2. Omitting (Adding) Variables 20
2.3.3. Omitting (Adding) an Observation 21
2.3.4. Conditions for Large Values of Pii 25
2.3.5. Omitting Multiple Rows of X 27
2.3.6. Eigenvalues of P and (I - P) 28
2.3.7. Distribution of Pii 31
-
3.1. Introduction 39
3.2. Effects of Underfitting 40
3.3. Effects of Overfitting 43
3.4. Interpreting Successive Fitting 48
3.5. Computing Implications for Successive Fitting 49
3.6. Introduction of One Additional Regressor 49
3.7. Comparing Models: Comparison Criteria 50
3.8. Diagnostic Plots for the Effects of Variables 54
-
3.8.1. Added Variable (Partial Regression) Plots 56
3.8.2. Residual Versus Predictor Plots 56
3.8.3. Component-Plus-Residual (Partial Residual) Plots 57
3.8.4. Augmented Partial Residual Plots 58
4. EFFECTS OF AN OBSERVATION ON A REGRESSION EQUATION 71
-
4.1. Introduction 71
4.2. Omission Approach 72
-
4.2.1. Measures Based on Residuals 72
-
4.2.1.1 Testing for a Single Outlier 80
4.2.1.2 Graphical Methods 83
4.2.3. Measures Based on Remoteness of Points in X-Y Space 99
-
4.2.3.1. Diagonal Elements of P 99
4.2.3.2. Mahalanobis Distance 102
4.2.3.3. Weighted Squared Standardized Distance 102
4.2.3.4. Diagonal Elements of Pz 104
-
4.2.4.1. Definition of the Influence Curve 108
4.2.4.2. Influence Curves for beta and sigma 109
4.2.4.3. Approximating the Influence Curve 113
-
4.2.5.1. Cook's Distance 117
4.2.5.2. Welsch-Kuh's Distance 120
4.2.5.3. Welsch's Distance 123
4.2.5.4. Modified Cook's Distance 124
-
4.2.6.1. Andrews-Pregibon Statistic 136
4.2.6.2. Variance Ratio 138
4.2.6.3. Cook-Weisberg Statistic 140
4;2.8. Measures Based on a Subset of the Regression Coefficients 148
-
4.2.8.1. Influence on a Single Regression Coefficient 149
4.2.8.2. Influence on Linear Functions of beta; 151
-
4.2.9.1. Condition Number and Collinearity Indices 154
4.2.9.2. Collinearity-Influential Points 158
4.2.9.3. Effects of an Observation on the Condition Number 161
4.2.9.4 Diagnosing Collinearity-Influential Observations 168
4.4. Summary and Concluding Remarks 182
5. ASSESSING THE EFFECTS OF MULTIPLE OBSERVATIONS 185
-
5.1. Introduction 185
5.2. Measures Based on Residuals 187
5.3. Measures Based on the Influence Curve 190
-
5.3.1. Sample Influence Curve 191
5.3.2. Empirical Influence Curve 193
5.3.3. Generalized Cook's Distance 194
5.3.4. Generalized Welsch's Distance 195
-
5.4.1. Generalized Andrews-Pregibon Statistic 195
5.4.2. Generalized Variance Ratio 197
5.6. Measures Based on a Subset of the Regression Coefficients 200
5.7. Identifying Collinearity-Influential Points 201
5.8. Identifying Influential Observations by Clustering 204
5.9. Example: Demographic Data 207
-
6.1. Introduction 211
6.2. Notations 213
6.3. Impact on the Leverage Values 215
6.4. Impact on Residual Sum of Squares 217
6.5. Impact on the Fitted Values 218
6.6. Partial F-Tests 221
6.7. Summary of Diagnostic Measures 226
6.8. Examples 226
6.9. Concluding Remarks 241
-
7.1. Introduction 245
7.2. Errors in the Response Variable 247
7.3. Errors in X: Asymptotic Approach 248
7.4. Errors in X: Perturbation Approach 251
7.5. Errors in X: Simulation Approach 261
-
8.1. Introduction 263
8.2. Generalized Linear Models (GLM) 264
8.3. Exponential Family 265
8.4. Link Function 267
8.5. Parameter Estimation for GLM 267
8.6. Judging the Goodness of Fit for GLM 271
8.7. Example 273
-
9.1. Introduction 281
9.2. Triangular Decomposition 282
-
9.2.1. Definitions 282
9.2.2. Algorithm for Computing L and D 284
9.4. Properties of L and D 288
9.5. Efficient Computing of Regression Diagnostics 293
-
A.1. Summary of Vector and Matrix Norms 297
A.2. Another Proof of Theorem 4.3 299
A.3. Proof of (4.60a) and (5.31a) 300
REFERENCES 301
INDEX 309