Sensitivity Analysis In Linear Regression

Chatterjee, S. and Hadi, A. S. (1988),
New York:
John Wiley & Sons.
ISBN: 0-471-82216-7

Preface

The past twenty years have seen a great surge of activity in the general area of model fitting. The linear regression model fitted by least squares is undoubtedly the most widely used statistical procedure. In this book we concentrate on one important aspect of the fitting of linear regression models by least squares. We examine the factors that determine the fit and study the sensitivity of the fit to these factors.

Several elements determine a fitted regression equation: the variables, the observations, and the model assumptions. We study the effect of each of these factors on the fitted model in turn. The regression coefficient for a particular variable will change if a variable not currently included in the model is brought into the model. We examine methods for estimating the change and assessing its relative importance. Each observation in the data set plays a role in the fit. We study extensively the role of a single observation and multiple observations on the whole fitting procedure. Methods for the study of the joint effect of a variable and an observation are also presented.

Many variables included in a regression study are measured with error, but the standard least squares estimates do not take this into account. The effects of measurement errors on the estimated regression coefficients are assessed. Assessment of the effects of measurement errors is of great importance (for example in epidemiological studies) where regression coefficients are used to apportion effects due to different variables.

The implicit assumption in least squares fitting is that the random disturbances present in the model have a Gaussian distribution. The generalized linear models proposed by Nelder and Wedderburn (1972) can be used to examine the sensitivity of the fitted model to the probability laws of the "errors" in the model. The object of this analysis is to assess qualitatively and quantitatively (numerically) the robustness of the regression fit.

This book does not aim at theory but brings together, from a practical point of view, scattered results in regression analysis. We rely heavily on examples to illustrate the theory. Besides numerical measures, we focus on diagnostic plots to assess sensitivity; rapid advances in statistical graphics make it imperative to incorporate diagnostic plots in any newly proposed methodology.

This book is divided into nine chapters and an appendix. The chapters are more or less self-contained. Chapter 1 gives a summary of the standard least squares regression results, reviews the assumptions on which these results are based, and introduces the notations that we follow in the text. Chapter 2 discusses the properties of the prediction projection) matrix which plays a pivotal role in regression. Chapter 3 discusses the role of variables in a regression equation. Chapters 4 and 5 examine the impact of individual and multiple observations on the fit. The nature of an observation (outlier, leverage point, influential point) is discussed in considerable detail. Chapter 6 assesses the joint impact of a variable and an observation. Chapter 7 examines the impact of measurement errors on the regression coefficients (the classical "error- in-variables" problem) from a numerical analyst's point of view. Chapter 8 presents a methodology for examining the effect of error laws of the "random disturbances" on the estimated regression parameters. Chapter 9 outlines some of the computational methods for efficiently executing the procedures described in the previous chapters. Since some readers might not be familiar with the concept of matrix norms, the main properties of norms are presented in the Appendix. The Appendix also contains proofs of some of the results in Chapters 4 and 5.

The methods we discuss attempt to provide the data analyst with a clear and complete picture of how and why the data at hand affect the results of a multiple regression analysis. The material should prove useful to anyone who is involved in analyzing data. For an effective use of the book, some matrix algebra and familiarity with the basic concepts of regression analysis is needed. This book could serve as a text for a second course in regression analysis or as a supplement to the basic text in the first course. The book brings together material that is often scattered in the literature and should, therefore, be a valuable addition to the basic material found in most regression texts.

Certain issues associated with least squares regression are covered briefly and others are not covered at all; because we feel there exist excellent texts that deal with these issues. A detailed discussion of multicollinearity can be found in Chatterjee and Price (1977) and Belsley, Kuh, and Welsch (1980); transformations of response and/or explanatory variables are covered in detail in Atkinson (1985) and Carroll and Ruppert (1988); the problems of heteroscedasticity and autocorrelation are addressed in Chatterjee and Price (1977) and Judge et al. (1985); and robust regression can be found in Huber (1981) and Rousseeuw and Leroy (1987).

Interactive, menu-driven, and user-friendly computer programs implementing the statistical procedures and graphical displays presented in this book are available from Ali S. Hadi, 358 Ives Hall, Cornell University, Ithaca, NY 14851-0952. These programs are written in APL; but user's knowledge of APL is not necessary. Two versions of these programs are available; one is tailored for the Macintosh and the other for the IBM PC.

Some of the material in this book has been used in courses we have taught at Cornell University and New York University. We would like to thank our many students whose comments have improved the clarity of exposition and eliminated many errors. In writing this book we have been helped by comments and encouragement from our many friends and colleagues. We would like particularly to mention Isadore Blumen, Sangit Chatterjee, Mary Dowling, Andrew Forbes, Glen Heller, Peter Lenk, Robert Ling, Philip McCarthy, Douglas Morrice, Cris Negm, Daryl Pregibon, Mary Rouse, David Ruppert, Steven Schwager, Karen Shane, Gary Simon, Jeffrey Simonoff, Leonard Stefanski, Paul Switzer, Chi-Ling Tsai, Paul Velleman, Martin Wells, and Roy Welsch.

For help in preparing the manuscript for publication we would like to thank Janet Brown, Helene Croft, and Evelyn Maybe. We would also like to thank Robert Cooke and Ted Sobel of Cooke Publications for providing us with a pre-release version of their software MathWriter4M that we have used in writing the mathematical expressions in this book. We would like to thank Bea Shube of John Wiley & Sons for her patience, understanding, and encouragement.

SAMPRIT CHATTERJEE
ALI S. HADI

Eagle Island, Maine
Ithaca, New York
October, 1987

Back to top of page


Table of Contents

1. INTRODUCTION 1

2. PREDICTION MATRIX 9

3. ROLE OF VARIABLES IN A REGRESSION EQUATION 39

4. EFFECTS OF AN OBSERVATION ON A REGRESSION EQUATION 71

5. ASSESSING THE EFFECTS OF MULTIPLE OBSERVATIONS 185

6. JOINT IMPACT OF A VARIABLE AND AN OBSERVATION 211 7. ASSESSING THE EFFECTS OF ERRORS OF MEASUREMENT 245 8. STUDY OF MODEL SENSITIVITY BY THE GENERALIZED LINEAR MODEL APPROACH 263 9. COMPUTATIONAL CONSIDERATIONS 281 APPENDIX 297

REFERENCES 301
INDEX 309

Back to top of page


Places where you can purchase the book

Back to top of page