Essential Regression and Experimental Design for Chemists and Engineers

Essential Regression and Experimental Design for Chemists and Engineers

Changes/New Features

ER, Version 2.218 (13-Mar-99)

EED, Version 2.206 (26-Jan-99)

(and things we forgot to mention in the Electronic Book)

After releasing the electronic book (ERBOOK.doc), we kept modifying Essential Regression (ER) and Essential Experimental Design (EED) in order to (hopefully) improve the software, but also to work out a few glitches in the previous versions. Eventually, it became necessary to come up with an updated book or an addendum with all the changes. In addition, we noticed that in the original book, we simply forgot to mention a few features. Rather than rewriting the original book, we decided (for now) to issue this "addendum", so users who downloaded the whole book do not have to go through this again

In the subsequent chapters, we included references to the electronic book that was issued with the previous version(s) of Essential Regression (ER) and Essential Experimental Design (EED).

David D. Steppan

Joachim Werner

Robert P. Yeater

Essential Regression, Version 2.217

Multiple Regression/Polynomial Regression Main Dialog

Overview

We added new functionality and changed the look somewhat to improve user-friendliness. The new look of the Multiple Regression Main Dialog is shown below. The arrows point to the features we are going to talk about.

Add/Delete all Terms

[Reference to ERBOOK.DOC: Chapter 2.2.2, Input Area of Main Dialog]

The arrow buttons, ">>" and "<<" allow the user to copy all possible model terms in one step into the Current Model window and, vice versa, remove all terms from the model.

Print Button

[Reference to ERBOOK.DOC: Chapter 5.3.4, Printed Output]

The Print button on the Main Dialog was not described properly in the book. Chapter 5.3.4 in the electronic book deals with the printed output based on the Output or XLS sheet generated after pressing the Make XLS button.

The Print button on the Main Dialog generates a "screen dump" of the dialog window, i.e., the dialog with all the information is printed graphically, as a picture.

Relocation of Back, Help, Print, and Exit buttons.

[Reference to ERBOOK.DOC: Chapter 2.2.1, Figure 2.1]

Note that the buttons have been rearranged compared to the picture shown in the electronic book.

Essential Regression XLS Output Worksheet

Table Button (Multiple Prediction Table)

[Reference to ERBOOK.DOC: Chapter 5.2, Predicting Observations; Chapter 5.3.1, Make XLS Button-Overview; Chapter 5.3.5, Prediction of New Observations]

The XLS Output Worksheet is created when the user presses the Make XLS button on the Main Dialog after performing a regression analysis. In addition to the previously described Predict button, the new version of ER generates a Table button on the XLS Output Worksheet. With this feature, more than just one observation (response) at a time can be predicted based on the underlying model. After pressing the Table button on an Output Worksheet for the first time, the user is asked to specify the size of the prediction table (i.e., the number of predictions). The range is 5-50. The default number is 5.

After specifying the size of the prediction table and pressing OK, ER generates the table on the XLS Output sheet. Pressing the Back button switches the display back to the top row of the worksheet. Pressing the Table button after the table is generated will display the table again, and the Prediction Table Size dialog as well as the table generation step will be skipped.

Note:

If new rows (more responses or predictions) need to be added to the table, just extend the range of the existing table by copying the last row of the table and pasting it in the first empty row below the current table range (works of course also with multiple rows).

Multiple Prediction Table-Features

[Reference to ERBOOK.DOC: Chapter 1.1.5, Confidence Limits for Regression Coefficients and Observations; Chapter 5.2, Predicting Observations]

The figure below shows a table with 5 predictions. This table was generated with the data included in Er_test.xls (the data file for the tutorial) after performing a model optimization (using the AutoFit button).

The first two columns contain the input variables, X1 and X2. The cells in these columns (and only in these columns) are literally input cells for the new predictions. Change the values in these columns to change the predicted response. By default, the table is generated displaying the average (mean) of each input variable.

Note: All the other columns) contain formulas, are not supposed to be modified manually, and depend on the values in the input columns!

Obviously, the number of input columns varies with the number of input model variables.

The third column (in our example) contains the predicted response (Y). Change the numbers in the input columns (here, X1 and X2), and the numbers in the Y column will change accordingly, based on the response model equation.

The next column titled "Hat" contains the value of the matrix expression for the values of input variables used in the given row. This expression, as explained in Chapter 5.2. of the electronic book, can be used to determine if the used values for the input variables fall inside or outside the input range the model is based on. Above the "hat" column, the maximum value of the hat matrix diagonal elements of the data set is displayed. If a number in the "hat" column exceeds this maximum value, the corresponding prediction constitutes a "hidden extrapolation" of the data range (rows 3 and 5 in the above table). In Excel 97 (Excel 8), the cells in the "hat" column turn "red" if this happens, otherwise the background color is green (conditional formatting feature in Excel 97).

In the next two columns of the prediction table, confidence intervals are shown as explained in Chapter 1.1.5 of our book (CI mean = confidence interval for the mean response, CI pred = confidence interval for new observations or predictions).

The last block of columns contains the linear or higher order terms that are used in the model equation (in this example, X1, X2, X1²).

Near-Neighbor Analysis to Estimate Lack of Fit and Pure Error

If the data set does not contain true replicates, i.e., repeated observations at the same levels of the regressors, the procedure described in Chapter 2.1.4 of the electronic book cannot be applied (see also further below in Chapter 2 of this document). However, an estimate for the lack of fit can be obtained by a procedure called "Near-Neighbor Analysis".

Points that are close together in x-space (i.e., the multidimensional range of the input variables), are treated as "quasi-replicates". Let’s assume we have k input variables, and j = 1…k. We compare the regressors, x_ij, of observation i with the x_i’jofobservation i’. Then, the proximity of data points can be expressed by a weighted sum of squared distance:

The b_j is the estimated regression model coefficient of the j^th regressor. MSE is the Mean Squared Error of the residuals as defined in Chapter 1 of our book.

A table of "Near Neighbors" is generated by first sorting the observations in order of increasing predicted response. Then, pairs are generated from points with adjacent values of the predicted response, and with values separated by one, two, and three. This generates 4n-10 pairs of data points, where n is the total number of observations.

For each pair, the range of residuals, and the weighted sum of squared distance is calculated (see above). The range of residuals is given by the absolute value of the difference of the residuals for the points in each pair:

The 4n-10 pairs are sorted in ascending order of the weighted sum of squared distance. For the first m pairs in this sorted table with the m smallest values of the weighted sum of squared distance, an estimate for the cumulative standard deviation including all pairs with lower values of m is calculated from:

The value for m is chosen so that the weighted sum of distance does not become too large compared to the maximum value of D²_ii’. The corresponding value for s is then compared to the standard error of the regression model (RMS error), the square root of MSE. If the RMS error is significantly larger than s, we can expect lack of fit due to missing regressors or outliers (see below how m is selected and the magnitude of s is evaluated in ER).

This procedure was proposed by Daniel and Wood (C. Daniel and F. S. Wood, "Fitting Equations to Data", 2^nd ed., John Wiley & Sons, Inc., New York, NY , 1980) and was also outlined by Montgomery and Peck in their book on Linear Regression Analysis (Douglas C. Montgomery and Elizabeth A. Peck, "Introduction to Linear Regression Analysis", 2^nd Ed. 1992, John Wiley & Sons, Inc., New York, NY (ISBN 0-471-53387-4). Once again, we recommend these books to users that wish to delve deeper into the mathematical foundations of regression analysis.

We implemented this procedure into ER an a "post-mortem" fashion. It can be applied after generating the XLS output sheet by pressing the Near-Neighbor button. This will generate the table of the 4n-10 pairs of data points in order of increasing D²_ii’. The observations used to generate the pairs are given in the last two columns of the table. The critical value of s for the mth pair is highlighted in red.

In ER, we define m (admittedly somewhat arbitrarily) as the largest integer that is smaller or equal to the total number of pairs, 4n-10, divided by six. The resulting m pairs are included as "near neighbors". D²_ii of the m^th pair is defined as the default cutoff value. The cutoff value can be changed via an input box before the actual analysis is performed.

The value of s for the last pair for which this is true is compared to the RMS error (Standard Error) of the regression. In ER, this comparison is made in analogy to the test for lack of fit (see also Chapter 2.1.4 of the book and Chapter 2 of this document). Basically, the pairs of near neighbors are treated as the replicates in the Lack-of-Fit procedure.

The value of s² is defined as the pure error variance, or MSPE. It is associated with 2m-m = m degrees of freedom. 2m replaces n as the total number of datapoints, because each of the m pairs contains 2 "replicates". The number of different datapoints, m, in the Lack-of-Fit test is equal to m, the number of near-neighbor pairs.

The Lack-of-Fit variance, or MSLOF, is calculated from

MSE=RMS²= Mean Squared Error of Residuals

f_E = degrees of freedom for MSE

p=number of model parameters including intercept

A test statistic (F-test) is calculated from:

The results of this test are shown in a second table generated by the procedure. If F is larger than the critical value F_crit for a given significance level a with m-p and m degrees of freedom, the lack of fit error is significant, i.e., there might be contributions in the regressor-response relationship not accounted for by the model. This means, if the calculated F-Significance in the table is less than a predefined level (e.g., 0.05), the difference between s² and MSE for the regression model is considered significant, indicating lack of fit.

We applied the lack-of-fit F-statistic for true replicates to have a somewhat objective criterion for the significance of the difference between s² and the RMS error in the near-neighbor analysis. If some of our readers think that this is inappropriate, please send us your opinion on this matter.

Correction of Errors in the Electronic Book

Test for Lack of Fit (Chapter 2.1.4)

The degrees of freedom associated with the Sum of Squares for Lack of Fit is generally m-p, not m-2. The term p stands for the number of regressors including the intercept, if present. The number of 2 is only valid for a simple regression model with one variable plus intercept. Consequently, the test statistic for lack of fit is given by

In the software, F₀ and the associated degrees of freedom are calculated correctly. The error only occurred in the book.