Modeling Relationship Between Variables: Regression

 

Problem

 

An automated system for marking large numbers of student computer programs, called AUTOMARK, has been used successfully at McMaster University in Ontario, Canada. AUTOMARK takes into account both program correctness and program style when marking student assignments. AUTOMARK was used to grade the FORTRAN77 assignments of a class of 33 students. To evaluate the effectiveness of the automated system, these grades were compared to the grades assigned by the instructor. The results are shown in the table.

 

AUTOMARK

GRADE x

INSTRUCTOR

GRADE y

AUTOMARK

GRADE x

INSTRUCTOR

GRADE y

AUTOMARK

GRADE x

INSTRUCTOR

GRADE y

12.2

10

18.2

15

19.3

17

10.6

11

15.1

16

19.5

17

15.1

12

17.2

16

19.7

17

16.2

12

17.5

16

18.6

18

16.6

12

18.6

16

19

18

16.6

13

18.8

16

19.2

18

17.2

14

17.8

17

19.4

18

17.6

14

18

17

19.6

18

18.2

14

18.2

17

20.1

18

16.5

15

18.4

17

19.2

19

17.2

15

18.6

17

19.3

17

12.2

10

19

17

19.5

17

 

Question

 

  1. Assuming instructor grade y and AUTOMARK grade x are linearly related, hypothesize a model relating y to x.
  2. Fit the model from part a to the data using the method of least squares. Give the least squares prediction equation.
  3. Interpret the values of β0 and β1.
  4. What assumptions are required to make valid inferences based on the regression results?
  5. Calculate and interpret s, the estimated standard deviation of the model.
  6. Conduct a test to determine whether y and x are positively linearly related. Use α = .05.
  7. Calculate a 90% confidence interval for β1, and interpret the result.
  8. Calculate the coefficient of determination, R2 , and interpret its value.
  9. Do you recommend using the model for estimation and prediction.
  10. Construct a 90% prediction interval for instructor grade y, for an assignment given an AUTOMARK grade of x = 18. Interpret the result.

 

 

Answer

  1. Assuming instructor grade y and AUTOMARK grade x are linearly related, we hypothesize a straight line model y =  β0 + β1x + ε.
  2. Based on the SAS output below. The least squares estimates of the y-intercept and slope are -1.04264 and 0.94406 (shaded under the column labeled Parameter Estimates), therefore, the least square prediction equation is  y = -1.04264 + 0.94406x.
  3. β0 is the y-intercept of the line of means; that is, it is the value of E(y) when x is 0. Then, β0 = -1.04264 is the estimated mean Instructor Grade y when AUTOMARK is 0. In general, this number has no practical interpretation. The slope of the line,  β1 is the change in E(y) for every 1 unit increase in x. Hence, β1 = 0.94406 is the estimated change in the average Instructor Grade y for every 1 unit change in AUTOMARK x. Since β1 is positive, we expect the Instructor Grade increase as the AUTOMARK increase.
  4. To be answered by reader.
  5. SSE is given in the SAS output under the column heading Sum of Squares in the row labeled Error. This value SSE = 53.65471, is a minimum. That is, there is no other straight lines with an SSE smaller than 53.65471. Since there are 36 data values, we have n-2 = 36 – 2 = 34 degrees freedom for estimating the variance σ2 of random error ε. Thus,  s2 = SSE/34 = 1.57808 (shown in the SAS printout as Mean Square for Error and is given next to the value of SSE. The estimate of the standard deviation σ of the random error ε is s = square root of s2 = 1.25622. This value is given in the SAS printout next to the heading Root MSE.
  6. To be answered by reader.
  7. To be answered by reader.
  8. The coefficient of determination R2 =  0.7405  = (SSyy – SSE)/SSyy. A practical interpretation is that 74% of the sample variation in Instructor Grade y is explained by the linear relationship between x and y. A 26% of the variation is unexplained.
  9. To be answered by reader.
  10. To be answered by reader.

 

 

 

SAS Program Used for the Analysis

 

 

*--- SAS program: Regression_Model_1.SAS ;

 

options nodate pageno=1;

 

*---Create SAS data set;

data automark;

  input automark_grade instructor_grade @@;

  cards;

12.2  10    18.2  15    19.3  17

10.6  11    15.1  16    19.5  17

15.1  12    17.2  16    19.7  17

16.2  12    17.5  16    18.6  18

16.6  12    18.6  16    19    18

16.6  13    18.8  16    19.2  18

17.2  14    17.8  17    19.4  18

17.6  14    18    17    19.6  18

18.2  14    18.2  17    20.1  18

16.5  15    18.4  17    19.2  19

17.2  15    18.6  17    19.3  17

12.2  10    19    17    19.5  17

;

run;

 

proc reg data=automark;

  title "Regression of instructor_grade on automark_grade";

  model instructor_grade = automark_grade / p cli;

  id automark_grade;

quit;

 

 

Note

  • The REG procedure performs a complete regression analysis on the data.
  • In the MODEL statement, the dependent variable is listed to the left of the equals sign and the independent variable to the right. The option P (following the slash) prints predicted values and residuals, and the option CLI prints corresponding lower and upper 95% prediction limits for observations in the data set. Specify CLM to obtain 95% confidence intervals for E(y).
  • The optional ID statement identifies the value of x for each 95% prediction (or confidence) interval.

 

  • The REG procedure performs a complete

 

 

 

SAS Output

 

                        Regression of instructor_grade on automark_grade

 

                                       The REG Procedure

                                         Model: MODEL1

                             Dependent Variable: instructor_grade

 

                                      Analysis of Variance

 

                                             Sum of           Mean

         Source                   DF        Squares         Square    F Value    Pr > F

 

         Model                     1      153.09529      153.09529      97.01    <.0001

         Error                    34       53.65471        1.57808

         Corrected Total          35      206.75000

 

 

                      Root MSE              1.25622    R-Square     0.7405

                      Dependent Mean       15.58333    Adj R-Sq     0.7329

                      Coeff Var             8.06128

 

 

                                      Parameter Estimates

 

                                      Parameter       Standard

          Variable            DF       Estimate          Error    t Value    Pr > |t|

 

          Intercept            1       -1.04264        1.70093      -0.61      0.5440

          automark_grade       1        0.94406        0.09585       9.85      <.0001

 

 

                            Regression of instructor_grade on automark_grade

 

                                           The REG Procedure

                                             Model: MODEL1

                                 Dependent Variable: instructor_grade

 

                                           Output Statistics

 

         automark_             Dep Var    Predicted       Std Error

  Obs    grade        instructor_grade        Value    Mean Predict        95% CL Predict         Residual

 

    1        12.2              10.0000      10.4749          0.5593       7.6804      13.2695      -0.4749

    2        18.2              15.0000      16.1393          0.2168      13.5486      18.7300      -1.1393

    3        19.3              17.0000      17.1777          0.2647      14.5688      19.7867      -0.1777

    4        10.6              11.0000       8.9644          0.7039       6.0381      11.8908       2.0356

    5        15.1              16.0000      13.2127          0.3190      10.5787      15.8467       2.7873

    6        19.5              17.0000      17.3666          0.2768      14.7524      19.9807      -0.3666

    7        15.1              12.0000      13.2127          0.3190      10.5787      15.8467      -1.2127

    8        17.2              16.0000      15.1952          0.2130      12.6058      17.7846       0.8048

    9        19.7              17.0000      17.5554          0.2897      14.9354      20.1753      -0.5554

   10        16.2              12.0000      14.2512          0.2493      11.6484      16.8539      -2.2512

   11        17.5              16.0000      15.4784          0.2096      12.8902      18.0667       0.5216

   12        18.6              18.0000      16.5169          0.2298      13.9216      19.1122       1.4831

   13        16.6              12.0000      14.6288          0.2307      12.0331      17.2244      -2.6288

   14        18.6              16.0000      16.5169          0.2298      13.9216      19.1122      -0.5169

   15          19              18.0000      16.8945          0.2481      14.2923      19.4968       1.1055

   16        16.6              13.0000      14.6288          0.2307      12.0331      17.2244      -1.6288

   17        18.8              16.0000      16.7057          0.2384      14.1072      19.3042      -0.7057

   18        19.2              18.0000      17.0833          0.2589      14.4767      19.6899       0.9167

   19        17.2              14.0000      15.1952          0.2130      12.6058      17.7846      -1.1952

   20        17.8              17.0000      15.7617          0.2102      13.1732      18.3501       1.2383

   21        19.4              18.0000      17.2722          0.2706      14.6606      19.8837       0.7278

   22        17.6              14.0000      15.5728          0.2094      12.9847      18.1610      -1.5728

   23          18              17.0000      15.9505          0.2127      13.3612      18.5397       1.0495

   24        19.6              18.0000      17.4610          0.2832      14.8440      20.0780       0.5390

   25        18.2              14.0000      16.1393          0.2168      13.5486      18.7300      -2.1393

   26        18.2              17.0000      16.1393          0.2168      13.5486      18.7300       0.8607

   27        20.1              18.0000      17.9330          0.3174      15.2998      20.5662       0.0670

   28        16.5              15.0000      14.5344          0.2349      11.9372      17.1316       0.4656

   29        18.4              17.0000      16.3281          0.2226      13.7354      18.9208       0.6719

   30        19.2              19.0000      17.0833          0.2589      14.4767      19.6899       1.9167

   31        17.2              15.0000      15.1952          0.2130      12.6058      17.7846      -0.1952

   32        18.6              17.0000      16.5169          0.2298      13.9216      19.1122       0.4831

   33        19.3              17.0000      17.1777          0.2647      14.5688      19.7867      -0.1777

   34        12.2              10.0000      10.4749          0.5593       7.6804      13.2695      -0.4749

   35          19              17.0000      16.8945          0.2481      14.2923      19.4968       0.1055

   36        19.5              17.0000      17.3666          0.2768      14.7524      19.9807      -0.3666

 

 

                              Sum of Residuals                           0

                              Sum of Squared Residuals            53.65471

                              Predicted Residual SS (PRESS)       62.78104