Solution 2.7 - Chemometrics: Data Analysis for the Laboratory and Chemical Plant
Education Article
- Published: Jan 1, 2000
- Channels: Chemometrics & Informatics
1. The design matrix is given below
2. 10 degrees of freedom are required for the model. 5 degrees of freedom are available for replication, so there are 5 degrees of freedom (=20-10-5) for testing the lack-of-fit.
3. The coefficients for the model are as follows.
b_{0} |
b_{1} |
b_{2} |
b_{3} |
b_{11} |
b_{22} |
b_{33} |
b_{12} |
b_{13} |
b_{23} |
5013.51 |
143.39 |
-28.11 |
42.40 |
-71.73 |
-129.51 |
-63.08 |
-71.75 |
14.25 |
-22.00 |
4. The calculation is presented below. Note that there are two possible answers for the root mean square error. The statistically correct is to divide the residual sum of squares by 10 (=N-P) but in other circumstances this number is divided by 20 (=N). For any formal statistical test, the former answer should be used, but sometimes the latter measure can be employed. In other areas of chemometrics it is not always so easy to determine how many degrees of freedom have been lost due to modelling and data preprocessing.
It is better to use the standard deviation because the range is small relative to the mean. Note that the sample rather than population standard deviation is employed in this calculation, and that the number of degrees of freedom for the error is 10.
5. The sum of square replicate error is calculated as follows
True reading |
Average of replicates |
Average-true |
5063.00 |
5013.83 |
-49.17 |
4968.00 |
5013.83 |
45.83 |
5035.00 |
5013.83 |
-21.17 |
5122.00 |
5013.83 |
-108.17 |
4970.00 |
5013.83 |
43.83 |
4925.00 |
5013.83 |
88.83 |
Sum of squares |
26478.83 |
The sum of squares accounted for by the lack-of-fit is given by 109741.00-26478.83 = 83262.16.
An ANOVA table is given below. Note that it is possible also to include the total error sum of squares for the entire dataset as well as the error sum of squares for the residuals.
Source of variation |
Sum of squares |
Degrees of freedom |
Mean sum of squares |
Variance ratio |
Residual |
109741.00 |
10 |
10974.1 | |
Replicate |
26478.83 |
5 |
5295.767 | |
Lack-of-fit |
83262.16 |
5 |
16652.43 |
3.14448 |
Although the lack-of-fit is higher than the replicate error, this is not particularly significant. Despite a relatively high percent error calculated in question 4, because there is quite a high spread of replicates, there is no real evidence that the lack-of-fit is particular large compared to the experimental error.
6. The matrix (D'.D)^{-1} is given below.
Multiplying the diagonal elements by 10974.1 gives the following variances.
b_{0} |
b_{1} |
b_{2} |
b_{3} |
b_{11} |
b_{22} |
b_{33} |
b_{12} |
b_{13} |
b_{23} | ||||
1827.57 |
796.38 |
796.38 |
796.38 |
737.30 |
737.30 |
737.30 |
1371.76 |
1371.76 |
1371.76 |
7. The t-statistic is simply obtained from dividing the numbers in question 3 by the square root of their variances to give the following.
b_{0} |
b_{1} |
b_{2} |
b_{3} |
b_{11} |
b_{22} |
b_{33} |
b_{12} |
b_{13} |
b_{23} |
117.27 |
5.08 |
-1.00 |
1.50 |
-2.64 |
-4.77 |
-2.32 |
-1.94 |
0.38 |
-0.59 |
The most significant are b_{0} b_{1} b_{11} b_{22} b_{33} b_{12} although others could be included. Often a cut-off t-statistic is used. Note that a number of other criteria and methods could be employed to reduce the number of significant terms.
8. A new model with the most significant terms is of the form
= b_{0} + b_{1}x_{1} + b_{11}x^{2}_{1} + b_{22}x^{2}_{2} + b_{33}x^{2}_{3} + b_{12}x_{1}x_{2}
Recalculating the model using only the six terms above gives
= 5013.51 + 143.389x_{1} – 71.73x^{2}_{1} – 129.51x^{2}_{2} –63.08x^{2}_{3} –71.75x_{1}x_{2}
The residual sum of squares increases to 150898.44 which is slightly under 50% increase over the full model. This is probably not very significant, since the residual sum of squares is still very small relative to the overall sum of squares, and represents an overall increase of the root mean square error using 10 degrees of freedom from 50.94% to 59.73%.
9. Equations can be set up as follows.
From the third equation
From the second equation
Hence, substituting into the first equation
The coded and real values of the three coefficients at the optimum are as follows.
factor |
coded value |
real value |
enzyme (mg protein) |
1.16 |
14.6 |
arginine (pmoles) |
-0.32 |
1135.7 |
pH |
0 |
7.5 |