Solution 4.8 - Chemometrics: Data Analysis for the Laboratory and Chemical Plant
Education Article
- Published: Jan 1, 2000
- Channels: Chemometrics & Informatics
1. The standardised data matrix is given below.
The population standard deviation is normally employed because the aim is data scaling rather than statistical sampling.
2. The PC scores are presented below.
The loadings and eigenvalues are as follows.
The sum of the eigenvalues equals 638 which is the same as the total number of measurements in the dataset ( = 58 ´ 11).
3. The scores plot is given below.
There is quite good discrimination between the two classes, but object G10 appears an outlier.
4. The labelled loadings plot is as follows.
5. There are several possible answers to this question, but a good combination of elements are those that appear roughly in the centre of each group, such as Mn and Sr. The graph of standardised readings for these two elements is presented below. There is reasonable discrimination, note G10 remains an outlier.
6. The elements K at Na are approximately at right angles to the discriminatory axis and so will be poor at separation. Note other answers are possible.
7. The class centroids are as follows.
Ti |
Sr |
Ba |
Mn |
Cr |
Ca |
Al |
Fe |
Mg |
Na |
K | |
A |
0.194 |
-0.683 |
0.973 |
0.621 |
-0.391 |
-1.102 |
0.807 |
0.279 |
-0.811 |
-0.335 |
0.431 |
B |
-0.284 |
0.463 |
-0.630 |
-0.371 |
0.189 |
0.687 |
-0.533 |
-0.306 |
0.552 |
0.269 |
-0.200 |
The Euclidean distances are as follows.
The class distance plot is as follows.
There is reasonable discrimination. G10 definitely appears as a member of neither class. However, there are some ambiguities, and without prior knowledge it would not be easy to unambiguously classify the samples.
8. An easy way of calculating the variance-covariance matrix for a dataset X is as follows.
- Centre the columns in each class,
- Multiply X' . X ,
- Divide by the number of samples in each class.
Note that in this application, both classes must be centred independently to get the correct results.
Alternatively there are functions in both Excel and Matlab.
The matrix for class A is as follows.
The corresponding matrix for class B is as follows.
Hence the Mahalanobis distances are as follows.
The follow observations can be made.
- G10 has a large distance from both classes.
- The sum of the squares of the Mahalanobis distance for each class to itself equals 11 (this can be checked).
- The distance of objects from class B to class A are larger than object from class A to class B. This is because the two classes have a different structure. Class A is much tighter so a larger Euclidean distance from its centroid would have greater significance than for Class B.
The class distance plot is as follows.
There is now much better discrimination. Note that the outlier is still far away from both classes.
9. Excluding the outlier, 100% correctly classified can be achieved.