Analytical Method Testing: Significance Tests in Chapter 3

1 / 57

Embed Share

Learn about the importance of testing analytical methods for systematic errors and the use of significance tests to evaluate experimental results in Chapter 3. Discover how significance tests help determine if differences between measured and standard values are significant or due to random variations.

tick709 Follow

Uploaded on Apr 16, 2025 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Significance tests Chapter 3

Introduction One of the most important properties of an analytical method is that it should be free from systematic error. This means that the value which it gives for the amount of the analyte should be the true value. This property of an analytical method may be tested by applying the method to a standard test portion containing a known amount of analyte (Chapter 1). However, as we saw in the last chapter, even if there were no systematic error, random errors make it most unlikely that the measured amount would exactly equal the standard amount. In order to decide whether the difference between the measured and standard amounts can be accounted for by random error, a statistical test known as a significance test can be employed.

significance test As its name implies, this approach tests whether the difference between the two results is significant, or whether it can be accounted for merely by random variations. Significance tests are widely used in the evaluation of experimental results. This chapter considers several tests which are particularly useful to analytical chemists.

Comparison of an experimental mean with a known value In making a significance test we are testing the truth of a hypothesis which is known as a null hypothesis, often denoted by H0. For the example in the previous paragraph we adopt the null hypothesis that the analytical method is not subject to systematic error. The term null is used to imply that there is no difference between the observed and known values other than that which can be attributed to random variation. Usually the null hypothesis is rejected if the probability of such a difference occurring by chance is less than 1 in 20 (i.e. 0.05 or 5%). In such a case the difference is said to be significant at the 0.05 (or 5%) level. Using this level of significance there is, on average, a 1 in 20 chance that we shall reject the null hypothesis when it is in fact true.

In order to be more certain that we make the correct decision a higher level of significance can be used, usually 0.01 or 0.001 (1% or 0.1%). The significance level is indicated by writing, for example, P (i.e. probability) = 0.05, and gives the probability of rejecting a true null hypothesis. It is important to appreciate that if the null hypothesis is retained it has not been proved that it is true, only that it has not been demonstrated to be false. Later in the chapter the probability of retaining a null hypothesis when it is in fact false will be discussed.

If (i.e. the calculated value of t without regard to sign) exceeds a certain critical value then the null hypothesis is rejected. The critical value of t for a particular significance level can be found from Table A.2. For example, for a sample size of 10 (i.e. 9 degrees of freedom) and a significance level of 0.01, the critical value is t9= 3.25, where, as in Chapter 2, the subscript is used to denote the number of degrees of freedom.

The reader should be aware that there are various versions given in the literature for the number of degrees of freedom for t, reflecting the fact that the method is an approximate one. The method above is that used by Minitab and it errs on the side of caution in giving a significant result. Excel, on the other hand, uses equation (3.5) but rounds the value to the nearest integer. For example, if equation (3.5) gave a value of 4.7, Minitab would take 4 degrees of freedom and Excel would take 5.

Paired t -test It frequently happens that two methods of analysis are compared by studying test samples containing different amounts of analyte. For example, Table 3.1 gives the results of determining paracetamol concentration (% m/m) in tablets by two different methods. Ten tablets from 10 different batches were analysed in order to see whether the results obtained by the two methods differed. As always there is the variation between the measurements due to random measurement error. In addition, differences between the tablets and differences between the methods may also contribute to the variation between measurements. It is the latter which is of interest in this example: we wish to know whether the methods produce significantly different results. The test for comparing two means (Section 3.3) is not appropriate in this case because it does not separate the variation due to method from that due to variation between tablets: the two effects are said to be confounded . This difficulty is overcome by looking at the difference, d, between each pair of results given by the two methods. If there is no difference between the two methods then these differences are drawn from a population with mean d = 0. In order to test the null hypothesis, we test whether differs significantly from 0 using the statistic t.

The paired test described above does not require that the precisions of the two methods are equal but it does assume that the differences, d, are normally distributed. This will be the case if the measurement error for each method is normally distributed and the precision and bias (if any) of each method are constant over the range of values for which the measurements were made. The data can consist of either single measurements or the means of replicate measurements. However, it is necessary for the same number of measurements to be made on each sample by the first method and likewise for the second method: that is n measurements on each sample by method 1 and m measurements on each sample by method 2, where m and n do not have to be equal.

There are various circumstances in which it may be necessary or desirable to design an experiment so that each sample is analyzed by two methods, giving results that are naturally paired. Some examples are: 1- The quantity of any one test sample is sufficient for only one determination by each method. 2 - The test samples may be presented over an extended period so it is necessary to remove the effects of variations in the environmental conditions such as temperature, pressure, etc. 3 - The methods are to be compared by using a wide variety of samples from different sources and possibly with very different concentrations.

As analytical methods usually have to be applicable over a wide range of concentrations, a new method is often compared with a standard method by analysis of samples in which the analyte concentration may vary over several powers of 10. In this case it is inappropriate to use the paired t-test since its validity rests on the assumption that any errors, either random or systematic, are independent of concentration. Over wide ranges of concentration this assumption may no longer be true. An alternative method in such cases is linear regression (see Section 5.9) but this approach also presents difficulties.

One-sided and two-sided tests The methods described so far in this chapter have been concerned with testing for a difference between two means in either direction. For example, the method described in Section 3.2 tests whether there is a significant difference between the experimental result and the known value for the reference material, regardless of the sign of the difference. In most situations of this kind the analyst has no idea, prior to the experiment, as to whether any difference between the experimental mean and the reference value will be positive or negative. Thus the test used must cover either possibility. Such a test is called two-sided (or two-tailed).

In a few cases, however, a different kind of test may be appropriate. Consider, for example, an experiment in which it is hoped to increase the rate of reaction by addition of a catalyst. In this case, it is clear before the experiment begins that the only result of interest is whether the new rate is greater than the old, and only an increase need be tested for significance. This kind of test is called one-sided (or one-tailed). For a given value of n and a particular probability level, the critical value for a one-sided test differs from that for a two-sided test. In a one-sided test for an increase, the critical value of t (rather than ItI ) for P = 0.05 is that value which is exceeded with a probability of 5%. Since the sampling distribution of the mean is assumed to be symmetrical, this probability is twice the probability that is relevant in the two-sided test. The appropriate value for the one-sided test is thus found in the P = 0.10bcolumn of Table A.2. Similarly, for a one-sided test at the P = 0.01 level, the 0.02 column is used.

F-test for the comparison of standard deviations The significance tests described so far are used for comparing means, and hence for detecting systematic errors. In many cases it is also important to compare the standard deviations, i.e. the random errors of two sets of data. As with tests on means, this comparison can take two forms. Either we may wish to test whether Method A is more precise than Method B (i.e. a one-sided test) or we may wish to test whether Methods A and B differ in their precision (i.e. a two-sided test). For example, if we wished to test whether a new analytical method is more precise than a standard method, we would use a one sided test; if we wished to test whether two standard deviations differ significantly (e.g. before applying a t-test see Section 3.3 above), a two-sided test is appropriate. The F-test considers the ratio of the two sample variances, i.e. the ratio of the squares of the standard deviations, . s12/s22

If the null hypothesis is true then the variance ratio should be close to 1. Differences from 1 can occur because of random variation, but if the difference is too great it can no longer be attributed to this cause. If the calculated value of F exceeds a certain critical value (obtained from tables) then the null hypothesis is rejected. This critical value of F depends on the size of both samples, the significance level and the type of test performed. The values for P = 0.05 are given in Appendix 2 in Table A.3 for one-sided tests and in Table A.4 for two-sided tests; the use of these tables is illustrated in the following examples.

As with the t-test, other significance levels may be used for the F-test and the critical values can be found from the tables listed in the bibliography at the end of Chapter 1. Care must be taken that the correct table is used depending on whether the test is one- or two-sided: for an % significance level the 2 % points of the F distribution are used for a one-sided test and the % points are used for a two-sided test. If a computer is used it will be possible to obtain a P- value. Note that Excel carries out only a one-sided F-test and that it is necessary to enter the sample with the larger variance as the first sample. Minitab does not give an F-test for comparing the variances of two samples.

Outliers Every experimentalist is familiar with the situation in which one (or possibly more) of a set of results appears to differ unreasonably from the others in the set. Such a measurement is called an outlier. In some cases an outlier may be attributed to a human error. For example, if the following results were given for a titration: 12.12, 12.15, 12.13, 13.14, 12.12 ml then the fourth value is almost certainly due to a slip in writing down the result and should read 12.14. However, even when such obviously erroneous values have been removed or corrected, values which appear to be outliers may still occur. Should they be kept, come what may, or should some means be found to test statistically whether or not they should be rejected? Obviously the final values presented for the mean and standard deviation will depend on whether or not the outliers are rejected.

The ISO recommended test for outliers is Grubbs test. This test compares the deviation of the suspect value from the sample mean with the standard deviation of the sample. The suspect value is the value that is furthest away from the mean. The critical values for G for P = 0.05 are given in Table A.5. If the calculated value of G exceeds the critical value, the suspect value is rejected. The values given are for a two-sided test, which is appropriate when it is not known in advance at which extreme an outlier may occur.

Ideally, further measurements should be made when a suspect value occurs, particularly if only a few values have been obtained initially. This may make it clearer whether or not the suspect value should be rejected, and, if it is still retained, will also reduce to some extent its effect on the mean and standard deviation.

Dixons test Dixon s test (sometimes called the Q-test) is another test for outliers which is popular because the calculation is simple. For small samples (size 3 to 7) the test assesses a suspect measurement by comparing the difference between it and the measurement nearest to it in size with range of the measurements. For larger samples the form of the test is modified slightly. A reference containing further details is given at the end of this chapter. The critical values of Q for P = 0.05 for a two-sided test are given in Table A.6. If the calculated value of Q exceeds the critical value, the suspect value is rejected.

Analysis of variance (ANOVA) In Section 3.3 a method was described for comparing two means to test whether they differ significantly. In analytical work there are often more than two means to be compared. Some possible situations are: comparing the mean concentration of protein in solution for samples stored under different conditions; comparing the mean results obtained for the concentration of an analyte by several different methods; and comparing the mean titration results obtained by several different experimentalists using the same apparatus. In all these examples there are two possible sources of variation. The first, which is always present, is due to the random error in measurement. This was discussed in detail in the previous chapter: it is this error which causes a different result to be obtained each time a measurement is repeated under the same conditions. The second possible source of variation is due to what is known as a controlled or fixed-effect factor.

For the examples above the controlled factors are respectively the conditions under which the solution was stored, the method of analysis used, and the experimentalist carrying out the titration. Analysis of variance (frequently abbreviated to ANOVA) is an extremely powerful statistical technique which can be used to separate and estimate the different causes of variation. For the particular examples above, it can be used to separate any variation which is caused by changing the controlled factor from the variation due to random error. It can thus test whether altering the controlled factor leads to a significant difference between the mean values obtained. ANOVA can also be used in situations where there is more than one source of random variation. Consider, for example, the purity testing of a barrelful of sodium chloride. Samples are taken from different parts of the barrel chosen at random and replicate analyses performed on these samples. In addition to the random error in the measurement of the purity, there may also be variation in the purity of the samples from different parts of the barrel. Since the samples were chosen at random, this variation will be random and is thus sometimes known as a random-effect factor. Again, ANOVA can be used to separate and estimate the sources of variation. Both types of statistical analysis described above, i.e. where there is one factor, either controlled or random, in addition to the random error in measurement, are known as one-way ANOVA.

Comparison of several means Table 3.2 shows the results obtained in an investigation into the stability of a fluorescent reagent stored under different conditions. The values given are the fluorescence signals (in arbitrary units) from dilute solutions of equal concentration. Three replicate measurements were made on each sample. The table shows that the mean values for the four samples are different. However, we know that because of random error, even if the true value which we are trying to measure is unchanged, the sample mean may vary from one sample to the next. ANOVA tests whether the difference between the sample means is too great to be explained by the random error.

Figure 3.2 shows a dot-plot comparing the results obtained in the different conditions. This suggests that there is little difference between conditions A and B but that conditions C and D differ both from A and B and from each other.

1 Within-sample variation

Between-sample variation

The chi-squared test In the significance tests so far described in this chapter the data have taken the form of observations which, apart from any rounding off, have been measured on a continuous scale. In contrast, this section is concerned with frequency, i.e. the number of times a given event occurs. For example, Table 2.2 gives the frequencies of the different values obtained for the nitrate ion concentration when 50 measurements were made on a sample. As discussed in Chapter 2, such measurements are usually assumed to be drawn from a population which is normally distributed. The chi squared test could be used to test whether the observed frequencies differ significantly from those which would be expected on this null hypothesis.

Since the calculation involved in using this statistic to test for normality is relatively complicated, it will not be described here. (A reference to a worked example is given at the end of the chapter.) The principle of the chi-squared test is more easily understood by means of the following example.

The calculation of 2 suggests that a significant result is obtained because of the high number of breakages reported by the first worker. To study this further, additional chi-squared tests can be performed. One of them tests whether the second, third and fourth workers differ significantly from each other: in this case each expected frequency is (17 + 11 + 9)/3. (Note that the t-test cannot be used here as we are dealing with frequencies and not continuous variates).

Another tests whether the first worker differs significantly from the other three workers taken as a group. In this case there are two classes: the breakages by the first worker with an expected frequency of 15.25 and the total breakages by the other workers with expected frequency of 15.25 3 = 45.75. In such cases when there are only two classes and hence one degree of freedom, an adjustment known as Yates s correction should be applied. This involves replacing O E by IO EI 0.5. For example, if O E = 4.5, IO EI= 4.5 and IO EI 0.5 = 4. These further tests are given as an exercise at the end of this chapter. In general the chi-squared test should be used only if the total number of observations is 50 or more and the individual expected frequencies are not less than 5. This is not a rigid rule: a reference is given at the end of this chapter which discusses this point further. Other applications of the chi-squared test are also described in this reference.

Testing for normality of distribution As has been emphasized in this chapter, many statistical tests assume that the data used are drawn from a normal population. One method of testing this assumption, using the chi- squared test, was mentioned in the previous section. Unfortunately, this method can only be used if there are 50 or more data points. It is common in experimental work to have only a small set of data. A simple visual way of seeing whether a set of data is consistent with the assumption of normality is to plot a cumulative frequency curve on special graph paper known as normal probability paper. This method is most easily explained by means of an example.

Analytical Method Testing: Significance Tests in Chapter 3

Download Presentation

Presentation Transcript

Related

More Related Content