Colour coding:
# comments
> user commands
computer output
# First follow the procedure outlined here to make the functions
visible from within R.
# Source the functions and example data into R :
> source("est.dispersion")
> source("test.fit")
> pastdata <- read.table("pastdata", header=T)
# view the example past data:
> pastdata
name | estimated | observed |
interpreter | 40 | 36 |
photoedit | 20 | 23 |
insurecalc | 10 | 16 |
fingerprints | 20 | 15 |
graphologist | 20 | 35 |
archivist | 40 | 50 |
etc.
# Interpretation of the columns:
# The estimated times could be obtained either from guesswork before the projects started,
# name: the name of project i .
# estimated: estimated total time, in person-days, required for
completion of project i
from start to finish.
# observed: observed total time, in person-days, required for completion
of project i
from start to finish.
# or as the output from a predictive model.
# The projects have been completed, so the observed times are known.
# Apply the EST.DISPERSION function to the past projects in pastdata.
# distname should be one of: "norm" (Normal), "gamma" (Gamma), "lnorm" (Lognormal),
# EXAMPLE:
# The EST.DISPERSION function is most appropriate when the estimated efforts were
# obtained by guesswork before the projects started. If they were obtained as the output
# from a regression model, the dispersion would normally be estimated as part of the model,
# rendering separate estimation unnecessary.
# The syntax is:
#
#
est.dispersion(obs, est, distname)
#
# obs is the set of observed times (here extracted as pastdata$observed);
# est is the set of estimated times (here extracted as pastdata$estimated);
# distname is the selected distribution name (must be in quotes).
# or "lnorm.med" (Lognormal, median specification).
# If the Poisson or Chisquare distributions are to be used, no estimate of dispersion is necessary.
# dispersion is the variance for the Normal option, the variance/(mean squared) for the Gamma
# and Lognormal options, and the variance/(median squared) for the Lognormal-median option.
# estimate the dispersion for the example data pastdata,
# assuming that the effort of each project is Gamma with common unknown dispersion,
# and mean given by the esimated column in pastdata.
# Store the result with name disp.
> disp <- est.dispersion(obs=pastdata$observed, est=pastdata$estimated, distname="gamma")
Estimated dispersion: 0.0459748
# The hypothesized distribution for project i is distname with mean est[i]
and dispersion as specified.
# For the Chisquare test, observations are grouped according to certain specified boundaries,
# The syntax is:
# probs specifies probability regions by which the the observations obs are
grouped for the Chisquare test:
# n.param.est specifies the number of parameters that have been estimated from the data,
THEORETICAL DETAILS:
# Function TEST.FIT uses the likelihood ratio test statistic rather than the Pearson X^2:
# The asymptotic distribution of the test statistic is bounded between the
Chisquare distribution
# The interpretation of the p-values given by the function is intentionally vague, and the tests
# EXAMPLE:
# Now apply the TEST.FIT function to check whether this distribution is an adequate
# fit to the past data. The function performs Chisquare and Kolmogorov-Smirnov
# goodness-of-fit tests.
# The observation is obs[i] .
# and the number of observations in each grouping is compared with that expected if the
# hypothesized distribution is correct.
# For the Kolmogorov-Smirnov (K-S) test, the observations are transformed into a Uniform(0, 1)
# distribution and the empirical distribution function is compared against the Uniform(0, 1)
# distribution function.
#
#
test.fit(obs, est, probs, distname, dispersion, n.param.est, central)
#
# obs, est, and distname are as above.
# The options "pois" (Poisson) and "chisq" (Chisquare) are now allowed for distname,
# in addition to "norm", "gamma", "lnorm", and "lnorm.med".
# dispersion is the dispersion, if appropriate, usually estimated from the data as above.
# e.g. if probs=c(0.25, 0.5, 0.75) (the default), then the first grouping comprises observations
# that lie in the first 25% probability region of their hypothesized distribution;
# the second grouping comprises those outside the first 25% region but inside the first 50% region,
# and so on.
# If central=TRUE, the probability regions are centered on the distribution median, so
# the first 25% probability region is the 25% of the distribution closest to the median.
# If central=FALSE (the default), the probability regions are not centered, so that the first 25%
# probability region is the lower 25% tail, and so on.
# It is generally better to use central=FALSE, because centralizing means that a surplus in the
# upper tail can cancel out a deficit in the lower tail, so a bad fit might not be detected.
# and it must be supplied even if 0.
# If the estimated values est were derived by guesswork only, then n.param.est
may be 0 for the Poisson and
# Chisquare distributions, and 1 for the Normal, Lognormal, and Gamma distributions if dispersion
# was estimated from the data.
# If the estimated values est were derived from the data, add one to n.param.est
for each parameter involved:
# e.g. add one to n.param.est if est = mean(obs),
# or add two to n.param.est if est is the output from a regression model
with two parameters, as in
# est = alpha + beta * (some measurement).
(Here alpha and beta are the parameters).
# Reference: Kendall, M.G. and Stuart, A. (1979)
Advanced Theory of Statistics, Vol 2, 4th Edition,
# sections 30.1 to 30.34.
# the likelihood ratio statistic should be more accurate although it is
less commonly used (see ref).
# with df = (ngroups - n.param.est - 1) and the
Chisquare distribution with df = (ngroups - 1)
# (see ref).
# This means we need to find the p-value for both of these distributions, and the output from the
# test is often vague.
# In addition, a reasonably large sample is needed for the asymptotic distribution to apply:
# this function rather arbitrarily gives a warning about sample sizes less than 25.
# should be regarded as a guideline, not a definite answer.
# The smaller the p-values, the less likely it is that the hypothesized distribution is correct;
# but exactly how small is small is arguable.
# Test the Gamma distribution, with dispersion estimated by est.dispersion above and stored
in disp,
# on the pastdata data set. The estimated efforts were gained by guesswork, so only the dispersion
# has been estimated from the data and n.param.est=1.
>
test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="gamma",
dispersion=disp, n.param.est=1, central=F)
===============================
Chisquare Test: Non-centralized Regions
===============================
Lower and Upper denote the probability region bounding the category.
Actual.count is the number of observations that lie outside the
lower% region of their distribution, but inside the upper% region.
Category | Lower | Upper | Expected.count | Actual.count |
1 | 0 | 25 | 7.5 | 4 |
2 | 25 | 50 | 7.5 | 6 |
3 | 50 | 75 | 7.5 | 10 |
4 | 75 | 100 | 7.5 | 10 |
Likelihood ratio chisquare test statistic: 3.800691
Compare test statistic against both Chisquare( df = 2 ) and Chisquare( df = 3 )
distributions:
p-value for asymptotic test is between 0.1495170 and 0.2838058
----------------------------------------------------------------------------------------------------
=====================
Kolmogorov-Smirnov Test
=====================
Alternative hypothesis: two.sided
Statistic: D = 0.1951866
p-value: p = 0.2031577
----------------------------------------------------------------------------------------------------
Interpretation: Chisquare Test
=======================
The Gamma distribution specified may be appropriate for these data.
Interpretation: K-S Test
==================
The Gamma distribution specified may be appropriate for these data.
# Interpretation:
# There is no evidence from these tests that the Gamma distribution with this dispersion
# disp=0.0459748 is an inappropriate model.
# The p-values all lie between about 0.14 and 0.28. Visual inspection of the observed counts and
# expected counts in the table indicate a possible bias in the upper tails, but there is not enough
# evidence for this to be statistically significant.
# Now try the same experiments with the Normal distribution.
# First estimate dispersion:
> disp2 <- est.dispersion(obs=pastdata$observed, est=pastdata$estimated, distname="norm")
Estimated dispersion: 44.6
# Now test the fit:
>
test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="norm",
dispersion=disp2, n.param.est=1, central=F)
===============================
Chisquare Test: Non-centralized Regions
===============================
Lower and Upper denote the probability region bounding the category.
Actual.count is the number of observations that lie outside the
lower% region of their distribution, but inside the upper% region.
Category | Lower | Upper | Expected.count | Actual.count |
1 | 0 | 25 | 7.5 | 3 |
2 | 25 | 50 | 7.5 | 9 |
3 | 50 | 75 | 7.5 | 9 |
4 | 75 | 100 | 7.5 | 9 |
Likelihood ratio chisquare test statistic: 4.34762
Compare test statistic against both Chisquare( df = 2 ) and Chisquare( df = 3 )
distributions:
p-value for asymptotic test is between 0.1137434 and 0.2262917
----------------------------------------------------------------------------------------------------
=====================
Kolmogorov-Smirnov Test
=====================
Alternative hypothesis: two.sided
Statistic: D = 0.2270218
p-value: p = 0.09078339
----------------------------------------------------------------------------------------------------
Interpretation: Chisquare Test
=======================
The Normal distribution specified may be appropriate for these data.
Interpretation: K-S Test
==================
The Normal distribution specified may be appropriate for these data.
# Interpretation:
# Once again, there is no evidence that the Normal distribution with dispersion disp2=44.6
# is an inappropriate model.
# Again, inspection of the printed table suggests that the data might be under-represented in the lower
# tails of the Normal distribution, and over-represented in the upper tails; but the evidence is not
# statistically significant.
# Finally try the same experiments with the Chisquare distribution.
# No estimate of dispersion is needed, so we go straight into test.fit with n.param.est=0.
>
test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="chisq",
n.param.est=0, central=F)
===============================
Chisquare Test: Non-centralized Regions
===============================
Lower and Upper denote the probability region bounding the category.
Actual.count is the number of observations that lie outside the
lower% region of their distribution, but inside the upper% region.
Category | Lower | Upper | Expected.count | Actual.count |
1 | 0 | 25 | 7.5 | 1 |
2 | 25 | 50 | 7.5 | 9 |
3 | 50 | 75 | 7.5 | 13 |
4 | 75 | 100 | 7.5 | 7 |
Likelihood ratio chisquare test statistic: 12.58729
Compare test statistic against Chisquare( df = 3 ) distribution:
p-value for asymptotic test is 0.005619703
----------------------------------------------------------------------------------------------------
=====================
Kolmogorov-Smirnov Test
=====================
Alternative hypothesis: two.sided
Statistic: D = 0.2416348
p-value: p = 0.06019764
----------------------------------------------------------------------------------------------------
Interpretation: Chisquare Test
=======================
The Chisquare distribution specified is probably not appropriate for these data.
Interpretation: K-S Test
==================
The Chisquare distribution specified may be appropriate for these data.
# Interpretation:
# This time, it looks unlikely that the specified distribution is correct.
# The Chisquare test gives a p-value of about 0.005 and the table shows clear
# discrepancy between expected and actual counts.
# The K-S test did not detect a statistically significant departure, but the Chisquare
# test provides enough evidence to make it clear that we would be unwise to proceed
# with the Chisquare distribution as our model.