The data from a Poisson distribution is an integer number and represents
number of events per time unit, area unit, unit of volume, etc.
Of course, these units do not need to be an exact number of minutes or
area, etc. An area can be a painted sheet of metal where number of flaws
are counted.
However, all results are expected to come from the
same sized units, unless otherwise is stated. But if the data come from
e.g. different sized panels, differently long time periods, etc, how can
this be handled? In the analysis there then must be an extra column
containing the different sizes. This is called offset in the
literature and is entered as an extra term in the model. There are some
examples below.
Test of model. A very important concept
of any analysis is to test the final model. In ordinary regression
analysis this is often done by using the so-called residuals. In
analysis of Poisson-data this is done by the test of deviance
(here called ‘overdispersion’), similar to the residuals.
This test
tells the analyst if the model is incorrect, perhaps some other
explanatory data-variable is needed, etc. Such a test is included in the
examples below.
The test command (“check_overdispersion()”, in
library “performance”) prints three rows and the important one is the
‘p’-value. A low ‘p’-value (e.g. < 0.05) indicates that the model is
incorrect; perhaps too simple, perhaps too complicated and is meant to
make the analyst aware of the need for further ínvestigations.
(See https://ovn.ing-stat.se [7] and [20] for more details
about the Poisson distribution.)
Four diffent simulated examples:
1. Two groups (A and B) simulated with different lambda-values
2. Two groups (A and B) simulated from two different surfaces
3. Two groups with different lambda during different time periods
4. Different Po-intensities with trend
1. Two groups (A and B) simulated with different lambda-values
The example shows data from two groups/machines/processes ‘A’ and ‘B’, each from a Poisson but different lambda values:
library(performance) # Functions to test the data.
lam1 = 1.4 # Lambda for group A.
lam2 = lam1*2 # Lambda for group B, twice the size of A.
antal = 2000 # Number of simulated values.
dataA <- rpois(antal, lam1) # Simulated Po-data grp A.
dataB <- rpois(antal, lam2) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
maskin <- c(rep("A", antal), rep("B", antal)) # Creating a column for the two groups, A nad B.
# --- analysis without the 'offset'-term.
analysAB <- glm(allData ~ maskin, family=poisson) # 'maskin' is the only explantory variable.
summary(analysAB) # Summary of the model.
##
## Call:
## glm(formula = allData ~ maskin, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.38424 0.01845 20.82 <2e-16 ***
## maskinB 0.64609 0.02278 28.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5606.2 on 3999 degrees of freedom
## Residual deviance: 4759.4 on 3998 degrees of freedom
## AIC: 13852
##
## Number of Fisher Scoring iterations: 5
b0 <- analysAB$coef[1] # Extracts the two coefficients.
b1 <- analysAB$coef[2]
estlamA1 <- round(exp(b0 + 0), 2) # Estimating the lambda-values.
estlamB1 <- round(exp(b0 + b1), 2)
paste0("Est lam1 = ", estlamA1, ", Est lam2 = ", estlamB1) # Prints result.
## [1] "Est lam1 = 1.47, Est lam2 = 2.8"
check_overdispersion(analysAB) # Check for incorrect model.
## # Overdispersion test
##
## dispersion ratio = 1.058
## Pearson's Chi-Squared = 4229.530
## p-value = 0.005
Example 1. Here Po-intensities are 1.4 and 2.8 and this is the
only difference between A and B, no trends, etc. Also, it is understood
that the data comes from equally sized time periods, size of areas, etc.
The analysis is thus the same as the ones in ‘Poisson Regression (I)’.
The estimates (Est1 = …‘) of the two lamda-values correspond well
with the simulated values.
The test of ’overdispersion’ indicates
that the model is acceptable.
2 Two groups (A and B) simulated from two different surfaces
The example shows data from two groups/machines/processes ‘A’ and ‘B’, each from a Poisson. The data is observered from two differently sized surfaces. One is twice the size of the other:
library(performance) # Functions to test the data.
lam1 = 1.4 # Lambda for group A.
lam2 = lam1*2 # Lambda for group B, twice the size of A.
antal = 2000 # Number of simulated values.
dataA <- rpois(antal, lam1) # Simulated Po-data grp A.
dataB <- rpois(antal, lam2) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
maskin <- c(rep("A", antal), rep("B", antal)) # Creating a column for the two groups, A nad B.
# --- analysis with fixed 'offset'-term.
area <- c(rep(1.0, antal), rep(2.0, antal)) # Creates a column with the two sizes of sheet.
analysAB <- glm(allData ~ maskin + offset(log(area)), family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ maskin + offset(log(area)), family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.33146 0.01895 17.495 <2e-16 ***
## maskinB 0.02025 0.02313 0.876 0.381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4743.3 on 3999 degrees of freedom
## Residual deviance: 4742.5 on 3998 degrees of freedom
## AIC: 13754
##
## Number of Fisher Scoring iterations: 5
b0 <- analysAB$coef[1] # Extracts the two coefficients.
b1 <- analysAB$coef[2]
estlamA2 <- round(exp(b0 + 0), 2) # Estimating the lambda-values.
estlamB2 <- round(exp(b0 + b1), 2)
paste0("Est lam1 = ", estlamA2, ", Est lam2 = ", estlamB2) # Prints result.
## [1] "Est lam1 = 1.39, Est lam2 = 1.42"
check_overdispersion(analysAB) # Check for incorrect model.
## # Overdispersion test
##
## dispersion ratio = 1.046
## Pearson's Chi-Squared = 4182.812
## p-value = 0.021
Example 2. Again, one lambda is twice the other one but now the
analysis incorporates that the second one is observed on sufaces twice
the size. Now the estimates (Est1 = …’) are approximately equal.
This is of course as expected: if there are 1.4 events per surface unit,
then is of course 2.8 events per two surface units.
3 Two groups with different lambda during different time periods
The example shows data from two groups/machines/processes ‘A’ and ‘B’, each from a Poisson. Each ‘A’-datavalue comes from 1 hour-period while ‘B’-datavalues comes from timeperiods 0.5-1.0 hours. All times are stored in the column ‘allexpon_times’:
lam1 = 1.4 # Events per hour
antal = 2000
maskin <- c(rep("A", antal), rep("B", antal))
expon_timesA <- c(rep(1, antal))
expon_timesB <- runif(antal, 0.5, 1.0) # Times: 0.5 - 1.0 hour
allexpon_times <- c(expon_timesA, expon_timesB) # All expo-times in one column.
maskin <- c(rep("A", antal), rep("B", antal))
lam2 = 1.4 * expon_timesB # Events machine B, varying expo-times.
dataA <- rpois(antal, lam1) # Simulated Po-data grp A.
dataB <- rpois(antal, lam2) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
analysAB <- glm(allData ~ maskin + offset(log(allexpon_times)), family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ maskin + offset(log(allexpon_times)),
## family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35767 0.01870 19.128 <2e-16 ***
## maskinB -0.01562 0.02870 -0.544 0.586
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4586.4 on 3999 degrees of freedom
## Residual deviance: 4586.1 on 3998 degrees of freedom
## AIC: 11334
##
## Number of Fisher Scoring iterations: 5
check_overdispersion(analysAB)
## # Overdispersion test
##
## dispersion ratio = 0.979
## Pearson's Chi-Squared = 3912.228
## p-value = 0.831
b0 <- analysAB$coef[1] # Extracts the two coefficients.
b1 <- analysAB$coef[2]
estlamA3 <- round(exp(b0 + 0), 2)
estlamB3 <- round(exp(b0 + b1), 2)
paste0("Est lam1 = ", estlamA3, ", Est lam2 = ", estlamB3) # Prints result.
## [1] "Est lam1 = 1.43, Est lam2 = 1.41"
check_overdispersion(analysAB) # Check for incorrect model.
## # Overdispersion test
##
## dispersion ratio = 0.979
## Pearson's Chi-Squared = 3912.228
## p-value = 0.831
Example 3. Here Po-intensity for ‘A’ is 1.4 but for ‘B’ this
value is multiplied by the ‘allexpon_times’. This mean that for ‘B’ the
data is simulated as 2000 different situations (if all simulated times
are different).
The model test does not indicate any problem with
the applied model. An of course, the estimated lambda value are
approximately equal (simulated as 1.4).
4. Different Po-intensities with trend
The example shows two groups each with a trend in lambda. A first analysis is performed with only one explanatory variable (the two groups A and B stored in the column ‘process’). Because of the trend, the model test will show a very low ‘p’-value indicating the the model can be improved:
library(performance) # Functions to test the data.
y1A <- 1.5; Xstart <- 0 # Start point for curve of Po-intensity grp 'A'.
y2A <- 3.6; Xstopp <- 400 # Stop point for curve of Po-intensity grp 'A'.
lamA <- -log(y2A/y1A)/Xstopp
y1B <- 2.5 # Start point for curve of Po-intensity grp 'B'.
y2B <- 4.7 # Stop point for curve of Po-intensity grp 'B'.
lamB <- -log(y2B/y1B)/Xstopp
Xaxel <- c(0:(Xstopp-1)) # X-axis.
lambdaChangeGrpA <- y1A * exp(-lamA*Xaxel) # Column of intensity values, grp A.
lambdaChangeGrpB <- y1B * exp(-lamB*Xaxel) # Column of intensity values, grp B.
dataA <- rpois(length(lambdaChangeGrpA), lambdaChangeGrpA) # Simulated Po-data grp A.
dataB <- rpois(length(lambdaChangeGrpA), lambdaChangeGrpB) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
process <- c(rep("A", length(Xaxel)), rep("B", length(Xaxel)))
analysAB <- glm(allData ~ process, family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ process, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.84801 0.03272 25.916 <2e-16 ***
## processB 0.37060 0.04254 8.712 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1101.5 on 799 degrees of freedom
## Residual deviance: 1024.3 on 798 degrees of freedom
## AIC: 3145.6
##
## Number of Fisher Scoring iterations: 5
b0 <- analysAB$coef[1] # Extracts the two coefficients.
b1 <- analysAB$coef[2]
estlamA4 <- round(exp(b0 + 0), 2) # Estimating the lambda-values.
estlamB4 <- round(exp(b0 + b1), 2)
paste0("Est lam1 = ", estlamA4, ", Est lam2 = ", estlamB4) # Prints result.
## [1] "Est lam1 = 2.33, Est lam2 = 3.38"
check_overdispersion(analysAB) # Check for incorrect model.
## # Overdispersion test
##
## dispersion ratio = 1.186
## Pearson's Chi-Squared = 946.406
## p-value = < 0.001
Exampel 4a. The analysis estimates the two lambda as values
somewhere in between the start- and stop values of the simulation. This
is of course the best estimates given the single explanatory variable.
However, the ‘Overdispersion test’ shows that the model can be
improved.Thus the analysist seeks extra expanatory variables and finds a
‘time’-variable that will be added in following analysis:
Ttot <- c(Xaxel, Xaxel) # The time variable.
analysAB <- glm(allData ~ process + Ttot, family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ process + Ttot, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4640040 0.0524606 8.845 <2e-16 ***
## processB 0.3706032 0.0425413 8.712 <2e-16 ***
## Ttot 0.0018152 0.0001835 9.894 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1101.53 on 799 degrees of freedom
## Residual deviance: 925.16 on 797 degrees of freedom
## AIC: 3048.4
##
## Number of Fisher Scoring iterations: 5
check_overdispersion(analysAB) # Check for incorrect model.
## # Overdispersion test
##
## dispersion ratio = 1.059
## Pearson's Chi-Squared = 843.689
## p-value = 0.122
Example 4b. The analysis above now shows that the two term for
‘process’ and time (‘Ttot’) are significant (< 0.05) as exepcted (see
the column far right).
The ‘Overdispersion test’ shows a ‘p’-value
that indicates the the model is satistfactory.
Estimated lambda curves. Below the two estimated curves are shown
as solid lines. On either side the dashed lines incorporates 90% of all
random values from the lambda value at that point. (Because the Poisson
data is integers, the dashed curves give stepped lines):
library(ggplot2)
b0 <- analysAB$coef[1] # Extracts the two coefficients.
b1 <- analysAB$coef[2]
b2 <- analysAB$coef[3]
fA1 <- function(x) {qpois(0.05, exp(b0 + b1 + b2*x))}
fA2 <- function(x) {exp(b0 + b1 + b2*x)}
fA3 <- function(x) {qpois(0.95, exp(b0 + b1 + b2*x))}
fB1 <- function(x) {qpois(0.05, exp(b0 + 0 + b2*x))}
fB2 <- function(x) {exp(b0 + 0 + b2*x)}
fB3 <- function(x) {qpois(0.95, exp(b0 + 0 + b2*x))}
dia <- ggplot() + xlim(0, Xstopp) +
geom_function(fun = fA1, colour='red', linewidth=0.5, linetype=2) +
geom_function(fun = fA2, colour='red', linewidth=0.5, linetype=1) +
geom_function(fun = fA3, colour='red', linewidth=0.5, linetype=2) +
geom_function(fun = fB1, colour='black', linewidth=0.5, linetype=2) +
geom_function(fun = fB2, colour='black', linewidth=0.5, linetype=1) +
geom_function(fun = fB3, colour='black', linewidth=0.5, linetype=2) +
annotate("text", x = 200, y = Inf, label="Estimated lambda (red: grp A, black: grp B Example 4b)",
vjust = 1.5, size = 5, colour = "blue") +
ylab(" ") + xlab("")
dia
The difference between the solid lines, modelling the change in lambda,
is approximately 1. As lambda is the mean as well as the variance, the
square root of lambda is the sigma. Sigma will here then be >1, thus
the difference between process ‘A’ and ‘B’ cannot be spotted by
graphical methods only, a formal analysis, as here, is needed.
This
is illustrated here by the dashed areas overlap considerably, which is
also illustrated in several graphs in the document ‘Poisson Regression
(I)’ where the data is included.
Final remarks
The results in the examples can perhaps be
positive or negative. One process owner perhaps likes a decreasing
λ (decrease in number of customer complaints?). Another process
owner perhaps likes an increasing λ-trend (increased number of
customers?), etc.
The examples above are by no means exhausting the
possibilities and perhaps necessary diagnostics. But here the simplicity
has been emphazised.
There are many books about the subject of
‘Po-regression’. Here the book ‘Negative binomial Regression’ (2nd ed,
Joseph M. Hilbe) has been used, especially chapter 7.
(The analyses have been performed by the ‘R’-software and with the graphical interface ‘R’-studio. Both these ara available on the net. See https://www.indstat.se and the button [Statistikprogram - R] for installation.)
(See https://www.indstat.se for many more simulations and animations.)