The Poisson distribution is commonly used to analyze ‘event’ data such
as ‘number of customers per week’, ‘number of customer complaints per
month’, ‘number of bacteria in a millilitre’, ‘number of flaws per
square’, etc.
The lowest numerical value is of course 0 but there is
no upper limit. The probability distribution contains only one
parameter, often designated by λ. The expressions for the mean
and the variance (sigma-squared) are showed below, simply as λ
(The derivation is quite simple but is not shown here).
(See https://ovn.ing-stat.se [7] and [20] for more details
about the Poisson distribution.)
\[ \begin{align} \large f(x) = \frac{\lambda e^{-\lambda x}}{x!} & \phantom{litet tomt ut} \large \mu = \lambda & \phantom{litet tomt ut} \large \sigma^{2} = \lambda \end{align} \]
As the mean and variance are equal in a Poisson, this can be used as a
crude test of data at hand. The calculated value mean/variance should be
at least equal to 1. If the value is less than 1, this means that the
variance of the data is exessive, indication that the data contains some
unknown sources of variation.
In this document regression analysis
is used to investigate the data which consists of the results from two
different machines. However, the ordinary theories of regression
analysis - based on normal probability theory - is extended to
incorporate the Poisson distribution using so-called ‘general linearized
models’ (which in fact includes the normal distribution).
Four diffent simulated examples:
1. Both groups having identical Po-intensities, no change over time
2. Different Po-intensities, no change over time
3. Different Po-intensities changing over time
4. Different Po-intensities changing in different directions
Note that for each example a curve is created that gives the lambda values (see the first six lines of code in each example).
1. Both groups having identical Po-intensities, no change over time
The example shows data from two groups/machines/processes ‘A’ and ‘B’, each from a Poisson distribution with the same lambda-value (here a straight line):
library(ggplot2) # Contains functions for graphing.
library(dplyr)
library(patchwork) # Different placing av graphs.
y1A <- 2.5; Xstart <- 0 # Start point for curve of Po-intensity grp 'A'.
y2A <- 2.5; Xstopp <- 200 # Stop point for curve of Po-intensity grp 'A'.
lamA <- -log(y2A/y1A)/Xstopp
y1B <- 2.5 # Start point for curve of Po-intensity grp 'B'.
y2B <- 2.5 # Stop point for curve of Po-intensity grp 'B'.
lamB <- -log(y2B/y1B)/Xstopp
Xaxel <- c(0:(Xstopp-1)) # X-axis.
lambdaChangeGrpA <- y1A * exp(-lamA*Xaxel) # Column of intensity values, grp A.
lambdaChangeGrpB <- y1B * exp(-lamB*Xaxel) # Column of intensity values, grp B.
dataA <- rpois(Xstopp, lambdaChangeGrpA) # Simulated Po-data grp A.
dataB <- rpois(Xstopp, lambdaChangeGrpB) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
# ------- Start graphs Example 1
result <- data.frame(allData) %>% group_by(allData) %>% summarise(number = n())
hist <- ggplot(result, aes(x = allData, y = number, fill = 'red')) +
geom_bar(stat = 'identity', width = 0.4, colour = 'black') +
scale_x_continuous(breaks = allData) +
labs(x = "Simulated Poisson data, Group A and B", y = "") +
theme(legend.position="none")
average <- mean(allData) # Average and variance of simulated data.
variance <- var(allData)
func1 <- function(Xaxel) {y1A * exp(-lamA*Xaxel)} # Theoretical intensity curves.
func2 <- function(Xaxel) {y1B * exp(-lamB*Xaxel)}
dia <- ggplot() + xlim(0, Xstopp) + # Diagram of the two intensity curves.
geom_function(fun = func1, colour='red', linewidth=1, linetype=2) +
geom_function(fun = func2, colour='black', linewidth=1, linetype=2) +
annotate("text", x = mean(Xaxel), y = Inf, label="Intensity curves\n(red: grp A, black: grp B)\n(Example 1)",
vjust = 1.5, size = 5, colour = "blue") +
ylab("Po-intensity values") + xlab("")
dia + hist # Side by side.
Example 1. The left diagram above shows the two lambda values
(here identical) för the two groups A and B. The histogram shows all
data, i.e. both the A- and the B-data.
Below the very analysis is
performed using the ‘glm()’-function (glm stands for ‘General linearized
models’, the ‘family’-option indicates tha Poisson-data is
analyzed):
Ttot <- c(Xaxel, Xaxel)
allData <- c(dataA, dataB)
process <- c(rep("A", length(Xaxel)), rep("B", length(Xaxel)))
analysAB <- glm(allData ~ process + Ttot, family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ process + Ttot, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.401e-01 7.035e-02 13.362 <2e-16 ***
## processB -4.643e-02 6.355e-02 -0.731 0.465
## Ttot -9.945e-05 5.502e-04 -0.181 0.857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 499.23 on 399 degrees of freedom
## Residual deviance: 498.67 on 397 degrees of freedom
## AIC: 1500.1
##
## Number of Fisher Scoring iterations: 5
Exampel 1. The analysis above shows a common printout for
regression analyses. The ‘Estmate’-column contains the estimated
parameters of the linear model. As the simulated data contains no
difference between the two groups and no trend in the lambda parameter
the analysis is expected. (Note that the parameter of ‘processB’ is
shown as a comparison to ‘processA’.)
The model of lambda. Suppose that the three ‘Estimate’ are close to 0.92, 0, and 0 (varies between runs):
\[ \begin{align} \Large f(x) = e^{(0.92+0\cdot processB + 0\cdot Ttot)}= e^{0.92}=2.5 \phantom{invisible}\text{Note:} & \phantom{litet tomt ut} \large e^{0}=1 \end{align} \]
The result corresponds to the simulated value in ‘Example 1’.
Below the two estimated curves and all data is shown:
b0 <- analysAB$coef[1]
b1 <- analysAB$coef[2]
b2 <- analysAB$coef[3]
funcA <- function(Xaxel) {exp(b0 + 0 + b2 * Xaxel)}
funcB <- function(Xaxel) {exp(b0 + b1 + b2 * Xaxel)}
farg<- recode(process, `A` = 2, `B` = 1)
graf <- ggplot() +
geom_point(aes(c(Xaxel, Xaxel), allData), size=2, col=farg) +
geom_function(fun = funcA, colour='red', linewidth=1) +
geom_function(fun = funcB, colour='black', linewidth=1, linetype=2) +
ylab("Po-data") + xlab("") +
annotate("text", x = mean(Xaxel), y = Inf,
label="Po-data and estimated intensity curves\n(red: grp A, black: grp B, Example 1)",
vjust = 1.1, size = 5, colour = "blue")
graf
Example 1. The diagram above contains all data and the two estimated curves. These curves were simulated as two identical lines. Any difference between them as well as any slope is caused by the randomness in the data.
2. Different Po-intensities, no change over time
The example shows again data from the two groups ‘A’ and ‘B’, but here each group has its own constant parameter value:
library(ggplot2)
library(dplyr)
library(patchwork)
y1A <- 1.5; Xstart <- 0 # Start point for curve of Po-intensity grp 'A'.
y2A <- 1.5; Xstopp <- 200 # Stop point for curve of Po-intensity grp 'A'.
lamA <- -log(y2A/y1A)/Xstopp
y1B <- 2.5 # Start point for curve of Po-intensity grp 'B'.
y2B <- 2.5 # Stop point for curve of Po-intensity grp 'B'.
lamB <- -log(y2B/y1B)/Xstopp
Xaxel <- c(0:(Xstopp-1)) # X-axis.
lambdaChangeGrpA <- y1A * exp(-lamA*Xaxel) # Column of intensity values, grp A.
lambdaChangeGrpB <- y1B * exp(-lamB*Xaxel) # Column of intensity values, grp B.
dataA <- rpois(Xstopp, lambdaChangeGrpA) # Simulated Po-data grp A.
dataB <- rpois(Xstopp, lambdaChangeGrpB) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
# ------- Start graphs Example 2
result <- data.frame(allData) %>% group_by(allData) %>% summarise(number = n())
hist <- ggplot(result, aes(x = allData, y = number, fill = 'red')) +
geom_bar(stat = 'identity', width = 0.4, colour = 'black') +
scale_x_continuous(breaks = allData) +
labs(x = "Simulated Posson data, Group A and B", y = "") +
theme(legend.position="none")
average <- mean(allData) # Average and variance of simulated data.
variance <- var(allData)
func1 <- function(Xaxel) {y1A * exp(-lamA*Xaxel)} # Theoretical intensity curves.
func2 <- function(Xaxel) {y1B * exp(-lamB*Xaxel)}
dia <- ggplot() + xlim(0, Xstopp) + # Diagram of the two intensity curves.
geom_function(fun = func1, colour='red', linewidth=1, linetype=2) +
geom_function(fun = func2, colour='black', linewidth=1, linetype=2) +
annotate("text", x = mean(Xaxel), y = Inf, label="Intensity curves\n(red: grp A, black: grp B)\n(Example 2)",
vjust = 1.5, size = 5, colour = "blue") +
ylab("Po-intensity values") + xlab("")
dia + hist # Side by side.
Example 2. The left diagram above shows the two lambda values for
the two groups A and B. The histogram shows the mixture of all data.
Below is the analysis:
Ttot <- c(Xaxel, Xaxel)
process <- c(rep("A", length(Xaxel)), rep("B", length(Xaxel)))
analysAB <- glm(allData ~ process + Ttot, family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ process + Ttot, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2957950 0.0854953 3.460 0.000541 ***
## processB 0.5187938 0.0729209 7.114 1.12e-12 ***
## Ttot 0.0010826 0.0006116 1.770 0.076698 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 505.92 on 399 degrees of freedom
## Residual deviance: 450.46 on 397 degrees of freedom
## AIC: 1350.8
##
## Number of Fisher Scoring iterations: 5
Example 2. The analysis above shows that the term for ‘process’
is significant (< 0.05), but the term ‘Ttot’ is not, as exepcted (see
the column far right).
The model of lambda. Using the three ‘Estimate’ from a simulation (rounded to two decimals), the following can be calculate (note that the software codes ‘A’ as 0 and ‘B’ as 1 unless otherwise is stated):
\[ \begin{align} & \large exp(0.40+0.57\cdot processA + 0\cdot Ttot)=1.49 \phantom{invisible} \Large (process A = 0) \\ \\ & \large exp(0.40+0.57\cdot processB + 0\cdot Ttot)=2.64 \phantom{invisible} \Large (process B = 1) \end{align} \]
The result corresponds to the simulated value in ‘Example 2’.
Below the two estimated curves and all data are shown:
b0 <- analysAB$coef[1]
b1 <- analysAB$coef[2]
b2 <- analysAB$coef[3]
funcA <- function(Xaxel) {exp(b0 + 0 + b2 * Xaxel)}
funcB <- function(Xaxel) {exp(b0 + b1 + b2 * Xaxel)}
farg<- recode(process, `A` = 2, `B` = 1)
graf <- ggplot() +
geom_point(aes(c(Xaxel, Xaxel), allData), size=2, col=farg) +
geom_function(fun = funcA, colour='red', linewidth=1) +
geom_function(fun = funcB, colour='black', linewidth=1, linetype=2) +
ylab("Po-data") + xlab("") +
annotate("text", x = mean(Xaxel), y = Inf,
label="Po-data and estimated intensity curves\n(red: grp A, black: grp B, Example 2)",
vjust = 1.1, size = 5, colour = "blue")
graf
Example 2. The diagram above contains all data and the two estimated curves. These curves were simulated as two different horizontal lines and any slope is caused by the randomness in the data. Note that it is impossible to spot the diffence between A and B by just using the diagram. A more or less formal analysis is needed.
3. Different Po-intensities changing over time
The example shows again data from the two groups ‘A’ and ‘B’, but here the lambda-parameter changes over time for each group:
library(ggplot2)
library(dplyr)
library(patchwork)
y1A <- 1.5; Xstart <- 0 # Start point for curve of Po-intensity grp 'A'.
y2A <- 2.5; Xstopp <- 200 # Stop point for curve of Po-intensity grp 'A'.
lamA <- -log(y2A/y1A)/Xstopp
y1B <- 2.5 # Start point for curve of Po-intensity grp 'B'.
y2B <- 3.5 # Stop point for curve of Po-intensity grp 'B'.
lamB <- -log(y2B/y1B)/Xstopp
Xaxel <- c(0:(Xstopp-1)) # X-axis.
lambdaChangeGrpA <- y1A * exp(-lamA*Xaxel) # Column of intensity values, grp A.
lambdaChangeGrpB <- y1B * exp(-lamB*Xaxel) # Column of intensity values, grp B.
dataA <- rpois(Xstopp, lambdaChangeGrpA) # Simulated Po-data grp A.
dataB <- rpois(Xstopp, lambdaChangeGrpB) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
# ------- Start graphs Example 3
result <- data.frame(allData) %>% group_by(allData) %>% summarise(number = n())
hist <- ggplot(result, aes(x = allData, y = number, fill = 'red')) +
geom_bar(stat = 'identity', width = 0.4, colour = 'black') +
scale_x_continuous(breaks = allData) +
labs(x = "Simulated Posson data, Group A and B", y = "") +
theme(legend.position="none")
average <- mean(allData) # Average and variance of simulated data.
variance <- var(allData)
func1 <- function(Xaxel) {y1A * exp(-lamA*Xaxel)} # Theoretical intensity curves.
func2 <- function(Xaxel) {y1B * exp(-lamB*Xaxel)}
dia <- ggplot() + xlim(0, Xstopp) + # Diagram of the two intensity curves.
geom_function(fun = func1, colour='red', linewidth=1, linetype=2) +
geom_function(fun = func2, colour='black', linewidth=1, linetype=2) +
annotate("text", x = mean(Xaxel), y = Inf, label="Intensity curves\n(red: grp A, black: grp B\nExample 3)",
vjust = 1.5, size = 5, colour = "blue") +
ylab("Po-intensity values") + xlab("")
dia + hist
Example 3. The left diagram above shows the lambda functions for
the two groups A and B. The histogram shows the mixture of all data.
Below is the analysis:
Ttot <- c(Xaxel, Xaxel)
process <- c(rep("A", length(Xaxel)), rep("B", length(Xaxel)))
analysAB <- glm(allData ~ process + Ttot, family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ process + Ttot, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5275797 0.0759887 6.943 3.84e-12 ***
## processB 0.4690441 0.0644790 7.274 3.48e-13 ***
## Ttot 0.0014024 0.0005445 2.576 0.01 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 490.62 on 399 degrees of freedom
## Residual deviance: 429.60 on 397 degrees of freedom
## AIC: 1448.5
##
## Number of Fisher Scoring iterations: 5
Example 3. The analysis above shows that terms for ‘process’ and
‘Ttot’ are significant (< 0.05), and should be included in the
model.
The model of lambda. Using the three ‘Estimate’ from a simulation (rounded to three decimals), the following functions can be calculate (note that the software codes ‘A’ as 0 and ‘B’ as 1 unless otherwise is stated):
\[ \begin{align} & \lambda =\large exp(0.532+0.372\cdot processA + 0.002\cdot Ttot) \phantom{invi} \large (processA = 0)\\ \\ & \lambda =\large exp(0.532+0.372\cdot processB + 0.002\cdot Ttot) \phantom{invi} \large (processB = 1) \end{align} \]
The result corresponds to the two functions simulated in ‘Example 3’.
Below the two estimated curves and all data is shown:
b0 <- analysAB$coef[1]
b1 <- analysAB$coef[2]
b2 <- analysAB$coef[3]
funcA <- function(Xaxel) {exp(b0 + 0 + b2 * Xaxel)}
funcB <- function(Xaxel) {exp(b0 + b1 + b2 * Xaxel)}
farg<- recode(process, `A` = 2, `B` = 1)
graf <- ggplot() +
geom_point(aes(c(Xaxel, Xaxel), allData), size=2, col=farg) +
geom_function(fun = funcA, colour='red', linewidth=1) +
geom_function(fun = funcB, colour='black', linewidth=1, linetype=2) +
ylab("Po-data") + xlab("") +
annotate("text", x = mean(Xaxel), y = Inf,
label="Po-data and estimated intensity curves\n(red: grp A, black: grp B, Example 3)",
vjust = 1.1, size = 5, colour = "blue")
graf
Example 3. The diagram above contains all data and the two estimated curves. These curves were simulated as two different lines. Note that it is impossible to spot the diffence between A and B by just using the diagram. A more or less formal analysis is needed.
4. Different Po-intensities changing in different dirctions
The example shows again data from the two groups ‘A’ and ‘B’, but here
the lambda-parameter changes over time but in different directions
(increasing or decreasing lambda).
This is called an interaction
between variables and can be described with a question and its answer:
‘How does the lambda intensity change over time? It depends on which
group is considered’. This dependence is modelled with an inteaction
term in the model (‘cross product’).
library(ggplot2)
library(dplyr)
library(patchwork)
y1A <- 1.5; Xstart <- 0 # Start point for curve of Po-intensity grp 'A'.
y2A <- 2.5; Xstopp <- 200 # Stop point for curve of Po-intensity grp 'A'.
lamA <- -log(y2A/y1A)/Xstopp
y1B <- 2.5 # Start point for curve of Po-intensity grp 'B'.
y2B <- 1.5 # Stop point for curve of Po-intensity grp 'B'.
lamB <- -log(y2B/y1B)/Xstopp
Xaxel <- c(0:(Xstopp-1)) # X-axis.
lambdaChangeGrpA <- y1A * exp(-lamA*Xaxel) # Column of intensity values, grp A.
lambdaChangeGrpB <- y1B * exp(-lamB*Xaxel) # Column of intensity values, grp B.
dataA <- rpois(Xstopp, lambdaChangeGrpA) # Simulated Po-data grp A.
dataB <- rpois(Xstopp, lambdaChangeGrpB) # Simulated Po-data grp B.
allData <- c(dataA, dataB) # All data in one column.
# ------- Start graphs Example 4
result <- data.frame(allData) %>% group_by(allData) %>% summarise(number = n())
hist <- ggplot(result, aes(x = allData, y = number, fill = 'red')) +
geom_bar(stat = 'identity', width = 0.4, colour = 'black') +
scale_x_continuous(breaks = allData) +
labs(x = "Simulated Posson data, Group A and B", y = "") +
theme(legend.position="none")
average <- mean(allData) # Average and variance of simulated data.
variance <- var(allData)
func1 <- function(Xaxel) {y1A * exp(-lamA*Xaxel)} # Theoretical intensity curves.
func2 <- function(Xaxel) {y1B * exp(-lamB*Xaxel)}
dia <- ggplot() + xlim(0, Xstopp) + # Diagram of the two intensity curves.
geom_function(fun = func1, colour='red', linewidth=1, linetype=2) +
geom_function(fun = func2, colour='black', linewidth=1, linetype=2) +
annotate("text", x = mean(Xaxel), y = Inf, label="Intensity curves\n(red: grp A, black: grp B\nExample 4)",
vjust = 1.5, size = 5, colour = "blue") +
ylab("Po-intensity values") + xlab("")
dia + hist
Example 4. The left diagram above shows the lambda functions for
the two groups A and B. The histogram shows the mixture of all data.
Below is the analysis:
Ttot <- c(Xaxel, Xaxel)
process <- c(rep("A", length(Xaxel)), rep("B", length(Xaxel)))
analysAB <- glm(allData ~ process + Ttot + process*Ttot, family=poisson)
summary(analysAB)
##
## Call:
## glm(formula = allData ~ process + Ttot + process * Ttot, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5538858 0.1047863 5.286 1.25e-07 ***
## processB 0.4728098 0.1384168 3.416 0.000636 ***
## Ttot 0.0007679 0.0008938 0.859 0.390265
## processB:Ttot -0.0037020 0.0012330 -3.002 0.002678 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 494.51 on 399 degrees of freedom
## Residual deviance: 478.97 on 396 degrees of freedom
## AIC: 1381.1
##
## Number of Fisher Scoring iterations: 5
Example 4. The analysis above shows that all terms in the model
are significant and should be included in the model. The interaction
term is here designated as ‘processB:Ttot’
The model of lambda. Using the four ‘Estimate’ from a simulation (rounded to three decimals), the following functions can be calculate (note that the software codes ‘A’ as 0 and ‘B’ as 1 unless otherwise is stated):
\[ \begin{align} & \lambda =\large exp(0.288+0.587\cdot processA + 0.004\cdot Ttot-0.005\cdot processA \cdot Ttot) \phantom{inve} \large (processA = 0)\\ \\ & \lambda =\large exp(0.288+0.587\cdot processB + 0.004\cdot Ttot-0.005\cdot processA \cdot Ttot) \phantom{inve} \large (processB = 1) \end{align} \]
b0 <- analysAB$coef[1]
b1 <- analysAB$coef[2]
b2 <- analysAB$coef[3]
b3 <- analysAB$coef[4]
farg<- factor(recode(process, `A` = 2, `B` = 1))
funcA <- function(Xaxel) {exp(b0 + b1*0 + b2*Xaxel + b3*Xaxel*0)}
funcB <- function(Xaxel) {exp(b0 + b1*1 + b2*Xaxel + b3*Xaxel*1)}
graf <- ggplot() +
geom_point(aes(c(Xaxel, Xaxel), allData), size=2, col=farg) +
geom_function(fun = funcA, colour='red', linewidth=1) +
geom_function(fun = funcB, colour='black', linewidth=1, linetype=2) +
ylab("Po-data") + xlab("") +
annotate("text", x = mean(Xaxel), y = Inf,
label="Po-data and estimated intensity curves\n(red: grp A, black: grp B, Example 4)",
vjust = 1.1, size = 5, colour = "blue")
graf
Example 4. The diagram above contains all data and the two estimated curves with very different slopes. Note that it is impossible to spot the diffence between A and B by just using the diagram. A more or less formal analysis is needed.
Final remarks
The results in the examples can perhaps be
positive or negative. One process owner perhaps likes a decreasing
λ (decrease in number of customer complaints?). Another process
owner perhaps likes an increasing λ-trend (increased number of
customers?), etc.
The examples above are by no means exhausting the
possibilities and perhaps necessary diagnostics. But here the simplicity
has been emphazised.
There are many books about the subject of
‘Po-regression’. Here the book ‘Negative binomial Regression’ (2nd ed,
Joseph M. Hilbe) has been used, especially chapter 7.
(The analyses have been performed by the ‘R’-software and with the graphical interface ‘R’-studio. Both these ara available on the net. See https://www.indstat.se and the button [Statistikprogram - R] for installation.)
(See https://www.indstat.se for many more simulations and animations.)