Weibull-data — mean of censured data

‘Censored data’ means that some datapoints in a set of data has no end-point (see diagram below). Then there is a problem when wanting to calculate the mean and standard deviation of the time process. Below some different options are discussed:

1. Using only the data with a recorded end point.

2. Using all data but the data beyond the limit are recorded as the limit-value.

3. Using all data, with an extra column stating if a data point is censored or not.

(Point 3 is further discussed in other texts on the page.)

Why is data censored? There are several reasons for this. Perhaps the test equipment is only available for a certain period of time, perhaps (in a clinical study) a person leaves the study of some reason or other, etc. (In the discussions below all time measurements start at 0 but of course different starting points can also be handled, it then needs an extra data column to hold starting points.)

Simulation
The simulation below produces two graphs. The top graph shows a sample of 40 time measurements where some pass the limiting value and thus are designated as censored values.
The second graph shows a histogram of 200 datapoints. The last tall bar contains all data values larger than the limit value.

A Weibull distribution – with shape parameter 2 and scale parameter 5 – is used. The estimated parameter values are shown in red on the graph. The red line is the true distribution.

library(survival)
library(patchwork)
library(ggplot2)

  n1    <- 40
  shape <- 2
  scale <- 5
  limit <- 6
  
  TotXaxis <- c()
  fx       <- c()

  areaRight <- pweibull(limit, shape, scale, lower.tail = FALSE)
  
  time     <- rweibull(n1, shape, scale)                       # Simulates Weibull distributed times.
  timeny   <- ifelse(time > limit, NA, time)                   # Only complete measuremnts.
  timeLine <- ifelse(time > limit, limit, time)                # Times above limit are set to the limit-value.
  
  yaxel <- c(1:n1)
  
  nyYaxel <- yaxel[timeLine == limit]
  
  x   <- c(rep(limit, length(nyYaxel)), rep(16, length(nyYaxel)))
  y   <- c(nyYaxel, nyYaxel)               
  grp <- c(nyYaxel, nyYaxel)
  restLines <- data.frame(x, y, grp)
  
  df <- data.frame(timeny, yaxel) 
  
  diagram <- ggplot(df, aes(x=timeny, y=yaxel)) + geom_point() + geom_segment(x=0, y=yaxel, xend=time, 
                        yend=yaxel) + xlim(0, 16) + geom_vline(xintercept = limit, color="red")
  diagram <- ggplot(df, aes(x=timeny, y=yaxel)) + geom_point() + geom_segment(x=0, y=yaxel, xend=timeLine, 
                        yend=yaxel) + xlim(0, 16) + geom_line(data=restLines, aes(x=x, y=y, group=grp), 
                        color="grey") + ylim(0, 42) + xlab("time") + ylab("item number")
  diagram <- diagram + annotate("text", x =  7, y = Inf, label="Censored times (right)", hjust=0, vjust =  1.5, 
                                size = 4, colour = "red" ) 
  diagram <- diagram + annotate("text", x =  0, y = Inf, label="Non-censored times (left)", hjust=0, vjust =  1.5, 
                                size = 4, colour = "red" ) 
  diagram <- diagram + geom_vline(xintercept = limit, color="red")
  
  
# -------------------------------- Graph 1 above, graph 2 below this line
    
  n2    <- 200
  shape <- 2
  scale <- 5
  limit <- 6
  
  Xend  <- 16
  Xaxis <- seq(0, Xend, by = 0.1)      # For plotting lines
    
  time   <- rweibull(n2, shape, scale)
  status <- ifelse(time > limit, 0, 1)       # 1 = 'no fail', 0 = 'fail at time'
  time   <- ifelse(time > limit, limit, time)
  
 timeStatus <- data.frame(time, status)
 
 fitWeibull <- survreg(Surv(time, status) ~ 1, dist="weibull")
 
 estScale <- exp(fitWeibull$icoef[1])
 estShape <- 1/exp(fitWeibull$icoef[2])
 
 nyttfx   <- c(dweibull(Xaxis, estShape, estScale))
 
 TotXaxis <- c(TotXaxis, Xaxis)
 fx       <- c(fx, nyttfx)

 greylines <- data.frame(TotXaxis, fx)
 shapetext <- paste0("Est. shape = ", round(estShape, 1))
 scaletext <- paste0("Est. scale = ", round(estScale, 1))
 
 data   <- data.frame(time)
 rubrik <- "Weibull distribution (censured data) (2, 5)"
 
 truefunction <- function(x){dweibull(x, shape, scale)}
 
 histCens <- ggplot(data, aes(time)) + geom_histogram(binwidth=1, color='black', alpha = 0.5, boundary = 0, 
                                       fill = "yellow", aes(y = after_stat(density)), position = 'identity')
 histCens <- histCens + stat_function(fun=truefunction, geom="line", color='red', linewidth=1.0)
 histCens <- histCens + theme(legend.position = "none") + xlim(0, 16) + ylim(0, 0.45) + xlab("time")
 histCens <- histCens + theme(axis.text.x  = element_text(size = 12), axis.text.y = element_blank(),  
                         axis.title.y = element_blank())
 histCens <- histCens + annotate("text", x =  5, y = Inf, label=rubrik, vjust = 1.5, size = 4.5, colour = "blue")
 histCens <- histCens + annotate("text", x=10, y=Inf, label=shapetext, hjust=0, vjust=7, size=4, colour="red") 
 histCens <- histCens + annotate("text", x=10, y=Inf, label=scaletext, hjust=0, vjust=10, size=4, colour="red") 
 histCens <- histCens + geom_segment(x=limit, y=0, xend=limit, yend=0.40, color="red")
 
 tvaCensgrafer <- diagram / histCens
 filnamn <- paste("~/Desktop/Rgrafer/tvaCensgrafer.jpeg", sep="")
 ggsave(filnamn, width = 7.5, height = 6.5)                                     # Units i tum
 
suppressWarnings(print(tvaCensgrafer))

Comments figure 1. The top graph shows 40 time measurements where some are censored, i.e. the end point is not known. The histogram shows 200 values where values larger than the limit are recorded as limit-values. Thus all values larger than the limit are found in the large bar.

The different estimates in a graph
The true mean and the true standard deviation – based on the two parameter values 2 and 5 – is shown as a red ring with a “D” together with the three estimates (“A”, “B”, and “C”).

A. Mean and standard deviation are calculated using the complete data only.

B. Mean and standard deviation using all data where data beyond limit are set to limit.

C. All data, with an extra column stating if a data point is censored or not.

 #      Original data:    'time'
 #      Finished data:    'timeFinished'
 #      Incl truncated:   'timeLine'
 
 timeFinished <- time[time < limit]
 
 meanFinished <- mean(timeFinished) 
 stdFinished  <- sd(timeFinished)
 
 meanLine <- mean(timeLine) 
 stdLine  <- sd(timeLine) 
 
 trueMean <- round(scale * gamma(1 + 1/shape), 2)
 trueStd <- round(sqrt(scale**2 * (gamma(1 + 2/shape) - gamma(1 + 1/shape)**2)), 2)
 
 texttrueMean <-paste0("true mean = ", trueMean)
 texttrueStd  <-paste0("true sigma = ", trueStd)
 
esttrueMean <- estScale * gamma(1 + 1/estShape)
esttrueStd <- sqrt(estScale**2 * (gamma(1 + 2/estShape) - gamma(1 + 1/estShape)**2))

xMean <- c(meanFinished, meanLine, esttrueMean, trueMean)
yStd  <- c(stdFinished, stdLine, esttrueStd, trueStd)
 
 xydata <- data.frame(xMean, yStd)

 diaStdMean <- ggplot(xydata, aes(x=xMean, y=yStd, label= c("A", "B", "C", "D"))) +  
                      geom_point(size=c(9, 9, 9, 12), color=c("black", "black","black", "red"))
 diaStdMean <- diaStdMean + geom_vline(aes(xintercept = trueMean, color="red")) + ylab("sigma") + xlab("mean")
 diaStdMean <- diaStdMean + geom_hline(aes(yintercept = trueStd, color="red"))
 diaStdMean <- diaStdMean + annotate("text", x=trueMean, y=-Inf, label=texttrueMean, angle=90, hjust=-1.2, 
                                     vjust = -0.9, size=4, color="red")
 diaStdMean <- diaStdMean + annotate("text", x=-Inf, y=trueStd, label=texttrueStd, hjust=-0.5, vjust = -0.5, 
                                     size=4, color="red")
 diaStdMean <- diaStdMean + theme(legend.position = "none") + geom_text(hjust=c(0.5, 0.5, 0.5, 0.5), 
                                                          color="white", vjust=c(0.5, 0.5, 0.5, 0.5), size=6)
 diaStdMean

Comments figure 2. The “A”-point is not even close to the true sigma and the true mean. This is obvious as it only uses some of the data (‘The lower portion of the data’). The “B”-point is deviating from the true value but somewhat less than “A” because times past the limit are set to the limit-value (‘censored’). The “C”-point is usually closest to the true value but is based on a much more advanced mathematics needing computer support.

Final comments. The possibility of estimating the process parameters using censored data is of greate advantage and is used a frequently in e.g. quality control and in reliability studies. Incomplete data is common in many clinical studies of various reasons.