R-TUTORIAL ON WEIBULL AND POISSON-REGRESSION Use the command pdat=read.table("https://folk.ntnu.no/bo/STK4080/2021/poisson-dat.txt",header=T) to upload a set of right censored lifetimes where t=survival time d=censoring status x=covariate (0-1) > pdat t d x 1 9 1 1 2 13 1 1 3 13 0 1 4 18 1 1 5 23 1 1 6 28 0 1 7 31 1 1 8 44 1 1 9 45 0 1 10 48 1 1 11 59 0 1 12 5 1 0 13 5 1 0 14 8 1 0 15 8 1 0 16 12 1 0 17 16 0 0 18 23 1 0 19 27 1 0 20 30 0 0 21 33 1 0 22 43 1 0 23 55 1 0 t = pdat$t d = pdat$d x = pdat$x #################################### First, we perform a Cox-regression with covariate x: > fit.c=coxph(Surv(t,d==1)~x) > summary(fit.c) Call: coxph(formula = Surv(t, d == 1) ~ x) n= 23, number of events= 17 coef exp(coef) se(coef) z Pr(>|z|) x -0.6831 0.5050 0.4998 -1.367 0.172 exp(coef) exp(-coef) lower .95 upper .95 x 0.505 1.98 0.1896 1.345 Concordance= 0.609 (se = 0.066 ) Likelihood ratio test= 1.91 on 1 df, p=0.2 Wald test = 1.87 on 1 df, p=0.2 Score (logrank) test = 1.94 on 1 df, p=0.2 ########################################### Then we do the corresponding analysis with Weibull-regression: > fit.w=survreg(Surv(t,d==1)~x) > summary(fit.w) Call: survreg(formula = Surv(t, d == 1) ~ x) Value Std. Error z p (Intercept) 3.328 0.212 15.73 <2e-16 x 0.444 0.330 1.34 0.179 Log(scale) -0.416 0.196 -2.12 0.034 Scale= 0.66 Weibull distribution Loglik(model)= -74.9 Loglik(intercept only)= -75.8 Chisq= 1.88 on 1 degrees of freedom, p= 0.17 Number of Newton-Raphson Iterations: 5 n= 23 ############### In practical analyses of this kind, it is the estimate beta-coefficient that is of main interest. This is -0.6831 for the Cox regression, and 0.444 for the Weibull regression. To compare these two values, recall from p. 18/33 in Slides 14 that we have the approximation beta(Cox) = - a beta(Weibull) (*) Here a is estimated to 1/Scale = 1/0.66 = 1.51515 so the right hand side of (*) is -1.51515 * 0.444 = - 0.6727 which is close to the Cox-regression estimate. It is seen that, moreover, the standard error estimates are also compatible, in the sense that a * se(Weibull) = 1.51515 * 0.33 = 0.500 approx se(Cox) Note, however, that there is a difference in the two models: In the Weibull analysis we assume a parametric baseline hazard function of the Weibull form, while the Cox model assumes nothing about the form of the baseline hazard. The Weibull baseline hazard is given by (see p. 18/33 in Slides 14): a exp(-a beta0) t^{a-1} and the cumulative hazard is therefore exp(-a beta0) t^a = exp(-(1/Scale) * Intercept) t^(1/scale) = exp(-(1/0.66) * 3.328) t^(1/0.66) = 0.006458 t^1.51515 You may draw the estimated cumulative hazard by alpha0 <- function(t) 0.006458*t^1.51515 curve(alpha0, 0, 60) plot (alpha0, 0, 60) and you man compare it to the estimated baseline cumulative hazard for Cox-regression: plot(survfit(fit.c),fun="cumhaz")) ############################################## AFTER THIS WE GET TO THE MAIN POINT OF THIS TUTORIAL, NAMELY THE MODELING VIA POISSON REGRESSION We now assume that the baseline hazard is piecewise constant, see Page 30/33 in Slides 14 alpha(t) = theta_k for (k-1)*10 < t <= k*10 for k=1,2,3,4,5,6 We need to compute the variables O(k,l) and R(k,l) defined on page 33/33. Here k is 1,2,3,4,5,6 and we let l=1 for the data with x=0 and l=2 for the data with x=1. (Thus L=2 in the notation on page 33/33.) Below we let Ol = (O(1,l),O(2,l),...,O(6,l)) for l=1,2 Rl = (R(1,l),R(2,l),...,R(6,l)) for l=1,2 The following are R-commands that give these vectors from the data in pdat, by "brute force". We shall see later how this can be done by using functions available in the survival package (following ASAUR Chapter 12.1) # INITIALIZE O1 = c(0,0,0,0,0,0) O2 = c(0,0,0,0,0,0) R1 = c(0,0,0,0,0,0) R2 = c(0,0,0,0,0,0) # GO THROUGH EACH LINE OF THE DATA AND ADD THE # APPROPRIATE CONTRIBUTION TO EACH VECTOR for (i in 1:23) { for (k in 1:6) { O1[k] = O1[k] + 1*(d[i]==1)*(x[i]==0)*(t[i] <= k*10)*(t[i] > (k-1)*10) O2[k] = O2[k] + 1*(d[i]==1)*(x[i]==1)*(t[i] <= k*10)*(t[i] > (k-1)*10) R1[k] = R1[k] + 10*(t[i] > k*10)*(x[i]==0) + max((t[i]-(k-1)*10),0)*(t[i] <= k*10)*(x[i]==0) R2[k] = R2[k] + 10*(t[i] > k*10)*(x[i]==1) + max((t[i]-(k-1)*10),0)*(t[i] <= k*10)*(x[i]==1) } } # THIS IS END OF THIS PROGRAM The results are: > O1 [1] 4 1 2 1 1 1 > O2 [1] 1 2 1 1 2 0 > R1 [1] 106 68 50 23 13 5 > R2 [1] 109 84 61 41 27 9 The data to be used in the Poisson regression problem defined on the last line of page 33/33 are: Obs = c(O1,O2) R = c(R1,R2) xx = c(0,0,0,0,0,0,1,1,1,1,1,1) theta = c(1,2,3,4,5,6,1,2,3,4,5,6) We can fit the given model by (see explanation after the output): ################################ fit.p=glm(Obs ~ offset(log(R))+ xx + factor(theta), family=poisson) summary(fit.p) > summary(fit.p) Call: glm(formula = Obs ~ offset(log(R)) + xx + factor(theta), family = poisson) Deviance Residuals: 1 2 3 4 5 6 7 8 0.38354 -0.67943 0.10875 -0.04739 -0.40208 0.58721 -0.59531 0.70764 9 10 11 12 -0.14299 0.04894 0.35224 -0.97811 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.4752 0.4788 -7.259 3.9e-13 *** xx -0.6744 0.4971 -1.357 0.175 factor(theta)2 -0.1338 0.7306 -0.183 0.855 factor(theta)3 0.1785 0.7306 0.244 0.807 factor(theta)4 0.3868 0.8392 0.461 0.645 factor(theta)5 1.2871 0.7352 1.751 0.080 . factor(theta)6 1.2151 1.0975 1.107 0.268 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 8.3945 on 11 degrees of freedom Residual deviance: 3.0881 on 5 degrees of freedom AIC: 42.195 Number of Fisher Scoring iterations: 5 ################################## The model used above is essentially a Cox-type model, with the difference that the baseline hazard is piecewise constant on time intervals of length 10, but is otherwise completely arbitrary. The coefficient of x should hence have a similar interpretation as the one for Cox-regression. In fact, we obtain the value of -0.6744, which is very close to the one from Cox-regression (-0.6831). The standard errors are, moreover, also very close. Now to the estimation of the baseline hazard: Since we use the Poisson regression which includes a constant term Intercept = beta0, and use theta as a factor, we get theta_1 as the reference value, estimated as exp(Intercept) = exp(-3.4752) = 0.030955 Then, further, is theta_2 estimated as exp(-3.4752 - 0.1338) = 0.02707891 etc. This is since the contributions from factors 2-6 are given relative to the reference. As an alternative, it is therefore recommended to ask for a Poisson regression without constant term (i.e., without the intercept): This gives: ####################################### > fit.p.nointer=glm(Obs ~ offset(log(R))+ xx + factor(theta)-1, family=poisson) > summary(fit.p.nointer) Call: glm(formula = Obs ~ offset(log(R)) + xx + factor(theta) - 1, family = poisson) Deviance Residuals: 1 2 3 4 5 6 7 8 0.38354 -0.67943 0.10875 -0.04739 -0.40208 0.58721 -0.59531 0.70764 9 10 11 12 -0.14299 0.04894 0.35224 -0.97811 Coefficients: Estimate Std. Error z value Pr(>|z|) xx -0.6744 0.4971 -1.357 0.174901 factor(theta)1 -3.4752 0.4788 -7.259 3.90e-13 *** factor(theta)2 -3.6091 0.6084 -5.932 3.00e-09 *** factor(theta)3 -3.2968 0.6080 -5.422 5.88e-08 *** factor(theta)4 -3.0885 0.7456 -4.142 3.44e-05 *** factor(theta)5 -2.1881 0.6314 -3.466 0.000529 *** factor(theta)6 -2.2602 1.0279 -2.199 0.027887 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1045.4555 on 12 degrees of freedom Residual deviance: 3.0881 on 5 degrees of freedom AIC: 42.195 Number of Fisher Scoring iterations: 5 ########################### We now obtain the estimates of the theta_k by direct exponentiation of the Coefficients: > exp(fit.p.nointer$coefficients) xx factor(theta)1 factor(theta)2 factor(theta)3 factor(theta)4 0.50944141 0.03095417 0.02707750 0.03700235 0.04557148 factor(theta)5 factor(theta)6 0.11212892 0.10432998 Thus the estimates are: theta_1 = 0.03095417 theta_2 = 0.02707750 theta_3 = 0.03700235 theta_4 = 0.04557148 theta_5 = 0.11212892 theta_6 = 0.10432998 You may draw the resulting baseline hazard alpha(t) as a function of t for t from 0 to 60, noting that this is constant on each interval (0,10],(10,20],.... Then by integration (summing of areas) you get a cumulative hazard which can be compared to the ones for Cox-regression and Weibull regression. ################### USING POISSON REGRESSION FOLLOWING ASAUR CHAPTER 12.1 The survival package in R has in fact the relevant functions for transforming the problem at the beginning of these tutorials into a Poisson regression problem. Thus we may avoid the ad hoc "brute force" method used above. Below we follow the steps from ASAUR Chapter 12.1 in order to analyse the data of this tutorial. Some of the wording below is copied directly from ASAUR. First, we define six intervals that we will use to define a piecewise constant hazard, i.e., a piecewise exponential distribution: ptau.s <- c(10, 20, 30, 40, 50, 60) It is convenient to include the identity of each original individual into the data: pdat$id=c(1:23) Then we use the command: pdat.split.s <- survSplit(data=pdat, cut=ptau.s, end="t",start="t0", event="d", episode="grp") > pdat.split.s x id t0 t d grp 1 1 1 0 9 1 1 2 1 2 0 10 0 1 3 1 2 10 13 1 2 4 1 3 0 10 0 1 5 1 3 10 13 0 2 6 1 4 0 10 0 1 7 1 4 10 18 1 2 8 1 5 0 10 0 1 9 1 5 10 20 0 2 10 1 5 20 23 1 3 11 1 6 0 10 0 1 12 1 6 10 20 0 2 13 1 6 20 28 0 3 14 1 7 0 10 0 1 15 1 7 10 20 0 2 16 1 7 20 30 0 3 17 1 7 30 31 1 4 18 1 8 0 10 0 1 19 1 8 10 20 0 2 20 1 8 20 30 0 3 21 1 8 30 40 0 4 22 1 8 40 44 1 5 23 1 9 0 10 0 1 24 1 9 10 20 0 2 25 1 9 20 30 0 3 26 1 9 30 40 0 4 27 1 9 40 45 0 5 28 1 10 0 10 0 1 29 1 10 10 20 0 2 30 1 10 20 30 0 3 31 1 10 30 40 0 4 32 1 10 40 48 1 5 33 1 11 0 10 0 1 34 1 11 10 20 0 2 35 1 11 20 30 0 3 36 1 11 30 40 0 4 37 1 11 40 50 0 5 38 1 11 50 59 0 6 39 0 12 0 5 1 1 40 0 13 0 5 1 1 41 0 14 0 8 1 1 42 0 15 0 8 1 1 43 0 16 0 10 0 1 44 0 16 10 12 1 2 45 0 17 0 10 0 1 46 0 17 10 16 0 2 47 0 18 0 10 0 1 48 0 18 10 20 0 2 49 0 18 20 23 1 3 50 0 19 0 10 0 1 51 0 19 10 20 0 2 52 0 19 20 27 1 3 53 0 20 0 10 0 1 54 0 20 10 20 0 2 55 0 20 20 30 0 3 56 0 21 0 10 0 1 57 0 21 10 20 0 2 58 0 21 20 30 0 3 59 0 21 30 33 1 4 60 0 22 0 10 0 1 61 0 22 10 20 0 2 62 0 22 20 30 0 3 63 0 22 30 40 0 4 64 0 22 40 43 1 5 65 0 23 0 10 0 1 66 0 23 10 20 0 2 67 0 23 20 30 0 3 68 0 23 30 40 0 4 69 0 23 40 50 0 5 70 0 23 50 55 1 6 In the above function, “t0” is the name given to the start of the interval for the respective part of a patient’s record, and “grp” is the indicator for the interval that a particular record refers to. Next, we define the exposure per record: pdat.split.s$expo <- pdat.split.s$t - pdat.split.s$t0 > pdat.split.s$expo [1] 9 10 3 10 3 10 8 10 10 3 10 10 8 10 10 10 1 10 10 10 10 4 10 10 10 10 5 [28] 10 10 10 10 8 10 10 10 10 10 9 5 5 8 8 10 2 10 6 10 10 3 10 10 7 10 10 [55] 10 10 10 10 3 10 10 10 10 3 10 10 10 10 10 5 To obtain parameter estimates for the piecewise exponential distribution, we treat “d” in “pdat.split.s” as if it had a Poisson distribution, the log of “expo” as the offset, and the mean as given in Eq. 12.1.4 in ASAUR. The R code is as follows: > result.pdat.poisson <- glm(d ~ -1 + factor(grp)+ x + + offset(log(expo)), family=poisson, data=pdat.split.s) > summary(result.pdat.poisson) Call: glm(formula = d ~ -1 + factor(grp) + x + +offset(log(expo)), family = poisson, data = pdat.split.s) Deviance Residuals: Min 1Q Median 3Q Max -1.4975 -0.7508 -0.5616 -0.3471 2.3606 Coefficients: Estimate Std. Error z value Pr(>|z|) factor(grp)1 -3.4752 0.4788 -7.259 3.90e-13 *** factor(grp)2 -3.6091 0.6084 -5.932 3.00e-09 *** factor(grp)3 -3.2968 0.6080 -5.422 5.88e-08 *** factor(grp)4 -3.0885 0.7456 -4.142 3.44e-05 *** factor(grp)5 -2.1881 0.6314 -3.466 0.000529 *** factor(grp)6 -2.2602 1.0279 -2.199 0.027887 * x -0.6744 0.4971 -1.357 0.174901 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1108.277 on 70 degrees of freedom Residual deviance: 65.909 on 63 degrees of freedom AIC: 113.91 Number of Fisher Scoring iterations: 7 THESE ARE THE SAME ESTIMATES AS OBTAINED EARLIER! Recall that the "-1” in the model definition tells R to fit a separate term for each of the six interval factors (rather than to consider the first level as a reference level). Since the covariate (x) is binary, we may further simplify the data set by collapsing over the six time intervals. The R function “aggregate” facilities this, as follows: pdat.tab <- aggregate(pdat.split.s[c("d", "expo")],by=list(xx=pdat.split.s$x,grp=pdat.split.s$grp), sum) > pdat.tab xx grp d expo 1 0 1 4 106 2 1 1 1 109 3 0 2 1 68 4 1 2 2 84 5 0 3 2 50 6 1 3 1 61 7 0 4 1 23 8 1 4 1 41 9 0 5 1 13 10 1 5 2 27 11 0 6 1 5 12 1 6 0 9 These data are equivalent to the ones we found by "brute force" earlier in this tutorial. We may now finally fit the Poisson-model to this new compact data set as follows. We will then of get exactly the same parameter estimates as above. result.pdat.tab.poisson <- glm(d ~ -1 + factor(grp) + xx + offset(log(expo)), family=poisson, data=pdat.tab) > summary(result.pdat.tab.poisson) Call: glm(formula = d ~ -1 + factor(grp) + xx + offset(log(expo)), family = poisson, data = pdat.tab) Deviance Residuals: 1 2 3 4 5 6 7 8 0.38354 -0.59531 -0.67943 0.70764 0.10875 -0.14299 -0.04739 0.04894 9 10 11 12 -0.40208 0.35224 0.58721 -0.97811 Coefficients: Estimate Std. Error z value Pr(>|z|) factor(grp)1 -3.4752 0.4788 -7.259 3.90e-13 *** factor(grp)2 -3.6091 0.6084 -5.932 3.00e-09 *** factor(grp)3 -3.2968 0.6080 -5.422 5.88e-08 *** factor(grp)4 -3.0885 0.7456 -4.142 3.44e-05 *** factor(grp)5 -2.1881 0.6314 -3.466 0.000529 *** factor(grp)6 -2.2602 1.0279 -2.199 0.027887 * xx -0.6744 0.4971 -1.357 0.174901 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1045.4555 on 12 degrees of freedom Residual deviance: 3.0881 on 5 degrees of freedom AIC: 42.195 Number of Fisher Scoring iterations: 5