R-TUTORIAL ON WEIBULL AND POISSON-REGRESSION 

Use the command

pdat=read.table("https://folk.ntnu.no/bo/STK4080/2021/poisson-dat.txt",header=T)

to upload a set of right censored lifetimes where

t=survival time
d=censoring status
x=covariate (0-1)

> pdat
    t d x
1   9 1 1
2  13 1 1
3  13 0 1
4  18 1 1
5  23 1 1
6  28 0 1
7  31 1 1
8  44 1 1
9  45 0 1
10 48 1 1
11 59 0 1
12  5 1 0
13  5 1 0
14  8 1 0
15  8 1 0
16 12 1 0
17 16 0 0
18 23 1 0
19 27 1 0
20 30 0 0
21 33 1 0
22 43 1 0
23 55 1 0

t = pdat$t
d = pdat$d
x = pdat$x

####################################

First, we perform a Cox-regression with covariate x:

> fit.c=coxph(Surv(t,d==1)~x)
> summary(fit.c)
Call:
coxph(formula = Surv(t, d == 1) ~ x)

  n= 23, number of events= 17 

     coef exp(coef) se(coef)      z Pr(>|z|)
x -0.6831    0.5050   0.4998 -1.367    0.172

  exp(coef) exp(-coef) lower .95 upper .95
x     0.505       1.98    0.1896     1.345

Concordance= 0.609  (se = 0.066 )
Likelihood ratio test= 1.91  on 1 df,   p=0.2
Wald test            = 1.87  on 1 df,   p=0.2
Score (logrank) test = 1.94  on 1 df,   p=0.2

###########################################

Then we do the corresponding analysis with Weibull-regression:

> fit.w=survreg(Surv(t,d==1)~x)
> summary(fit.w)

Call:
survreg(formula = Surv(t, d == 1) ~ x)
             Value Std. Error     z      p
(Intercept)  3.328      0.212 15.73 <2e-16
x            0.444      0.330  1.34  0.179
Log(scale)  -0.416      0.196 -2.12  0.034

Scale= 0.66 

Weibull distribution
Loglik(model)= -74.9   Loglik(intercept only)= -75.8
        Chisq= 1.88 on 1 degrees of freedom, p= 0.17 
Number of Newton-Raphson Iterations: 5 
n= 23 

###############

In practical analyses of this kind, it is the estimate beta-coefficient that is of main interest. This is

-0.6831

for the Cox regression, and

0.444

for the Weibull regression. To compare these two values,
recall from p. 18/33 in Slides 14 that we have the approximation

beta(Cox) =  - a beta(Weibull)      (*)

Here a is estimated to 1/Scale = 1/0.66 = 1.51515

so the right hand side of (*) is

-1.51515 * 0.444 = - 0.6727

which is close to the Cox-regression estimate. It is seen that, moreover, the standard error estimates are also compatible, in the sense that

a * se(Weibull) = 1.51515 * 0.33 = 0.500 approx se(Cox) 

Note, however, that there is a difference in the two models: In the Weibull analysis we assume a parametric baseline hazard function of the Weibull form, while the Cox model assumes nothing about the form of the baseline hazard. 

The Weibull baseline hazard is given by (see p. 18/33 in Slides 14):

a exp(-a beta0) t^{a-1}

and the cumulative hazard is therefore

exp(-a beta0) t^a

= exp(-(1/Scale) * Intercept) t^(1/scale)

= exp(-(1/0.66) * 3.328) t^(1/0.66)

= 0.006458 t^1.51515

You may draw the estimated cumulative hazard by 

alpha0 <- function(t) 0.006458*t^1.51515
curve(alpha0, 0, 60)
plot (alpha0, 0, 60)

and you man compare it to the estimated baseline cumulative hazard for Cox-regression:

plot(survfit(fit.c),fun="cumhaz"))

##############################################

AFTER THIS WE GET TO THE MAIN POINT OF THIS TUTORIAL, NAMELY THE MODELING VIA POISSON REGRESSION

We now assume that the baseline hazard is piecewise constant, see Page 30/33 in Slides 14

alpha(t) = theta_k  for (k-1)*10 < t <= k*10

for k=1,2,3,4,5,6

We need to compute the variables O(k,l) and R(k,l) defined on page 33/33. Here k is 1,2,3,4,5,6  and we let l=1 for the data with x=0 and l=2 for the data with x=1. (Thus L=2 in the notation on page 33/33.) 

Below we let 

Ol = (O(1,l),O(2,l),...,O(6,l)) for l=1,2
Rl = (R(1,l),R(2,l),...,R(6,l)) for l=1,2

The following are R-commands that give these vectors from 
the data in pdat, by "brute force". We shall see later how this can be done by using functions available in the survival package (following ASAUR Chapter 12.1)

# INITIALIZE
O1 = c(0,0,0,0,0,0)
O2 = c(0,0,0,0,0,0)
R1 = c(0,0,0,0,0,0)
R2 = c(0,0,0,0,0,0)
# GO THROUGH EACH LINE OF THE DATA AND ADD THE # APPROPRIATE CONTRIBUTION TO EACH VECTOR
for (i in 1:23)
  {
   for (k in 1:6)
    {
   O1[k] = O1[k] + 1*(d[i]==1)*(x[i]==0)*(t[i] <= k*10)*(t[i] > (k-1)*10)
   O2[k] = O2[k] + 1*(d[i]==1)*(x[i]==1)*(t[i] <= k*10)*(t[i] > (k-1)*10)
   R1[k] = R1[k] + 10*(t[i] > k*10)*(x[i]==0) + max((t[i]-(k-1)*10),0)*(t[i] <= k*10)*(x[i]==0)
   R2[k] = R2[k] + 10*(t[i] > k*10)*(x[i]==1) + max((t[i]-(k-1)*10),0)*(t[i] <= k*10)*(x[i]==1)
  }
  }
# THIS IS END OF THIS PROGRAM

The results are:

> O1
[1] 4 1 2 1 1 1
> O2
[1] 1 2 1 1 2 0
> R1
[1] 106  68  50  23  13   5
> R2
[1] 109  84  61  41  27   9


The data to be used in the Poisson regression problem defined on the last line of page 33/33 are:

Obs = c(O1,O2)
R = c(R1,R2)
xx = c(0,0,0,0,0,0,1,1,1,1,1,1)
theta = c(1,2,3,4,5,6,1,2,3,4,5,6)


We can fit the given model by (see explanation after the output):

################################

fit.p=glm(Obs ~ offset(log(R))+ xx + factor(theta), family=poisson)
summary(fit.p)

> summary(fit.p)

Call:
glm(formula = Obs ~ offset(log(R)) + xx + factor(theta), family = poisson)

Deviance Residuals: 
       1         2         3         4         5         6         7         8  
 0.38354  -0.67943   0.10875  -0.04739  -0.40208   0.58721  -0.59531   0.70764  
       9        10        11        12  
-0.14299   0.04894   0.35224  -0.97811  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -3.4752     0.4788  -7.259  3.9e-13 ***
xx              -0.6744     0.4971  -1.357    0.175    
factor(theta)2  -0.1338     0.7306  -0.183    0.855    
factor(theta)3   0.1785     0.7306   0.244    0.807    
factor(theta)4   0.3868     0.8392   0.461    0.645    
factor(theta)5   1.2871     0.7352   1.751    0.080 .  
factor(theta)6   1.2151     1.0975   1.107    0.268    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 8.3945  on 11  degrees of freedom
Residual deviance: 3.0881  on  5  degrees of freedom
AIC: 42.195

Number of Fisher Scoring iterations: 5

##################################

The model used above is essentially a Cox-type model, with the difference that the baseline hazard is piecewise constant on time intervals of length 10, but is otherwise completely arbitrary. The coefficient of x should hence have a similar interpretation as the one for Cox-regression. In fact, we obtain the value of -0.6744, which is very close to the one from Cox-regression (-0.6831). The standard errors are, moreover, also very close.

Now to the estimation of the baseline hazard:

Since we use the Poisson regression which includes a constant term Intercept = beta0, and use theta as a factor, we get theta_1 as the reference value, estimated as

exp(Intercept) = exp(-3.4752) = 0.030955

Then, further, is theta_2 estimated as

exp(-3.4752 - 0.1338) = 0.02707891

etc.


This is since the contributions from factors 2-6 are given relative to the reference. As an alternative, it is therefore recommended to ask for a Poisson regression without constant term (i.e., without the intercept):

This gives:

#######################################

> fit.p.nointer=glm(Obs ~ offset(log(R))+ xx + factor(theta)-1, family=poisson)
> summary(fit.p.nointer)

Call:
glm(formula = Obs ~ offset(log(R)) + xx + factor(theta) - 1, 
    family = poisson)

Deviance Residuals: 
       1         2         3         4         5         6         7         8  
 0.38354  -0.67943   0.10875  -0.04739  -0.40208   0.58721  -0.59531   0.70764  
       9        10        11        12  
-0.14299   0.04894   0.35224  -0.97811  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
xx              -0.6744     0.4971  -1.357 0.174901    
factor(theta)1  -3.4752     0.4788  -7.259 3.90e-13 ***
factor(theta)2  -3.6091     0.6084  -5.932 3.00e-09 ***
factor(theta)3  -3.2968     0.6080  -5.422 5.88e-08 ***
factor(theta)4  -3.0885     0.7456  -4.142 3.44e-05 ***
factor(theta)5  -2.1881     0.6314  -3.466 0.000529 ***
factor(theta)6  -2.2602     1.0279  -2.199 0.027887 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1045.4555  on 12  degrees of freedom
Residual deviance:    3.0881  on  5  degrees of freedom
AIC: 42.195

Number of Fisher Scoring iterations: 5

###########################

We now obtain the estimates of the theta_k by direct exponentiation of the Coefficients:

> exp(fit.p.nointer$coefficients)
   xx factor(theta)1 factor(theta)2 factor(theta)3 factor(theta)4 
    0.50944141     0.03095417     0.02707750     0.03700235     0.04557148 
factor(theta)5 factor(theta)6 
    0.11212892     0.10432998 

          
Thus the estimates are:

theta_1 = 0.03095417
theta_2 = 0.02707750
theta_3 = 0.03700235
theta_4 = 0.04557148
theta_5 = 0.11212892
theta_6 = 0.10432998

You may draw the resulting baseline hazard alpha(t) as a function of t for t from 0 to 60, noting that this is constant on each interval (0,10],(10,20],....  Then by integration (summing of areas) you get a cumulative hazard which can be compared to the ones for Cox-regression and Weibull regression.  


###################

USING POISSON REGRESSION FOLLOWING ASAUR CHAPTER 12.1

The survival package in R has in fact the relevant functions for transforming the problem at the beginning of these tutorials into a Poisson regression problem. Thus we may avoid the ad hoc "brute force" method used above. 

Below we follow the steps from ASAUR Chapter 12.1 in order to analyse the data of this tutorial. Some of the wording below is copied directly from ASAUR.

First, we define six intervals that we will use to define a piecewise constant hazard, i.e., a piecewise exponential distribution:

ptau.s <- c(10, 20, 30, 40, 50, 60)

It is convenient to include the identity of each original individual into the data:

pdat$id=c(1:23)

Then we use the command:

pdat.split.s <- survSplit(data=pdat, cut=ptau.s, end="t",start="t0", event="d", episode="grp")

> pdat.split.s
   x id t0  t d grp
1  1  1  0  9 1   1
2  1  2  0 10 0   1
3  1  2 10 13 1   2
4  1  3  0 10 0   1
5  1  3 10 13 0   2
6  1  4  0 10 0   1
7  1  4 10 18 1   2
8  1  5  0 10 0   1
9  1  5 10 20 0   2
10 1  5 20 23 1   3
11 1  6  0 10 0   1
12 1  6 10 20 0   2
13 1  6 20 28 0   3
14 1  7  0 10 0   1
15 1  7 10 20 0   2
16 1  7 20 30 0   3
17 1  7 30 31 1   4
18 1  8  0 10 0   1
19 1  8 10 20 0   2
20 1  8 20 30 0   3
21 1  8 30 40 0   4
22 1  8 40 44 1   5
23 1  9  0 10 0   1
24 1  9 10 20 0   2
25 1  9 20 30 0   3
26 1  9 30 40 0   4
27 1  9 40 45 0   5
28 1 10  0 10 0   1
29 1 10 10 20 0   2
30 1 10 20 30 0   3
31 1 10 30 40 0   4
32 1 10 40 48 1   5
33 1 11  0 10 0   1
34 1 11 10 20 0   2
35 1 11 20 30 0   3
36 1 11 30 40 0   4
37 1 11 40 50 0   5
38 1 11 50 59 0   6
39 0 12  0  5 1   1
40 0 13  0  5 1   1
41 0 14  0  8 1   1
42 0 15  0  8 1   1
43 0 16  0 10 0   1
44 0 16 10 12 1   2
45 0 17  0 10 0   1
46 0 17 10 16 0   2
47 0 18  0 10 0   1
48 0 18 10 20 0   2
49 0 18 20 23 1   3
50 0 19  0 10 0   1
51 0 19 10 20 0   2
52 0 19 20 27 1   3
53 0 20  0 10 0   1
54 0 20 10 20 0   2
55 0 20 20 30 0   3
56 0 21  0 10 0   1
57 0 21 10 20 0   2
58 0 21 20 30 0   3
59 0 21 30 33 1   4
60 0 22  0 10 0   1
61 0 22 10 20 0   2
62 0 22 20 30 0   3
63 0 22 30 40 0   4
64 0 22 40 43 1   5
65 0 23  0 10 0   1
66 0 23 10 20 0   2
67 0 23 20 30 0   3
68 0 23 30 40 0   4
69 0 23 40 50 0   5
70 0 23 50 55 1   6

In the above function, “t0” is the name given to the start of the interval for the respective
part of a patient’s record, and “grp” is the indicator for the interval that a
particular record refers to. Next, we define the exposure per record:

pdat.split.s$expo <- pdat.split.s$t - pdat.split.s$t0

> pdat.split.s$expo
 [1]  9 10  3 10  3 10  8 10 10  3 10 10  8 10 10 10  1 10 10 10 10  4 10 10 10 10  5
[28] 10 10 10 10  8 10 10 10 10 10  9  5  5  8  8 10  2 10  6 10 10  3 10 10  7 10 10
[55] 10 10 10 10  3 10 10 10 10  3 10 10 10 10 10  5


To obtain parameter estimates for the piecewise exponential distribution, we treat
“d” in “pdat.split.s” as if it had a Poisson distribution, the log of “expo” as
the offset, and the mean as given in Eq. 12.1.4 in ASAUR. The R code is as follows:

> result.pdat.poisson <- glm(d ~ -1 + factor(grp)+ x +
+ offset(log(expo)), family=poisson, data=pdat.split.s)

> summary(result.pdat.poisson)

Call:
glm(formula = d ~ -1 + factor(grp) + x + +offset(log(expo)), 
    family = poisson, data = pdat.split.s)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4975  -0.7508  -0.5616  -0.3471   2.3606  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
factor(grp)1  -3.4752     0.4788  -7.259 3.90e-13 ***
factor(grp)2  -3.6091     0.6084  -5.932 3.00e-09 ***
factor(grp)3  -3.2968     0.6080  -5.422 5.88e-08 ***
factor(grp)4  -3.0885     0.7456  -4.142 3.44e-05 ***
factor(grp)5  -2.1881     0.6314  -3.466 0.000529 ***
factor(grp)6  -2.2602     1.0279  -2.199 0.027887 *  
x             -0.6744     0.4971  -1.357 0.174901    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1108.277  on 70  degrees of freedom
Residual deviance:   65.909  on 63  degrees of freedom
AIC: 113.91

Number of Fisher Scoring iterations: 7

THESE ARE THE SAME ESTIMATES AS OBTAINED EARLIER!

Recall that the "-1” in the model definition tells R to fit a separate term for each of the six
interval factors (rather than to consider the first level as a reference level).

Since the covariate (x) is binary, we may further simplify the data set by
collapsing over the six time intervals. The R function
“aggregate” facilities this, as follows:

pdat.tab <- aggregate(pdat.split.s[c("d", "expo")],by=list(xx=pdat.split.s$x,grp=pdat.split.s$grp), sum)

> pdat.tab
   xx grp d expo
1   0   1 4  106
2   1   1 1  109
3   0   2 1   68
4   1   2 2   84
5   0   3 2   50
6   1   3 1   61
7   0   4 1   23
8   1   4 1   41
9   0   5 1   13
10  1   5 2   27
11  0   6 1    5
12  1   6 0    9

These data are equivalent to the ones we found by "brute force" earlier in this tutorial. We may now finally fit the Poisson-model to this new compact data set as follows. We will then of get exactly the same parameter estimates as above. 

result.pdat.tab.poisson <- glm(d ~ -1 + factor(grp) +
xx + offset(log(expo)), family=poisson, data=pdat.tab)

> summary(result.pdat.tab.poisson)

Call:
glm(formula = d ~ -1 + factor(grp) + xx + offset(log(expo)), 
    family = poisson, data = pdat.tab)

Deviance Residuals: 
       1         2         3         4         5         6         7         8  
 0.38354  -0.59531  -0.67943   0.70764   0.10875  -0.14299  -0.04739   0.04894  
       9        10        11        12  
-0.40208   0.35224   0.58721  -0.97811  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
factor(grp)1  -3.4752     0.4788  -7.259 3.90e-13 ***
factor(grp)2  -3.6091     0.6084  -5.932 3.00e-09 ***
factor(grp)3  -3.2968     0.6080  -5.422 5.88e-08 ***
factor(grp)4  -3.0885     0.7456  -4.142 3.44e-05 ***
factor(grp)5  -2.1881     0.6314  -3.466 0.000529 ***
factor(grp)6  -2.2602     1.0279  -2.199 0.027887 *  
xx            -0.6744     0.4971  -1.357 0.174901    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1045.4555  on 12  degrees of freedom
Residual deviance:    3.0881  on  5  degrees of freedom
AIC: 42.195

Number of Fisher Scoring iterations: 5