Package discfrail (discrete
frailty) is an R
package that provides a
novel, flexible model for hierarchical time-to-event data in which a
clustering structure of groups is suspected. The random effects that
define the group-level frailties follow a nonparametric discrete
distribution, and the baseline hazards are left unspecified, as in the
Cox model. The package contains a function to fit the Cox model with
nonparametric discrete frailty term and two functions for simulating
data. Full details of the methods are given in Gasperoni et al. (2018). This vignette
gives a brief overview of the methods, and worked examples of using the
functions.
Consider a random sample with a hierarchical structure, i.e, where each individual subject, or statistical unit, belongs to one group. Define Tij* as the survival time and Cij as the censoring time of subject i, i = 1, ..., nj, in the j-th group, j = 1, ..., J. Let Xij = (Xij1, ..., Xijp)T be a vector of covariates, assumed constant over time, for subject i in group j. Then, we define Tij = min(Tij*, Cij), tij its realization and δij = 1(Tij* ≤ Cij).
Let $\tilde{\textbf{w}}$ be a vector
of shared random effects, and $\textbf{w}=\exp\{\tilde{\textbf{w}}\}$, be
the corresponding vector of shared frailties that represent unexplained
relative hazards between the groups. Conventionally, frailties are
assumed to follow a parametric distribution, typically the Gamma.
However in discfrail
, the frailties are assumed to have a
nonparametric, discrete distribution, with an unknown number of points
in the support. This allows an arbitrarily flexible random effects
distribution, and gives a method for clustering. In particular, we
assume that each group j can
belong to one latent population k, k = 1, ..., K, with
probability πk. In this
case, w1, ..., wK
are the corresponding distinct values for the frailties in each
population, where ℙ{w = wk} = πk.
We introduce also an auxiliary (latent) indicator random variable zjk which is equal to 1 if the j-th group belongs to the k-th population, so, $z_{jk} \overset{i.i.d}{\sim} Bern(\pi_{k})$. Since each group belongs to only one population, $\sum_{k=1}^K z_{jk} = 1$ for each j.
Therefore, in the nonparametric discrete frailty Cox model of
discfrail
, the hazard for individual i in group j, conditional on w and on the zjk is:
where λ0(t) represents the baseline hazard, β is the vector of regression coefficients and wk is the frailty term shared among groups of the same latent population k. Since the baseline hazard is unspecified, and the remaining parameters are estimated by maximum partial likelihood, model is an extension of a proportional hazard Cox model.
The vector zj has a multinomial distribution. Note that there are two levels of grouping. The first level is known, for example, healthcare providers as groups of patients) and we refer to these clusters as groups. The second level is the unknown clustering of groups (e.g. groups of healthcare providers with similar outcomes) that we want to detect, and we refer to these clusters as latent populations.
We assume that censoring is noninformative, thus that Tij* and Cij are conditionally independent, given Xij, wk and zjk.
An application of this model to a real dataset is shown in Section 5 of Gasperoni et al. (2018). In this case, the event of interest is the second admission in hospital of patients that suffer from Heart Failure (patients are grouped in healthcare providers).
Two functions in the discfrail
package simulate
hierarchical time-to-event data in which there are clusters of groups
with the same frailty.
The input parameters of the simulation algorithm are:
Several steps are needed for building the dataset. The simulation procedure uses the method presented by Bender et al. (2005). We use two probability results to obtain Eq.: the inverse probability method and the fact that if a generic random variable V is distributed as 𝒰[0, 1] then 1 − V is still a 𝒰[0, 1].
Then, we are able to compute the survival times with Eq. by inverting Eq.. The main difference between equations , and the ones showed in Bender et al. (2005) is the frailty term, wk.
The covariates are randomly generated from a normal distribution, $\textbf{X}_{ij} \overset{i.i.d.}{\sim} N(0,1)$.
Exploiting equations and we are able to simulate the event times Tij. Then, we generate the censoring times according to a Normal distribution with standard deviation equal to one, independently from the Tij. The censoring scheme is reproduced according to Wan (2017) so that censperc of the simulated times are censored.
Finally, we obtain the status variable as: δij = 1(Tij* ≤ Cij).
The package includes two general choices of baseline survival function:
sim_weibdf
, a parametric baseline, specifically a
Weibull distribution characterised by two parameters λ and ρ, with cumulative hazard function
Λ0(t) = λ ⋅ tρ.
sim_npdf
, a generic baseline, where the user can
supply Λ0−1 as an R
function of time.
In the first case, equation becomes:
In the following example we simulate data with a Weibull baseline.
## Loading required package: survival
J <- 100 # total number of groups
N <- 40 # number of units per group
lambda <- 0.5 # Weibull scale parameter
rho <- 1.4 # Weibull shape parameter
beta <- c( 1.6, 0.4 ) # log hazard ratios for covariates
p <- c( 0.7, 0.3 ) # mixing proportions
w_values <- c( 1.2, 2.1 ) # frailty values
cens_perc <- 0.1 # percentage of censored events
set.seed(1200)
data_weib <- sim_weibdf( J, N, lambda, rho, beta, p, w_values, cens_perc)
head( data_weib )
## family time status x.1 x.2 belong
## 1 1 6.3524700 0 -0.848038891 -1.5494394 1.2
## 2 1 0.1524241 1 -0.047688486 1.8599481 1.2
## 3 1 0.7678551 1 -0.005320405 1.4354504 1.2
## 4 1 0.9151902 1 0.354866150 0.6519595 1.2
## 5 1 0.2059090 1 1.333647771 1.3741231 1.2
## 6 1 0.6141216 1 0.805291478 0.4975223 1.2
The simulated data object contains components that include
family
, the group indicator. In this case there are
40 individuals in each simulated group, and this can be checked here
with table(data_weib$family)
.
status
: the survival status δij.
Here we can check the proportion of censored events:
##
## 0 1
## 0.10025 0.89975
belong
: w the frailty values. Here
we can check that the proportion of groups belonging to each latent
population agrees with the values we specified for p
:##
## 1.2 2.1
## 0.7 0.3
We can visualize the simulated group-specific survivor curves by Kaplan-Meier estimation, as follows. The curves are coloured according to the latent population to which each group belongs.
fit1 <- coxph( Surv( time, status ) ~ family, data_weib, method = 'breslow')
sfit1 <- survfit(Surv( time, status ) ~ family, data_weib)
lat_pop = rep( 2, J )
lat_pop[ which( data_weib$belong[ seq( 1, dim(data_weib)[1], N ) ] %in%
w_values[2]) ] = 1
plot( sfit1, col = lat_pop, xlab = 'time', ylab = 'Survival probability',
lwd = 1.5 )
legend( 'topright', paste('Latent population', 1:2), col = 2:1,
lty = rep(1,2), lwd = 1.5, bty = 'n' )
This illustrates the substantial heterogeneity between groups.
The following code simulates the same form of data, but this time with a user-specified inverse cumulative hazard function Λ0(t) supplied as an R function.
J <- 100
N <- 40
Lambda_0_inv = function( t, c=0.01, d=4.6 ) ( t^( 1/d ) )/c
beta <- 1.6
p <- c( 0.8, 0.2 )
w_values <- c( 0.8, 1.6 )
cens_perc <- 0.2
data_np <- sim_npdf( J, N, beta, Lambda_0_inv, p, w_values, cens_perc)
And again, the simulated survivor functions can be visualized as follows:
fit1 <- coxph( Surv( time, status ) ~ family, data_np, method = 'breslow')
sfit1 <- survfit(Surv( time, status ) ~ family, data_np)
lat_pop = rep( 2, J )
lat_pop[ which( data_np$belong[ seq( 1, dim(data_np)[1], N ) ] %in%
w_values[2]) ] = 1
plot( sfit1, col = lat_pop, xlab = 'time', ylab = 'Survival probability',
lwd = 1.5 )
legend( 'bottomleft', paste('Latent population', 1:2), col = 2:1,
lty = rep(1,2), lwd = 1.5, bty = 'n' )
The function npdf_cox
can be used to fit the Cox model
with nonparametric discrete frailty. This is based on maximum partial
likelihood using an EM algorithm – see Section 3 of Gasperoni et al. (2018). We illustrate
the use of this function to fit this model to the dataset
data_weib
that was previously simulated.
The first argument specifies the outcome as a Surv
object, and any covariates, in the standard R formula syntax used by
coxph
in the standard survival
package.
Another required argument is groups
, which names a variable
in the data containing the group membership indicators.
The number of latent populations can be specified in two alternative
ways. If npdf_cox
is called with estK=FALSE
,
then one model is fitted with the number of latent populations fixed at
the number specified by the K
argument to
npdf_cox
.
Alternatively, if npdf_cox
is called with
estK=TRUE
(which is the default choice if estK
is omitted) then multiple models are fitted with the number of latent
groups ranging from 1 to K
. This is done in the following
example:
test_weib <- npdf_cox( Surv(time, status) ~ x.1 + x.2, groups = family,
data = data_weib, K = 4, eps_conv=10^-4)
Printing the fitted model object shows, firstly, the model comparison
statistics for the models fitted. The estimates from the best-fitting
model according to the preferred criterion (specified in the
criterion
argument to npdf_cox
, by default,
this is BIC) are then presented.
##
## Call:
## npdf_cox(formula = Surv(time, status) ~ x.1 + x.2, groups = family,
## data = data_weib, K = 4, eps_conv = 10^-4)
##
## Model comparison:
## K K_fitted llik AIC BIC
## 1 1 1 -28515.39 57036.79 57055.36
## 2 2 2 -28453.33 56916.66 56947.60
## 3 3 2 -28457.40 56928.80 56972.12
## 4 4 2 -28470.35 56958.69 57014.39
## Optimal K:
## Laird AIC BIC
## 2 2 2
##
## Best model according to BIC:
## Nonparametric discrete frailty Cox model fit with K=2 latent populations
##
## Estimated parameters and standard errors:
## est seLouis seExact
## p1 0.694 0.0498 0.0499
## p2 0.306 0.0498 0.0499
## w2/w1 1.816 0.0693 0.0694
## x.1 1.640 0.0271 0.0271
## x.2 0.408 0.0169 0.0169
##
## Log-likelihood: -28453
## AIC: 56917
## BIC: 56948
## Fitted K: 2
##
## To examine other models, look at `models` component
According to AIC and BIC, the optimal K̂ is equal to 2, which was the true value used to generate the data.
The criterion of Laird (1978) works as
follows. At the end of the model fitting algorithm, each group is
assigned to a latent population. The number of distinct “fitted” latent
populations for each model is displayed in the K_fitted
column of the model comparison table. Then the best-fitting model
according to the Laird (1978) criterion is
the maximum K
among the fitted models for which
K=K_fitted
, that is, for which every latent population has
at least one individual assigned to it. In this example, the optimal
K̂ is 4
.
The estimates $\hat{\boldsymbol{\pi}}$ are [0.665, 0.335], close to the true values of
[0.7, 0.3]. The estimated ratio $\hat{w_2}/\hat{w_1} = 1.747$, again close to
the true 1.75 (2.1/1.2). Exact standard errors and standard
errors according to the method of Louis
(1982) are also reported. See the Supplementary Material of
Gasperoni et al.(2018) for more
information about the computation of standard errors. Note that if
K_fitted
is less than K
then the standard
errors cannot be computed under these methods.
The function npdf_cox
returns a list with the following
components, illustrated for this example as follows:
test_weib$model
is a list with one element for each
fitted model, in this case 4 elements.
The ith
element of the list contains the estimates with i latent populations. We show the
output related to K = 3.##
## Nonparametric discrete frailty Cox model fit with K=3 latent populations
##
## Estimated parameters and standard errors:
## est
## p1 0.00725
## p2 0.68663
## p3 0.30612
## w2/w1 1.06399
## w3/w1 1.93015
## x.1 1.63942
## x.2 0.40845
##
## Log-likelihood: -28457
## AIC: 56929
## BIC: 56972
## Fitted K: 2
test_weib$comparison
is a matrix in which the model
comparison information, including the log-likelihood, AIC and BIC, is
reported for each value of K.## K K_fitted llik AIC BIC
## 1 1 1 -28515.39 57036.79 57055.36
## 2 2 2 -28453.33 56916.66 56947.60
## 3 3 2 -28457.40 56928.80 56972.12
## 4 4 2 -28470.35 56958.69 57014.39
test_weib$Kopt
reports the optimal K according to Laird, AIC and
BIC.## Laird AIC BIC
## 2 2 2
test_weib$criterion
is the criterion used to choose the
optimal K.## [1] "BIC"
test_weib$mf
is the , the data frame used to fit the
model.Since the data are simulated, we are also able to check whether the groups are correctly assigned. We note that only 2 groups are misclassified (42 and 98).
# true distinct frailty for each group
w <- data_weib$belong[!duplicated(data_weib$family)]
# true latent population for each group
real_lat_pop <- match(w, sort(unique(w)))
table( test_weib$models[[2]]$belonging, real_lat_pop )
## real_lat_pop
## 1 2
## 1 67 2
## 2 3 28
## [1] 29 41 54 60 61
## [,1] [,2]
## [1,] 0.73278706 0.2672129
## [2,] 0.09290722 0.9070928
## [3,] 0.04250517 0.9574948
## [4,] 0.78118633 0.2188137
## [5,] 0.16067014 0.8393299
We can investigate the posterior probabilities of being part of a specific latent population (αjk, see Section 3.1 of Gasperoni et al. (2018)). For some groups there is great uncertainty about the latent population to which they belong (i.e., group 98 is assigned to latent population 1 and 2 with probabilities of 0.66, 0.34 respectively). The following code identifies those groups that have probabilities between 0.05 and 0.95 of being assigned to latent population one and two (note that this list includes the two “misclassified” groups highlighted above).
#alpha matrix: alpha_{jk}
#is the probability that group j is assigned to latent population k.
#test_weib$models[[2]]$alpha
assign_not_sure1 = which( test_weib$models[[2]]$alpha[ ,1 ] < 0.95 &
test_weib$models[[2]]$alpha[ ,1 ] > 0.05 &
test_weib$models[[2]]$alpha[ ,2 ] < 0.95 &
test_weib$models[[2]]$alpha[ ,2 ] > 0.05 )
assign_not_sure1
## [1] 7 10 24 29 41 46 56 60 61 82 95 97 99
We can visualize the fitted survivor curves for each group through
the plot
method. The default, shown below, uses
Kaplan-Meier estimates. It is also possible to plot Nelson-Aalen
estimates (choosing type = 'na'
). These plots automatically
colour the group-specific curves according to the group’s fitted latent
population. This demonstrates how the group-specific frailties are
clustered, with the higher-frailty population shown in black.
plot( test_weib, type = 'km', lwd = 1.5 )
legend( 'topright', paste( 'Latent population', c( 1, 2 ) ), lwd = 1.5,
col = c( 2, 1 ), lty = rep( 1, 2 ), bty = 'n' )
The following plot highlights in green the survival curves for the three groups in the simulated data that are “misclassified” by the model.
colors_misc = test_weib$models[[2]]$belonging + 1
colors_misc[ colors_misc == 3 ] = 1
colors_misc[ test_weib$models[[2]]$belonging != real_lat_pop ] = 3
plot( test_weib, type = 'km', col = colors_misc, lwd = 1.5 )
legend( 'topright', paste( 'Latent population', c( 1, 2 ) ), lwd = 1.5,
col = c( 2, 1 ), lty = rep( 1, 2 ), bty = 'n' )