Generalized Additive Models (GAMs) are characterized by a linear
predictor that can be broken down into a sum of a linear term (the Xβ term) and a smooth term
(the ∑jfj(xj)
term). The goal of plsmselect
is to provide a flexible
interface for parameter estimation under various penalty structures on
the linear parameters β
(“none”, ℓ1, or ℓ2) and the smooth
functions fj’s (ℓ1 or ℓ2). Most noteworthy, the package
allows users to fit so called GAMLASSO models that add an ℓ1 penalty on both the linear
parameters β and the smooth
functions fj’s. Extension
to Cox regression is also implemented, which allows users to fit
GAMLASSO models to censored time-to-event data. Demonstration
of package usage for all families (gaussian
,
binomial
, poisson
, and cox
) can
be found in the Examples section and users that
are more interested in package functionalities (than methodological
details) may skip right to that section.
To the best of our knowledge this is the first R-package that can fit
GAM models with ℓ1 penalty
on both the linear and smooth terms (i.e. so called GAMLASSO
models) for gaussian
, binomial
,
poisson
, and cox
families. (Chouldechova and Hastie
2015) proposed a similar method that selects between fitting
each of the smooth terms fj’s as zero,
linear, or nonlinear, as determined by the data. This method has been
implemented in the R-package gamsel
(Chouldechova, Hastie,
and Spinu 2018) but is only applicable for the
gaussian
and binomial
families. When dealing
with continuous variables both gamsel
and
plsmselect
retain the interpretability advantages of linear
(or zero) fits when appropriate, while capturing strong non-linear
relationships when they are present.
The main idea of the GAMLASSO model involves combining codes
from the two benchmark packages for fitting GAM models on the one hand,
mgcv
(S.
Wood 2019), and LASSO models on the other,
glmnet
(Friedman et al. 2019). The former
package allows for flexible modeling of non-linear effects (but does not
facilitate ℓ1 penalty on
linear terms) while the latter package allows for ℓ1 penalty on linear terms (but
does not facilitate estimation of non-linear effects).
plsmselect
allows for both by borrowing strength between
the two. The main model object of plsmselect
(gamlasso
) in fact inherits objects from
mgcv::gam
and glmnet::cv.glmnet
making it
particularly easy to work with for users that are familiar with both of
those packages.
The goal of plsmselect
is to estimate the parameters of
Generalized Additive Models of the form under different penalty
structures on the linear parameters β (“none”, ℓ1, or ℓ2) and the smooth
functions fj’s (ℓ1 or ℓ2). In the above μ denotes the mean of a response
Y that is assumed to follow a
distribution from the exponential dispersion family (with associated
link function g), X is a model matrix for
linear predictors (e.g. indicators corresponding to categorical
variables) and Zj’s are
smooth predictors (e.g. all continuous covariates that
potentially have a non-linear relationship with the response).
The package also fits generalized additive models for survival data whose models are of the form: where λ(t) is the hazard function corresponding to a censored time-to-event response T and λ0(t) is the baseline hazard.
Penalties: gamlasso
- the main function
of plsmselect
deals with all the combinations of penalty
structures as presented in the following table:
None | ℓ1 | ℓ2 | |
---|---|---|---|
β | ✓ | ✓ | ✓ |
fj’s | ✓ | ✓ |
For example having no penalty on β and an ℓ2 penalty on the fj’s is the same
fit as can be obtained by simply using mgcv::gam
. We could
also fit a model with ℓ1
penalty on β and no smooth
component at all in which case we could simply use
glmnet::cv.glmnet
.
The main novelty of plsmselect
is when we impose ℓ1 penalty on both β and the fj’s. This means
the model expects some of the elements of β to be 0 and some of the functions
fj ≡ 0,
i.e., we need to do variable selection for linear and smooth predictors
simultaneously. In this case we use the GAMLASSO algorithm
described in the following section.
Note that the ℓ1 and
ℓ2 penalties on β are the standard Lasso and Ridge
penalties. As for penalties on the fj’s note that
by ℓ2 penalty we imply the
standard smoothness penalty used on the functions (detailed in the background section). The ℓ1-type penalty used on the
functions fj is actually a
variant of the smoothness penalty but it forces a function to be 0 in
case it is not significant - in this case it acts similar to the Lasso
penalty. See ?mgcv::smooth.terms
for more details on this
type of penalty.
When we are fitting a model with ℓ1 penalty on the linear coefficients β we apply the GAMLASSO algorithm:
offset = 0
loop until convergence of the fitted values
Fit $Y \sim \beta_0 + \sum_{j=1}^q
f_j(Z_j) + offset$ with either ℓ1 or ℓ2 penalty on fj’s.
Set $offset \leftarrow \hat{\beta}_0 +
\sum_{j=1}^q \hat{f_j}(Z_j)$
Fit Y ∼ Xβ + offset
with ℓ1 penalty on β.
Set offset ← Xβ̂.
We can think of this approach as a block coordinate descent
algorithm. The first block having the intercept β0 and the linear
coefficients of the basis functions corresponding to the the smooth
functions fj (more details
in the background section). The second block
containing the β’s
corresponding to the linear predictors. The former block is implemented
by invoking the mgcv
package while the latter is
implemented with glmnet
.
The other combinations with none or ℓ2 penalties on β already have closed form
solutions. They have been implemented in the mgcv
package
and so in these cases gamlasso
acts as a wrapper to those
implementations.
For a simple model Y = f(X) + ϵ, given data (yi, xi)i = 1n we can estimate f assuming it to be of the form $$ f(x) = \sum_{k=1}^K b_k(x) \beta_k, $$ for some basis functions bk and corresponding weights βk.
In practice the basis functions are known. So replacing f in the first equation gives us a linear model which we can solve for the βk values.
Of course in general the choice of basis functions is key. One choice could be polynomials (spline fitting) with prespecified knots. Now polynomials are sufficient for approximating functions at specific points but could become too “wiggly” over a whole domain. Piecewise linear basis functions gives a “better” fit but is not “smooth”. Cubic regression splines with fixed knots x1*, …, xK* provide an attractive solution as a smooth alternative to the intuitive picewise linear splines. With such a spline basis representation we can add a “wigglyness” penalty $$ \lambda \sum_{k=2}^{K-1} \big(f(x^*_{j-1}) - 2f(x^*_j) + f(x^*_{j+1})\big)^2 = \lambda \beta^T S \beta, $$ The penalty is a second order one, as an approximation to the penalty on second derivatives used in spline fitting. Cross-validation is used for estimating λ.
In the standard software used to fit a gam (mgcv
) the
thin plate regularized spline (tprs) basis functions are used.
More details can be found in (S. N. Wood 2003)
For two or more smooth predictors the model is Y = f1(X1) + f2(X2) + … + ϵ Now both f1 and f2 cannot be uniquely identified. So we always have an intercept β0 and identifiability contraints $$ \sum_{i=1}^n f_1(X_{1i}) = \sum_{i=1}^n f_2(X_{2i}) = 0. $$
Note:
In general for fitting a model with linear predictors and/or more than one smooth predictors we can write it as a linear model (similar to the univariate smooth case above) and proceed accordingly.
In this section we will go over some of the functionalities of
plsmselect
on a simulated dataset. In particular, we will
demonstrate how to use the package to fit Generalized Additive Models
under different penalty structures to data arising from the following
families: gaussian
, binomial
,
poisson
, and cox
. We will focus our attention
on the so called GAMLASSO model that involves an ℓ1 penalty on both linear
parameters β and smooth
functions fj’s.
In the next subsection we will describe the simulated example data
set. After that we demonstrate how to fit the GAMLASSO model to
gaussian
, binomial
and poisson
responses. Finally, we will show how to fit the Cox GAMLASSO
model to survival data (censored time-to-event responses).
plsmselect
comes with a simulated toy data set that we
will use to demonstrate package functionalities. This data set
simData
will be used in all below analyses.
id | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | z1 | z2 | z3 | z4 | lp | Yg | Yb | Yp | success | failure | time | status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0.27 | 0.80 | 0.86 | 0.03 | 0.01 | -0.12 | 0 | 1 | 3 | 2 | 44.83 | 1 |
i2 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0.59 | 0.73 | 0.89 | 0.49 | 2.46 | 2.54 | 1 | 13 | 2 | 0 | 2.24 | 1 |
i3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0.16 | 0.22 | 0.49 | 0.54 | 1.13 | 1.05 | 1 | 3 | 1 | 1 | 10.61 | 1 |
i4 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0.85 | 0.74 | 0.72 | 0.84 | 2.24 | 2.17 | 1 | 8 | 9 | 1 | 3.02 | 1 |
i5 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0.85 | 0.87 | 0.49 | 0.73 | 2.02 | 2.03 | 1 | 7 | 3 | 1 | 30.37 | 1 |
i6 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0.48 | 0.53 | 0.99 | 0.56 | 1.00 | 0.97 | 1 | 0 | 2 | 0 | 20.87 | 1 |
The details of how this data is simulated can be found in the Appendix.
plsmselect
In this section we fit the following GAM model to the response Yg from
simData
in Table 1: where both β and the fj’s are subject
to ℓ1 penalties. In the
above ε ∼ N(0, σ2),
X denotes the model matrix
corresponding to the binary variables x1, …, x10
and z1, …, z4
denote the continuous variables of simData
.
(Note that the notation here is vectorized, for example the i element of the vector Zj is used for the i element of the vector Yg. We will use a similar vectorised notation henceforth)
Below we demonstrate the code for fitting this GAMLASSO model using
the plsmselect
formula specification. Note that for the
formula specification the model (design) matrix X needs to be computed and appended
to the simData set. The same code generalizes to situations where x1, …, x10
are categorical factor variables.
## Create model matrix X corresponding to linear terms
## (necessary for the formula option of gamlasso below)
simData$X = model.matrix(~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=simData)[,-1]
## The formula approach
gfit = gamlasso(Yg ~ X +
s(z1, k=5, bs="ts") +
s(z2, k=5, bs="ts") +
s(z3, k=5, bs="ts") +
s(z4, k=5, bs="ts"),
data = simData,
seed = 1)
Alternatively, we can fit the above model using the
plsmselect
term specification:
## The term specification approach
gfit = gamlasso(response = "Yg",
linear.terms = paste0("x",1:10),
smooth.terms = paste0("z",1:4),
data = simData,
linear.penalty = "l1",
smooth.penalty = "l1",
num.knots = 5,
seed = 1)
The term specification approach is helpful when you have a large data set with multiple columns, but has the disadvantage that the number of knots is constant across all smooth terms. With the formula approach on the other hand the user has full flexibility in defining different numbers of knots per smooth term.
gamlasso
object: The
gamlasso
object gfit
is a list with first
component corresponding to the smooth gam part of the fit
(gfit$gam
: a gam
object from the package
mgcv
; see ?mgcv::gam
) and a second component
corresponding to the linear lasso part of the fit
(gfit$cv.glmnet
: a cv.glmnet
object from the
package glmnet
; see ?glmnet::cv.glmnet
).
#> [1] "gam" "glm" "lm"
#> [1] "cv.glmnet"
The summary of the GAMLASSO model can be obtained with the following command:
#> $lasso
#> 11 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) .
#> Xx1 0.99087556
#> Xx2 -0.58532317
#> Xx3 0.54394830
#> Xx4 0.03323999
#> Xx5 .
#> Xx6 .
#> Xx7 .
#> Xx8 .
#> Xx9 .
#> Xx10 0.02623600
#>
#> $gam
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> Yg ~ s(z1, k = 5, bs = "ts") + s(z2, k = 5, bs = "ts") + s(z3,
#> k = 5, bs = "ts") + s(z4, k = 5, bs = "ts")
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.798303 0.009222 86.57 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(z1) 3.573e+00 4 161.7 <2e-16 ***
#> s(z2) 3.518e+00 4 218.6 <2e-16 ***
#> s(z3) 5.519e-10 4 0.0 0.914
#> s(z4) 1.792e-08 4 0.0 0.543
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.987 Deviance explained = 93.8%
#> GCV = 0.009253 Scale est. = 0.0085044 n = 100
We note that the summary is broken into a lasso summary and
a gam summary. The lasso summary shows the linear
parameter estimates and we note that in the above case only the first
three linear parameters are estimated as non-zero. Compare these values
to the underlying “truth” as defined in the simulation section of Appendix (β1 = 1, β2 = −0.6, β3 = 0.5, β4 = ⋯ = β10 = 0).
The gam summary corresponds to the (mgcv
) summary
of the gam object gfit$gam (see ?mgcv::summary.gam
).
We can visualize the estimates of the smooth functions by calling the
mgcv::plot.gam
function:
We note that the functions corresponding to the variables z3 and z4 are estimated to be
near zero (confidence bands include zero). Compare these estimates to
the underlying “truth” as defined in the simulation section of Appendix (f1(z) = z3,
f2(z) = sin (π ⋅ z),
f3(z) = f4(z) = 0).
Note that mgcv
applies a sum constraint on each function
∑kf1(z1k) = ⋯ = ∑kf4(z4k) = 0,
where the sum is taken across all observations (samples) of z1, …, z4
(see Methods section). Hence the model only
estimates the functions (and overall intercept) correctly up to a
vertical shift. However, we can verify that the model’s fitted values
Ŷg match
nicely those that were observed Yg:
## Plot fitted versus observed values:
plot(simData$Yg, predict(gfit), xlab = "Observed values", ylab = "Fitted Values")
plsmselect
In this section we fit a Poisson regression model to the response
Yp ∼ Poi(λ)
from simData
where both β and the fj’s are subject
to ℓ1 penalties.
The above model can be fit using the formula approach:
## Create model matrix X corresponding to linear terms
## (necessary for the formula option of gamlasso below)
simData$X = model.matrix(~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=simData)[,-1]
## Poisson response. Formula approach.
pfit = gamlasso(Yp ~ X +
s(z1, bs="ts", k=5) +
s(z2, bs="ts", k=5) +
s(z3, bs="ts", k=5) +
s(z4, bs="ts", k=5),
data = simData,
family = "poisson",
seed = 1)
or alternatively with the term specification approach:
## Poisson response. Term-specification approach.
pfit = gamlasso(response = "Yp",
linear.terms = paste0("x",1:10),
smooth.terms = paste0("z",1:4),
data = simData,
linear.penalty = "l1",
smooth.penalty = "l1",
family = "poisson",
num.knots = 5,
seed = 1)
We can obtain the linear parameter estimates directly:
#> 11 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) .
#> Xx1 0.7535747
#> Xx2 -0.4572487
#> Xx3 0.4357671
#> Xx4 .
#> Xx5 .
#> Xx6 .
#> Xx7 .
#> Xx8 .
#> Xx9 .
#> Xx10 .
and plot the smooth estimates of individual terms:
par(mfrow=c(1,2))
plot(pfit$gam, select=1) # estimate of smooth term z1
plot(pfit$gam, select=2) # estimate of smooth term z2
We note that the linear parameter estimates reasonably capture the underlying truth: β1 = 1, β2 = −0.6, β3 = 0.5, β4 = …, β10 = 0 (only shrunk towards zero as expected). Similarly we see that the estimates of the smooth effects corresponding to z1 and z2 are consistent with the true functions f1(z) = z3 and f2(z) = sin (π ⋅ z) up to a constant vertical shift (due to sum constraint). We can further verify how well the model estimates the true expected counts exp (lp):
plot(predict(pfit, type="response"), exp(simData$lp), xlab="predicted count", ylab="true expected count")
plsmselect
In this section we fit a binary logistic regression model to the
response Yb ∼ Bernoulli(p)
from simData
: where both β and the fj’s are subject
to ℓ1 penalties.
The above model can be fit using the formula approach:
## Create model matrix X corresponding to linear terms
## (necessary for the formula option of gamlasso below)
simData$X = model.matrix(~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=simData)[,-1]
## Bernoulli trials response
bfit = gamlasso(Yb ~ X +
s(z1, bs="ts", k=5) +
s(z2, bs="ts", k=5) +
s(z3, bs="ts", k=5) +
s(z4, bs="ts", k=5),
data = simData,
family = "binomial",
seed = 1)
or alternatively using the term specification approach:
## The term specification approach
bfit = gamlasso(response = "Yb",
linear.terms = paste0("x",1:10),
smooth.terms = paste0("z",1:4),
data = simData,
family="binomial",
linear.penalty = "l1",
smooth.penalty = "l1",
num.knots = 5,
seed = 1)
We can evalute the summary
, plot the smooth terms and
compare predicted probabilities to underlying truth as above (not
evaluated):
summary(bfit)
plot(bfit$gam, pages=1)
pred.prob <- predict(bfit, type="response")
true.prob <- exp(simData$lp)/(1+exp(simData$lp))
plot(pred.prob, true.prob, xlab="predicted probability", ylab="true probability")
Note that the above model does not fit the data well since this is a small data set (N = 100) for Binary logistic regression.
plsmselect
In this section we fit a binomial logistic regression model to the
response success ∼ Binomial(n, p)
from simData
(where n denotes varying count totals
across subjects id): where both β and the fj’s are subject
to ℓ1 penalties.
We define failure = n − success and use the formula specification approach to fit the above model:
## Create model matrix X corresponding to linear terms
## (necessary for the formula option of gamlasso below)
simData$X = model.matrix(~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=simData)[,-1]
## Binomial counts response. Formula approach.
bfit2 = gamlasso(cbind(success,failure) ~ X +
s(z1, bs="ts", k=5) +
s(z2, bs="ts", k=5) +
s(z3, bs="ts", k=5) +
s(z4, bs="ts", k=5),
data = simData,
family = "binomial",
seed = 1)
or alternatively using the term specification approach:
## Binomial counts response. Term specification approach
bfit2 = gamlasso(c("success","failure"),
linear.terms=paste0("x",1:10),
smooth.terms=paste0("z",1:4),
data=simData,
family = "binomial",
linear.penalty = "l1",
smooth.penalty = "l1",
num.knots = 5,
seed=1)
As in the Bernoulli Example we can evalute
the summary
, plot the smooth terms and compare predicted
probabilities to underlying truth (not evaluated):
plsmselect
In this section we fit a Cox regression model to the censored event
time
variable from simData
(status
=1 if event is observed, 0 if censored): where both
β and the fj’s are subject
to ℓ1 penalties.
We fit the above Cox GAMLASSO model using the formula approach:
## Create model matrix X corresponding to linear terms
## (necessary for the formula option of gamlasso below)
simData$X = model.matrix(~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=simData)[,-1]
# Censored time-to-event response. Formula approach.
cfit = gamlasso(time ~ X +
s(z1, bs="ts", k=5) +
s(z2, bs="ts", k=5) +
s(z3, bs="ts", k=5) +
s(z4, bs="ts", k=5),
data = simData,
family = "cox",
weights = "status",
seed = 1)
or alternatively using the term specification approach:
# Censored time-to-event response. Term specification approach.
cfit = gamlasso(response = "time",
linear.terms = paste0("x",1:10),
smooth.terms = paste0("z",1:4),
data = simData,
linear.penalty = "l1",
smooth.penalty = "l1",
family = "cox",
weights="status",
num.knots = 5,
seed = 1)
Once the model is fit we can perform similar diagnostics as in the
Gaussian, Poisson, Bernoulli and Binomial examples above. But in
addition since this is a survival model we can also estimate the hazard
function. In particular the function cumbasehaz
will output
the estimated cumulative baseline hazard function.
## Obtain and plot predicted cumulative baseline hazard:
H0.pred <- cumbasehaz(cfit)
time.seq <- seq(0, 60, by=1)
plot(time.seq, H0.pred(time.seq), type="l", xlab = "Time", ylab="",
main = "Predicted Cumulative \nBaseline Hazard")
We can predict the survival probability S(t) = P(T > t) for each sample at a single or a sequence of event times t.
## Obtain predicted survival at days 1,2,3,...,60:
S.pred <- predict(cfit, type="response", new.event.times=1:60)
## Plot the survival curve for sample (subject) 17:
plot(1:60, S.pred[17,], xlab="time (in days)", ylab="Survival probability", type="l")
Note that the object S.pred
above is a matrix whose rows
are samples (subjects) and columns are the new event times (at which
survival probabilities are calculated).
We provide below the code that was used to generate
simData
. Note that due to simulation using different seeds
the output from the code (simData2
) below may not exactly
match data(simData)
.
Simulate covariates:
We first simulate the covariate predictors with the helper function
generate.inputdata
below. We simulate N = 100 samples of 10 binary (linear) predictors x1, …, x10
and 4 continuous (smooth) predictors
z1, …, z4.
generate.inputdata <- function(N) {
id <- paste0("i",1:N)
## Define linear predictors
linear.pred <- matrix(c(rbinom(N*3,size=1,prob=0.2),
rbinom(N*5,size=1,prob=0.5),
rbinom(N*2,size=1,prob=0.9)),
nrow=N)
colnames(linear.pred) = paste0("x", 1:ncol(linear.pred))
## Define smooth predictors
smooth.pred = as.data.frame(matrix(runif(N*4),nrow=N))
colnames(smooth.pred) = paste0("z", 1:ncol(smooth.pred))
## Coalesce the predictors
dat = cbind.data.frame(id, linear.pred, smooth.pred)
return(dat)
}
## Simulate N input data observations:
N <- 100
simData2 <- generate.inputdata(N)
We encourage the user to change N
to a higher number
(e.g. N = 1000) and rerun the
code in the Examples section with the new
simData2
in order to get a better model fit.
Calculate “true” linear predictor:
In all below simulations (of the various response types
gaussian
, binomial
, poisson
, and
cox
) we will assume that the “true” underlying linear
predictor is lp = x1 − 0.6x2 + 0.5x3 + z13 + sin (π ⋅ z2),
i.e. we assume that the linear parameters associated with x1, …, x10
are respectively β1 = 1, β2 = −0.6, β3 = 0.5, β4 = ⋯ = β10 = 0
and the functional relationships between lp and z1, …, z4
are respectively f1(z) = z3,
f2(z) = sin (π ⋅ z),
f3(z) = f4(z) = 0.
The below R code calculates the “true” linear predictor based on the
above formula.
## "True" coefficients of linear terms (last 7 are zero):
beta <- c(1, -0.6, 0.5, rep(0,7))
## "True" nonlinear smooth terms:
f1 <- function(x) x^3
f2 <- function(x) sin(x*pi)
## Calculate "True" linear predictor (lp) based on above data (simData2)
simData2$lp <- as.numeric(as.matrix(simData2[,paste0("x",1:10)]) %*% beta + f1(simData2$z1) + f2(simData2$z2))
Simulate Gaussian, Bernoulli, and Poisson responses:
We then go on and simulate gaussian
,
bernoulli
, and poisson
responses assuming the
above linear predictor as ground truth:
## Simulate gaussian response with mean lp:
simData2$Yg = simData2$lp + rnorm(N, sd = 0.1)
## Simulate bernoulli trials with success probability exp(lp)/(1+exp(lp))
simData2$Yb = map_int(exp(simData2$lp), ~ rbinom(1, 1, p = ( ./(1+.) ) ) )
## Simulate Poisson response with log(mean) = lp
simData2$Yp = map_int(exp(simData2$lp), ~rpois(1, .) )
Simulate Binomial counts:
We then simulate binomial success/failure counts (with varying count totals n also simulated) assuming the above linear predictor as ground truth: where p = exp (lp)/(1 + exp (lp)).
## Simulate binomial counts with success probability exp(lp)/(1+exp(lp))
sizes = sample(10, nrow(simData2), replace = TRUE)
# Simulate success:
simData2$success = map2_int(exp(simData2$lp), sizes, ~rbinom(1, .y, p = ( .x/(1+.x) )))
# Calculate failure:
simData2$failure = sizes - simData2$success
Simulate time-to-event response from Weibull model:
Finally we simulate censored event times T according to a Cox regression model with exponential baseline hazard (i.e. baseline hazard function is constant), assuming the above linear predictor as ground truth. The censoring times C are also simulated according to an exponential distribution. The censoring variable is then defined as status = 1(T ≥ C).
## Function that simulates N samples of censored event times (time, status)
## Event times ~ Weibull(lambda0, rho) with linear predictor lp
## rho=1 corresponds to exponential distribution
simulWeib <- function(N, lambda0, rho, lp)
{
# Censoring times ~ Exponential(lambdaC)
lambdaC=0.0005 # very mild censoring
# Simulate Weibull latent event times
v <- runif(n=N)
Tlat <- (- log(v) / (lambda0 * exp(lp)))^(1 / rho)
# Simulate censoring times
C <- rexp(n=N, rate=lambdaC)
# Calculate follow-up times and event indicators
time <- pmin(Tlat, C)
status <- as.numeric(Tlat <= C)
data.frame(time=time, status=status)
}
## Set parameters of Weibull baseline hazard h0(t) = lambda*rho*t^(rho-1):
lambda0 <- 0.01; rho <- 1;
## Simulate (times, status) from above censored Weibull model and cbind to our data:
simData2 <- cbind.data.frame(simData2, simulWeib(N, lambda0, rho, simData2$lp))