Title: | High Dimensional Penalized Generalized Linear Mixed Models (pGLMM) |
---|---|
Description: | Fits high dimensional penalized generalized linear mixed models using the Monte Carlo Expectation Conditional Minimization (MCECM) algorithm. The purpose of the package is to perform variable selection on both the fixed and random effects simultaneously for generalized linear mixed models. The package supports fitting of Binomial, Gaussian, and Poisson data with canonical links, and supports penalization using the MCP, SCAD, or LASSO penalties. The MCECM algorithm is described in Rashid et al. (2020) <doi:10.1080/01621459.2019.1671197>. The techniques used in the minimization portion of the procedure (the M-step) are derived from the procedures of the 'ncvreg' package (Breheny and Huang (2011) <doi:10.1214/10-AOAS388>) and 'grpreg' package (Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2>), with appropriate modifications to account for the estimation and penalization of the random effects. The 'ncvreg' and 'grpreg' packages also describe the MCP, SCAD, and LASSO penalties. |
Authors: | Hillary Heiling [aut, cre], Naim Rashid [aut], Quefeng Li [aut], Joseph Ibrahim [aut] |
Maintainer: | Hillary Heiling <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.5.4.8 |
Built: | 2025-02-26 05:17:07 UTC |
Source: | https://github.com/hheiling/glmmpen |
Control of Metropolis-within-Gibbs Adaptive Random Walk Sampling Procedure
Controls the adaptive random walk Metropolis-within-Gibbs sampling procedure.
adaptControl(batch_length = 100, offset = 0)
adaptControl(batch_length = 100, offset = 0)
batch_length |
positive integer specifying the number of posterior samples to collect
before the proposal variance is adjusted based on the acceptance rate of the last
|
offset |
non-negative integer specifying an offset value for the increment of the proposal
variance adjustment. Optionally used to ensure the required diminishing adaptation condition.
Default set to 0.
|
Function returns a list (inheriting from class "adaptControl
")
containing parameter specifications for the adaptive random walk sampling procedure.
Basal dataset: A composition of cancer datasets with top scoring pairs (TSPs) as covariates and binary response indicating if the subject's cancer subtype was basal-like.
A dataset composed of four datasets combined from studies that contain gene expression data from subjects with several types of cancer. Two of these datasets contain gene expression data for subjects with Pancreatic Ductal Adenocarcinoma (PDAC), one dataset contains data for subjects with Breast Cancer, and the fourth dataset contains data for subjects with Bladder Cancer. The response of interest is whether or not the subject's cancer subtype was the basal-like subtype. See articles Rashid et al. (2020) "Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction" and Moffitt et al. (2015) "Virtual microdissection identifies distinct tumor- and stroma-specific subtypes of pancreatic ductal adenocarcinoma" for further details on these four datasets.
data("basal")
data("basal")
A list containing the following elements:
binary response vector; 1 indicates that the subject's cancer was of the basal-like subtype, 0 otherwise
matrix of 50 top scoring pair (TSP) covariates
factor indicating which cancer study the observation belongs to, which are given the following descriptions: UNC PDAC, TCGA PDAC, TCGA Bladder Cancer, and UNC Breast Cancer
model matrix for random effects; organized first by variable, then by group (i.e. by cancer study)
glmm
is used to fit a single generalized mixed model via Monte Carlo
Expectation Conditional Minimization (MCECM). Unlike glmmPen
, no model selection
is performed.
glmm( formula, data = NULL, family = "binomial", covar = NULL, offset = NULL, optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), progress = TRUE, ... )
glmm( formula, data = NULL, family = "binomial", covar = NULL, offset = NULL, optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), progress = TRUE, ... )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right. Random-effects terms are distinguished by vertical bars
("|") separating expression for design matrices from the grouping factor. |
data |
an optional data frame containing the variables named in |
family |
a description of the error distribution and link function to be used in the model.
Currently, the |
covar |
character string specifying whether the covariance matrix should be unstructured
("unstructured") or diagonal with no covariances between variables ("independent").
Default is set to |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
... |
additional arguments that could be passed into |
The glmm
function can be used to fit a single generalized mixed model.
While this approach is meant to be used in the case where the user knows which
covariates belong in the fixed and random effects and no penalization is required, one is
allowed to specify non-zero fixed and random effects penalties using lambdaControl
and the (...) arguments. The (...) allow for specification of penalty-related arguments; see
glmmPen
for details. For a high dimensional situation, the user may want to fit a
minimal penalty model using a small penalty for the fixed and random effects and save the posterior
samples from this minimal penalty model for use in any BIC-ICQ calculations during selection within glmmPen
.
Specifying a file name in the 'BICq_posterior' argument will save the posterior samples from the
glmm
model into a big.matrix with this file name, see the Details section of
glmmPen
for additional details.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
)
glmm_FA
is used to fit a single generalized linear mixed model via Monte Carlo
Expectation Conditional Minimization (MCECM) using a factor model decomposition of
the random effects. No model selection is performed.
This function uses a factor model decomposition on the random effects. This assumption
reduces the latent space in the E-step (Expectation step) of the algorithm,
which reduces the computational complexity of the algorithm. This improves
the speed of the algorithm and enables the scaling of the algorithm to larger dimensions.
Besides the modeling of the random effects, this function is similar to glmm
.
glmm_FA( formula, data = NULL, family = "binomial", offset = NULL, r_estimation = rControl(), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), progress = TRUE, ... )
glmm_FA( formula, data = NULL, family = "binomial", offset = NULL, r_estimation = rControl(), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), progress = TRUE, ... )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right. Random-effects terms are distinguished by vertical bars
("|") separating expression for design matrices from the grouping factor. |
data |
an optional data frame containing the variables named in |
family |
a description of the error distribution and link function to be used in the model.
Currently, the |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
r_estimation |
a list of class "rControl" from function |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
... |
additional arguments that could be passed into |
The glmm_FA
function can be used to fit a single generalized linear mixed model.
While this approach is meant to be used in the case where the user knows which
covariates belong in the fixed and random effects and no penalization is required, one is
allowed to specify non-zero fixed and random effects penalties using lambdaControl
and the (...) arguments. The (...) allow for specification of penalty-related arguments; see
glmmPen_FA
for details. For a high dimensional situation, the user may want to fit a
full model using a small penalty for the fixed and random effects and save the posterior
draws from this full model for use in any BIC-ICQ calculations during selection within glmmPen_FA
.
Specifying a file name in the 'BICq_posterior' argument will save the posterior draws from the
glmm_FA
model into a big.matrix with this file name, see the Details section of
glmmPen_FA
for additional details.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
)
glmmPen
is used to fit penalized generalized mixed models via Monte Carlo Expectation
Conditional Minimization (MCECM). The purpose of the function is to perform
variable selection on both the fixed and random effects simultaneously for the
generalized linear mixed model. glmmPen
selects the best model using
BIC-type selection criteria (see selectControl
documentation for
further details). To improve the speed of the algorithm, consider setting
var_restrictions
= "fixef" within the optimControl
options.
glmmPen( formula, data = NULL, family = "binomial", covar = NULL, offset = NULL, fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), BICq_posterior = NULL, progress = TRUE, ... )
glmmPen( formula, data = NULL, family = "binomial", covar = NULL, offset = NULL, fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), BICq_posterior = NULL, progress = TRUE, ... )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right. Random-effects terms are distinguished by vertical bars
("|") separating expression for design matrices from the grouping factor. |
data |
an optional data frame containing the variables named in |
family |
a description of the error distribution and link function to be used in the model.
Currently, the |
covar |
character string specifying whether the covariance matrix should be unstructured
("unstructured") or diagonal with no covariances between variables ("independent").
Default is set to |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
fixef_noPen |
Optional vector of 0's and 1's of the same length as the number of fixed effects covariates used in the model. Value 0 indicates the variable should not have its fixed effect coefficient penalized, 1 indicates that it can be penalized. Order should correspond to the same order of the fixed effects given in the formula. |
penalty |
character describing the type of penalty to use in the variable selection procedure. Options include 'MCP', 'SCAD', and 'lasso'. Default is MCP penalty. If the random effect covariance matrix is "unstructured", then a group MCP, group SCAD, or group LASSO penalty is used on the random effects coefficients. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for details of these penalties. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions
from the MCP/SCAD/LASSO penalty and the ridge, or L2, penalty. |
gamma_penalty |
The scaling factor of the MCP and SCAD penalties. Not used by LASSO penalty. Default is 4.0 for SCAD and 3.0 for MCP. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for further details. |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
BICq_posterior |
an optional character string expressing the path and file
basename of a file combination that
will file-back or currently file-backs a |
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
... |
additional arguments that could be passed into |
Argument BICq_posterior
details: If the BIC_option
in selectControl
(tuning_options
) is specified
to be 'BICq', this requests the calculation of the BIC-ICQ criterion during the selection
process. For the BIC-ICQ criterion to be calculated, a minimal penalty model assuming a small valued
lambda penalty needs to be fit, and the posterior samples from this minimal penalty model need to be used.
In order to avoid repetitive calculations of
this minimal penalty model (i.e. if the user wants to re-run glmmPen
with a different
set of penalty parameters), a big.matrix
of these
posterior samples will be file-backed as two files: a backing file with extension '.bin' and a
descriptor file with extension '.desc'. The BICq_posterior
argument should contain a
path and a filename of the form "./path/filename" such that the backingfile and
the descriptor file would then be saved as "./path/filename.bin" and "./path/filename.desc", respectively.
If BICq_posterior
is set to NULL
, then by default, the backingfile and descriptor
file are saved in the working directory as "BICq_Posterior_Draws.bin" and "BICq_Posterior_Draws.desc".
If the big matrix of posterior samples is already file-backed, BICq_posterior
should
specify the path and basename of the appropriate files (again of form "./path/filename");
the minimal penalty model
will not be fit again and the big.matrix of
posterior samples will be read using the attach.big.matrix
function of the
bigmemory
package and used in the BIC-ICQ
calculations. If the appropriate files do not exist or BICq_posterior
is specified as NULL
, the minimal penalty model will be fit and the minimal penalty model posterior
samples will be saved as specified above. The algorithm will save 10^4 posterior samples automatically.
Trace details: The value of 0 (default) does not output any extra information. The value of 1 additionally outputs the updated coefficients, updated covariance matrix values, and the number of coordinate descent iterations used for the M step for each EM iteration. When pre-screening procedure is used and/or if the BIC-ICQ criterion is requested, trace = 1 gives additional information about the penalties used for the 'minimal penalty model' fit procedure. If Stan is not used as the E-step sampling mechanism, the value of 2 outputs all of the above plus gibbs acceptance rate information for the adaptive random walk and independence samplers and the updated proposal standard deviation for the adaptive random walk.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
, see ?pglmmObj for additional documentation)
glmmPen_FA
is used to fit penalized generalized linear mixed models via
Monte Carlo Expectation Conditional Minimization (MCECM) using a factor model decomposition of
the random effects. The purpose of the function is to perform
variable selection on both the fixed and random effects simultaneously for the
generalized linear mixed model.
This function uses a factor model decomposition on the random effects. This assumption
reduces the latent space in the E-step (Expectation step) of the algorithm,
which reduces the computational complexity of the algorithm. This improves
the speed of the algorithm and enables the scaling of the algorithm to larger dimensions.
Besides the modeling of the random effects, this function is similar to glmmPen
.
glmmPen_FA
selects the best model using
BIC-type selection criteria (see selectControl
documentation for
further details).
Function is currently available for Binomial and Poisson families with canonical links.
To improve the speed of the algorithm, consider setting
var_restrictions
= "fixef" within the optimControl
options.
glmmPen_FA( formula, data = NULL, family = "binomial", offset = NULL, r_estimation = rControl(), fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), BICq_posterior = NULL, progress = TRUE, ... )
glmmPen_FA( formula, data = NULL, family = "binomial", offset = NULL, r_estimation = rControl(), fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), BICq_posterior = NULL, progress = TRUE, ... )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right. Random-effects terms are distinguished by vertical bars
("|") separating expression for design matrices from the grouping factor. |
data |
an optional data frame containing the variables named in |
family |
a description of the error distribution and link function to be used in the model.
Currently, the |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
r_estimation |
a list of class "rControl" from function |
fixef_noPen |
Optional vector of 0's and 1's of the same length as the number of fixed effects covariates used in the model. Value 0 indicates the variable should not have its fixed effect coefficient penalized, 1 indicates that it can be penalized. Order should correspond to the same order of the fixed effects given in the formula. |
penalty |
character describing the type of penalty to use in the variable selection procedure. Options include 'MCP', 'SCAD', and 'lasso'. Default is MCP penalty. If the random effect covariance matrix is "unstructured", then a group MCP, group SCAD, or group LASSO penalty is used on the random effects coefficients. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for details of these penalties. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions
from the MCP/SCAD/LASSO penalty and the ridge, or L2, penalty. |
gamma_penalty |
The scaling factor of the MCP and SCAD penalties. Not used by LASSO penalty. Default is 4.0 for SCAD and 3.0 for MCP. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for further details. |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
BICq_posterior |
an optional character string expressing the path and file
basename of a file combination that
will file-back or currently file-backs a |
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
... |
additional arguments that could be passed into |
Argument BICq_posterior
details: If the BIC_option
in selectControl
(tuning_options
) is specified
to be 'BICq', this requests the calculation of the BIC-ICQ criterion during the selection
process. For the BIC-ICQ criterion to be calculated, a full model assuming a small valued
lambda penalty needs to be fit, and the posterior draws from this full model need to be used.
In order to avoid repetitive calculations of
this full model (i.e. if the user wants to re-run glmmPen
with a different
set of penalty parameters), a big.matrix
of these
posterior draws will be file-backed as two files: a backing file with extention '.bin' and a
descriptor file with extension '.desc'. The BICq_posterior
argument should contain a
path and a filename with no extension of the form "./path/filename" such that the backingfile and
the descriptor file would then be saved as "./path/filename.bin" and "./path/filename.desc", respectively.
If BICq_posterior
is set to NULL
, then by default, the backingfile and descriptor
file are saved in the working directory as "BICq_Posterior_Draws.bin" and "BICq_Posterior_Draws.desc".
If the big matrix of posterior draws is already file-backed, BICq_posterior
should
specify the path and basename of the appropriate files (again of form "./path/filename");
the full model
will not be fit again and the big.matrix of
posterior draws will be read using the attach.big.matrix
function of the
bigmemory
package and used in the BIC-ICQ
calcuations. If the appropriate files do not exist or BICq_posterior
is specified as NULL
, the full model will be fit and the full model posterior
draws will be saved as specified above. The algorithm will save 10^4 posterior draws automatically.
Trace details: The value of 0 (default) does not output any extra information. The value of 1 additionally outputs the updated coefficients, updated covariance matrix values, and the number of coordinate descent iterations used for the M step for each EM iteration. When pre-screening procedure is used and/or if the BIC-ICQ criterion is requested, trace = 1 gives additional information about the penalties used for the 'full model' fit procedure. If Stan is not used as the E-step sampling mechanism, the value of 2 outputs all of the above plus gibbs acceptance rate information for the adaptive random walk and independence samplers and the updated proposal standard deviation for the adaptive random walk.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
, see ?pglmmObj for additional documentation)
Constructs control structures for penalized mixed model fitting.
lambdaControl(lambda0 = 0, lambda1 = 0) selectControl( lambda0_seq = NULL, lambda1_seq = NULL, nlambda = 10, search = c("abbrev", "full_grid"), BIC_option = c("BICq", "BICh", "BIC", "BICNgrp"), logLik_calc = switch(BIC_option[1], BICq = FALSE, TRUE), lambda.min = NULL, pre_screen = TRUE, lambda.min.presc = NULL )
lambdaControl(lambda0 = 0, lambda1 = 0) selectControl( lambda0_seq = NULL, lambda1_seq = NULL, nlambda = 10, search = c("abbrev", "full_grid"), BIC_option = c("BICq", "BICh", "BIC", "BICNgrp"), logLik_calc = switch(BIC_option[1], BICq = FALSE, TRUE), lambda.min = NULL, pre_screen = TRUE, lambda.min.presc = NULL )
lambda0 |
a non-negative numeric penalty parameter for the fixed effects coefficients |
lambda1 |
a non-negative numeric penalty parameter for the (grouped) random effects covariance coefficients |
lambda0_seq , lambda1_seq
|
a sequence of non-negative numeric penalty parameters for
the fixed
and random effect coefficients ( |
nlambda |
positive integer specifying number of penalty parameters
to use for the fixed and random effects penalty parameters. Default set to 10. Only used
if either |
search |
character string of "abbrev" (default) or "full_grid" indicating if the search of models over the penalty parameter space should be the full grid search (total number of models equals 'nlambda'^2 or length('lambda0_seq')*length('lambda1_seq')) or an abbreviated grid search. The abbreviated grid search is described in more detail in the Details section. Te authors highly recommend the abbreviated grid search. |
BIC_option |
character string specifying the selection criteria used to select the 'best' model. Default "BICq" option specifies the BIC-ICQ criterion (Ibrahim et al (2011) <doi:10.1111/j.1541-0420.2010.01463.x>), which requires a fit of a 'minimum penalty' model; a small penalty (the minimum of the penalty sequence) is used for the fixed and random effects. See "Details" section for what these small penalties will be. The "BICh" option utilizes the hybrid BIC value described in Delattre, Lavielle, and Poursat (2014) <doi:10.1214/14-EJS890>. The regular "BIC" option penalty term uses (total non-zero coefficients)*(length(y) = total number observations). The "BICNgrp" option penalty term uses (total non-zero coefficients)*(nlevels(group) = number groups). |
logLik_calc |
logical value specifying if the log likelihood (and log-likelihood based
calculations BIC, BICh, and BICNgrp) should be calculated for all of the models in the selection procedure.
If BIC-ICQ is used for selection, the log-likelihood is not needed for each model.
However, if users are interested
in comparing the best models from BIC-ICQ and other BIC-type selection criteria, setting
|
lambda.min |
numeric fraction between 0 and 1. The sequence of the lambda penalty parameters
ranges from the maximum lambda where all fixed and random effects are penalized to 0 and
a minimum lambda value, which equals a small fraction of the maximum lambda. The parameter
|
pre_screen |
logical value indicating whether pre-screening should be performed before
model selection (default |
lambda.min.presc |
numeric fraction between 0 and 1. During pre-screening and the
minimal penalty
model fit for the BIC-ICQ calculation, the small penalty used on the random effect is
the fraction |
If left as the default NULL
values,
the lambda0_seq
and lambda1_seq
numeric sequences are
automatically calculated. The sequence will be calculated in the same manner as
ncvreg
calculates the range: the max value (let's denote this as lambda_max
)
penalizes all fixed and random effects to 0, the min value is a
small portion of max (lambda.min
*lambda_max
), and the sequence is composed of
nlambda
values ranging from these min and max values spread evenly on the log scale.
Unlike ncvreg
, the order of penalty
values used in the algorithm must run from the min lambda to the max lambda (as opposed to
running from max lambda to min lambda). The length of the sequence is specified by nlambda
.
By default, these sequences are calculated using LambdaSeq
.
The lambda0
and lambda1
arguments used within the glmm
function
allow for a user to fit a model with a single
non-zero penalty parameter combination. However, this is generally not recommended.
Abbreviated grid search: The abbreviated grid search proceeds in two stages. In stage 1, the algorithm fits the following series of models: the fixed effects penalty parameter remains a fixed value evaluated at the minimum of the fixed effects penalty parameters, and all random effects penalty parameters are examined. The 'best' model from this first stage of models determines the optimum random effect penalty parameter. In stage 2, the algorithm fits the following series of models: the random effects penalty parameter remains fixed at the value of the optimum random effect penalty parameter (from stage 1) and all fixed effects penalty parameters are considered. The best overall model is the best model from stage 2. This reduces the number of models considered to length('lambda0_seq') + length('lambda1_seq'). The authors found that this abbreviated grid search worked well in simulations, and performed considerably faster than the full grid search that examined all possible fixed and random effect penalty parameter combinations.
The arguments nlambda
and lambda.min
are only used
if one or both of the lambda0_seq
and lambda1_seq
penalty sequences
(corresponding to the fixed and random effects penalty sequences, respectively)
remain unspecified by the user (one or both of these arguments left as NULL
),
indicating that the algorithm needs to calculate default penalty sequences.
The argument lambda.min.presc
is only used under the following condition:
lambda1_seq
remains unspecified by the user
(this argument set to NULL
so the random effects penalty parameter sequence
needs to be calculated) AND either the pre-screening procedure is selected by the argument
pre_screen
or the BIC-ICQ is selected as the model selection criteria,
i.e., BIC_option
= "BICq".
If lambda1_seq
is specified by the user, the minimum
value in that sequence will be used as the random effect penalty in the
pre-screening procedure and/or the minimal penalty model for the BIC-ICQ calculation.
BIC-ICQ calculation: This model selection criteria requires the fitting of a 'minimal penalty'
model, which fits a model with a small penalty on the fixed and random effects.
For the fixed effects penalty, the minimal penalty is: (a) 0 if the number of fixed
effects covariates is 4 or less or (b) the minimum fixed effect penalty from the fixed
effects penalty sequence (either from the default sequence or from the sequence
specified by the user). For the random effects penalty, the minimal penalty
is (a) 0 if the number of random
effects covariates is 4 or less; (b) the minimum random effect penalty
from the random effects penalty sequence specified by the user, or
(c) lambda.min.presc
multiplied to the lambda_max
maximum penalty
specified above when a default random effects penalty sequence is calculated.
Pre-screening: The minimum fixed effects penalty used in the pre-screening stage
will be the minimum penalty of the fixed effects penalty sequence, lambda0_seq
.
The minimum random effects penalty used in the pre-screening stage will be either
(a) the minimum random effects penalty in the sequence lambda1_seq
if this sequence specified by the user, or (b) lambda.min.pres
x lambda_max
,
where lambda_max
was described above.
The *Control functions return a list (inheriting from class "pglmmControl
")
containing parameter values that determine settings for variable selection.
Calculates the sequence of penalty parameters used in the model selection procedure.
This function calls functions from package ncvreg
.
LambdaSeq( X, y, family, offset = NULL, alpha = 1, lambda.min = NULL, nlambda = 10, penalty.factor = NULL )
LambdaSeq( X, y, family, offset = NULL, alpha = 1, lambda.min = NULL, nlambda = 10, penalty.factor = NULL )
X |
matrix of standardized fixed effects (see |
y |
numeric vector of response values. If "survival" family, |
family |
a description of the error distribution and link function to be used in the model.
Currently, the |
offset |
numeric vector that can be used to specify an a priori known component
to be included in the linear predictor during fitting.
This should be |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions
from the MCP/SCAD/LASSO penalty and the ridge, or L2, penalty. |
lambda.min |
numeric fraction between 0 and 1. The sequence of the lambda penalty parameters
ranges from the maximum lambda where all fixed and random effects are penalized to 0 and
a minimum lambda value, which equals a small fraction of the maximum lambda. The parameter
|
nlambda |
positive integer specifying number of penalty parameters (lambda) with which to fit a model. |
penalty.factor |
an optional numeric vector equal to the |
numeric sequence of penalty parameters of length nlambda
ranging from the
minimum penalty parameter (first element) equal to fraction lambda.min
multiplied by the
maximum penalty parameter to the maximum penalty parameter (last element)
Constructs the control structure for the optimization of the penalized mixed model fit algorithm.
optimControl( var_restrictions = c("none", "fixef"), conv_EM = 0.0015, conv_CD = 5e-04, nMC_burnin = NULL, nMC_start = NULL, nMC_max = NULL, nMC_report = 5000, maxitEM = NULL, maxit_CD = 50, M = 10000, t = 2, mcc = 2, sampler = c("stan", "random_walk", "independence"), var_start = "recommend", step_size = 1, standardization = TRUE, convEM_type = c("AvgEuclid1", "maxdiff", "AvgEuclid2", "Qfun"), B_init_type = c("deterministic", "data", "random") )
optimControl( var_restrictions = c("none", "fixef"), conv_EM = 0.0015, conv_CD = 5e-04, nMC_burnin = NULL, nMC_start = NULL, nMC_max = NULL, nMC_report = 5000, maxitEM = NULL, maxit_CD = 50, M = 10000, t = 2, mcc = 2, sampler = c("stan", "random_walk", "independence"), var_start = "recommend", step_size = 1, standardization = TRUE, convEM_type = c("AvgEuclid1", "maxdiff", "AvgEuclid2", "Qfun"), B_init_type = c("deterministic", "data", "random") )
var_restrictions |
character string indicating how the random effect covariance matrix should be initialized at the beginning of the algorithm when penalties are applied to the coefficients. If "none" (default), all random effect predictors are initialized to have non-zero variances. If "fixef", the code first examines the initialized fixed effects (initialized using a regular penalized GLM), and only the random effect predictors that are initialized with non-zero fixed effects are initialized with non-zero variances. |
conv_EM |
a non-negative numeric convergence criteria for the convergence of the
EM algorithm. Default is 0.0015.
EM algorithm is considered to have converge if the average Euclidean
distance between the current coefficient estimates and the coefficient estimates from
|
conv_CD |
a non-negative numeric convergence criteria for the convergence of the grouped coordinate descent loop within the M step of the EM algorithm. Default 0.0005. |
nMC_burnin |
positive integer specifying the number of posterior samples to use as
burn-in for each E step in the EM algorithm. If set to |
nMC_start |
a positive integer for the initial number of Monte Carlo draws. If set to
|
nMC_max |
a positive integer for the maximum number of allowed Monte Carlo draws used
in each step of the EM algorithm. If set to |
nMC_report |
a positive integer for the number of posterior samples to save from the final
model. These posterior samples can be used for diagnostic purposes, see |
maxitEM |
a positive integer for the maximum number of allowed EM iterations.
If set to |
maxit_CD |
a positive integer for the maximum number of allowed iterations for the coordinate descent algorithms used within the M-step of each EM iteration. Default equals 50. |
M |
positive integer specifying the number of posterior samples to use within the Pajor log-likelihood calculation. Default is 10^4; minimum allowed value is 5000. |
t |
the convergence criteria is based on the average Euclidean distance between
the most recent coefficient estimates and the coefficient estimates from |
mcc |
the number of times the convergence criteria must be met before the algorithm is seen as having converged (mcc for 'meet condition counter'). Default set to 2. Value restricted to be no less than 2. |
sampler |
character string specifying whether the posterior samples of the random effects
should be drawn using Stan (default, from package rstan) or the Metropolis-within-Gibbs procedure
incorporating an adaptive random walk sampler ("random_walk") or an
independence sampler ("independence"). If using the random walk sampler, see |
var_start |
either the character string "recommend" or a positive number specifying the
starting values to initialize the variance of the covariance matrix. For |
step_size |
positive numeric value indicating the starting step size to use in the Majorization-Minimization scheme of the M-step. Only relevant when the distributional assumption used is not Binomial or Gaussian with canonical links (e.g. Poisson with log link) |
standardization |
logical value indicating whether covariates should
standardized ( |
convEM_type |
character string indicating the type of convergence criteria to
use within the EM algorithm to determine when a model has converged. The default is "AvgEuclid1",
which calculates the average Euclidean distance between the most recent coefficient vector and
the coefficient vector |
B_init_type |
character string indicating how the B matrix within the |
Several arguments are set to a default value of NULL
. If these arguments
are left as NULL
by the user, then these values will be filled in with appropriate
default values as specified above, which may depend on the number of random effects or
the family of the data. If the user
specifies particular values for these arguments, no additional modifications to these
arguments will be done within the algorithm.
Function returns a list inheriting from class optimControl
containing fit and optimization criteria values used in optimization routine.
pglmmObj
of Fitted Penalized Generalized Mixed-Effects Models for
package glmmPen
The functions glmm
, glmmPen
,
glmm_FA
, and glmmPen_FA
from the package glmmPen
output the reference class object of type pglmmObj
.
## S3 method for class 'pglmmObj' fixef(object, ...) ## S3 method for class 'pglmmObj' ranef(object, ...) ## S3 method for class 'pglmmObj' sigma(object, ...) ## S3 method for class 'pglmmObj' coef(object, ...) ## S3 method for class 'pglmmObj' family(object, ...) ## S3 method for class 'pglmmObj' nobs(object, ...) ## S3 method for class 'pglmmObj' ngrps(object, ...) ## S3 method for class 'pglmmObj' formula(x, fixed.only = FALSE, random.only = FALSE, ...) ## S3 method for class 'pglmmObj' model.frame(formula, fixed.only = FALSE, ...) ## S3 method for class 'pglmmObj' model.matrix(object, type = c("fixed", "random"), ...) ## S3 method for class 'pglmmObj' fitted(object, fixed.only = TRUE, ...) ## S3 method for class 'pglmmObj' predict( object, newdata = NULL, type = c("link", "response"), fixed.only = TRUE, ... ) ## S3 method for class 'pglmmObj' residuals(object, type = c("deviance", "pearson", "response", "working"), ...) ## S3 method for class 'pglmmObj' print(x, digits = c(fef = 4, ref = 4), ...) ## S3 method for class 'pglmmObj' summary( object, digits = c(fef = 4, ref = 4), resid_type = switch(object$family$family, gaussian = "pearson", "deviance"), ... ) ## S3 method for class 'pglmmObj' logLik(object, ...) ## S3 method for class 'pglmmObj' BIC(object, ...) ## S3 method for class 'pglmmObj' plot(x, fixed.only = FALSE, type = NULL, ...)
## S3 method for class 'pglmmObj' fixef(object, ...) ## S3 method for class 'pglmmObj' ranef(object, ...) ## S3 method for class 'pglmmObj' sigma(object, ...) ## S3 method for class 'pglmmObj' coef(object, ...) ## S3 method for class 'pglmmObj' family(object, ...) ## S3 method for class 'pglmmObj' nobs(object, ...) ## S3 method for class 'pglmmObj' ngrps(object, ...) ## S3 method for class 'pglmmObj' formula(x, fixed.only = FALSE, random.only = FALSE, ...) ## S3 method for class 'pglmmObj' model.frame(formula, fixed.only = FALSE, ...) ## S3 method for class 'pglmmObj' model.matrix(object, type = c("fixed", "random"), ...) ## S3 method for class 'pglmmObj' fitted(object, fixed.only = TRUE, ...) ## S3 method for class 'pglmmObj' predict( object, newdata = NULL, type = c("link", "response"), fixed.only = TRUE, ... ) ## S3 method for class 'pglmmObj' residuals(object, type = c("deviance", "pearson", "response", "working"), ...) ## S3 method for class 'pglmmObj' print(x, digits = c(fef = 4, ref = 4), ...) ## S3 method for class 'pglmmObj' summary( object, digits = c(fef = 4, ref = 4), resid_type = switch(object$family$family, gaussian = "pearson", "deviance"), ... ) ## S3 method for class 'pglmmObj' logLik(object, ...) ## S3 method for class 'pglmmObj' BIC(object, ...) ## S3 method for class 'pglmmObj' plot(x, fixed.only = FALSE, type = NULL, ...)
object |
pglmmObj object output from |
... |
potentially further arguments passed from other methods |
x |
an R object of class |
fixed.only |
logical value; default |
random.only |
logical value used in |
formula |
in the case of model.frame, a |
type |
See details of |
newdata |
optional new data.frame containing the same variables used in the model fit procedure |
digits |
number of significant digits for printing; default of 4 |
resid_type |
type of residuals to summarize in output. See |
The pglmmObj object returns the following items:
fixef |
vector of fixed effects coefficients |
ranef |
matrix of random effects coefficients for each explanatory variable for each level of the grouping factor |
sigma |
random effects covariance matrix |
scale |
if family is Gaussian, returns the residual error variance |
posterior_samples |
Samples from the posterior distribution of the random effects, taken at the end of the model fit (after convergence or after maximum iterations allowed). Can be used for diagnositics purposes. Note: These posterior samples are from a single chain. |
sampling |
character string for type of sampling used to calculate the posterior samples in the E-step of the algorithm |
results_all |
matrix of results from all model fits during variable selection (if selection
performed). Output for each model includes: penalty parameters for fixed (lambda0) and random
(lambda1) effects, BIC-derived quantities and the log-likelihood
(note: the arguments |
results_optim |
results from the 'best' model fit; see results_all for details. BICh, BIC, BICNgrp, and LogLik computed for this best model if not previously calculated. |
family |
Family |
penalty_info |
list of penalty information |
call |
arguments plugged into |
formula |
formula |
fixed_vars |
names of fixed effects variables |
data |
list of data used in model fit, including the response y, the fixed effects covariates matrix X, the random effects model matrix Z (which is composed of values from the standardized fixed effects model matrix), the grouping factor, offset, model frame, and standarization information used to standardize the fixed effects covariates |
optinfo |
Information about the optimization of the 'best' model |
control_info |
optimization parameters used for the model fit |
Estep_init |
materials that can be used to initialize another E-step, if desired |
Gibbs_info |
list of materials to perform diagnositics on the Metropolis-within-Gibbs sample chains, including the Gibbs acceptance rates (included for both the independence and adaptive random walk samplers) and the final proposal standard deviations (included for the adaptive random walk sampler only)) |
r_estimation |
list of output related to estimation of number of latent common
factors, r. Only relevant for the output of functions |
showClass("pglmmObj") methods(class = "pglmmObj")
fixef.pglmmObj
: Provides the fixed effects coefficients
ranef.pglmmObj
: Provides the random effects posterior modes for each explanatory variable
for each level of the grouping factor
sigma.pglmmObj
: Provides the random effect covariance matrix. If family is Gaussian,
also returns the standard deviation of the residual error.
coef.pglmmObj
: Computes the sum of the fixed effects
coefficients and the random effect posterior modes
for each explanatory variable for each level of each grouping factor.
family.pglmmObj
: Family of the fitted GLMM
nobs.pglmmObj
: Number of observations used in the model fit
ngrps.pglmmObj
: Number of levels in the grouping factor
formula.pglmmObj
: Formula used for the model fit. Can return the full
formula, or just the formula elements relating to the fixed effects
(fixed.only = TRUE) or random effects (random.only = TRUE)
model.frame.pglmmObj
: Returns the model frame
model.matrix.pglmmObj
: Returns the model matrix of either the fixed (type = "fixed") or
random effects (type = "random")
fitted.pglmmObj
: Fitted values, i.e., the linear predictor of the model.
predict.pglmmObj
: Predictions for the model corresponding to
the pglmmObj output object from the glmmPen package functions. The function
predict
can predict either the linear predictor of the model or the
expected mean of the response, as specified by the type
argument.
Argument type
: character string for type of predictors: "link" (default),
which generates the linear predictor,
and "response", which generates the expected mean values of the response.
residuals.pglmmObj
: Residuals for the pglmmObj output object from the glmmPen package functions.
Argument type
: character string for type of residuals to report. Options include "deviance" (default),
"pearson", "response", and "working", which specify the deviance residuals, Pearson residuals,
the difference between the actual response y and the expected mean response (y - mu), and the
working residuals (y - mu) / mu
print.pglmmObj
: Prints a selection of summary information of fitted model
summary.pglmmObj
: Returns a list of summary statistics of the fitted model.
logLik.pglmmObj
: Returns the log-likelihood using the Corrected Arithmetic Mean estimator
with importance sampling weights developed by Pajor (2017). Degrees of freedom
give the sum of the non-zero fixed and random effects coefficients.
Citation: Pajor, A. (2017). Estimating the marginal likelihood using the arithmetic mean identity.
Bayesian Analysis, 12(1), 261-287.
BIC.pglmmObj
: Returns BIC, BICh (hybrid BIC developed by Delattre et al., citation:
Delattre, M., Lavielle, M., & Poursat, M. A. (2014). A note on BIC in mixed-effects models.
Electronic journal of statistics, 8(1), 456-475.), BICNgrps (BIC using N = number of groups
in the penalty term), and possibly BIC-ICQ (labeled as "BICq") if the argument BIC_option
was set to "BICq" in selectControl
(citation for BIC-ICQ:
Ibrahim, J. G., Zhu, H., Garcia, R. I., & Guo, R. (2011).
Fixed and random effects selection in mixed effects models. Biometrics, 67(2), 495-503.)
plot.pglmmObj
: Plot residuals for the pglmmObj output object from the glmmPen package.
Argument type
: character string for type of residuals to report. Options include "deviance"
(default for non-Gaussian family), "pearson" (default for Gaussian family),
"response", and "working", which specify the deviance residuals, Pearson residuals,
the difference between the actual response y and the expected mean response (y - mu), and the
working residuals (y - mu) / mu
phmm
is used to approximate a single proportional hazards mixed model
using a piecewise constant hazard mixed model approximation
via Monte Carlo Expectation Conditional Minimization (MCECM). No model selection is performed.
phmm( formula, data = NULL, covar = NULL, offset = NULL, optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), survival_options = survivalControl(), progress = TRUE, ... )
phmm( formula, data = NULL, covar = NULL, offset = NULL, optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), survival_options = survivalControl(), progress = TRUE, ... )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right.
The response must be a |
data |
an optional data frame containing the variables named in |
covar |
character string specifying whether the covariance matrix should be unstructured
("unstructured") or diagonal with no covariances between variables ("independent").
Default is set to |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
survival_options |
a structure of class "survivalControl" created
from function |
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
... |
additional arguments that could be passed into |
The phmm
function can be used to approximate a single proportional hazards
mixed model using a piecewise constant hazard mixed model.
While this approach is meant to be used in the case where the user knows which
covariates belong in the fixed and random effects and no penalization is required, one is
allowed to specify non-zero fixed and random effects penalties using lambdaControl
and the (...) arguments. The (...) allow for specification of penalty-related arguments; see
glmmPen
and glmmPen_FA
for details.
For a high dimensional situation, the user may want to fit a
full model using a small penalty for the fixed and random effects and save the posterior
draws from this full model for use in any BIC-ICQ calculations during selection within phmmPen
.
Specifying a file name in the 'BICq_posterior' argument will save the posterior draws from the
phmm
model into a big.matrix with this file name, see the Details section of
phmmPen
for additional details.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
)
phmm_FA
is used to fit a single piecewise constant hazard mixed model as an
approximation to a proportional hazards mixed model via Monte Carlo
Expectation Conditional Minimization (MCECM). This piecewise constant hazard mixed model
uses a factor model decomposition of
the random effects. No model selection is performed.
phmm_FA( formula, data = NULL, offset = NULL, r_estimation = rControl(), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), survival_options = survivalControl(), progress = TRUE, ... )
phmm_FA( formula, data = NULL, offset = NULL, r_estimation = rControl(), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = lambdaControl(), survival_options = survivalControl(), progress = TRUE, ... )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right.
The response must be a |
data |
an optional data frame containing the variables named in |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
r_estimation |
a list of class "rControl" from function |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
survival_options |
a structure of class "survivalControl" created
from function |
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
... |
additional arguments that could be passed into |
The phmm_FA
function can be used to approximate a single proportional hazards
mixed model using a piecewise constant hazard mixed model.
While this approach is meant to be used in the case where the user knows which
covariates belong in the fixed and random effects and no penalization is required, one is
allowed to specify non-zero fixed and random effects penalties using lambdaControl
and the (...) arguments. The (...) allow for specification of penalty-related arguments; see
phmmPen_FA
for details. For a high dimensional situation, the user may want to fit a
full model using a small penalty for the fixed and random effects and save the posterior
draws from this full model for use in any BIC-ICQ calculations during selection within phmmPen_FA
.
Specifying a file name in the 'BICq_posterior' argument will save the posterior draws from the
phmm_FA
model into a big.matrix with this file name, see the Details section of
phmmPen_FA
for additional details.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
)
phmmPen_FA
is used to fit penalized proportional hazards mixed models
using a piecewise constant hazard mixed model approximation via
Monte Carlo Expectation Conditional Minimization (MCECM).
The purpose of the function is to perform
variable selection on both the fixed and random effects simultaneously for the
piecewise constant hazard mixed model.
phmmPen
selects the best model using
BIC-type selection criteria (see selectControl
documentation for
further details).
To improve the speed of the algorithm, consider setting
var_restrictions
= "fixef" within the optimControl
options.
phmmPen( formula, data = NULL, covar = NULL, offset = NULL, fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), survival_options = survivalControl(), BICq_posterior = NULL, progress = TRUE )
phmmPen( formula, data = NULL, covar = NULL, offset = NULL, fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), survival_options = survivalControl(), BICq_posterior = NULL, progress = TRUE )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right.
The response must be a |
data |
an optional data frame containing the variables named in |
covar |
character string specifying whether the covariance matrix should be unstructured
("unstructured") or diagonal with no covariances between variables ("independent").
Default is set to |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
fixef_noPen |
Optional vector of 0's and 1's of the same length as the number of fixed effects covariates used in the model. Value 0 indicates the variable should not have its fixed effect coefficient penalized, 1 indicates that it can be penalized. Order should correspond to the same order of the fixed effects given in the formula. |
penalty |
character describing the type of penalty to use in the variable selection procedure. Options include 'MCP', 'SCAD', and 'lasso'. Default is MCP penalty. If the random effect covariance matrix is "unstructured", then a group MCP, group SCAD, or group LASSO penalty is used on the random effects coefficients. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for details of these penalties. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions
from the MCP/SCAD/LASSO penalty and the ridge, or L2, penalty. |
gamma_penalty |
The scaling factor of the MCP and SCAD penalties. Not used by LASSO penalty. Default is 4.0 for SCAD and 3.0 for MCP. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for further details. |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
survival_options |
a structure of class "survivalControl" created
from function |
BICq_posterior |
an optional character string expressing the path and file
basename of a file combination that
will file-back or currently file-backs a |
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
Argument BICq_posterior
details: If the BIC_option
in selectControl
(tuning_options
) is specified
to be 'BICq', this requests the calculation of the BIC-ICQ criterion during the selection
process. For the BIC-ICQ criterion to be calculated, a full model assuming a small valued
lambda penalty needs to be fit, and the posterior draws from this full model need to be used.
In order to avoid repetitive calculations of
this full model (i.e. if the user wants to re-run phmmPen
with a different
set of penalty parameters), a big.matrix
of these
posterior draws will be file-backed as two files: a backing file with extention '.bin' and a
descriptor file with extension '.desc'. The BICq_posterior
argument should contain a
path and a filename with no extension of the form "./path/filename" such that the backingfile and
the descriptor file would then be saved as "./path/filename.bin" and "./path/filename.desc", respectively.
If BICq_posterior
is set to NULL
, then by default, the backingfile and descriptor
file are saved in the working directory as "BICq_Posterior_Draws.bin" and "BICq_Posterior_Draws.desc".
If the big matrix of posterior draws is already file-backed, BICq_posterior
should
specify the path and basename of the appropriate files (again of form "./path/filename");
the full model
will not be fit again and the big.matrix of
posterior draws will be read using the attach.big.matrix
function of the
bigmemory
package and used in the BIC-ICQ
calcuations. If the appropriate files do not exist or BICq_posterior
is specified as NULL
, the full model will be fit and the full model posterior
draws will be saved as specified above. The algorithm will save 10^4 posterior draws automatically.
Trace details: The value of 0 (default) does not output any extra information. The value of 1 additionally outputs the updated coefficients, updated covariance matrix values, and the number of coordinate descent iterations used for the M step for each EM iteration. When pre-screening procedure is used and/or if the BIC-ICQ criterion is requested, trace = 1 gives additional information about the penalties used for the 'full model' fit procedure. If Stan is not used as the E-step sampling mechanism, the value of 2 outputs all of the above plus gibbs acceptance rate information for the adaptive random walk and independence samplers and the updated proposal standard deviation for the adaptive random walk.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
, see ?pglmmObj for additional documentation)
phmmPen_FA
is used to approximate penalized proportional hazards mixed models using
using a piecewise constant hazard mixed survival model approximation via
Monte Carlo Expectation Conditional Minimization (MCECM) and a factor model decomposition of
the random effects.
The purpose of the function is to perform
variable selection on both the fixed and random effects simultaneously for the
piecewise constant hazard mixed model.
This function uses a factor model decomposition on the random effects. This assumption
reduces the latent space in the E-step (Expectation step) of the algorithm,
which reduces the computational complexity of the algorithm. This improves
the speed of the algorithm and enables the
scaling of the algorithm to larger dimensions.
phmmPen_FA
selects the best model using
BIC-type selection criteria (see selectControl
documentation for
further details).
To improve the speed of the algorithm, consider setting
var_restrictions
= "fixef" within the optimControl
options.
phmmPen_FA( formula, data = NULL, offset = NULL, r_estimation = rControl(), fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), survival_options = survivalControl(), BICq_posterior = NULL, progress = TRUE )
phmmPen_FA( formula, data = NULL, offset = NULL, r_estimation = rControl(), fixef_noPen = NULL, penalty = c("MCP", "SCAD", "lasso"), alpha = 1, gamma_penalty = switch(penalty[1], SCAD = 4, 3), optim_options = optimControl(), adapt_RW_options = adaptControl(), trace = 0, tuning_options = selectControl(), survival_options = survivalControl(), BICq_posterior = NULL, progress = TRUE )
formula |
a two-sided linear formula object describing both the fixed effects and
random effects part of the model, with the response on the left of a ~ operator and the terms,
separated by + operators, on the right.
The response must be a |
data |
an optional data frame containing the variables named in |
offset |
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
r_estimation |
a list of class "rControl" from function |
fixef_noPen |
Optional vector of 0's and 1's of the same length as the number of fixed effects covariates used in the model. Value 0 indicates the variable should not have its fixed effect coefficient penalized, 1 indicates that it can be penalized. Order should correspond to the same order of the fixed effects given in the formula. |
penalty |
character describing the type of penalty to use in the variable selection procedure. Options include 'MCP', 'SCAD', and 'lasso'. Default is MCP penalty. If the random effect covariance matrix is "unstructured", then a group MCP, group SCAD, or group LASSO penalty is used on the random effects coefficients. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for details of these penalties. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions
from the MCP/SCAD/LASSO penalty and the ridge, or L2, penalty. |
gamma_penalty |
The scaling factor of the MCP and SCAD penalties. Not used by LASSO penalty. Default is 4.0 for SCAD and 3.0 for MCP. See Breheny and Huang (2011) <doi:10.1214/10-AOAS388> and Breheny and Huang (2015) <doi:10.1007/s11222-013-9424-2> for further details. |
optim_options |
a structure of class "optimControl" created
from function |
adapt_RW_options |
a list of class "adaptControl" from function |
trace |
an integer specifying print output to include as function runs. Default value is 0. See Details for more information about output provided when trace = 0, 1, or 2. |
tuning_options |
a list of class "selectControl" or "lambdaControl" resulting from
|
survival_options |
a structure of class "survivalControl" created
from function |
BICq_posterior |
an optional character string expressing the path and file
basename of a file combination that
will file-back or currently file-backs a |
progress |
a logical value indicating if additional output should be given showing the
progress of the fit procedure. If |
Argument BICq_posterior
details: If the BIC_option
in selectControl
(tuning_options
) is specified
to be 'BICq', this requests the calculation of the BIC-ICQ criterion during the selection
process. For the BIC-ICQ criterion to be calculated, a full model assuming a small valued
lambda penalty needs to be fit, and the posterior draws from this full model need to be used.
In order to avoid repetitive calculations of
this full model (i.e. if the user wants to re-run phmmPen
with a different
set of penalty parameters), a big.matrix
of these
posterior draws will be file-backed as two files: a backing file with extention '.bin' and a
descriptor file with extension '.desc'. The BICq_posterior
argument should contain a
path and a filename with no extension of the form "./path/filename" such that the backingfile and
the descriptor file would then be saved as "./path/filename.bin" and "./path/filename.desc", respectively.
If BICq_posterior
is set to NULL
, then by default, the backingfile and descriptor
file are saved in the working directory as "BICq_Posterior_Draws.bin" and "BICq_Posterior_Draws.desc".
If the big matrix of posterior draws is already file-backed, BICq_posterior
should
specify the path and basename of the appropriate files (again of form "./path/filename");
the full model
will not be fit again and the big.matrix of
posterior draws will be read using the attach.big.matrix
function of the
bigmemory
package and used in the BIC-ICQ
calcuations. If the appropriate files do not exist or BICq_posterior
is specified as NULL
, the full model will be fit and the full model posterior
draws will be saved as specified above. The algorithm will save 10^4 posterior draws automatically.
Trace details: The value of 0 (default) does not output any extra information. The value of 1 additionally outputs the updated coefficients, updated covariance matrix values, and the number of coordinate descent iterations used for the M step for each EM iteration. When pre-screening procedure is used and/or if the BIC-ICQ criterion is requested, trace = 1 gives additional information about the penalties used for the 'full model' fit procedure. If Stan is not used as the E-step sampling mechanism, the value of 2 outputs all of the above plus gibbs acceptance rate information for the adaptive random walk and independence samplers and the updated proposal standard deviation for the adaptive random walk.
A reference class object of class pglmmObj
for which many methods are
available (e.g. methods(class = "pglmmObj")
, see ?pglmmObj for additional documentation)
Provides graphical diagnostics of the random effect posterior draws from the (best) model. Availabile diagnostics include the sample path, histograms, cummulative sums, and autocorrelation.
plot_mcmc( object, plots = "sample.path", grps = "all", vars = "all", numeric_grp_order = FALSE, bin_width = NULL )
plot_mcmc( object, plots = "sample.path", grps = "all", vars = "all", numeric_grp_order = FALSE, bin_width = NULL )
object |
an object of class |
plots |
a character string or a vector of character strings specifying which graphical diagnostics to provide. Options include a sample path plot (default, "sample.path"), autocorrelation plots ("autocorr"), histograms ("histogram"), cumulative sum plots ("cumsum"), and all four possible plot options ("all"). While the "all" option will produce all four possible plots, subsets of the types of plots (e.g. sample path plots and autocorrelation plots only) can be specified with a vector of the relevant character strings (e.g. c("sample.path","autocorr")) |
grps |
a character string or a vector of character strings specifying which groups should have diagnostics provided. The names of the groups match the input group factor levels. Default is set to 'all' for all groups. |
vars |
a character string or a vector of character strings specifying which variables
should have diagnostics provided. Default is set to
'all', which picks all variables with non-zero random effects.
Tip: can find the names of the random effect variables in
the output sigma matrix found in the |
numeric_grp_order |
if TRUE, specifies that the groups factor should be converted to numeric values. This option could be used to ensure that the organization of the groups is in the proper numeric order (e.g. groups with levels 1-10 are ordered 1-10, not 1, 10, 2-9). |
bin_width |
optional binwidth argument for |
a list of ggplot graphics, each faceted by group and random effect variable.
Type of plots specified in the plots
argument.
glmmPen_FA
and
glmm_FA
estimation procedures.Control of Latent Factor Model Number Estimation
Constructs the control structure for the estimation of the
number of latent factors (r) for use within the glmmPen_FA
and
glmm_FA
estimation procedures.
rControl( r = NULL, r_max = NULL, r_est_method = "GR", size = 25, sample = FALSE )
rControl( r = NULL, r_max = NULL, r_est_method = "GR", size = 25, sample = FALSE )
r |
positive integer specifying number of latent common factors to assume
in the model. If |
r_max |
positive integer specifying maximum number of latent factors to consider.
If |
r_est_method |
character string indicating method used to estimate number
of latent factors |
size |
positive integer specifying the total number of pseudo random
effect estimates to use in the estimation procedure for the number of latent factors
r, which is restricted to be no less than 25. If this |
sample |
logical value indicating if the total number of pseudo random effect
estimates to use in the estimation procedure for the number of latent common factors r
should be larger than the number of unique groups in the data, where the number
of pseudo estimates are increased to the value of |
Estimation of r
procedure: For each level of the group variable separately,
we identify the observations within that group and
fit a regular penalized generalized linear model where the penalty value is the
minimum fixed effect penalty. These group-specific estimates, which we label as 'pseudo random effects',
are placed into a matrix G
(rows = number of levels of the grouping variable, columns = number of random effect covariates),
and this pseudo random effects matrix is treated as the observed outcome matrix used in
the "GR", "ER", and "BN" estimation procedures described above in the description of r_est_method
.
glmmPen
packageSimulates data to use for testing the glmmPen
package.
sim.data
simulates data for glmmPen
,
sim.data.FA
simulates data for glmmPen_FA
,
sim.data.piecewise.exp
simulates survival data for phmmPen
and phmmPen_FA
,
and sim.data.weibull
simulates alternative survival data for phmmPen
and phmmPen_FA
.
Possible parameters to specify includes number of total covariates,
number of non-zero fixed and random effects, and the magnitude
of the random effect covariance values.
sim.data( n, ptot, pnonzero, nstudies, sd_raneff = 1, family = "binomial", corr = NULL, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 ) sim.data.FA( n, ptot, pnonzero, nstudies, sd_raneff = 0, family = "binomial", B = NULL, r = 2, corr = NULL, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 ) sim.data.piecewise.exp( n, ptot, pnonzero, nstudies, sd_raneff = 0, B = NULL, r = 2, cut_points = c(0, 0.5, 1, 1.5, 2), lhaz_vals = c(-1.5, 1, 2.7, 3.7, 6.8), cens_type = c("unif", "exp"), cens_max = 5, exp_rate = 0.15, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 ) sim.data.weibull( n, ptot, pnonzero, nstudies, sd_raneff = 0, B = NULL, r = 2, lhaz_base = 1, alpha_PH = 5, exp_rate = 0.15, cens_type = c("unif", "exp"), cens_max = 5, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 )
sim.data( n, ptot, pnonzero, nstudies, sd_raneff = 1, family = "binomial", corr = NULL, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 ) sim.data.FA( n, ptot, pnonzero, nstudies, sd_raneff = 0, family = "binomial", B = NULL, r = 2, corr = NULL, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 ) sim.data.piecewise.exp( n, ptot, pnonzero, nstudies, sd_raneff = 0, B = NULL, r = 2, cut_points = c(0, 0.5, 1, 1.5, 2), lhaz_vals = c(-1.5, 1, 2.7, 3.7, 6.8), cens_type = c("unif", "exp"), cens_max = 5, exp_rate = 0.15, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 ) sim.data.weibull( n, ptot, pnonzero, nstudies, sd_raneff = 0, B = NULL, r = 2, lhaz_base = 1, alpha_PH = 5, exp_rate = 0.15, cens_type = c("unif", "exp"), cens_max = 5, seed, imbalance = 0, beta = NULL, pnonzerovar = 0, sd_x = 1 )
n |
integer specifying total number of samples to generate |
ptot |
integer specifying total number of covariates to generate (values randomly generated from the standard normal distribution) |
pnonzero |
integer specifying how may of the covariates should have non-zero fixed and random effects |
nstudies |
number of studies/groups to have in the data |
sd_raneff |
non-negative value specifying the standard deviation of the random effects covariance matrix (applied to the non-zero random effects) |
family |
character string specifying which family to generate data from. Family options include "binomial" (default), "poisson", and "gaussian". |
corr |
optional value to specify correlation between covariates
in the model matrix. Default |
seed |
integer to use for the setting of a random seed |
imbalance |
integer of 0 or 1 indicating whether the observations should be equally distributed among the groups (0) or unequally distributed (1). |
beta |
numeric vector of the fixed effects (including intercept) |
pnonzerovar |
non-negative integer specifying the number of covariates with a zero-valued fixed effect but a non-zero random effect. |
sd_x |
non-negative value specifying the standard deviation of the
simulated covariates (drawn from a normal distribution with mean 0,
standard deviation |
B |
matrix specifying the factor loadings matrix for the random effects,
only used within |
r |
positive integer specifying number of latent common factors that describe the random effects,
only used within |
cut_points |
vector of cut points to use for the time intervals when simulating piecewise
exponential data. Length of cut points must equal length of |
lhaz_vals |
vector of the log baseline hazard values for each time interval (which
correspond to the time intervals defined by the |
cens_type |
character specifying type of censoring to implement. Default "unif" specifies
uniform censoring from 0 to |
cens_max |
numeric value used to characterize the range of the uniform censoring procedure
(from 0 to |
exp_rate |
numeric value used to characterize the exponential censoring rate (where rate
is defined as the rate used in |
lhaz_base |
numeric value that gives the log of the scale parameter for the Weibull distribution
(for description of Weibull scale parameter without log transformation,
see "lambdas" argument in |
alpha_PH |
numeric value > 0 that gives the shape parameter for the Weibull distribution
(for description of Weibull shape paraemeter,
see "gammas" argument in |
list containing the following elements:
y |
vector of the response |
X |
model matrix for the fixed effects |
Z |
model matrix for the random effects, organized first by variable and then by group |
pnonzero |
number of non-zero fixed effects |
z1 |
values of the random effects for each variable for each level of the grouping factor |
group |
grouping factor |
X0 |
model matrix for just the non-zero fixed effects |
Converts the input survival data with one row or element corresponding to a
single observation or subject into a long-form dataset where one observation or subject
contributes j
rows, where j
is the number of time intervals that
a subject survived at least part-way through.
survival_data(y, X, Z, group, offset_fit = NULL, survival_options)
survival_data(y, X, Z, group, offset_fit = NULL, survival_options)
y |
response, which must be a |
X |
matrix of fixed effects covariates |
Z |
matrix of random effects covariates |
group |
vector specifying the group assignment for each subject |
offset_fit |
vector specifying the offset.
This can be used to specify an a priori known component to be included in the
linear predictor during fitting. Default set to |
survival_options |
a structure of class "survivalControl" created
from function |
Constructs the control structure for additional parameters needed to properly fit survival data using a piecewise constant hazard mixed model
survivalControl( cut_num = 8, interval_type = c("equal", "manual", "group"), cut_points = NULL, time_scale = 1 )
survivalControl( cut_num = 8, interval_type = c("equal", "manual", "group"), cut_points = NULL, time_scale = 1 )
cut_num |
positive integer specifying the number of time intervals to include in the piecewise constant hazard model assumptions for the sampling step. Default is 8. General recommendation: use between 5 and 10 intervals. See the Details section for additional information. |
interval_type |
character specifying how the time intervals are calculated.
Options include 'equal' (default), 'manual', or 'group'.
If 'equal' (default), time intervals
are calculated such that there are approximately equal numbers of events per time
interval.
If 'manual', the user needs to input
their own cut points (see |
cut_points |
numeric vector specifying the value of the cut points to use
in the calculation of the time intervals for the piecewise constant hazard model.
If |
time_scale |
positive numeric value (greater than 1) used to scale the time variable in the survival data. In order to calculate the piecewise constant hazard mixed model, the log of the time a subject survived within a particular time interval is used as an offset in the model fit. Sometimes multiplying the time scale by a factor greater than 1 improves the stability of the model fit algorithm. |
In the piecewise constant hazard model, there is an assumption that the
time line of the data can be cut into cut_num
time intervals and the baseline hazard is constant within
each of these time intervals. In the fit algorithm, we estimate
these baseline hazard values (specifically, we estimate the log of the baseline
hazard values). By default, we determine cut points by specifying the total number of cuts
to make (cut_num
) and then specifying time values for cut points such
that each time interval has an approximately equal number
of events (interval_type = equal
).
The authors of this package have found simulations to work well using this default interval_type = equal
,
but if desired, users can further specify that each group has at least some (4) events observed
within each time interval.
Regardless of the interval_type
choice, users should be aware that
having too many cut points could result in having too few
events for each time interval needed for a stable estimation of the baseline hazard estimates.
Additionally, data with few events could result in too few events per time interval
even for a small number of cut points.
Alternatively, having too few cut points could result in a sub-par approximation of the baseline hazard,
which can lead to biased estimation for the coefficients corresponding to the input variables of interest.
We generally recommend having
8 total time intervals (more broadly, between 5 and 10 total time intervals). Warnings or errors
will occur for cases when there are 1 or 0 events for a time interval.
If this happens, either adjust the cut_num
value appropriately,
or in the case when the data simply has a very small number of events,
consider not using this software for your estimation purposes.
Function returns a list inheriting from class survivalControl
containing fit and optimization criteria values used in optimization routine.