
Running vertex-wise linear mixed models
Serena Defina
2026-02-17
Source:vignettes/articles/03-run-vw-lmm.Rmd
03-run-vw-lmm.RmdWarning
verywiselets you run vertex-wise analyses however you want to. For example, you can add interactions and splines / polynomials to your models, as many random intercept and slopes as your heart desires…, you can use liberal thresholds for multiple test correction and test as many hypothesis as you have time for. Remember: this software is just a tool, and it is up to you to use it sensibly.
With your set-up and data ready, it is finally time to use some
verywise functionality.
First, load the package in our R sesssion, so we call its functions directly.
Wait, what is a linear mixed model again?
A linear mixed model (LMM) is a multilevel regression technique, that is especially useful when observations are clustered (e.g. across neuroimaging sites) or repeated (across time).
In a nutshell, a LMM includes both fixed effects (parameters associated with the entire population) and random effects (parameters that vary across groups, such as subjects or sites).
Later in this tutorial, I will point out a few important considerations to keep in mind when building and interpreting LMMs. If you want to learn more about the theory behind this funky technique, check out the resources at the end of this article.
Run a vertex-wise linear mixed model
The main function in the verywise package is
run_vw_lmm which fits a vertex-wise linear mixed model to
the data.
Let’s demonstrate a simple (and probably most common) example of a mixed model, in which we have a single grouping/cluster (participants) modelled as a random intercept.
# Run a linear mixed model
run_vw_lmm(
formula = vw_thickness ~ sex * age + site + (1 | id), # model formula
pheno = "path/to/phentype/data.rds", # or an R object already in memory
subj_dir = "/path/to/freesurfer/subjects", # Neuro-imaging data location
outp_dir = "/path/to/output", # Where you want to store results
hemi = "lh", # (default) or "rh": which hemisphere to run
n_cores = 4 # number of cores for parallel processing
)The function will first check the inputs and then run
lme4::lmer on every vertex in the left hemisphere.
The most important parameters you need to know about:
formula
A model formula object. Use this to specify the LMM you wish to run,
using lme4
syntax, for example:
vw_area ~ sex * age + site + (1 | id/site). See the next
section for some advice, cheatsheets and references on model
specification.
The outcome variable should be one of the supported
brain surface metrics computed by FreeSurfer, prefixed by
vw_:
-
vw_thickness- Cortical thickness -
vw_area- Cortical surface area (white surface) -
vw_area.pial- Cortical surface area (pial surface) -
vw_curv- Mean curvature -
vw_jacobian_white- Jacobian determinant (white surface) -
vw_pial- Pial surface coordinates -
vw_pial_lgi- Local gyrification index (pial surface) -
vw_sulc- Sulcal depth -
vw_volume- Gray matter volume -
vw_w_g.pct- White/gray matter intensity ratio -
vw_white.H- Mean curvature (white surface) -
vw_white.K- Gaussian curvature (white surface)
It is currently only possible to have brain metrics as outcome variables.
pheno
The phenotype data. This is either the name of an R
object already loaded in your environment, or as a
string containing the path to a data file, that will
then be read in memory for you (supported file extensions are:
rds, csv, txt and
sav).
More information about the format requirements for the phenotype data
can be found in the Data format
article. Brifly, we support both single datasets
(e.g. a data.frame) and multiple imputed
datasets (i.e. a mids object or list of
dataframes). Either way, the data shoud be in long
format and it should contain all the variables specified in the
formula plus the folder_id column.
subj_dir
The path to the FreeSurfer data directory. This must follow the
verywise directory structure (see Data format article for details).
Other arguments you may want to specify
These argments are optional, but you may want to specify them explicitly depending on your analysis:
outp_dir: a character string specifying the path to the output directory, where results will be stored. IfNULL(default), the function creates a “verywise_results” sub-directory in the current working directory (though this is not recommended for tidyness reasons).hemi: a character string specifying which hemisphere to analyze. Options:"lh"(left hemisphere: default),"rh"(right hemisphere).folder_id: a character string specifying the column name inphenothat contains subject directory names of the input neuroimaging data (e.g. “sub-001_ses-baseline” or “site1/sub-010_ses-F1”). These are expected to be nested insidesubj_dir. Default:"folder_id".n_cores: an integer specifying the number of CPU cores for parallel processing. Default:1.weights: a string or a numeric vector of weights for the linear mixed model. You can use this argument to specify inverse-probability weights. If this is a string, the function look for a column with that name in the phenotype data. Note that these weights are not normalized or standardized in any way. If this is a vector, it should be of the same length as the number of rows in your data. Default:NULL(no weights).seed: an integer specifying the random seed for reproducibility. Default:3108.save_ss: whether to save the super-subject matrix (“ss”) as an .rds file that can be then re-used in future analyses. This can also be a character string specifying the directory where ss should be saved. WhenTRUE, the ss matrix will be saved in<outp_dir>/ssby default. Default:FALSE.
Even more arguments for fine-grained control
These are more advanced arguments that you typically won’t need, but you may want to tweak depending on your analytical (and control) needs:
-
fs_template: a character string specifying the FreeSurfer template for vertex registration. Options:-
"fsaverage"(default) = 163842 vertices (–> highest resolution), -
"fsaverage6"= 40962 vertices, -
"fsaverage5"= 10242 vertices, -
"fsaverage4"= 2562 vertices, -
"fsaverage3"= 642 vertices
Note that lower resolutions should be only used to downsample the brain map, for faster model tuning. The final analyses should run using
fs_template = "fsaverage"to avoid (small) imprecisions in vertex registration and smoothing. -
apply_cortical_mask: whether to exclude non-cortical vertices from analysis. Default:TRUE(recommended).tolerate_surf_not_found: an integer indicating how many brain surface files listed in thefolder_idcolumn can be missing fromsubj_dir. If the number of missing or corrupted files is> tolerate_surf_not_foundexecution will stop. Default:20.use_model_template: whether to pre-compile the model template for faster estimation. Default:TRUE(recommended).lmm_control: cptional list (of correct class, resulting fromlmerControl()) containing control parameters to be passed tolme4::lmer()(e.g. optimizer choice, convergence criteria, see the?lmerControldocumentation for details. Default: (uses default settings).chunk_size: an integer specifying the number of vertices processed sequentially in each chunk (used for parallel operations). Larger values use more memory but may be faster depending on your resources (i.e.n_cores). Default:1000.FS_HOME: a character string specifying the FreeSurfer home directory. Defaults to$FREESURFER_HOMEenvironment variable, if this was set up correctly during FreeSurfer installation (see Set-up article for details).fwhm: a numeric value specifying the full-width half-maximum for the smoothing kernel. Default:10.-
mcz_thr: a numeric value for the Monte Carlo simulation threshold. Any of the following are accepted (equivalent values are separated by/):-
13 / 1.3 / 0.05, -
20 / 2.0 / 0.01, -
23 / 2.3 / 0.005, -
30 / 3.0 / 0.001, –> default -
33 / 3.3 / 0.0005, -
40 / 4.0 / 0.0001.
-
cwp_thr: a numeric value for cluster-wise p-value threshold (on top of all corrections). Set this should be set to0.025when both hemispheres are analyzed, and0.05for single hemisphere analyses. Default:0.025.save_optional_cluster_info: whether to save additional output formmri_surfclusterFreeSurfer call. Seecompute_clustersfunction documentation for details. Default:FALSE.save_residuals: Logical indicating whether to save theresiduals.mghfile. Default:FALSE(recommended, as these files can be quite large).verbose: whether to display progress messages. Default:TRUE.
A little more theory: constructing the right model
The model specification is probably the most important step when running a LMM (or any model really). A poorly specified model can lead to biased estimates, incorrect inferences, and misleading conclusions. What complicates matters more is that you are now running hundreds of thousands of models (one per vertex), so you really want to get this right, ideally the first time around and consistently across the brain.
Should my variables be fixed or random?
In many cases, the same variables could be considered either random or fixed effects (and sometimes even both at the same time, though I wouldn’t get too excited about that). The difference between these two modeling choices has little to do with the variables themselves, and a lot more to do with your study design / research question.
In broad terms, fixed effects are variables that you are interested in, you want to know about their association with the brain metric of choice. Random effects are grouping factors (i.e. categorical variables) that you are not specifically interested in, but you know they might be influencing such brain measures (e.g. scanner, site, subject ID).
Note: You might not want to use random effects when the number of factor levels is very low (i.e. < 5 is a commonly reported rule of thumb) or when you wouldn’t assume that your factor levels come from a common distribution (e.g. assuming that males and females come from a population of sexes, of which there is an infinite number and we are interested in an average). You can read more about this topic here, here and here.
Random structure specification
The formula syntax adopted by lme4 to specify random
effects is very flexible. Almost *too flexible**, so it can be tricky to
get your head around all the available options. Here is a quick
cheatsheet adepted ( / stolen) from StackOverflow (RIP):
| formula | meaning |
|---|---|
(1 | group) |
random group intercept |
(x | group) = (1 + x | group)
|
random slope of x within group - with correlated intercept |
(0 + x | group) |
random slope of x within group - no variation in intercept |
(1 | group) + (0 + x | group) |
uncorrelated random intercept and random slope within group |
(1 | site/id) =
(1 | site) + (1 | site:id)
|
intercept varying among sites and among people within sites (nested random effects) |
site + (1 | site:id) |
fixed effect of sites plus random variation in intercept among people within sites |
(x | site/id) =
(x | site) + (x | site:id)
|
slope and intercept varying among sites and among people within sites |
(x1 | site) + (x2 | id) |
two different effects, varying at different levels |
(1 | group1) + (1 | group2) |
intercept varying among crossed random effects (e.g. site, year) |
Nested vs. crossed random effects
The clustering patterns that can be modeled using random effects can take many forms.
In some cases, clustering is hierarchical (i.e. clusters are nested within other clusters). A relevant example are multi-site longitudinal neuroimaging studies: repeated brain observations are nested within individulas, individulas are nested within sites, sites nested within cohorts and so on.
In other cases, there is no nesting structure. For example, a study where participants complete the same set of tasks, observations are nested within individuals, but they are also clustered according to task type (…or image aquisition protocol or scanner etc.). These are commonly refered to as crossed random effects, i.e. when one level of a random effect can appear in conjunction with more than one level of another effect.
Can I trust the results? [TODO: finsh this section]
Model converge
Get too greedy with model complexity, you may incur in convergence issues.
These will be soon stored in the output –> TODO
A common problem we have seen is singular fits. These are typically caused by a number of random-effect levels that is too small (e.g. < 5; see more info here)
Model fit information
These will be soon stored in the output –> TODO
BIC, AIC, logLik, deviance
More on AIC in the context of LMM here
A note on P values
P values for fixed effects are calculated using the Satterthwaite’s method [TODO: add info]
Read more about this here.
Next article: Run many vertex-wise linear mixed models