Run federated aggregation for vertex-wise distributed LMM

Performs the aggregation step of a privacy-preserving federated linear mixed model (LMM) on vertex-wise neuroimaging data. This function never sees raw participant data — it operates entirely on pre-computed sufficient statistics (site payloads) uploaded by each site. It fits a random-intercept LMM with a single shared variance component ratio $\lambda = \sigma_b^2 / \sigma_\varepsilon^2$, selected via a 1-D profile likelihood grid search (rather than iterative site communication).

Usage

run_vw_fed_aggr(
  site_names,
  formula,
  inpt_dir,
  outp_dir = NULL,
  hemi = c("lh", "rh"),
  fs_template = "fsaverage",
  seed = 3108,
  n_cores = 1,
  chunk_size = 1000,
  REML = TRUE,
  lambda_grid = NULL,
  chol_tol = 1e-12,
  ridge = 0,
  verbose = TRUE
)

Arguments

site_names: Character vector of site identifiers. Each site must have a corresponding <site>.tar.gz archive in inpt_dir.
formula: A two-sided formula specifying the fixed effects. The LHS names the surface measure (e.g. thickness ~ age + sex).
inpt_dir: Character. Directory containing the site payload archives.
outp_dir: Character or NULL. Directory for output FBM backing files. Defaults to inpt_dir if NULL.
hemi: Character. Hemisphere: "lh" (default) or "rh".
fs_template: Character. FreeSurfer template surface used for the cortical mask (e.g. "fsaverage", "fsaverage5"). Default: "fsaverage".
seed: Integer. Random seed passed to the parallel backend for reproducibility. Default: 3108.
n_cores: Integer. Number of parallel workers. 1 (default) runs sequentially; values > 1 use doParallel.
chunk_size: Integer. Number of vertices processed per parallel job. Larger values are faster but use more memory. Default: 1000.
REML: Logical. If TRUE (default), use restricted maximum likelihood (REML); otherwise use ML.
lambda_grid: Numeric vector of candidate $\lambda$ values for the grid search, or NULL to use the default log-spaced grid spanning $[10^{-6}, 10^{2}]$.
chol_tol: Numeric. Minimum diagonal pivot in the Cholesky factor below which a $\lambda$ value is rejected as numerically singular. Default: 1e-12.
ridge: Numeric. Ridge penalty added to the diagonal of $A(\lambda)$ before factorisation. Use a small positive value (e.g. 1e-6) when predictors are near-collinear. Default: 0 (no regularisation).
verbose: Logical. Whether to print progress messages. Default: TRUE.

Value

A named list of four FBM objects, each of dimension described below. Rows correspond to model terms; columns to vertices.

coef: $p \times V$ matrix of fixed-effect estimates $\hat\beta$.
se: $p \times V$ matrix of standard errors.
pval: $p \times V$ matrix of two-sided p-values (t-distribution, df degrees of freedom).
variances: $2 \times V$ matrix; row 1 = $\hat\sigma^2_\varepsilon$, row 2 = $\hat\tau^2 = \hat\lambda \hat\sigma^2_\varepsilon$.

Details

Model structure. The covariance matrix is assumed to follow $$V = \sigma_\varepsilon^2 (I + \lambda Z Z^\top)$$ where $Z$ is a block-diagonal indicator of site membership. Under this structure, the Sherman–Morrison–Woodbury identity reduces the $N \times N$ inverse to site-level rank-1 updates, making the algorithm $O(K p^2)$ rather than $O(N^3)$.

Grid search. For each candidate $\lambda$, the function reconstructs the global Hessian $A(\lambda) = X^\top V^{-1} X$ and its Cholesky factor, then computes the profile (RE)ML log-likelihood per vertex in chunks. The $\lambda$ that maximises the log-likelihood is selected independently for each vertex.

Chunked processing. Vertices are processed in chunks of size chunk_size to keep memory usage bounded regardless of surface resolution. Results are written directly to memory-mapped bigstatsr FBM files on disk.

Author

Serena Defina, 2026.