
Simulate a distributed (multi-site) cross-sectional brain surface dataset
Source:R/simulate_distrib_dataset.R
simulate_distrib_dataset.RdGenerates a synthetic distributed dataset where each site's data lives in its own subdirectory, mirroring a real multi-site setup where no site can access another's raw data. For each site \(k\) and each vertex \(v\), the generative model is:
$$ y_{ikv} = \mu_v + \sum_{j} \beta_j \cdot x_{ijk} + u_{kv} + \varepsilon_{ikv} $$
where \(\mu_v\) is a vertex-specific intercept drawn once from
\(\mathcal{N}(\code{vw\_mean},\, \code{vw\_sd}^2)\),
\(\beta_j\) are fixed effects shared across all vertices and sites
(supplied via betas),
\(u_{kv} \sim \mathcal{N}(0, \tau^2_v)\) is a site-level random intercept,
and \(\varepsilon_{ikv} \sim \mathcal{N}(0, \sigma^2_v)\) is residual noise.
The output folder layout is:
<path>/
<site1>/
phenotype.csv
sub-001/surf/<hemi>.<measure>.<fwhmc>.<fs_template>.mgh
sub-002/surf/...
<site2>/
phenotype.csv
...
Each site folder can therefore be passed directly to
run_vw_fed_local as the subj_dir and the matching
phenotype.csv as pheno.
Usage
simulate_distrib_dataset(
path,
site_sizes = c(site1 = 80L, site2 = 120L, site3 = 60L),
betas = c(sex = -0.2, age = 0.3),
tau2 = 0.01,
sigma2 = 0.01,
fs_template = "fsaverage",
roi_subset = c("temporalpole", "frontalpole", "entorhinal"),
location_association = NULL,
measure = "thickness",
hemi = "lh",
fwhmc = "fwhm10",
vw_mean = 5,
vw_sd = 0.5,
overwrite = TRUE,
seed = 3108,
verbose = TRUE
)Arguments
- path
Character string. Root directory for all site sub-folders. Created recursively if it does not exist.
- site_sizes
Named integer vector. Names become site/folder names; values give the number of subjects at each site. Default:
c(site1 = 80, site2 = 120, site3 = 60).- betas
Named numeric vector of fixed-effect coefficients. Each name must correspond to a covariate that will be simulated in the phenotype data. Currently supported covariate names:
ageContinuous, drawn from \(\mathcal{N}(30, 5^2)\) and then z-scored within each site.
sexBinary (0/1), drawn from \(\text{Bernoulli}(0.5)\).
Example:
c(age = 0.3, sex = -0.2). Set any coefficient to0to include the covariate in the design matrix without injecting a signal (useful for null-effect benchmarks).- tau2
Numeric scalar or length-\(V_{\text{roi}}\) vector. Between-site variance of the random intercept \(u_{kv}\). A value of
0collapses the model to a fixed-effects OLS. Default:0.5.- sigma2
Numeric scalar or length-\(V_{\text{roi}}\) vector. Within-site residual variance \(\varepsilon_{ikv}\). Default:
1.tau2andsigma2together define the intra-class correlation \(\text{ICC} = \tau^2 / (\tau^2 + \sigma^2)\).- fs_template
Character string. FreeSurfer template; determines the total number of vertices in each
.mghfile. Options:"fsaverage"= 163842 vertices"fsaverage6"= 40962 vertices"fsaverage5"= 10242 vertices"fsaverage4"= 2562 vertices"fsaverage3"= 642 vertices
Default:
"fsaverage".- roi_subset
Character vector of ROI names used to restrict the active vertices (the rest are set to
0and excluded from analysis). Vertex locations are read from the FreeSurfer annotation files stored inR/sysdata.rda. Default:c("temporalpole", "frontalpole", "entorhinal").- location_association
Character vector (optional). If supplied, the signal encoded in
betasis injected only within these ROIs; the remaining active vertices (roi_subset) are generated under a null model (\(\beta = 0\)). Useful for testing spatial localisation of discovered effects.- measure
Character string. Surface measure; used for file naming only. Default:
"thickness".- hemi
Character string. Hemisphere:
"lh"or"rh". Default:"lh".- fwhmc
Character string. Smoothing label; used for file naming only. Default:
"fwhm10".- vw_mean
Numeric scalar. Mean of the vertex-specific intercept \(\mu_v \sim \mathcal{N}(\code{vw\_mean},\, \code{vw\_sd}^2)\). For cortical thickness a realistic value is around
2.5mm. Default:2.5.- vw_sd
Numeric scalar. Standard deviation of the vertex-specific intercept distribution. Default:
0.5.- overwrite
Logical. If
FALSEandphenotype.csvalready exists in a site folder, the existing file is re-used and no new brain surface files are written for that site. Default:TRUE.- seed
Integer. Random seed for reproducibility. Default:
3108.- verbose
Logical. Print progress messages. Default:
TRUE.
Value
Invisibly returns a list with the ground-truth parameters used during data generation:
beta0Length-\(V_{\text{roi}}\) vector of simulated vertex-specific intercepts \(\mu_v\).
betasThe
betasargument as supplied.tau2Length-\(V_{\text{roi}}\) vector of between-site variances (after recycling).
sigma2Length-\(V_{\text{roi}}\) vector of residual variances (after recycling).
uNumeric matrix \(K \times V_{\text{roi}}\) of realised site random intercepts.
iccLength-\(V_{\text{roi}}\) vector of theoretical ICCs: \(\tau^2_v / (\tau^2_v + \sigma^2_v)\).
signal_verticesLogical vector of length \(n_{\text{verts}}\) indicating which global vertex indices received a non-zero fixed effect (i.e. the intersection of
roi_subsetandlocation_association, if supplied).
Data and phenotype files are written to path as a side-effect.
See also
run_vw_fed_local for the analysis function this feeds into,
simulate_longit_dataset for the longitudinal (multi-session) variant.
Examples
if (FALSE) { # \dontrun{
truth <- simulate_fed_dataset(
path = tempfile("fed_sim_"),
site_sizes = c(site1 = 80, site2 = 120, site3 = 60),
betas = c(age = 0.3, sex = -0.2),
tau2 = 0.5,
sigma2 = 1,
fs_template = "fsaverage5" # fast; use "fsaverage" for final analyses
)
# Ground-truth ICC summary across active vertices
summary(truth$icc)
# Recovered site random intercepts (K x V_roi matrix)
dim(truth$u)
} # }