Skip to contents

Data poor? Simulate yourself some data, son

If you are eager to get started with verywise but do not have access to a dataset (yet), you can generate both a set of ready-made FreeSurfer brain surface files and a phenotype file to go with them.

The simulated dataset will automatically be stored into a verywise directory structure and will be ready for you to analyze.

Let’s see how to get there. First, load the package:

library(verywise)
#> Loaded verywise 1.3.5.9000

Now, decide what data you want to generate. In this (small) example, I will simulate a dataset with 250 subjects, who belong to two cohorts (including 100 and 150 subjects respectively) and all underwent two MRI sessions ("01" and "02"). I am only interested in surface area, so I will only be generating "area" maps. The resolution of these fake surface maps is 163842 vertices (corresponding to the most detailed FreeSurfer template "fsaverage").

To make this more realistic, I want to inject a positive association between "age" and surface area in my data (beta = 3.5). This association will only be present in the "frontalpole" region from the Desikan-Killiany atlas.


data_structure <- list("cohort1" = list("n_subjects" = 100, 
                                        "sessions" = c("01","02")),
                       "cohort2" = list("n_subjects" = 150, 
                                        "sessions" = c("01","02")))

# Simulate FreeSurfer and phenotype dataset for both hemispheres
for (hemi in c('lh','rh')) {

  simulate_longit_dataset(
    path = "path/to/verywise/simulated_example",
    data_structure = data_structure,
    measure = "area",
    hemi = hemi,
    fs_template = "fsaverage", # FreeSurfer template (largest available: 163842 vertices)
    vw_mean = 5, # mean surface area value
    vw_sd = 0.1, # standard deviation of surface area values
    subj_sd = 0.02, # variability between subjects
    site_sd = 0.01, # variability between sites
    # (optionally) only simulate data within certain ROIs:
    # roi_subset = c('temporalpole', 'frontalpole', 'entorhinal'), 
    simulate_association = "3.5 * age",
    location_association = "frontalpole")

}

This should give you everything you need to play around with verywise model fitting. If you do run a model using this data now, you should be able to recover an association with age located in a single cluster, perfectly overlapping with the frontal pole region. How neat.

Under the hood, simulate_dataset() will call two functions:

You can also call these two functions separately, in case you only need one of the two data sources.

Only generate the brain data

If you have a phenotype dataset already, but you are waiting for the FreeSurfer data to cook.

In this case I have a single dataset / cohort called "my_site" with 100 subjects who underwent three MRI sessions ("baseline", "F1" and "F2"). The phenotype data is ready and stored in a CSV file located at "path/to/my_phenotype_data.rds". This dataset should contain 300 rows (100 subjects x 3 sessions), and a few variables of interest, including "wisdom".

Let’s simulate left hemisphere thickness maps, smoothed at 10mm FWHM, with an overall mean of 6.5 (mm thickness) and a standard deviation of 1.5. This time, to save us some time, I will use a smaller surface template ("fsaverage3"), which has only 40962 vertices in total.

I also want to inject a positive association between "wisdom" and thickness (beta = 10) in the "entorhinal" and "temporalpole" regions.


phenotype_data <- readRDS("path/to/my_phenotype_data.rds")

data_structure <- list("my_site" = list("sessions" = c("baseline", "F1", "F2"),
                                        "n_subjects" = 100))

# Simulate FreeSurfer dataset
simulate_freesurfer_data(path = "path/to/simulated_FreeSurfer_output",
                         data_structure = data_structure,
                         fs_template = "fsaverage3",
                         measure = "thickness",
                         hemi = "lh",
                         fwhmc = "fwhm10", # smoothness level
                         vw_mean = 6.5,
                         vw_sd = 1.5,
                         subj_sd = 0.2, 
                         site_sd = 0.1,
                         simulate_association = 10 * phenotype$wisdom,
                         location_association = c('temporalpole', 'entorhinal'),
                         seed = my_random_seed_I_did_not_forget_to_set)

Only generate the phenotype data

On the other hand, if you already have some surfaces to use but you would like a phenotype to go with it, you can generate a minimal “long format” dataframe (with variables: folder_id, id, site, age, sex, wisdom).

# Simulate phenotype dataset, using the same data_structure as above
phenotype_data <- simulate_long_pheno_data(data_structure = data_structure,
                                           seed = my_random_seed_I_did_not_forget_to_set)

Simulate “distributed” data

verywise also handles datasets that are not accessible by a single analyst (distributed framework). To simulate a similar dataset for testing:

site_sizes = c(
   site1 = 50, 
   site2 = 100, 
   site3 = 5
)

for (hemi in c('rh','lh')) {

  true_estimates <- simulate_distrib_dataset(
    path = "path/to/testing/folder",
    site_sizes = site_sizes,
    fs_template = 'fsaverage',
    measure = 'area',
    hemi = hemi,
    fwhmc = 'fwhm10',
    vw_mean = 5,
    vw_sd = 0.1,
    tau2 = 0.1, # Between-site variance of the random intercept
    sigma2 = 0.05, # Within-site residual variance

    betas = c(sex = -0.5, age = 0.8),
    roi_subset = c('temporalpole', 'frontalpole', 'entorhinal'),
    location_association = 'frontalpole',
    overwrite = TRUE,
    seed = 42,
    verbose = TRUE)
  
}