An automatically reproducible neuroimaging data analysis

Scientific studies should be reproducible, and with the increasing accessibility of data, there is not much excuse for lack of reproducibility anymore.

DataLad can help with the technical aspects of reproducible science...

It always starts with a dataset

~ % datalad create demo
[INFO   ] Creating a new annex repo at /demo/demo
create(ok): /demo/demo (dataset)
~ % cd demo

For this demo we are using two public brain imaging datasets that were published on OpenFMRI.org, and are available from DataLad's datasets.datalad.org

~/demo % datalad install -d . -s ///openfmri/ds000001 inputs/ds000001
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000001 to '/demo/demo/inputs/ds000001'
add(ok): inputs/ds000001 (dataset) [added new subdataset]
add(notneeded): inputs/ds000001 (dataset) [nothing to add from /demo/demo/inputs/ds000001]
add(notneeded): .gitmodules (file) [already included in the dataset]
save(ok): /demo/demo (dataset)
[INFO   ] access to dataset sibling "datalad" not auto-enabled, enable with:
|            datalad siblings -d "/demo/demo/inputs/ds000001" enable -s datalad
install(ok): inputs/ds000001 (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  install (ok: 1)
  save (ok: 1)

BTW: '///' is just short for http://datasets.datalad.org

~/demo % datalad install -d . -s ///openfmri/ds000002 inputs/ds000002
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000002 to '/demo/demo/inputs/ds000002'
add(ok): inputs/ds000002 (dataset) [added new subdataset]
add(notneeded): inputs/ds000002 (dataset) [nothing to add from /demo/demo/inputs/ds000002]
add(notneeded): .gitmodules (file) [already included in the dataset]
save(ok): /demo/demo (dataset)
[INFO   ] access to dataset sibling "datalad" not auto-enabled, enable with:
|            datalad siblings -d "/demo/demo/inputs/ds000002" enable -s datalad
install(ok): inputs/ds000002 (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  install (ok: 1)
  save (ok: 1)

Both datasets are now registered as subdatasets, and their precise versions are on record

~/demo % datalad --output-format '{path}: {revision_descr}' subdatasets
[WARNING] Result rendering failed for: {'status': 'ok', 'gitmodule_name': u'inputs/ds000001', 'parentds': '/demo/demo', 'gitmodule_url': u'http://datasets.datalad.org/openfmri/ds000001/.git', 'action': 'subdataset', 'path': '/demo/demo/inputs/ds000001', 'type': 'dataset', 'refds': '/demo/demo', 'revision': 'f47099a5124e8f619f763f44f70e1faf5154d41a'} [u'revision_descr' [base.py:<lambda>:412]]
[WARNING] Result rendering failed for: {'status': 'ok', 'gitmodule_name': u'inputs/ds000002', 'parentds': '/demo/demo', 'gitmodule_url': u'http://datasets.datalad.org/openfmri/ds000002/.git', 'action': 'subdataset', 'path': '/demo/demo/inputs/ds000002', 'type': 'dataset', 'refds': '/demo/demo', 'revision': 'e1b7df06da8dd8f1d8802d699d9ad7781fad8bb6'} [u'revision_descr' [base.py:<lambda>:412]]

However, very little data were actually downloaded (the full datasets are several gigabytes in size):

~/demo % du -sh inputs/
20M  inputs/

DataLad datasets are fairly lightweight in size, they only contain pointers to data and history information in their minimal form.

Both datasets contain brain imaging data, and are compliant with the BIDS standard. This makes it really easy to locate particular images and perform analysis across datasets.

Here we will use a small script that performs 'brain extraction' using FSL as a stand-in for a full analysis pipeline

~/demo % mkdir code
~/demo % cat << EOT > code/brain_extraction.sh
> # enable FSL
> . /etc/fsl/5.0/fsl.sh
>
> # obtain all inputs
> datalad get \$@
> # perform brain extraction
> count=1
> for nifti in \$@; do
>   subdir="sub-\$(printf %03d \$count)"
>   mkdir -p \$subdir
>   echo "Processing \$nifti"
>   bet \$nifti \$subdir/anat -m
>   count=\$((count + 1))
> done
> EOT

Note that this script uses the 'datalad get' command which automatically obtains the required files from their remote source -- we will see this in action shortly

We are saving this script in the dataset. This way we will know exactly which code was used for the analysis. Also, we track this code file with Git, so we can see more easily how it was edited over time.

~/demo % datalad add code -m "Brain extraction script" --to-git
add(ok): /demo/demo/code/brain_extraction.sh (file) [non-large file; adding content to git repository]
add(ok): /demo/demo/code (directory)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)

In addition, we will "tag" this state of the dataset. This is optional, but it can help to identify important milestones more easily

~/demo % datalad save --version-tag setup_done
save(ok): /demo/demo (dataset)

Now we can run our analysis code to produce results. However, instead of running it directly, we will run it with DataLad -- this will automatically create a record of exactly how this script was executed

For this demo we will just run it on the structural images of the first subject from each dataset. The uniform structure of the datasets makes this very easy. Of course we could run it on all subjects; we are simply saving some time for this demo. While the command runs, you should notice a few things:

1) We run this command with 'bash -e' to stop at any failure that may occur

2) You'll see the required data files being obtained as they are needed -- and only those that are actually required will be downloaded

~/demo % datalad run bash -e code/brain_extraction.sh inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz
[INFO   ] == Command start (output follows) =====
get(ok): /demo/demo/inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file)
get(ok): /demo/demo/inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file)
action summary:
  get (ok: 2)
Processing inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz
Processing inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz
[INFO   ] == Command exit (modification check follows) =====
add(ok): sub-002/anat.nii.gz (file)
add(ok): sub-001/anat.nii.gz (file)
add(ok): sub-002/anat_mask.nii.gz (file)
add(ok): sub-001/anat_mask.nii.gz (file)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 4)
  save (ok: 1)

The analysis step is done, all generated results were saved in the dataset. All changes, including the command that caused them are on record

~/demo % git show --stat
commit 7607ddef8c03dc5516869f1e35025083772efc5a (HEAD -> master)
Author: DataLad Demo <demo@datalad.org>
Date:   Fri Mar 16 08:26:11 2018 +0100

    [DATALAD RUNCMD] bash -e code/brain_extraction.sh inputs/...

    === Do not change lines below ===
    {
     "pwd": ".",
     "cmd": [
      "bash",
      "-e",
      "code/brain_extraction.sh",
      "inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz",
      "inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz"
     ],
     "exit": 0,
     "chain": []
    }
    ^^^ Do not change lines above ^^^

 sub-001/anat.nii.gz      | 1 +
 sub-001/anat_mask.nii.gz | 1 +
 sub-002/anat.nii.gz      | 1 +
 sub-002/anat_mask.nii.gz | 1 +
 4 files changed, 4 insertions(+)

DataLad has enough information stored to be able to re-run a command.

On command exit, it will inspect the results and save them again, but only if they are different.

In our case, the re-run yields bit-identical results, hence nothing new is saved.

~/demo % datalad rerun
unlock(ok): sub-001/anat.nii.gz (file)
unlock(ok): sub-001/anat_mask.nii.gz (file)
unlock(ok): sub-002/anat.nii.gz (file)
unlock(ok): sub-002/anat_mask.nii.gz (file)
[INFO   ] == Command start (output follows) =====
get(notneeded): /demo/demo/inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [already present]
get(notneeded): /demo/demo/inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [already present]
action summary:
  get (notneeded: 2)
Processing inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz
Processing inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz
[INFO   ] == Command exit (modification check follows) =====
add(ok): sub-002/anat.nii.gz (file)
add(ok): sub-001/anat.nii.gz (file)
add(ok): sub-002/anat_mask.nii.gz (file)
add(ok): sub-001/anat_mask.nii.gz (file)
save(notneeded): /demo/demo (dataset)
action summary:
  add (ok: 4)
  save (notneeded: 1)
  unlock (ok: 4)

Now that we are done, and have checked that we can reproduce the results ourselves, we can clean up

DataLad can easily verify if any part of our input dataset was modified since we configured our analysis

~/demo % datalad diff --revision setup_done inputs

Nothing was changed.

With DataLad with don't have to keep those inputs around -- without losing the ability to reproduce an analysis.

Let's uninstall them -- checking the size on disk before and after

~/demo % du -sh
32M  .
~/demo % datalad uninstall inputs/*
drop(ok): /demo/demo/inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [checking http://openneuro.s3.amazonaws.com/ds000002/ds000002_R2.0.0/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=vXK2.bQ360phhPqbVV_n6RMYqaWAy4Dg...]
drop(ok): /demo/demo/inputs/ds000002 (directory)
uninstall(ok): /demo/demo/inputs/ds000002 (dataset)
drop(ok): /demo/demo/inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [checking http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/highres001.nii.gz?versionId=8TJ17W9WInNkQPdiQ9vS7wo8ZJ9llF80...]
drop(ok): /demo/demo/inputs/ds000001 (directory)
uninstall(ok): /demo/demo/inputs/ds000001 (dataset)
action summary:
  drop (ok: 4)
  uninstall (ok: 2)
~/demo % du -sh .
3.0M .

All inputs are gone...

~/demo % ls inputs/*
inputs/ds000001:

inputs/ds000002:

Only the remaining data (our code and the results) need to be kept and require a backup for long term archival. Everything else can be re-obtained as needed, when needed.

As DataLad knows everything needed about the inputs, including where to get the right version, we can re-run the analysis with a single command. Watch how DataLad re-obtains all required data, re-runs the code, and checks that none of the results changed and need saving

~/demo % datalad rerun
unlock(ok): sub-001/anat.nii.gz (file)
unlock(ok): sub-001/anat_mask.nii.gz (file)
unlock(ok): sub-002/anat.nii.gz (file)
unlock(ok): sub-002/anat_mask.nii.gz (file)
[INFO   ] == Command start (output follows) =====
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000001/.git to '/demo/demo/inputs/ds000001'
[INFO   ] access to dataset sibling "datalad" not auto-enabled, enable with:
|            datalad siblings -d "/demo/demo/inputs/ds000001" enable -s datalad
install(ok): /demo/demo/inputs/ds000001 (dataset) [Installed subdataset in order to get /demo/demo/inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz]
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000002/.git to '/demo/demo/inputs/ds000002'
[INFO   ] access to dataset sibling "datalad" not auto-enabled, enable with:
|            datalad siblings -d "/demo/demo/inputs/ds000002" enable -s datalad
install(ok): /demo/demo/inputs/ds000002 (dataset) [Installed subdataset in order to get /demo/demo/inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz]
get(ok): /demo/demo/inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file)
get(ok): /demo/demo/inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file)
action summary:
  get (ok: 2)
  install (ok: 2)
Processing inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz
Processing inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz
[INFO   ] == Command exit (modification check follows) =====
add(ok): sub-002/anat.nii.gz (file)
add(ok): sub-001/anat.nii.gz (file)
add(ok): sub-002/anat_mask.nii.gz (file)
add(ok): sub-001/anat_mask.nii.gz (file)
save(notneeded): /demo/demo (dataset)
action summary:
  add (ok: 4)
  save (notneeded: 1)
  unlock (ok: 4)

Reproduced!

This dataset could now be published and enable anyone to replicate the exact same analysis. Public data for the win!