DataLad: Distributed Data Management

Contributors: Nathan TM Huneke

Maintainers: Nathan TM Huneke

The following text is adapted from the DataLad handbook.

DataLad is a free and open source command line tool, available for all major operating systems, and builds up on Git and git-annex to allow sharing, synchronizing, and version controlling collections of large files in repositories known as DataLad datasets. You can find information on how to install DataLad at handbook.datalad.org/en/latest/intro/installation.html.

Note

If you are using my neuroimaging conda environment then datalad will be installed.

Get the dataset

A DataLad dataset can be cloned by running:

datalad clone <url>

Once a dataset is cloned, it is a light-weight directory on your local machine. At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files.

Retrieve dataset content

After cloning a dataset, you can retrieve file contents by running:

datalad get <path/to/directory/or/file>

This command will trigger a download of the files, directories, or subdatasets you have specified.

DataLad datasets can contain other datasets, so called subdatasets. If you clone the top-level dataset, subdatasets do not yet contain metadata and information on the identity of files, but appear to be empty directories. In order to retrieve file availability metadata in subdatasets, run:

datalad get -n <path/to/subdataset>

Afterwards, you can browse the retrieved metadata to find out about subdataset contents, and retrieve individual files with datalad get. If you use datalad get <path/to/subdataset>, all contents of the subdataset will be downloaded at once.

Stay up-to-date

DataLad datasets can be updated. The command datalad update will fetch updates and store them on a different branch (by default remotes/origin/master). Running:

datalad update --merge

will pull available updates and integrate them in one go.

Find out what has been done

DataLad datasets contain their history in the git log. By running git log (or a tool that displays Git history) in the dataset or on specific files, you can find out what has been done to the dataset or to individual files by whom, and when.

Saving changes you make to a dataset

If you make a change to a dataset, you will need to save it into the history so that you or other users in the future can see what was done when and by whom. The command datalad status will show you whether any changes need to be saved. If you run this command you might see either the message:

nothing to save, working tree clean

In which case the log is up to date, or:

modified: file1.txt (file)

There is a file (file1.txt) that has been modified. This modification needs to be saved in the dataset’s history. To do so run a datalad save command. This will produce the following output:

add(ok): file1.txt (file)
save(ok): . (dataset)
action summary:
    add (ok: 1)
    save (ok: 1)

file1.txt was added to the repository and then the dataset was saved. It is useful to add a message regarding what was saved, so that you can keep track of what has been done to the dataset. To do so use the -m argument, e.g.:

datalad save -m "Add file1.txt"

If you now look at the git log you will see that this entry is accompanied by the message that file1.txt was added:

commit 7a286d45195b7ac6a167fefb2a2229fa87af2425
Author: nh6g15 <n.huneke@soton.ac.uk>
Date:   Mon Jun 21 14:49:43 2021 +0100

    Add file1.txt

commit a736983ec90a2094ce401105173122f9a9033824
Author: nh6g15 <n.huneke@soton.ac.uk>
Date:   Thu Jun 17 14:34:17 2021 +0100

    Apply YODA dataset setup

commit 8bf3c337ef7c06852ffe07ee738eae7f44b1f46c
Author: nh6g15 <n.huneke@soton.ac.uk>
Date:   Thu Jun 17 14:34:15 2021 +0100

    [DATALAD] new dataset

DataLad Run

Possibly the most useful feature of DataLad for computationally intensive analyses (e.g. neuroimaging) is the datalad run command. Using this command allows you to capture your command(s), fetch relevant files, do something with them, and then save the results.

For example, the following datalad run command, runs a script on a file called anonymised_dataset.csv to convert it to long format:

datalad run \
    -m "Save long format dataset" \
    -i anonymised_dataset.csv \
    -o dataset_long_format.csv \
    "code/convert2long.R"

After running this, checking the git log will show the following:

commit 1eac06986726b3f98c61b0b7eab0964ca54c2e0b (HEAD -> master)
Author: nh6g15 <n.huneke@soton.ac.uk>
Date:   Fri Jun 25 16:17:16 2021 +0100

    [DATALAD RUNCMD] Save long format dataset

    === Do not change lines below ===
    {
    "chain": [],
    "cmd": "code/convert2long.R",
    "dsid": "9663676d-5ac4-4071-9406-6ee778f7d49e",
    "exit": 0,
    "extra_inputs": [],
    "inputs": [
    "anonymised_dataset.csv"
    ],
    "outputs": [
    "dataset_long_format.csv"
    ],
    "pwd": "."
    }
    ^^^ Do not change lines above ^^^

Because the command and files needed are all saved in the log, we can even re-run this command if needed! To do so, we use datalad rerun <SHASUM> using the SHASUM of the commit in question. For example:

datalad rerun 1eac06986726b3f98c61b0b7eab0964ca54c2e0b

I strongly suggest you read the Chapter on datalad run in the DataLad handbook as this command is so important.

Dataset Storage and Backup

DataLad includes tools for easily managing dataset storage and backup. These work very nicely with the University’s research filestore. A suggested workflow is described below.

DataLad Siblings

Before understanding how DataLad can be used for storage and backup, it is important to understand the concept of a DataLad sibling. A datalad sibling is essentially a ‘copy’ of a dataset stored in another location. Each sibling will have its own git log, and changes made in either dataset can be incorporated into the other with a datalad update command. For dataset storage and backup, we will be using a special kind of sibling known as a special remote. We will use two types of special remote: remote indexed archives and a gitlab remote.

Remote Indexed Archives

RIA stores can be easily created or extended from within any dataset. The advantage of using an RIA store is that the remote machine does not need datalad to be installed. Nevertheless, datalad will still be able to find and retrieve dataset contents through a datalad get command. The RIA store is therefore perfect for content storage and backup, particularly as the University filestore is regularly backed up itself.

The RIA store is created with the following command:

datalad create-sibling-ria -s ria-backup ria+<URL>

If using the university filestore you would replace <URL> with the path to access your research filestore via SSH. This takes the following form:

ssh://ssh.soton.ac.uk:/research/absolute/path/to/ria-store

The final command therefore looks like this:

datalad create-sibling-ria -s ria-backup ria+ssh://ssh.soton.ac.uk:/research/absolute/path/to/ria-store

Note

To access your research filestore via SSH you need to ask iSolutions for the directory to be NFS and you need to be added to the list of SSH gateway users.

To backup your dataset contents in the RIA store, use the following command:

datalad push --to ria-backup

Tip

Accessing a directory via SSH without password

It can be useful to set up access to your filestore via SSH without the need for a password. This way you can automate operations like getting dataset contents or pushing dataset updates without needing to input your password every 5 seconds. Doing this is very straightforward through the use of an SSH key.

First on your machine, type the following command:

ssh-keygen

Just use the defaults when prompted by pressing return. Next, copy your key to the SSH server with:

ssh-copy-id user@server

You should now be able to login without using a password.

This key can be copied (using secure copy) to any other machine to allow access to the same server without using a password:

scp ~/.ssh/id_<key> user@machine:~/.ssh/id_<key>
scp ~/.ssh/id_<key>.pub user@machine:~/.ssh/id<key>.pub

GitLab Siblings

RIA stores are great, but they have one problem: they are not human-readable. Here is an example of what an RIA store actually looks like:

 /path/to/my_riastore
├── 946
│   └── e8cac-432b-11ea-aac8-f0d5bf7b5561
│       ├── annex
│       │   └── objects
│       │       ├── 6q
│       │       │   └── mZ
│       │       │       └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
│       │       │           └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
│       │       ├── 6v
│       │       │   └── zK
│       │       │       └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
│       │       │           └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
│       │       ├── [...]
│       │       └── [...]
│       ├── archives
│       │   └── archive.7z
│       ├── branches
│       ├── config
│       ├── description
│       ├── HEAD
│       ├── hooks
│       │   ├── applypatch-msg.sample
│       │   ├── [...]
│       │   └── update.sample
│       ├── info
│       │   └── exclude
│       ├── objects
│       │   ├── 05
│       │   │   └── 3d25959223e8173497fa7f747442b72c31671c
│       │   ├── 0b
│       │   │   └── 8d0edbf8b042998dfeb185fa2236d25dd80cf9
│       │   ├── [...]
│       │   │   └── [...]
│       │   ├── info
│       │   └── pack
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── master
│       │   └── tags
│       ├── ria-layout-version
│       └── ria-remote-ebce196a-b057-4c96-81dc-7656ea876234
│           └── transfer
├── error_logs
└── ria-layout-version

Cloning this dataset would involve finding out the dataset id, which is not trivial when you don’t know where to look. Instead, we can store a human-readable version of the dataset on the University’s GitLab instance. GitLab is not suitable for storing dataset contents (other than code), so does need to be used in conjunction with the filestore. GitLab instead stores metadata about the dataset, allowing retrieval of contents from the RIA store using human-readable commands. This is very useful when collaborating with others on a project.

Setup

Before a GitLab remote can be created, you need to complete a few setup steps:

Generate a personal access token for GitLab here.
Copy and paste the following into a text file, inserting your personal access token in the appropriate field:
```
[soton]
url = https://git.soton.ac.uk
private_token = [insert token here]
```
Save this file in your home directory (~) as .python-gitlab.cfg.

Create the GitLab remote

To create a GitLab remote, use the following command:

datalad create-sibling-gitlab -s gitlab --site soton --project <path/to/project>

The metadata and code can then be pushed to gitlab with:

datalad push --to gitlab

Human-readable metadata will now be visible at https://git.soton.ac.uk/path/to/project. The dataset can then be cloned with:

datalad clone https://git.soton.ac.uk/path/to/project .

assuming the cloner has permissions to view the dataset. Contents can then be retrieved with a datalad get as datalad will by default search the RIA store for contents.

Note

To reduce complexity, it helps to create a GitLab remote for the superdataset only. Any subdatasets can be backed up to an RIA store. As long as the superdataset is cloned, it is possible to then retrieve subdataset contents from the RIA store. One line of code needs to be run in the superdataset to configure this:

git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+<URL>#{id}"

Procedures for Setting Up Datasets

To help speed up the process of setting up a new datalad dataset (including backup siblings), I have written a number of procedures. The github repository including installation and usage instructions is here.

More information

More information on DataLad and how to use it can be found in the DataLad Handbook at handbook.datalad.org. The chapter What you really need to know is particularly useful.