DataLad: Distributed Data Management
The following text is adapted from the DataLad handbook.
DataLad is a free and open source command line tool, available for all major operating systems, and builds up on Git and git-annex to allow sharing, synchronizing, and version controlling collections of large files in repositories known as DataLad datasets. You can find information on how to install DataLad at handbook.datalad.org/en/latest/intro/installation.html.
Note
If you are using my neuroimaging conda environment then datalad will be installed.
Get the dataset
A DataLad dataset can be cloned
by running:
datalad clone <url>
Once a dataset is cloned, it is a light-weight directory on your local machine. At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files.
Retrieve dataset content
After cloning a dataset, you can retrieve file contents by running:
datalad get <path/to/directory/or/file>
This command will trigger a download of the files, directories, or subdatasets you have specified.
DataLad datasets can contain other datasets, so called subdatasets. If you clone the top-level dataset, subdatasets do not yet contain metadata and information on the identity of files, but appear to be empty directories. In order to retrieve file availability metadata in subdatasets, run:
datalad get -n <path/to/subdataset>
Afterwards, you can browse the retrieved metadata to find out about
subdataset contents, and retrieve individual files with datalad get
. If you
use datalad get <path/to/subdataset>
, all contents of the subdataset will
be downloaded at once.
Stay up-to-date
DataLad datasets can be updated. The command datalad update
will fetch
updates and store them on a different branch (by default
remotes/origin/master
). Running:
datalad update --merge
will pull available updates and integrate them in one go.
Find out what has been done
DataLad datasets contain their history in the git log
.
By running git log
(or a tool that displays Git history) in the dataset or on
specific files, you can find out what has been done to the dataset or to individual files
by whom, and when.
Saving changes you make to a dataset
If you make a change to a dataset, you will need to save it into the history
so that you or other users in the future can see what was done when and by whom. The
command datalad status
will show you whether any changes need to be saved. If you run
this command you might see either the message:
nothing to save, working tree clean
In which case the log is up to date, or:
modified: file1.txt (file)
There is a file (file1.txt
) that has been modified. This modification needs to be
saved in the dataset’s history. To do so run a datalad save
command. This will
produce the following output:
add(ok): file1.txt (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
file1.txt
was added to the repository and then the dataset was saved. It is
useful to add a message regarding what was saved, so that you can keep track of
what has been done to the dataset. To do so use the -m
argument, e.g.:
datalad save -m "Add file1.txt"
If you now look at the git log
you will see that this entry is accompanied by the
message that file1.txt was added:
commit 7a286d45195b7ac6a167fefb2a2229fa87af2425
Author: nh6g15 <n.huneke@soton.ac.uk>
Date: Mon Jun 21 14:49:43 2021 +0100
Add file1.txt
commit a736983ec90a2094ce401105173122f9a9033824
Author: nh6g15 <n.huneke@soton.ac.uk>
Date: Thu Jun 17 14:34:17 2021 +0100
Apply YODA dataset setup
commit 8bf3c337ef7c06852ffe07ee738eae7f44b1f46c
Author: nh6g15 <n.huneke@soton.ac.uk>
Date: Thu Jun 17 14:34:15 2021 +0100
[DATALAD] new dataset
DataLad Run
Possibly the most useful feature of DataLad for computationally intensive analyses (e.g. neuroimaging)
is the datalad run
command. Using this command allows you to capture your command(s), fetch relevant files,
do something with them, and then save the results.
For example, the following datalad run
command, runs a script on a file called
anonymised_dataset.csv
to convert it to long format:
datalad run \
-m "Save long format dataset" \
-i anonymised_dataset.csv \
-o dataset_long_format.csv \
"code/convert2long.R"
After running this, checking the git log
will show the following:
commit 1eac06986726b3f98c61b0b7eab0964ca54c2e0b (HEAD -> master)
Author: nh6g15 <n.huneke@soton.ac.uk>
Date: Fri Jun 25 16:17:16 2021 +0100
[DATALAD RUNCMD] Save long format dataset
=== Do not change lines below ===
{
"chain": [],
"cmd": "code/convert2long.R",
"dsid": "9663676d-5ac4-4071-9406-6ee778f7d49e",
"exit": 0,
"extra_inputs": [],
"inputs": [
"anonymised_dataset.csv"
],
"outputs": [
"dataset_long_format.csv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
Because the command and files needed are all saved in the log, we can even re-run this command if needed!
To do so, we use datalad rerun <SHASUM>
using the SHASUM of the commit in question. For example:
datalad rerun 1eac06986726b3f98c61b0b7eab0964ca54c2e0b
I strongly suggest you read the Chapter on datalad run
in the DataLad handbook
as this command is so important.
Dataset Storage and Backup
DataLad includes tools for easily managing dataset storage and backup. These work very nicely with the University’s research filestore. A suggested workflow is described below.
DataLad Siblings
Before understanding how DataLad can be used for storage and backup, it is important to understand the
concept of a DataLad sibling.
A datalad sibling is essentially a ‘copy’ of a dataset stored in another location. Each sibling will
have its own git log
, and changes made in either dataset can be incorporated into the other with a
datalad update
command. For dataset storage and backup, we will be using a special kind of sibling
known as a special remote
. We will use two types of special remote
: remote indexed archives
and a
gitlab remote
.
Remote Indexed Archives
RIA stores can be easily created or extended from within any dataset.
The advantage of using an RIA store is that the remote machine does not need datalad to be installed. Nevertheless,
datalad will still be able to find and retrieve dataset contents through a datalad get
command. The RIA
store is therefore perfect for content storage and backup, particularly as the University filestore is regularly
backed up itself.
The RIA store is created with the following command:
datalad create-sibling-ria -s ria-backup ria+<URL>
If using the university filestore you would replace <URL>
with the path to access your research filestore via
SSH. This takes the following form:
ssh://ssh.soton.ac.uk:/research/absolute/path/to/ria-store
The final command therefore looks like this:
datalad create-sibling-ria -s ria-backup ria+ssh://ssh.soton.ac.uk:/research/absolute/path/to/ria-store
Note
To access your research filestore via SSH you need to ask iSolutions for the directory to be
NFS
and you need to be added to the list of SSH gateway users
.
To backup your dataset contents in the RIA store, use the following command:
datalad push --to ria-backup
Tip
Accessing a directory via SSH without password
It can be useful to set up access to your filestore via SSH without the need for a
password. This way you can automate operations like getting dataset contents or pushing
dataset updates without needing to input your password every 5 seconds. Doing this
is very straightforward through the use of an SSH key
.
First on your machine, type the following command:
ssh-keygen
Just use the defaults when prompted by pressing return. Next, copy your key to the SSH server with:
ssh-copy-id user@server
You should now be able to login without using a password.
This key can be copied (using secure copy) to any other machine to allow access to the same server without using a password:
scp ~/.ssh/id_<key> user@machine:~/.ssh/id_<key>
scp ~/.ssh/id_<key>.pub user@machine:~/.ssh/id<key>.pub
GitLab Siblings
RIA stores are great, but they have one problem: they are not human-readable. Here is an example of what an RIA store actually looks like:
/path/to/my_riastore
├── 946
│ └── e8cac-432b-11ea-aac8-f0d5bf7b5561
│ ├── annex
│ │ └── objects
│ │ ├── 6q
│ │ │ └── mZ
│ │ │ └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
│ │ │ └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
│ │ ├── 6v
│ │ │ └── zK
│ │ │ └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
│ │ │ └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
│ │ ├── [...]
│ │ └── [...]
│ ├── archives
│ │ └── archive.7z
│ ├── branches
│ ├── config
│ ├── description
│ ├── HEAD
│ ├── hooks
│ │ ├── applypatch-msg.sample
│ │ ├── [...]
│ │ └── update.sample
│ ├── info
│ │ └── exclude
│ ├── objects
│ │ ├── 05
│ │ │ └── 3d25959223e8173497fa7f747442b72c31671c
│ │ ├── 0b
│ │ │ └── 8d0edbf8b042998dfeb185fa2236d25dd80cf9
│ │ ├── [...]
│ │ │ └── [...]
│ │ ├── info
│ │ └── pack
│ ├── refs
│ │ ├── heads
│ │ │ ├── git-annex
│ │ │ └── master
│ │ └── tags
│ ├── ria-layout-version
│ └── ria-remote-ebce196a-b057-4c96-81dc-7656ea876234
│ └── transfer
├── error_logs
└── ria-layout-version
Cloning this dataset would involve finding out the dataset id
, which is not trivial
when you don’t know where to look. Instead, we can store a human-readable version of the dataset
on the University’s GitLab instance. GitLab is not suitable for storing
dataset contents (other than code), so does need to be used in conjunction with the filestore. GitLab instead
stores metadata about the dataset, allowing retrieval of contents from the RIA store using
human-readable commands. This is very useful when collaborating with others on a project.
Setup
Before a GitLab remote can be created, you need to complete a few setup steps:
Generate a personal access token for GitLab here.
Copy and paste the following into a text file, inserting your personal access token in the appropriate field:
[soton] url = https://git.soton.ac.uk private_token = [insert token here]
Save this file in your
home
directory (~
) as.python-gitlab.cfg
.
Create the GitLab remote
To create a GitLab remote, use the following command:
datalad create-sibling-gitlab -s gitlab --site soton --project <path/to/project>
The metadata and code can then be pushed to gitlab with:
datalad push --to gitlab
Human-readable metadata will now be visible at https://git.soton.ac.uk/path/to/project
.
The dataset can then be cloned
with:
datalad clone https://git.soton.ac.uk/path/to/project .
assuming the cloner has permissions to view the dataset. Contents can then be retrieved with a
datalad get
as datalad will by default search the RIA store for contents.
Note
To reduce complexity, it helps to create a GitLab remote for the superdataset
only.
Any subdatasets can be backed up to an RIA store. As long as the superdataset is cloned, it is
possible to then retrieve subdataset contents from the RIA store. One line of code needs to be run
in the superdataset to configure this:
git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+<URL>#{id}"
Procedures for Setting Up Datasets
To help speed up the process of setting up a new datalad dataset (including backup siblings), I have written a number of procedures. The github repository including installation and usage instructions is here.
More information
More information on DataLad and how to use it can be found in the DataLad Handbook at handbook.datalad.org. The chapter What you really need to know is particularly useful.