.. _datalad.rst: ============================================== DataLad: Distributed Data Management ============================================== | Contributors: Nathan TM Huneke | Maintainers: Nathan TM Huneke ------------------------------------------ The following text is adapted from the `DataLad handbook `_. `DataLad `_ is a free and open source command line tool, available for all major operating systems, and builds up on Git and `git-annex `__ to allow sharing, synchronizing, and version controlling collections of large files in repositories known as DataLad datasets. You can find information on how to install DataLad at `handbook.datalad.org/en/latest/intro/installation.html `_. .. note:: If you are using my :ref:`neuroimaging conda environment ` then datalad will be installed. Get the dataset ^^^^^^^^^^^^^^^ A DataLad dataset can be ``cloned`` by running:: datalad clone Once a dataset is cloned, it is a light-weight directory on your local machine. At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual *content* of the (sometimes large) data files. Retrieve dataset content ^^^^^^^^^^^^^^^^^^^^^^^^ After cloning a dataset, you can retrieve file contents by running:: datalad get This command will trigger a download of the files, directories, or subdatasets you have specified. DataLad datasets can contain other datasets, so called *subdatasets*. If you clone the top-level dataset, subdatasets do not yet contain metadata and information on the identity of files, but appear to be empty directories. In order to retrieve file availability metadata in subdatasets, run:: datalad get -n Afterwards, you can browse the retrieved metadata to find out about subdataset contents, and retrieve individual files with ``datalad get``. If you use ``datalad get ``, all contents of the subdataset will be downloaded at once. Stay up-to-date ^^^^^^^^^^^^^^^ DataLad datasets can be updated. The command ``datalad update`` will *fetch* updates and store them on a different branch (by default ``remotes/origin/master``). Running:: datalad update --merge will *pull* available updates and integrate them in one go. Find out what has been done ^^^^^^^^^^^^^^^^^^^^^^^^^^^ DataLad datasets contain their history in the ``git log``. By running ``git log`` (or a tool that displays Git history) in the dataset or on specific files, you can find out what has been done to the dataset or to individual files by whom, and when. Saving changes you make to a dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you make a change to a dataset, you will need to save it into the history so that you or other users in the future can see what was done when and by whom. The command ``datalad status`` will show you whether any changes need to be saved. If you run this command you might see either the message:: nothing to save, working tree clean In which case the log is up to date, or:: modified: file1.txt (file) There is a file (``file1.txt``) that has been modified. This modification needs to be saved in the dataset's history. To do so run a ``datalad save`` command. This will produce the following output:: add(ok): file1.txt (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) ``file1.txt`` was added to the repository and then the dataset was saved. It is useful to add a message regarding what was saved, so that you can keep track of what has been done to the dataset. To do so use the ``-m`` argument, e.g.: .. code-block:: bash datalad save -m "Add file1.txt" If you now look at the ``git log`` you will see that this entry is accompanied by the message that file1.txt was added:: commit 7a286d45195b7ac6a167fefb2a2229fa87af2425 Author: nh6g15 Date: Mon Jun 21 14:49:43 2021 +0100 Add file1.txt commit a736983ec90a2094ce401105173122f9a9033824 Author: nh6g15 Date: Thu Jun 17 14:34:17 2021 +0100 Apply YODA dataset setup commit 8bf3c337ef7c06852ffe07ee738eae7f44b1f46c Author: nh6g15 Date: Thu Jun 17 14:34:15 2021 +0100 [DATALAD] new dataset DataLad Run ^^^^^^^^^^^^ Possibly the most useful feature of DataLad for computationally intensive analyses (e.g. neuroimaging) is the ``datalad run`` command. Using this command allows you to capture your command(s), fetch relevant files, do something with them, and then save the results. For example, the following ``datalad run`` command, runs a script on a file called ``anonymised_dataset.csv`` to convert it to long format: .. code-block:: bash datalad run \ -m "Save long format dataset" \ -i anonymised_dataset.csv \ -o dataset_long_format.csv \ "code/convert2long.R" After running this, checking the ``git log`` will show the following:: commit 1eac06986726b3f98c61b0b7eab0964ca54c2e0b (HEAD -> master) Author: nh6g15 Date: Fri Jun 25 16:17:16 2021 +0100 [DATALAD RUNCMD] Save long format dataset === Do not change lines below === { "chain": [], "cmd": "code/convert2long.R", "dsid": "9663676d-5ac4-4071-9406-6ee778f7d49e", "exit": 0, "extra_inputs": [], "inputs": [ "anonymised_dataset.csv" ], "outputs": [ "dataset_long_format.csv" ], "pwd": "." } ^^^ Do not change lines above ^^^ Because the command and files needed are all saved in the log, we can even re-run this command if needed! To do so, we use ``datalad rerun `` using the SHASUM of the commit in question. For example: .. code-block:: bash datalad rerun 1eac06986726b3f98c61b0b7eab0964ca54c2e0b I strongly suggest you read the Chapter on ``datalad run`` in the `DataLad handbook `_ as this command is so important. Dataset Storage and Backup ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ DataLad includes tools for easily managing dataset storage and backup. These work very nicely with the University's research filestore. A suggested workflow is described below. DataLad Siblings ------------------ Before understanding how DataLad can be used for storage and backup, it is important to understand the concept of a `DataLad sibling `_. A datalad sibling is essentially a 'copy' of a dataset stored in another location. Each sibling will have its own ``git log``, and changes made in either dataset can be incorporated into the other with a ``datalad update`` command. For dataset storage and backup, we will be using a special kind of sibling known as a ``special remote``. We will use two types of ``special remote``: ``remote indexed archives`` and a ``gitlab remote``. Remote Indexed Archives ------------------------ `RIA stores `_ can be easily created or extended from within any dataset. The advantage of using an RIA store is that the remote machine does not need datalad to be installed. Nevertheless, datalad will still be able to find and retrieve dataset contents through a ``datalad get`` command. The RIA store is therefore perfect for content storage and backup, particularly as the University filestore is regularly backed up itself. The RIA store is created with the following command:: datalad create-sibling-ria -s ria-backup ria+ If using the university filestore you would replace ```` with the path to access your research filestore via SSH. This takes the following form:: ssh://ssh.soton.ac.uk:/research/absolute/path/to/ria-store The final command therefore looks like this:: datalad create-sibling-ria -s ria-backup ria+ssh://ssh.soton.ac.uk:/research/absolute/path/to/ria-store .. note:: To access your research filestore via SSH you need to ask iSolutions for the directory to be ``NFS`` and you need to be added to the list of ``SSH gateway users``. To backup your dataset contents in the RIA store, use the following command:: datalad push --to ria-backup .. tip:: **Accessing a directory via SSH without password** It can be useful to set up access to your filestore via SSH without the need for a password. This way you can automate operations like getting dataset contents or pushing dataset updates without needing to input your password every 5 seconds. Doing this is very straightforward through the use of an ``SSH key``. First on your machine, type the following command:: ssh-keygen Just use the defaults when prompted by pressing return. Next, copy your key to the SSH server with:: ssh-copy-id user@server You should now be able to login without using a password. This key can be copied (using secure copy) to any other machine to allow access to the same server without using a password:: scp ~/.ssh/id_ user@machine:~/.ssh/id_ scp ~/.ssh/id_.pub user@machine:~/.ssh/id.pub GitLab Siblings ---------------- RIA stores are great, but they have one problem: *they are not human-readable*. Here is an example of what an RIA store actually looks like:: /path/to/my_riastore ├── 946 │ └── e8cac-432b-11ea-aac8-f0d5bf7b5561 │ ├── annex │ │ └── objects │ │ ├── 6q │ │ │ └── mZ │ │ │ └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz │ │ │ └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz │ │ ├── 6v │ │ │ └── zK │ │ │ └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz │ │ │ └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz │ │ ├── [...] │ │ └── [...] │ ├── archives │ │ └── archive.7z │ ├── branches │ ├── config │ ├── description │ ├── HEAD │ ├── hooks │ │ ├── applypatch-msg.sample │ │ ├── [...] │ │ └── update.sample │ ├── info │ │ └── exclude │ ├── objects │ │ ├── 05 │ │ │ └── 3d25959223e8173497fa7f747442b72c31671c │ │ ├── 0b │ │ │ └── 8d0edbf8b042998dfeb185fa2236d25dd80cf9 │ │ ├── [...] │ │ │ └── [...] │ │ ├── info │ │ └── pack │ ├── refs │ │ ├── heads │ │ │ ├── git-annex │ │ │ └── master │ │ └── tags │ ├── ria-layout-version │ └── ria-remote-ebce196a-b057-4c96-81dc-7656ea876234 │ └── transfer ├── error_logs └── ria-layout-version Cloning this dataset would involve finding out the ``dataset id``, which is not trivial when you don't know where to look. Instead, we can store a human-readable version of the dataset on the University's `GitLab instance `_. GitLab is not suitable for storing dataset contents (other than code), so does need to be used in conjunction with the filestore. GitLab instead stores metadata about the dataset, allowing retrieval of contents from the RIA store using human-readable commands. This is very useful when collaborating with others on a project. Setup ******* Before a GitLab remote can be created, you need to complete a few setup steps: 1. Generate a personal access token for GitLab `here <(https://git.soton.ac.uk/-/profile/personal_access_tokens)>`_. 2. Copy and paste the following into a text file, inserting your personal access token in the appropriate field:: [soton] url = https://git.soton.ac.uk private_token = [insert token here] 3. Save this file in your ``home`` directory (``~``) as ``.python-gitlab.cfg``. Create the GitLab remote ************************** To create a GitLab remote, use the following command:: datalad create-sibling-gitlab -s gitlab --site soton --project The metadata and code can then be pushed to gitlab with:: datalad push --to gitlab Human-readable metadata will now be visible at ``https://git.soton.ac.uk/path/to/project``. The dataset can then be ``cloned`` with:: datalad clone https://git.soton.ac.uk/path/to/project . assuming the cloner has permissions to view the dataset. Contents can then be retrieved with a ``datalad get`` as datalad will by default search the RIA store for contents. .. note:: To reduce complexity, it helps to create a GitLab remote for the ``superdataset`` **only**. Any subdatasets can be backed up to an RIA store. As long as the superdataset is cloned, it is possible to then retrieve subdataset contents from the RIA store. One line of code needs to be run in the superdataset to configure this:: git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+#{id}" Procedures for Setting Up Datasets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To help speed up the process of setting up a new datalad dataset (including backup siblings), I have written a number of `procedures `_. The github repository including installation and usage instructions is `here `_. More information ^^^^^^^^^^^^^^^^ More information on DataLad and how to use it can be found in the DataLad Handbook at `handbook.datalad.org `_. The chapter `What you really need to know `_ is particularly useful.