Commit f4b1e746 authored by Benjamin Murauer's avatar Benjamin Murauer
Browse files

clearified dataset linking

parent 419d1b47
Pipeline #39478 passed with stage
in 2 minutes and 40 seconds
......@@ -16,20 +16,41 @@ Please have a look at the examples for more information.
### CLI
We provide a `dbispipeline-link` tool that can be used to link datasets to
the data directory. To use this feature provide a `data/links.yaml` file.
An example could look like this:
- music/acousticbrainz
- music/billboard
- music/millionsongdataset
To set the root path where the datasets are linked from either set the CLI
parameter or configure the dbispipeline acordingly (See the sample config
the data directory. This ensures that datasets are linked in a consistent way
even on different machines. The general process is as follows:
1. Either in the `dbispipeline.ini` or as an argument in the cli call, one can
define where in general datasets are stored on the local machine. For example,
many datasets are available on `/storage/nas3/datasets/text`. In this case,
this would be the value in the configuration:
dataset_dir = /storage/nas3/datasets
2. In a file `data/links.yaml`, one can define specific datasets that are used
by the software. Thereby, the first path segment will be cut off (not sure why).
For example, the following yaml file:
- music/acousticbrainz
- music/billboard
- music/millionsongdataset
would assume that a physical directory exists at
`/storage/nas3/datasets/music/billboard` and after calling the script
`dbispipeline-link` without parameters using the above configuration, the
following symlinks will be created:
data/acousticbrainz -> /storage/nas3/datasets/music/acousticbrainz
data/billboard -> /storage/nas3/datasets/music/billboard
data/millionsongdataset -> /storage/nas3/datasets/music/millionsongdataset
The value of `dataset_dir` from the config can be overwritten in the cli
script by using the `-p` option.
## Requirements
"""Tool to manage data."""
import os
from logzero import logger
import yaml
from dbispipeline import utils
from logzero import logger
LINK_CONFIG_FILE = 'data/links.yaml'
......@@ -19,8 +19,8 @@ def link(dataset_dir=None):
default from dbispipeline. To use the pipeline config pass None.
if dataset_dir is None:
procject_config = utils.load_project_config()[utils.SECTION_PROJECT]
dataset_dir = procject_config[utils.OPTION_DATASET_DIR]
project_config = utils.load_project_config()[utils.SECTION_PROJECT]
dataset_dir = project_config[utils.OPTION_DATASET_DIR]
if dataset_dir == '':
logger.error('No dataset dir is defined. Look at the README.')
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment