wiki:DataManagementMillipede

Data management on Millipede

The size of data and the number of projects on the Millipede /data/gcc/ directory is growing fast. It's time to impose a sensible, project related data management structures on all users to avoid total chaos. This page will list those structures: who has access, what is where, etc.

Currently, there are 4 main groups using /data/gcc/ (with 'main contact'): the Genetics department (Roan), the Genome Analysis Facility (Wil), GoNL (Freerk) and GBIC (Joeri). There are some shared structures as well.

GCC-level structure

Accessibility and restrictions

  • Make sure your tools and resources have at least read/execute permissions for the Millipede gcc users group
  • Please communicate access restrictions to your project's contents on this page
  • If you feel your particular project falls outside the mentioned main four, please consult Morris to become the 'fifth main project'
  • Any data outside this structure will be removed regularly to keep things clean so use your project or home (/data/gcc/home/you/ or /data/you/) directory
  • Please do not use someone else's data without consulting them first (we can all trust eachother!)
  • If you need a /data/gcc/home/you/ directory, please ask Morris

File structure

/data/gcc/
+-- home/
+-- projects/
+---- GAF/
+---- GBIC/
+---- Genetics/
+---- gonl/
+-- resources/
+-- tools/
  • home/
    • contains directories per user, on request, for their own personal use
    • restrictions (size and accesibility)?
  • projects/
    • contains directories for each main project (currently being the four main groups GAF, GBIC, Genetics and GoNL)
    • each main project has it's own data management structure, described on this page
  • resources/
    • contains shared tools for all 4 groups (now mainly GoNL and GAF?)
    • all resources should be put in a folder precising their version. Normally, should follow resource-version.
      • Ex: Human Genome build 19 should be found in /data/gcc/resources/hg-19/
  • tools/
    • contains shared tools for all 4 groups (now mainly GoNL and GAF?)
    • all tools should be put in a folder using the naming convention: toolname-version
      • Ex: Picard v1.32 should be found in /data/gcc/tools/picard-tools-1.32/

We should all list here changes and/or additions we would like to see on the tools and resources directories (if any), so these can be discussed (please add your name).

(Laurent, Wil) Clean up the scripts directory (all scripts in one dir? subdirs? separate script dirs per language, project or purpose?)

(Wil) Remove unused older versions of tools?

(Laurent) Restructure the entire tools dir?

(Wil) A proposed restriction: no testing in the tools and resources dirs (allthough I don't think anybody actually does that)

GBIC

Joeri, please update.

Accessibility and restrictions

File structure

Genetics

Roan, Patrick, whoever, please update.

Accessibility and restrictions

File structure

GoNL

The GoNL group has a page on the BBMRI wiki concerning their data management structure:  http://www.bbmriwiki.nl/wiki/DataManagement. GoNL group please update this section when appropriate.

Accessibility and restrictions

  • all data should only be writable by their owners
  • all tools and resources should be read/executable by the whole gcc group
  • all project-specific data and results should be read/executable by the gvnl group

File structure

  • The basic structure is as follows:
    • tools and resources are shared at /data/gcc/ level
    • for details, refer to the above link
      /data/gcc/projects/gonl/
      +-- rawdata/
      +-- results/
      +---- BGI/
      +---- immunochip/
      +---- pipeline/
      

Genome Analysis Facility

Directory currently named 'in-house', can change into 'GAF' when running jobs are finished.

Accessibility and restrictions

  • The in-house projects are run by Wil and should be accessible to at least one other person, preferrably Freerk
  • No data is shared with anyone without consent by Cleo or Morris, not even if it concerns the requester's own project
  • File permissions need to be looked into, for now everyone should follow the above rule

File structure

For now, the file structure is as follows:

/data/gcc/projects/in-house/
+-- data_per_project/
+---- project/
+-- rawdata/
+-- Results/
+---- runxxxxxx[-xxxx_flowcell]/
+------ [logs/]
+------ jobs/
+------ outputs[_version]/
+-------- sample_lane/
  • data_per_project
    • temporary directory containing the data per project (VCF, QC, BAM)
    • to be moved to ftp server in the future
    • always latest version, earlier versions can be found in the pipeline results dir
  • rawdata
    • raw, unprocessed fastq files
    • README follows lanes and samples
  • Results
    • directories per run (in the form date, flowcell for HiSeq and date for GA)
      • logs dir: for some of the older runs. Mostly empty, should consider cleaning them up
      • jobs dir(s): PSB scripts and submit script, no real versioning
      • outputs[_version] dir(s): all outputs for that run divided into lane dirs
        • lanedirs: in the form sample_lane (e.g. D12345_L1), containing pipeline results and a log dir

Proposed changes/additions to this structure:

  • pipeline results structured /run/lane/sample/ because of multiplexing
  • pipeline results should be versioned, perhaps by adding pipeline version or date to dirname automatically or to the filenames (see gonl templates)
  • jobs should always be stored, structured per version and/or date, stored per run (versioning in the jobs dir)