Genome Diff file Generation

Overview

This is a series of commands to automatically generate .gd files based on naming system present in .fastq files. This will typically be the first step once you have sequencing files in .fastq format from the sequencing core. Script exists as part of barricklab respository on github.

Protocol

"" marks are required, text within <> should be replaced.

Clone or update script from github repository:

- Clone (for first time users):
  1. cd <location_for_repository>
  2. git clone https://github.com/barricklab/barricklab.git
  3. Do 1 of the following:
    - add <location_for_repository/barricklab> to your path (don't forget to put it in your profile)
    - run the command using the absolute path
    - move reads_to_gd_files.py to place within your path. This is the worst choice. NOTE: you will need to move the script each time you update from the repository
- update respository scripts (if you know there are changes or if something isn't working how you expect)
  1. cd <location_for_repository/barricklab>
  2. git pull
  3. Remember to move reads_to_gd_files.py to somewhere in your path if you went with that option originally, and consider deleting it from that location and doing one of the other options.

Generate meta data file

Make a new .tsv file using your favorite text editor which contains your sample name (which should match the beginning of the .fastq files) and any of the following: 'population', 'time' (meaning generation), 'treatment', 'clone'.
Be sure the first row contains the names you want included in the .gd files. (i.e. that the file has a header row)
Be sure 'sample' is the first column
Be sure file is tab separated

Generate .gd files

1. cd <location_of_fastq_files>
2. Run the following command and applicable options:
  - reads_to_gd_files.py -a <your_name> -f $PWD -r <location_of_references> -m <location_and_name_of_meta_data_tsv_file>
  - for <your_name> make sure you do not use spaces
  - for <location_of_reference_file> make sure it is specified from root (i.e. should start with "/") or from a website (i.e. should start with 'http')
  - for <location_and_name_of_meta_data_tsv_file> make sure it is specified from root (i.e. should start with "/")
  - If there are index files in the directory (i.e. if your sequencing data comes from a miSeq run) add -i to the above command line
  - If the data is stored on TACC's corral system and is maintained by the Barrick lab, add -b to the above command line
  - If the data is stored on NCBI's SRA archive, add -s to the above command line
  - If you wish to specify adapter sequences add -t <location_and_name_of_adapter_sequcnes_file> make sure it is specified from root (i.e. should start with "/").
3. Inspect contents of new_gd_files, and make sure both the files created and the contents of those files look correct.

Add .gd files to DCAMP

1. checkout newest version of DCAMP from repository. Alternatively, if you have not previously cloned the dcamp repository, see how to clone dcamp
  - cd <location_of_dcamp>
  - hg pull
  - hg update
2. If fastq files are part of a new project/publication, make a new RS_#### directory within DCAMP's src/data directory, otherwise determine what directory .gd files should belong to.
3. cd <location_of_fastq_files>
4. mv -i new_gd_files/*.gd <location_from_root_of_dcamp>/dcamp/src/data/<RS_####_directory>
5. verify that the contents of new_gd_files are empty
  - ls new_gd_files
6. rm -r new_gd_files
7. cd <location_from_root_of_dcamp>/dcamp
8. hg add .
9. hg commit -m "<Brief description of what you are adding>"
10. hg push

Version History.

Section contains information on versions of scripts, dates scripts were used, and archived versions. Care will be taken that new scripts preserve the same command line execution and only add new functionality. Direct suggestions for improvements to Dan.

3-30-16 GSAF Core seems to have changed the fastq file output naming system. No longer is the barcode added prior to the lane ID. Instead an unknown S# is added in front of it. S# varies per sample and does not appear to be related to the barcode. Script updated to account for new naming, but should retain old functionality as well

-- Main.DanielDeatherage - 29 Mar 2016

Barrick Lab > ComputationList > ProtocolsGdGenerationl

Contributors to this topic

DanielDeatherage

Topic revision: r4 - 2017-06-10 - 00:40:15 - Main.DanielDeatherage