NGS Data Retrieval

Overview

This is a series of commands to download NGS data from the GSAF and back it up. Make sure you make it all the way through the steps to avoid future problems for yourself and others.

Protocol

  1. Receive an email from the GSAF stating that your sequencing results are done.
    • Email will have a subject line of "GSAF Sequencing Data for job JA# on sequencing run SA# is now ready" where # is replaced by your specific information
  2. The email will contain a hyperlink to an amazonaws website. Again, the hyperlink will look something like 'Access your data for JA# from sequencing run SA# here.'
  3. Copy that link to your clipboard.
  4. Run the following commands "" marks are required, text within <> should be replaced. Be sure to read the explanation which follows:
    1. cd /corral-repl/utexas/breseq
    2. ls
      • This should return a list of 3 things: "genomes, temp and gsaf_download.sh".
      • IF THERE ARE ADDITIONAL FILES PRESENT DO NO CONTINUE. See below for more information
    3. gsaf_download.sh "<your link here>"
      • This will take a non-trivial amount of time to complete. You will get a new line for each fastq file being downloaded beginning with "Downloading:".
      • Once the download is complete each file will be checked with md5sum to ensure accurate file transfer.
      • If a final line states "Downloaded # files successfully." continue to next step, else see next section.
    4. mkdir genomes/utexas_gsaf/Project_JA<your JA # here>
      • This is the standard GSAF naming system we use, if these are additional reads from the same job, this directory may already exist.
    5. chmod 775 *.gz
    6. mv -i *.gz genomes/utexas_gsaf/Project_JA<your JA # here>
      • Make special note of the "-i" flag after mv. This is absolutely required to avoid overwriting existing files. If a prompt appears asking to overwrite, the answer is always no.
    7. rm *.wget.log
    8. rm files.html
    9. rm md5.txt
    10. ls
      • This should again only list 3 things: "genomes, temp and gsaf_download.sh". If additional files are present, you need to figure out what has happened. Do not just ignore them as it will cause problems for other people the next time they try to download their data.

Notes on what to do if md5sum check fails

Having not seen a download fail, I do not know what the output will say, nor the best way to fix this. My assumption is to first remove all files that have downloaded/been created (ie everything that is not "genomes and gsaf_download.sh".) and restart the process. If it again fails I'd use google or ask for help to figure out what is going wrong.

Notes on why you must stop if additional files are present at initial ls check.

If additional files are present, it suggests someone else is also downloading their data, and the following commands will incorrectly assign read files to projects. Option 1 is to wait for the other person to finish, option 2 is to download you files to a different directory. Do not directly download to project directories as it may cause files to be overwritten.

-- Main.DanielDeatherage - 03 Jul 2014

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | More topic actions...

 Barrick Lab  >  ComputationList  >  ProtocolsSequenceRetrieval

Contributors to this topic Edit topic DanielDeatherage
Topic revision: r2 - 2014-07-31 - 21:49:10 - Main.DanielDeatherage
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright ©2024 Barrick Lab contributing authors. Ideas, requests, problems? Send feedback