NGS Data Retrieval


This is a series of commands to download NGS data from the GSAF and back it up. Make sure you make it all the way through the steps to avoid future problems for yourself and others.


  1. Receive an email from the GSAF stating that your sequencing results are done.
    • Email will have a subject line of "GSAF Sequencing Data for job JA# on sequencing run SA# is now ready" where # is replaced by your specific information
  2. The email will contain a hyperlink to an amazonaws website. Again, the hyperlink will look something like 'Access your data for JA# from sequencing run SA# here.'
  3. Copy that link to your clipboard.
  4. Run the following commands "" marks are required, text within <> should be replaced. Be sure to read the explanation which follows:
    1. cd /corral-repl/utexas/breseq
    2. ls
      • This should return a list of 6 things: ",, corral, genomes,, and ".
      • IF THERE ARE ADDITIONAL FILES PRESENT DO NO CONTINUE. See below for more information
    3. ./ "<your link here>"
      • Note the " " marks are required.
      • This will take a non-trivial amount of time to complete. You will get a new line for each fastq file being downloaded beginning with "Downloading:".
      • Once the download is complete each file will be checked with md5sum to ensure accurate file transfer.
      • If a final line states "Downloaded # files successfully." continue to next step, else see next section.
    4. mkdir genomes/utexas_gsaf/Project_JA<your JA # here>
      • This is the standard GSAF naming system we use, if these are additional reads from the same job, this directory may already exist.
    5. chmod 775 *.gz
    6. mv -i *.gz genomes/utexas_gsaf/Project_JA<your JA # here>
      • Make special note of the "-i" flag after mv. This is absolutely required to avoid overwriting existing files. If a prompt appears asking to overwrite, the answer is always no.
    7. rm *.wget.log
    8. rm files.html
    9. rm md5.txt
    10. ls
      • This should again only list 3 things: "genomes, temp and". If additional files are present, you need to figure out what has happened. Do not just ignore them as it will cause problems for other people the next time they try to download their data.

Notes on what to do if md5sum check fails

Having not seen a download fail, I do not know what the output will say, nor the best way to fix this. My assumption is to first remove all files that have downloaded/been created (ie everything that is not "genomes and".) and restart the process. If it again fails I'd use google or ask for help to figure out what is going wrong.

Notes on why you must stop if additional files are present at initial ls check.

If additional files are present, it suggests someone else is also downloading their data, and the following commands will incorrectly assign read files to projects. Option 1 is to wait for the other person to finish, option 2 is to download you files to a different directory. Do not directly download to project directories as it may cause files to be overwritten.

Version History.

Section contains information on versions of scripts, dates scripts were used, and archived versions. At this time it is planned that script will only be changed when an existing script stops working. Suggesting that old scripts will not work with new sequence files. Care will be taken that new scripts preserve the same command line execution.

Version Start Date End Date Changed by Changes
Version 1 July 2014 November 2014 DED Original
Version 2 November 2014 ? DED grep commands changed around July 25 by SHS do not seem to have broken functionality. Changes made on October 24th seem to have broken functionality, probably by complementary changes to the "files.html" file.

-- Main.DanielDeatherage - 03 Jul 2014

Topic attachments
I Attachment Action Size Date Who Comment
shsh manage 0.9 K 19 Nov 2014 - 20:43 Main.DanielDeatherage Version 1 script

This topic: Lab > WebLeftBar > ComputationList > ProtocolsSequenceRetrieval
Topic revision: r5 - 03 Oct 2016 - 16:08:23 - Main.DanielDeatherage