---+!! Computing Environment Setup for Bioinformatics and Computational Biology %TOC% So, you want to harness the immense power of bioinformatics and computational biology for your science? Here's some advice that will save you headaches and make your life easier when working in a Linux/UNIX environment. *NOTE: These instructions are for setting up your Linux/UNIX computational environment. This might be on your local machine, but it could also be on a computing cluster like TACC that you log into remotely.* Instructions for setting up your local machine (for example, your laptop) with programs for editing text, accessing remote servers, etc., are covered over at [[ProtocolsComputerSetup][Computer Setup]]. ---++ Introduction to the Shell You will want to learn basic Unix commands and syntax for navigating your command-line. environment and running commands. These include things like copying files, interrupting a process, redirecting the output/input of a program. Here is one useful [[http://www.ee.surrey.ac.uk/Teaching/Unix/index.html][Introduction to Unix]] that covers these commands/concepts. ---+++ Environmental Variables ---+++ Login Scripts and your <code>$PATH</code> (<code>.bashrc</code>) ---++ Using TACC ---+++ Connecting: Head Nodes and Compute Nodes. The current system on TACC that we use for most of our computing is lonestar6. It's address is =ls6.tacc.utexas.edu=, so to ssh to it you use: %CODE{bash}% ssh <username>@ls6.tacc.utexas.edu %ENDCODE% After you fill in your password and make it past 2FA, you get a shell on the *HEAD NODE*. This is a machine that is used like the brain of the cluster. It's function is to send tasks to its many *COMPUTE NODES**. *DO NOT* run any computationally demanding or long tasks on the head node. It will inconvenience others by making the machine slow. Your command will be killed it it uses too many resources and you may be banned from TACC. Instead, you can get an interactive shell on a *COMPUTE NODE* using this command: %CODE{bash}% idev -m 60 %ENDCODE% The =-m 60= is asking for a 60-minute slot on one compute node. Currently, you can make this as high as 120 minutes. For longer jobs, you will need to learn about submitting jobs to the queue. After some informational messages, your terminal will pop up and now you can run commands on the *COMPUTE NODE*. They have a lot of cores (processors) and memory (RAM), so you can (and should) be running many jobs in parallel on one of these nodes if you are using it for compute. The =idev= command is mostly meant for _development_ (that is, writing and testing new code/tools), but it can be used for short tasks, particularly if you are using a job manager like Snakemake that can intelligently use the resources. If you get lost and can't remember if you are on the *HEAD NODE* or a *COMPUTE NODE*, you can use this command: %CODE{bash}% hostname %ENDCODE% If it has "login" in the name it returns, then you are on the *HEAD NODE*. ---+++ Filesystems: =$HOME=, =$WORK=, and =$SCRATCH= ---++ Conda 101 Whether on TACC or your own computer, you'll want to become familiar with the Conda package/environment manager. It makes it easy to install a wide variety of command-line tools in a way that prevents them from interfering with one another or other settings on your system. ---+++ Set up Conda, Mamba, Bioconda *Conda* is the main framework. *Mamba* speeds up Conda installs (once it is installed use <code>mamba</code> everywhere you would use <code>conda</code> for running commands. *Bioconda* makes it possible to install additional packages related to bioinformatics and computational biology. You'll want all three of these working together in your environment. 1 Install Conda (the Miniconda flavor). Using the [[https://docs.anaconda.com/free/miniconda/#quick-command-line-install][Quick Command Line Install]] instructions is probably easiest, esp. on TACC. 1 Reload your shell (close and open the terminal or logout and log back in) so you are in your conda =base= environment. 1 Install Mamba using these commands:%BR% %CODE{bash}% conda install mamba mamba init%ENDCODE% 1 Set up Bioconda [[https://bioconda.github.io/][Run the commands here]] ---+++ Using Conda Environments Conda environments are a way to: 1 Insulate different installed tools from one another to prevent incompatibilities and unexpected interactions. 1 Manage and save exactly which versions of different tools you used for an analysis When you open a new shell, by default the <code>base</code> conda environment will be loaded. It's OK to install some general-purpose utilities in this environment, but you should generally **install each of your major bioinformatics tools (or sets of tools) in its own environment**. This sequence of commands creates an environment called <code>breseq-env</code> and installs _breseq_ in it: %CODE{bash}% mamba env create -n breseq-env mamba activate breseq-env mamba install breseq %ENDCODE% Let's say you were trying to reproduce results from an older paper. You may want to install a specific version of _breseq_ in your environment. In this case, you'd use this variant %CODE{bash}% mamba install breseq=0.36.1 %ENDCODE% Another very useful set of commands can save your environment to a =yaml= file: %CODE{bash}% conda env export > environment.yml %ENDCODE% Or load an environment from a =yaml= file created by someone else, so you can reproduce their work! %CODE{bash}% conda env create -f environment.yml %ENDCODE% Many other possibilities are covered in the official Conda documentation under [[https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html][managing environments]]. ---++ Miscellaneous Timesavers * [[https://linuxhandbook.com/linux-alias-command/][Use the alias command to set up shortcuts]] %BR% For example, type =sshl= to connect to lonestar6 using your username. * [[ProtocolsSSHPublicKeyAuthentication][Setting up SSH Public Key Authentication]] %BR% Save yourself typing your password every time you connect to TACC or another server * [[CommandLineBox][Using lftp to copy files to/from UT Box]] %BR% Copy and mirror files from/to your computer and TACC or other servers. ---++ See Also * [[https://wikis.utexas.edu/display/bioiteam/Linux+and+stampede2+Setup+--+GVA2022][Genome Variant Analysis Course 2022 - Linux and Stamped2 setup (by Dan Deatherage)]]
This topic: Lab
>
ComputationList
>
ProtocolsComputingEnvironmentSetup
Topic revision: r4 - 2024-07-09 - JeffreyBarrick