Difference: ToolsRNAStructureMutualInformation (1 vs. 5)

Revision 52023-09-20 - JeffreyBarrick

Changed:
<
<
META TOPICPARENT name="ToolList"
>
>
META TOPICPARENT name="SoftwareList"
 

RNA Structure Mutual Information

Overview: What do these programs do?

Mutual information between columns in a sequence alignment of structured RNAs can provide evidence for base pairs in a secondary structure model. For example, if there is a Watson-Crick base pair between two columns, then one may have a perfectly balanced distribution of bases A, U, C, and G at each position, but only observe GC, CG, AU, UA pairs. Since you can predict the identity of the base in the second column from the first, there is more overall information (less Shannon entropy) in the sequence string than would be predicted from each column considered alone. This correlation is also commonly referred to as "base covariation".

Raw mutual information scores between columns in RNA sequence alignments are typically poor indicators of base interactions, primarily because: (1) Real RNA alignments contain a limited number of sequences. There is a chance of observing elevated mutual information between columns by chance due to this limited sampling. (2) If the base present in one column rarely varies, then there are few opportunities to observe covariation with other columns. This means raw mutual information scores including the conserved column are depressed. (3) the sequences in an RNA alignment usually have shared evolutionary histories such that correlations between positions occur due to common descent rather than structural constraints.

These MI Perl scripts run an analysis pipeline that estimates the statistical significance of mutual information scores between columns in an alignment, corrected for these considerations. In addition to providing evidence for base-paired stems in RNA structures, this procedure has been used to predict Non-Watson-Crick pairs based on unusual covariation patterns and support the identification of new RNA structure motif examples. It is important to note that this entire analysis relies on the quality of the input alignment, i.e. it can only find correlations between columns as they are aligned and requires a fair number of divergent sequences in the input alignment.

More information on this approach can be found in the following publication where this approach was applied to ten riboswitch families. If you use predictions from these programs, please cite:

Barrick, J.E. and Breaker R.R. 2007. The structures, distributions, and mechanisms of metabolite-binding riboswitches. Genome Biology 8:R239. «PubMed»

Installation

The mutual information Perl scripts have been developed and tested on MacOS X. It should be straightforward to install the required programs on any Unix-style system. If you encounter problems, let me know.

Install a modified version of rate4site

The program rate4site is used to infer a phylogenetic tree with per-site substitution rates from the observed sequence alignment. In order to properly function on RNA alignments, rate4site requires a minor source code modification to deal with gap characters as a separate state. For convenience, I have included a modified version of the complete source release for download here.

Download Download rate4site version 2.01 (Nov06) modified 18 December 2008

The included Makefile is for compiling under Windows. Instead, compile using this command:

g++ -Dunix -DDOUBLEREP -o rate4site -O3 *.cpp

Finally, copy the new rate4site executable to a bin directory (e.g. /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can invoke it.

Install the esl-weight program from Infernal

Chances are that you already have Infernal installed if you routinely work with RNA alignments. If not, download Infernal, compile, and install according to the included instructions.

External site Official Infernal Site

These scripts only need the esl-weight utility program that is included in the easel subpackage. This program will be compiled by default, but may NOT be installed. For infernal-1.0rc5, you can find the binary at infernal-1.0rc5/easel/miniapps/esl-weight in your Infernal package. Manually install this program into your path (e.g. copy to /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can employ it.

Install BioPerl

These scripts use modules for reading phylogenetic trees from BioPerl. Download and install BioPerl according to the official instructions.

External site Official BioPerl Site

Install mutual information scripts

Finally, the download and extract the mutual information Perl scripts themselves. You will (naturally) need Perl to be installed on your system to use these.

Download Download MI Scripts version 1 27 December 2008

You should be able to run these Perl scripts from their current location or add them to your $PATH.

Usage

The analysis pipeline can be run using the single script mutual_information_significance.pl.

Usage

mutual_information_significance.pl -i stockholm.stk -o stockholm.mi.stk [-r 200 -n 300]

The input RNA alignment file in Stockholm format (stockholm.stk) MUST have an RF line. See the example alignments that are included and the Infernal documentation for a description of this line. An RF line will be generated by default if the Infernal program cmalign was used to generate the Stockholm file. You may want to alter or construct this line yourself. Only columns that contain non-gap characters in the RF line will be considered when removing redundant sequences.

The input alignment is processed before MI is calculated. First, identical sequences in the alignment are removed. Second, sequences that share the most identity with other sequences are removed until fewer than a specified number of sequences remain in the alignment. Third, sequence weights are calculated according to the GSC algorithm to de-emphasize closely related sequences. Fourth, all columns that are >50% gaps (taking into account sequence weights) are removed. These steps reduce the number of columns and sequences that must be considered in further calculations and typically do not affect the calculated MI significance scores.

The two parameters that you may want to adjust are -n to control the maximum number of sequences allowed after pruning to the most diverse and -r to specify how many different random alignments to generate to estimate the _p_-value significance of the actual MI score between each pair of columns (default = 200). The more random alignments used, the better the precision of the estimated _p_-value. For production runs, a value between 1000 to 10000 should be used. Note that a randomization procedure is used to estimate _p_-values, so they may differ slightly if you run the same procedure twice on the same input file.

The output file (stockholm.mi.stk) contains the pruned alignment with additional per-column annotation showing column pairs sorted by the significance of the mutual information between them.

Be forewarned that depending on (1) the number of sequences, (2) the number of columns in your alignment, and (3) the number of resamplings requested for estimating p-values that this procedure can be extremely slow and the intermediate resampling of alignments can require large amounts of free disk space. If you must interrupt operation of the script, it can usually be called from within the same working directory later and execution will pick up where it left off rather than restarting, if possible. If this procedure is too intensive for your input alignment, it is possible to parallelize the calculation of mutual information from each resampled alignment (Each alignment in the resampled-tree directory must be used to generate a MI file in the resampled-mi directory).

Example

Several example Stockholm alignments are provided in the Examples folder. To "quickly" test this procedure and generate example output, use the smallest one: SAM-II.stk. From within this path, run:

mutual_information_significance.pl -i SAM-II.stk -o SAM-II.mi.stk

This command should take less than 30 minutes to complete. Several intermediate files and folders will be created. Open the resulting output file SAM-I.mi.stk in a text editor. Other files can be deleted after the command completes.

META FILEATTACHMENT attachment="rate4site.tgz" attr="" comment="" date="1229632774" name="rate4site.tgz" path="rate4site.tgz" size="224124" stream="rate4site.tgz" user="Main.JeffreyBarrick" version="1"
META FILEATTACHMENT attachment="mi.tgz" attr="" comment="" date="1230395051" name="mi.tgz" path="mi.tgz" size="114551" stream="mi.tgz" user="Main.JeffreyBarrick" version="1"

Revision 42008-12-27 - JeffreyBarrick

 
META TOPICPARENT name="ToolList"
Changed:
<
<
Warning!
This page is under construction. The instructions are currently not complete!
>
>

RNA Structure Mutual Information

 
Deleted:
<
<

Mutual Information Support for RNA Secondary Structure Models

Predicting base interactions in RNA structures from the phylogenetic-significance of mutual information between columns.

 

Overview: What do these programs do?

Changed:
<
<
Reference.
>
>
Mutual information between columns in a sequence alignment of structured RNAs can provide evidence for base pairs in a secondary structure model. For example, if there is a Watson-Crick base pair between two columns, then one may have a perfectly balanced distribution of bases A, U, C, and G at each position, but only observe GC, CG, AU, UA pairs. Since you can predict the identity of the base in the second column from the first, there is more overall information (less Shannon entropy) in the sequence string than would be predicted from each column considered alone. This correlation is also commonly referred to as "base covariation".
 
Added:
>
>
Raw mutual information scores between columns in RNA sequence alignments are typically poor indicators of base interactions, primarily because: (1) Real RNA alignments contain a limited number of sequences. There is a chance of observing elevated mutual information between columns by chance due to this limited sampling. (2) If the base present in one column rarely varies, then there are few opportunities to observe covariation with other columns. This means raw mutual information scores including the conserved column are depressed. (3) the sequences in an RNA alignment usually have shared evolutionary histories such that correlations between positions occur due to common descent rather than structural constraints.

These MI Perl scripts run an analysis pipeline that estimates the statistical significance of mutual information scores between columns in an alignment, corrected for these considerations. In addition to providing evidence for base-paired stems in RNA structures, this procedure has been used to predict Non-Watson-Crick pairs based on unusual covariation patterns and support the identification of new RNA structure motif examples. It is important to note that this entire analysis relies on the quality of the input alignment, i.e. it can only find correlations between columns as they are aligned and requires a fair number of divergent sequences in the input alignment.

More information on this approach can be found in the following publication where this approach was applied to ten riboswitch families. If you use predictions from these programs, please cite:

Barrick, J.E. and Breaker R.R. 2007. The structures, distributions, and mechanisms of metabolite-binding riboswitches. Genome Biology 8:R239. «PubMed»
 

Installation

Added:
>
>
The mutual information Perl scripts have been developed and tested on MacOS X. It should be straightforward to install the required programs on any Unix-style system. If you encounter problems, let me know.
 

Install a modified version of rate4site

The program rate4site is used to infer a phylogenetic tree with per-site substitution rates from the observed sequence alignment. In order to properly function on RNA alignments, rate4site requires a minor source code modification to deal with gap characters as a separate state. For convenience, I have included a modified version of the complete source release for download here.

Download Download rate4site version 2.01 (Nov06) modified 18 December 2008

The included Makefile is for compiling under Windows. Instead, compile using this command:

g++ -Dunix -DDOUBLEREP -o rate4site -O3 *.cpp
Changed:
<
<
Finally, copy the new rate4site executable to a bin directory (e.g. /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can employ it.
>
>
Finally, copy the new rate4site executable to a bin directory (e.g. /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can invoke it.
 

Install the esl-weight program from Infernal

Chances are that you already have Infernal installed if you routinely work with RNA alignments. If not, download Infernal, compile, and install according to the included instructions.

External site Official Infernal Site
Changed:
<
<
These scripts only need the esl-weight utility program that is included in the easel subpackage. This program will be compiled by default, but may NOT be installed by default. For infernal-1.0rc5, you can find the binary at infernal-1.0rc5/easel/miniapps/esl-weight. Manually move this program into your path (e.g. to /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can employ it.
>
>
These scripts only need the esl-weight utility program that is included in the easel subpackage. This program will be compiled by default, but may NOT be installed. For infernal-1.0rc5, you can find the binary at infernal-1.0rc5/easel/miniapps/esl-weight in your Infernal package. Manually install this program into your path (e.g. copy to /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can employ it.
 

Install BioPerl

Changed:
<
<
These scripts use modules for handling phylogenetic trees from BioPerl. Download and install BioPerl according to the instructions. Be sure that you add BioPerl to your Perl library path (e.g. by setting $PERL5LIB).
>
>
These scripts use modules for reading phylogenetic trees from BioPerl. Download and install BioPerl according to the official instructions.
 
External site Official BioPerl Site

Install mutual information scripts

Changed:
<
<
Finally, the download the scripts themselves.
>
>
Finally, the download and extract the mutual information Perl scripts themselves. You will (naturally) need Perl to be installed on your system to use these.
 
Changed:
<
<
Download Download MI Scripts version 1 21 December 2008
>
>
Download Download MI Scripts version 1 27 December 2008
  You should be able to run these Perl scripts from their current location or add them to your $PATH.

Usage

Added:
>
>
The analysis pipeline can be run using the single script mutual_information_significance.pl.

Usage

 
Changed:
<
<
Usage: mutual_information_significance.pl -i stockholm.stk -o stockholm.mi.stk [-r 200 -n 300]
>
>
mutual_information_significance.pl -i stockholm.stk -o stockholm.mi.stk [-r 200 -n 300]
 
Changed:
<
<
The input alignment is processed before MI is calculated. First, all columns that are >50% gaps (taking into account relative sequence weights) are removed. Second, identical sequences in the alignment are removed. Third, sequences that share the most identity with other sequences are removed until fewer than a certain number of sequences remain in the alignment. These steps reduce the number of columns and sequences that must be considered in further calculations and usually do not affect the calculated MI significance scores.
>
>
The input RNA alignment file in Stockholm format (stockholm.stk) MUST have an RF line. See the example alignments that are included and the Infernal documentation for a description of this line. An RF line will be generated by default if the Infernal program cmalign was used to generate the Stockholm file. You may want to alter or construct this line yourself. Only columns that contain non-gap characters in the RF line will be considered when removing redundant sequences.
 
Changed:
<
<
The two parameters that you may want to adjust are -r which is how many different random alignments to generate to estimate the _p_-value significance of the actual MI score between each pair of columns (default = 200) and -n to control the maximum number of sequences allowed after pruning to the most diverse.
>
>
The input alignment is processed before MI is calculated. First, identical sequences in the alignment are removed. Second, sequences that share the most identity with other sequences are removed until fewer than a specified number of sequences remain in the alignment. Third, sequence weights are calculated according to the GSC algorithm to de-emphasize closely related sequences. Fourth, all columns that are >50% gaps (taking into account sequence weights) are removed. These steps reduce the number of columns and sequences that must be considered in further calculations and typically do not affect the calculated MI significance scores.
 
Changed:
<
<
The input stockholm alignment MUST have an RF line. See the example alignments that are included and the Infernal documentation for a description of this line. This line will be generated by default if the Infernal program cmalign was used to generate the Stockholm file. You may want to alter or construct this line yourself. Only columns that contain non-gap characters in the RF line will be considered when removing redundant sequences.
>
>
The two parameters that you may want to adjust are -n to control the maximum number of sequences allowed after pruning to the most diverse and -r to specify how many different random alignments to generate to estimate the _p_-value significance of the actual MI score between each pair of columns (default = 200). The more random alignments used, the better the precision of the estimated _p_-value. For production runs, a value between 1000 to 10000 should be used. Note that a randomization procedure is used to estimate _p_-values, so they may differ slightly if you run the same procedure twice on the same input file.
 
Changed:
<
<
Be forewarned that depending on (1) the number of sequences, (2) the number of columns in your alignment, and (3) the number of resamplings requested for estimating p-values that this procedure can be extremely slow and the intermediate resampling of alignments can require large amounts of free disk space. If you must interrupt operation of the script, it can usually be called from within the same working directory later and execution will pick up where it left off rather than restarting, if possible. The calculation of mutual information from each resampled alignment can be parallelized (Each alignment in the resampled-tree directory must be used to generate a MI file in the resampled-mi directory).
>
>
The output file (stockholm.mi.stk) contains the pruned alignment with additional per-column annotation showing column pairs sorted by the significance of the mutual information between them.
 
Added:
>
>
Be forewarned that depending on (1) the number of sequences, (2) the number of columns in your alignment, and (3) the number of resamplings requested for estimating p-values that this procedure can be extremely slow and the intermediate resampling of alignments can require large amounts of free disk space. If you must interrupt operation of the script, it can usually be called from within the same working directory later and execution will pick up where it left off rather than restarting, if possible. If this procedure is too intensive for your input alignment, it is possible to parallelize the calculation of mutual information from each resampled alignment (Each alignment in the resampled-tree directory must be used to generate a MI file in the resampled-mi directory).
 

Example

Changed:
<
<
Two example Stockholm alignments are provided: FMN.stk and SAM-I.stk. For testing, run:
>
>
Several example Stockholm alignments are provided in the Examples folder. To "quickly" test this procedure and generate example output, use the smallest one: SAM-II.stk. From within this path, run:
 
Changed:
<
<
mutual_information_significance.pl -i SAM-I.stk -o SAM-I.mi.stk
>
>
mutual_information_significance.pl -i SAM-II.stk -o SAM-II.mi.stk
 
Changed:
<
<
Several intermediate files will be created. Open the resulting file SAM-I.mi.stk in a text editor.
>
>
This command should take less than 30 minutes to complete. Several intermediate files and folders will be created. Open the resulting output file SAM-I.mi.stk in a text editor. Other files can be deleted after the command completes.
 
META FILEATTACHMENT attachment="rate4site.tgz" attr="" comment="" date="1229632774" name="rate4site.tgz" path="rate4site.tgz" size="224124" stream="rate4site.tgz" user="Main.JeffreyBarrick" version="1"
Added:
>
>
META FILEATTACHMENT attachment="mi.tgz" attr="" comment="" date="1230395051" name="mi.tgz" path="mi.tgz" size="114551" stream="mi.tgz" user="Main.JeffreyBarrick" version="1"
 

Revision 32008-12-22 - JeffreyBarrick

 
META TOPICPARENT name="ToolList"
Warning!
This page is under construction. The instructions are currently not complete!
Changed:
<
<

Mutual Information Support for RNA Secondary Structure Models

>
>

Mutual Information Support for RNA Secondary Structure Models

  Predicting base interactions in RNA structures from the phylogenetic-significance of mutual information between columns.

Overview: What do these programs do?

Reference.

Changed:
<
<

Install a modified version of rate4site

>
>

Installation

 
Changed:
<
<
The program rate4site is used to infer a phylogenetic tree with per-site substitution rates that generates the observed sequence alignment. In order to properly function on RNA alignments, rate4site required a minor source code modification to deal with gap characters as a separate state. For convenience, I have included this modification on the latest source release for download here.
>
>

Install a modified version of rate4site

 
Added:
>
>
The program rate4site is used to infer a phylogenetic tree with per-site substitution rates from the observed sequence alignment. In order to properly function on RNA alignments, rate4site requires a minor source code modification to deal with gap characters as a separate state. For convenience, I have included a modified version of the complete source release for download here.
 
Download Download rate4site version 2.01 (Nov06) modified 18 December 2008
Changed:
<
<
The included Makefile is for compiling under Windows. To compile use this command instead:
>
>
The included Makefile is for compiling under Windows. Instead, compile using this command:
 
g++ -Dunix -DDOUBLEREP -o rate4site -O3 *.cpp
Changed:
<
<
Finally, add the new rate4site executable to your $PATH so that the mutual information scripts can employ it.
>
>
Finally, copy the new rate4site executable to a bin directory (e.g. /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can employ it.
 
Changed:
<
<

Install the esl-weight program from Infernal

>
>

Install the esl-weight program from Infernal

  Chances are that you already have Infernal installed if you routinely work with RNA alignments. If not, download Infernal, compile, and install according to the included instructions.

External site Official Infernal Site
Changed:
<
<
These scripts actually only need the weight or esl-weight utility program that is included. This program will be compiled, but may NOT be installed by default. For infernal-1.0rc5, you can find the binary at infernal-1.0rc5/easel/miniapps/esl-weight. Either move this into a bin directory or add it to your $PATH so that the mutual information scripts can employ it.
>
>
These scripts only need the esl-weight utility program that is included in the easel subpackage. This program will be compiled by default, but may NOT be installed by default. For infernal-1.0rc5, you can find the binary at infernal-1.0rc5/easel/miniapps/esl-weight. Manually move this program into your path (e.g. to /usr/local/bin) or add its location to your $PATH so that the mutual information scripts can employ it.
Added:
>
>

Install BioPerl

These scripts use modules for handling phylogenetic trees from BioPerl. Download and install BioPerl according to the instructions. Be sure that you add BioPerl to your Perl library path (e.g. by setting $PERL5LIB).

External site Official BioPerl Site

Install mutual information scripts

Finally, the download the scripts themselves.

Download Download MI Scripts version 1 21 December 2008

You should be able to run these Perl scripts from their current location or add them to your $PATH.

Usage

Usage: mutual_information_significance.pl -i stockholm.stk -o stockholm.mi.stk [-r 200 -n 300]

The input alignment is processed before MI is calculated. First, all columns that are >50% gaps (taking into account relative sequence weights) are removed. Second, identical sequences in the alignment are removed. Third, sequences that share the most identity with other sequences are removed until fewer than a certain number of sequences remain in the alignment. These steps reduce the number of columns and sequences that must be considered in further calculations and usually do not affect the calculated MI significance scores.

The two parameters that you may want to adjust are -r which is how many different random alignments to generate to estimate the _p_-value significance of the actual MI score between each pair of columns (default = 200) and -n to control the maximum number of sequences allowed after pruning to the most diverse.

The input stockholm alignment MUST have an RF line. See the example alignments that are included and the Infernal documentation for a description of this line. This line will be generated by default if the Infernal program cmalign was used to generate the Stockholm file. You may want to alter or construct this line yourself. Only columns that contain non-gap characters in the RF line will be considered when removing redundant sequences.

Be forewarned that depending on (1) the number of sequences, (2) the number of columns in your alignment, and (3) the number of resamplings requested for estimating p-values that this procedure can be extremely slow and the intermediate resampling of alignments can require large amounts of free disk space. If you must interrupt operation of the script, it can usually be called from within the same working directory later and execution will pick up where it left off rather than restarting, if possible. The calculation of mutual information from each resampled alignment can be parallelized (Each alignment in the resampled-tree directory must be used to generate a MI file in the resampled-mi directory).

Example

Two example Stockholm alignments are provided: FMN.stk and SAM-I.stk. For testing, run:

mutual_information_significance.pl -i SAM-I.stk -o SAM-I.mi.stk

Several intermediate files will be created. Open the resulting file SAM-I.mi.stk in a text editor.

 
META FILEATTACHMENT attachment="rate4site.tgz" attr="" comment="" date="1229632774" name="rate4site.tgz" path="rate4site.tgz" size="224124" stream="rate4site.tgz" user="Main.JeffreyBarrick" version="1"

Revision 22008-12-18 - JeffreyBarrick

 
META TOPICPARENT name="ToolList"
Warning!
This page is under construction. The instructions are currently not complete!

Mutual Information Support for RNA Secondary Structure Models

Predicting base interactions in RNA structures from the phylogenetic-significance of mutual information between columns.

Overview: What do these programs do?

Reference.

Changed:
<
<

Compile modified version of rate4site

>
>

Install a modified version of rate4site

  The program rate4site is used to infer a phylogenetic tree with per-site substitution rates that generates the observed sequence alignment. In order to properly function on RNA alignments, rate4site required a minor source code modification to deal with gap characters as a separate state. For convenience, I have included this modification on the latest source release for download here.

Download Download rate4site version 2.01 (Nov06) modified 18 December 2008

The included Makefile is for compiling under Windows. To compile use this command instead:

Changed:
<
<
g++ -Dunix -o rate4site.exe -O3 *.cpp
>
>
g++ -Dunix -DDOUBLEREP -o rate4site -O3 *.cpp
 
Added:
>
>
Finally, add the new rate4site executable to your $PATH so that the mutual information scripts can employ it.

Install the esl-weight program from Infernal

Chances are that you already have Infernal installed if you routinely work with RNA alignments. If not, download Infernal, compile, and install according to the included instructions.

External site Official Infernal Site

These scripts actually only need the weight or esl-weight utility program that is included. This program will be compiled, but may NOT be installed by default. For infernal-1.0rc5, you can find the binary at infernal-1.0rc5/easel/miniapps/esl-weight. Either move this into a bin directory or add it to your $PATH so that the mutual information scripts can employ it.

META FILEATTACHMENT attachment="rate4site.tgz" attr="" comment="" date="1229632774" name="rate4site.tgz" path="rate4site.tgz" size="224124" stream="rate4site.tgz" user="Main.JeffreyBarrick" version="1"
 

Revision 12008-12-18 - JeffreyBarrick

 
META TOPICPARENT name="ToolList"
Warning!
This page is under construction. The instructions are currently not complete!

Mutual Information Support for RNA Secondary Structure Models

Predicting base interactions in RNA structures from the phylogenetic-significance of mutual information between columns.

Overview: What do these programs do?

Reference.

Compile modified version of rate4site

The program rate4site is used to infer a phylogenetic tree with per-site substitution rates that generates the observed sequence alignment. In order to properly function on RNA alignments, rate4site required a minor source code modification to deal with gap characters as a separate state. For convenience, I have included this modification on the latest source release for download here.

Download Download rate4site version 2.01 (Nov06) modified 18 December 2008

The included Makefile is for compiling under Windows. To compile use this command instead:

g++ -Dunix -o rate4site.exe -O3 *.cpp
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright ©2025 Barrick Lab contributing authors. Ideas, requests, problems? Send feedback