Motivation: You have re-sequenced several genomes after a mutation accumulation or adaptive evolution experiment. How do you infer the rates at which different types of mutation accumulate from these data? What are the 95% confidence intervals on these values?
Assumptions:
Example: Single-base substitutions
Calculation:
>m = 22 >T = 10000 * 6 >rate = poisson.test(m) >rate$estimate/T event rate 0.0003666667 >rate$conf.int/T [1] 0.0002297880 0.0005551377 attr(,"conf.level") [1] 0.95
>s = 1342726 >rate$estimate/(T*s) event rate 2.730763e-10 >rate$conf.int/(T*s) [1] 1.711355e-10 4.134408e-10 attr(,"conf.level") [1] 0.95
Assumptions:
Example: Deletion of an unstable chromosomal region. Once deleted, it can never be deleted again.
Calculation:
> m = 5 > n = 12 > T = 10000
p = binom.test(n - m, n) >p Exact binomial test data: n - m and n number of successes = 7, number of trials = 12, p-value = 0.7744 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.2766697 0.8483478 sample estimates: probability of success 0.5833333
> -log(p$estimate) / T probability of success 5.389965e-05 > -log(p$conf.int) / T [1] 1.284931e-04 1.644646e-05 attr(,"conf.level") [1] 0.95
This is a particularly simple type of survival analysis.
What if you want to test for variation in rates of mutation accumulation?
You can use Poisson regression in R (using glm()
) to judge whether there is a significant difference in the rates at which mutations accumulate relative to some factor. For example, you can test whether there is evidence that certain populations accumulated different numbers of mutations per unit time compared to others or whether mutations at certain sites were more common than at other sites. Fit a model that incorporates the relevant factor and one that does not, and then compare them using anova()
.
What if you sequenced multiple genomes from each population?
This type of pseudo-replication complicates the statistical analysis because strains sequenced from one population are likely to share some of their evolutionary history. If they happened to evolve more rapidly by chance, you will overestimate rates by including both of them and assuming an independent time basis for each one. It is not easy to correct for this shared history. To do so in a rigorous way would likely require a resampling procedure. It would be valid to randomly pick one strain from each population and only include that one in the typical analysis—restoring the assumption of independence—but this is excluding some information.
We used the approaches described here to characterize and compare the rates of mutations in this paper:
Renda, B.A., Dasgupta, A., Leon, D., Barrick, J.E. (2015) Genome instability mediates the loss of key traits by Acinetobacter baylyi ADP1 during laboratory evolution. J. Bacteriol. 197:872-881. https://doi.org/10.1128/JB.02263-14
Barrick Lab > ProtocolList > ProceduresCalculatingMutationRatesFromGenomicData