GenomeDiff Format

breseq outputs its evidence and mutation predictions in a computer-readable GenomeDiff text format.

An example of a portion of a GenomeDiff file:

#=GENOME_DIFF 1.0
DEL  61      11      NC_001416       139     1
INS  62      12      NC_001416       14266   G
SNP  63      13      NC_001416       20661   G
INS  64      14      NC_001416       20835   C
SNP  65      15      NC_001416       21714   A
DEL  60      33,1    NC_001416       21738   5996
SNP  66      35      NC_001416       31016   C
...
MC   9               NC_001416       1       2       0       0       left_inside_cov=0       left_outside_cov=NA     right_inside_cov=0      right_outside_cov=169
RA   11              NC_001416       139     0       G       .       frequency=1     new_cov=34/40   quality=309.0   ref_cov=0/0     tot_cov=34/40
JC   2               NC_001416       5491    1       NC_001416       30255   1       0       alignment_overlap=4     coverage_minus=8        coverage_plus=0 flanking_left=35        flanking_right=35       key=NC_001416__5491__1__NC_001416__30251__1__4____35__35__0__0  max_left=30     max_left_minus=30       max_left_plus=0 max_min_left=0  max_min_left_minus=0    max_min_left_plus=0     max_min_right=11        max_min_right_minus=11  max_min_right_plus=0    max_right=11    max_right_minus=11      max_right_plus=0        min_overlap_score=44    pos_hash_score=7        reject=NJ,COV   side_1_annotate_key=gene        side_1_overlap=4        side_1_redundant=0      side_2_annotate_key=gene        side_2_overlap=0        side_2_redundant=0      total_non_overlap_reads=8       total_reads=8
JC   3               NC_001416       13180   1       NC_001416       13218   1       0       alignment_overlap=4     coverage_minus=1        coverage_plus=0 flanking_left=35        flanking_right=35       key=NC_001416__13180__1__NC_001416__13214__1__4____35__35__0__0 max_left=17     max_left_minus=17       max_left_plus=0 max_min_left=0  max_min_left_minus=0    max_min_left_plus=0     max_min_right=14        max_min_right_minus=14  max_min_right_plus=0    max_right=14    max_right_minus=14      max_right_plus=0        min_overlap_score=14    pos_hash_score=1        reject=NJ,COV   side_1_annotate_key=gene        side_1_overlap=4        side_1_redundant=0      side_2_annotate_key=gene        side_2_overlap=0        side_2_redundant=0      total_non_overlap_reads=1       total_reads=1
RA   12              NC_001416       14266   1       .       G       frequency=1     new_cov=44/31   quality=186.3   ref_cov=0/0     tot_cov=44/31
JC   5               NC_001416       14869   -1      NC_001416       15609   -1      0       alignment_overlap=7     coverage_minus=1        coverage_plus=0 flanking_left=35        flanking_right=35       key=NC_001416__14869__0__NC_001416__15616__0__7____35__35__0__0 max_left=21     max_left_minus=21       max_left_plus=0 max_min_left=0  max_min_left_minus=0    max_min_left_plus=0     max_min_right=7 max_min_right_minus=7   max_min_right_plus=0    max_right=7     max_right_minus=7       max_right_plus=0        min_overlap_score=7     pos_hash_score=1        reject=NJ,COV   side_1_annotate_key=gene        side_1_overlap=7        side_1_redundant=0      side_2_annotate_key=gene        side_2_overlap=0        side_2_redundant=0      total_non_overlap_reads=1       total_reads=1

Format specification

Version line

The first line of the file must define the version:

#=GENOME_DIFF 1.0

Metadata lines

Lines beginning in #=<name> <value> are interpreted as metadata. (Thus, the first line is assigning a metadata item named GENOME_DIFF a value of 1.0.) Names cannot include whitespace characters. Values may include whitespace characters. Lines with the same name are concatenated with single spaces added between them.

Comment lines

Subsequent lines beginning with whitespace and # are comments.

Data lines

Data lines describe either a mutation or evidence from an analysis that can potentially support a mutational event. Data fields are tab-delimited. Each line begins with several fields containing information common to all types, continues with a fixed number of type-specific fields, and ends with an arbitrary number of name=value pairs that store optional information.

  1. type <string>

    type of the entry on this line.

  2. id <uint32>

    id of this item. May be set to ‘+’ for manually edited entries.

  3. parent-ids <uint32>

    ids of evidence that support this mutation. May be set to ‘.’ or left blank.

Valid mutation types are: SNP, SUB, DEL, INS, MOB, AMP, CON, INV.

Valid evidence types are: RA, MC, JC, UN.

Evidence Types

RA: Read alignment evidence

Line specification:

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. insert_position <uint32>

    number of bases inserted after the reference position to get to this base. An value of zero refers to the base. A value of 5 means that this evidence if for the fifth newly inserted column after the reference position.

  4. ref_base <char>

    base in the reference genome.

  5. new_base <char>

    new base supported by read alignment evidence.

MC: Missing coverage evidence

Line specification:

  1. seq_id <string>

    id of reference sequence fragment.

  2. start <uint32>

    start position in reference sequence fragment.

  3. end <uint32>

    end position in reference sequence of region.

  4. start_range <uint32>

    number of bases to offset after the start position to define the upper limit of the range where the start of a deletion could be.

  5. end_range <uint32>

    number of bases to offset before the end position to define the lower limit of the range where the start of a deletion could be.

Essentially this is evidence of missing coverage between two positions in the ranges [start, start+start_range] [end-end_range, end].

NJ: New junction evidence

  1. side_1_seq_id <string>

    id of reference sequence fragment containing side 1 of the junction.

  2. side_1_position <uint32>

    position of side 1 at the junction boundary.

  3. side_1_strand <1/-1>

    direction that side 1 continues matching the reference sequence

  4. side_2_seq_id <string>

    id of reference sequence fragment containing side 2 of the junction.

  5. side_2_position <uint32>

    position of side 2 at the junction boundary.

  6. side_2_strand <1/-1>

    direction that side 2 continues matching the reference sequence.

  1. overlap <uint32>

    Number of bases that the two sides of the new junction have in common.

UN: Unknown base evidence

Line specification:

  1. seq_id <string>

    id of reference sequence fragment.

  2. start <uint32>

    start position in reference sequence of region.

  3. end <uint32>

    end position in reference sequence of region.

Mutational Event Types

SNP: Base substitution mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. new_seq <char>

    new base at position

SUB: Multiple base substitution mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases after the specified reference position to replace with new_seq

  4. new_seq <string>

    new base at position

DEL: Deletion mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases deleted in reference

INS: Insertion mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. new_seq <string>

    new base inserted after the specified rference position

MOB: Mobile element insertion mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. repeat_name <string>

    name of the mobile element. Should correspond to an annotated repeat_region in the reference.

  4. strand <1/-1>

    strand of mobile element insertion.

  5. duplication_size <uint32>

    number of bases duplicated during insertion, beginning with the specified reference position.

AMP: Amplification mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases duplicated starting with the specified reference position.

  4. new_copy_number <uint32>

    new number of copies of specified bases.

CON: Gene conversion mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment that was the target of gene conversion from another genomic location.

  3. size <uint32>

    number of bases to replace in the reference genome beginning at the specified position.

  4. region <sequence:start-end>

    Region in the reference genome to use as a replacement.

INV: Inversion mutation

  1. seq_id <string>

    id of reference sequence fragment.

  2. position <uint32>

    position in reference sequence fragment.

  3. size <uint32>

    number of bases in inverted region beginning at the specified reference position.