Regular Expressions

In Python and other languages, regular expressions (otherwise known as regex) are a powerful way to perform queries on strings. Below are tips on using them.

The 're' Package

Regular expressions are not loaded by default. They must be loaded by invoking import re.

There are various useful functions to use for regex:, input)
Scans the input for matches to the given pattern and returns the first match.

re.match(pattern, input)
Scans the input for matches to the given pattern and returns only if the beginning of the string matches the pattern. Note that this is stricter than which will return even if the middle or end of the string matches the given pattern.

re.findall(pattern, input)
Returns all of the matches of the given pattern, making this the least strict of the regex functions.

Clears the regex cache.

The pattern can be regular characters or words such as "kitten", a special character that has given meaning, or both combined in some fashion.

Building Regex Patterns

Here are a list of special characters that can be used to build such patterns:
[a-z] matches any lowercase letter; [A-Z] would match uppercase ones.
[0-9] matches any digit.
* is a wildcard, matching zero or more repetitions of the preceding characters.
first_regex|second_regex: given that first_regex and second_regex are different regular expressions, this will match either one or the other of them. Thus [a-z]|[0-5] would match either a lowercase letter or digits 0-5, checking the [a-z] first and then returning a match.

Example - Parsing .gd Files

This script makes extensive use of find.all to parse through a .gd file containing mutation data and extracting information about each mutation, putting it into a list for export into csv format.

#!/usr/bin/env python

import re
import csv
import sys
import os
import glob

for filename in glob.glob('*.gd'):
file = open(filename)
originalgdname = os.path.splitext(filename)[0]
startpos = []
endpos = []
mutation = []
mutationannotation = []
genename = []
geneproduct = []

for line in file: x = re.findall('start_position=(\d+)', line) if len(x) > 0: startpos.append(x[-1])

#\d matches any decimal digit.

y = re.findall('end_position=(\d+)', line) if len(y) > 0: endpos.append(y[-1])

a = re.findall('html_mutation=(.*?)html_mutation_annotation', line) if len(a) > 0: mutation.append(a[-1])

#matches all text between two specified strings, 'html_mutation' and 'html_mutation_annotation'. The parentheses specify what must be printed.

b = re.findall('html_mutation_annotation=(.*?)html_position', line) if len(b) > 0: mutationannotation.append(b[-1])

c = re.findall('html_gene_name=(.*?)html_gene_product', line) if len(c) > 0: genename.append(c[-1])

d = re.findall('html_gene_product=(.*?)html_mutation', line) if len(d) > 0: geneproduct.append(d[-1])

#parses the targeted .gd file and stores output in lists

def WriteListToCSV(csv_file,csv_columns,rows): try: with open(csv_file, 'w') as csvfile: writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC) writer.writerow(csv_columns) for row in rows: writer.writerow(row) except IOError as (errno, strerror): print("I/O error({0}): {1}".format(errno, strerror)) return

csv_columns = ['Start Position','End Position','Mutation','Mutation Annotation', 'Gene Name', 'Gene Product'] rows = zip(startpos, endpos, mutation, mutationannotation, genename, geneproduct)

#writes all lists by column into a target csv file

currentPath = os.getcwd() csv_file = currentPath + "/parsed/%s_%s.csv" % (originalgdname, 'parsed')




Lucy LeBlanc

This topic: Lab > WebLeftBar > ComputationList > RegularExpressions
Topic revision: r2 - 15 Apr 2016 - 03:56:54 - Main.LucyLeblanc