Regular Expressions

In Python and other languages, regular expressions (otherwise known as regex) are a powerful way to perform queries on strings. Below are tips on using them.

The 're' Package

Regular expressions are not loaded by default. They must be loaded by invoking import re.

There are various useful functions to use for regex:

re.search(pattern, input)
Scans the input for matches to the given pattern and returns the first match.

re.match(pattern, input)
Scans the input for matches to the given pattern and returns only if the beginning of the string matches the pattern. Note that this is stricter than re.search() which will return even if the middle or end of the string matches the given pattern.

re.findall(pattern, input)
Returns all of the matches of the given pattern, making this the least strict of the regex functions.

re.purge()
Clears the regex cache.

The pattern can be regular characters or words such as "kitten", a special character that has given meaning, or both combined in some fashion.

Building Regex Patterns

Here are a list of special characters that can be used to build such patterns:
[a-z] matches any lowercase letter; [A-Z] would match uppercase ones.
[0-9] matches any digit.
* is a wildcard, matching zero or more repetitions of the preceding characters.
first_regex|second_regex: given that first_regex and second_regex are different regular expressions, this will match either one or the other of them. Thus [a-z]|[0-5] would match either a lowercase letter or digits 0-5, checking the [a-z] first and then returning a match.

Example - Parsing .gd Files

#!/usr/bin/env python

import re import csv import sys import os import glob

for filename in glob.glob('*.gd'): file = open(filename) originalgdname = os.path.splitext(filename)[0] print(originalgdname) startpos = [] endpos = [] mutation = [] mutationannotation = [] genename = [] geneproduct = []

for line in file: x = re.findall('start_position=(\d+)', line) if len(x) > 0: startpos.append(x[-1])

#\d matches any decimal digit.

y = re.findall('end_position=(\d+)', line) if len(y) > 0: endpos.append(y[-1])

a = re.findall('html_mutation=(.*?)html_mutation_annotation', line) if len(a) > 0: mutation.append(a[-1])

#matches all text between two specified strings, 'html_mutation' and 'html_mutation_annotation'. The parentheses specify what must be printed.

b = re.findall('html_mutation_annotation=(.*?)html_position', line) if len(b) > 0: mutationannotation.append(b[-1])

c = re.findall('html_gene_name=(.*?)html_gene_product', line) if len(c) > 0: genename.append(c[-1])

d = re.findall('html_gene_product=(.*?)html_mutation', line) if len(d) > 0: geneproduct.append(d[-1])

#parses the targeted .gd file and stores output in lists

def WriteListToCSV(csv_file,csv_columns,rows): try: with open(csv_file, 'w') as csvfile: writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC) writer.writerow(csv_columns) for row in rows: writer.writerow(row) except IOError as (errno, strerror): print("I/O error({0}): {1}".format(errno, strerror)) return

csv_columns = ['Start Position','End Position','Mutation','Mutation Annotation', 'Gene Name', 'Gene Product'] rows = zip(startpos, endpos, mutation, mutationannotation, genename, geneproduct)

#writes all lists by column into a target csv file

currentPath = os.getcwd() csv_file = currentPath + "/parsed/%s_%s.csv" % (originalgdname, 'parsed')

WriteListToCSV(csv_file,csv_columns,rows)

file.close()

Contributors

Lucy LeBlanc
Edit | Attach | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...

 Barrick Lab  >  WebLeftBar  >  ComputationList  >  RegularExpressions

Topic revision: r1 - 07 Apr 2016 - 20:14:13 - Main.LucyLeblanc
 
This site is powered by the TWiki collaboration platformCopyright ©2017 Barrick Lab contributing authors. Ideas, requests, problems? Send feedback