Michael Brunton-Spall     About     Archive     Feed

Regular Expressions

2009-12-18 22:57:30 +0000

I'm not a big fan of regular expressions.  They can be powerful, but for anything remotely complicated they can be a nightmare to maintain and re-read.  I had an idea recently for an easy to use chaining regular expression building library but I can't find anybody doing it, so I've created one myself.

I've borrowed the concept of chaining from the jquery library, so each function on the Regular Expression Builder object returns the modified object.  This makes the interface easier to read, and makes constructing a complex object pretty simple.

The code can be found at github

Using it in your code is pretty simple, import the library, and start building the regular expression

 

from regex_builder import *
regex = str(literal('abc').one_or_more(literal('ef')))

 

So why is this useful? Regular expressions can start to get pretty large and funky. So for example, we might want to match a set or urls something like/travel/france and /travel/france+skiing and also /travel/france+science/nanotechnology

 

/[a-zA-Z0-9]+/[a-zA-Z0-9]+(?:+(?:[a-zA-Z0-9]+/[a-zA-Z0-9]+)|[a-zA-Z0-9]+)?

 

Writing this in the first place without making a mistake is painstaking and fiddly.  Coming back to it 6 months later and having to change it is even worse.  Here is the equivalent using my library.

 

slugword = one_or_more(range('a-zA-Z0-9'))
section_and_keyword = literal(str(slugword)+'/'+str(slugword))
combiner = literal('/'+str(section_and_keyword)).optional(
               literal('\+').alternate(section_and_keyword, slugword))

 

Now it's not perfect by any stretch of the imagination.  I'd love to not have to use str and literal to repeat a defined regex, but the current architecture means that executing "slugword.literal('a')" modifies every instance of slugword. 

Other Todo's includes adding word, whitespace and digit methods, adding an any_character method and finding bugs by actually using it.  I'll also be extending the framework to automatically match using the re module, so you won't have to manually compile and match by hand.

I also think it would be fairly easy to port to Java, so thats on the cards

Let me know what you think, use it and tell me what regex functions you use that I'm missing.  I only implemented the simplest functions, so there is a lot of lazy flags, special repeat types and stuff that I've never personally used, and so didn't implement.