Regular expressions blues

Or why I hate regular expressions, they’re not designed for readability

Early this morning I caught up on a recent changelog covering Happy.js, a lightweight form validation plugin. My first reaction, nifty. I can use a handy form validator for web app inputs. As I was skimming through the example validation functions the REGEX reached out the page and slapped me in the face. It was only then that I realized how unintuitive regular expressions are.

I can handle this regex for phone numbers, while it takes a little familiarity to follow at least it fits in a pre tag:

/^\(?(\d{3})\)?[\- ]?(\d{3})[\- ]?(\d{4})$/.test(val)

Then came this beast for email validation, go ahead keep on scrolling right…

What I’d prefer is an expressive syntax that fits with patterns as I think of them. So for a phone number validation it would be #(###)###-#### where # specifies a singlet digit number, or an equivalent abbreviated form #({3}#){3}#-{4}#. Same for email *@*.EXTs where * is any character sequence and EXTs specifies a user defined list of acceptable extensions. Explicit character matches can be done by adding the characters i.e. *@gmail.com for all gmail addresses. Of course this syntax will need an escape character for reserved symbols and the standard case insensitive flag or global versus first match replacement (i and g flags for regular expressions). Unfortunately it appears all I’ve done is reinvent another flavor of regular expression.

Another tactic may aid in simplifying the syntax appearance, rearranging the validation string into rows.
#
({3}#))
{3}#
-
{4}#
Not exactly a thing of beauty. I don’t have any more time at the moment, but I’d like to revisit the problems with easy regular expression reading.

Categories: Uncategorized
Tags: ,
  • http://blog.botfu.com Kevin Marshall

    The problem is it’s basically like saying “French is unreadable” because you don’t know French well enough yet.

    Regex is not a simple ‘language’ to learn, but it *is* very powerful and well worth spending lots of time on…you could probably create a subset of regex that was more ‘readable’ (and many have) but what you’ll give up is lots of power and flexibility.

    But I do feel for you…even after years of working with REGEX, I often make subtle mistakes that take awhile to debug…such is life.

  • http://www.google.com/profiles/lablua Kevin C.

    A lot of people are intimidated by the poor readability of Regex, and I’ll admit that even as an expert reading those messes above is tough. One way Perl tries to improve on this is the x flag that lets you add whitespace and comments to regular expressions. The x flag would allow your re-arrangement into rows, though keeping the original symbols.

    None of your translations appear to keep the original meaning regarding optional characters.

    I do wonder if maybe the time has come for a “Pythonization” of reg ex – a new more readable, even if somewhat more verbose, way to write regular expressions, but retaining all the power.

    Here is my attempt at a more readable version:

    start_of_line
    optional “(“
    (3 digits) # I’ve retained parens for capture groups
    optional “)”
    optional any of “- ” # Using whitespace in quotes here is probably not good for readability
    (3 digits)
    optional any of “- “
    (4 digits)
    end_of_line

    A lot more verbose, but a lot more readable I think too… wonder if I should attempt the email one..

  • http://www.google.com/profiles/lablua Kevin C.

    That email regex is actually somewhat poorly written… there are a lot of cases where it writes like /X|[YZ]|[a-z]/ when it could say /[XYZa-z]/. I’ve preserved such oddities exactly below. I also put a pretty printed version of it at https://gist.github.com/806489 that I needed to make to translate to below.

    {
    case_insensitive flag
    begin_of_line
    (
    (
    once_or_more (any between “a” and “z” or digit or any of “!#$%&’*+-/=?^_`{|}~” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    zero_or_more ( “.” once_or_more (any between “a” and “z” or digit or any of “!#$%&’*+-/=?^_`{|}~” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”) )
    )
    or
    (
    zero_or_more (
    (double_quote)
    optional (
    optional ( zero_or_more (space or tab) (cr lf) )
    once_or_more (space or tab)
    )
    (
    ([x01-x08x0bx0cx0e-x1fx7f] or “x21″ or [x23-x5b] or [x5d-x7e] or [u00A0-uD7FFuF900-uFDCFuFDF0-uFFEF])
    or
    ( “\” ([x01-x09x0bx0cx0d-x7f] or [u00A0-uD7FFuF900-uFDCFuFDF0-uFFEF]) )
    )
    )
    optional (
    optional (zero_or_more (space or tab) (cr lf))
    one_or_more (space or tab)
    )
    (double_quote)
    )
    )
    “@”
    once_or_more (
    (
    (any between “a” and “z” or digit or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    or
    (
    (any between “a” and “z” or digit or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    zero_or_more (any between “a” and “z” or digit or “-” or “.” or “_” or “~” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    (any between “a” and “z” or digit or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    )
    )
    “.”
    )
    (
    (any between “a” and “z” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    or
    (
    (any between “a” and “z” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    zero_or_more (any between “a” and “z” or digit or “-” or “.” or “_” or “~” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    (any between “a” and “z” or any between “u00A0″ and “uD7FF”, “uF900″ and “uFDCF”, “uFDF0″ and “uFFEF”)
    )
    )
    optional “.”
    end_of_line
    }

    Summary of proposed new reg ex language:

    * New regex quantifiers are prefixes instead of postfixes

    New regex Old regex
    ( (
    ) )
    “literal” literal
    “\” \
    # Comment to end of line
    any between “X” and “Y” [X-Y]
    any of “XYZ” [XYZ]
    begin_of_line ^
    cr x0d r
    double_quote x22 “
    end_of_line $
    digit d
    digits d
    lf x0a n
    once_or_more +
    optional ?
    space x20
    tab x09 t
    zero_or_more *

    Probably should add some easy way to define a new term within an expression.

    So far everything I’ve come up with above would allow for automatic translation between plain regular expressions and this new fangled readable syntax.

  • http://www.victusspiritus.com/ Mark Essel

    I’ve only written a dozen or so regex’s between Ruby and Javascript, and each time it feels like I’m starting from scratch. One of these days… :D .

    Good to know it gets better later.

  • http://www.victusspiritus.com/ Mark Essel

    This is an amazing translation. Thank you Kevin. It’ll take me some time to work through theses expressions in detail. But by the end I’ll be much more comfortable with regexes by the time
    I’m done. Don’t suppose you have a blog to prop this up on for me to link back to and post around? It’s worth starting one for (wordpress, blogger, etc).

  • http://www.pdxbrain.com Tyler

    http://txt2re.com/

    awesome site that matches predetermined string and then gives you a regex to match similar patterns

  • http://www.victusspiritus.com/ Mark Essel

    woohoo, I can return to blissfully forgetting about regex patterns. If my regex density (number of regular expressions/week) goes over a threshold I’ll eventually learn the syntax. Thanks!