Wednesday, January 30, 2008

Regular Expressions are Evil

For me, PHP was the first server-side language I began working with. It was also the first programming language I started working with and in truthfully, although now I can write several languages, it was the one that has taught me (and continues to teach me) about programming. Patterns, models, paradigms, et cetera. PHP makes it easy to experiment with these things.

If there was one thing I hated though, since I found out it existed, was regular expressions. It also seems I was not alone in this assertion because regular expressions seem to continue to be one of the most talked and asked-about topic in PHP articles and discussion forums. Even people who know regular expressions often say they dislike them. And indeed, I have found that although I know how to work with them now, I still dislike them.

Recently, over the course of a few months now, I have generally avoided using regular expressions and instead I have practiced a lot of old fashion text parsing. The reason is that regular expressions can easily hog a lot of resources and make your application less effective. Indeed I think people use regular expressions for all the wrong reasons and in places you should not. At times I see just simply outrageous uses that any sensible person would just accomplish with a creative use of two to three string functions.

As a rule, I think it works well when you consider the following, "if your first thought is to use regular expressions, you really should not". This has served me well personally and when you think about it, there are actually less places where regular expressions do a better job than string functions or just simple parsing. Naturally there are tasks that you absolutely need regular expressions for. Before PHP's filter library became part of the core, the only way to parse a url or a complicated dsn (data source name) was to use regular expressions. Thankfully, this is no longer necessary.

Another annoying fact about regular expressions is that they are generally unreadable. You should note that I am not talking about small expressions. The atypical, 50 to 100 characters long expressions are the most unfriendly things in the programming industry. The expressions make sense just long enough for you to make sure it works and after that you, or whoever is made responsible for your code, needs to spend a good load of time looking at it to make any sense of it. Changing or fixing a bug in a complicated regular expression is a nightmare, which is not made any easier by the fact that few programmers comment their expressions properly. Most of the time it just sits somewhere in middle of the code and is considered just as natural as any single if or switch statement.

Seems I made this into more of a rant than an insightful article, but that probably only serves to demonstrate the frustration I often experience with regular expressions.

Busy, busy

I think I should apologize for my long absence. After Christmas and the 2008 new year I have been quite busy with numerous projects. One of the foremost is Atom, my component library that I am currently in the process of making publicly available. I will be writing more on that at later time, when I will be putting together an introductory article for it.

However now I am forcing myself to finally sit down a write on my blog again :)