More mod_rewrite / .htaccess: going the distance with URL masking

Continuing in the saga of a customer’s SEO (search engine optimization) journey, I was told that I needed to find a way to mask the URLs used by our custom content management system completely using only the .htaccess file. Previously I was asked to make it possible to enter blah.com/bliggity into the browser and have remain in URL box rather than the CMS URL of blah.com/index.cfm?page=123. Now I was asked to make blah.com/index.cfm?page=123 redirect to blah.com/bliggity so that the CMS would not have to be modified in order to have the links on every page refer to the SEOed URL. “Mod rewrite should be able to do this, right?”

I assert that it can, but it’ll probably require much more work than the original solution: Mod_rewrite / .htaccess nuances. But maybe I’ll get lucky, right? Why couldn’t I just put a rule in that redirects the page?

RewriteRule ^index\.cfm?page=123$ http://blah.com/bliggity [R,L]

This one isn’t too hard to find the flaws with. If it doesn’t work, it’s probably obvious that the query string is not passed into the rule for matching. Requesting index.cfm?page=123 will only pass index.cfm to the RewriteRule. Well maybe I’ll try using a RewriteCond directive to match which page I’m trying to get:

RewriteCond %{QUERY_STRING} page=123
RewriteRule ^index\.cfm$ http://blah.com/bliggity [R,L]

Now this works…sort of. By itself it will redirect the browser properly, but in order to actually display the content for blah.com/bliggity, I need another rule.

RewriteRule ^bliggity$  index.cfm?page=123 [L]

With both rules in place, Firefox complains about the page not redirecting properly, and no page is displayed. Debugging the HTTP connection shows that the page is doing an HTTP redirect back to itself, infinitely. Strange, since both rules are flagged as “Last” rules, meaning mod_rewrite should stop rewriting altogether for this request after successfully matching that rule. I’ll admit it, I don’t know much about the internal workings of mod_rewrite and Apache such that I can tell you exactly how mod_rewrite and Apache interact to fulfill a request. But I know that in this case, despite the “Last” flag, Apache (working with version 1.3.29) sent the internal blah.com/index.cfm?page=123 request generated from the rewritten browser request blah.com/bliggity back through mod_rewrite again, producing a redirect loop.

The long and the short of it is that this is how mod_rewrite works, and I need to get around it.

I wish I had time to go through the whys and hows of my solution, but I need to get back to work. So here’s my final solution. For each page that I need to redirect, I need to generate a block of code like this. I’ll continue with the blah.com/bliggity example.

RewriteCond %{QUERY_STRING} !page=
RewriteRule ^bliggity$ index.cfm?page=123&Redirected=1 [L]
RewriteCond %{QUERY_STRING} page=123$
RewriteCond %{QUERY_STRING} !Redirected=1
RewriteRule ^index\.cfm$ http://blah.com/bliggity? [R,L]

Here’s the description for each line:

  1. The first RewriteCond checks to see if this request has a page identifier on it. If it doesn’t, then process it through the bliggity rule.
  2. If it matches bliggity, then fetch the page, but pass in the extra GET parameter Redirected=1 to flag that this request does not need to be redirected.
  3. If the request supplied a page id, it might need to be redirected.
  4. However, if the page has already been redirected, it should not be redirected again.
  5. Finally, perform the redirection.

Caveat: This will probably mess up forms using index.cfm?page=id as their action and the GET method. If you use index.cfm as a form action but do not supply a page id, then your GET forms would be unaffected. If you do supply a page id, you will need additional RewriteCond directives to exclude these requests from redirection (since the redirection overwrites the query string), use the POST method or look into some of the additional rewrite flags that are available (at least one of them affects the way mod_rewrite handles the query string).

Hard work aside, it looks pretty impressive when it’s done. The links clearly go to index.cfm?page=123, but clicking on them displays a much more informative URL in the browser. In this solution, the .htaccess file is re-generated every time the site structure is changed. This allows the links to be dynamic while keeping the solution fairly simple. It’s not too difficult to do–the most difficult part of dynamically generating it is escaping the characters properly.

One Response

  1. That’s intense. But seeing as how you are a genius, it is a good thing they gave the task to you…. Also, I just stumbled upon your blog. I hope you don’t mind this intrusion. It was thanks to technorati.

    Peteb - February 1st, 2007 at 8:45 am

Leave a Reply