Apache: Rewriting URLs

This site runs on an open source content management system called Drupal. While I have done some development for Drupal, most of what you see working here was done by someone else. In fact, there's a lot going on here that I've never figured out. Recently, I started building another site from scratch to amend that, and I started with the perplexing question of "How did they manage to make that rather succinct URL tell them everything they needed to know?"

Take for instance, the URL of this page. What you should see in the location bar is something like:

http://ziemecki.net/node/12

If you've ever written a web app, you'll immediate notice a few things missing. For instance, the name of the page to be called, and the name-value pairs of the associated variables. For what this URL would do, most of us would expect something like:

http://ziemecki.net/node.php?node=12

In fact, the PHP interpreter that reads my spaghetti code will only work on pages ending with the ".php" extention, so how does this actually function?

The fist step on the path of working that out was to look into a file called ".htaccess". This is a configuration file that runs in the site root of any web site running Apache. There's a lot that can be done with this file, but the section that applies to URL rewriting looks something like this:

# Various rewrite rules
<IfModule mod_rewrite.c>
  RewriteEngine on
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d
  RewriteRule ^(.*)$ index.php?q=$1 [L]
</IfModule>

Let's look at that piece by piece:

# Various rewrite rules
<IfModule mod_rewrite.c>
  ...
</IfModule>

This starts off with a comment line describing what's happening here, followed by an if/then block, which basically says that if your web server in not running the "mod_rewrite.c" module, then you might as well skip this section.

RewriteEngine on

Turn the RewriteEngine on. Simple enough. It's called every time a request is made on this web site.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

These are "Rewrite Conditions". It says to find the full local filesystem path to the file or script matching the request ("REQUEST_FILENAME"), then to make sure that it isn't an actually existant file ("!-f") or directory ("!-d"). In other words, if the URL is something like:

http://ziemecki.net/dontlook/hiddenfile.php

... and there really is such a file as "hiddenfile.php" in the "dontlook" subdirectory, then you should ignore the rest of the rewrite rule, and actually deliver the requested file exactly as typed.

RewriteRule ^(.*)$ index.php?q=$1 [L]

This is the active ingredient. It says to find everything in the URL past the site root (^(.*)$) and to treat it as if it were a request for “index.php?q=$1”. “$1” is a variable for the 1st value returned for the regular expression looking for "(^(.*)$)". So, if the request is:

http://ziemecki.net/node/12

... then “node/12” is stripped off, assigned as “$1”, and the entire string becomes:

http://ziemecki.net/index.php?q=node/12

In other words, when the web server sees "http://ziemecki.net/node/12", it interprets it to mean "http://ziemecki.net/index.php?q=node/12". The “[L]” just says this is the last rule and you can stop looking for more. Makes more sense if you're running multiple rules.

Now, that looks a little bit more familiar. Of course, it does seem odd that every request goes to same page. Sounds like a mighty long web page, no?

No.

On a site I'm building, I wrote a function that looks like this:

/*
        Given a source type of "req" (requested page) or
        "ref" (referring page), return an array of query
        string values.
        e.g. $sURL = parseURL("req");
        returns ...
        Array ( [0] => home [1] => something [2] => else )
*/

function parseURL( $sType )
{
        global $sRefQry;
        if($sType == "req"){$sQry = $_GET['q'];}
        if($sType == "ref"){$sQry = $sRefQry;}
        $sQry= preg_replace('|[^\w\d/]|', '', $sQry );
        return explode("/", $sQry );
}

I think the comments are relatively self explanatory. So, assuming this is a normal request, we'd assign “$_GET['q']” (the value to the right of the equal sign in the rewritten URL above) to the variable “$sQry”. So, now, “$sQry” would equal “node/12”. The “pref_replace” phrase cleans out any nefarious characters from the URL that someone might enter trying to break the app. Finally, the "return explode" returns an array containing all of the individual items that separated by a forward slash. In this case, it would return "Array ( [0] => node [1] => 12)".

At the top of my “index.php”, I assign the first item in the array to a variable that contains the page it refers to:

$sURL = parseURL("req");
$p = $sURL[0];

In this case, "$p" would now equal "node".

Further down, I include all the code associated with the “node” page into the mostly empty “index.php” page.

       
        $page = "../modules/" . $p . ".mod";
        if (file_exists($page)){
                require $page;
        } else {
                $sMsg = "User requested nonexistent page: " . $page;
                writeLog ($sMsg, "");
                echo getMsg("404");
        }

This says that “node” really means “../modules/node.mod”, which is a non-internet-accessible file (much safer place to keep operative code). Then it checks to see if “$page” exists. If it does, it inserts all the code from that module into “index.php”. If not, it logs it, and returns a custom “File Not Found” message.

Now, wasn't that easy?

Tags: