Page Encoding Revisited

In a previous post Zach described a method for determining a page’s encoding and encoding strings to that encoding in php. Now the code he’s given is example code, but it is also a good starting point to show how simple example code can be extended in to robust code you could actually use in a production environment. With that let’s take a look at the original function:

function convert_utf8_content_to_iso_8859_1_if_page_default($in_string) {
    // Get headers as an assoc array
    $page_headers = get_headers($the_page_url, 1);
    $page_encoding = "ISO-8859-1";
    if($page_headers["Content-Type"]) {
        $page_encoding = get_charset_from_page_header($page_headers["Content-Type"]);
    }
    if($page_encoding == "UTF-8") {
        return $in_string;
    }
    return mb_convert_encoding($in_string, "ISO-8859-1", "UTF-8");
}

This function works as is(except for one bug when the page’s encoding is neither UTF-8 nor ISO-8869-1), but there are a few things about it that would not make it an ideal solution for converting encodings. Part of it is the functions used within it and part of it is the scope of the problem this function tries to solve. First let’s state what this code does:

  1. It makes a request for headers to the server for a page at $the_page_url (this is the php page we are calling the function in).
  2. Determines what the server’s encoding is from the headers
  3. If the server’s encoding is UTF-8 then we simply return the string
  4. Otherwise we encode from UTF-8 to ISO-8859-1 and return that

The first issue we should resolve is the request to the server. Anytime a request is made to the server your code resides on resources are used, Apache ties up one of its processes, it’s one less request your server can send out to another user, and so on. So we definitely want to do this differently. This code is using the headers to get the encoding from the headers Apache sends out. This encoding is actually defined in Apache’s configuration and once we know what the encoding is it will not change (unless we explicitly change it in Apache configs). We now have two options, we could set php’s default encoding to the same encoding as Apache, or if needed we can simply set a define with apache’s encoding (this seems like a rarely needed case). In Zach’s previous post he updated the above function to use a helper function that cache’s the server’s encoding. While this is better than making requests every time our conversion function gets called, the best solution is to set Apache and PHP to use the same encoding.

Technically, if PHP’s encoding and Apache’s encoding are the same, the this function is not needed. But what we are really trying to solve here is the case where we have a string that is not encoded with the same encoding as our output. Perhaps this string came from an RSS feed on another machine, or it came from user input(think copy and paste). So we do need a function like this to help us with those situations.

We could simply rename our function to be convert_utf8_content_to_iso_8859_1() but if we decide later to support other encodings we’ll have write and then call a different function for every encoding pairing. It is also pretty clear that the call to mb_convert_encoding does not need to have hard coded encoding names. The mb_string library also has a function called mb_detect_encoding which can tell us the encoding of a string. Given this, we could write our conversion function like this:

define("DEFAULT_ENCODING", ini_get("default_charset")); //php.ini has this set to 'UTF-8'
function convert_to_page_encoding($string) {
    $encoding = mb_detect_encoding($string,'ASCII,UTF-8,ISO-8859-1');
    if ($encoding !== false) {
        if(DEFAULT_ENCODING == $encoding) return $string;
        return mb_convert_encoding($string, DEFAULT_ENCODING, $encoding);
    }
    die ("Could not detect encoding of string!");
}

And that’s all there is to it! We can now easily convert any string to our page’s encoding. Well, not really. Working with string encodings can be quite difficult, and the weak link in the above function is the mb_detect_encoding. While it works for the cases we use it most, it’s not robust enough to simply handle detection of any encoding. In fact ISO-8859-1 will get returned for any string encoded in one of the other ISO-8859-* encodings. You also have to be very specific about which order encodings are searched for and even then you may get a false positive. However, for many cases the above function will work, and more importantly it’ll gladly let us know when there is an issue we need to fix should we run into it later on.

I mostly wanted to write on this topic to bring up a point about writing code in general. Almost everything in coding can be done with several different approaches. Programmers are constantly pulled between getting something that works and coming up with the ideal solution. Sometimes when you get into bug fixing mode it is easy to do something that fixes the specific bug but does not fix the general problem the bug describes. It’s also easy to get bogged down trying to write the ideal solution when a halfway approach makes more sense (in terms of time and effort for value gained). This is the process we take a lot of times when writing code. We start out getting something that works, but when we step back and look at we see there are improvements that can be made. And often times after those improvements we see there are more to be made. Oddly enough, that’s part of the fun of coding.

This entry was posted in PHP. Bookmark the permalink.

Leave a comment