Scott Boms

Batch Encoding Text

Server-side parsing tools are fantastic and can save you enormous amounts of time and effort in producing large-scale websites. At the same time they tend to have their own bugs and intricacies which can confound and perplex the best of us. This was the case today.

Background and the Problem at Hand

We’re preparing to promote a number of changes, fixes, new features and general improvements tonight on the Masterfile.com site. In the testing of said features we’ve encountered the usual fare — bugs. The latest one being related to a small, but generally significant change we’ve been trying to get out the door for some time — moving the site completely to UTF-8.

In a nutshell, we encountered a problem where somewhere along the way, character encodings were getting mangled. As a result, text was not rendering properly and search links generated zero result queries. This is bad and obviously unacceptable.

The Solution

While perhaps not the most elegant solution, we found a little piece of JavaScript code which batch-translates the raw UTF-8 encoded pages to the equivalent HTML entities. It’s simple and not overly tedious. It kindly ignores the surrounding HTML code completely and only translates accented characters as well.

Here’s the code and a brief description of how to use it:


function convertToEntities() {
  var tstr = document.form.unicode.value;
  var bstr = '';
  for(i=0; i<tstr.length; i++) {
    if(tstr.charCodeAt(i)>127) {
      bstr += '&#' + tstr.charCodeAt(i) + ';';
    } else {
      bstr += tstr.charAt(i);
    }
  }
  document.form.entity.value = bstr;
}

Usage

Create a simple form in a new HTML page with two textarea fields and a submit button. The first will be the input, the second will display the output and should be set with the readonly attribute. The submit button should have an onclick attribute which calls the javascript function. Take a quick look at the sample page I’ve put together to see how it works.