Monday, December 2, 2013

Removing invalid characters from XML

XML as you would know essentially consists of markup tags and character data. The markup tags are > (greater than), < (less than), ' (single quote), " (double quote) and & (ampersand). Character data which appears inside text nodes or in attributes could be anything, any character in any language.
But not all Unicode characters are fit to be included in XML as character data. There are two specs that one needs to refer to understand this
  1. http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
  2. http://www.w3.org/TR/2000/REC-xml-20001006#syntax
The following Java code is an implementation of these rules. It essentially removes all these illegal Unicode characters.

private static String removeInvalidXMLCharacters(String xmlString) {
    StringBuilder out = new StringBuilder();
    int codePoint;
    int i = 0;
    while (i < xmlString.length())
    {
        // This is the unicode code of the character.
        codePoint = xmlString.codePointAt(i);
        if ((codePoint == 0x9) ||
                (codePoint == 0xA) ||
                (codePoint == 0xD) ||
                ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||
                ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||
                ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF)))
        {
            out.append(Character.toChars(codePoint));
        }
        i += Character.charCount(codePoint);
    }
    return out.toString();
}

No comments:

Post a Comment