XML as you would know essentially consists of markup tags and character data. The markup tags are > (greater than), < (less than), ' (single quote), " (double quote) and & (ampersand). Character data which appears inside text nodes or in attributes could be anything, any character in any language.
But not all Unicode characters are fit to be included in XML as character data. There are two specs that one needs to refer to understand this
But not all Unicode characters are fit to be included in XML as character data. There are two specs that one needs to refer to understand this
- http
://w ww.w 3.or g/TR /200 0/RE C-xm l-20 0010 06#N T-Ch a r - http
://w ww.w 3.or g/TR /200 0/RE C-xm l-20 0010 06#s ynta x
private static String removeInvalidXMLCharacters(String xmlString) { StringBuilder out = new StringBuilder(); int codePoint; int i = 0; while (i < xmlString.length()) { // This is the unicode code of the character. codePoint = xmlString.codePointAt(i); if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) { out.append(Character.toChars(codePoint)); } i += Character.charCount(codePoint); } return out.toString(); }