Friday, July 14, 2017

Strip HTML Entities | Remove HTML Tags From String | Stripping HTML Tags in Java | Remove Html Tags From String Using Java | Java: RegEx To Remove HTML Tags

I am writing one program which reads text contents from file. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. But I want to remove html tags too like "div", "span", "p" and many others like this, including &amp. How to achieve this? Is there a good way to remove HTML from a Java string? Yes, we can using regex. Below is full example written in Java to strip html entities from text.

Full java code example is below:


package com.pkm;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author PRITOM K MONDAL
 */
public class StripHtmlTags {
    private static final String emptyTag = "<[a-zA-Z0-9]+[^>]+>|</[a-zA-Z0-9]+>";
    private static final String commentTag = "<!--(.*?)-->";
    private static final String docTypeTag = "<![a-zA-Z0-9]+(.*?)>";

    public static void main(String[] args) throws Exception {
        File source = new File("html.txt");
        String text = new String(readObjectFromFile(source), "UTF-8");
        //text = "<p-if case='n=1'>TESTED_N_EQ_1</p-if>";
        text = parse(text);
        println("-------------------------------");
        println(text);
        println("-------------------------------");
    }

    private static String parse(String text) throws Exception {
        text = text.replace("\r\n", "").replace("\n", "").replace("&nbsp;", " ");
        text = replaceDocTypeTags(text);
        text = replaceCommentTags(text);
        text = replaceScriptTags(text);
        text = replaceContentTags(text);  
        text = replaceEmptyTags(text);      
        return text.replaceAll("\\s+", " ").trim();
    }
    
    private static String replaceScriptTags(String text) {
        Pattern p = Pattern.compile("(<(?<tag>(script|style))[^>]*>(?>[^<])*<\\/\\k<tag>\\s*>)", Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(text);
        StringBuffer output = new StringBuffer();
        boolean trigger = false;
        while(m.find()){
            String find = m.group(0), tag = m.group(2);
            m.appendReplacement(output, " ");
            trigger = true;
        }
        m.appendTail(output);
        if (trigger) {
            return replaceScriptTags(output.toString());
        }
        return output.toString().replaceAll("\\s+", " ").trim();
    }

    private static String replaceDocTypeTags(String text) {
        try {
            StringBuffer output = new StringBuffer();
            Pattern pattern = Pattern.compile(docTypeTag, Pattern.MULTILINE);
            Matcher matcher = pattern.matcher(text);
            while (matcher.find()) {
                matcher.appendReplacement(output, " ");
            }
            matcher.appendTail(output);
            return output.toString().trim();
        }
        catch (Exception ex) {
            ex.printStackTrace();
        }
        return text;
    }

    private static String replaceCommentTags(String text) {
        try {
            StringBuffer output = new StringBuffer();
            Pattern pattern = Pattern.compile(commentTag, Pattern.MULTILINE);
            Matcher matcher = pattern.matcher(text);
            while (matcher.find()) {
                matcher.appendReplacement(output, " ");
            }
            matcher.appendTail(output);
            return output.toString().trim();
        }
        catch (Exception ex) {
            ex.printStackTrace();
        }
        return text;
    }

    private static String replaceContentTags(String text) {
        Pattern p = Pattern.compile("(<(?<tag>\\w+)[^>]*>(?>[^<])*<\\/\\k<tag>\\s*>)", Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(text);
        StringBuffer output = new StringBuffer();
        boolean trigger = false;
        while(m.find()){
            String find = m.group(0), tag = m.group(2);
            if (tag != null) {
                find = find.substring(find.indexOf(">") + 1, find.length() - tag.length() - 3);
                m.appendReplacement(output, " " + Matcher.quoteReplacement(find) + " ");
            }
            else {
                m.appendReplacement(output, " ");
            }
            trigger = true;
        }
        m.appendTail(output);
        if (trigger) {
            return replaceContentTags(output.toString());
        }
        return output.toString();
    }

    private static String replaceEmptyTags(String text) {
        try {
            StringBuffer output = new StringBuffer();
            Pattern pattern = Pattern.compile(emptyTag, Pattern.MULTILINE);
            Matcher matcher = pattern.matcher(text);
            while (matcher.find()) {
                matcher.appendReplacement(output, " ");
            }
            matcher.appendTail(output);
            return output.toString().trim();
        }
        catch (Exception ex) {
            ex.printStackTrace();
        }
        return text;
    }

    private static byte[] readObjectFromFile(File source) throws Exception {
        int size = (int) source.length();
        byte[] bytes = new byte[size];
        BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(source));
        inputStream.read(bytes, 0, bytes.length);
        inputStream.close();
        return bytes;
    }
    
    private static void println(Object o) {
        System.out.println("" + o);
    }
}

I used below html as example:


<div class="m_2691871633108987921InfoContent"><div style="margin:auto;max-width:48em" class="m_2691871633108987921box m_2691871633108987921dark">Pritom you are getting this message as free service for being a user of the PHP Classes site to which you registered voluntarily using the email address <a href="mailto:pritomkucse@gmail.com" target="_blank">pritomkucse@gmail.com</a>. If you wish to unsubscribe go to the <a href="https://www.phpclasses.org/unsub/n/pritomkumarm/u/newclasses/cc/8d4da9/" target="_blank" data-saferedirecturl="https://www.google.com/url?hl=en&amp;q=https://www.phpclasses.org/unsub/n/pritomkumarm/u/newclasses/cc/8d4da9/&amp;source=gmail&amp;ust=1500128983383000&amp;usg=AFQjCNHdGC4JMTFZL4gunLN5iEeeo-H7fA">unsubscribe page</a>.</div>
<ul>
<table cellpadding="0" cellspacing="0">
<tbody><tr>
<td valign="top" style="padding:0"><h2>Class:</h2>
<div><b><a href="https://www.phpclasses.org/package/10378.html" target="_blank" data-saferedirecturl="https://www.google.com/url?hl=en&amp;q=https://www.phpclasses.org/package/10378.html&amp;source=gmail&amp;ust=1500128983383000&amp;usg=AFQjCNGoRBbTEuOrPMzCnTA8hSMocBh0bA">AJAX Form Validate</a></b><br><a href="https://www.phpclasses.org/discuss/package/10378/" target="_blank" data-saferedirecturl="https://www.google.com/url?hl=en&amp;q=https://www.phpclasses.org/discuss/package/10378/&amp;source=gmail&amp;ust=1500128983383000&amp;usg=AFQjCNF-oSc-24S0hUUr873Yu-o0afYlMg">This class support forum</a></div>
<h2>Short description:</h2>
<div>
Render and validate forms submitted via AJAX</div>
<h2>Groups:</h2>
<div>
PHP 5, Validation, AJAX</div>
</td>
<td valign="top" width="1%"></td>
</tr>
</tbody></table>
<h2>Supplied by:</h2><div>Vishv Sahdev</div>
<h2>Detailed description:</h2>
<div>This class can render and validate forms submitted via AJAX.<br>
<br>
It can render HTML forms using templates that take assigned variable values passed to the class as an array.<br>
<br>
The class can also validate a submitted form inputs according to many possible validation rules.<br>
<br>
If there are validation errors, the class will generate a JSON response to the AJAX request to display</div>
</ul>
<ul>
<h2>PHP Classes site tip of the day:</h2>
<div><p><b><big><a href="https://www.phpclasses.org/tips.html?tip=submit-your-components" target="_blank" data-saferedirecturl="https://www.google.com/url?hl=en&amp;q=https://www.phpclasses.org/tips.html?tip%3Dsubmit-your-components&amp;source=gmail&amp;ust=1500128983383000&amp;usg=AFQjCNGyBzcYnZHB_Trk3N_EycwWu5PS3g">Get feedback and recognition submitting your PHP components</a></big></b></p>
</div>
</ul>
<div style="margin:auto;max-width:48em" class="m_2691871633108987921box m_2691871633108987921dark">Pritom if you are not interested in receiving any more messages like this one, go to the <a href="https://www.phpclasses.org/unsub/n/pritomkumarm/u/newclasses/cc/8d4da9/" target="_blank" data-saferedirecturl="https://www.google.com/url?hl=en&amp;q=https://www.phpclasses.org/unsub/n/pritomkumarm/u/newclasses/cc/8d4da9/&amp;source=gmail&amp;ust=1500128983383000&amp;usg=AFQjCNHdGC4JMTFZL4gunLN5iEeeo-H7fA">unsubscribe page</a> .</div>

</div>


And my java code parse as below string:

Pritom you are getting this message as free service for being a user of the PHP Classes site to which you registered voluntarily using the email address pritomkucse@gmail.com . If you wish to unsubscribe go to the unsubscribe page . Class: AJAX Form Validate This class support forum Short description: Render and validate forms submitted via AJAX Groups: PHP 5, Validation, AJAX Supplied by: Vishv Sahdev Detailed description: This class can render and validate forms submitted via AJAX. It can render HTML forms using templates that take assigned variable values passed to the class as an array. The class can also validate a submitted form inputs according to many possible validation rules. If there are validation errors, the class will generate a JSON response to the AJAX request to display PHP Classes site tip of the day: Get feedback and recognition submitting your PHP components Pritom if you are not interested in receiving any more messages like this one, go to the unsubscribe page .

No comments:

Post a Comment