How to get pure content from HTML page in Java via Regex
Introduction
I’ve written a web crawler while I was developing a search engine a few weeks ago. It extracts the contents and saves them onto the database. The HTML tags aren’t so important to most of the search engines. So, I removed them successfully. To do the same, follow below steps:
1- Remove the script tags and inclusive content:
// htmlContent is full content of page with HTML codes.
String content;
Pattern pattern;
pattern = Pattern.compile("<script.*?>.*?</script>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
content = pattern.matcher(htmlContent).replaceAll("");
Note: In dotall mode, the expression <tt>.</tt> matches any character, including a line terminator. By default this expression does not match line terminators.
2- Remove the style tags and inclusive content:
String content;
Pattern pattern;
pattern = Pattern.compile("<style.*?>.*?</style>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
content = pattern.matcher(content).replaceAll("");
3- Remove all HTML tags without inclusive content.
pattern = Pattern.compile("<[^>]*>");
content = pattern.matcher(content).replaceAll("");
4- Replace new lines, tabs and multiple spaces with a single space.
content = content.replaceAll("\n+", " ");
content = content.replaceAll("\t+", " ");
content = content.replaceAll("( )+", "");
And you have a pure content now ![]()
Links
Regular expression
How to Write an HTML Parser in Java
Regular-Expressions.info
