How to get pure content from HTML page in Java via Regex

Introduction
I’ve written a web crawler while I was developing a search engine a few weeks ago. It extracts the contents and saves them onto the database. The HTML tags aren’t so important to most of the search engines. So, I removed them successfully. To do the same, follow below steps:
1- Remove the script tags and inclusive content:

// htmlContent is full content of page with HTML codes.

String content;
Pattern pattern;

pattern = Pattern.compile("<script.*?>.*?</script>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
content = pattern.matcher(htmlContent).replaceAll("");

Note: In dotall mode, the expression <tt>.</tt> matches any character, including a line terminator. By default this expression does not match line terminators.

2- Remove the style tags and inclusive content:

String content;
Pattern pattern;

pattern = Pattern.compile("<style.*?>.*?</style>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
content = pattern.matcher(content).replaceAll("");

3- Remove all HTML tags without inclusive content.

pattern = Pattern.compile("<[^>]*>");
content = pattern.matcher(content).replaceAll("");

4- Replace new lines, tabs and multiple spaces with a single space.

content = content.replaceAll("\n+", " ");
content = content.replaceAll("\t+", " ");
content = content.replaceAll("(  )+", "");

And you have a pure content now :)

Links
Regular expression
How to Write an HTML Parser in Java
Regular-Expressions.info

Java: run command as root by Runtime.getRuntime().exec() in Ubuntu

Hey :)

a few days ago I needed to run `/etc/init.d/networking restart` command by Runtime.getRuntime().exec() in Java EE web application. The first and easiest way that came to mind was sudo without password and… It Worked!
* To execute sudo without password, open /etc/sudoers by text editor like `nano`:

$ sudo nano /etc/sudoers

And add your user or group to the end of file like below:

# for user
USER_NAME ALL= NOPASSWD: ALL

# for group
%GROUP_NAME ALL= NOPASSWD: ALL

let’s see my Java code:

String command = "sudo /etc/init.d/networking restart";
Runtime runtime = Runtime.getRuntime();
try {
    Process process = runtime.exec(command);
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
    String line;
    while ((line = bufferedReader.readLine()) != null) {
        System.out.println(line);
    }
} catch (IOException e) {
    e.printStackTrace();
}

Troubleshooting
if you get `sudo: no tty present and no askpass program specified` error, make sure the user that runs command is in /etc/sudoers.

let me know if you find similar or easier way :)

Installing Sun JDK 5 on Ubuntu 9.10 and 10.04

Hello :)

As you known, Sun JDK version 1.5 or 5 is deleted from Ubuntu 10.4 and 9.10 repositories and the version 6 has been replaced.

The easiest way to install Sun JDK 5 version is add its repository from Ubuntu 9.04 to the list of repositories in 9.10 and 10.04. For this purpose, follow the steps.

1- Open /etc/apt/sources.list with a text editor like gedit:

sudo gedit /etc/apt/sources.list

2- Add the following lines to the end of the file then save it and close:

  ## For sun-java5-jdk
 deb http://ir.archive.ubuntu.com/ubuntu jaunty-updates main multiverse

3- Update the packages lists and install sun-java5-jdk:

 sudo aptitude update
 sudo aptitude install sun-java5-jdk

* Above method can be used for other applications.

Another way to install jdk 5 is download software package and its dependencies from packages.ubuntu.com.

Good luck