Sunday, September 16, 2007

Java HTML Screen Scraping (the Easy Way)

I thought this might be interesting. I have written screen scrapers in the past in Perl, but recently started using JTidy on an HTML stream in Java. I then parse that stream into a DOM tree and use XPath to search through the newly imported HTML. Here's an example class...


import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;


public class Keywords {

private static final Log log = LogFactory.getLog(Keywords.class.getName());

private static String targetURLString = "";
private static String xpath = "//table/tr" ;

public Document tidy(InputStream inputStrm) {
Tidy tidy = new Tidy();
Document tidyDOM = tidy.parseDOM(inputStrm, null);
return tidyDOM;

public NodeList getKeywordList(String keyword) {
NodeList urlNodes = null;

try {
URL targetURL = new URL(targetURLString);
URLConnection targetConnection = targetURL.openConnection();

// Post to output
OutputStreamWriter out = new OutputStreamWriter(targetConnection
out.write("stst=" + keyword);

Document xmlResponse = tidy(targetConnection.getInputStream());

urlNodes = XPathAPI.selectNodeList(xmlResponse, xpath);

} catch (Exception urle) {
log.error("Error: " + urle.toString());

return urlNodes;


Converting the HTML to a DOM tree allows the data to be converted to any format via XSL. Mobile devices, RSS feeds, other web pages are just examples of different formats that can be generated. The possibilities are limitless.


Varsha said...

Thank you for the example. It has helped me in a project(Web scraping) that I'm working on.

Armu said...

Anonymous said...

Thank you for example given.
I want to can i post data and scrape result page data. Basically i want to login to web site using user name and password. Then scrape site data