I thought this might be interesting. I have written screen scrapers in the past in Perl, but recently started using JTidy on an HTML stream in Java. I then parse that stream into a DOM tree and use XPath to search through the newly imported HTML. Here's an example class...
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;
import com.sun.org.apache.xpath.internal.XPathAPI;
public class Keywords {
private static final Log log = LogFactory.getLog(Keywords.class.getName());
private static String targetURLString = "http://inventory.overture.com/d/searchinventory/suggestion/";
private static String xpath = "//table/tr" ;
public Document tidy(InputStream inputStrm) {
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(inputStrm, null);
return tidyDOM;
}
public NodeList getKeywordList(String keyword) {
NodeList urlNodes = null;
try {
URL targetURL = new URL(targetURLString);
URLConnection targetConnection = targetURL.openConnection();
targetConnection.setDoOutput(true);
// Post to output
OutputStreamWriter out = new OutputStreamWriter(targetConnection
.getOutputStream());
out.write("stst=" + keyword);
out.close();
Document xmlResponse = tidy(targetConnection.getInputStream());
urlNodes = XPathAPI.selectNodeList(xmlResponse, xpath);
} catch (Exception urle) {
log.error("Error: " + urle.toString());
}
return urlNodes;
}
}
Converting the HTML to a DOM tree allows the data to be converted to any format via XSL. Mobile devices, RSS feeds, other web pages are just examples of different formats that can be generated. The possibilities are limitless.
3 comments:
Thank you for the example. It has helped me in a project(Web scraping) that I'm working on.
http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html
Thank you for example given.
I want to can i post data and scrape result page data. Basically i want to login to web site using user name and password. Then scrape site data
Post a Comment