Sunday, September 16, 2007

Java HTML Screen Scraping (the Easy Way)

I thought this might be interesting. I have written screen scrapers in the past in Perl, but recently started using JTidy on an HTML stream in Java. I then parse that stream into a DOM tree and use XPath to search through the newly imported HTML. Here's an example class...

import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;

import com.sun.org.apache.xpath.internal.XPathAPI;

public class Keywords {

private static final Log log = LogFactory.getLog(Keywords.class.getName());

private static String targetURLString = "http://inventory.overture.com/d/searchinventory/suggestion/";
private static String xpath = "//table/tr" ;

public Document tidy(InputStream inputStrm) {
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(inputStrm, null);
return tidyDOM;
}

public NodeList getKeywordList(String keyword) {
NodeList urlNodes = null;

try {
URL targetURL = new URL(targetURLString);
URLConnection targetConnection = targetURL.openConnection();
targetConnection.setDoOutput(true);

// Post to output
OutputStreamWriter out = new OutputStreamWriter(targetConnection
.getOutputStream());
out.write("stst=" + keyword);
out.close();

Document xmlResponse = tidy(targetConnection.getInputStream());

urlNodes = XPathAPI.selectNodeList(xmlResponse, xpath);

} catch (Exception urle) {
log.error("Error: " + urle.toString());
}

return urlNodes;
}

}

Converting the HTML to a DOM tree allows the data to be converted to any format via XSL. Mobile devices, RSS feeds, other web pages are just examples of different formats that can be generated. The possibilities are limitless.

Wednesday, September 12, 2007

Blogging, Search Engines, Web Presence, and Groupies

OK, I'm still waiting on the groupies part, but I have been blogging for about 9 months now. (Ha! I beat the average blog life span) Consequently, this has perked my interest and caused me to read up on search engine optimization techniques. One little trick that has been overlooked by many of the SEO experts and books, is the Blog itself. After analyzing this website, I noticed that Google scans my blog (framework provided by Google) quite often. It appears sometimes, soon after every new post each article is scanned. I believe that for any other reason this provides an incentive for anyone trying to get their website ranked (or even listed).

Last Friday, for instance, I dropped 6-7 domain links that I have parked on my blog. (See the lower right of this page). These were all sites unlisted on Google. I dropped a new post the same night. Within a day, Google had scanned those links and my parked domains were suddenly cached in Google. Interestingly, Google didn’t not cache my domain names with the Western Samoa (*.ws) extension until later that day. All .coms and .nets were in Google's cache the next day.

This is a great incentive for bloggers! Many bloggers write for pauper's wages (if that much), but Google has given writers the ability to push their domain names into the Google cache without waiting as long as weeks or months. Many bloggers will probably not take advantage of this. Too bad, what an easy way to promote keywords and peak special interest in a website than by blogging. Google has once more dropped the carrot in front of us to provide them more Google content :) I haven't even pumped this for all it's potential, so I expect to write more on this subject later. In the mean time, I will get back to the technical stuff that most of my readers are trying to find.

Sunday, September 9, 2007

Extreme Gigabit Routers for the Home

I recently purchased a D-Link Xtreme N gigabit router for the home network, and I have been pleasantly surprised. I was hoping that my jump into the world of low-price (119$) gigabit routers was not going to result in overly complicated setups, bad performance, or just bad quality that I have experienced in other routers. Overall, the experience was good.

I started here. Platon Scheblykin basically wrote an article reverse-engineering the D-Link (DIR 655). Being the hardware geek that I am, this article also fed my curiosity with it's breakdown of performance and throughput numbers. Platon also breaks open the DIR-655 and explains layer by layer the good and bad points of the layout and packaging. This is an extremely detailed and thorough article. It is worth the read if you are considering the D-Link DIR-655.

I have'nt had any hiccups so far with the D-Link DIR-655. The wireless network is reaching areas I never was able to reach with my old Netgear 802.11g 100MBps wireless router. I also, port forward to my web server and that setup took a total of 10 minutes. I was happy with the GUI interface. I thought the interface was well thought out, and I only experienced one issue during the setup and the problem was outside of the router (my linux firewall on my web server had port 80 closed :) Other than that I believe the entire setup took 30 minutes to get working with my BellSouth Westell router.

I am also seeing awesome speeds with a file server and other computers on the network. Much of this however, can simply be attributed to going from 100MBps to 1GBps. The QOS is great. I can throttle certain types of traffic while I am working to allocate enough bandwidth for VPN.

I do plan on getting a D-Link wireless card and hope to have more details on the improvements made on my network. Also, I will be posting pictures and specs soon to help anyone else who might be interested in GB home networks, and I should be delving more into the home entertainment area. For now, though I am happy with the home bandwidth boost, and will be happier when fiber gets here to my neighborhood at the end of this year.