Friday, October 5, 2007

Java Conversions of HTML to XML

I recently ran into this error in my browser while trying to convert html to xml via a java parser application. Apparently, html entities which are handled fine in most browsers, do not translate so nicely into xml. The reason is that most xml parsers do not handle "&nbsp" or any of the other 1000's of special characters out there.

XML Parsing Error: undefined entity
Location: http://localhost:8080/rssFeed.xml
Line Number 255, Column 16:href="r/3t">360°

---------------^

Here's a couple of options...

Option 1. Download the following entity definitions.

http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent

http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent

I concatenated the files together, but this is not necessary.

cat xhtml* >xhtml-pac.ent

Then in your xml file include a reference to the entities in your external file. You will need to describe an external entity as seen here...

http://www.w3.org/TR/REC-xml/#sec-external-ent

Example:
<!--ENTITY spec-chars  SYSTEM "resources/xhtml-pac.ent"-->
This inclusion of the entity file converts special characters to unicode. The unicode will always be processed by the xml processor.

Also take a look at the Java library JTidy. The JTidy Java API provides a method in their Tidy Java class that will automatically do the above work for you. You also get cleaned xml automatically. The Java method is setNumEntities and will convert those special HTML characters into unicode.

1 comment:

Unknown said...

Thanks for this post. It's exactly what I was looking for!