HTML Parsers dwells on the area where there is no standard operating conditions. There is no standard protocol to interact with web other than plain HTTP(S) protocol. And the HTML on the web is not strictly following fullest structure of the HTML format. Web browsers are very adaptive to incorrect or inappropriate HTML format. This makes life difficult for anyone who wants to harvest some meaningful information from the web.
HTML Parsers come very handy in achieving this. While this is a known concept, it is important to choose the right package (library) for the project need. One of the best way to identify the right package is to list all the popular HTML Parsers and compare the characteristics and take an informed decision on what is appropriate.
With some research, the following are identified as candidates.
Jeriko HTML Parser
These packages will be compared with each other for the characteristics. Note here that though there are several other packages are available for HTML Parsing and many featured in this website, only these abover were shortlisted for comparison based on popularity. If you feel, one of your favorite is missing here, but deserve to be compared, please comment.
The Comparison Parameters
As a guideline, the following parameters are used for comparison. To be fair, factual information is used wherever possible.
– Is there an Official site?
– Getting Started Guide or Tutorial available?
– Detailed documentation
– API reference documentation
– Cookbook or examples
Ease of Use
– how quickly one can setup the environment
– how quickly an average developer can write code
– how easy it is to execute
– Speed – Throughput
– Memory Consumption
– The feature list of the library
– Whether the basic features are available?
– Special features which distinguishes the library from others.
Developer/ Forum/Community Support
– Is there active forum or community to support?
– Is there enough discussions going around or available to refer in the internet?
Baggage (Size & Dependencies)
– Size of the binary file
– Other dependent libraries required for making this library work
The Ultimate Comparison – Rating 0 to 5 – Poor to Best
|#||Parameter||HTML Cleaner||JSoup||Jeriko HTML Parser|
|2||Ease of Use||3||5||5|
|5||Developer/ Forum/Community Support||3||5||3|
While HTML Cleaner, Jeriko HTML Parsers are promising and easy to use, the overall score winner is JSoup. The recemmendation is use either HTML Cleaner or Jeriko if you want simple web interation. If you need complex parsing with good support, then JSoup is the clear option.