Process of extracting data from Web pages is also referred as
Web Scraping or Web Data Mining.
World Wide Web, as the largest database, often contains various data that we would like to consume for our needs.
The problem is that this data is in most cases mixed together with formatting code - that way making human-friendly, but not machine-friendly content.
Doing manual copy-paste is error prone, tedious and sometimes even impossible.
Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that.
Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.
Every Web site and every Web page is composed using some logic.
It is therefore needed to describe reverse process - how to fetch desired data from the mixed content.
Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files.
Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal.
Processors execute in the form of pipeline.
Thus, the output of one processor execution is input to another one.
This can be best explained using the simple configuration fragment:
<xpath expression="//a[@shape='rect']/@href"> <html-to-xml> <http url="http://www.somesite.com/"/> </html-to-xml> </xpath>
- http processor downloads content from the specified URL.
- html-to-xml processor cleans up that HTML producing XHTML content.
- xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.
Note inspired from the good article :
On web harvesting activity and the possible java direct interaction.
I often have a need to quickly scrape some data out of a web page (or
list of web pages), which can then be fed into Excel and on to
specialist data visualisation tools.
To this end I have turned to WebHarvest,
an excellent scriptable open source API for web scraping in Java.
I
really really like it, but there are some quirks and setup issues that
have cost me hours so I thought I’d roll together a tutorial with the
fixes.
WebHarvest Config for Maven
When it works Maven is a
lovely tool to hide dependency management for Java projects, but
WebHarvest is not configured quite right out of the box to work
transparently with it.
(Describing Maven is beyond the scope of this
post, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)
This is the Maven POM I ended up with to use WebHarvest in a new JavaSE project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>WebScraping</groupId>
<artifactId>WebScraping</artifactId>
<packaging>jar</packaging>
<version>0.00.01</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>wso2</id>
<url>http://dist.wso2.org/maven2/</url>
</repository>
<repository>
<id>maven-repository-1</id>
<url>http://repo1.maven.org/maven2/</url>
</repository>
</repositories>
<groupId>WebScraping</groupId>
<artifactId>WebScraping</artifactId>
<packaging>jar</packaging>
<version>0.00.01</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>wso2</id>
<url>http://dist.wso2.org/maven2/</url>
</repository>
<repository>
<id>maven-repository-1</id>
<url>http://repo1.maven.org/maven2/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.12</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.webharvest.wso2</groupId>
<artifactId>webharvest-core</artifactId>
<version>1.0.0.wso2v1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.12</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.webharvest.wso2</groupId>
<artifactId>webharvest-core</artifactId>
<version>1.0.0.wso2v1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<!-- web harvest pom doesn't track dependencies well -->
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>saxon-xom</artifactId>
<version>8.7</version>
</dependency>
<dependency>
<groupId>org.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>1.55</version>
</dependency>
<dependency>
<groupId>bsh</groupId>
<artifactId>bsh</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>
</dependencies>
</project>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>saxon-xom</artifactId>
<version>8.7</version>
</dependency>
<dependency>
<groupId>org.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>1.55</version>
</dependency>
<dependency>
<groupId>bsh</groupId>
<artifactId>bsh</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>
</dependencies>
</project>
You’ll note that the WebHarvest dependencies had to be added
explicitly, because the jar does not come with a working pom listing
them.
Writing A Scraping Script
WebHarvest uses XML configuration files to describe how to scrape a
site – and with a few lines of Java code you can run any XML
configuration and have access to any properties that the script
identified from the page.
This is definitely the safest way to scrape
data, as it decouples the code from the web page markup – so if the site
you are scraping goes through a redesign, you can quickly adjust the
config files without recompiling the code they pass data to.
The site some good example scripts
to show you how to get started, so I won’t repeat them here.
The
easiest way to create your own is to run the WebHarvest GUI from the
command line,(ie: java -jar webharvest_all_2.jar) start with a sample script, and then hack it around to get
what you want – it’s an easy iterative process with good feedback in
the UI.
FOR INSTANCE: let's have a look at the second example to start working on the Canon products at Yahoo Shopping sample.
FOR INSTANCE: let's have a look at the second example to start working on the Canon products at Yahoo Shopping sample.
As we can see we want to harvest around the following url:
http://shopping.yahoo.com/s:Digital%20Cameras:4168-Brand=Canon:browsename=Canon%20Digital%20Cameras:refspaceid=96303108;_ylt=AnHw0Qy0K6smBU.hHvYhlUO8cDMB;_ylu=X3oDMTBrcDE0a28wBF9zAzk2MzAzMTA4BHNlYwNibmF2
the simple script is :
<config charset="ISO-8859-1">
<include path="functions.xml"/>
<!-- collects all tables for individual products -->
<var-def name="products">
<call name="download-multipage-list">
<call-param name="pageUrl">http://shopping.yahoo.com/s:Digital%20Cameras:4168-Brand=Canon:browsename=Canon%20Digital%20Cameras:refspaceid=96303108;_ylt=AnHw0Qy0K6smBU.hHvYhlUO8cDMB;_ylu=X3oDMTBrcDE0a28wBF9zAzk2MzAzMTA4BHNlYwNibmF2</call-param>
<call-param name="nextXPath">//a[starts-with(., 'Next')]/@href</call-param>
<call-param name="itemXPath">//li[@class="hproduct" or @class="hproduct first" or @class="hproduct last"]</call-param>
<call-param name="maxloops">10</call-param>
</call>
</var-def>
<!-- iterates over all collected products and extract desired data -->
<file action="write" path="D:/tmp/canon/catalog.xml" charset="UTF-8">
<![CDATA[ <catalog> ]]>
<loop item="item" index="i">
<list><var name="products"/></list>
<body>
<xquery>
<xq-param name="item" type="node()"><var name="item"/></xq-param>
<xq-expression><![CDATA[
declare variable $item as node() external;
let $name := data($item//*[@class='title'])
let $desc := data($item//*[@class='desc'])
let $price := data($item//*[@class='price'])
return
<product>
<name>{normalize-space($name)}</name>
<desc>{normalize-space($desc)}</desc>
<price>{normalize-space($price)}</price>
</product>
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </catalog> ]]>
</file>
</config>
As a simple example, this is a script to go to the Sony-Ericsson developer site’s handset gallery at
http://developer.sonyericsson.com/device/searchDevice.do?restart=true, and rip each handset’s individual spec page URI:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<config>
<!-- indicates we want a loop, through the list defined in <list>, doing <body> for each item where the variables uri and i are defined as the index and value of the relevant item -->
<loop item="uid" index="i">
<!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags -->
<list>
<xpath expression="//option/@value">
<html-to-xml>
<http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/>
</html-to-xml>
</xpath>
</list>
<!-- the body section lists instructions which are run for every iteration of the loop -->
<body>
<!-- we define a new variable for every iteration, using the iteration count as a suffix -->
<var-def name="uri.${i}">
<!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions -->
<template>device/loadDevice.do?id=${uid}</template>
</var-def>
</body>
</loop>
</config>


Hello Friends,
RispondiEliminaWeb-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities. Thank you..............
Web Scrape