mercoledì 25 aprile 2012

Web Harvest library overview




 

Overview

This section describes the motive, 
the notions and concepts used in Web-Harvest.

Rationale

World Wide Web, though by far the largest knowledge base, 
is rarely regarded as database in traditional sense - 
as source of information used for further computing. 

Web-Harvest is inspired by practical need for having right data 
at the right time. And very often, the Web is the only source 
that publicly provides wanted information.

Basic concept

The main goal behind Web-Harvest is to empower the usage of already 
existing extraction technologies
Its purpose is not to propose a new method, but to provide a way to 
easily use and combine the existing ones

Web-Harvest offers the set of processors for data handling and 
control flow. Each processor can be regarded as a function - 
it has zero or more input parameters and gives a result 
after execution. 

Processors could be combined in a pipeline
making the chain of execution. For easier manipulation and data reuse Web-Harvest provides variable context where named variables are stored. 

The following diagram describes one pipeline execution:













The result of extraction could be available in files created during 
execution or from the variable context if Web-Harvest 
is programmatically used.

Configuration language

Every extraction process is defined in one or more  
configuration files, using simple XML-based language. 

Each processor is described by specific XML element or structure 
of XML elements. 

For the illustration, here is presented an example of 
configuration file:

 
<?xml version="1.0" encoding="UTF-8"?>
 
<config charset="UTF-8">
    <var-def name="urlList">
        <xpath expression="//img/@src">
            <html-to-xml>
                <http url="http://news.bbc.co.uk"/>
            </html-to-xml>
        </xpath>
    </var-def>
        
    <loop item="link" index="i" filter="unique">
        <list>
            <var name="urlList"/>
        </list>
        <body>
            <file action="write" type="binary" path="images/${i}.gif">
                <http url="${sys.fullUrl('http://news.bbc.co.uk', link)}"/>
            </file>
        </body>
    </loop>
</config>
 
This configuration contains two pipelines
The first pipeline performs the following steps:
  1. HTML content at http://news.bbc.co.uk is downloaded,
  2. HTML cleaning is performed on downloaded content producing XHTML,
  3. XPath expression is searched for, giving URL sequence of page images,
  4. New variable named "urlList" is defined containing sequence of image URLs.
The second pipeline uses result of the previous execution in order to collect all page images:
  1. Loop processor iterates over URL sequence and for every item:
  2. Downloads image at current URL,
  3. Stores the image on the file system.


This example illustrates some procedural-language elements of 
Web-Harvest, like variable definition and list iteration
few data management processors (file and http
and couple of HTML/XML processing instructions 
(html-to-xml and xpath processors).  
 

For slightly more complex example of image download, 
where some other features of Web-Harvest are used, 
see Examples page. 
For technical coverage of supported processors, see User manual.

Data values

All data produced and consumed during extraction process 
in Web-Harvest have three representations: 
text, binary and list. 

There is also special data value empty, whose textual 
representation is empty string, binary - empty byte array and list 
- zero length list. 

Which form of data is used - it depends on processor 
that consumes the data. 

In previous configuration html-to-xml processor uses downloaded 
content as text in order to transform it to XHTML,  
loop processor uses variable urlList as a list in order 
to iterate over it and file processor treats downloaded images 
as binary data when saving them to the files. 

In most cases proper representation of the data is chosen by 
Web-Harvest. 

However - in some situations it must be explicitly stated 
which one to use. One example is file processor where default data 
type is text and the binary content must be explicitly specified 
with type="binary".

Variables

Web-Harvest provides the variable context for storing and using 
variables. There is no special convention for naming variables 
like in most of the programming languages. 

Thus, the names like arr[1], 100 or #$& are valid. 
However, if aforementioned variables were used in scripts or 
templates (see next section), where expressions are dynamically 
evaluated, the exception would be thrown. 

It is therefore recommended to use usual programming language naming in order to avoid any difficulties.
When Web-Harvest is programmatically used (from Java code, not from
command line) variable context may be initially set by user in 
order to add custom values and functionality. 

Similarly, after execution, variable context is available for taking variables from it.
When user-defined functions are called (see User manual) separate local variable context is created (like in many programming languages, including Java). The valid way to exchange data between caller and called function is through the function parameters.

Scripting and templating

Before Web-Harvest 0.5 templating mechanism was based on OGNL 
(Object-Graph Navigation Language). From the version 0.5 OGNL is replaced by 
BeanShell, and starting from version 1.0, multiple scripting 
languages are supported, giving developers freedom to choose the 
favourite one. 
 
Besides the set of powerful text and XML manipulation processors, 
Web-Harvest supports real scripting languages which code can be 
easily intergrated within scraper configurations. Languages 
currently supported are BeanShell, Groovy and Javascript

BeanShell is probably the closest to Java syntax and power, 
but Groovy and Javascript have some other adventages. It is up to 
the developer to use prefered language or even to mix different 
languages in the single configuration.

Templating allowes evaluating of marked parts of the text 
(text "islands" surrounded with ${ and }). 

Evaluation is performed using the chosen scripting language. 
In Web-Harvest all elements' attributes are implicitly passed to the
templating engine. 

In upper configuration, there are two places where templater is doing the job:
  • path="images/${i}.gif" in file processor, producing file names based on loop index,
  • url="${sys.fullUrl('http://news.bbc.co.uk', link)}" in http processor, where built-in functionality is called to calculate full URL of the image 

Nessun commento:

Posta un commento