venerdì 18 maggio 2012

WebHarvest: Easy Java Web Scraping






Process of extracting data from Web pages is also referred as
Web Scraping or Web Data Mining.

World Wide Web, as the largest database, often contains various data that we would like to consume for our needs.

The problem is that this data is in most cases mixed together with formatting code - that way making human-friendly, but not machine-friendly content.

Doing manual copy-paste is error prone, tedious and sometimes even impossible.
Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that.

Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.
Every Web site and every Web page is composed using some logic.

It is therefore needed to describe reverse process - how to fetch desired data from the mixed content.

Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files.
Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal.

Processors execute in the form of pipeline.
Thus, the output of one processor execution is input to another one.

This can be best explained using the simple configuration fragment:

 
<xpath expression="//a[@shape='rect']/@href">
    <html-to-xml>
        <http url="http://www.somesite.com/"/>
    </html-to-xml>
</xpath>
 
When Web-Harvest executes this part of configuration, the following steps occur:
  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.
Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling. See User manual for technical description of provided processors.




Note inspired from the good article :



On web harvesting activity and the possible java direct interaction.

I often have a need to quickly scrape some data out of a web page (or list of web pages), which can then be fed into Excel and on to specialist data visualisation tools.

To this end I have turned to WebHarvest, an excellent scriptable open source API for web scraping in Java. 

I really really like it, but there are some quirks and setup issues that have cost me hours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is a lovely tool to hide dependency management for Java projects, but WebHarvest is not configured quite right out of the box to work transparently with it. 

(Describing Maven is beyond the scope of this post, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a new JavaSE project:


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 
 <modelVersion>4.0.0</modelVersion>
 <groupId>WebScraping</groupId>
 <artifactId>WebScraping</artifactId>
 <packaging>jar</packaging>
 <version>0.00.01</version>
 <properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 </properties>

 <build>
   <plugins>
     <plugin>
       <artifactId>maven-compiler-plugin</artifactId>
       <configuration>
       <source>1.6</source>
       <target>1.6</target>
       </configuration>
     </plugin>
   </plugins>
 </build>

 <repositories>
   <repository>
     <id>wso2</id>
     <url>http://dist.wso2.org/maven2/</url>
   </repository>
   <repository>
     <id>maven-repository-1</id>
     <url>http://repo1.maven.org/maven2/</url>
   </repository>
 </repositories>
 
<dependencies>
  <dependency>
     <groupId>commons-logging</groupId>
     <artifactId>commons-logging</artifactId>
     <version>1.1</version>
     <type>jar</type>
     <scope>compile</scope>
  </dependency>
  <dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.12</version>
    <type>jar</type>
    <scope>compile</scope>
  </dependency>
  <dependency>
    <groupId>org.webharvest.wso2</groupId>
    <artifactId>webharvest-core</artifactId>
    <version>1.0.0.wso2v1</version>
    <type>jar</type>
    <scope>compile</scope>
  </dependency>
 
<!-- web harvest pom doesn't track dependencies well -->
  <dependency>
    <groupId>net.sf.saxon</groupId>
    <artifactId>saxon-xom</artifactId>
    <version>8.7</version>
  </dependency>
  <dependency>
    <groupId>org.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>1.55</version>
  </dependency>
  <dependency>
    <groupId>bsh</groupId>
    <artifactId>bsh</artifactId>
    <version>1.3.0</version>
  </dependency>
  <dependency>
    <groupId>commons-httpclient</groupId>
    <artifactId>commons-httpclient</artifactId>
    <version>3.1</version>
  </dependency>
 </dependencies>
</project>



You’ll note that the WebHarvest dependencies had to be added explicitly, because the jar does not come with a working pom listing them.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape a site – and with a few lines of Java code you can run any XML configuration and have access to any properties that the script identified from the page. 

This is definitely the safest way to scrape data, as it decouples the code from the web page markup – so if the site you are scraping goes through a redesign, you can quickly adjust the config files without recompiling the code they pass data to.

The site some good example scripts to show you how to get started, so I won’t repeat them here. 

The easiest way to create your own is to run the WebHarvest GUI from the command line,(ie: java -jar webharvest_all_2.jar) start with a sample script, and then hack it around to get what you want – it’s an easy iterative process with good feedback in the UI.


FOR INSTANCE: let's have a look at the second example to start working on the Canon products at Yahoo Shopping sample.








As we can see we want to harvest around the following url:
http://shopping.yahoo.com/s:Digital%20Cameras:4168-Brand=Canon:browsename=Canon%20Digital%20Cameras:refspaceid=96303108;_ylt=AnHw0Qy0K6smBU.hHvYhlUO8cDMB;_ylu=X3oDMTBrcDE0a28wBF9zAzk2MzAzMTA4BHNlYwNibmF2

the simple script is :

<config charset="ISO-8859-1">
   
    <include path="functions.xml"/>
               
    <!-- collects all tables for individual products -->
    <var-def name="products">   
        <call name="download-multipage-list">
            <call-param name="pageUrl">http://shopping.yahoo.com/s:Digital%20Cameras:4168-Brand=Canon:browsename=Canon%20Digital%20Cameras:refspaceid=96303108;_ylt=AnHw0Qy0K6smBU.hHvYhlUO8cDMB;_ylu=X3oDMTBrcDE0a28wBF9zAzk2MzAzMTA4BHNlYwNibmF2</call-param>
            <call-param name="nextXPath">//a[starts-with(., 'Next')]/@href</call-param>
            <call-param name="itemXPath">//li[@class="hproduct" or @class="hproduct first" or @class="hproduct last"]</call-param>
            <call-param name="maxloops">10</call-param>
        </call>
    </var-def>
   
    <!-- iterates over all collected products and extract desired data -->
    <file action="write" path="D:/tmp/canon/catalog.xml" charset="UTF-8">
        <![CDATA[ <catalog> ]]>
        <loop item="item" index="i">
            <list><var name="products"/></list>
            <body>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                    <xq-expression><![CDATA[
                            declare variable $item as node() external;

                            let $name := data($item//*[@class='title'])
                            let $desc := data($item//*[@class='desc'])
                            let $price := data($item//*[@class='price'])
                                return
                                    <product>
                                        <name>{normalize-space($name)}</name>
                                        <desc>{normalize-space($desc)}</desc>
                                        <price>{normalize-space($price)}</price>
                                    </product>
                    ]]></xq-expression>
                </xquery>
            </body>
        </loop>
        <![CDATA[ </catalog> ]]>
    </file>

</config>




As a simple example, this is a script to go to the Sony-Ericsson developer site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true, and rip each handset’s individual spec page URI:

<?xml version="1.0" encoding="UTF-8"?>
<config>
 
    <!-- indicates we want a loop, through the list defined in <list>, doing <body> for each item where the variables uri and i are defined as the index and value of the relevant item -->

    <loop item="uid" index="i">
        <!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags -->
        <list>
            <xpath expression="//option/@value">
                <html-to-xml>
                    <http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/>
                </html-to-xml>
            </xpath>
        </list>
        <!-- the body section lists instructions which are run for every iteration of the loop -->
        <body>
            <!-- we define a new variable for every iteration, using the iteration count as a suffix  -->
            <var-def name="uri.${i}">
                <!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions -->
                <template>device/loadDevice.do?id=${uid}</template>
            </var-def>
        </body>
    </loop>
</config>


mercoledì 16 maggio 2012

Core Spring and IOC concept in a nut. 1




Core Spring API Components

Spring is a container and component model. Everything else, including AOP, transactions, database access, web applications, and the like is built on top of this container and component model. Objects managed in the container do not have to know about Spring or the container because of Inversion of Control (IoC)

This pattern specifies the involvement of the Spring container (which manages lifecycle), your object, and any other dependant objects – known as beans in Spring parlance. 

The container is able to inject any number or type of dependant beans together while specifying the relationship throughconfiguration

Dependency injection is enabled by creating properties and matching setter methods of your target object for the types of objects that you expect to inject. Alternatively, objects may be injected
during instantiation by providing a constructor with a signature that matches types you expect to inject.

The core of Spring framework’s functionality lies within this IoC container, which is discussed next.

The Inversion of Control Container

The Inversion of Control (IoC) container provides the dependency injection support to your applications that enables you to configure and integrate application and infrastructure components together. Through IoC, your applications may achieve a low-level of coupling, because all of the bean
configuration can be specified in terms of IoC idioms (such as property collaborators and constructors). Meanwhile, most if not all of your application’s bean lifecycle (construction to destruction) may be managed from within the container. 

This enables you to declare scope – how and when a new object
instance gets created, and when it gets destroyed. 

For example, the container may be instructed that a
specific bean instance be created only once per thread, or that, upon destruction, a database bean may disconnect from any active connections. 
Through requests to the Spring IoC container, a new bean may
either get constructed or a singleton bean may get passed back to the requesting bean. Either way, this is a transparent event that is configured along with the bean declaration.


Spring Container Metadata Configuration

Spring provides several implementations of the ApplicationContext interface out of the box. In standalone applications using XML metadata (still commonplace today), it is common to create an
instance of 

org.springframework.context.support.ClassPathXmlApplicationContext or
org.springframework.context.support.FileSystemXmlApplicationContext.

Configuration metadata consisting of bean definitions represented with XML or Java configuration is preferred for third-party APIs for which you do not have access to source code. 

In most other cases, configuration metadatain addition to, or in place of XML – may be applied through Java annotations. 

ioc_basics.xml – 

A Basic Spring Application Configuration File
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
 
<bean class="com.iocbasics.BasicPOJO"/>
 
</beans>

This is a simple XML document that makes use of the Spring beans namespace – the basis for resolving your POJO within the Spring container. 

Instantiating the IoC Container

To instantiate the IoC container and resolve a functional ApplicationContext instance, the relevant  ApplicationContext implementation must be invoked with a specified configuration. 

With a standard Java application with XML metadata configuration, the ClassPathXmlApplicationContext will resolve any number of Spring XML configuration resources given a path relative to the base Java classpath. This constructor is overloaded with a variety of argument arrangements that provide ways to specify the resource locations. 

Spring also provides a number of other applicationContext implementations. For example, FileSystemXmlApplicationContext is used to load XML configuration from the file system (outside of the classpath), and AnnotationConfigApplicationContext supports loading annotated Java classes.
 
In the next part of code, the context constructor is given a single resolvable path of the Spring configuration XML file. If the XML resource were located outside of classpath, the absolute file system path would be provided to the FileSystemApplicationContext factory class to obtain the ApplicationContext.

BasicIoCMain.java 
– Single Line of Code to Startup you Application Context

ApplicationContext ctx = new ClassPathXmlApplicationContext("ioc_basics.xml");

The next step is to obtain your configured beans. 

Simply calling ApplicationContext.getBean method (by passing it the class of the instance you want returned) will provide, by default, the singleton instance of the bean

That is, every bean will be the same instance.
BasicIocMain.java - Obtaining a Bean Reference by Class.

BasicPOJO basicPojo = ctx.getBean(BasicPOJO.class);

Alternatively, Spring can infer which bean definition your looking for by passing it the ID (or qualifier) of the bean instance instead of the bean class. This helps when dealing with multiple bean definitions of the same type. 

For instance, given a bean definition of the same BasicPojo class, The next sample illustrates the combined code effort in setting up and obtaining this bean resource.

BasicIocMain.java – Obtaining a Bean Reference by Name

BasicPOJO basicPojo = (BasicPOJO)ctx.getBean("basic-pojo");



Bean Instantiation from the IoC Container
 
A basic operation in a Spring context is instantiating beans. This is done in a variety of ways; however, we will focus on the most common use cases. 

In this example, the BasicPojo bean provides both a noarg and arguments-based constructor. In addition, we have a POJO property named color with type ColorEnum (see later). 

We will use both BasicPOJO and ColorEnum objects to illustrate how you can define and populate your beans within Spring XML configuration.

package com.sample.iocbasic;

public class BasicPOJO {

public String name;
public ColorEnum color;
 

public ColorRandomizer colorRandomizer;
// empty constructor

public BasicPOJO() {
}
 

public BasicPOJO(String name, ColorEnum color) {
 this.name = name;
 this.color = color;
}


public String getName() {
 return name;
}


public void setName(String name) {
 this.name = name;
}


public ColorEnum getColor() {
 return color;
}


public void setColor(ColorEnum color) {
 this.color = color;
}


public ColorRandomizer getColorRandomizer() {
 return colorRandomizer;
}


public void setColorRandomizer(ColorRandomizer colorRandomizer) {
 this.colorRandomizer = colorRandomizer;
}


}


package com.sample.iocbasic;
public enum ColorEnum {
  violet, blue, red, green, purple, orange, yellow
}




Constructor Injection

Constructor arguments may be set through XML configuration. This will enable you to inject dependencies through the constructor arguments. 


To do this, use the constructor-arg element within
the bean definition, as shown in next sample.
 

ioc_basics.xml – XML Bean Construction using Parameterized Arguments

<bean id="constructor-setup"
class="
com.sample.iocbasic.BasicPOJO">
  <constructor-arg name=”name” value="red"/>
  <constructor-arg name=”color” value="violet" />
</bean>



Bean References and Setter Injection


Bean properties may be injected into your target beans through references to other beans within the scope of your application context. 


This is known as bean collaboration

This requires defining the additional bean that you wish to refer to, also known as the collaborator. Using the ref attribute in the
property tag enables us to tell Spring which bean we want to collaborate with, and thus have injected
 

ioc_basics.xml – XML Configuration for Bean Collaboration by Setting Injection

<bean id="no-args"
class="
com.sample.iocbasic.BasicPOJO">
  <property name="color" ref="defaultColor"/>
  <property name="name" value="Mario"/>
</bean>


<bean id="defaultColor"
class="
com.sample.iocbasic.ColorEnum"
factory-method="valueOf">
  <constructor-arg value="blue"/>
</bean>

 
Static and Instance Factory Injection
 

Note the factory-method attribute on the defaultColor element. 
In this case the static factory method instantiation mechanism was used

Note also that the class attribute does not specify the type of object returned by a factory method, as it specifies only the type containing that factory method. 

For this simple example, a string was fed to the enum static factory method valueOf, which is the common approach to resolving enum constants from strings. 


When static factory methods are not practical, use instance factory methods instead

These are methods that get invoked from existing beans on the container to provide new bean instances. 
The class in the next sample demonstrates instance factory methods to provide random ColorEnum instances.

ColorRandomizer.java – Class Definition for the ColorRandomizer Factory Bean


package
com.sample.iocbasic;
import java.util.Random;
 

public class ColorRandomizer {
 

 ColorEnum colorException;
 

 public ColorEnum randomColor() {
  ColorEnum[] allColors = ColorEnum.values();
  ColorEnum ret = null;
  do {
   ret = allColors[new Random().nextInt(allColors.length - 1)];
  }
  while (colorException != null && colorException == ret);
  return ret;
 }


public ColorEnum exceptColor(ColorEnum ex) {
 ColorEnum ret = null;
 do {
  ret = randomColor();
 } while (ex != null && ex == ret);
  

 return ret;
}


public void setColorException(ColorEnum colorExceptions) {
 this.colorException = colorExceptions;
}


}

 
To invoke the factory within our Spring context, ColorRandomizer will be defined as a bean, then one of its methods will be invoked in another bean definition as a way to vend an instance of ColorEnum. 


We obtain two separate instances of ColorEnum using both ColorRandomizer factory methods to illustrate variances in factory method invocations.

Obtaining Bean Instances from Factory Methods in Spring
 

<!-- Factory bean for colors -->
<bean id="colorRandomizer" class="
com.sample.iocbasic.ColorRandomizer" />
 

<!-- gets a random color -->
<bean id="randomColor" factory-bean="colorRandomizer" factory-method="randomColor"/>
 

<!-- gets any color, except the random color defined above -->
<bean id="exclusiveColor" factory-bean="colorRandomizer" factory-method="exceptColor">

  <constructor-arg ref="randomColor"/>
</bean>



 
Bean Scopes

Spring uses the notion of bean scopes to determine how beans defined in the IoC container get issued to the application upon request with getBean methods, or through bean references. 


A bean’s scope is set with the scope attribute in the bean element, or by using the @Bean annotation in the class file. 
Spring defaults to the singleton scope, where a single instance of the bean gets shared throughout the entire container. Spring provides a total of six bean scopes out of the box for use in specific context implementations, although only singleton, prototype, and thread are available through all context
implementation. 


The other scopes – request, session, and globalSession – are available only to application contexts that are web-friendly, such as WebApplicationContext.

Bean scopes available in Spring:


Singleton Single bean instance per container, shared throughout the IoC container.


Prototype New bean instance created per request.


Request Web application contexts only: Creates a bean instance per HTTP request.


Session Web application contexts only: Creates a bean instance per HTTP session.


GlobalSession Web portlet only: Creates a bean instance per Global HTTP session.


Thread* Creates a bean instance per thread. Similar to request scope.


* Thread scope is not registered by default, and requires registration with the CustomScopeConfigurer bean.


To illustrate the behavior of prototype- and singleton-scoped beans, next sample declares two beans of the same type, which differ only in scope. 


The value from the singleton-scoped bean should always
return the same value, whereas the prototype-scoped bean will always return different values (since the
factory returns random numbers). 


This shows the Spring configuration file and the next one
shows the main class.


ioc_basics.xml – Overriding Default Scope for Beans with XML Metadata


<beans…>
  <bean id="randomeverytime" factory-bean="colorRandomizer" factory-  method="randomColor"
scope="prototype"/>


  <bean id="alwaysthesame" factory-bean="colorRandomizer factory-method="randomColor"
scope="singleton"/>
 

</beans>

BasicIocMain.java –Simple For-Loop
 

public static void demonstrateScopes(ApplicationContext ctx) {
 for (int i = 0; i < 5; i++) {
  System.out.println("randomeverytime: " +
    ctx.getBean("randomeverytime", ColorEnum.class));
  System.out.println("alwaysthesame: " +
    ctx.getBean("alwaysthesame", ColorEnum.class));
 }
}

 
The output of this loop will emit text similar to this:


Output of Bean Scope Induced Behavior
randomeverytime: green
alwaysthesame: orange
randomeverytime: purple
alwaysthesame: orange
randomeverytime: violet
alwaysthesame: orange
randomeverytime: violet
alwaysthesame: orange
randomeverytime: green
alwaysthesame: orange


You register the thread scope, or any other custom scope, in XML by defining a org.springframework.beans.factory.config.CustomScopeConfigurer bean. 


Pass the scope implementation class to the map property scopes. The map property is evaluated with the key providing the scope name,
and value having the scope’s implementation class. Registering in this fashion is always compatible to @Bean annotated properties with the @Scope annotation. That is, a scope definition once enabled for any given scope is activated throughout the container and for all manners of configuration (see next).


Ioc_basics.xml – Registering Custom Scopes in XML


<beans…>
 <bean class="org.springframework.beans.factory.config.CustomScopeConfigurer">
 <property name="scopes">
  <map>
   <entry key="thread">
    <bean class="org.springframework.context.support.SimpleThreadScope"/>
   </entry>
  </map>
 </property>
</bean>
<bean id="threadColor" factory-bean="colorRandomizer" factory-method="randomColor"
scope="thread"/>
</beans…>




 

Umlet





Finally a cool customizable and free uml tools, that I really really appreciate for it's immediate and simple approach.

Good projecting to everyone !!

ETL Pattern




































giovedì 10 maggio 2012

jQuery notes







Some notes finally helps me to properly integrate this powerful traversing/navigation/managing tool but I can't find specifically in any documentation I've read.

These lines were added from a java class that load a javascript on rendering components. This should be useful because it's not so obvious the sigle or double quote integration.

So let's start retrieving the main dom object to work on.
We know the id of the document:
 

    "        var strCH='\\'[id*=columnHeader]\\'';"+
 
// using the jQuery wrapper retrieve the children list (jquery function http://api.jquery.com/children/)

// then retrieve the first array dom object
    "        var tableCols=$(strCH).children().get(0);" +
// pay attention that this object is a standard dom object and not a jQuery wrapper instead. So every jquery api function give us an error.

// TO START AGAIN WITH A JQUERY WRAPPER LET'S CALL AGAIN $

    "        var trc = $(tableCols).find('tr');"+
    "        var tdsCol = $(trc).children();" +
 
// for each colum 
    "        tdsCol.each(function(j){" +
    "            if (this._logger==null){" +
    "                this._logger = new Log(Log.DEBUG, Log.popupLogger);" +
    "            }    " +

    "            var tdc = $(this).get(0);" +
 
    "            tdc.style.borderBottomWidth='0px';" +
    "            tdc.style.borderBottomStyle='';" +
    "            tdc.style.borderRightWidth='0px';" +
    "            tdc.style.borderRigthStyle='';" +
    "            tdc.style.borderTopWidth='0px';" +
    "            tdc.style.borderTopStyle='';" +
   
   
    "        });" +
 

giovedì 3 maggio 2012

jQuery Selectors




jQuery Selectors

Use w3c excellent jQuery Selector Tester to experiment with the different selectors.

Selector Example Selects
* $("*") All elements
#id $("#lastname") The element with id=lastname
.class $(".intro") All elements with class="intro"
element $("p") All p elements
.class.class $(".intro.demo") All elements with the classes "intro" and "demo"
     
:first $("p:first") The first p element
:last $("p:last") The last p element
:even $("tr:even") All even tr elements
:odd $("tr:odd") All odd tr elements
     
:eq(index) $("ul li:eq(3)") The fourth element in a list (index starts at 0)
:gt(no) $("ul li:gt(3)") List elements with an index greater than 3
:lt(no) $("ul li:lt(3)") List elements with an index less than 3
:not(selector) $("input:not(:empty)") All input elements that are not empty
     
:header $(":header") All header elements h1, h2 ...
:animated $(":animated") All animated elements
     
:contains(text) $(":contains('W3Schools')") All elements which contains the text
:empty $(":empty") All elements with no child (elements) nodes
:hidden $("p:hidden") All hidden p elements
:visible $("table:visible") All visible tables
     
s1,s2,s3 $("th,td,.intro") All elements with matching selectors
     
[attribute] $("[href]") All elements with a href attribute
[attribute=value] $("[href='default.htm']") All elements with a href attribute value equal to "default.htm"
[attribute!=value] $("[href!='default.htm']") All elements with a href attribute value not equal to "default.htm"
[attribute$=value] $("[href$='.jpg']") All elements with a href attribute value ending with ".jpg"
[attribute^=value] $("[href^='jquery_']") All elements with a href attribute value starting with "jquery_"
     
:input $(":input") All input elements
:text $(":text") All input elements with type="text"
:password $(":password") All input elements with type="password"
:radio $(":radio") All input elements with type="radio"
:checkbox $(":checkbox") All input elements with type="checkbox"
:submit $(":submit") All input elements with type="submit"
:reset $(":reset") All input elements with type="reset"
:button $(":button") All input elements with type="button"
:image $(":image") All input elements with type="image"
:file $(":file") All input elements with type="file"
     
:enabled $(":enabled") All enabled input elements
:disabled $(":disabled") All disabled input elements
:selected $(":selected") All selected input elements
:checked $(":checked") All checked input elements

mercoledì 2 maggio 2012

Spring Integration Tutorial - part 1





An extract from the original very interesting article:http://java.dzone.com/articles/spring-integration-hands

This tutorial is the first in a two-part series on Spring Integration. In this series we're going to build out a lead management system based on a message bus that we implement using Spring Integration. Our first tutorial will begin with a brief overview of Spring Integration and also just a bit about the lead management domain. 

After that we'll build our message bus. 

I've used Maven profiles to isolate the dependencies you’ll need if you’re running Java 5. 

The tutorials assume that you're comfortable with JEE, the core Spring framework and Maven 2. Also, Eclipse users may find the m2eclipse plug-in helpful.

To complete the tutorial you'll need an IMAP account, and you'll also need access to an SMTP server.

Let's begin with an overview of Spring Integration.

A bird's eye view of Spring Integration

Spring Integration is a framework for implementing a dynamically configurable service integration tier

The point of this tier is to orchestrate independent services into meaningful business solutions in a loosely-coupled fashion, which makes it easy to rearrange things in the face of changing business needs. 

The service integration tier sits just above the service tier as shown in figure 1.

Following the book Enterprise Integration Patterns by Gregor Hohpe and Bobby Woolf (Addison-Wesley), Spring Integration adopts the well-known pipes and filters architectural style as its approach to building the service integration layer. 

Abstractly, filters are information-processing units (any type of processing—doesn’t have to be information filtering per se), 
and pipes are the conduits between filters

In the context of integration, the network we’re building is a messaging infrastructure—a so-called message bus—and the pipes and filters and called message channels and message endpoints, respectively

The network carries messages from one endpoint to another via channels, and the message is validated, routed, split, aggregated, resequenced, reformatted, transformed and so forth as the different endpoints process it.

Figure 1. The service integration tier orchestrates the services below it.

That should give you enough technical context to work through the tutorial. Let’s talk about the problem domain for our sample integration, which is enrollment lead management in an online university setting.

Lead management overview

In many industries, such as the mortgage industry and for-profit education, one important component of customer relationship management (CRM) is managing sales leads

This is a fertile area for enterprise integration because there are typically multiple systems that need to play nicely together in order to pull the whole thing off

Examples include  
  • front-end marketing/lead generation websites,  
  • external lead vendor systems,  
  • intake channels for submitted leads
  • lead databases
  • e-mail systems (e.g., to accept leads, to send confirmation e-mails),
  • lead qualification systems
  • sales systems and potentially others.
This tutorial and the next use Spring Integration to integrate several of systems of the kind just mentioned into an overall lead management capability for a hypothetical online university. Specifically we’ll integrate the following:

    •    a CRM system that allows campus and call center staff to create leads directly, as they might do for walk-in or phone-in leads
    •    a Request For Information (RFI) form on a lead generation ("lead gen") marketing website
    •    a legacy e-mail based RFI channel
    •    an external CRM that the international enrollment staff uses to process international leads
    •    confirmation e-mails


Figure 2 shows what it will look like when we’re done with both tutorials. For now focus on the big picture rather than the details.
 
Figure 2. This is the lead management system we'll build.


For this first tutorial we're simply going to establish the base staff interface, the (dummy) backend service that saves leads to a database, and confirmation e-mails. 

The second tutorial will deal with lead routing, web-based RFIs and e-mail-based RFIs.

Let's dive in. We’ll begin with the basic lead creation page in the CRM and expand out from there.

Building the core components

[You can download the source code for this section of the tutorial here]
We’re going to start by creating a lead creation HTML form for campus and call center staff. That way, if walk-in or phone-in leads express an interest, we can get them into the system. This is something that might appear as a part of a lead management module in a CRM system, as shown in figure 3.


Figure 3. We'll build our lead management module with integration in mind from the beginning.


Because we’re interested in the integration rather than the actual app features, we’re not really going to save the lead to the database. Instead we’ll just call a createLead() method against a local LeadService bean and leave it at that. But we will use Spring Integration to move the lead from the form to the service bean.
Our first stop will be the domain model.



Create the domain model

We’ll need a domain object for leads, so listing 1 shows the one we’ll use. It’s not an industrial-strength representation, but it will do for the purposes of the tutorial.

Listing 1. Lead.java, a basic domain object for leads.

package crm.model;

... other imports ...

public class Lead {
    private static DateFormat dateFormat = new SimpleDateFormat();
   
    private String firstName;
    private String middleInitial;
    private String lastName;
    private String address1;
    private String address2;

    ... other fields ...
   
    public Lead() { }
   
    public String getFirstName() { return firstName; }

    public void setFirstName(String firstName) {
        this.firstName = firstName;
    }

    ... other getters and setters, and a toString() method ...
}

 
There is nothing special happening here at all. 
So far the Lead class is just a bunch of getters and setters. You can see the full code listing in the download.
 
If you thought that was underwhelming, just wait until you see the LeadServiceImpl service bean in listing 2.

Listing 2. LeadServiceImpl.java, a dummy service bean.

package crm.service;

import java.util.logging.Logger;
import org.springframework.stereotype.Service;
import crm.model.Lead;

@Service("leadService")
public class LeadServiceImpl implements LeadService {
    private static Logger log = Logger.getLogger("global");
   
    public void createLead(Lead lead) {
        log.info("Creating lead: " + lead);
    }
}


This is just a dummy bean. In real life we’d save the lead to a database. The bean implements a basic LeadService interface that we've suppressed here, but it's available in the code download.
Now that we have our domain model, let’s use Spring Integration to create a service integration tier above it.

Create the service integration tier

If you look back at figure 3, you’ll see that the CRM app pushes lead data to the service bean by way of a channel called newLeadChannel

While it’s possible for the CRM app to push messages onto the channel directly, it’s generally more desirable to keep the systems you’re integrating decoupled from the underlying messaging infrastructure, such as channels. That allows you to configure service orchestrations dynamically instead of having to go into the code.

Spring Integration supports the Gateway pattern (described in the aforementioned Enterprise Integration Patterns book), which allows an application to push messages onto the message bus without knowing anything about the messaging infrastructure. Listing 3 shows how we do this.

Listing 3. LeadGateway.java, a gateway offering access to the messaging system.

package crm.integration.gateways;

import org.springframework.integration.annotation.Gateway;
import crm.model.Lead;

public interface LeadGateway {
   
    @Gateway(requestChannel = "newLeadChannel")
    void createLead(Lead lead);
}

 
We are of course using the Spring Integration @Gateway annotation to map the method call to the newLeadChannel, but gateway clients don’t know that. Spring Integration will use this interface to create a dynamic proxy that accepts a Lead instance, wraps it with an org.springframework.integration.core.Message, and then pushes the Message onto the newLeadChannel

The Lead instance is the Message body, or payload, and Spring Integration wraps the Lead because only Messages are allowed on the bus.

We need to wire up our message bus. Figure 4 shows how to do that with an application context configuration file.

Listing 4. /WEB-INF/applicationContext-integration.xml message bus definition.

<?xml version="1.0" encoding="UTF-8"?>
<beans:beans xmlns="http://www.springframework.org/schema/integration"
    xmlns:beans="http://www.springframework.org/schema/beans"
    xmlns:p="http://www.springframework.org/schema/p"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
        http://www.springframework.org/schema/integration
        http://www.springframework.org/schema/integration/spring-integration-1.0.xsd">

    <gateway id="leadGateway"
        service-interface="crm.integration.gateways.LeadGateway" />
   
    <publish-subscribe-channel id="newLeadChannel" />
   
    <service-activator
        input-channel
="newLeadChannel"
        ref="leadService"
        method="createLead" />

</beans:beans>

The first thing to notice here is that we've made the Spring Integration namespace our default namespace instead of the standard beans namespace. 
The reason is that we're using this configuration file strictly for Spring Integration configuration, so we can save some keystrokes by selecting the appropriate namespace

This works pretty nicely for some of the other Spring projects as well, such as Spring Batch and Spring Security.

In this configuration we've created the three messaging components that we saw in figure 3. 

First, we have an incoming lead gateway to allow applications to push leads onto the bus. 
We simply reference the interface from listing 3; 

Spring Integration takes care of the dynamic proxy. Next we create a publish/subscribe ("pub-sub") channel called newLeadChannel. 

This is the channel that the @Gateway annotation referenced in listing 3. 

A pub-sub channel can publish a message to multiple endpoints simultaneously

For now we have only one subscriber—a service activator—but we already know we're going to have others, so we may as well make this a pub-sub channel.

The service activator is an endpoint that allows us to bring our LeadServiceImpl service bean onto the bus. 

We're injecting the newLeadChannel into the input end of the service activator. 

When a message appears on the newLeadChannel, the service activator will pass its Lead payload to the leadService bean's createLead() method.
Stepping back, we've almost implemented the design described by figure 3. The only part that remains is the lead creation frontend, which we'll address right now.

Create the web tier

Our user interface for creating new leads will be a web-based form that we implement using Spring Web MVC. The idea is that enrollment staff at campuses or call centers might use such an interface to handle walk-in or phone-in traffic. Listing 5 shows our simple @Controller.

Listing 5. LeadController.java, a @Controller to allow staff to create leads


package crm.web;

import java.util.Date;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.ui.Model;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import crm.integration.gateways.LeadGateway;
import crm.model.Country;
import crm.model.Lead;

@Controller
public class LeadController {
   
    @Autowired
    private LeadGateway leadGateway;
   
    @RequestMapping(value = "/lead/form.html", method = RequestMethod.GET)
    public void getForm(Model model) {
        model.addAttribute(Country.getCountries());
        model.addAttribute(new Lead());
    }
   
    @RequestMapping(value = "/lead/form.html", method = RequestMethod.POST)
    public String postForm(Lead lead) {
        lead.setDateCreated(new Date());
        leadGateway.createLead(lead);
        return "redirect:form.html?created=true";
    }
}

 

This isn't an industrial-strength controller as it doesn't do HTTP parameter whitelisting (for example, via an @InitBinder method) and form validation, both of which you would expect from a real implementation

But the main pieces from a Spring Integration perspective are here

We're autowiring the gateway into the @Controller, and we have methods for serving up the empty form and for processing the submitted form

The getForm() method references a Countries class that we've suppressed (it's in the code download); 

it just puts a list of countries on the model so the form can present a Country field to the staff member. 

The postForm() method invokes the createLead() method on the gateway. 

This will pass the Lead to the dynamic proxy LeadGateway implementation, which in turn will wrap the Lead with a Message and then place the Message on the newLeadChannel.
There are a few other configuration files you will need to put in place, including web.xml, main-servlet.xml and applicationContext.xml

There's also a JSP for the web form. As none of these relates directly to Spring Integration, we won't treat them here. Please see the code download for details.

With that, we've established a baseline system. 
To try it out, run
 
  mvn jetty:run

against crm/pom.xml and point your browser at
 
  http://localhost:8080/crm/main/lead/form.html

You should see a very basic-looking web form for entering lead information. Enter some user information (it doesn't matter what you enter—recall that we don't have any form validation) and press Submit. 

The console should report that LeadServiceImpl.createLead() created a lead. Congratulations!

Even though we now have a working system, it isn't very interesting. From here on out (this tutorial and the next) we'll be adding some common features to make the lead management system more capable. 

Our first addition will be confirmation e-mails

Adding confirmation e-mails

After an enrollment advisor (or some other staff member) creates a lead in the system, we want to send the lead an e-mail letting him know that that's happened. Actually—and this is a critical point—we really don't care how the lead was created. Anytime a lead appears on the newLeadChannel, we want to fire off a confirmation e-mail. I'm making the distinction because it points to an important aspect of the message bus: it allows us to control lead processing code centrally instead of having to chase it down in a bunch of different places. Right now there's only one way to create leads, but figure 2 revealed that we'll be adding others. No matter how many we add, they'll all result in sending a confirmation e-mail out to the lead.

Figure 4 shows the new bit of plumbing we're going to add to our message bus.
Figure 4. Send a confirmation e-mail when creating a lead.

To do this, we're going to need to make a few changes to the configuration and code.

POM changes

First we need to update the POM. Here's a summary of the changes; see the code download for details:
    •    Add a JavaMail dependency to the Jetty plug-in.
    •    Add an org.springframework.context.support dependency.
    •    Add a spring-integration-mail dependency.
    •    Set the mail.version property.
These changes will allow us to use JavaMail.

Expose JavaMail sessions through JNDI

We'll also need to add a /WEB-INF/jetty-env.xml configuration to make our JavaMail sessions available via JNDI. 

Once again, see the code download for details. 

it's included a /WEB-INF/jetty-env.xml.sample configuration for your convenience. 

As mentioned previously, you'll need access to an SMTP server.
 
Besides creating jetty-env.xml, we'll need to update applicationContext.xml. 

Listing 6 shows the changes we need so we can use JavaMail and SMTP.

Listing 6. /WEB-INF/applicationContext.xml changes supporting JavaMail and SMTP

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:jee="http://www.springframework.org/schema/jee"
    xmlns:p="http://www.springframework.org/schema/p"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
        http://www.springframework.org/schema/context
        http://www.springframework.org/schema/context/spring-context-2.5.xsd
        http://www.springframework.org/schema/jee
        http://www.springframework.org/schema/jee/spring-jee-2.5.xsd">
   
    <jee:jndi-lookup id="mailSession"
        jndi-name="mail/Session" resource-ref="true" />
   
    <bean id="mailSender"
        class="org.springframework.mail.javamail.JavaMailSenderImpl"
        p:session-ref="mailSession" />
   
    <context:component-scan base-package="crm.service" />
</beans>

 
The changes expose JavaMail sessions as a JNDI resource. 

We've declared the jee namespace and its schema location, configured the JNDI lookup, and created a JavaMailSenderImpl bean that we'll use for sending mail.

We won't need any domain model changes to generate confirmation e-mails. We will however need to create a bean to back our new transformer endpoint.

Service integration tier changes

First, recall from figure 4 that the newLeadChannel feeds into a LeadToEmailTransformer endpoint. This endpoint takes a lead as an input and generates a confirmation e-mail as an output, and the e-mail gets pipes out to an SMTP transport

In general, transformers transform given inputs into desired outputs. No surprises there.

Figure 4 is slightly misleading since it's actually the POJO itself that we're going to call LeadToEmailTransformer; the endpoint is really just a bean adapter that the messaging infrastructure provides so we can place the POJO on the message bus. 

Listing 7 presents the LeadToEmailTransformer POJO.

package crm.integration.transformers;

import java.util.Date;
import java.util.logging.Logger;
import org.springframework.integration.annotation.Transformer;
import org.springframework.mail.MailMessage;
import org.springframework.mail.SimpleMailMessage;
import crm.model.Lead;

public class LeadToEmailTransformer {
    private static Logger log = Logger.getLogger("global");
   
    private String confFrom;
    private String confSubj;
    private String confText;
   
    ... getters and setters for the fields ...

    @Transformer
    public MailMessage transform(Lead lead) {
        log.info("Transforming lead to confirmation e-mail: " + lead);
       
        String leadFullName = lead.getFullName();
        String leadEmail = lead.getEmail();
        MailMessage msg = new SimpleMailMessage();
       
        msg.setTo(leadFullName == null ?
                leadEmail : leadFullName + " <" + leadEmail + ">");
       
        msg.setFrom(confFrom);
        msg.setSubject(confSubj);
        msg.setSentDate(new Date());
        msg.setText(confText);
       
        log.info("Transformed lead to confirmation e-mail: " + msg);
        return msg;
    }
}

 

Again, LeadToEmailTransformer is a POJO, so we use the 
@Transformer annotation to select the method that's performing the transformation. 

We use a Lead for the input and a MailMessage for the output, and perform a simple transformation in between.

When defining backing beans for the various Spring Integration filters, it's possible to specify a Message as an input or an output. 

That is, if we want to deal with the messages themselves rather than their payloads, we can do that. (Don't confuse the MailMessage in listing 7 with a Spring Integration message; MailMessage represents an e-mail message, not a message bus message.) We might do that in cases where we want to read or manipulate message headers. In this tutorial we don't need to do that, so our backing beans just deal with payloads.
Now we'll need to build out our message bus so that it looks like figure 4. We do this by updating applicationContext-integration.xml as shown in listing 8.

Listing 8. /WEB-INF/applicationContext-integration.xml updates to support confirmation e-mails

<?xml version="1.0" encoding="UTF-8"?>
<beans:beans xmlns="http://www.springframework.org/schema/integration"
    xmlns:mail="http://www.springframework.org/schema/integration/mail"
    xmlns:beans="http://www.springframework.org/schema/beans"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:p="http://www.springframework.org/schema/p"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/integration/mail
        http://www.springframework.org/schema/integration/mail/spring-integration-mail-1.0.xsd
        http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
        http://www.springframework.org/schema/context
        http://www.springframework.org/schema/context/spring-context-2.5.xsd
        http://www.springframework.org/schema/integration
        http://www.springframework.org/schema/integration/spring-integration-1.0.xsd">

    <context:property-placeholder
        location="classpath:applicationContext.properties" />
   
    <gateway id="leadGateway"
        service-interface="crm.integration.gateways.LeadGateway" />
   
    <publish-subscribe-channel id="newLeadChannel" />
   
    <service-activator
        input-channel="newLeadChannel"
        ref="leadService"
        method="createLead" />
   
    <transformer input-channel="newLeadChannel" output-channel="confEmailChannel">
        <beans:bean class="crm.integration.transformers.LeadToEmailTransformer">
            <beans:property name="confFrom" value="${conf.email.from}" />
            <beans:property name="confSubject" value="${conf.email.subject}" />
            <beans:property name="confText" value="${conf.email.text}" />
        </beans:bean>
    </transformer>
   
    <channel id="confEmailChannel" />
   
    <mail:outbound-channel-adapter
        channel="confEmailChannel"
        mail-sender="mailSender" />

</beans:beans>

 
The property-placeholder configuration loads the various ${...} properties from a properties file; see /crm/src/main/resources/applicationContext.properties in the code download. 

You don't have to change anything in the properties file.

The transformer configuration brings the LeadToEmailTransformer bean into the picture so it can transform Leads that appear on the newLeadChannel into MailMessages that it puts on the confEmailChannel. 

As a side note, the p namespace way of specifying bean properties doesn't seem to work here (I assume it's a bug: http://jira.springframework.org/browse/SPR-5990), so I just did it the more verbose way.

The channel definition defines a point-to-point channel rather than a pub-sub channel. That means that only one endpoint can pull messages from the channel.

Finally we have an outbound-channel-adapter that grabs MailMessages from the confEmailChannel and then sends them using the referenced mailSender, which we defined in listing 6.

That's it for this section. We should have working confirmation e-mails. Restart your Jetty instance and go again to
http://localhost:8080/crm/main/lead/form.html
 
Fill it out and provide your real e-mail address in the e-mail field. A few moments after submitting the form you should receive a confirmation e-mail. If you don't see it, you might check your SMTP configuration in jetty-env.xml, or else check your spam folder.

Summary

In this tutorial we've taken our first steps toward developing an integrated lead management system. Though the current bus configuration is simple, we've already seen some key Spring Integration features, including

    •    support for the Gateway pattern, allowing us to connect apps to the message bus without knowing about messages
    •    point-to-point and pub-sub channels
    •    service activators to allow us to place service beans on the bus
    •    message transformers
    •    outbound SMTP channel adapters to allow us to send e-mail
The second tutorial will continue elaborating what we've developed here, demonstrating the use of several additional Spring Integration features, including
    •    message routers (including content-based message routers)
    •    outbound web service gateways for sending SOAP messages
    •    inbound HTTP adapters for collecting HTML form data from external systems
    •    inbound e-mail channel adapters (we'll use IMAP IDLE, though POP and IMAP are also possible) for processing incoming e-mails