distributed crawl | David DeMartini actual

OK, after more than a year it’s time to get down to really building out a Gearman system.

I’ve recently taken on a project that I believe to be the perfect fit for the pipelined distributed work manager of Gearman. I’m of course, not at liberty to discuss the various details of this project, but I can provide some high-level description for the purposes of justification.

The Project – distributed harvesting

The objective of the project is to provide a distributed method of web page scraping and parsing. This project requires that the scraping and profiling occur for 2,000,000+ websites in under 18 hours. No small feat for certain. The good news is that I’ve built a systems in the past (circa 2006) that did just this using MySQL as the task manager. It worked, but it had it’s issues, and almost every single one of them can be mitigated by using Gearman. The rest will be mitigated with the application of NoSQL solutions for site list management.

What is Gearman

Here is the synopsis from the Gearman.org main page.

Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates.

The Job Server — implementing in Amazon’s ES2 environment

For the project I’m working on, I’ve opted for the Java implementation of the Job Server. This implementation’s main page is located [HERE].

Information about the Java Job Server:

Java Gearman Service is an easy-to-use distributed network application framework implementing the gearman protocol used to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages.

04-23-2012 java-gearman-service v0.6 has been released. [DOWNLOAD]

The service now uses the slf4j logging facade, allowing the user to have better control over logging
Persistent background jobs are now supported though an application hook
The API has been updated to be more user friendly, and it makes it easier to create divide-and-conquer/mapreduce applications (breaks the code of previous versions)
A .properties file now may be used to set property values and fine-tune the application.

Requirements to deploy the Java job server:

Java SE 7
slf4j 1.6.4+

For my implementation, I’ve extracted the zip into a vendor directory form where I’ll plan to launch the .jar. Development is occurring on an Apple OSX portable, then deployed to the AWS EC2 cluster for production. It’s expected that some library pathing and configuration will be required to make this all work.

Starting up the Java Gearman Service

It took a little time to locate the instructions for Starting up the Java implementation of the Gearman Service. It is located [HERE].

Instructions on how I started the GearMan server on my OSX development machine are located [HERE].

Once started, you should be able to communicate with it with your Client and Worker code!!

Next steps:

Installing Gearman PHP components

Build a GearMan Client Demonstrator

Build a GearMan Worker Demonstrator

David DeMartini actual

Tag Archives: distributed crawl

Deploying Java Gearman Job Server