Fetcher for constrained crawls

accesine 2005-08-25

展开全文

***** ********** 收件箱

Kelvin Tan

<kelvin-lists@relevanz.com> 致 nutch-dev

更多选项

8月23日 (2天前)

I‘ve been working on some changes to crawling to facilitate its use as a non-whole-web crawler, and would like to gauge interest on this list about including it somewhere in the Nutch repo, hopefully before the map-red brance gets merged in.

It is basically a partial re-write of the whole fetching mechanism, borrowing large chunks of code here and there.

Features include:
- Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch FetchList, etc
- Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within their domains. (this can already be manually accomplished with RegexURLFilter, but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 external link (not possible with current filter mechanism)
- Online fetchlist building (as opposed to Nutch’s offline method), and customizable strategies for building a fetchlist. The default implementation gives priority to hosts with a larger number of pages to crawl. Note that offline fetchlist building is ok too.
- Runs continuously until all links are crawled
- Customizable fetch output mechanisms, like output to file, to WebDB, or even not at all (if we’re just implementing a link-checker, for example)
- Fully utilizes HTTP 1.1 connection persistence and request pipelining

It is fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the new crawler can produce a Nutch segment. However, if you don’t need that at all, and are just interested in Nutch as a crawler, then that’s ok too!

It is a drop-in replacement for the Nutch crawler, and compiles with the recently released 0.7 jar.

Some disclaimers:
It was never designed to be a superset replacement for the Nutch crawler. Rather, it is tailored to fairly specific requirements of what I believe is called constrained crawling. It uses Spring Framework (for easy customization of implementation classes) and JDK 5 features (occasional new loop syntax, autoboxing, generics, etc). These 2 points speeded up dev. but probably make it a untasty Nutch acquisition.. ;-) But it shouldn‘t be tough to do something about that..

One of the areas the Nutch Crawler can use with improvement is in the fact that its really difficult to extend and customize. With the addition of interfaces and beans, its possible for developers to develop their own mechanism for fetchlist prioritization, or use a B-Tree as the backing implementation of the database of crawled URLs. I‘m using Spring to make it easy to change implementation, and make loose coupling easy..

There are some places where existing Nutch functionality is duplicated in some way to allow for slight modifications as opposed to patching the Nutch classes. The rationale behind this approach was to simplify integration - much easier to have Our Crawler as a separate jar which depends on the Nutch jar. Furthermore if it doesn‘t get accepted into Nutch, no rewriting or patching of Nutch sources needs to be done.

Its my belief that if you‘re using Nutch for anything but whole-web crawling and need to make even small changes to the way the crawling is performed, you‘ll find Our Crawler helpful.

I consider current code as beta quality. I‘ve run it on smallish crawls (200k+ URLs) and things seem to be working ok, but nowhere near production quality.

Some related blog entries:

Improving Nutch for constrained crawls
http://www./index.php?p=274

Reflections on modifying the Nutch crawler
http://www./index.php?p=283

Limitations of OC
http://www./index.php?p=284

Even if we decide not to include in Nutch repo, the code will still be released under APL. I‘m in the process of adding abit more documentation, and a shell script for running, and will release the files over the next couple days.

Cheers,
Kelvin

http://www.

回复转发邀请Kelvin使用 Gmail

Kelvin Tan

<kelvin-lists@relevanz.com> 致 nutch-dev

更多选项

8月23日 (2天前)

Sorry, realized I needed to qualify: plugin framework is nice, but I mean customizing non-extension point fetcher behaviour.

k

- 显示引用文字 -

On Tue, 23 Aug 2005 00:02:26 -0400, Kelvin Tan wrote:
>One of the areas the Nutch Crawler can use with improvement is in
>the fact that its really difficult to extend and customize.

回复转发邀请Kelvin使用 Gmail

Erik Hatcher

<erik@ehatchersolutions.com> 致 nutch-dev

更多选项

8月23日 (2天前)

Kelvin,

Big +1!!! I‘m working on focused crawling as well, and your work
fits well with my needs.

An implementation detail - have you considered using HiveMind rather
than Spring? This would be much more compatible license-wise with
Nutch and be easier to integrate into the ASF repository. Further -
I wonder if the existing plugin mechanism would work well as a
HiveMind-based system too.

Erik

- 显示引用文字 -

On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:

> I‘ve been working on some changes to crawling to facilitate its use
> as a non-whole-web crawler, and would like to gauge interest on
> this list about including it somewhere in the Nutch repo, hopefully
> before the map-red brance gets merged in.
>
> It is basically a partial re-write of the whole fetching mechanism,
> borrowing large chunks of code here and there.
>
> Features include:
> - Customizable seed inputs, i.e. seed a crawl from a file,
> database, Nutch FetchList, etc
> - Customizable crawl scopes, e.g. crawl the seed URLs and only the
> urls within their domains. (this can already be manually
> accomplished with RegexURLFilter, but what if there are 200,000
> seed URLs?), or crawl seed url domains + 1 external link (not
> possible with current filter mechanism)
> - Online fetchlist building (as opposed to Nutch’s offline method),
> and customizable strategies for building a fetchlist. The default
> implementation gives priority to hosts with a larger number of
> pages to crawl. Note that offline fetchlist building is ok too.
> - Runs continuously until all links are crawled
> - Customizable fetch output mechanisms, like output to file, to
> WebDB, or even not at all (if we’re just implementing a link-
> checker, for example)
> - Fully utilizes HTTP 1.1 connection persistence and request
> pipelining
>
> It is fully compatible with Nutch as it is, i.e. given a Nutch
> fetchlist, the new crawler can produce a Nutch segment. However, if
> you don’t need that at all, and are just interested in Nutch as a
> crawler, then that’s ok too!
>
> It is a drop-in replacement for the Nutch crawler, and compiles
> with the recently released 0.7 jar.
>
> Some disclaimers:
> It was never designed to be a superset replacement for the Nutch
> crawler. Rather, it is tailored to fairly specific requirements of
> what I believe is called constrained crawling. It uses Spring
> Framework (for easy customization of implementation classes) and
> JDK 5 features (occasional new loop syntax, autoboxing, generics,
> etc). These 2 points speeded up dev. but probably make it a untasty
> Nutch acquisition.. ;-) But it shouldn‘t be tough to do something
> about that..
>
> One of the areas the Nutch Crawler can use with improvement is in
> the fact that its really difficult to extend and customize. With
> the addition of interfaces and beans, its possible for developers
> to develop their own mechanism for fetchlist prioritization, or use
> a B-Tree as the backing implementation of the database of crawled
> URLs. I‘m using Spring to make it easy to change implementation,
> and make loose coupling easy..
>
> There are some places where existing Nutch functionality is
> duplicated in some way to allow for slight modifications as opposed
> to patching the Nutch classes. The rationale behind this approach
> was to simplify integration - much easier to have Our Crawler as a
> separate jar which depends on the Nutch jar. Furthermore if it
> doesn‘t get accepted into Nutch, no rewriting or patching of Nutch
> sources needs to be done.
>
> Its my belief that if you‘re using Nutch for anything but whole-web
> crawling and need to make even small changes to the way the
> crawling is performed, you‘ll find Our Crawler helpful.
>
> I consider current code as beta quality. I‘ve run it on smallish
> crawls (200k+ URLs) and things seem to be working ok, but nowhere
> near production quality.
>
> Some related blog entries:
>
> Improving Nutch for constrained crawls
> http://www./index.php?p=274
>
> Reflections on modifying the Nutch crawler
> http://www./index.php?p=283
>
> Limitations of OC
> http://www./index.php?p=284
>
> Even if we decide not to include in Nutch repo, the code will still
> be released under APL. I‘m in the process of adding abit more
> documentation, and a shell script for running, and will release the
> files over the next couple days.
>
> Cheers,
> Kelvin
>
> http://www.
>

回复转发邀请Erik使用 Gmail

Piotr Kosiorowski

<pkosiorowski@gmail.com> 致 nutch-dev

更多选项

8月23日 (2天前)

Hi,
I think it is an interesting idea but from technical pespective decision
to use HiveMind or Spring should be taken for the whole project in my
opinion. The same goes for JDK 5.0. So right now it is not a best match
for Nutch.

On the functionality side I am not the best person to judge it as I am
doing rather big crawls with many hosts, but it sounds interesting.

Regards,
Piotr

- 显示引用文字 -

Erik Hatcher wrote:
> Kelvin,
>
> Big +1!!! I‘m working on focused crawling as well, and your work fits
> well with my needs.
>
> An implementation detail - have you considered using HiveMind rather
> than Spring? This would be much more compatible license-wise with
> Nutch and be easier to integrate into the ASF repository. Further - I
> wonder if the existing plugin mechanism would work well as a
> HiveMind-based system too.
>
> Erik
>
> On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:
>
>> I‘ve been working on some changes to crawling to facilitate its use
>> as a non-whole-web crawler, and would like to gauge interest on this
>> list about including it somewhere in the Nutch repo, hopefully before
>> the map-red brance gets merged in.
>>
>> It is basically a partial re-write of the whole fetching mechanism,
>> borrowing large chunks of code here and there.
>>
>> Features include:
>> - Customizable seed inputs, i.e. seed a crawl from a file, database,
>> Nutch FetchList, etc
>> - Customizable crawl scopes, e.g. crawl the seed URLs and only the
>> urls within their domains. (this can already be manually accomplished
>> with RegexURLFilter, but what if there are 200,000 seed URLs?), or
>> crawl seed url domains + 1 external link (not possible with current
>> filter mechanism)
>> - Online fetchlist building (as opposed to Nutch’s offline method),
>> and customizable strategies for building a fetchlist. The default
>> implementation gives priority to hosts with a larger number of pages
>> to crawl. Note that offline fetchlist building is ok too.
>> - Runs continuously until all links are crawled
>> - Customizable fetch output mechanisms, like output to file, to
>> WebDB, or even not at all (if we’re just implementing a link- checker,
>> for example)
>> - Fully utilizes HTTP 1.1 connection persistence and request pipelining
>>
>> It is fully compatible with Nutch as it is, i.e. given a Nutch
>> fetchlist, the new crawler can produce a Nutch segment. However, if
>> you don’t need that at all, and are just interested in Nutch as a
>> crawler, then that’s ok too!
>>
>> It is a drop-in replacement for the Nutch crawler, and compiles with
>> the recently released 0.7 jar.
>>
>> Some disclaimers:
>> It was never designed to be a superset replacement for the Nutch
>> crawler. Rather, it is tailored to fairly specific requirements of
>> what I believe is called constrained crawling. It uses Spring
>> Framework (for easy customization of implementation classes) and JDK
>> 5 features (occasional new loop syntax, autoboxing, generics, etc).
>> These 2 points speeded up dev. but probably make it a untasty Nutch
>> acquisition.. ;-) But it shouldn‘t be tough to do something about that..
>>
>> One of the areas the Nutch Crawler can use with improvement is in the
>> fact that its really difficult to extend and customize. With the
>> addition of interfaces and beans, its possible for developers to
>> develop their own mechanism for fetchlist prioritization, or use a
>> B-Tree as the backing implementation of the database of crawled URLs.
>> I‘m using Spring to make it easy to change implementation, and make
>> loose coupling easy..
>>
>> There are some places where existing Nutch functionality is
>> duplicated in some way to allow for slight modifications as opposed
>> to patching the Nutch classes. The rationale behind this approach was
>> to simplify integration - much easier to have Our Crawler as a
>> separate jar which depends on the Nutch jar. Furthermore if it
>> doesn‘t get accepted into Nutch, no rewriting or patching of Nutch
>> sources needs to be done.
>>
>> Its my belief that if you‘re using Nutch for anything but whole-web
>> crawling and need to make even small changes to the way the crawling
>> is performed, you‘ll find Our Crawler helpful.
>>
>> I consider current code as beta quality. I‘ve run it on smallish
>> crawls (200k+ URLs) and things seem to be working ok, but nowhere
>> near production quality.
>>
>> Some related blog entries:
>>
>> Improving Nutch for constrained crawls
>> http://www./index.php?p=274
>>
>> Reflections on modifying the Nutch crawler
>> http://www./index.php?p=283
>>
>> Limitations of OC
>> http://www./index.php?p=284
>>
>> Even if we decide not to include in Nutch repo, the code will still
>> be released under APL. I‘m in the process of adding abit more
>> documentation, and a shell script for running, and will release the
>> files over the next couple days.
>>
>> Cheers,
>> Kelvin
>>
>> http://www.
>>
>
>

回复转发

ogjunk-nutch@yahoo.com

致 nutch-dev

更多选项

8月23日 (2天前)

From what I heard from Kelvin, the Spring part could be thrown out and
replaced with classes with main().

I think there is a need for having the Fetcher component more separated
from the rest of Nutch. The Fetcher alone is well done and quite
powerful on its own - it has host-based queues, doesn‘t use much
RAM/CPU, it‘s polite, and so on. For instance, for Simpy.com I‘m
currently using only the Fetcher (+ segment data it creates). I feed
it URLs to fetch my own way, and I never use ‘bin/nutch‘ to run all
those other tools that work on WebDB.

Like Stephan, I thought map-reduce implementation was going to be more
complex to run.

Otis

- 显示引用文字 -

--- Piotr Kosiorowski <pkosiorowski@gmail.com> wrote:

> Hi,
> I think it is an interesting idea but from technical pespective
> decision
> to use HiveMind or Spring should be taken for the whole project in my
>
> opinion. The same goes for JDK 5.0. So right now it is not a best
> match
> for Nutch.
>
> On the functionality side I am not the best person to judge it as I
> am
> doing rather big crawls with many hosts, but it sounds interesting.
>
> Regards,
> Piotr
>
>
>
> Erik Hatcher wrote:
> > Kelvin,
> >
> > Big +1!!! I‘m working on focused crawling as well, and your work
> fits
> > well with my needs.
> >
> > An implementation detail - have you considered using HiveMind
> rather
> > than Spring? This would be much more compatible license-wise with
>
> > Nutch and be easier to integrate into the ASF repository. Further
> - I
> > wonder if the existing plugin mechanism would work well as a
> > HiveMind-based system too.
> >
> > Erik
> >
> > On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:
> >
> >> I‘ve been working on some changes to crawling to facilitate its
> use
> >> as a non-whole-web crawler, and would like to gauge interest on
> this
> >> list about including it somewhere in the Nutch repo, hopefully
> before
> >> the map-red brance gets merged in.
> >>
> >> It is basically a partial re-write of the whole fetching
> mechanism,
> >> borrowing large chunks of code here and there.
> >>
> >> Features include:
> >> - Customizable seed inputs, i.e. seed a crawl from a file,
> database,
> >> Nutch FetchList, etc
> >> - Customizable crawl scopes, e.g. crawl the seed URLs and only the
>
> >> urls within their domains. (this can already be manually
> accomplished
> >> with RegexURLFilter, but what if there are 200,000 seed URLs?),
> or
> >> crawl seed url domains + 1 external link (not possible with
> current
> >> filter mechanism)
> >> - Online fetchlist building (as opposed to Nutch’s offline
> method),
> >> and customizable strategies for building a fetchlist. The default
>
> >> implementation gives priority to hosts with a larger number of
> pages
> >> to crawl. Note that offline fetchlist building is ok too.
> >> - Runs continuously until all links are crawled
> >> - Customizable fetch output mechanisms, like output to file, to
> >> WebDB, or even not at all (if we’re just implementing a link-
> checker,
> >> for example)
> >> - Fully utilizes HTTP 1.1 connection persistence and request
> pipelining
> >>
> >> It is fully compatible with Nutch as it is, i.e. given a Nutch
> >> fetchlist, the new crawler can produce a Nutch segment. However,
> if
> >> you don’t need that at all, and are just interested in Nutch as a
>
> >> crawler, then that’s ok too!
> >>
> >> It is a drop-in replacement for the Nutch crawler, and compiles
> with
> >> the recently released 0.7 jar.
> >>
> >> Some disclaimers:
> >> It was never designed to be a superset replacement for the Nutch
> >> crawler. Rather, it is tailored to fairly specific requirements of
>
> >> what I believe is called constrained crawling. It uses Spring
> >> Framework (for easy customization of implementation classes) and
> JDK
> >> 5 features (occasional new loop syntax, autoboxing, generics,
> etc).
> >> These 2 points speeded up dev. but probably make it a untasty
> Nutch
> >> acquisition.. ;-) But it shouldn‘t be tough to do something about
> that..
> >>
> >> One of the areas the Nutch Crawler can use with improvement is in
> the
> >> fact that its really difficult to extend and customize. With the
> >> addition of interfaces and beans, its possible for developers to
> >> develop their own mechanism for fetchlist prioritization, or use
> a
> >> B-Tree as the backing implementation of the database of crawled
> URLs.
> >> I‘m using Spring to make it easy to change implementation, and
> make
> >> loose coupling easy..
> >>
> >> There are some places where existing Nutch functionality is
> >> duplicated in some way to allow for slight modifications as
> opposed
> >> to patching the Nutch classes. The rationale behind this approach
> was
> >> to simplify integration - much easier to have Our Crawler as a
> >> separate jar which depends on the Nutch jar. Furthermore if it
> >> doesn‘t get accepted into Nutch, no rewriting or patching of Nutch
>
> >> sources needs to be done.
> >>
> >> Its my belief that if you‘re using Nutch for anything but
> whole-web
> >> crawling and need to make even small changes to the way the
> crawling
> >> is performed, you‘ll find Our Crawler helpful.
> >>
> >> I consider current code as beta quality. I‘ve run it on smallish
> >> crawls (200k+ URLs) and things seem to be working ok, but nowhere
>
> >> near production quality.
> >>
> >> Some related blog entries:
> >>
> >> Improving Nutch for constrained crawls
> >> http://www./index.php?p=274
> >>
> >> Reflections on modifying the Nutch crawler
> >> http://www./index.php?p=283
> >>
> >> Limitations of OC
> >> http://www./index.php?p=284
> >>
> >> Even if we decide not to include in Nutch repo, the code will
> still
> >> be released under APL. I‘m in the process of adding abit more
> >> documentation, and a shell script for running, and will release
> the
> >> files over the next couple days.
> >>
> >> Cheers,
> >> Kelvin
> >>
> >> http://www.
> >>
> >
> >
>
>
>

回复转发邀请ogjunk-nutch@yahoo.com使用 Gmail

Nick Lothian

<nlothian@educationau.edu.au> 致 nutch-dev

更多选项

8月24日 (1天前)

>
> An implementation detail - have you considered using HiveMind
> rather than Spring? This would be much more compatible
> license-wise with Nutch and be easier to integrate into the
> ASF repository. Further - I wonder if the existing plugin
> mechanism would work well as a HiveMind-based system too.
>
> Erik
>

Another option is the Apache OSGI project
(http://mail-archives.apache.org/mod_mbox/incubator-general/200507.mbox/
%3C42D6FC59.3050802@bellsouth.net%3E)

That might make some sense because the Nutch plugin mechanism is based
on the Eclipse plugin system, which is based on OSGI.