Guangyuan's Research and Development Blog: Configure Heritrix 1.14.3 in Eclipse

Configuration Environment

Heritrix 1.14.3
Java 1.7
Eclipse: LUNA
Mac OS (Maverick)

Steps for Heritrix open in Eclipse project

Download Hetrix

In my case: heritrix-1.14.3-src.tar.gz

Unzip to your local directory

In my case: /Applications/heritrix-1.14.3

Create New Java Project in Eclipse

Project path: /Users/guangyuan/Documents/workspace/Hetrixproject
Delete src & bin folders under the Hetrixproject

Copy org,com,st 3 folders from /Applications/heritrix-1.14.3/src/java/ to Hetrixproject folder
Copy modules, profiles, selftest 3 folders with heritrix.properties, jmxremote.password.template, heritrix.cacerts, jndi.properties 4 files from /Applications/heritrix-1.14.3/src/conf to Hetrixproject
Copy arcMetaheaderBody.xsl and README.txt 2 files from /Applications/heritrix-1.14.3/src/resources to Hetrixproject
Copy webapps folder from /Applications/heritrix-1.14.3/src/ to Hetrixproject
Copy lib folder from /Applications/heritrix-1.14.3/src/ to Hetrixproject
Modify .classpath file in Hetrixproject

<?xml version="1.0" encoding="UTF-8"?>

</classpath>

Modify heritrix.properties in Hetrixproject folder

heritrix. version = @ VERSION @ -> heritrix. version = 1.14.3
heritrix. cmdline.admin = "your username":"your password" (without quotation marks)
heritrix. cmdline.port = 8080 #you could change the default port as well

Refresh Eclipse project and find Heritrix.java under org.archieve.crawler package
[Run as] -> [Java Application]
Heritrix is already started, you could log on with "your username" and "your password" in your browser with http://localhost:8080 now and start crawling!!

Hetrixproject folder structure

Other example settings (get URLs only from Hosts)

Use Deciding Scope in [Modules]
Add 3 decide rules in [SubModules]

org.archive.crawler.deciderules.RejectDecideRule
org.archive.crawler.deciderules.OnHostsDecideRule
org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule

About Politeness (in Frontier Settings)

delay-factor
Imposes a delay between URIs from the same host that is a multiple of the amount of time it took to fetch the last URI downloaded from that host.
For example if it took 800 milliseconds to fetch the last URI from a host and the delay-factor is 5 (a very high value) then the Frontier will wait 4000 milliseconds (4 seconds) before allowing another URI from that host to be processed.
This value can be set to 0 for maximum impoliteness. It is never possible to have multiple concurrent URIs being processed from the same host.
max-delay-ms
This setting allows the user to set a maximum upper limit on the 'in between URIs' wait created by the delay factor. If set to 1000 milliseconds then in the example used above the Frontier would only hold URIs from that host for 1 second instead of 4 since the delay factor exceeded this ceiling value.
min-delay-ms
Similar to the maximum limit, this imposes a minimum limit to the politeness. This can be useful to ensure, for example, that at least 100 milliseconds elapse between connections to the same host. In a case where the delay factor is 2 and it only took 20 milliseconds to get a URI this would come into effect.
min-interval-ms
An alternate way of putting a floor on the delay, this specifies the minimum number of milliseconds that must elapse from the start of processing one URI until the next one after it starts. This can be useful in cases where sites have a mix of large files that take an excessive amount of time and very small files that take virtually no time.
In all cases (this can vary from URI to URI) the more restrictive (delaying) of the two floor values is imposed.

About Regular Expression

For example, if you want to exclude "jquery" in the URL - .*jquery.* (no \ for escape)

Crawl Status Code

      1 Successful DNS lookup
      0 Fetch never tried (perhaps protocol unsupported or illegal URI)
     -1 DNS lookup failed
     -2 HTTP connect failed
     -3 HTTP connect broken
     -4 HTTP timeout (before any meaningful response received)
     -5 Unexpected runtime exception; see runtime-errors.log
     -6 Prerequisite domain-lookup failed, precluding fetch attempt
     -7 URI recognized as unsupported or illegal
     -8 Multiple retries all failed, retry limit reached
    -50 Temporary status assigned URIs awaiting preconditions; appearance in
        logs may be a bug
    -60 Failure status assigned URIs which could not be queued by the 
        Frontier (and may in fact be unfetchable)
    -61 Prerequisite robots.txt-fetch failed, precluding a fetch attempt
    -62 Some other prerequisite failed, precluding a fetch attempt
    -63 A prerequisite (of any type) could not be scheduled, precluding a 
        fetch attempt
  -3000 Severe Java 'Error' conditions (OutOfMemoryError, StackOverflowError,
        etc.) during URI processing.
  -4000 'chaff' detection of traps/content of negligible value applied
  -4001 Too many link hops away from seed
  -4002 Too many embed/transitive hops away from last URI in scope
  -5000 Out of scope upon reexamination (only happens if scope changes during 
        crawl)
  -5001 Blocked from fetch by user setting
  -5002 Blocked by a custom processor
  -5003 Blocked due to exceeding an established quota
  -5004 Blocked due to exceeding an established runtime
  -6000 Deleted from Frontier by user
  -7000 Processing thread was killed by the operator (perhaps because of a
        hung condition)
  -9998 Robots.txt rules precluded fetch

Share it if you like the article:)

Configure Heritrix 1.14.3 in Eclipse

No comments:

Post a Comment