Installation Configuration of the Module MOD_robo

This module implements a simple web robot, which scans certain URLs recursively up to a given number of pages. The time to start can be configured for periodical or one-time runs, the robot then runs in a separate task. The number of URLs to be checked depends on the license and is the number of parallel tasks the WWW server can handle divided by 4 and rounded to an integer. I.e. the public version can handle 10 parallel tasks, divided by 4 is 2.5, rounded 2. So with a public version of shttpd you can specify two URLs to be checked.

Configuration

Configuration is done within the administration section. The configuration panel shows each search URL with all of its settings:

NoDay/year0 <--time-1: -1max.100 unused
URL: 
change

On the left No specifies the URL number (ordinal number). The top line defines starting conditions and maximum number of pages to process. The number right of max. is the maximum number of pages and must be in rage of 1 .. 10000. The starting conditions are:

The second line contains the URL to start with and which defines the range. The URL must be of the form

        protocol://host:port/path
  like
        http://www.somewhere.org:80/
(Note: When pressing the change button, a parsing of the URL is done and you are able to check/verify the setting.)

Processing an URL

When the robot starts, a start-up message is produced by the scheduler (see HttpScheduleAt()). The robot starts with the given URL and checks the related page for links. If a link matches the starting URL, then the robot follows the link, else it does not follow the link. No more data is further then requested, when the maximum number of links is reached for this URL.

Output

Currently, there is no link from the online documentation to the main output page of this module. In the default configuration the URI /search is associated with that page. If a query string is given and the first parameter starts with what=..., one of the sub-pages is called.

?what=search&where=no&word=word...
Search for a word word in one of the lists (no = 0 .. n) or all lists (no = -1). More than one word can be specified, by repeating word=word parameters.
?what=missing&where=no
Display all missing URLs of a list (no = 0 .. n) or all lists (no = -1). Missing URLs are those, where the WWW server returns a HTTP return code other than 200.
?what=showWords&&where=no
Display all known words of a list (no = 0 .. n) or all lists (no = -1). The single words are hyperlinked with the search word page.

Output files

After completion of a scan of an URL, output files are written to the WWW server's binary directory. The file names associated with the robot are built in this scheme:

        _robo_<no>.<ext>

The list of URLs are ASCII files, where all processed URLs are stored, with the starting URL in the first line. The line data is stored space-separated:

  1. current URL
  2. '-' if HTTP return code is 200, else the URL from which the current URL is referenced
  3. HTTP return code
  4. size of URL contents in bytes
  5. word index number
  6. further word index numbers (repeating up to end of line)
The line ends with a terminating space. If a line starts with a @, it is a control line with control information:
@Connections <total> <re-used>
Contains the number of URLs found (total) and the number of re-used connections (normally each HTTP request is handled by another connection. Modern servers can handle multiple HTTP request via one connection. The number of re-used connections compared to the total URL number is a unit of measurement of how good a system acts.).
@Duration <seconds> Duration of robot processing in seconds.

In order to save space, words are stored in word list and the URL list only contains index number, which show, what words are contained by a URL object. The file contains of an ASCII list with one word per line, followed by its index number and a terminating space. Lines staring with a @ are control lines:

@Words <total>
Total number of words in this list.

^

Table of Contents
Table of
Contents
Index
Index A-Z
Server Administration
Server Ad-
ministration
Copyright
Copyright
Notes
©. 1998-2000 by Dirk Ohme