![]() | Configuration of the Module MOD_robo |
This module implements a simple web robot, which scans certain URLs recursively up to a given number of pages. The time to start can be configured for periodical or one-time runs, the robot then runs in a separate task. The number of URLs to be checked depends on the license and is the number of parallel tasks the WWW server can handle divided by 4 and rounded to an integer. I.e. the public version can handle 10 parallel tasks, divided by 4 is 2.5, rounded 2. So with a public version of shttpd you can specify two URLs to be checked.
Configuration is done within the administration section. The configuration panel shows each search URL with all of its settings:
No | Day/year | 0 | <-- | time | -1 | : | -1 | max. | 100 | unused |
URL: |
|
On the left No specifies the URL number (ordinal number). The top line defines starting conditions and maximum number of pages to process. The number right of max. is the maximum number of pages and must be in rage of 1 .. 10000. The starting conditions are:
The second line contains the URL to start with and which defines the range. The URL must be of the form
protocol://host:port/path like http://www.somewhere.org:80/(Note: When pressing the change button, a parsing of the URL is done and you are able to check/verify the setting.)
When the robot starts, a start-up message is produced by the scheduler (see HttpScheduleAt()). The robot starts with the given URL and checks the related page for links. If a link matches the starting URL, then the robot follows the link, else it does not follow the link. No more data is further then requested, when the maximum number of links is reached for this URL.
Currently, there is no link from the online documentation to the main output page of this module. In the default configuration the URI /search is associated with that page. If a query string is given and the first parameter starts with what=..., one of the sub-pages is called.
After completion of a scan of an URL, output files are written to the WWW server's binary directory. The file names associated with the robot are built in this scheme:
_robo_<no>.<ext>
The list of URLs are ASCII files, where all processed URLs are stored, with the starting URL in the first line. The line data is stored space-separated:
In order to save space, words are stored in word list and the URL list only contains index number, which show, what words are contained by a URL object. The file contains of an ASCII list with one word per line, followed by its index number and a terminating space. Lines staring with a @ are control lines:
| ©. 1998-2000 by Dirk Ohme |