Cataloging your web site
his chapter describes how you can automatically
generate web pages that list and categorize the HTML files in your web
site. The AutoCatalog feature lets you automatically provide your web users
with easy access to your content by:
-
listing all HTML documents in your web site
-
generating HTML views of your web content organized by title, classification,
author, and last-modified date
-
generating automatic directory information (known as a resource description)
for each HTML document in your document root
If your web server is one of many in a company, organization,
or educational facility, you can also use the AutoCatalog feature to provide
a resource description of your web site to any Netscape Catalog Server.
Netscape Catalog Server can then provide a central server where users can
find information on any of the individual web servers in your organization.
The AutoCatalog feature of Enterprise Server
3.0 provides only a subset of the functionality of Netscape Catalog Server.
While the AutoCatalog feature can list and categorize the files on your
web site, Netscape Catalog Server also indexes information, provides searching
capabilities, and catalogs documents from multiple servers. Netscape Catalog
server is also highly configurable, allowing users to write plug-in functions
and define rules for gathering and categorizing documents. If you would
like a more robust cataloging tool, you may want to purchase Netscape Catalog
Server to work in conjunction with your Enterprise Server.
What can AutoCatalog do for my web site?
If you have a large web site with many files and
directories, it can be difficult to organize the content so that your users
can quickly find specific information. If your web server also contains
directories of information from various groups or people, the content may
not be unified.
The AutoCatalog feature creates an organized
catalog of all of the documents on your web server. It sorts the documents
by title, classification, author, and last-modification time, as shown
in Figure 14.1.
Users see your catalog as categorized
links
How does AutoCatalog work?
The AutoCatalog feature is actually controlled by
an agent process called the catalog agent. The catalog agent accesses
your server through HTTP requests. You either set up the catalog agent
to run at set times, or you can manually run it from a form in the Server
Manager. The catalog agent sends requests to your server until the catalog
agent determines there are no more files to catalog.
The catalog agent gathers information in
a two-step process. First it enumerates (gathers) the URLs referenced
in each HTML file and determines which of these URLs it should catalog.
Then it generates a resource description that contains information
about the HTML file.
Enumerating the URLs
The catalog agent sends an HTTP request to your server
and accesses the first URL you specify. Typically this is the URL to your
home page, but you can set it to start in any directory or HTML file in
your document root. The catalog agent gets the first HTML document and
scans it for information to catalog.
Warning!
If your server uses access control based on hostname or IP
address, make sure you've allowed your local machine full access to your
web site. Also, if your server is configured to use authentication through
entries in a user database, make sure you give access to the catalog agent.
See "The process.conf file" on page 267 for
more information.
The first scan lists any URLs referenced in the HTML
file. The second scan generates a resource description, as described in
"Generating a resource description" on page
259. After the catalog agent enumerates the URLs and generates a resource
description, it determines which HTML files to scan next. The catalog agent
in Netscape Enterprise Server limits the URLs it traverses: it accesses
only those HTML files located in your server. Figure
14.2 shows how the catalog agent scans the files on a sample web server.
The catalog agent enumerates URLS, and
then generates resource descriptions
Generating a resource description
For each document found during enumeration, the catalog
agent scans the HTML document for information to catalog. For example,
the agent might gather the following information:
-
Title is the text that appears within HTML <TITLE> tags.
-
Classification is the text that appears in the CONTENT attribute
for the HTML <META> tag. For example, "General HTML" is the
Classification in the following text:
<META NAME="Classification" CONTENT="General HTML">
Author also appears in the CONTENT attribute for the HTML <META>
tag. For example, "J. Doe" is the Author in the following text:
<META NAME="Author" CONTENT="J. Doe">
Last-modified time is the time the file was last saved. The catalog
agent gets this information from the HTTP headers that the server sends,
not from the HTML document itself.
After the catalog agent gathers this information
from the first HTML file, it uses the enumerated URLs to choose which file
to scan next.
Generating HTML catalog files
After the catalog agent gathers all of the information
for your web server, it generates several HTML files that users will view
to find information on your web site. These HTML files are kept in the
https-identifier/catalog directory. Users can access the
categorized information by going to the catalog directory on your server:
http://yourserver.org/catalog
You can restrict access to this directory
and treat it as you do other parts of your web site. You can also turn
off the catalog feature, which effectively means no one can access the
catalog directory or any of its HTML files.
Using AutoCatalog
To use the AutoCatalog feature with your server,
you first turn on the catalog agent and configure it to gather and sort
the information about your web site. The catalog agent then collects information
from HTML documents on your server, and it creates static HTML files that
categorize your server's content in several ways.
Before any user can access the generated HTML
files, you must turn on the catalog option. To let users access your server's
catalog:
-
From the Server Manager, choose Auto Catalog|On/Off. The AutoCatalog On/Off
form appears.
-
Click the Server On button.
-
Save and apply your changes.
See "Accessing catalog
files" on page 266 for information on accessing the HTML files created
by the catalog agent.
Configuring AutoCatalog
You can configure how the catalog agent accesses
the content on your server. You can set directories in your document root
where the catalog agent starts cataloging. That is, if you have several
directories in your document root (the main directory where you keep your
server's content files), you can set the catalog agent to access only certain
directories and their subdirectories.
To configure the catalog agent:
-
From the Server Manager, choose AutoCatalog|Configure. The Configure Catalog
Agent form appears. To find your server's document
root, choose the Primary Document Directory link in Content Mgmt.
-
Type the directories where you want the catalog agent to begin searching
(it starts with the index.html file in that directory). For example,
if your server's document directory has three subdirectories called first,
second, and third, and you want the catalog agent to search
only the second directory, type /second in the Starting Directories
field.
If you leave the Starting Directories field blank,
the catalog agent searches your home page first (this file is usually called
index.html), and then it searches any URLs referenced in that
file.
-
Select the speed at which the catalog agent should search your server's
directories. The default is 7. The speed setting determines the number
of "hits" the server will experience when the catalog agent is working.
That is, when the catalog agent is searching through your server's files,
it can simultaneously send the server one or more requests for documents.
The catalog agent can also wait before sending a request to the server.
In general, the speed setting should be appropriate
for your server and its content. If you have a high-access server and up-to-date
cataloging isn't very important, you should choose a low speed; if your
server has low load times (perhaps in the early morning hours) and cataloging
is very important to you, you should run the catalog agent at a high speed.
Table 14.1 defines
the speed settings.
Speed settings
Speed setting
|
Simultaneous retrievals
|
Delay (seconds)
|
1
|
1
|
60
|
2
|
1
|
30
|
3
|
1
|
15
|
4
|
1
|
5
|
5
|
1
|
1
|
6
|
2
|
0
|
7
|
4
|
0
|
8
|
8
|
0
|
9
|
12
|
0
|
10
|
16
|
0
|
-
Enter the username and password that the agent will use to access any password-protected
sources that are to be enumerated.
-
Click OK.
Scheduling the catalog agent
You can configure the catalog agent to run at specific
times on specific days of the week. This feature is useful if your web
site content changes frequently. For example, you might have a site where
many people upload content and you don't directly monitor the content changes.
Or, you might manage a site whose content is very dynamic and should be
cataloged frequently.
Note
If the content on your server changes infrequently or if all
of the content changes simultaneously, you'll probably want to recatalog
your content manually instead of scheduling times for recataloging. Manual
recataloging minimizes the performance impact on your server.
To schedule the catalog agent:
-
Make sure that cron control is started by choosing Global Settings|Cron
Controls in the administration server. For more information on Cron Controls
and the administration server, see Managing
Netscape Servers.
-
From the Server Manager, choose Auto-Catalog|Schedule. The Schedule Catalog
Agent form appears.
-
Select the hour and minute when you want the catalog agent to run. The
drop-down lists let you choose a time in ten-minute increments.
-
Check the days of the week that you want the catalog agent to run. You
can check one or more days.
-
Check Activate schedule. If you want to stop the server from cataloging
your files on a schedule, check Deactivate schedule.
-
Click OK.
When the catalog agent runs, it logs its progress
in a file called robot.log. This file appears in the https-identifier/logs
directory under your server root directory. The log file contains the URLs
retrieved, generated, and enumerated. This log file gives more detail than
the status report (see "Getting a status
report for the catalog agent" on page 264).
Controlling the catalog agent manually
You can control the catalog agent manually. This
feature is useful for several reasons:
-
You can improve performance by running the catalog agent only when it is
necessary. For instance, you can run the catalog agent only after you've
made significant changes to your server's content instead of running it
on a weekly basis.
-
You can minimize the catalog agent's impact on your server by running it
during a low-access period. That is, you manually run the catalog agent
when you know your server isn't being accessed by lots of people.
-
You can stop the catalog agent if it is impacting your server's performance.
To manually control the catalog agent:
-
From the Server Manager, choose Auto Catalog|Manually Control. The Control
Catalog Agent form appears.
-
Select one of the following buttons for controlling the catalog agent:
Start starts the catalog agent using the
settings in the Configure Catalog Agent form.
Status displays the current status of the
agent. See the following section for more information on status.
Stop Enumeration stops the catalog agent
from traversing files, but it continues generating the resource description
for the file it's scanning.
Stop stops the catalog agent that you manually
started. If the agent is in the middle of enumerating or generating a resource
description, you'll lose that information, but the catalog agent will stop
itself and clean up any temporary files it was using. You might use Stop
Enumeration instead. The catalog agent will run again later if you scheduled
the agent to run at specific times.
Kill immediately stops the server. You'll
lose any information the catalog agent was working on.
Getting a status report for the catalog agent
Whenever the catalog agent runs, you can get a status
report that describes what the catalog agent is doing. To view the status
report, click the Status button on the Control Catalog Agent form.
A sample status report for the catalog
agent.
Table 14.2 defines
all of the status attributes.
Status attributes
Attribute
|
Description
|
active
|
The number of URLs the catalog agent is currently
working on
|
spawned
|
The number of URLs the catalog agent has enumerated
but hasn't yet retrieved
|
retrieved
|
The number of URLs retrieved through HTTP connections
|
enumerated
|
The number or URLs enumerated so far
|
generated
|
The number or URLs generated so far
|
filtered-at-metadata
|
The number of URLs rejected by the catalog agent
when scanning the META data in the HTML files
|
filtered-at-data
|
The number of URLs rejected by the catalog agent
when scanning the data in the HTML files (for example, if the links reference
an external host)
|
retrievals-pending
|
The number of URLs remaining that need to be
retrieved
|
retrievals-active
|
The number of URLs the agent is currently retrieving
|
retrievals-active-peak
|
The highest number of URLs the agent simultaneously
retrieved
|
deleted
|
The number of URLs filtered
|
migrated
|
The number of URLs enumerated but waiting to
have resource descriptions processed
|
defunct
|
The number of URLs filtered
|
spawn-backlog
|
The number of URLs waiting to be processed by
the catalog agent
|
spawn-string-cache
|
The number of unique host names that appeared
in links
|
bytes-retrieved
|
The total number of bytes the catalog agent has
retrieved, that is, the total number of bytes for all of the files the
agent has retrieved through HTTP connections
|
Accessing catalog files
Once you have a working catalog, you can access the
catalog main page at the following URL:
http://yourserver.org/catalog
Catalog files are kept on your server in
a directory under the server root directory called https-identifier/catalog.
Because this directory is outside your document root directory (where you
keep all of your web content), the server creates an additional document
directory that maps the URL prefix /catalog to the https-identifier/catalog
directory on your hard disk. You can view this setting by choosing Content
Mgmt|Additional Document Directories in the Server Manager.
Catalog configuration files
The catalog agent uses the following configuration
files:
-
The filter.conf file is used by the catalog agent to determine
what data to save in the resource descriptions. This file also configures
the catalog agent. You should modify this file only if you have Netscape
Catalog Server. See that product's documentation for more information on
this configuration file.
-
The process.conf file configures the catalog agent and tells it
where to send the resource descriptions that it generates. This file contains
all of the catalog agent settings you specified in the Server Manager forms,
including the URL where the catalog agent begins its enumeration.
-
The robot.conf file specifies which filter.conf file
the catalog agent uses.
-
The rdm.conf file contains information for all catalogs served
by the resource description server (RDS). RDSs collect resource descriptions
from the robots that search the network and send this information to the
catalog server. The RDS is actually the back end of Netscape Catalog Server,
and is not part of the autocatalog feature. You should modify this file
only if you have Netscape Catalog Server and its documentation.
-
The csid.conf file contains configuration information for the
servers that the RDS catalogs.
The catalog agent also uses and obeys restrictions
set in a file called robots.txt. You can use this file to restrict
areas of your server from your catalog agent. This file is also used by
any other robots or catalog agents that visit your web server.
The filter.conf file
The filter.conf file uses the same syntax
as the obj.conf file. It is a series of directives and functions
with attributes that define the rules the catalog agent follows (which
directory to start cataloging) and how to generate the resource descriptions.
The filter.conf file uses four directives:
-
Setup initializes the catalog agent when the agent is started.
-
MetaData filters the resource based on any meta-data listed in META
tags in the HTML document. This filtering occurs once for each HTML file
that the catalog agent retrieves.
-
Data filters the HTML file based on the information sent in the
HTTP headers for the file.
-
Shutdown performs any functions needed before the catalog agent
shuts down.
You should only modify this file if you plan to use
your web server with Netscape Catalog Server. For more information on the
configuration files, see the documentation for Netscape Catalog Server.
The process.conf file
The process.conf file configures the catalog
agent. It includes information such as:
-
Where to register results. A catalog service ID (CSID) points to the resource
description server for your web server. The catalog agent sends the resource
descriptions to this CSID. You can view your server's CSID on the AutoCatalog
On/Off form.
-
How fast to run. This is the speed setting you set in the Server Manager
forms.
-
How many system resources to use.
-
A single username and password for authenticating to the server.
Example process.conf file
The following sample file shows how you can set a
username and password that the catalog agent uses when authenticating to
your server. The email address is also used to identify the catalog agent.
<Process csid="x-catalog://www.netscape.com:9999/AutoCatalog" \
speed=10 \
email="user@domain" \
username="anonymous" \
password="robin@" \
http://www.netscape.com/
</Process>
The robots.txt file
The catalog agent is a type of robot--that
is, it is a program that gathers information from your web site by recursively
following links from one HTML file to another. There are many different
kinds of robots that roam the World Wide Web looking for web servers to
use for information gathering. For example, there are many companies that
search the web, index documents, and then provide the information as a
service to their customers (typically through searchable forms).
Robots are also sometimes called web crawlers or spiders.
Because some web administrators want to control what
directories and files a robot can access, the web community designed a
standard robots.txt file for excluding robots from web servers.
The catalog agent was designed to follow instructions in a robots.txt
file. However, not all web robots follow these guidelines.
You can use the robots.txt file
to restrict your server's catalog agent, but if your web server is part
of the World Wide Web, keep in mind that the robots.txt file might
be used by other robots visiting your site.
Note
The catalog agent, and any other robot, is restricted by access
control settings and user authentication.
Format for robots.txt
The robots.txt file consists of one or more
groups of lines with name-value pairs that instruct the robots. Each group
of lines should describe the User-Agent type, which is the name of a particular
robot. The Netscape catalog agent is called Netscape-Catalog-Agent/1.0.
After you specify which User-Agents you want to configure, you include
a Disallow line that lists the directories you want to restrict. You can
include one or more groups in your robots.txt file.
Each line in the group has the format
"field:value"
The field name is not case-sensitive, but the
value is case-sensitive. You can include comment lines by beginning the
comment with the # character. The following example shows one group that
configures all robots and tells them not to go into the directory called
/usr:
# This is a sample robots.txt file
User-agent: *
Disallow: /usr
Example robot.txt files
The following example robots.txt file specifies
that no robots should visit any URL starting with /usr or /tmp:
# robots.txt for http://www.mysite.com/
User-agent: *
Disallow: /usr
Disallow: /tmp
The next example restricts all robots from your
web site except the Netscape catalog agent:
# robots.txt for http://www.site.com/
User-agent: *
Disallow: *
# Netscape catalog agent is a good robot
User-agent: Netscape-Catalog-Agent/1.0
Disallow:
The following example tells all robots, including
the catalog agent, not to traverse your web site.
# No robots allowed!
User-agent: *
Disallow: /
Editing the robots.txt file
You can edit the robots.txt file manually
or by using the online form. The Edit Robots.txt form will create a robots.txt
file if one does not already exist. If you choose to edit the file manually,
use the format described in "Format for robots.txt"
on page 269. To edit the robots.txt file using the Edit Robots.txt
form:
-
Choose Auto-Catalog|Edit Robots.txt in the Server Manager. The Edit Robots.txt
form appears.
-
In the User-Agent field, enter the names of the User-Agents, or robots,
you want to configure. Each User-Agent should be on a separate line in
the text field. The User-Agents you list in this field are those for which
you will be disallowing access to specific directories. The User-Agent
names are case-sensitive.
For example, if you want to configure the Netscape
catalog agent, type Netscape-Catalog-Agent/1.0 in the User-Agent
field. If you want to configure all robots, you should enter *.
-
In the Disallow field, enter the names of the directories you want to restrict,
listing each directory on a separate line. The directory names are case-sensitive
also.
-
Click OK.
Copyright 1997 Netscape Communications Corporation.
All rights reserved.