The Columbia Crown The Kermit Project | Columbia University
612 West 115th Street, New York NY 10025 USA • [email protected]
…since 1981

C-Kermit 9.0 Sitemap Script

Frank da Cruz
The Kermit Project
Columbia University
Last update: Tue Jun 28 12:22:01 2011
Download:   http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap
Requires:   C-Kermit 9.0.

The ksitemap script builds a sitemap.xml file for a website based on a data file that you provide listing the files and (using Google Sitemap Image Extensions) images you wish to include in your sitemap, along with their properties, so that search engines like Google, Yahoo, Bing, and Ask can index them better. Read about sitemaps here.

Totally data driven, ksitemap reads a file-list file (or “filelist” for short) containing the names and attributes of the pages and images to be included in the sitemap. The filelist file is kept in the web directory itself, but it need not be world readable.

The ksitemap script should work under any Unix operating system (Linux, Mac OS X, NetBSD, Solaris, etc) that has C-Kermit 9.0 installed (but the top line, which indicates the pathname of the C-Kermit executable, might need to be changed). In Unix the ksitemap script must, of course, also be given execute permission (chmod +x). Ksitemap has not yet been tested in VMS.

Invocation

Ksitemap is invoked with the pathname of the filelist as its first and only command-line argument, for example:

$ ksitemap /www/filelist (absolute)
$ ksitemap ~/web/filelist (symbolic)
$ ksitemap web/filelist (relative)
$ ksitemap ../web/filelist (relative)
$ ksitemap /www/ (absolute directory, no filename)
$ ksitemap (no argument, see just below)
If you give a directory name without a filename, 'filelist' is used as the filename.

If you invoke ksitemap without a command-line argument then:

  • If the environment variable KSITEMAPDIR is defined, it will be used as the pathname of the website directory;
  • Otherwise, your current directory will be assumed as the website directory.
In both cases the filename will default to 'filelist'. Thus, if you have the KSITEMAPDIR environment variable defined in your Unix profile (e.g. .bash_profile for the Bash shell); for example:

export KSITEMAPDIR=/net/w/0/htdocs/username/web/

and the name of the file-list file is 'filelist', then you can run ksitemap from any directory any time without any command-line argument.

To invoke for debugging and testing, do:

$ DEBUG=1 ksitemap args

This gives progress messages and it writes the sitemap.xml file in a "tmp" directory.

The filelist file

The filelist file contains names of HTML and image files relative to the web directory. It can contain comment lines that begin with '#':

# This is a comment line

And it can contain blank lines, which are ignored. Nonblank, non-comment lines are in this format:

tag=value

An equal sign (=) separates the tag from the value. If you include whitespace (blanks or tabs) before and after the equal sign and they are ignored. The following three lines have identical effect:

home=http://www.xyzcorp.com/
home = http://www.xyzcorp.com/
home=          http://www.xyzcorp.com/

If you need to include an equal sign in the value itself, surround the value with ASCII doublequotes. If you want the value itself to be enclosed in doublequotes, put three of them on each end (see the section on programming considerations for an explanation). Examples:

cap=View from the Empire State Building looking East
cap="A+B=C"
cap="""Caption within doublequotes"""

The first few lines define parameters for the whole website:

Tag Status Value
encoding Depends sitemap.xml files are encoded in UTF-8. If your filelist file is encoded in some other character set (such as ISO-8859-1) for the purpose of including non-ASCII characters (such as accented letters or non-Roman letters), you must declare its encoding so ksitemap can convert the text to UTF-8. If your file-list file is ASCII, or it is already UTF-8, this item is optional. Otherwise this item is required, and it should come first, so ksitemap can convert all the lines in the file appropriately. The value is the MIME name of the character set used in the file-list file. For a list of supported encodings, see this page).
home Required The URL of the website's home directory (with no filename part)
geo Optional The default geographical location for images, if any
lic Optional The default filename, if any, for a page containing copyright or license information for the site's original images
.macroname Optional Definition for macro with given name

These items should come before any of the page-specific items that are described below. If you include a geo or lic tag before any url tag (see below), these will be used for any image for which you do not specify a geo or lic tag. In other words the ones in the top section are global and the ones in an img section are local to that image.

The "home" line's value is the URL of the website root directory, ending with slash, for example:

home:http://kermit.columbia.edu/

This is used to form the full URLs of the files and images in the website. Example:

home:http://kermit.columbia.edu/
lic:copyright.html

This results in the URL of the license file being:

http://kermit.columbia.edu/copyright.html

Macros allow you to use variables in value strings. For example, given:

.year=2010

Then any ocurrence of \m(year) in a value string is replaced by 2010.

The remainder of the file list contains lines for each file and image you want to include in your sitemap. For each page, the lines should appear in the following order:

Tag Status Value
url Required Name of an html file relative to the website's root directory.
pri Optional Priority of the page, 0.0 to 1.0

For each URL, the page date is supplied automatically based on the modification date of the file and the change frequency (daily, weekly, monthly, yearly) is supplied based on when the file was last modifed.

For redirects, a URL entry can have two values; for example:

url=index.html=index-en.html

This means that the first filename is an HTTP Redirect to the second filename; that is, the first name is a pointer to a file having the second name. For example, suppose you have a website with calendars for different years: cal-2009.html, cal-2010.html, cal-2011.html, etc, and the calendar for the current year should always be available as simply cal.html. In that case your .htaccess file redirects the name cal.html to (say) cal-2011.html because you want the cal.html name to be indexed by Web crawlers even though no file exists with that name in your site. This way, each year you only have to change your .htaccess and you don't have to wait for the web crawlers to index a file that didn't exist before:

url=cal.html=cal-2011.html

If you have a lot of files using this naming convention, you can use a macro so the variable string can be defined (and changed) in just one place instead of lots of places:

.year=2011
url=cal.html=cal-\m(year).html
url=jan.html=jan-\m(year).html
url=feb.html=feb-\m(year).html
etc...

If there are images on the page that you want to include in the sitemap:

Tag Status Value
img Required Name file an image file in the root directory or in a subdirectory.
cap Optional A text caption for the image
title Optional A text title for the image
geo Optional The geographical localization of this image only
lic Optional The URL of a license page for this image only

Here's a brief example that has three files. For the first file (index.html), a priority is specified; for the others, the default priority is accepted. The second file is in a subdirectory. The third file has images. Comments, blank lines, and indentation are used for clarity, but they do not do not affect the result. Note that there may be, but need not be, whitespace around the equal sign.

# ksitemap filelist for building sitemap.xml

encoding = ISO-8859-1
home=http://kermit.columbia.edu/
geo=New York City USA
lic=copyright.html

url=index.html
pri=1.0

url=cudocs/ilosetup.html

url=cable.html
img=connectors-340.jpg
  cap=Male and Female RS-232 Connectors
  title=Serial Data Connectors
img=modemcable.jpg
  cap=Modem Cable Schematic
  geo=Bedford MA
img=nullmodem-480.jpg
  cap=Null Modem Cable Schematic
  lic=special.html
  geo=Batey Caño - Yamasá

The resulting sitemap.xml looks like this:


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
  <loc>http://kermit.columbia.edu/</loc>
  <lastmod>2010-12-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>1.0</priority>
</url>
<url>
  <loc>http://kermit.columbia.edu/cudocs/ilosetup.html</loc>
  <lastmod>2010-12-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
</url>
<url>
  <loc>http://kermit.columbia.edu/cable.html</loc>
  <lastmod>2010-12-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
  <image:image>
    <image:loc>http://kermit.columbia.edu/connectors-340.jpg</image:loc>
    <image:caption>Male and Female RS-232 Connectors</image:caption>
    <image:title>Serial Data Connectors</image:title>
    <image:geo_location>New York City USA</image:geo_location>
    <image:license>http://kermit.columbia.edu/copyright.html</image:license>
  </image:image>
  <image:image>
    <image:loc>http://kermit.columbia.edu/modemcable.jpg</image:loc>
    <image:caption>Modem Cable Schematic</image:caption>
    <image:geo_location>Bedford MA</image:geo_location>
    <image:license>http://kermit.columbia.edu/copyright.html</image:license>
  </image:image>
  <image:image>
    <image:loc>http://kermit.columbia.edu/nullmodem-480.jpg</image:loc>
    <image:caption>Null Modem Cable Schematic</image:caption>
    <image:geo_location>Batey Caño - Yamasá</image:geo_location>
    <image:license>http://kermit.columbia.edu/special.html</image:license>
  </image:image>
</url>
</urlset>

Programming considerations

The key to parsing the filelist is Kermit's \fsplit() function, and in particular some new features added to it in C-Kermit 9.0: a straightforward way of handling strings containing non-ASCII characters, and the "comma-separated values" list (CSV) feature described in this page. The statement:
.\%9 := \fsplit(\m(line),&x,=,CSV) # Split line on '='
splits a filelist line into two pieces, the tag and the value:
  • \%9 is a kind of all-purpose temporary local variable, a usually unused command-line or macro argument number 9, which in this case receives the number of items that were obtained by splitting (the \%1-9 variables are local by definition, meaning if you use them in a macro, changing their values won't affect variables of the same name anywhere else).
  • \fsplit() is a built-in function for splitting a string into pieces based on all sorts of breaking, including, and grouping criteria.
  • The first argument, \m(line), is the variable holding the current line from the filelist.
  • &x is the name of the array to put the result in.
  • is the break set, composed of one character in this case, the equal sign.
  • CSV means it is a "comma-separated values" list, but since the break character is equal sign and not comma, it is really an "equal-sign separated list", but with the same rules as a CSV, such as:
    1. All characters other than the break character itself are in the include set.
    2. Except that the separator can, but need not be, surrounded by whitespace, in which case the whitespace characters are discarded (not included).
    3. A field containing the separator character as data must be surrounded by doublequotes, which will be removed in the final result.
    4. A field that contains doublequotes must be enclosed in doublequotes, and then all interior doublequotes must be doubled.
The complete set of CSV rules is here.

Another observation about \fsplit() is worth making. Its result goes into an array, and array elements in the Kermit language, just like \%a variables, are evaluated recursively. The array elements contain the literal pieces of the original string, but when you refer to an array element whose value contains any backslashes, the string is evaluated recursively, "all the way down". This is why the array element values are referenced through \fcontents(), which forces a simple "one-level-deep" evaluation.

A more serious problem was noted when adding the macro capability to ksitemap, namely that \fsplit() itself was stripping out backslash characters. This is appropriate behavior for some of its other uses (e.g. parsing S-Expressions), but is not appropriate for parsing external data, such as data lines read from files. This explains the "Quoting Hell" trick just before the \fsplit() invocation. This will be unnecessary (and, in fact, harmful) in the next build of C-Kermit after 9.0.299 Alpha.09, where in CSV and TSV invocations of \fsplit(), backslashes will be treated just as any other character.

Finally it should be noted that ksitemap takes pains to expand macros only after verifying that a line contains “\m(xxx)” (where xxx would be the name of the macro). It could very easily have simply evaluated each line without all the testing and checking, but then files that contained backslashes for other reasons would be wrecked.

References


ksitemap / Kermit sitemap script / The Kermit Project / Columbia University / December 2010