The Kermit Project |
Columbia
University
612 West 115th Street, New York NY 10025 USA • [email protected]
| |||||||||
|
Frank da Cruz
The Kermit Project
Columbia University
Last update: Tue Jun 28 12:22:01 2011
Download: http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap Requires: C-Kermit 9.0.
Totally data driven, ksitemap reads a file-list file (or “filelist” for short) containing the names and attributes of the pages and images to be included in the sitemap. The filelist file is kept in the web directory itself, but it need not be world readable.
The ksitemap script should work under any Unix operating system (Linux, Mac OS X, NetBSD, Solaris, etc) that has C-Kermit 9.0 installed (but the top line, which indicates the pathname of the C-Kermit executable, might need to be changed). In Unix the ksitemap script must, of course, also be given execute permission (chmod +x). Ksitemap has not yet been tested in VMS.
If you give a directory name without a filename, 'filelist' is used as the filename.
$ ksitemap /www/filelist (absolute) $ ksitemap ~/web/filelist (symbolic) $ ksitemap web/filelist (relative) $ ksitemap ../web/filelist (relative) $ ksitemap /www/ (absolute directory, no filename) $ ksitemap (no argument, see just below)
If you invoke ksitemap without a command-line argument then:
export KSITEMAPDIR=/net/w/0/htdocs/username/web/
and the name of the file-list file is 'filelist', then you can run ksitemap from any directory any time without any command-line argument.
To invoke for debugging and testing, do:
$ DEBUG=1 ksitemap args
This gives progress messages and it writes the sitemap.xml file in a "tmp" directory.
# This is a comment lineAnd it can contain blank lines, which are ignored. Nonblank, non-comment lines are in this format:
tag=value
An equal sign (=) separates the tag from the value. If you include whitespace (blanks or tabs) before and after the equal sign and they are ignored. The following three lines have identical effect:
home=http://www.xyzcorp.com/ home = http://www.xyzcorp.com/ home= http://www.xyzcorp.com/
If you need to include an equal sign in the value itself, surround the value with ASCII doublequotes. If you want the value itself to be enclosed in doublequotes, put three of them on each end (see the section on programming considerations for an explanation). Examples:
cap=View from the Empire State Building looking East cap="A+B=C" cap="""Caption within doublequotes"""
The first few lines define parameters for the whole website:
Tag Status Value encoding Depends sitemap.xml files are encoded in UTF-8. If your filelist file is encoded in some other character set (such as ISO-8859-1) for the purpose of including non-ASCII characters (such as accented letters or non-Roman letters), you must declare its encoding so ksitemap can convert the text to UTF-8. If your file-list file is ASCII, or it is already UTF-8, this item is optional. Otherwise this item is required, and it should come first, so ksitemap can convert all the lines in the file appropriately. The value is the MIME name of the character set used in the file-list file. For a list of supported encodings, see this page). home Required The URL of the website's home directory (with no filename part) geo Optional The default geographical location for images, if any lic Optional The default filename, if any, for a page containing copyright or license information for the site's original images .macroname Optional Definition for macro with given name
These items should come before any of the page-specific items that are described below. If you include a geo or lic tag before any url tag (see below), these will be used for any image for which you do not specify a geo or lic tag. In other words the ones in the top section are global and the ones in an img section are local to that image.
The "home" line's value is the URL of the website root directory, ending with slash, for example:
home:http://kermit.columbia.edu/
This is used to form the full URLs of the files and images in the website. Example:
home:http://kermit.columbia.edu/ lic:copyright.html
This results in the URL of the license file being:
http://kermit.columbia.edu/copyright.html
Macros allow you to use variables in value strings. For example, given:
.year=2010
Then any ocurrence of \m(year)
in a value string is replaced by
2010
.
The remainder of the file list contains lines for each file and image you want to include in your sitemap. For each page, the lines should appear in the following order:
Tag Status Value url Required Name of an html file relative to the website's root directory. pri Optional Priority of the page, 0.0 to 1.0
For each URL, the page date is supplied automatically based on the modification date of the file and the change frequency (daily, weekly, monthly, yearly) is supplied based on when the file was last modifed.
For redirects, a URL entry can have two values; for example:
url=index.html=index-en.html
This means that the first filename is an HTTP Redirect to the second filename; that is, the first name is a pointer to a file having the second name. For example, suppose you have a website with calendars for different years: cal-2009.html, cal-2010.html, cal-2011.html, etc, and the calendar for the current year should always be available as simply cal.html. In that case your .htaccess file redirects the name cal.html to (say) cal-2011.html because you want the cal.html name to be indexed by Web crawlers even though no file exists with that name in your site. This way, each year you only have to change your .htaccess and you don't have to wait for the web crawlers to index a file that didn't exist before:
url=cal.html=cal-2011.html
If you have a lot of files using this naming convention, you can use a macro so the variable string can be defined (and changed) in just one place instead of lots of places:
.year=2011 url=cal.html=cal-\m(year).html url=jan.html=jan-\m(year).html url=feb.html=feb-\m(year).html etc...
If there are images on the page that you want to include in the sitemap:
Tag Status Value img Required Name file an image file in the root directory or in a subdirectory. cap Optional A text caption for the image title Optional A text title for the image geo Optional The geographical localization of this image only lic Optional The URL of a license page for this image only
Here's a brief example that has three files. For the first file (index.html), a priority is specified; for the others, the default priority is accepted. The second file is in a subdirectory. The third file has images. Comments, blank lines, and indentation are used for clarity, but they do not do not affect the result. Note that there may be, but need not be, whitespace around the equal sign.
# ksitemap filelist for building sitemap.xml encoding = ISO-8859-1 home=http://kermit.columbia.edu/ geo=New York City USA lic=copyright.html url=index.html pri=1.0 url=cudocs/ilosetup.html url=cable.html img=connectors-340.jpg cap=Male and Female RS-232 Connectors title=Serial Data Connectors img=modemcable.jpg cap=Modem Cable Schematic geo=Bedford MA img=nullmodem-480.jpg cap=Null Modem Cable Schematic lic=special.html geo=Batey Caño - Yamasá
The resulting sitemap.xml looks like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"> <url> <loc>http://kermit.columbia.edu/</loc> <lastmod>2010-12-07</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <url> <loc>http://kermit.columbia.edu/cudocs/ilosetup.html</loc> <lastmod>2010-12-07</lastmod> <changefreq>daily</changefreq> <priority>0.5</priority> </url> <url> <loc>http://kermit.columbia.edu/cable.html</loc> <lastmod>2010-12-07</lastmod> <changefreq>daily</changefreq> <priority>0.5</priority> <image:image> <image:loc>http://kermit.columbia.edu/connectors-340.jpg</image:loc> <image:caption>Male and Female RS-232 Connectors</image:caption> <image:title>Serial Data Connectors</image:title> <image:geo_location>New York City USA</image:geo_location> <image:license>http://kermit.columbia.edu/copyright.html</image:license> </image:image> <image:image> <image:loc>http://kermit.columbia.edu/modemcable.jpg</image:loc> <image:caption>Modem Cable Schematic</image:caption> <image:geo_location>Bedford MA</image:geo_location> <image:license>http://kermit.columbia.edu/copyright.html</image:license> </image:image> <image:image> <image:loc>http://kermit.columbia.edu/nullmodem-480.jpg</image:loc> <image:caption>Null Modem Cable Schematic</image:caption> <image:geo_location>Batey Caño - Yamasá</image:geo_location> <image:license>http://kermit.columbia.edu/special.html</image:license> </image:image> </url> </urlset>
splits a filelist line into two pieces, the tag and the value:.\%9 := \fsplit(\m(line),&x,=,CSV) # Split line on '='
Another observation about \fsplit() is worth making. Its result goes into an array, and array elements in the Kermit language, just like \%a variables, are evaluated recursively. The array elements contain the literal pieces of the original string, but when you refer to an array element whose value contains any backslashes, the string is evaluated recursively, "all the way down". This is why the array element values are referenced through \fcontents(), which forces a simple "one-level-deep" evaluation.
A more serious problem was noted when adding the macro capability to ksitemap, namely that \fsplit() itself was stripping out backslash characters. This is appropriate behavior for some of its other uses (e.g. parsing S-Expressions), but is not appropriate for parsing external data, such as data lines read from files. This explains the "Quoting Hell" trick just before the \fsplit() invocation. This will be unnecessary (and, in fact, harmful) in the next build of C-Kermit after 9.0.299 Alpha.09, where in CSV and TSV invocations of \fsplit(), backslashes will be treated just as any other character.
Finally it should be noted that ksitemap takes pains to expand macros only after verifying that a line contains “\m(xxx)” (where xxx would be the name of the macro). It could very easily have simply evaluated each line without all the testing and checking, but then files that contained backslashes for other reasons would be wrecked.
|
|