About my Chaucer concordance

← Return to the concordance

My Chaucer concordance (or Chaucer Concordance, if you want a name for it) began a year ago as a little tool for some work I was doing on Troilus and Crisyde. I put it online and thought nothing more of it until I noticed a few months ago that it was the first result returned by a Google search for chaucer concordance. As not to disappoint visitors, I have since added most of Chaucer's extant works and made major performance enhancements.

The texts were cleaned, normalized to an extent (see below), and indexed with Python. The indices are stored on this server as JSON documents, which are loaded asynchronously by the browser as they are needed. The frontend was built in JavaScript and uses jQuery to handle user interactions and AJAX requests.

For performance, I have designed the tool to run without server-side logic; once the indices have been loaded, the tool can essentially run offline. This makes the tool perforce open-source, but it is sparsely documented and somewhat haphazardly organized, so its source code may be of limited use.

My chief hope for the tool, besides that it be of some use to students of Chaucer, is that it serve as a demonstration of the text-processing tasks that can now be accomplished by the browser. A full index of a modest corpus can be quickly transferred to a client's browser, which can then search the corpus much more quickly than it could submit queries to a server and wait for its response. The indices and texts used by this tool come in at less than five megabytes; a far larger corpus could be handled in this way without incurring any more network strain than streaming a video or downloading a photo album. Only the largest corpora really demand a central server to run search queries; a corpus containing even quite a few authors or works can be conveniently searched and analyzed client-side.

That said, this particular concordance demands that I note several caveats:

  1. For copyright reasons, the text is drawn from the 1900 edition of Walter W. Skeat, as transcribed by Project Gutenberg. I have mostly respected Skeat's divisions and lexical choices, though I have eliminated all diacritical marks (e.g., ë has become e).
  2. For technical reasons, line numbers are naïve: they count from the beginning of the section as I have divided it, and therefore take no account of the section's place in any broader work or its own subdivisions.
  3. I have not made any effort to lemmatize various spellings or inflections. All words are counted (with normalized capitalization) just as they appear in the various editions that I have brought together.

If you have any comments or questions, feel free to email me.

You might also like my search interface for the University of Michigan's online Middle English Dictionary.

© 2017 Henry Litwhiler