Sample OCR

Error Report on ScanWorx

The following report is based on an extremely small sampling of OCRed text that was found on the archived tapes, and OCRed text that was created just recently for a comparison.

There were three pages of scanned text files found on the 1.4" tape. This comparison was done for two of the pages.

Click here to view the first page scanned.

Click here to view the second page scanned.

Click here to view the original OCRed text.

Click here to view the new OCRed text.

The statistical breakdown
Pages	Total words	Common errors	Singular errors	Total errors	Error %
Old Page 1	278	16	4	20	7%
New Page 1	278	16	31	47	17%
Old Page 2	350	23	8	31	9%
New Page 2	350	23	35	58	17 %

Explanation of Statistics

The old pages refer to the archived text that was found on the tapes. The new pages refer to the recently scanned text pages using ScanWorx. The word count represents the number of independent alpha-numberic strings separated by spaces which existed in the original page. Common errors refer to the errors that were common to both versions of scanned texts. The errors are sometimes the same error, but not necessarily. For example, for page 1, the string ``ETHEL'' in the same location scanned as ``T?TJ::~f,'' for both versions. In contract, the string ``ROSENBERG'' on page 2 scanned as ``R0S~iJBERG'' in the original scan, but as ``O5~iJB~~G'' in the new version. These are both incorrent in different ways, and they are counted as errors common to both versions. Thus, for page 1, there were sixteen errors common to both versions of the ocr-ed output; for page 2, there were twenty-three errors on both versions.

Errors that were only found on one version are called singular errors. These errors occurred only in that particular scan. For example, the original scan of page one has the word ``"political'' incorrectly scanned as `` "po1itical'' in the original but correctly scanned in the new version. In contrast, the word ``that'' on page one was scanned correctly in the old version, but emerged as ``thpt'' in the new scan. The number of errors in the new scan is much higher than in the old scan. There were only 4 singular errors in the original scan of page one, but 31 singular errors resulting from the new scan. For page two, there were originally 8 singular errors, but 35 singular errors in the new version.

The total error is a summation of common and singular errors, and the percentage is the error percentage for that particular page. Thus, for page one, the total error percentage is only 7%, but increases to 17% with the new scanning; the total percentage of error for page two is 9%, but is also 17% output from the new version.

We do not know what system was used to scan the originals, nor do we know if any of the files were hand-corrected or hand-produced. Thus, we cannot make any inferences about the quality of ocr-ing this data based on the original scanned files. We can, however, conclude, that for these files, we will probably be able to scan at about 80% correct, given the current set-up.

Click here to view the index of tapes that were found.

Error Report on ScanWorx

Explanation of Statistics

Comments to: kjc9 last updated: 6/17/96

Comments to: kjc9
last updated: 6/17/96