There were three pages of scanned text files found on the 1.4" tape. This comparison was done for two of the pages.
Click here to view the second page scanned.
Click here to view the original OCRed text.
Click here to view the new OCRed text.
|The statistical breakdown|
|Pages||Total words||Common errors||Singular errors||Total errors||Error %|
|Old Page 1||278||16||4||20||7%|
|New Page 1||278||16||31||47||17%|
|Old Page 2||350||23||8||31||9%|
|New Page 2||350||23||35||58||17 %|
The old pages refer to the archived text that was found on the tapes. The new pages refer to the recently scanned text pages using ScanWorx. The word count represents the number of independent alpha-numberic strings separated by spaces which existed in the original page. Common errors refer to the errors that were common to both versions of scanned texts. The errors are sometimes the same error, but not necessarily. For example, for page 1, the string ``ETHEL'' in the same location scanned as ``T?TJ::~f,'' for both versions. In contract, the string ``ROSENBERG'' on page 2 scanned as ``R0S~iJBERG'' in the original scan, but as ``O5~iJB~~G'' in the new version. These are both incorrent in different ways, and they are counted as errors common to both versions. Thus, for page 1, there were sixteen errors common to both versions of the ocr-ed output; for page 2, there were twenty-three errors on both versions.
Errors that were only found on one version are called singular errors. These errors occurred only in that particular scan. For example, the original scan of page one has the word ``"political'' incorrectly scanned as `` "po1itical'' in the original but correctly scanned in the new version. In contrast, the word ``that'' on page one was scanned correctly in the old version, but emerged as ``thpt'' in the new scan. The number of errors in the new scan is much higher than in the old scan. There were only 4 singular errors in the original scan of page one, but 31 singular errors resulting from the new scan. For page two, there were originally 8 singular errors, but 35 singular errors in the new version.
The total error is a summation of common and singular errors, and the percentage is the error percentage for that particular page. Thus, for page one, the total error percentage is only 7%, but increases to 17% with the new scanning; the total percentage of error for page two is 9%, but is also 17% output from the new version.
We do not know what system was used to scan the originals, nor do we know if any of the files were hand-corrected or hand-produced. Thus, we cannot make any inferences about the quality of ocr-ing this data based on the original scanned files. We can, however, conclude, that for these files, we will probably be able to scan at about 80% correct, given the current set-up.