Skip to content

The Fallows of Medium Data

July 3, 2022


Who will curate less-prominent datasets?

Presidential Biography src

Samuel Fallows was a bishop in the Reformed Episcopal Church. He was born in 1835 and headed the denomination for four stints between 1877 and his death in 1922. Among numerous popular works, he compiled his own Complete Dictionary of Synonyms and Antonyms. Unlike the more-famous Roget’s Thesaurus, it is freely downloadable—but there are catches.

Today we discuss travails and lessons from my effort to use this book as data in my algorithms-and-data-structures course this past term.

The word travail appears in a form that well captures all the senses I went through:


KEY: Travail.
SYN: Labor, toil, heaviness, affliction.
ANT: Ease, rest, lightness, joy.


This comes with Project Gutenberg’s markup, which was digitized from a source like US Archive’s PDF image. The digitizer acknowledges OCR errors in his preface there. A massive unacknowledged error, however, is the main prompt for this post. [See Update at end.]

Before presenting the big error and its lessons for curating this kind of data, let me say more of note about the dictionary and Fallows’s other works.

Fallows as Author

The dictionary went to several editions and is still on sale at Amazon and Barnes and Noble. The downloads and optical PDF are of the 1898 third edition, whose one-page preface caught my eye for the following passage:

For the solution of Cross Word Puzzles special attention is called also to the lists of Americanisms and Briticisms and the immensely valuable table of Homonyms (words spelt alike but differing in use)—original features of easily recognized importance.

I grew up with the New York-centric story of crossword puzzles developing popularity in the 1920s. The history on Wikipedia dates the term to 1862 but signifies scant attention to them until 1913. It notes that the 1913 puzzle by Arthur Wynne in the New York World newspaper “is frequently cited as the first crossword puzzle.” Later Wikipedia says, “By the 1920s, the crossword phenomenon was starting to attract notice.” But hold on: evidently crosswords had attracted enough notice by 1898 for Fallows to invoke them as a prime selling point of his compendium.

Here are some other secular books by Fallows that show his high level of popular engagement:

He also contributed introductions to numerous books, including “Lest We Forget”: Chicago’s Awful Theater Horror (1903) and a 1919 treatise most simply titled Eugenics.

It is particularly interesting to read the San Francisco book, which Project Gutenberg’s transcriber speculates “was published very hurriedly following the earthquake.” Here is a passage from chapter 9, “Through Lanes of Misery”:

At Salinas, about dark, the conductor came back, shaking his head; a freight train ahead at Pajaro had been completely buried by a mountain of earth hurled in the quake.

The men said it was likely to be a week before any train went through.

Three or four of us hurried into the town looking for an automobile. One of the passengers on the train was Mrs. Robert Louis Stevenson, and the news had been kept from her until this delay.

A few lines further down: “One giant maniac had broken his shackles and rescued one of the guards from the building. He had just one sane moment; long enough to be a hero. Then he fled howling into the hills.”

A Data Structures Dont’t

This section’s title has no typo: another of Fallows’s books is titled Discriminate: A Companion to “Dont’t” (1885, 1891). The excerpt chosen by forgottenbooks.com could serve in a modern software company’s mission flyer:

Discriminate between ability and capacity. Capacity is the power of receiving and retaining knowledge with ease. Ability is the power of applying knowledge to practical purposes. Capacity implies power to conceive, ability the power to execute designs. Capacity is shown in quickness of apprehension; ability in something actually done.

One application I conceived for my Data Structures course was to rewrite long words in selected texts by shorter synonyms, perhaps to humorous effect, using Fallows’s dictionary. This exemplifies lists and arrays and sets and maps of various sizes. The main map to build was from the key word to the associated list of synonyms. One could alternatively use a set of objects having key and synonyms fields, which I presented as more flexible in allowing other ways to define keys. Some words in Fallows’s dictionary have separate entries for the part of speech, for instance (antonyms and some words snipped):


KEY: Array \v.\.
SYN: Vest, deck, equip, decorate, rank, adorn, dress, accoutre, …
=
KEY: Array \n.\, Arrangement, order, disposition, sight, …, parade.


Rather than juggle different kinds of maps, to use or not-use the noun/verb/adjective info, I said better to keep the data all together. This fell in with preaching about a classic data structures pitfall of “Parallel Arrays” and ensuing off-by-one indexing errors. I designed and gave a separate assignment where a sorted set with iterator gave 3-4x better performance than repeated lookup from a map.

But I never gave the originally-conceived assignment. Among several reasons, I was dumbstruck by a “Parallel Arrays” fault in the Project Gutenberg text file.

The Horror

The dictionary also has cross-reference entries typified by


KEY: Bellow, [See BAWL].


My intent in such cases was to have the students’ code look up the synonyms of the referenced word, here KEY: Bawl. SYN: Shout, vociferate, halloo, roar, bellow. And—in case the dictionary had just KEY: Bawl, [See BELLOW]—beware of going into an infinite loop.

I did assign the task of detecting when two words appear on the synonym lists of each other. I intended to extend it to cross-references, so that bawl and bellow would count as a “reciprocal pair.” But before I got there, I noticed instances like the following—especially toward the end of the file:


KEY: Unruffled, [See DISCOVER].
=
KEY: Unruly.
SYN: Ungovernable, unmanageable, refractory, [See TRANQUIL].
=
KEY: Unsafe, [See REFRACTORY].
=
KEY: Unseasonable, [See SAFE].


This off-by-one error extends above and below. There are some islands of correctness, but abutted by bizarreness:


KEY: Unhandy.
SYN: Awkward, clumsy, uncouth, [See AWKWARD].
=
KEY: unhappiness.
SYN: Misery, wretchedness, distress, woe, [See AWKWARD].
=
KEY: Unhappy.
SYN: Miserable, wretched, distressed, …, dismal, [See BUSS].
=
KEY: Unhealthy, [See BEHALF].


The BUSS is an OCR error for BLISS, meant to go with Unhappiness, and BEHALF is an OCR error for HEALTH—which is correctly aligned again. In other places the misalignments seem weirder and greater. But none of them is in any printed source. I ask:

How could this kind of error happen?

Evidently the transcriber or some helper fell afoul of Parallel Arrays. One possibility is hinted by the project Gutenberg site having CSV files that use separate columns for KEY, SYN, and ANT, and have notes interspersed with data toward the top. Inserting a note in one column would throw off alignment below it. But I have not found these errors in those files.

The Effort to Curate Data

I took the time to fix all the OCR errors on the KEY: fields in my update posted on my course webpage: Fallows1898fx.txt. I started to fix cross-references, but gave up when I noticed sporadic instances earlier than S in the file and the conjunction with OCR errors. Some of the latter are harder to explain. The Gutenberg file has


KEY: Catalogue \n.\, [See BAWL].


The PDF/printed source has [See RECORD]. The ‘R‘ could produce the ‘B‘, but the unlikelihood of getting AWL from ECORD makes me suspect a different error. Perhaps the earlier [See BAWL] from the entry for Bellow got copied here. Copies like the above AWKWARD example occur elsewhere with more intervening space. A third possibility is that BAWL could go with KEY: Caterwaul, but Fallows does not have that word.

The off-by-one error was avoided in US Archive’s own full text, but it has other issues. The markup is jumbled. The true text format is recoverable in many places but not easily in others. OCR-type typos are completely unmarked.

Some errors are by Fallows himself. For example, he forgot to insert “SYN.” into his own entry for the noun form of Array given above. Should these be corrected? For my purpose of wanting a clean dataset, I would wish so. Never mind that I could add lots of entries—his dictionary was far from “complete” even in 1898. The understanding is that we operate with these historical artifacts as they are, perhaps after fixing things the authors clearly intended.

Data cleansing has become an area of computer and data science unto itself. My point is not to explore its theory or use-cases. I could devote a whole series of posts to issues with my chess data and numerous irregularities in chess game files sent to me that I have to fix. My point—with all these less-prominent but potentially useful data sources—is not how but who:

Who will undertake to co-ordinate and execute the cleaning of all this medium-range data?

Whether a large and united effort like Project Gutenberg has the human resources seems in question. The University of Pennsylvania Online Books Library has a cautionary status note about the Gutenberg link:

No stable link: This is an uncurated book entry from our extended bookshelves, readable online now but without a stable link here. You should not bookmark this page, but you can request that we add this book to our curated collection, which has stable links.

The scanned PDF version is stable. If the note is prompted by faults in the textual transcriptions—well, I think I could finish the fixes I started if a free extra week were magically inserted into my calendar. If any of you can bump it along a day at a time, either starting from my version or working afresh, I’ll be grateful.

Open Problems

How much will the world need to use such medium-level data sets? How important is it to clean them, and where would that effort come from? Or will all this data continue to lie in fallows?

Besides the above books and religious guides, Fallows wrote patriotic books, including The American Manual and Patriot’s Handbook (1889). But among all his topics, perhaps the one most current to remember this Fourth of July weekend—as we discuss student loans and the larger role of higher education—is that his great cause in his home state was “a college education, tuition free, for every Wisconsin boy or girl who wanted it.” He organized the first postgraduate distance-education program in the US, and also created the epsilon-alcoholic “Bishop’s Beer” before Prohibition.

Update Aug. 1, 2022: The Gutenberg version has been updated with mine and some other corrections, but the poster acknowledges privately that many more remain to be fixed—and that restarting from scratch with this image may be best.

[some little fixes]

2 Comments leave one →
  1. July 3, 2022 4:40 pm

    You may be more likely to recover corrections by putting your file on GitHub and soliciting edits there.

Trackbacks

  1. Thanks to Will Shortz | Gödel's Lost Letter and P=NP

Leave a Reply

Discover more from Gödel's Lost Letter and P=NP

Subscribe now to keep reading and get access to the full archive.

Continue reading