Malaprop v0.1.0

L. Amber O'Hearn

"...she's as headstrong as an allegory on the banks of Nile."

— Mrs. Malaprop, in Sheridan's The Rivals

As a contribution to the adversarial evaluation paradigm, I have released my first version of Malaprop ⁰, a project involving transformations of natural text that result in some words being replaced by real-word near neighbours.

The Adversarial Evaluation Model for Natural Language Processing

Noah Smith recently proposed a framework for evaluating linguistic models based on adversarial roles ¹. In essence, if you have a sufficiently good linguistic model, you should be able to differentiate between a sample of natural language and an artificially altered sample. An entity that performs this differentiation is called a Claude. At the same time, having a good linguistic model should also enable you to transform a sample of natural language in a way that preserves its linguistic properties; that is, that makes it hard for a Claude to tell which was the original. An entity that performs this transformation is called a Zellig. These tasks are complementary.

This framework is reminiscent of the cryptographic indistinguishability property, in which an attacker chooses two plaintexts to give to an Oracle. The Oracle chooses one and encrypts it. The encryption scheme is considered secure if the attacker can not guess at better than chance which of the two plaintexts corresponds to the Oracle's ciphertext.

Even though encryption schemes are constructed mathematically, questions of security are always empirical. The notion of Provable Security is regarded with skepticism (at least by some); schemes are considered tentatively secure based on withstanding attempts to be broken. Similarly, it would take an array of independent Claude's all unable to guess at better than chance to support the claim that a given Zellig had hit upon a truly linguistic-structure-preserving transformation. Likewise, if an array of independent Zelligs can't fool a given Claude, that would support a strong claim about his ability to recognize natural language.

A Real-Word Error Corpus

In this spirit, I've just released Malaprop v0.1.0. It creates a corpus of real-word errors embedded in text. It was designed to work with Westbury Lab's Wikipedia corpora, but can be used with any text.

The code acts as a noisy channel, randomly inserting Damerau-Levenshtein errors at the character level as a word is passed through. If the resulting string is a real word — that is, a sufficiently frequent word in the original corpus — the new word replaces the original.

I intend to use this corpus to evaluate algorithms that correct orthographical errors. However, it could be used quite generally as just one Zellig in what I hope becomes a large body of such resources.

Notes

⁰The term malapropism was first used in the context of the computational linguistics task of real-word error detection and correction by David St-Onge in 1995 in his Master's thesis, Detecting and Correcting Malapropisms with Lexical Chains .

¹ Noah A. Smith. Adversarial Evaluation for Models of Natural Language. CoRR abs/1207.0245 2012

The Adversarial Evaluation Model for Natural Language Processing

A Real-Word Error Corpus

Notes

Comments