A new automatic spelling correction model to improve parsability of noisy content
To improve the parsing of noisy content (here: Tweets), a new automatic
spelling correction model is proposed that normalizes input before handing it
to the parser. Spelling correction is done in three steps: finding errors,
generating solutions and ranking the generated solutions.
Finding errors Normally, finding errors is done by comparing tokens against a
dictionary. A more elaborate approach uses n-grams to estimate word
probabilities. If the probabilities of n-grams around a certain word are very
low, this probably indicates an error.
Generating solutions Most spelling correctors use a modified version of the
edit distance to find the correct words/phrases. Aspell includes an efficient
and effective algorithm for this, which combines the normal edit distance with
the double metaphone algorithm (L. Philips, 2000). The generation process
should be adapted to the domain of application. Because our test data
originates from Twitter, we assume that people tend to use shorter variants of
words. This justifies a modification of the costs of insertion and deletion in
the Aspell code.
Ranking solutions In previous work ranking is based on edit distance and
context probability(ie. M Schierle, 2008). To utilize more estimators for the
ranking, a logistic regression model is trained. The features used are:
- Edit distance: as calculated by Aspell, using the modified costs.
- Different n-gram probabilities: uni- bi- and tri-gram probabilities are
included.
- Parse probability: The parse probability of the best parse of the Stanford
parser.
In the presentation, we will present the model, as well as experimental
results.