Do not write a spell checker like this! I will hunt you down and beat you with a dictionary.
I find spell checking in almost every application that I've tried to be spectacularly bad for anything other than trivial typos. Google is quite good, but it is leveraging data not always available in other applications and is a special case optimized for search applications. Spell correcting for general writing is somewhat different in requirements and optimization vectors.
Spelling errors that can be classified in terms of edit distance, are far better characterized as "typos" not spelling errors. A typo is a transcription error, whereas incorrect spelling is a representation error. Both can be present in a particular ward, but they are different phenomenon.
People make incorrect spelling choices when they intentionally chose the wrong letters to represent phonemes, or to apply spelling rules incorrectly.
A better approach in all respects is to do what's often referred to as a grapheme to phoneme transformation, which is basically to `compile` a word in to a smaller set of characters representing the phonemes of the language. With the reduced set of symbols, that in and of themselves eliminate trivial typos, statistical models can preform better, and faster. Further, unknown words can often be corrected via the corrected spelling of the phoneme.
Why most spelling checkers fail to do grapheme to phoneme transformation when it is easy and both reduces the symbol set for statistical analysis and is demonstrably more accurate (with the possible exception of Indian languages) is beyond me. It's like watching people try to solve dehydration with a "better(TM)" formula for Coke. You're thinking about this problem wrong, just drink some water.
I had always assumed that spellcheckers took into account likely mistypes based on keyboard layout when working on spelling correction.
For example, "RRROR" (error) could easily have been me fat fingering the keys, whilst "MRROR" seems far less likely unless I have truly gargantuan fingers or a really really small keyboard. So you might infer I'd missed the "i" instead of mistyped the "e".
Is that ever used as part of statistical analysis?
A "confusion matrix" is statistical analysis you're looking for. An entry (i,j) is a score on how much the ith character is mistakenly typed (inputted) as the jth character. This can be easily adapted to various input devices.
probably if the spell checker owns the keyboard (eg a android keyboard app) and probably not on most desktop ones. and even if it does use that on desktop ones it's probably assuming a layout and missing, giving worse results
It's not nearly as simple as it sounds in theory. Consider that you have typos which are from hitting the wrong button - and also typos from hitting the next letter before the previous letter.
Also when looking at typos it's likely you'd find statistically significant differences for typos by hitting a letter to the left, or right, or prob. of typoing given the letters general position within a layout etc.
I find spell checking in almost every application that I've tried to be spectacularly bad for anything other than trivial typos. Google is quite good, but it is leveraging data not always available in other applications and is a special case optimized for search applications. Spell correcting for general writing is somewhat different in requirements and optimization vectors.
Spelling errors that can be classified in terms of edit distance, are far better characterized as "typos" not spelling errors. A typo is a transcription error, whereas incorrect spelling is a representation error. Both can be present in a particular ward, but they are different phenomenon.
People make incorrect spelling choices when they intentionally chose the wrong letters to represent phonemes, or to apply spelling rules incorrectly.
A better approach in all respects is to do what's often referred to as a grapheme to phoneme transformation, which is basically to `compile` a word in to a smaller set of characters representing the phonemes of the language. With the reduced set of symbols, that in and of themselves eliminate trivial typos, statistical models can preform better, and faster. Further, unknown words can often be corrected via the corrected spelling of the phoneme.
Why most spelling checkers fail to do grapheme to phoneme transformation when it is easy and both reduces the symbol set for statistical analysis and is demonstrably more accurate (with the possible exception of Indian languages) is beyond me. It's like watching people try to solve dehydration with a "better(TM)" formula for Coke. You're thinking about this problem wrong, just drink some water.