Programming 5: Automatic Spelling Recognition using HMMs
Context from Programming 4A and 4B: Imagine that we are
developing automatic spelling recognition software to help people in the one-dimensional
keyboard country automatically correct their typos and the software has a
training stage that requires people to type the Biola vision statements as the
training data for the software. Mr. X is our first customer from the
one-dimensional keyboard country and he produced the following corrupted
document d1=corruptedBiolaVision.txt when he tried to
type the Biola vision
statements. We know that the values of the degenerate parameters degsp and degkb are both equal to 2 for Mr. X.
Programming
tasks:
Please carefully
play with options L, R, T, and U provided in the demo executable for automatic
recovery of a message X described below. In Programming #4A and 4B, you have
already implemented the capability corresponding to Option L. You can use
option L to learn the best parameter values based on the corrupted message in d1=corruptedBiolaVision.txt
as the result of Mr. X typing the Biola vision statements. In Programming #5, you need to implement
additional functions/cells in your notebook corresponding to the functionality
of options R, T, and U in the demo executable. After
implementing them, then you should be able to sequentially go through options
L, R, T, and U one by one to recognize an unknown message recorded
in a document messageX.txt, given that (i) corruptedMessage1.txt and corruptedMessage2.txt are the
results of Mr. X trying to type an unknown message recorded in a document
messageX.txt twice and (ii) all the original words are in vocabulary.txt. See more details below.
·
Step 1: Use Option L to
learn the parameter values from the training data set. We only know that the values of the degenerate parameters degsp and degkb are both equal to 2 for Mr. X. However, we don’t
know the values of Prhit, Prmiss, Prrepeat, or PrmoveOn
for Mr. X. In order to make Options
R, T, and U to work well, the user should use Option L to learn the best
parameter values based on the corrupted message in d1=corruptedBiolaVision.txt as the result of
Mr. X typing the Biola
vision statements.
·
Step 2: Use Option R to recover an unknown message based on the result of Mr.
X typing the message X once, ending in one
corrupted result in corruptedMessage1.txt. We know that all the words in message X are conained in the vocabulary file vocabulary.txt. Option R should automatically recover message X from the corrupted result in corruptedMessage1.txt.
Implementation
to do: For each corrupted string s
you see in corruptedMessage1.txt, (i) check for each word w in the vocabulary.txt the probability Pr(s |
w, X) that Mr. X would generate
the corrupted string s when Mr. X types the word w, (ii) find the top 4 candidate words with the highest Pr(s | w, X), and (iii)
save these 4 words on a separate line in the output file recoveredMessage_V1.txt
to indicate they are the most likely candidates regarding the corresponding
actual word in message X based on the evidence of the corrupted
string s.
·
Step 3: Use Option T to recover an unknown message based on the results of
Mr. X typing the
message X twice, ending in two corrupted results corruptedMessage1.txt and corruptedMessage2.txt from document X
just now. We know that all the words in message X are conained in this vocabulary file.
Using what you have leared about the parameter values of Mr. X in the training phase, option
T should automatically recover message X
from the two corrupted documents corruptedMessage1.txt and corruptedMessage2.txt.
Implementation
to do: For each corrupted string s1 you see in corruptedMessage1.txt and the corresponding corrupted string s2 you see in corruptedMessage2.txt, (i) check for each word w in the vocabulary.txt the probability Pr(s1 |
w, X) and the probability Pr(s2
| w, X) that Mr. X would generate the corrupted strings s1 and
s2
respectively when Mr. X types the
word w, (ii) find the
top 4 candidate words with the highest Pr(s1 | w, X) * Pr(s2 | w, X) value, and (iii) save these 4 words on a separate
line in the output file recoveredMessage_V2.txt to indicate they are the most
likely candidates regarding the corresponding actual word in message X based on the evidence of both corrupted strings s1 and
s2.
·
Step 4: Use Option U to gather the
statistics about the acuracy of spelling recognition. The real “unknown” message X actually is actually the
one in messageX.txt, and we want to see how accurate our spelling
recognition software can accomplish its task.
Implementation
to do: (I) For the recovered result in
recoveredMessage_V1.txt, we want to check for each
word w in messageX.txt
whether w is identified as the top 1 candidate
in the corresponding line in recoveredMessage_V1.txt, or one of the top 2
candidates in the corresponding line in recoveredMessage_V1.txt , or one of the top 3 candidates in the corresponding line in
recoveredMessage_V1.txt, or one of the top 4 candidates
in the corresponding line in recoveredMessage_V1.txt. In the process above, let’s
collect the statistics about the percentage of times words in messageX.txt
are correctly identified as the top candidate, one of
the top 2 candidates, one of the top 3 candidates, and one of the top 4
candidates. Report these persentages. (II) For
the recovered result in recoveredMessage_V2.txt, we want to check for each word w
in messageX.txt whether w is identified as the top 1 candidate in the corresponding line in
recoveredMessage_V2.txt, or one of the top 2 candidates in the
corresponding line in recoveredMessage_V2.txt , or one of the
top 3 candidates in the corresponding line in recoveredMessage_V2.txt, or one of the top 4 candidates in the
corresponding line in recoveredMessage_V2.txt. In the process above, let’s collect the statistics about the
percentage of times words in messageX.txt are
correctly identified as the top candidate, one of the top 2 candidates,
one of the top 3 candidates, and one of the top 4 candidates. Report these percentages.