Spelling-Recognition with Training and Testing Data

Programming 5: Automatic Spelling Recognition using HMMs

Context from Programming 4A and 4B: Imagine that we are developing automatic spelling recognition software to help people in the one-dimensional keyboard country automatically correct their typos and the software has a training stage that requires people to type the Biola vision statements as the training data for the software. Mr. X is our first customer from the one-dimensional keyboard country and he produced the following corrupted document d1=corruptedBiolaVision.txt when he tried to type the Biola vision statements. We know that the values of the degenerate parameters deg_sp and deg_kb are both equal to 2 for Mr. X.

Programming tasks: Please carefully play with options L, R, T, and U provided in the demo executable for automatic recovery of a message X described below. In Programming #4A and 4B, you have already implemented the capability corresponding to Option L. You can use option L to learn the best parameter values based on the corrupted message in d1=corruptedBiolaVision.txt as the result of Mr. X typing the Biola vision statements. In Programming #5, you need to implement additional functions/cells in your notebook corresponding to the functionality of options R, T, and U in the demo executable. After implementing them, then you should be able to sequentially go through options L, R, T, and U one by one to recognize an unknown message recorded in a document messageX.txt, given that (i) corruptedMessage1.txt and corruptedMessage2.txt are the results of Mr. X trying to type an unknown message recorded in a document messageX.txt twice and (ii) all the original words are in vocabulary.txt. See more details below.

· Step 1: Use Option L to learn the parameter values from the training data set. We only know that the values of the degenerate parameters deg_sp and deg_kb are both equal to 2 for Mr. X. However, we don’t know the values of Pr_hit, Pr_miss, Pr_repeat, or Pr_moveOnfor Mr. X. In order to make Options R, T, and U to work well, the user should use Option L to learn the best parameter values based on the corrupted message in d1=corruptedBiolaVision.txt as the result of Mr. X typing the Biola vision statements.

· Step 2: Use Option R to recover an unknown message based on the result of Mr. X typing the message X once, ending in one corrupted result in corruptedMessage1.txt. We know that all the words in message X are conained in the vocabulary file vocabulary.txt. Option R should automatically recover message X from the corrupted result in corruptedMessage1.txt.

Implementation to do: For each corrupted string s you see in corruptedMessage1.txt, (i) check for each word w in the vocabulary.txt the probability Pr(s | w, X) that Mr. X would generate the corrupted string s when Mr. X types the word w, (ii) find the top 4 candidate words with the highest Pr(s | w, X), and (iii) save these 4 words on a separate line in the output file recoveredMessage_V1.txt to indicate they are the most likely candidates regarding the corresponding actual word in message X based on the evidence of the corrupted string s.

· Step 3: Use Option T to recover an unknown message based on the results of Mr. X typing the message X twice, ending in two corrupted results corruptedMessage1.txt and corruptedMessage2.txt from document X just now. We know that all the words in message X are conained in this vocabulary file. Using what you have leared about the parameter values of Mr. X in the training phase, option T should automatically recover message X from the two corrupted documents corruptedMessage1.txt and corruptedMessage2.txt.

Implementation to do: For each corrupted string s1 you see in corruptedMessage1.txt and the corresponding corrupted string s2 you see in corruptedMessage2.txt, (i) check for each word w in the vocabulary.txt the probability Pr(s1 | w, X) and the probability Pr(s2 | w, X) that Mr. X would generate the corrupted strings s1 and s2 respectively when Mr. X types the word w, (ii) find the top 4 candidate words with the highest Pr(s1 | w, X) * Pr(s2 | w, X) value, and (iii) save these 4 words on a separate line in the output file recoveredMessage_V2.txt to indicate they are the most likely candidates regarding the corresponding actual word in message X based on the evidence of both corrupted strings s1 and s2.

· Step 4: Use Option U to gather the statistics about the acuracy of spelling recognition. The real “unknown” message X actually is actually the one in messageX.txt, and we want to see how accurate our spelling recognition software can accomplish its task.

Implementation to do: (I) For the recovered result in recoveredMessage_V1.txt, we want to check for each word w in messageX.txt whether w is identified as the top 1 candidate in the corresponding line in recoveredMessage_V1.txt, or one of the top 2 candidates in the corresponding line in recoveredMessage_V1.txt , or one of the top 3 candidates in the corresponding line in recoveredMessage_V1.txt, or one of the top 4 candidates in the corresponding line in recoveredMessage_V1.txt. In the process above, let’s collect the statistics about the percentage of times words in messageX.txt are correctly identified as the top candidate, one of the top 2 candidates, one of the top 3 candidates, and one of the top 4 candidates. Report these persentages. (II) For the recovered result in recoveredMessage_V2.txt, we want to check for each word w in messageX.txt whether w is identified as the top 1 candidate in the corresponding line in recoveredMessage_V2.txt, or one of the top 2 candidates in the corresponding line in recoveredMessage_V2.txt , or one of the top 3 candidates in the corresponding line in recoveredMessage_V2.txt, or one of the top 4 candidates in the corresponding line in recoveredMessage_V2.txt. In the process above, let’s collect the statistics about the percentage of times words in messageX.txt are correctly identified as the top candidate, one of the top 2 candidates, one of the top 3 candidates, and one of the top 4 candidates. Report these percentages.