Programming 4B: Spelling Recognition with Training and Testing Data

 

Context from Programming 4A: Imagine that we are developing automatic spelling recognition software to help people in the one-dimensional keyboard country automatically correct their typos and the software has a training stage that requires people to type the Biola vision statements as the training data for the software. Mr. X is our first customer from the one-dimensional keyboard country and he produced the following corrupted document d1=corruptedBiolaVision1.txt when he tried to type the Biola vision statements. We know that the values of the degenerate parameters degsp and degkb  are both equal to 2 for Mr. X.

 

Demo executable and the programming task: Please download and carefully play with options L R, T, and U provided in this demo executable for automatic recovery of a message X described below. In Programming #4A, you have already implemented Option L. You can use option L to learn the best parameter values based on the corrupted message in d1=corruptedBiolaVision1.txt as the result of Mr. X typing the Biola vision statements. In Programming #4B, you need to implement the new options R, T, and U such that you can go through the steps to recognize from corruptedMessage1.txt and corruptedMessage2.txt as the results of Mr. X trying to type an unknown document recorded in messageX.txt twice given that all the original words are in vocabulary.txt. See more details below.

 

 

What to implement: For each corrupted string s you see in corruptedMessage1.txt, (i) check for each word w in the vocabulary.txt the probability Pr(s | w, X) that Mr. X would generate the corrupted string s when Mr. X types the word w, (ii) find the top 4 candidate words with the highest Pr(s | w, X), and (iii) save these 4 words on a separate line in the output file recoveredMessage_V1.txt to indicate they are the most likely candidates regarding the corresponding actual word in message X based on the evidence of the corrupted string s.

 

What to implement: For each corrupted string s1 you see in corruptedMessage1.txt and the corresponding corrupted string s2 you see in corruptedMessage2.txt, (i) check for each word w in the vocabulary.txt the probability Pr(s1 | w, X) and the probability Pr(s2 | w, X) that Mr. X would generate the corrupted strings s1 and s2 respectively when Mr. X types the word w, (ii) find the top 4 candidate words with the highest Pr(s1 | w, X) * Pr(s2 | w, X) value, and (iii) save these 4 words on a separate line in the output file recoveredMessage_V2.txt to indicate they are the most likely candidates regarding the corresponding actual word in message X based on the evidence of both corrupted strings s1 and s2.

 

       Option U:Checking the precision. The real message X actually is the one in messageX.txt.

What to implement: (I) For the recovered result in recoveredMessage_V1.txt, we want to check for each word w in messageX.txt whether w is identified as the top 1 candidate in the corresponding line in recoveredMessage_V1.txt, or one of the top 2 candidates in the corresponding line in recoveredMessage_V1.txt, or one of the top 3 candidates in the corresponding line in recoveredMessage_V1.txt, or one of the top 4 candidates in the corresponding line in recoveredMessage_V1.txt. In the process above, letís collect the statistics about the percentage of times words in messageX.txt are correctly identified as the top candidate, one of the top 2 candidates, one of the top 3 candidates, and one of the top 4 candidates. Report the persentage statistics. (II) For the recovered result in recoveredMessage_V2.txt, we want to check for each word w in messageX.txt whether w is identified as the top 1 candidate in the corresponding line in recoveredMessage_V2.txt, or one of the top 2 candidates in the corresponding line in recoveredMessage_V2.txt, or one of the top 3 candidates in the corresponding line in recoveredMessage_V2.txt, or one of the top 4 candidates in the corresponding line in recoveredMessage_V2.txt. In the process above, letís collect the statistics about the percentage of times words in messageX.txt are correctly identified as the top candidate, one of the top 2 candidates, one of the top 3 candidates, and one of the top 4 candidates. Report the persentage statistics.

 

       Experiment 4B: After you implement Options R, T, and U, run option L, then option R, then option T, and then option U in the end to recover message and to collect statistics about the accuracy of the recovered message. Report the statistics you got from option U in your self-evaluation report.