Automatic identity recognition from text outputs
Johnny, Winnie, Manny, and Cathy use their one-dimensional keyboard (as described in the handout) to type the Biola vision statement and because of typos in the process we have a collection of 8 text files A, B, C, D, E, F, G and H. Each of the eight documents was typed by one of Johnny, Winnie, Manny, and Cathy. You task is to determine for each document d who is the most likely person who has generated the document d when trying to type the Biola vision statement. The following is the overview of a well-founded approach to accomplish the task.
a) Let d1, d2, ... dn be the words in document d corresponding to the n intended words w1, w2, ... wn in the Biola vision statement and let Pr(di |wi, p) be the probability that person p will produce the i th word di in document d when he or she tries to type the i th word wi in the Biola vision statement. The probability Pr(d |p) that document d is the resulting text when person p try to type the Biola vision statement is simply the product of all the Pr(di |wi, p)’s. In other words, we have Pr(d |p) = Pr(d1 |w1, p) *Pr(d2 |w2, p)* … *Pr(dn |wn, p).
b) You should write your code to (i) open the files corresponding to document d and the Biola vision statement and (ii) read in the observed strings d1, d2, ... dn from document d one at a time and the actual words w1, w2, ... wn one at a time from the Biola vision statement. For each pair of observed string di and actual word wi, you can then calculate Pr(di |wi, p) and log Pr(di |wi, p). (See the descriptions in 3 below.).
c) We know that Pr(d |p) = Pr(d1 |w1, p) *Pr(d2 |w2, p)* … *Pr(dn |wn, p). However, multiplying all the tiny probabilities Pr(di |wi, p)’s together to get Pr(d |p) may end in too small a number losing precision because of the limited precision supported by the CPU. Instead, we can add up (log Pr(di |wi, p))’s together to get log Pr(d |p). This works because for any two positive numbers x and y we have log (xy) = log x + log y. Therefore, we have log Pr(d |p) = log Pr(d1 |w1, p) + log Pr(d2 |w2, p)+… + log Pr(dn |wn, p). To deal with logarithm, you can use #include <cmath> and call the log function to calculate the logarithm of Pr(di |wi, p).
d) Note that for any two probabilities (two positive numbers) x and y we have x > y if and only if log x > log y. Therefore, equivalently we can calculate and compare log( Pr(d |p) )’s instead of comparing Pr(d |p) to find the most likely author p of document d by simply selecting the person that ends in the maximal log( Pr(d |p) ) value. This allows us to better retain the numerical precision throughout the computation.