Automatic identity recognition from text outputs

 

Johnny, Winnie, Manny, and Cathy use their one-dimensional keyboard (as described in the handout) to type the Biola vision statement and because of typos in the process we have a collection of 8 text files A, B, C, D, E, F, G and H. Each of the eight documents was typed by one of Johnny, Winnie, Manny, and Cathy. You task is to determine for each document d who is the most likely person who has generated the document d when trying to type the Biola vision statement. The following is the overview of a well-founded approach to accomplish the task.

  1. The big picture: For each document d in {A, B, C, D, E, F, G, H}, we should for each person p in {Johnny, Winnie, Manny, and Cathy}determine the probability Pr(d |p) that document d is the resulting text when person p try to type the Biola vision statement. We can then compare these probabilities and the most likely person is simply the person p with the highest probability Pr(d |p) for document d.
  2. How to determine the person p with the highest probability Pr(d |p) for a given document d:

a)     Let d1, d2, ... dn be the words in document d corresponding to the n intended words w1, w2, ... wn in the Biola vision statement and let Pr(di |wi, p) be the probability that person p will produce the i th word di in document d when he or she tries to type the i th word wi in the Biola vision statement. The probability Pr(d |p) that document d is the resulting text when person p try to type the Biola vision statement is simply the product of all the Pr(di |wi, p)s. In other words, we have Pr(d |p) = Pr(d1 |w1, p) *Pr(d2 |w2, p)* *Pr(dn |wn, p).

b)    You should write your code to (i) open the files corresponding to document d and the Biola vision statement and (ii) read in the observed strings d1, d2, ... dn from document d one at a time and the actual words w1, w2, ... wn one at a time from the Biola vision statement. For each pair of observed string di and actual word wi, you can then calculate Pr(di |wi, p) and log Pr(di |wi, p). (See the descriptions in 3 below.).

c)     We know that Pr(d |p) = Pr(d1 |w1, p) *Pr(d2 |w2, p)* *Pr(dn |wn, p). However, multiplying all the tiny probabilities Pr(di |wi, p)s together to get Pr(d |p) may end in too small a number losing precision because of the limited precision supported by the CPU. Instead, we can add up (log Pr(di |wi, p))s together to get log Pr(d |p). This works because for any two positive numbers x and y we have log (xy) = log x + log y. Therefore, we have log Pr(d |p) = log Pr(d1 |w1, p) + log Pr(d2 |w2, p)+ + log Pr(dn |wn, p). To deal with logarithm, you can use #include <cmath> and call the log function to calculate the logarithm of Pr(di |wi, p).

d)    Note that for any two probabilities (two positive numbers) x and y we have x > y if and only if log x > log y. Therefore, equivalently we can calculate and compare log( Pr(d |p) )s instead of comparing Pr(d |p) to find the most likely author p of document d by simply selecting the person that ends in the maximal log( Pr(d |p) ) value. This allows us to better retain the numerical precision throughout the computation.