Automatic identity recognition from text outputs

Automatic identity recognition from text outputs

Johnny, Winnie, Manny, and Cathy use their one-dimensional keyboard (as described in the handout) to type the Biola vision statement and because of typos in the process we have a collection of 8 text files A, B, C, D, E, F, G and H. Each of the eight documents was typed by one of Johnny, Winnie, Manny, and Cathy. Your task is to determine for each document d who is the most likely person who has generated the document d when trying to type the Biola vision statement. The following is the overview of a well-founded approach to accomplish the task.

1. The big picture: For each document d in {A, B, C, D, E, F, G, H}, we should for each person p in {Johnny, Winnie, Manny, and Cathy}determine the probability Pr(d |p) that document d is the resulting text when person p try to type the Biola vision statement. We can then compare these probabilities and the most likely person is simply the person p with the highest probability Pr(d |p) for document d.

2. How to determine the person p with the highest probability Pr(d |p) for a given document d:

a) Let d₁, d₂, ... d_nbe the words in document d corresponding to the n intended words w₁, w₂, ... w_nin the Biola vision statement and let Pr(d_i |w_i, p) be the probability that person p will produce the i th word d_i in document d when he or she tries to type the i th word w_i in the Biola vision statement. The probability Pr(d |p) that document d is the resulting text when person p try to type the Biola vision statement is simply the product of all the Pr(d_i |w_i, p)’s. In other words, we have Pr(d |p) = Pr(d₁ |w₁, p) *Pr(d₂ |w₂, p)* … *Pr(d_n |w_n, p).

b) You should write your code to (i) open the files corresponding to document d and the Biola vision statement and (ii) read in the observed strings d₁, d₂, ... d_nfrom document d one at a time and the actual words w₁, w₂, ... w_none at a time from the Biola vision statement. For each pair of observed string d_i and actual word w_i, you can then calculate Pr(d_i |w_i, p) and log Pr(d_i |w_i, p). (See the descriptions in c below.).

c) We know that Pr(d |p) = Pr(d₁ |w₁, p) *Pr(d₂ |w₂, p)* … *Pr(d_n |w_n, p). However, multiplying all the tiny probabilities Pr(d_i |w_i, p)’s together to get Pr(d |p) may end in too small a number losing precision because of the limited precision supported by the CPU. Instead, we can add up (log Pr(d_i |w_i, p))’s together to get log Pr(d |p). This works because for any two positive numbers x and y we have log (xy) = log x + log y. Therefore, we have log Pr(d |p) = log Pr(d₁ |w₁, p) + log Pr(d₂ |w₂, p)+… + log Pr(d_n |w_n, p).

d) Note that for any two probabilities (two positive numbers) x and y we have x > y if and only if log x > log y. Therefore, equivalently we can calculate and compare log( Pr(d |p) )’s instead of comparing Pr(d |p) to find the most likely author p of document d by simply selecting the person that ends in the maximal log( Pr(d |p) ) value. This allows us to better retain the numerical precision throughout the computation.

3. Reflection: What is the basis of the approach in (2) above? Could you justify it in terms of Bayes Theorem and what is the assumption behind this approach?