**Automatic identity recognition from text outputs**

Johnny, Winnie, Manny, and Cathy use their
one-dimensional keyboard (as described in the handout) to
type the Biola vision statement and because of typos in the
process we have a collection of 8 text files A, B, C, D, E, F, G and H. Each of the eight documents was
typed by one of Johnny,
Winnie, Manny, and Cathy. You task is to
determine for each document *d* who is the most likely person who
has generated the document *d* when trying to type the Biola vision statement. The following is the overview of a
well-founded approach to accomplish the task.

**The
big picture**: For each document *d** *in {A, B, C, D, E, F, G, H}, we should for each person
*p** *in {Johnny, Winnie, Manny, and Cathy}determine
the probability *Pr**(***d |p**)
that document *d** *is the resulting text when person *p**
*try to type the Biola vision statement. We can then compare these
probabilities and the most likely person is simply the person *p*
with the highest probability *Pr**(**d |p**)* for document *d*.
**How
to determine the ****person ***p* with
the highest probability *Pr**(d |p) *for a given document *d**:*

a)
Let *d*_{1},
d_{2}, ... d_{n}_{ }be the words in document *d* corresponding
to the *n *intended words *w*_{1}, w_{2}, ... w_{n}_{ }in the Biola
vision statement and let *Pr**(**d*_{i} |w_{i},
p*) *be the probability that person *p* will produce
the *i** th *word
*d*_{i} in document
*d* when he or she tries to type the *i**
th *word *w*_{i}
in the Biola vision statement.** **The probability *Pr**(**d |p**)* that
document *d** *is the resulting text when person *p**
*try to type the Biola vision statement is simply
the product of all the *Pr**(***d**_{i}
|w_{i}, p)’s. In other words, we have* **Pr**(**d |p**)*
= *Pr**(***d**_{1} |w_{1}, p)
**Pr**(***d**_{2} |w_{2}, p)*
… **Pr**(***d**_{n}**
|w**_{n}, p).

b)
You** ** should write your
code to (i) open the files corresponding to document *d*
and the Biola vision statement and (ii) read in
the observed strings *d*_{1}, d_{2}, ... d_{n}_{ }from document *d* one at a time** **and the actual words *w*_{1},
w_{2}, ... w_{n}_{ }one
at a time** **from the Biola vision statement. For
each pair of observed string *d*_{i} and actual word* w*_{i}, you can then calculate *Pr**(***d**_{i} |w_{i}, p)** **and** ***log** Pr(***d**_{i}
|w_{i}, p)**. **(See* *the descriptions in
3 below.).

c)
We know that *Pr**(**d
|p**)* = *Pr**(***d**_{1}
|w_{1}, p) **Pr**(***d**_{2}
|w_{2}, p)* … **Pr**(***d**_{n}** |w**_{n},
p). However, multiplying all
the tiny probabilities *Pr**(***d**_{i}
|w_{i}, p)’s together to get *Pr**(***d |p**) may end in too small a number
losing precision because of the limited precision supported by the CPU.
Instead, we can add up (*log **Pr**(**d*_{i}
|w_{i}, p)*)*’s together to get *log** Pr(***d
|p**). This works because
for any two positive numbers *x* and* y *we have *log (xy) = log x + log y*. Therefore, we have* **log** Pr(***d |p**) =
*log **Pr**(***d**_{1}
|w_{1}, p) + *log**
Pr(***d**_{2} |w_{2}, p)+… +* log** Pr(***d**_{n}** |w**_{n},
p). To deal with
logarithm, you can use *#include <cmath>* and call the log function
to calculate the logarithm of *Pr**(**d*_{i} |w_{i},
p*)*.

d)
Note that for
any two probabilities (two positive numbers) *x* and* y *we
have *x* > *y *if and only if *log x* > *log y*. Therefore, equivalently we can calculate and compare* log**( Pr(***d
|p**) )’s instead of comparing *Pr**(***d
|p**) to find the most likely author *p *of document *d* by simply selecting
the person that ends in the maximal *log**(
Pr(***d |p**) ) value. This allows us to better retain the
numerical precision throughout the computation.