Automatic identity recognition from text outputs
Johnny, Winnie, Manny, and Cathy use their
one-dimensional keyboard (as described in the handout) to
type the Biola vision statement and because of typos in the
process we have a collection of 8 text files A, B, C, D, E, F, G and H. Each of the eight documents was
typed by one of Johnny,
Winnie, Manny, and Cathy. You task is to
determine for each document d who is the most likely person who
has generated the document d when trying to type the Biola vision statement. The following is the overview of a
well-founded approach to accomplish the task.
a)
Let d1,
d2, ... dn be the words in document d corresponding
to the n intended words w1, w2, ... wn in the Biola
vision statement and let Pr(di |wi,
p) be the probability that person p will produce
the i th word
di in document
d when he or she tries to type the i
th word wi
in the Biola vision statement. The probability Pr(d |p) that
document d is the resulting text when person p
try to type the Biola vision statement is simply
the product of all the Pr(di
|wi, p)’s. In other words, we have Pr(d |p)
= Pr(d1 |w1, p)
*Pr(d2 |w2, p)*
… *Pr(dn
|wn, p).
b)
You should write your
code to (i) open the files corresponding to document d
and the Biola vision statement and (ii) read in
the observed strings d1, d2, ... dn from document d one at a time and the actual words w1,
w2, ... wn one
at a time from the Biola vision statement. For
each pair of observed string di and actual word wi, you can then calculate Pr(di |wi, p) and log Pr(di
|wi, p). (See the descriptions in
3 below.).
c)
We know that Pr(d
|p) = Pr(d1
|w1, p) *Pr(d2
|w2, p)* … *Pr(dn |wn,
p). However, multiplying all
the tiny probabilities Pr(di
|wi, p)’s together to get Pr(d |p) may end in too small a number
losing precision because of the limited precision supported by the CPU.
Instead, we can add up (log Pr(di
|wi, p))’s together to get log Pr(d
|p). This works because
for any two positive numbers x and y we have log (xy) = log x + log y. Therefore, we have log Pr(d |p) =
log Pr(d1
|w1, p) + log
Pr(d2 |w2, p)+… + log Pr(dn |wn,
p). To deal with
logarithm, you can use #include <cmath> and call the log function
to calculate the logarithm of Pr(di |wi,
p).
d)
Note that for
any two probabilities (two positive numbers) x and y we
have x > y if and only if log x > log y. Therefore, equivalently we can calculate and compare log( Pr(d
|p) )’s instead of comparing Pr(d
|p) to find the most likely author p of document d by simply selecting
the person that ends in the maximal log(
Pr(d |p) ) value. This allows us to better retain the
numerical precision throughout the computation.