Automatic identity recognition from text outputs
Johnny, Winnie, Manny, and Cathy use their
one-dimensional keyboard (as described in the handout) to
type the Biola vision
statement and because of typos in the process we have a collection of 8
text files A, B, C, D, E, F, G and H. Each of the eight documents was
typed by one of Johnny,
Winnie, Manny, and Cathy. Your task is
to determine for each document d who is the most likely person
who has generated the document d when trying to type the Biola
vision statement. The following is the overview of a well-founded approach to
accomplish the task.
1. The big picture: For each document d
in {A, B, C, D, E, F, G, H}, we should for each person p
in {Johnny,
Winnie, Manny, and Cathy}determine the
probability Pr(d |p) that
document d is the resulting text when person p
try to type the Biola
vision statement. We can then compare these probabilities and the most
likely person is simply the person p with the highest probability
Pr(d
|p) for document d.
2. How to determine the person p
with the highest probability Pr(d |p)
for a given document d:
a)
Let d1,
d2, ... dn be the words in document d corresponding
to the n intended words w1, w2, ... wn in the Biola vision
statement and let Pr(di |wi,
p) be the probability that person p will produce
the i th word
di in document
d when he or she tries to type the i
th word wi
in the Biola vision statement. The
probability Pr(d |p) that
document d is the resulting text when person p
try to type the Biola vision statement is simply the product of all the Pr(di |wi,
p)’s. In other words, we have
Pr(d |p) = Pr(d1 |w1, p) *Pr(d2 |w2, p)* …
*Pr(dn
|wn, p).
b)
You should write your
code to (i) open the files corresponding to document d
and the Biola vision statement and (ii) read in the observed strings d1,
d2, ... dn from
document d one at a time and the
actual words w1, w2, ... wn
one at a time from the Biola vision statement. For each
pair of observed string di and actual word wi, you can then calculate Pr(di |wi, p) and log Pr(di
|wi, p). (See the descriptions in
c below.).
c)
We know that Pr(d
|p) = Pr(d1
|w1, p) *Pr(d2
|w2, p)* … *Pr(dn |wn,
p). However, multiplying all
the tiny probabilities Pr(di
|wi, p)’s together to get Pr(d |p) may end in too small a number
losing precision because of the limited precision supported by the CPU.
Instead, we can add up (log Pr(di
|wi, p))’s together to get log Pr(d
|p). This works because
for any two positive numbers x and y we have log (xy) = log x + log y. Therefore, we have log Pr(d |p) =
log Pr(d1
|w1, p) + log
Pr(d2 |w2, p)+… + log Pr(dn |wn,
p).
d)
Note that for
any two probabilities (two positive numbers) x and y we
have x > y if and only if log x > log y. Therefore, equivalently we can calculate and compare log( Pr(d
|p) )’s instead of comparing Pr(d
|p) to find the most likely author p of document d by simply selecting
the person that ends in the maximal log(
Pr(d |p) ) value. This allows us to better retain the
numerical precision throughout the computation.
3. Reflection: What is the
basis of the approach in (2) above? Could you justify it in terms of Bayes
Theorem and what is the assumption behind this approach?