Partial Solutions to Homework 1

Q1

1 A)Prove that the traceback path that is created during dynamic programming specifies a longest common subsequence.

Answer: We proceed by induction. Suppose D  is the matrix  obtained after applying the LCS algorithm. It has rows numbered 0 to n and column numbered from 0 to m.

Let A_k = { (i,j) | i+j = k ). Define a total (linear) order < on the family of sets F = { A_0, A_1,... } as follows: A_p < A_q iff p is less than q. It is quite forward to see that F is well-ordered (i.e every subset of F has a least element) . Consequently, we can apply the induction principle to prove that the following property holds for every set A_k in F : "For every pair (i,j) , where D(i,j) is defined, the traceback paths from cell (i,j) to (0,0) specify longest common subsequences of s[1..i] and t[1..j] ".

Base Case: P(A_0) holds
The set A_0 consists of the cell (0,0) of our matrix D, which contains the length of the LCS between two empty strings in this case. The only traceback path is the empty path.  This in agreement with the fact that there is no LCS, so the base case holds.

Inductive hypothesis: Assume P(A_i) holds for 0 <i <= k, k > 0. Let's show P(A_k+1) holds

Now, A_k+1 consists of the cell elements (i,j) such that i+j= k+1. Let (i,j) be such a cell element. Now, suppose (i,j) has a pointer to the cell (i-1,j-1). By the induction hypothesis, all traceback paths from the cell (i-1,j-1) to the cell (0,0) specify longest common subsequences of s[1..i-1]  and t[1..j-1]. Since D(i,j) = D(i-1,j-1) +1 in our case, concatenating the diagonal arrow from cell (i,j) to cell (i-1,j-1) with any traceback path from (i-1,j-1) to (0,0) will specify a longest common subsequence of s[1..i] and t[1..j].  Suppose instead (i,j) has a pointer to (i-1,j). By the induction hypothesis, all traceback paths from the cell (i-1,j) to the cell (0,0) specify longest common subsequences of s[1..i-1] and t[1..j]. Since D(i,j)=D(i-1,j), concatenating the vertical arrow from cell (i,j) to (i-1,j) with any traceback path from (i-1,j) to (0,0) will specify a longest common subsequence of s[1..i] and t[1..j]. The case when there's an arrow between (i,j) and (i,j-1) is similar.

Therefore, the inductive hypothesis holds. Hence, P holds for all sets A_i , i >= 0.

1 B) Prove the other direction: every longest common subsequence has such a traceback path.

We proceed by induction on the length of the LCS.

Base Case: | LCS | = 0
In this case, D(n,m) = 0 , so any path from cell (n,m) to (0,0) is a traceback path specifying no subsequence (there are no diagonal arrows in any of these paths).

Inductive step: Assume the property holds for |LCS| = k , k > 0.  Let's show that it holds for |LCS| = k+1.
In this case, D(n,m) = k+1. There are several possibilities here on how to construct our traceback path. If D(n,m) = D(n-1,m-1) + 1, then we know by hypothesis that there is a traceback path P from the cell (n-1,m-1) to (0,0). Hence, concatenating the arrow from (n,m) to (n-1,m-1) with P produces a traceback path for our LCS of length k+1. If there is no arrow between cell (n,m) and (n-1,m-1), then it is the case that D(n,m) = D(n-1,m) or D(n,m)=D(n,m-1), or both. Suppose D(n,m) = D(n-1,m), then we can follow a path of vertical arrows and horizontal arrows until there is a pair  (i,j) such that D(n,m) = D(i,j) = D(i-1,j-1) + 1 (remember |LCS|  > 0, so this pair (i,j) must exists ).  By the induction hypothesis, there is a traceback path P from the cell (i-1,j-1) to (0,0). Concatenating the path from (n,m) to (i,j)  with P produces a traceback for our longest common subsequence of length k+1. The case where D(n,m) = D(n,m-1) is similar.

Therefore, the inductive hypothesis holds. Hence, every longest common subsequence has such a traceback path.
 
 

Q2
Give an algorithm to compute the number of traceback paths in time O(n*m), where n=|s| , and m=|t|

REMARK: There was a small inconsistency in the question. The question states that there is a one-to-one correspondance between LCS and traceback paths. Actually, the one-to-one correspondance is between alignments and traceback paths. Convince yourself that two different traceback paths could specify the same LCS, but yield two different alignments of s and t. 

SOLUTION:
Let D be the matrix obtained after applying the LCS algorithm.
We define a new matrix B as follows: B(i,j) is the number of traceback paths from cell (i,j) to cell (0,0).
We compute B for a cell element (i,j) as follows:

If the cell (i,j) in D contains an arrow to (i-1,j-1) then
    B(i,j) = B(i-1,j-1)
If the cell (i,j) in D contains an arrow to (i-1,j) then
   B(i,j) = B(i,j) + B(i-1,j)
If the cell (i,j) in D contains an arrow to (i,j-1) then
   B(i,j) = B(i,j) + B(i,j-1)

Of course, we must fill out the matrix B row by row like we did when computing D in order for these rules to behave properly.
Base case:
for i=0...n,
 B(i,0) = 1
for j=0..m
 B(0,j) = 1

The answer to the problem is given by B(n,m).
Since each entry of B takes O(1) time to compute, the running time of this algorithm is O(n*m).

Q5

Answer: Let X_i  be the number of times  we have to sample (without replacement) from our 'infinite bag' of DNA letters until t[i] is retrieved. X_i is a random variable which follows a geometric distribution with mean 1/p = 4 (p = 1/4) , and variance sigma^2 = q/p^2 = 12. Let Y be the summation of X_1, X_2, .. X_m, where m is the length of the string t. Y is also a random variable which follows a geometric distribution with mean m*1/p = 4*m, and variance sigma^2 = m*q/p^2 = 12*m. This is so since all the random variables X_i are independent (P(X_i = x | X_j =y) = P(X_i =x) for different values of i and j). Now, the probability that s does not contains t as a subsequence is the probability that Y >= n+1. To get an upper bound for this probability, we use Tchebyshev's theorem, which states that
P( | Y - mean(Y) | >= k*sigma) <= 1/k^2.  Obviously, P(Y >= mean(Y) + k*sigma) < P(|Y-mean(Y)| >= k*sigma) <= 1/k^2.  So, in our case , we want the following:

P(Y>= n+1 ) = P(Y >= mean(Y)+k*sigma) < 1/k^2 < 1-epsilon.

 n+1 = mean(Y) + k*sigma = 4*m +k*sqrt(12*m).            (equation 1)

From 1/k^2 < 1-epsilon, we have k > sqrt ( 1/ 1-epsilon) , so equation 1 becomes
n+1 = 4*m + 12*k*m > 4*m + sqrt(12*m/1-epsilon)

So,  by setting n > 4* m + sqrt(12*m/1-epsilon) -1 , the probability that the string s contains the string t as a subsequence is greater or equal to epsilon.

There were other valid ways to obtain a bound on n.
 



BACK