\documentclass{report}
\usepackage[headings]{fullpage}
\usepackage{scribe}
\usepackage[pdftex]{color,graphicx}
\course{COMP-761}
\coursetitle{Quantum Information Theory}
\semester{Winter 2009}
\lecturer{Patrick Hayden}
\scribe{Artem Kaznatcheev}
\lecturenumber{3}
\lecturedate{January 8}
\newtheorem{myDef}{Definition}[section]
\begin{document}
\maketitle
\section{Introduction}
In the previous two lectures, we formalized the ideas notions of
\emph{uncertainty} and \emph{information} as \emph{entropy} and \emph{mutual information}. However, apart from
seeing the equations governing entropy and their fun properties, we have
not looked at an intuitive connection or interpretation for the mutual information function. The
goal of this lecture is to provide this connection. To start, let us consider a gambling example.
\section{Gambling example}
Let us consider a game of chance with outcomes $x\in \chi$, where each
outcome has probability $p(x)$. Assume that we have a friendly bookie, who
gives us fair odds. That means if we bet $d$ on $x$, then we get
$\frac{d}{p(x)}$ if $x$ happens and nothing otherwise. Our goal (apart
from having fun) is to maximize the exponential rate of return $G$ on our
capital $V_j$ after the $j^\mathit{th}$ round.
\begin{equation}
G = \lim_{n\rightarrow \infty} \frac{1}{n} \log \frac{V_n}{V_0}
\label{eq:G}
\end{equation}
\begin{danger}
Note if we just use $G = \frac{V_n}{V_0}$ then the optimal strategy will
involve betting all your money in one place, and that will make you go
broke very quickly!
\end{danger}
Now, consider that the game of chance is not quite as fair as was first
suggested. In particular, we have a friend `on the inside' who supplies
extra information about the game in the form of a random variable
$Y ~ p(y|x)$. Now the question for us to answer is `what is the max of $G$
over all possible betting strategies?' Surprisingly, the answer is the
mutual information between $X$ and $Y$.
\begin{theorem}
$I(X:Y)$ is the max of $G$ over all betting strategies.
\end{theorem}
\begin{proof}
Let $a(x|y)$ be the fraction of your capital allocated to $x$ after
hearing $y$ from your friend. In other words $a(x|y)$ is our strategy.
Without loss of generality assume that $\displaystyle\sum_{x\in \chi} a(x|y)=1$.
We can assume this because we know that if we don't want to bet, then
there is some way to place our money such that it cancels. Further,
let $w_{xy}$ be the number of times you hear $y$ and $x$ happens. Clearly,
\begin{equation}
V_n = \prod_{x,y} \left(\frac{a(x|y)}{p(x)}\right)^{w_{xy}}V_0.
\end{equation}
%
Now, plugging $V_n$ into equation~\ref{eq:G} we get:
%
\begin{eqnarray}
G &=& \lim_{n\rightarrow \infty} \frac{1}{n} \log
(\prod_{x,y} (\frac{a(x|y)}{p(x)})^{w_{xy}}) \\
&=& \lim_{n\rightarrow \infty} \frac{1}{n} \sum_{x,y} w_{xy}
\log\frac{a(x|y)}{p(x)}
\end{eqnarray}
By the law of large number we can conclude that with probability $1$
$\frac{w_{xy}}{n}$ will converge to $p(x,y)$. Therefore,
\begin{eqnarray}
G &=& \sum_{x,y} p(x,y) \log\frac{a(x|y)}{p(x)} \\
&=& \sum_{x,y} p(x,y) \log\frac{a(x|y)p(x|y)}{p(x)p(x|y)} \\
&=& \sum_y p(y) [\sum_x p(x|y)\log\frac{p(x|y)}{p(x)} -
\sum_x p(x|y) \log\frac{p(x|y)}{a(x|y)})]\\
&=& \sum_y p(y) [D(p(x|y)||p(x)) - D(p(x|y)||a(x|y))]
\end{eqnarray}
However, from before we know that $D(u||v)$ is always non-negative and
equal to zero if and only if $u = v$. So our optimal betting strategy is
$a(x|y) = p(x|y)$. If we play this strategy, then:
\begin{eqnarray}
G &=& \sum_y p(y) D(p(x|y)||p(x)) \\
&=& \sum_{x,y} p(y)p(x|y)\log\frac{p(x|y)}{p(x)} \\
&=& \sum_{x,y} p(x,y)\log\frac{p(x|y)}{p(x)p(y)} \\
&=& I(X:Y)
\end{eqnarray}
\end{proof}
This provides us with a way to think about mutual information as a measure
of correlation between random variables. In simpler terms, $I(X:Y)$ can be
thought of as how much $X$ really knows about $Y$ and vice-versa.
\section{Communication}
This part of the lecture will provide us with one of the most important
results of information theory and delve into the heart of theory:
communication. A good theory of information should provide us with
a way to model typical obstacles to communication, such as:
\begin{enumerate}
\item weak signals (ex. Space probe, cell phones)
\item imperfect signals (ex. Smoke signals)
\item faulty media (ex. Failed hard-drive)
\end{enumerate}
When designing a communication protocol we want a system that is robust to
one or all of these obstacles. The basic model for us is the channel.
Examples of channels include the binary symmetric channel (BSC), binary
erasure channel, and binary adder channel. In general, anything that connects
your inputs to outputs, whether deterministic or stochastic is a channel.
\begin{myDef}
A (discrete) \textbf{channel} $(\chi, p(y|x), \mathcal{Y})$ consists of $2$
finite sets $\chi$ and $\mathcal{Y}$ and a probability density function
$p(y|x)$ for each $x\in\chi$.
\end{myDef}
To build our theory, we will assume memory-less channels and use channel
times to send messages. We will also borrow notation from Cover and Thomas and use
$\displaystyle p(y^n|x^n) = p^n(y^n|x^n) = \prod_{j=1}^n p(y_j|x_j)$. Our
overall system will be modeled as a word $w$ passed into an encoder, which
will output a code-word $x^n$ into the channel which will output some $y^n$
with probability $p(y^n|x^n)$ into the decoder which will transform it into
the final message $\tilde{w}$. We want $\lim_{n \rightarrow \infty}
\mathit{Pr}\{w\neq\tilde{w}\} = 0$. In other words, we want our transmission
to approach lossless as the length of the codeword goes to infinity.
More formally, an {\bf $(M,n)$-code} consists of:
%
\begin{enumerate}
\item A message index set $\mathcal{M} = \{1, ... , M\}$
\item An encoder function $X^n: \mathcal{M} \rightarrow \chi^n$
\item A decoder function $g: \mathcal{Y} \rightarrow \mathcal{M}$
\end{enumerate}
The {\bf rate} of the code is (in bits per channel use):
\begin{equation}
R = \frac{1}{n} \log M
\end{equation}
The probability of error on a particular string $w$ is:
\begin{equation}
P_e^{(n)}(w) = \mathit(Pr)\{g(Y^n \neq w | X^n = X^n(w)\}
\label{eq:P_e}
\end{equation}
From equation~\ref{eq:P_e} we can define the average probability of error:
\begin{equation}
\bar{P}_e^{(n)} = \frac{1}{M}\sum_{w\in M} P_e^{(n)}(w).
\label{eq:P_avg}
\end{equation}
\begin{danger}
We generally do not know Alice's distribution of messages, and equation \ref{eq:P_avg} assumes
a uniform distribution, so $\bar{P}_e^{(n)}$ is perhaps not that useful.
\end{danger}
A better metric of error is one where we can assume an adversarial sender.
For such, we need to consider the worst error possible:
\begin{equation}
P_e^{(n)} = \max_{w\in \mathcal{M}} P_e^{(n)}(w)
\end{equation}
\begin{myDef}
We call a rate $R$ \textbf{achievable} if there is some sequence of
$(\lceil 2^{nR} \rceil, n)$-codes such that $P_e^{(n)} \rightarrow 0$ as
$n \rightarrow \infty$.
\end{myDef}
\begin{myDef}
The \textbf{capacity} of $p(y|x)$ is the suprimum of all achievable rates
\end{myDef}
With all this new terminology we can state Shannon's noisy coding theorem
and provide intuitions for it, then prove it in the next lecture.
\begin{theorem}[Shannon's Noisy Coding]
The capacity of $C$ is $\displaystyle C = \max_{p(x)} I(X:Y)$
\end{theorem}
\subsection{Intuition}
We do not know how to choose code words, so let us try choosing them at
random. The size of the typical image of $X^n$ is approximately $2^{nH(Y)}$
but the image of the typical images of particular code-words is $2^{nH(y|x)}$
Thus, we are likely able to distinguish
$M \leq \frac{2^{nH(Y)}}{2^{nH(y|x)}} = 2^{n[H(Y) - H(y|x)]}$
\end{document}