\documentclass[12pt]{report}
\usepackage[headings]{fullpage}
\usepackage{scribe}
\usepackage{graphicx}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.25cm}
\renewcommand{\labelitemi}{$\cdot$}
\newtheorem{definition}[lemma]{Definition}
\begin{document}
\course{COMP761}
\coursetitle{Quantum Information Theory}
\semester{Winter 2009}
\lecturer{Patrick Hayden}
\scribe{Jan Florjanczyk}
\lecturenumber{1}
\lecturedate{January 6}
\maketitle
\section{Quantum information science}
Quantum information science is composed of two strands: the technological and the conceptual. The technological side exploits quantum mechanics to perform tasks such as quantum key distribution (QKD) or factor large integers (Shor's factoring algorithm). Meanwhile, the work of the conceptual strand ranges across many topics and fields but was perhaps best encapsulated by John Wheeler who coined the phrase ``it from bit". Wheeler was exhorting physicists to think about whether the laws of physics, particularly quantum mechanics, might have an information-theoretic origin. The conceptual strand of quantum information science tries to make sense of ``weird'' features of quantum mechanics such as irreducible probabilities and the incompatibility of some observable quantities in quantitative and operational terms.
In this course we will make two assumptions:
\begin{enumerate}
\item \textbf{Communication is expensive}, in particular, we will ``meter" communication
\item \textbf{Computation is free}
\end{enumerate}
These give a simple and highly structured theory which provides the scaffolding necessary for a more practical follow-up.
\section{What is information?}
Consider a scale that can determine if its left side is heavier (L),
its right side is heavier (R), or the two are balanced (B). Imagine
also that you wish to find a counterfeit coin amongst a set of 12
which is lighter or heavier than the others but you don't know
which. How would you find it with as few weighings as possible?
First, we can find a lower bound for the number of weighings. The possible number of configurations for the set of coins is 24, any one of the 12 can be counterfeit \textbf{and} it can be heavier or lighter. Thus if $n$ is the number of uses of the scale then $3^n$ is the number of possible results after $n$ uses and we must have $3^n \geq 24$ or $n \geq 3$ since $n$ is an integer.
The algorithm we construct to find the counterfeit coin will always ask questions that \textit{reduce our uncertainty} about its identity and weight by as much as possible at every step. [Exercise: confirm that the problem can be solved with only 3 uses of the scale.] It is not a large leap to then say that \textbf{information is that which reduces uncertainty}. The minimal number of uses of the scale therefore coincides with the information we gain by learning the identity and weight of the counterfeit coin.
\section{Typical sequences}
Another familiar thought experiment is to repeatedly flip a biased coin. Consider a coin with results $0$ and $1$ and $\mathrm{Pr} \left[ X_j = 1 \right] = p$. Such a coin tested repeatedly may give the following string of results:
\begin{equation}
X_1X_2...X_n = 011010111010101
\end{equation}
Thus we might naively think there are $n$ bits of information in this string. But in order to predict how many 1's will appear we need to define the following quantity,
\begin{equation}
Z_n := \frac{1}{n} \displaystyle\sum_{j=1}^{n} X_j
\end{equation}
This is the number of 1's as a fraction of the total string. We have that $\mathbb{E} \left[Z_n\right] = p$ and the distribution of $Z$ is normal around this value.
\begin{figure}[h]
\centering
\includegraphics[scale=0.25]{gaussian.png}
\caption{Sequences with Z values away from p are unlikely. We can thus \textbf{ignore} sequences with atypical fractions of ones when compressing data.}
\end{figure}
Let $x_1, x_2, ..., x_n$ be independent and identically distributed (i.i.d.) random variables (r.v.) taking values in a finite set $\chi$ with density $p(x)$. Then we can write an important definition,
\begin{definition}{Shannon entropy of X}
\begin{equation}
H(X) = - \displaystyle\sum_x p(x) \log p(x)
\end{equation}
\end{definition}
We use the following conventions:
\begin{itemize}
\itemsep 0pt
\item $\chi$, the set of all possible values of a random variable is called the alphabet
\item $x \in \chi$ is the value of an r.v. called a letter
\item $X$ is a random variable with values in $\chi$
\item All logarithms are base 2
\end{itemize}
The Shannon entropy is the most common and often best way to measure the uncertainty in $X$.
\begin{theorem}{Asymptotic Equipartition Theorem}
\begin{equation}
\forall \epsilon > 0, \mathrm{Pr} \left[ \left| -\frac{1}{n} \log
p(X_1, X_2, ..., X_n) - H(X) \right| > \epsilon \right] \overset{n
\rightarrow \infty}{\longrightarrow} 0
\end{equation}
\end{theorem}
We can reformulate this statement to say with that with very high probability the following bounds hold,
\begin{equation}
2^{-n\left[H(X)+\epsilon\right]} \leq p(X_1, X_2, ..., X_n) \leq 2^{-n\left[H(X)-\epsilon\right]}
\end{equation}
In other words, amongst the sequences we are likely to see, they all have roughly the same probability. The motivation for the bounds is simple to see if we rewrite a little,
\begin{equation}
2^{-n \cdot H(X)} = 2^{n \sum_x p(x) \log p(x)} = \displaystyle\Pi_x p(x)^{n \cdot p(x)}
\end{equation}
The right side of the equation is simply the probability of result
$x$ with its expected number of occurrences (or multiplicity) in the
exponent.
\begin{proof}[of the AEP Theorem]
Let $Y_j = - \log p(X_j)$, thus $Y_j$ is a random variable with finite variance. The expectation value of $Y_j$ becomes,
\begin{equation}
\mathbb{E} \left[ Y_j \right] = \displaystyle\sum_{x} \left( -p(x) \log p(x) \right) = H(X)
\end{equation}
We can apply the weak law of large numbers to get
\begin{equation}
\mathrm{Pr} \left[ \left| \frac{1}{n} \displaystyle\sum_{j=1}^{n}
Y_j - H(X) \right| > \epsilon \right] \overset{n \rightarrow
\infty}{\longrightarrow} 0.
\end{equation}
The term being compared to the entropy becomes
\begin{equation}
\frac{1}{n} \displaystyle\sum_{j=1}^{n} Y_j = -\frac{1}{n} \displaystyle\sum_{j=1}^{n} \log p(X_j) = -\frac{1}{n} \log \displaystyle\Pi_{j=1}^{n} p(X_j) = -\frac{1}{n} \log p(X^n)
\end{equation}
\end{proof}
\begin{definition}{Typical sequences}
\begin{equation}
\forall \epsilon > 0, \mathrm{T}_{\epsilon}^{(n)} = \left\{ x^n = x_1, x_2, ..., x_n \left| 2^{-n\left[H(X)+\epsilon\right]} \leq p(x^n) \leq 2^{-n\left[H(X)-\epsilon\right]} \right. \right\}
\end{equation}
Note that $x^n$ is a sequence of $n$ letters each taken from the alphabet $\chi$. The set $\mathrm{T}_{\epsilon}^{(n)}$ is called the \emph{set of entropy-typical sequences with respect to $p(x)$}.
\end{definition}
Some properties of the set are
\begin{enumerate}
\itemsep 0pt
\item If $x^n$ is typical, i.e.: $x^n \in \mathrm{T}_{\epsilon}^{(n)}$, then $H(X) - \epsilon \leq -\frac{1}{n} \log p(x^n) \leq H(X) + \epsilon$
\item $\mathrm{Pr} \left[ x^n \in \mathrm{T}_{\epsilon}^{(n)} \right] \geq 1 - \delta$ $\forall \delta > 0$ and $n$ sufficiently large
\item $\left| \mathrm{T}_{\epsilon}^{(n)} \right| \leq 2^{n\left[H(X)+\epsilon\right]}$
\item $\left| \mathrm{T}_{\epsilon}^{(n)} \right| \geq (1-\delta) 2^{n\left[H(X)-\epsilon\right]}$ $\forall \delta > 0$ and $n$ sufficiently large
\end{enumerate}
\begin{proof} The proofs follow from simple manipulations of the theorem and definition
\begin{enumerate}
\itemsep 0pt
\item Rearranging the definition of $T_{\epsilon}^{(n)}$. For a typical sequence $x^n$ take the logarithm of the double inequality in the definition.
\item Rearranging the AEP Theorem above. Note that from the theorem,
\begin{equation}
\forall \epsilon > 0, \mathrm{Pr} \left[ \left| -\frac{1}{n} \log p(x_1, x_2, ..., x_n) - H(X) \right| \leq \epsilon \right] \overset{n \rightarrow \infty}{\longrightarrow} 1
\end{equation}
Then we can get,
\begin{equation}
\forall \epsilon > 0, \mathrm{Pr} \left[ x^n \in \mathrm{T}_{\epsilon}^{(n)} \right] \overset{n \rightarrow \infty}{\longrightarrow} 1
\end{equation}
and the property follows.
\item We rearrange the following inequality,
\begin{equation}
\left| \mathrm{T}_{\epsilon}^{(n)} \right| 2^{-n\left[H(X)+\epsilon\right]} \leq \displaystyle\sum_{x^n \in \mathrm{T}_{\epsilon}^{(n)}} p(x^n) = \mathrm{Pr} \left[ x^n \in \mathrm{T}_{\epsilon}^{(n)} \right] \leq 1
\end{equation}
and the property follows by multiplying both sides by $2^{n\left[H(X)+\epsilon\right]}$.
\item We rearrange the following inequality,
\begin{equation}
\left| \mathrm{T}_{\epsilon}^{(n)} \right| 2^{-n\left[H(X)-\epsilon\right]} \geq \displaystyle\sum_{x^n \in \mathrm{T}_{\epsilon}^{(n)}} p(x^n) = \mathrm{Pr} \left[ x^n \in \mathrm{T}_{\epsilon}^{(n)} \right] \geq 1- \delta
\end{equation}
where the final step is true by property (ii).
\end{enumerate}
\end{proof}
\section{Lossless compression}
We consider enumerating all typical sequences with a function
\begin{equation}
f:\mathrm{T}_{\epsilon}^{(n)} \longrightarrow \{0,1\}^{\lceil n(H(X)
+ \epsilon) \rceil}.
\end{equation}
Our compression scheme, then, will evaluate each sequence $x^n$ to
determine whether it is typical (i.e.: is $x^n \in
\mathrm{T}_{\epsilon}^{(n)}$?). If the sequence is typical, then it
can be assigned the string $0f(x^n)$, thereby reducing its length to
$\lceil nH(X)+\epsilon\rceil + 1$. If the sequence $x^n$ is atypical
then it is encoded as $1x^n$ with resulting length $n+1$. Since the
probability of a sequence being typical goes to $1$ as the length of
the sequence diverges, the average length of output per input bit
goes to $H(X)+\epsilon$ for sufficiently large $n$.
Also, while we haven't proved it yet, it is impossible to compress the string to fewer than $H(X)$ output bits per input symbol. This observation provides with a solid operational interpretation of the Shannon entropy as the minimal number of bits required to accurately encode the output of a source generating independent samples from the random variable $X$.
\end{document}