课程介绍
信息论(2019年春)台湾交通大学陈伯宁,哔哩哔哩视频:信息论(2019年春)台湾交通大学陈伯宁,参考网站:消息理論,参考书籍:An Introduction to Single-User Information Theory。该课程将从公理出发,严格的证明香农三大定律,探索信息论的方法论。
Introduction
该章节不从严格的数学角度描述,而是从一个 Overview 的角度宏观的了解香农的工作。
信息从使用 code 表示开始,什么样的 code 是最好的呢,code 的紧凑程度(Compactness)是一种很好的度量,基于统计的视角,即平均比特长度可以度量code的紧凑长度,如何去度量使用code表示的信息多少呢,信息如何测度呢,香农从三大公理开始建立信息论的数学基础:
-
Monotonicity in event probability.发生概率较小的事件的信息量大,带给接收方的冲击也大,发生概率较大的事件信息量少。
-
Additivity.两个事件的联合信息量必然不少于单独一个事件的信息量。
-
Continuity.微小的事件概率变化引起的信息量变化也是微小的。
这三条公理也对应着人类对于世界的认知:我们所生活的世界是温和的。就如同高斯分布是满足熵最大分布的原理一样,符合现实世界的分布必然不是重尾的,而是测度集中的。上述三条公理将导出唯一的可以度量信息的函数: $$ \text { self-information of an event }=\log _2 \frac{1}{\text { event probability }} \text { bits. } $$ 该公式与物理中熵的定义不谋而合,有故事称香农一开始并不是给该信息量命名为信息熵,是香农的老师建议,才将该度量命名为信息熵。
我们假设信息以一定的处理方式在发送端进行处理:
现在有了信息的度量,那么是否可以度量出一种code,可以最紧凑的表达信息,香农第一定理便是回答了这个问题:
$$\text{Min Compression rate} = \text{entropy of } (Z_1,Z_2,\cdots,Z_n)$$
香农虽然回答了该问题,说明信息的压缩有一个极限,但是并没有给出具体的编码方式,直到50年之后turbo码的出现才第一次逼近了这个极限。值得注意的是,在逼近这个极限时,压缩编码$U_1,\cdots,U_n$是渐进均匀分布的,且是 i.i.d.的随机变量,该定理也进一步方便了后续的信道编码处理与分析。
香农第二定律研究了信息在含有噪声的信道下传输的特性,从很多领域中我们都可以看到可靠传输这样的概念,但是真正严谨的可靠性传输并不是指0差错,而是可以用 $\epsilon-\sigma$ 语言去描述的渐进无差错概率,而差错不可能是0,因为香农研究在大样本背景下的编码和通信。
所以对于信源将原始信息 $Z$ 压缩为的 $U$,我们必须进一步将其编码,添加冗余,使其在信道中即使受了噪声影响,通过接收端的解码也可以恢复原来的信息,这就是Channel encoder的作用,一般而言,信源和信道编码器都是分开研究的,不过也有学者将二者放在一起进行研究。
那么是否真的有这样一种编码,可以让 差错概率渐进为0 吗,香农第二定律回答了该问题: $$ \text{Channel capacity is the max reliable transmission code rate for a noisy channel.} $$ 香农是从数学的角度定义了互信息,然后证明该定理的,但是与第一定律一样,并没有得出具体的编码形式,而是证明了确有编码可以达到该效果。(这里的描述是不严谨的,但是可以方便我们很快的了解香农的工作)
至于香农第三定律,需要引入更多的名词才能介绍。
香农的信息论并不只是简单的这三条定理,其定理证明的背后也蕴含着一套方法论,就是不找出具体的方法,但是从数学上证明确有此事,这样的方法论也融合到了很多的学科之中。
Overview on Suprema and Limits
Definition 1(Upper bound of a set).
$\mathcal{A} \subset \mathbb{R}$ is bounded above $\Longleftrightarrow(\exists u \in \mathbb{R})$ such that $(\forall a \in \mathcal{A}), a \leq u$.
Definition 2(Least upper bound or supremum). Suppose $\mathcal{A}$ is a non-empty subset of $\mathbb{R}$. Then, we say that a real number $s$ is a least upper bound or supremum of $\mathcal{A}$ if $s$ is an upper bound of the set $\mathcal{A}$ and if $s \leq s^{\prime}$ for each upper bound $s^{\prime}$ of $\mathcal{A}$. In this case, we write $s=\sup \mathcal{A}$; other notations are $s=\sup _{x \in \mathcal{A}} x$ and $s=\sup {x: x \in \mathcal{A}}$.
Completeness Axiom 3(Least upper bound property). Let $\mathcal{A}$ be a non-empty subset of $\mathbb{R}$ that is bounded above. Then $\mathcal{A}$ has a least upper bound.
Definition 3 (Maximum) If $supA ∈ \mathcal{A}$, then supA is also called the maximum of $A$ and is denoted by $\max A$. However, if $supA \notin A$, then we say that the maximum of $A$ does not exist.
Property 4 (Properties of the supremum).
-
The supremum of any set in $\mathbb{R} \cup{-\infty, \infty}$ always exits.
-
If $-\infty<\sup \mathcal{A}<\infty$, then $(\forall \varepsilon>0)\left(\exists a_0 \in \mathcal{A}\right) a_0>\sup \mathcal{A}-\varepsilon$.
-
If $\sup \mathcal{A}=\infty$, then $(\forall L \in \mathbb{R})\left(\exists B_0 \in \mathcal{A}\right) B_0>L$.
-
If $\sup \mathcal{A}=-\infty$, then $\mathcal{A}$ is empty.
对于 Infimum and Minimum,定义以及性质与上述一致,不再赘述。
Property 5 (Properties of the supremum and infimum). $A,B\subset \mathbb{R}$:
- $\sup (A+B) = \sup(A)+\sup(B)$
- $\sup(kA) = k\sup(A)$
- $\sup(A) = -\inf(-A)$
- $\sup(A\cdot B)\ge \sup(A)\cdot\sup(B)$
在通信序列中,经常需要分析序列极限,然而可能会遇到极限不存在的情况,影响分析,如何使得极限必然存在呢?
Definition 6 (limsup and liminf)。 The limit supremum of $\left\{a_n\right\}_{n=1}^{\infty}$ is the extended real number in $\mathbb{R} \cup{-\infty, \infty}$ defined by $$ \limsup _{n \rightarrow \infty} a_n:=\lim_{n \rightarrow \infty}\left(\sup _{k \geq n} a_k\right), $$ and the limit infimum of $\left\{a_n\right\}_{n=1}^{\infty}$ is the extended real number defined by $$ \liminf _{n \rightarrow \infty} a_n:=\lim _{n \rightarrow \infty}\left(\inf _{k \geq n} a_k\right) $$ Some also use the notations $\overline{\lim }$ and $\underline{\mathrm{lim}}$ to denote limsup and liminf, respectively.
Lemma 7 (Limit). For a sequence $\left\{a_n\right\}_{n=1}^{\infty}$, $$ \lim _{n \rightarrow \infty} a_n=L \Longleftrightarrow \limsup _{n \rightarrow \infty} a_n=\liminf _{n \rightarrow \infty} a_n=L $$
Property 8 . 都大则大,都小则小,避免耦合: $$ \begin{aligned} \liminf _{n \rightarrow \infty} a_n+\liminf _{n \rightarrow \infty} b_n & \leq \liminf _{n \rightarrow \infty}\left(a_n+b_n\right) \\ & \leq \limsup _{n \rightarrow \infty} a_n+\liminf _{n \rightarrow \infty} b_n \\ & \leq \limsup _{n \rightarrow \infty}\left(a_n+b_n\right) \\ & \leq \limsup _{n \rightarrow \infty} a_n+\limsup _{n \rightarrow \infty} b_n . \end{aligned} $$
Overview in Probability and Random Processes
Random Variables and Random Processes
Def 1(r.v.). A random variable $X$ defined over the probability space $(\Omega, \mathcal{F}, P)$ is a real-valued function $X: \Omega \rightarrow \mathbb{R}$ that is measurable (or $\mathcal{F}$-measurable), i.e., satisfying the property that
$$
X^{-1}((-\infty, t]):=\{\omega \in \Omega: X(\omega) \leq t\} \in \mathcal{F}
$$
for each real $t$.
Remark. r.v.是$\mathcal{F}\textbf{可测映射}$
Def 2(Borel sets). he Borel $\sigma$-field of $\mathbb{R}$, denoted by $\mathscr{B}(\mathbb{R})$, is the smallest $\sigma$-field of subsets of $\mathbb{R}$ containing all open intervals in $\mathbb{R}$. The elements of $\mathscr{B}(\mathbb{R})$ are called Borel sets.
Def 3(r.v.诱导的测度). For any random variable $X$, we use $P_X$ to denote the probability distribution on $\mathscr{B}(\mathbb{R})$ induced by $X$, given by
$$
P_X(B):=\operatorname{Pr}[X \in B]=P(w \in \Omega: X(w) \in B), \quad B \in \mathscr{B}(\mathbb{R})
$$
Note that the quantities $P_X(B), B \in \mathscr{B}(\mathbb{R})$, fully characterize the random variable $X$ as they determine the probabilities of all events that concern $X$.
Remark. 在很多概率的书籍中,$P_X$ 专用来表示随机变量诱导的测度,用来对 $X$ 映射到的可测集合操作,$P$表示原始概率空间上的测度,用来对 $\mathcal{F}$ 进行操作,而 $\operatorname{Pr}$侧重于描述随机变量事件的概率(the probability that)。
这里不严格的定义随机过程,而是强调离散时间随机过程,即以序列 $I$ 表示的随机变量的集合 $\{X_i\mid i\in I\}$为随机过程。
Properties of random processes
Definition 1($\mathbb{T}$-invariant). An event $E$ is said to be $\mathbb{T}$-invariant with respect to the left-shift (or shift transformation) $\mathbb{T}: \mathcal{X}^{\infty} \rightarrow \mathcal{X}^{\infty}$ if $$ \mathbb{T} E \subseteq E $$ where $$ \mathbb{T} E:={\mathbb{T} \boldsymbol{x}: \boldsymbol{x} \in E} \quad \text { and } \quad \mathbb{T} \boldsymbol{x}:=\mathbb{T}\left(x_1, x_2, x_3, \ldots\right)=\left(x_2, x_3, \ldots\right) . $$ Remark. 可以定义该映射的逆映射:$\mathbb{T}^{-1} E:=\left\{\boldsymbol{x} \in \mathcal{X}^{\infty}: \mathbb{T} \boldsymbol{x} \in E\right\}$。这里是以通信系统为目的讨论的,所以强调了对事件的移位操作,强调时不变,对于概率论视角或者遍历论视角,更统一的方法是定义$\mathbb{T}$不变测度。
Def 2(Ergodic sets). 对于事件 $E$,若 $\mathbb{T} E = E = \mathbb{T}^{-1} E$,称该事件为 Ergodic sets。
Def 3(随机过程的分类)
- Memoryless: A random process or a source $X$ is said to be memoryless if thesequence of random variables $X_i$ is i.i.d.
- Stationary process: Aprocess is said to be stationary (or strictly stationary)if the probability of every sequence or event is unchanged by a left (time) shift.
- Ergodic process: A process is said to be ergodic if any ergodic set in $\mathcal{F}_X$ has probability either 1 or 0.
Remark. 遍历性与平稳性没有关系,平稳性说的是分布不随时间变化(i.d.),遍历性说的是该系统是封闭的,即该性质得以让我们从任何一点观察一随机过程,只要观察的时间足够长(SLLN),就可以推断该随机过程的分布。遍历性可以保证样本均值收敛到常数,但是未必是期望。
Theorem 4(Pointwise ergodic theorem) Consider a discrete-time stationary random process, $\boldsymbol{X}=\left\{X_n\right\}_{n=1}^{\infty}$. For real-valued function $f(\cdot)$ on $\mathbb{R}$ with finite mean (i.e., $\left|E\left[f\left(X_n\right)\right]\right|<\infty$ ), there exists a random variable $Y$ such that $$ \lim _{n \rightarrow \infty} \frac{1}{n} \sum_{k=1}^n f\left(X_k\right)=Y \quad \text { with probability } 1 $$ If, in addition to stationarity, the process is also ergodic, then $$ \lim _{n \rightarrow \infty} \frac{1}{n} \sum_{k=1}^n f\left(X_k\right)=E\left[f\left(X_1\right)\right] \quad \text { with probability } 1 . $$ Remark. 一些工程类教材经常把该定理当作遍历性的定义, 因为借助遍历性这一概念只是想利用SLLN,但是这样的定义是狭隘的, 从该定义出发难以探索遍历性的其他性质。Birkhoff 逐点遍历定理保证了, 我们可以用单个轨道的时间均值来估计总体数学期望,这让遍历性在实际应用中非常重要。
Convergence of Sequences of Random Variables
为了介绍LLN,有必要先介绍随机变量序列的收敛:
Definition 1 (Convergence modes for random sequences).
- Pointwise convergence on $\Omega$. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ pointwise on $\Omega$ if $$ \lim _{n \rightarrow \infty} X_n(\omega)=X(\omega) \text { for all } \omega \in \Omega $$ This notion of convergence, which is familiar to us from real analysis, is denoted by $X_n \xrightarrow{p . w .} X$.该收敛并没有考虑随机的因素,而是考虑映射 $X$ 本身的收敛。
- Almost sure convergence or convergence with probability 1. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ with probability 1 , if $$ P\left\{\omega \in \Omega: \lim _{n \rightarrow \infty} X_n(\omega)=X(\omega)\right\}=1 $$ Almost sure convergence is denoted by $X_n \xrightarrow{\text { a.s. }} X$; note that it is nothing but a probabilistic version of pointwise convergence.
- Convergencein probability. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ in probability, if for any $\varepsilon>0$, $$ \lim _{n \rightarrow \infty} P\left\{\omega \in \Omega:\left|X_n(\omega)-X(\omega)\right|>\varepsilon\right\}=\lim _{n \rightarrow \infty} \operatorname{Pr}\left\{\left|X_n-X\right|>\varepsilon\right\}=0 $$ This mode of convergence is denoted by $X_n \xrightarrow{p} X$.
- Convergence in $r$ th mean. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ in $r$ th mean, if $$ \lim _{n \rightarrow \infty} E\left[\left|X-X_n\right|^r\right]=0 $$ This is denoted by $X_n \xrightarrow{\mathcal{L}_r} X$.
- Convergence in distribution. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ in distribution, if $$ \lim _{n \rightarrow \infty} F_{X_n}(x)=F_X(x) $$ for every continuity point of $F(x)$, where $$ F_{X_n}(x)=\operatorname{Pr}\left\{X_n \leq x\right\} \quad \text { and } \quad F_X(x)=\operatorname{Pr}\{X \leq x\} $$ We denote this notion of convergence by $X_n \xrightarrow{d} X$.
Theorem B. 9 (Monotone convergence theorem)
$\left.\begin{array}{l}
\text { (i) }X_n \xrightarrow{\text { a.s. }} X\\
\text { (ii) }(\forall n) Y \leq X_n \leq X_{n+1} \\
\text { (iii) } E[|Y|]<\infty\end{array}\right\} \Rightarrow X_n \xrightarrow{L_1} X \quad \Rightarrow \quad E\left[X_n\right] \rightarrow E[X]$.
Theorem B. 10 (Dominated convergence theorem)
$\left.\begin{array}{l}\text { (i) }X_n \xrightarrow{\text { a.s. }} X\\ \text { (ii) }(\forall n)\left|X_n\right| \leq Y \\ \text { (iii) } E[|Y|]<\infty\end{array}\right\} \Rightarrow X_n \xrightarrow{L_1} X \quad \Rightarrow \quad E\left[X_n\right] \rightarrow E[X]$.
The implication of $X_n \xrightarrow{L_1} X$ to $E\left[X_n\right] \rightarrow E[X]$ can be easily seen from $$ \left|E\left[X_n\right]-E[X]\right|=\left|E\left[X_n-X\right]\right| \leq E\left[\left|X_n-X\right|\right] . $$
Observation B. 8 (Uniqueness of convergence).
- If $X_n \xrightarrow{p . w .} X$ and $X_n \xrightarrow{p . w .} Y$, then $X=Y$ pointwise. That is, $(\forall \omega \in \Omega)$ $X(\omega)=Y(\omega)$.
- If $X_n \xrightarrow{\text { a.s. }} X$ and $X_n \xrightarrow{\text { a.s. }} Y$ (or $X_n \xrightarrow{p} X$ and $X_n \xrightarrow{p} Y$ ) (or $X_n \xrightarrow{\mathcal{L}_r} X$ and $X_n \xrightarrow{\mathcal{L}_r} Y$ ), then $X=Y$ with probability 1. That is, $\operatorname{Pr}\{X=Y\}=1$.
- $X_n \xrightarrow{d} X$ and $X_n \xrightarrow{d} Y$, then $F_X(x)=F_Y(x)$ for all $x$.
Ergodicity and Laws of Large Numbers,CLT
Theorem B. 13 (Weak law of large numbers) Let $\left\{X_n\right\}_{n=1}^{\infty}$ be a sequence of uncorrelated random variables with common mean $E\left[X_i\right]=\mu$. If the variables also have common variance, or more generally,
then the sample average
$$
\frac{X_1+\cdots+X_n}{n}
$$
converges to the mean $\mu$ in probability.
Proof By Chebyshev’s inequality,
Note that the right-hand side of the above Chebyshev’s inequality is just the second moment of the difference between the $n$-sample average and the mean $\mu$. Thus, the variance constraint is equivalent to the statement that $X_n \xrightarrow{\mathcal{L}_2} \mu$ implies $X_n \xrightarrow{p} \mu$.
Theorem B. 14 (Kolmogorov’s strong law of large numbers) Let $\left\{X_n\right\}_{n=1}^{\infty}$ be a sequence of independent random variables with common mean $E\left[X_n\right]=\mu$. If either
- $X_n$ ’s are identically distributed; or
- $X_n$ ’s are square-integrable with variances satisfying $$ \sum_{i=1}^{\infty} \frac{\operatorname{Var}\left[X_i\right]}{i^2}<\infty $$
then $$ \frac{X_1+\cdots+X_n}{n} \xrightarrow{\text { a.s. }} \mu . $$ Remark. $SLLN\Rightarrow Pr\{\lim\}$,$WLLN\Rightarrow \lim Pr\{\}$, WLLN是估计量的依概率收敛,SLLN是估计量的依概率1收敛或a.s.收敛,Kolmogorov’s strong law of large numbers对函数序列极限依然适用。
逐点遍历原理是SLLN的一种。
直观地说,遍历性意味着系统没有“可分解的部分”,任何在变换 𝑇 下保持不变的子集,要么几乎处处包含所有样本点,要么几乎处处不包含任何样本点。 如果一个事件 𝐴是 𝑇-不变的,那么它要么几乎处处发生(概率 1),要么几乎处处不发生(概率 0)。
我们总可以定义一些 𝑇-不变事件,这些事件是必然存在的,不在乎会不会发生 (统计量平均收敛于常数就是这样的一个不变事件),于是我们就可以对这样的事件进行测量, 如果是普通是事件,根据我们测量的环境不同,可能会发生变化(不同样本轨道的统计特性不同), 但是如果是 𝑇-不变事件,就可以保证该事件要么不发生,要么对于所有样本点发生,这也保证了 不同的环境测量出的事件一致,这是SLLN或者逐点遍历原理的一种现实意义。
Theorem B. 15 (Central limit theorem) If $\left\{X_n\right\}_{n=1}^{\infty}$ is a sequence of i.i.d. random variables with finite common marginal mean $\mu$ and variance $\sigma^2$, then
$$ \frac{1}{\sqrt{n}} \sum_{i=1}^n\left(X_i-\mu\right) \xrightarrow{d} Z \sim \mathcal{N}\left(0, \sigma^2\right) $$
where the convergence is in distribution (as $n \rightarrow \infty$ ) and $Z \sim \mathcal{N}\left(0, \sigma^2\right)$ is a Gaussian distributed random variable with mean 0 and variance $\sigma^2$.
Lagrange Multipliers Technique and KKT Conditions
Definition B. 16 (Convexity) Consider a convex set $ \mathcal{O} \in \mathbb{R}^m$, where $m$ is a fixed positive integer. Then, a function $f: \mathcal{O} \rightarrow \mathbb{R}$ is said to be convex over $\mathcal{O}$ if for every $\underline{x}, \underline{y}$ in $\mathcal{O}$ and $0 \leq \lambda \leq 1$,
$$
f(\lambda \underline{x}+(1-\lambda) \underline{y}) \leq \lambda f(\underline{x})+(1-\lambda) f(\underline{y})
$$
Furthermore, a function $f$ is said to be strictly convex if equality holds only when $\lambda=0$ or $\lambda=1$.
Remark. 凸集对线性凸组合保持封闭,函数的凸性是针对定义域和函数特性而言的,值得注意的式,满足convex的函数在凸集上二阶导非负。
Definition B. 17 (Concavity) A function $f$ is concave if $-f$ is convex.
Theorem B. 18 (Jensen’s inequality) If $f: \mathcal{O} \rightarrow \mathbb{R}$ is convex over a convex set $\mathcal{O} \subset \mathbb{R}^m$, and $\underline{X}=\left(X_1, X_2, \ldots, X_m\right)^T$ is an m-dimensional random vector with alphabet $\mathcal{X} \subset \mathcal{O}$, then $$ \mathbb{E}[f(\underline{X})] \geq f(\mathbb{E}[\underline{X}]) $$ Moreover, if $f$ is strictly convex, then equality in the above inequality immediately implies $\underline{X}=E[\underline{X}]$ with probability 1.
接下来考虑优化问题:Optimization of a function $f(\boldsymbol{x})$ over $\boldsymbol{x}=\left(x_1, \ldots, x_n\right) \in \mathcal{X} \subseteq \mathbb{R}^n$ $$ \min _{\boldsymbol{x} \in \mathcal{Q}} f(\boldsymbol{x}), $$ where $$ \mathcal{Q}:=\left\{\boldsymbol{x} \in \mathcal{X}: g_i(\boldsymbol{x}) \leq 0 \text { for } 1 \leq i \leq m \text { and } h_i(\boldsymbol{x})=0 \text { for } 1 \leq j \leq \ell\right\} $$ 可以利用拉格朗日乘子法,将该问题转化为无约束的对偶问题:
称该函数为拉格朗日对偶函数,可以证明是一个 concave 的函数,其中$\lambda,\mu$ 为拉格朗日乘子,并且非负,则有
如果上式等号成立,我们就可以通过优化对偶问题找到原优化问题的解,在特定条件下,KKT条件就是使得等式成立的条件(此时对偶问题为强对偶):
Definition B. 19 (Karush-Kuhn-Tucker (KKT) condition) Point $\boldsymbol{x}=\left(x_1\right.$, $\left.\ldots, x_n\right)$ and multipliers $\boldsymbol{\lambda}=\left(\lambda_1, \ldots, \lambda_m\right)$ and $\boldsymbol{\nu}=\left(\nu_1, \ldots, \nu_{\ell}\right)$ are said to satisfy the KKT condition if
Remark. (1)如果 $f,g$ 为凸函数,$h$是不过原点的直线,那么等式成立 $\iff$ KKT条件。
(2)对于任意的 $f,g,h$ 函数,有 等式成立 $\Rightarrow$ KKT条件。
Information Measures for Discrete Systems
暂时搁置了。。。。陈伯宁老师的课程录屏被他们学校剪辑发出,效果真的不忍直视,等以后有时间再学。