学习笔记|信息论

课程介绍

信息论（2019年春）台湾交通大学陈伯宁，哔哩哔哩视频：信息论（2019年春）台湾交通大学陈伯宁,参考网站：消息理論,参考书籍：An Introduction to Single-User Information Theory。该课程将从公理出发，严格的证明香农三大定律，探索信息论的方法论。

Introduction

该章节不从严格的数学角度描述，而是从一个 Overview 的角度宏观的了解香农的工作。

信息从使用 code 表示开始，什么样的 code 是最好的呢，code 的紧凑程度(Compactness)是一种很好的度量，基于统计的视角，即平均比特长度可以度量code的紧凑长度，如何去度量使用code表示的信息多少呢，信息如何测度呢，香农从三大公理开始建立信息论的数学基础：

Monotonicity in event probability.发生概率较小的事件的信息量大，带给接收方的冲击也大，发生概率较大的事件信息量少。
Additivity.两个事件的联合信息量必然不少于单独一个事件的信息量。
Continuity.微小的事件概率变化引起的信息量变化也是微小的。

这三条公理也对应着人类对于世界的认知:我们所生活的世界是温和的。就如同高斯分布是满足熵最大分布的原理一样，符合现实世界的分布必然不是重尾的，而是测度集中的。上述三条公理将导出唯一的可以度量信息的函数： $$ \text { self-information of an event }=\log _2 \frac{1}{\text { event probability }} \text { bits. } $$ 该公式与物理中熵的定义不谋而合，有故事称香农一开始并不是给该信息量命名为信息熵，是香农的老师建议，才将该度量命名为信息熵。

我们假设信息以一定的处理方式在发送端进行处理: 现在有了信息的度量，那么是否可以度量出一种code，可以最紧凑的表达信息，香农第一定理便是回答了这个问题： $$\text{Min Compression rate} = \text{entropy of } (Z_1,Z_2,\cdots,Z_n)$$ 香农虽然回答了该问题，说明信息的压缩有一个极限，但是并没有给出具体的编码方式，直到50年之后turbo码的出现才第一次逼近了这个极限。值得注意的是，在逼近这个极限时，压缩编码$U_1,\cdots,U_n$是渐进均匀分布的，且是 i.i.d.的随机变量，该定理也进一步方便了后续的信道编码处理与分析。

香农第二定律研究了信息在含有噪声的信道下传输的特性，从很多领域中我们都可以看到可靠传输这样的概念，但是真正严谨的可靠性传输并不是指0差错，而是可以用 $\epsilon-\sigma$ 语言去描述的渐进无差错概率，而差错不可能是0，因为香农研究在大样本背景下的编码和通信。

所以对于信源将原始信息 $Z$ 压缩为的 $U$,我们必须进一步将其编码，添加冗余，使其在信道中即使受了噪声影响，通过接收端的解码也可以恢复原来的信息，这就是Channel encoder的作用，一般而言，信源和信道编码器都是分开研究的，不过也有学者将二者放在一起进行研究。

那么是否真的有这样一种编码，可以让 差错概率渐进为0 吗，香农第二定律回答了该问题： $$ \text{Channel capacity is the max reliable transmission code rate for a noisy channel.} $$ 香农是从数学的角度定义了互信息，然后证明该定理的，但是与第一定律一样，并没有得出具体的编码形式，而是证明了确有编码可以达到该效果。（这里的描述是不严谨的，但是可以方便我们很快的了解香农的工作）

至于香农第三定律，需要引入更多的名词才能介绍。

香农的信息论并不只是简单的这三条定理，其定理证明的背后也蕴含着一套方法论，就是不找出具体的方法，但是从数学上证明确有此事，这样的方法论也融合到了很多的学科之中。

Overview on Suprema and Limits

Definition 1(Upper bound of a set).
$\mathcal{A} \subset \mathbb{R}$ is bounded above $\Longleftrightarrow(\exists u \in \mathbb{R})$ such that $(\forall a \in \mathcal{A}), a \leq u$.

Definition 2(Least upper bound or supremum). Suppose $\mathcal{A}$ is a non-empty subset of $\mathbb{R}$. Then, we say that a real number $s$ is a least upper bound or supremum of $\mathcal{A}$ if $s$ is an upper bound of the set $\mathcal{A}$ and if $s \leq s^{\prime}$ for each upper bound $s^{\prime}$ of $\mathcal{A}$. In this case, we write $s=\sup \mathcal{A}$; other notations are $s=\sup _{x \in \mathcal{A}} x$ and $s=\sup {x: x \in \mathcal{A}}$.

Completeness Axiom 3(Least upper bound property). Let $\mathcal{A}$ be a non-empty subset of $\mathbb{R}$ that is bounded above. Then $\mathcal{A}$ has a least upper bound.

Definition 3 (Maximum) If $supA ∈ \mathcal{A}$, then supA is also called the maximum of $A$ and is denoted by $\max A$. However, if $supA \notin A$, then we say that the maximum of $A$ does not exist.

Property 4 (Properties of the supremum).

The supremum of any set in $\mathbb{R} \cup{-\infty, \infty}$ always exits.
If $-\infty<\sup \mathcal{A}<\infty$, then $(\forall \varepsilon>0)\left(\exists a_0 \in \mathcal{A}\right) a_0>\sup \mathcal{A}-\varepsilon$.
If $\sup \mathcal{A}=\infty$, then $(\forall L \in \mathbb{R})\left(\exists B_0 \in \mathcal{A}\right) B_0>L$.
If $\sup \mathcal{A}=-\infty$, then $\mathcal{A}$ is empty.

对于 Infimum and Minimum，定义以及性质与上述一致，不再赘述。

Property 5 (Properties of the supremum and infimum). $A,B\subset \mathbb{R}$:

$\sup (A+B) = \sup(A)+\sup(B)$
$\sup(kA) = k\sup(A)$
$\sup(A) = -\inf(-A)$
$\sup(A\cdot B)\ge \sup(A)\cdot\sup(B)$

在通信序列中，经常需要分析序列极限，然而可能会遇到极限不存在的情况，影响分析，如何使得极限必然存在呢？

Definition 6 (limsup and liminf)。 The limit supremum of $\left\{a_n\right\}_{n=1}^{\infty}$ is the extended real number in $\mathbb{R} \cup{-\infty, \infty}$ defined by $$ \limsup _{n \rightarrow \infty} a_n:=\lim_{n \rightarrow \infty}\left(\sup _{k \geq n} a_k\right), $$ and the limit infimum of $\left\{a_n\right\}_{n=1}^{\infty}$ is the extended real number defined by $$ \liminf _{n \rightarrow \infty} a_n:=\lim _{n \rightarrow \infty}\left(\inf _{k \geq n} a_k\right) $$ Some also use the notations $\overline{\lim }$ and $\underline{\mathrm{lim}}$ to denote limsup and liminf, respectively.

Lemma 7 (Limit). For a sequence $\left\{a_n\right\}_{n=1}^{\infty}$, $$ \lim _{n \rightarrow \infty} a_n=L \Longleftrightarrow \limsup _{n \rightarrow \infty} a_n=\liminf _{n \rightarrow \infty} a_n=L $$

Property 8 . 都大则大，都小则小，避免耦合： $$ \begin{aligned} \liminf _{n \rightarrow \infty} a_n+\liminf _{n \rightarrow \infty} b_n & \leq \liminf _{n \rightarrow \infty}\left(a_n+b_n\right) \\ & \leq \limsup _{n \rightarrow \infty} a_n+\liminf _{n \rightarrow \infty} b_n \\ & \leq \limsup _{n \rightarrow \infty}\left(a_n+b_n\right) \\ & \leq \limsup _{n \rightarrow \infty} a_n+\limsup _{n \rightarrow \infty} b_n . \end{aligned} $$

Overview in Probability and Random Processes

Random Variables and Random Processes

Def 1(r.v.). A random variable $X$ defined over the probability space $(\Omega, \mathcal{F}, P)$ is a real-valued function $X: \Omega \rightarrow \mathbb{R}$ that is measurable (or $\mathcal{F}$-measurable), i.e., satisfying the property that $$ X^{-1}((-\infty, t]):=\{\omega \in \Omega: X(\omega) \leq t\} \in \mathcal{F} $$ for each real $t$.
Remark. r.v.是$\mathcal{F}\textbf{可测映射}$

Def 2(Borel sets). he Borel $\sigma$-field of $\mathbb{R}$, denoted by $\mathscr{B}(\mathbb{R})$, is the smallest $\sigma$-field of subsets of $\mathbb{R}$ containing all open intervals in $\mathbb{R}$. The elements of $\mathscr{B}(\mathbb{R})$ are called Borel sets.

Def 3(r.v.诱导的测度). For any random variable $X$, we use $P_X$ to denote the probability distribution on $\mathscr{B}(\mathbb{R})$ induced by $X$, given by $$ P_X(B):=\operatorname{Pr}[X \in B]=P(w \in \Omega: X(w) \in B), \quad B \in \mathscr{B}(\mathbb{R}) $$ Note that the quantities $P_X(B), B \in \mathscr{B}(\mathbb{R})$, fully characterize the random variable $X$ as they determine the probabilities of all events that concern $X$.
Remark. 在很多概率的书籍中，$P_X$ 专用来表示随机变量诱导的测度，用来对 $X$ 映射到的可测集合操作，$P$表示原始概率空间上的测度，用来对 $\mathcal{F}$ 进行操作，而 $\operatorname{Pr}$侧重于描述随机变量事件的概率（the probability that）。

这里不严格的定义随机过程，而是强调离散时间随机过程，即以序列 $I$ 表示的随机变量的集合 $\{X_i\mid i\in I\}$为随机过程。

Properties of random processes

Definition 1($\mathbb{T}$-invariant). An event $E$ is said to be $\mathbb{T}$-invariant with respect to the left-shift (or shift transformation) $\mathbb{T}: \mathcal{X}^{\infty} \rightarrow \mathcal{X}^{\infty}$ if $$ \mathbb{T} E \subseteq E $$ where $$ \mathbb{T} E:={\mathbb{T} \boldsymbol{x}: \boldsymbol{x} \in E} \quad \text { and } \quad \mathbb{T} \boldsymbol{x}:=\mathbb{T}\left(x_1, x_2, x_3, \ldots\right)=\left(x_2, x_3, \ldots\right) . $$ Remark. 可以定义该映射的逆映射：$\mathbb{T}^{-1} E:=\left\{\boldsymbol{x} \in \mathcal{X}^{\infty}: \mathbb{T} \boldsymbol{x} \in E\right\}$。这里是以通信系统为目的讨论的，所以强调了对事件的移位操作，强调时不变，对于概率论视角或者遍历论视角，更统一的方法是定义$\mathbb{T}$不变测度。

Def 2(Ergodic sets). 对于事件 $E$,若 $\mathbb{T} E = E = \mathbb{T}^{-1} E$,称该事件为 Ergodic sets。

Def 3(随机过程的分类)

Memoryless: A random process or a source $X$ is said to be memoryless if thesequence of random variables $X_i$ is i.i.d.
Stationary process: Aprocess is said to be stationary (or strictly stationary)if the probability of every sequence or event is unchanged by a left (time) shift.
Ergodic process: A process is said to be ergodic if any ergodic set in $\mathcal{F}_X$ has probability either 1 or 0.

Remark. 遍历性与平稳性没有关系，平稳性说的是分布不随时间变化(i.d.),遍历性说的是该系统是封闭的，即该性质得以让我们从任何一点观察一随机过程，只要观察的时间足够长(SLLN),就可以推断该随机过程的分布。遍历性可以保证样本均值收敛到常数，但是未必是期望。

Theorem 4(Pointwise ergodic theorem) Consider a discrete-time stationary random process, $\boldsymbol{X}=\left\{X_n\right\}_{n=1}^{\infty}$. For real-valued function $f(\cdot)$ on $\mathbb{R}$ with finite mean (i.e., $\left|E\left[f\left(X_n\right)\right]\right|<\infty$ ), there exists a random variable $Y$ such that $$ \lim _{n \rightarrow \infty} \frac{1}{n} \sum_{k=1}^n f\left(X_k\right)=Y \quad \text { with probability } 1 $$ If, in addition to stationarity, the process is also ergodic, then $$ \lim _{n \rightarrow \infty} \frac{1}{n} \sum_{k=1}^n f\left(X_k\right)=E\left[f\left(X_1\right)\right] \quad \text { with probability } 1 . $$ Remark. 一些工程类教材经常把该定理当作遍历性的定义，因为借助遍历性这一概念只是想利用SLLN，但是这样的定义是狭隘的，从该定义出发难以探索遍历性的其他性质。Birkhoff 逐点遍历定理保证了，我们可以用单个轨道的时间均值来估计总体数学期望，这让遍历性在实际应用中非常重要。

Convergence of Sequences of Random Variables

为了介绍LLN，有必要先介绍随机变量序列的收敛：

Definition 1 (Convergence modes for random sequences).

Pointwise convergence on $\Omega$. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ pointwise on $\Omega$ if $$ \lim _{n \rightarrow \infty} X_n(\omega)=X(\omega) \text { for all } \omega \in \Omega $$ This notion of convergence, which is familiar to us from real analysis, is denoted by $X_n \xrightarrow{p . w .} X$.该收敛并没有考虑随机的因素，而是考虑映射 $X$ 本身的收敛。
Almost sure convergence or convergence with probability 1. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ with probability 1 , if $$ P\left\{\omega \in \Omega: \lim _{n \rightarrow \infty} X_n(\omega)=X(\omega)\right\}=1 $$ Almost sure convergence is denoted by $X_n \xrightarrow{\text { a.s. }} X$; note that it is nothing but a probabilistic version of pointwise convergence.
Convergencein probability. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ in probability, if for any $\varepsilon>0$, $$ \lim _{n \rightarrow \infty} P\left\{\omega \in \Omega:\left|X_n(\omega)-X(\omega)\right|>\varepsilon\right\}=\lim _{n \rightarrow \infty} \operatorname{Pr}\left\{\left|X_n-X\right|>\varepsilon\right\}=0 $$ This mode of convergence is denoted by $X_n \xrightarrow{p} X$.
Convergence in $r$ th mean. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ in $r$ th mean, if $$ \lim _{n \rightarrow \infty} E\left[\left|X-X_n\right|^r\right]=0 $$ This is denoted by $X_n \xrightarrow{\mathcal{L}_r} X$.
Convergence in distribution. $\left\{X_n\right\}_{n=1}^{\infty}$ is said to converge to $X$ in distribution, if $$ \lim _{n \rightarrow \infty} F_{X_n}(x)=F_X(x) $$ for every continuity point of $F(x)$, where $$ F_{X_n}(x)=\operatorname{Pr}\left\{X_n \leq x\right\} \quad \text { and } \quad F_X(x)=\operatorname{Pr}\{X \leq x\} $$ We denote this notion of convergence by $X_n \xrightarrow{d} X$.

这张图片表明了几种收敛的关系：

Theorem B. 9 (Monotone convergence theorem)
$\left.\begin{array}{l} \text { (i) }X_n \xrightarrow{\text { a.s. }} X\\ \text { (ii) }(\forall n) Y \leq X_n \leq X_{n+1} \\ \text { (iii) } E[|Y|]<\infty\end{array}\right\} \Rightarrow X_n \xrightarrow{L_1} X \quad \Rightarrow \quad E\left[X_n\right] \rightarrow E[X]$.

Theorem B. 10 (Dominated convergence theorem)

$\left.\begin{array}{l}\text { (i) }X_n \xrightarrow{\text { a.s. }} X\\ \text { (ii) }(\forall n)\left|X_n\right| \leq Y \\ \text { (iii) } E[|Y|]<\infty\end{array}\right\} \Rightarrow X_n \xrightarrow{L_1} X \quad \Rightarrow \quad E\left[X_n\right] \rightarrow E[X]$.

The implication of $X_n \xrightarrow{L_1} X$ to $E\left[X_n\right] \rightarrow E[X]$ can be easily seen from $$ \left|E\left[X_n\right]-E[X]\right|=\left|E\left[X_n-X\right]\right| \leq E\left[\left|X_n-X\right|\right] . $$

Observation B. 8 (Uniqueness of convergence).

If $X_n \xrightarrow{p . w .} X$ and $X_n \xrightarrow{p . w .} Y$, then $X=Y$ pointwise. That is, $(\forall \omega \in \Omega)$ $X(\omega)=Y(\omega)$.
If $X_n \xrightarrow{\text { a.s. }} X$ and $X_n \xrightarrow{\text { a.s. }} Y$ (or $X_n \xrightarrow{p} X$ and $X_n \xrightarrow{p} Y$ ) (or $X_n \xrightarrow{\mathcal{L}_r} X$ and $X_n \xrightarrow{\mathcal{L}_r} Y$ ), then $X=Y$ with probability 1. That is, $\operatorname{Pr}\{X=Y\}=1$.
$X_n \xrightarrow{d} X$ and $X_n \xrightarrow{d} Y$, then $F_X(x)=F_Y(x)$ for all $x$.

Ergodicity and Laws of Large Numbers，CLT

Theorem B. 13 (Weak law of large numbers) Let $\left\{X_n\right\}_{n=1}^{\infty}$ be a sequence of uncorrelated random variables with common mean $E\left[X_i\right]=\mu$. If the variables also have common variance, or more generally,

$$ \lim_{n \rightarrow \infty} \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}\left[X_i\right]=0, \quad\left(\text { equivalently }, \frac{\mathrm{X}_1+\cdots+\mathrm{X}_{\mathrm{n}}}{\mathrm{n}} \xrightarrow{\mathcal{L}_2}-\right) $$

then the sample average $$ \frac{X_1+\cdots+X_n}{n} $$ converges to the mean $\mu$ in probability.
Proof By Chebyshev’s inequality,

$$ \operatorname{Pr}\left\{\left|\frac{1}{n} \sum_{i=1}^n X_i-\mu\right| \geq \varepsilon\right\} \leq \frac{1}{n^2 \varepsilon^2} \sum_{i=1}^n \operatorname{Var}\left[X_i\right] $$

Note that the right-hand side of the above Chebyshev’s inequality is just the second moment of the difference between the $n$-sample average and the mean $\mu$. Thus, the variance constraint is equivalent to the statement that $X_n \xrightarrow{\mathcal{L}_2} \mu$ implies $X_n \xrightarrow{p} \mu$.

Theorem B. 14 (Kolmogorov’s strong law of large numbers) Let $\left\{X_n\right\}_{n=1}^{\infty}$ be a sequence of independent random variables with common mean $E\left[X_n\right]=\mu$. If either

$X_n$ ’s are identically distributed; or
$X_n$ ’s are square-integrable with variances satisfying $$ \sum_{i=1}^{\infty} \frac{\operatorname{Var}\left[X_i\right]}{i^2}<\infty $$

then $$ \frac{X_1+\cdots+X_n}{n} \xrightarrow{\text { a.s. }} \mu . $$ Remark. $SLLN\Rightarrow Pr\{\lim\}$,$WLLN\Rightarrow \lim Pr\{\}$, WLLN是估计量的依概率收敛，SLLN是估计量的依概率1收敛或a.s.收敛,Kolmogorov’s strong law of large numbers对函数序列极限依然适用。

逐点遍历原理是SLLN的一种。

直观地说，遍历性意味着系统没有“可分解的部分”，任何在变换 𝑇 下保持不变的子集，要么几乎处处包含所有样本点，要么几乎处处不包含任何样本点。如果一个事件 𝐴是 𝑇-不变的，那么它要么几乎处处发生（概率 1），要么几乎处处不发生（概率 0）。

我们总可以定义一些 𝑇-不变事件，这些事件是必然存在的，不在乎会不会发生 (统计量平均收敛于常数就是这样的一个不变事件)，于是我们就可以对这样的事件进行测量，如果是普通是事件，根据我们测量的环境不同，可能会发生变化(不同样本轨道的统计特性不同)，但是如果是 𝑇-不变事件，就可以保证该事件要么不发生，要么对于所有样本点发生，这也保证了不同的环境测量出的事件一致，这是SLLN或者逐点遍历原理的一种现实意义。

Theorem B. 15 (Central limit theorem) If $\left\{X_n\right\}_{n=1}^{\infty}$ is a sequence of i.i.d. random variables with finite common marginal mean $\mu$ and variance $\sigma^2$, then

$$ \frac{1}{\sqrt{n}} \sum_{i=1}^n\left(X_i-\mu\right) \xrightarrow{d} Z \sim \mathcal{N}\left(0, \sigma^2\right) $$

where the convergence is in distribution (as $n \rightarrow \infty$ ) and $Z \sim \mathcal{N}\left(0, \sigma^2\right)$ is a Gaussian distributed random variable with mean 0 and variance $\sigma^2$.

Lagrange Multipliers Technique and KKT Conditions

Definition B. 16 (Convexity) Consider a convex set $ \mathcal{O} \in \mathbb{R}^m$, where $m$ is a fixed positive integer. Then, a function $f: \mathcal{O} \rightarrow \mathbb{R}$ is said to be convex over $\mathcal{O}$ if for every $\underline{x}, \underline{y}$ in $\mathcal{O}$ and $0 \leq \lambda \leq 1$, $$ f(\lambda \underline{x}+(1-\lambda) \underline{y}) \leq \lambda f(\underline{x})+(1-\lambda) f(\underline{y}) $$ Furthermore, a function $f$ is said to be strictly convex if equality holds only when $\lambda=0$ or $\lambda=1$.
Remark. 凸集对线性凸组合保持封闭，函数的凸性是针对定义域和函数特性而言的，值得注意的式，满足convex的函数在凸集上二阶导非负。

Definition B. 17 (Concavity) A function $f$ is concave if $-f$ is convex.

Theorem B. 18 (Jensen’s inequality) If $f: \mathcal{O} \rightarrow \mathbb{R}$ is convex over a convex set $\mathcal{O} \subset \mathbb{R}^m$, and $\underline{X}=\left(X_1, X_2, \ldots, X_m\right)^T$ is an m-dimensional random vector with alphabet $\mathcal{X} \subset \mathcal{O}$, then $$ \mathbb{E}[f(\underline{X})] \geq f(\mathbb{E}[\underline{X}]) $$ Moreover, if $f$ is strictly convex, then equality in the above inequality immediately implies $\underline{X}=E[\underline{X}]$ with probability 1.

接下来考虑优化问题：Optimization of a function $f(\boldsymbol{x})$ over $\boldsymbol{x}=\left(x_1, \ldots, x_n\right) \in \mathcal{X} \subseteq \mathbb{R}^n$ $$ \min _{\boldsymbol{x} \in \mathcal{Q}} f(\boldsymbol{x}), $$ where $$ \mathcal{Q}:=\left\{\boldsymbol{x} \in \mathcal{X}: g_i(\boldsymbol{x}) \leq 0 \text { for } 1 \leq i \leq m \text { and } h_i(\boldsymbol{x})=0 \text { for } 1 \leq j \leq \ell\right\} $$ 可以利用拉格朗日乘子法，将该问题转化为无约束的对偶问题：

$$ L(\boldsymbol{\lambda}, \boldsymbol{\nu}):=\min _{x \in \mathcal{X}}\left(f(\boldsymbol{x})+\sum_{i=1}^m \lambda_i g_i(\boldsymbol{x})+\sum_{j=1}^{\ell} \nu_j h_j(\boldsymbol{x})\right)=\min _{x \in \mathcal{X}} L(\boldsymbol{x} ; \boldsymbol{\lambda}, \boldsymbol{\nu}) . $$

称该函数为拉格朗日对偶函数，可以证明是一个 concave 的函数，其中$\lambda,\mu$ 为拉格朗日乘子，并且非负，则有

$$ L(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq \min _{\boldsymbol{x} \in \mathcal{Q}}\left(f(\boldsymbol{x})+\sum_{i=1}^m \lambda_i g_i(\boldsymbol{x})+\sum_{j=1}^{\ell} \nu_j h_j(\boldsymbol{x})\right) \leq \min _{\boldsymbol{x} \in \mathcal{Q}} f(\boldsymbol{x}) . $$

如果上式等号成立，我们就可以通过优化对偶问题找到原优化问题的解，在特定条件下，KKT条件就是使得等式成立的条件(此时对偶问题为强对偶)：

Definition B. 19 (Karush-Kuhn-Tucker (KKT) condition) Point $\boldsymbol{x}=\left(x_1\right.$, $\left.\ldots, x_n\right)$ and multipliers $\boldsymbol{\lambda}=\left(\lambda_1, \ldots, \lambda_m\right)$ and $\boldsymbol{\nu}=\left(\nu_1, \ldots, \nu_{\ell}\right)$ are said to satisfy the KKT condition if

$$ \begin{cases}g_i(\boldsymbol{x}) \leq 0, \quad \lambda_i \geq 0, \quad \lambda_i g_i(\boldsymbol{x})=0 & i=1, \ldots, m \\ h_j(\boldsymbol{x})=0 & j=1, \ldots, \ell \\ \frac{\partial L}{\partial x_k}(\boldsymbol{x} ; \boldsymbol{\lambda}, \boldsymbol{\nu})=\frac{\partial f}{\partial x_k}(\boldsymbol{x})+\sum_{i=1}^m \lambda_i \frac{\partial g_i}{\partial x_k}(\boldsymbol{x})+\sum_{j=1}^{\ell} \nu_j \frac{\partial h_j}{\partial x_k}(\boldsymbol{x})=0 & k=1, \ldots, n\end{cases} $$

Remark. （1）如果 $f,g$ 为凸函数，$h$是不过原点的直线，那么等式成立 $\iff$ KKT条件。
（2）对于任意的 $f,g,h$ 函数，有等式成立 $\Rightarrow$ KKT条件。

Information Measures for Discrete Systems

暂时搁置了。。。。陈伯宁老师的课程录屏被他们学校剪辑发出，效果真的不忍直视，等以后有时间再学。