学习笔记|课题组讨论班

高等概率论

王拓学长分享的高等概率论，讲义和视频见讨论班网页 Seminar Information

在开始之前，先提出一些启发性的问题，概率论中的 Event,Probability,random variable,distribution of r.v.,independent分别是什么，为什么 $X_1,X_2,X_3$ 独立，对于某些函数，有 $f(X_1,X_2)$ 和 $X_3$ 独立。

Probability Space

概率测度是从 $\sigma-$域到 $\mathbb{R}$ 的映射，为了定义概率，需要先定义事件：

Definition 1 ( $\sigma$-algebra or $\sigma$-field). Let $\Omega$ be a non-empty set and $\mathcal{F}$ be a collection of the subsets of $\Omega$. Then we say $\mathcal{F}$ is a $\sigma$-algebra or a $\sigma$-field on $\Omega$ if it satisfies the following axioms:
（1） $\varnothing \in \mathcal{F}, \Omega \in \mathcal{F}$;
（2） If $A \in \mathcal{F}$, then $A^c \in \mathcal{F}$;
（3） If $A_i \in \mathcal{F}, i=1,2, \cdots$, then $\bigcup_{i=1}^{\infty} A_i \in \mathcal{F}$.
IF $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$ and $A \in \mathcal{F}$, then we say $A$ is a event, $\Omega$ is the sample space, $(\Omega, \mathcal{F})$ is a measurable space.

Definition 2 (Probability Measure). Let $(\Omega, \mathcal{F})$ be a measurable space, a mapping $\mathbb{P}: \mathcal{F} \rightarrow \mathbb{R}$ is called a probability measure if it satisfies the following axioms:
（1） $\mathbb{P}(\varnothing)=0$;
（2） If $A_i \in \mathcal{F}, i=1,2, \cdots$ are pairwise disjoint, then $$ \mathbb{P}\left(\bigcup_{i=1}^{\infty} A_i\right)=\sum_{i=1}^{\infty} \mathbb{P}\left(A_i\right) $$ （3） $\mathbb{P}(\Omega)=1$.
And if a set function $\mu$ on $\mathcal{F}$ only satisfies （1） and （2）, we say it is a measure on $\mathcal{F}$.

这时便定义出了概率测度，对于任意一个测度 $\mu,(\Omega,\mathcal{F},\mu)$ 为一个measure space,若该测度是一个概率测度，则 $\mu,(\Omega,\mathcal{F},\mu)$是一个probability space.

Note. 概率测度的公理（2）为可列可加性，我们可以注意到是无穷个集合 $A_i$ 的并可以进行拆分，事实上，历史中有另外一种定义，是有限可加性，可列可加性比有限可加性更强，事实上，可列可加性=有限可加性$+$连续性，这种连续性是一种集合上的连续，并非函数的连续，这种可列可加性将会更方便我们对集合的操作，因为概率测度本身就是一个set function，从事件域到 $\mathbb{R}$ 的映射。

Proposition 3 If $\mu$ is a finite additivity and finite set function on a $\sigma-$ algebra $\mathcal{F}$, then the followings are equivalent:
(i) $\mu$ is countably additive;
(ii) $\mu$ is continuous from below;
(iii) $\mu$ is continuous from above;
(iv) $\mu$ is continuous from above at $\varnothing$.

Definition 4(最小生成 $\sigma$ 域). Let $\mathcal{C}$ be a collection of sets on $\Omega$, then the smallest $\sigma$ - algebra that contains $\mathcal{C}$ is called the the $\sigma-$ algebra generated by $\mathcal{C}$, denoted by $\sigma(\mathcal{C})$.

Definition 5 (Borel $\sigma$ - algebra). Let $\mathcal{C}$ be the collection of all open sets on $\mathbb{R}$, then the $\sigma-$ algebra generated by $\mathcal{C}$ is called the Borel $\sigma$ - algebra on $\mathbb{R}$, which is denoted by $$ \mathcal{B}(\mathbb{R})=\sigma(\mathcal{C}) $$

Random Variables

Definition 1(可测函数的定义). 设 $\left(X, \Sigma_X\right)$ 与 $\left(Y, \Sigma_Y\right)$ 为可测空间。函数 $f: X \rightarrow Y$ 对任意 $B \in \Sigma_Y$ 若满足： $$ f^{-1}(B) \in \Sigma_X $$ 则称 $f$ 为一个 $\Sigma_X-\Sigma_Y$ 可测函数。
Remark. 可测都是针对 $\sigma-$域来说的，即映射的原像在 $\sigma-$域中。

Definition 2(Random Variable). A random variable $X$ on $(\Omega, \mathcal{F})$ is a measurable function $X:(\Omega, \mathcal{F}) \rightarrow(\mathbb{R}, \mathcal{B}(\mathbb{R}))$, namely for all $B \in \mathcal{B}(\mathbb{R})$, $$ X^{-1}(B)={\omega \in \Omega \mid X(\omega) \in B} \in \mathcal{F} $$ Remark. 映射 $X:\Omega\rightarrow\mathbb{R}$，但是通常会写做 $X:(\Omega, \mathcal{F}) \rightarrow(\mathbb{R}, \mathcal{B}(\mathbb{R}))$，因为 $X$ 不止是一个映射，而是一个保持可测结构的映射，即映射的原像也是可测集。

下面我们来考虑随机变量的分布是什么，初等概率论中可能将分布狭义的指定为分布函数，其实在严格的概率论中，分布指随机变量诱导出的概率测度，在更一般的角度说，随机变量符合的法则（law）也是分布（例如从复杂分布中采样，该分布的含义可能不单单是概率测度，还可能是另外能代表概率测度的东西，是随机变量服从的法则）。

Definition 3 (Distribution of Random Variables). The distribution or law of a random variable $X$ is a probability measure $\mu_X=\mathbb{P} X^{-1}$ on $(\Omega, \mathcal{F})$ defined as $$ \mu_X(A):=\left(\mathbb{P} X^{-1}\right)(A)=\mathbb{P}\left(X^{-1}(A)\right)=\mathbb{P}(X \in A), \quad \forall A \in \mathcal{F} $$ Note. The distribution of random variable $X$ is a measure on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ induced by $X$ : $$ X:(\Omega, \mathcal{F}, \mathbb{P}) \rightarrow\left(\mathbb{R}, \mathcal{B}(\mathbb{R}), \mu_X\right) $$ hence it is also called the image measure induced by $X$.

Monotone Class Theorem(set-theoretic version)

If we want to show that a $\sigma$-algebra $\mathcal{F}$ satisfying property $P$, we can define a “good set”: $$ \mathcal{G}={A \in \mathcal{F} ; A \text { satisfies property } P} \subseteq \mathcal{F} $$ And then we just need to show that $\mathcal{G}=\mathcal{F}$. Usually, we have the condition or easy to prove that a sub-class $\mathcal{C}$ which can generate $\mathcal{F}$ satisfies property $P$, that is to say, $$ \mathcal{G} \supset \mathcal{C} $$ Hence if we can show that $\mathcal{G}$ is a $\sigma$-algebra, then we have $$ \mathcal{G}=\sigma(\mathcal{G}) \supset \sigma(\mathcal{C})=\mathcal{F} $$ which follows that $\mathcal{G}=\mathcal{F}$ and we get the desired result. The following theorem is an example and it serves as the theoretical guarantee for the uniqueness of the Carathéodory extension theorem.

Theorem 1. If $\mathbb{P}$ and $\mathbb{Q}$ are both probability measure on $(\Omega, \mathcal{F})$ and $\mathcal{C}$ is a sub-class of $\mathcal{F}$ such that $\sigma(\mathcal{C})=\mathcal{F}$ and $$ \mathbb{P}(A)=\mathbb{Q}(A), \quad \forall A \in \mathcal{C} $$ then $\mathbb{P} \equiv \mathbb{Q}$ on $\mathcal{F}$.

在大多数情况下，证明一个子集的集合是 $\sigma-$代数是十分困难的，下面介绍一种方便证明某集合是 $\sigma-$代数的方法。

Definition 2($\pi$ - system). Let $\mathcal{C}$ be a collection of subsets of $\Omega$, we say $\mathcal{C}$ is a $\pi$-system if $A, B \in \mathcal{C}$, then $A \cap B \in \mathcal{C}$.

Definition 3($\lambda$-system). Let $\mathcal{C}$ be a collection of subsets of $\Omega$, we say $\mathcal{C}$ is a $\lambda$-system if
(1) $\Omega \in \mathcal{C}$;
(2) If $A, B \in \mathcal{C}, A \subset B$, then $B \backslash A \in \mathcal{C}$;
(3) If $A_n \in \mathcal{C}, n=1,2, \cdots$ and $A_n \uparrow A$, then $A \in \mathcal{C}$.

Theorem 4 $(\pi+\lambda=\sigma)$. Let $\mathcal{C}$ be a collection of subsets of $\Omega$. If $\mathcal{C}$ is both a $\pi$-system and a $\lambda$-system.

Definition 5. Let $\mathcal{C}$ be a collection of subsets of $\Omega$. We say the smallest $\lambda-$ system that contains $\mathcal{C}$ is the $\lambda-$ system generated by $\mathcal{C}$ and denoted by $\lambda(\mathcal{C})$.

The following theorem is the famous $\pi-\lambda$ theorem given by Dynkin.

Theorem 1.3 ( $\pi-\lambda$ Theorem). Let $\mathcal{C}$ be a collection of subsets of $\Omega$. If $\mathcal{C}$ is a $\pi-$ system, then $$ \lambda(\mathcal{C})=\sigma(\mathcal{C}) $$ By $\pi-\lambda$ theorem, if we have shown that $$ \mathcal{G} \supset \mathcal{C}, $$ where $\mathcal{G}$ is the good set defined by us ( and $\mathcal{G}$ is $\lambda-$system) and $\mathcal{C}$ satisfies the desired property $P$ and $\sigma(\mathcal{C})=\mathcal{F}$, then we have $$ \mathcal{G}=\lambda(\mathcal{G}) \supset \lambda(\mathcal{C})=\sigma(\mathcal{C})=\mathcal{F} $$ hence we get the desired result.

定理1同样可以由 $\pi-\lambda$ 定理来证明。

Independent

本章节从事件的独立开始，研究随机变量的独立性。 Definition 1(事件、子集组的独立). Let $A_t \in \mathcal{F}, t \in T$, we say $A_t, t \in T$ are independent if for all $n \in \mathbb{N}_{+}$and all choices $t_1, \cdots, t_n \in T$, we have $$ \mathbb{P}\left(A_{t_1} \cap A_{t_2} \cap \cdots \cap A_{t_n}\right)=\mathbb{P}\left(A_{t_1}\right) \mathbb{P}\left(A_{t_2}\right) \cdots \mathbb{P}\left(A_{t_n}\right) $$ Let $\mathcal{E}_t, t \in T$ be a family of subsets of $\Omega$, we say $\mathcal{E}_t, t \in T$ are independent if for all choices $A_t \in \mathcal{E}_t, A_t, t \in T$ are independent.

Definition 2(r.v.生成的 $\sigma-$ 代数)). 设 $X:(\Omega, \mathcal{F}) \rightarrow(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ 是一个随机变量，即 $X$ 是一个可测映射。由 $X$ 生成的 $\sigma$－代数记作 $\sigma(X)$ ，定义为： $$ \sigma(X)=\sigma\left(X^{-1}(B) \mid B \in \mathcal{B}(\mathbb{R})\right) $$ 即，$X$ 的生成 $\sigma$－代数是所有形如 $X^{-1}(B)$ 的集合所生成的最小 $\sigma$－代数，其中 $B$ 取遍 $\mathbb{R}$ 上的 Borel $\sigma$－代数。

有了子集组的定义和最小生成 $\sigma-$代数的定义，现在我们来定义随机变量的独立：

Definition 3(r.v.的独立). Let $X_t, t \in T$ be a family of random variables, then we say $X_t, t \in T$ are independent if $\sigma\left(X_t\right), t \in T$ are independent.

Question. Recall the classical probability theory, we define two random variables $X_t, t \in T$ are independent if for all $n \in \mathbb{N}_{+}$and all choices of $t_1, \cdots, t_n \in T$, $$ \mathbb{P}\left(X_{t_1} \leq x_1, \cdots, X_{t_n} \leq x_n\right)=\mathbb{P}\left(X_{t_1} \leq x_1\right) \cdots \mathbb{P}\left(X_{t_n} \leq x_n\right) $$ So what is the relationship between two definitions? And what is the advantages of definition 3?

容易看出，通过取特定的事件 $A_{t_i}$,初等概率论中的定义是 Definition 3 的简单推广，实际上，通过下面的定理，可以证明，这两种定义是等价定义：

Theorem 4 (Independent Class Extension Theorem). If $\mathcal{E}_t, t \in T$ are independent, then $\lambda\left(\mathcal{E}_t\right), t \in T$ are independent. Moreover, if $\mathcal{E}_t, t \in T$ are $\pi$-systems, then $\sigma\left(\mathcal{E}_t\right), t \in T$ are independent.

For all $t \in T$, let $$ \mathcal{E}_t=\left\{\left(-\infty, x_t\right] ; x_t \in \mathbb{R}\right\} $$ Clearly, $\mathcal{E}_t$ is a $\pi$-system and $\sigma\left(\mathcal{E}_t\right)=\mathcal{B}(\mathbb{R})$, for all $t \in T$. By the independent class extension theorem, we know that definition of elementary probability theory implies definition 3. Hence, they are equivalent.

现在来解释相互独立的随机变量经过“特定性质”的映射后依然相互独立的现象。

Theorem 5. Let $X_t:(\Omega, \mathcal{F}) \rightarrow\left(E_t, \mathcal{E}_t\right), t \in T$ be measurable and independent. Then for any measurable functions $g_t:\left(E_t, \mathcal{E}_t\right) \rightarrow(\mathbb{R}, \mathcal{B}(\mathbb{R}))$, we have $g\left(X_t\right), t \in T$ are independent.

Proof. For all $n \in \mathbb{N}_{+}$, all choices of $t_1, \cdots, t_n \in T$ and $B_{t_1}, \cdots, B_{t_n} \in \mathcal{B}(\mathbb{R})$, we have

$$ \begin{aligned} \mathbb{P}\left(g_{t_1}\left(X_{t_1}\right) \in B_{t_1}, \cdots, g_{t_n}\left(X_{t_n}\right) \in B_{t_n}\right) & =\mathbb{P}\left(X_{t_1} \in g_{t_1}^{-1}\left(B_{t_1}\right), \cdots, X_{t_n} \in g_{t_n}^{-1}\left(B_{t_n}\right)\right) \\ & =\mathbb{P}\left(X_{t_1} \in g_{t_1}^{-1}\left(B_{t_1}\right)\right) \cdots \mathbb{P}\left(X_{t_n} \in g_{t_n}^{-1}\left(B_{t_n}\right)\right) \\ & =\mathbb{P}( g _ { t _ { 1 } } ( X _ { t _ { 1 } } ) \in B _ { t _ { 1 } } ) \cdots \mathbb { P } \left(g_{t_n}\left(X_{t_n}\right) \in B_{t_n}\right) \end{aligned} $$

QED.

Expectation,integral and probability density

本小节将介绍期望，并且讨论到拉东尼古丁定理和R-N导数，并且最后给出一个初等概率论回答不了的问题的答案。

期望是一种勒贝格积分，可以参考我的另一个博客文章，这里补充一个引理：

Lemma 1.5（简单函数逼近一般函数）. Let $(\Omega, \mathcal{F}, \mu)$ be a measure space and $f$ be nonnegative and $\mathcal{F}$ measurable, then there exists a nonnegative sequence of simple measurable functions $f_n$ such that $0 \leq f_n \uparrow f$.

Note 1.4（关于积分的一些记法）. The integral can be also denoted by $$ \mu(f):=\int_{\Omega} f \mathrm{~d} \mu:=\int_{\Omega} f(\omega) \mathrm{d} \mu(\omega):=\int_{\Omega} f(\omega) \mu(\mathrm{d} \omega):=\int_{\Omega} \mu(\mathrm{d} \omega) f(\omega) $$ When $\mu$ is Lebesgue measure $m$, we often omit " $m$ " and denote the integral by $$ \int_{\mathbb{R}} f(x) \mathrm{d} m(x):=\int_{\mathbb{R}} f(x) \mathrm{d} x $$

Definition 1.22 (Absolutely continuous). Let $(\Omega, \mathcal{F})$ be a measurable space, $\mu$ be a measure on $\mathcal{F}$ and $\nu$ be a signed measurable on $\mathcal{F}$. If $$ \mu(A)=0 \Longrightarrow \nu(A)=0, \quad \forall A \in \mathcal{F} $$ then we say $\nu$ is absolutely continuous with respect to measurable $\mu$ and we denote $\nu \ll \mu$.

Theorem 1.15 (Radon-Nikodym Theorem). Let $(\Omega, \mathcal{F})$ be a measurable space, $\mu$ is a $\sigma$-finite measure on $\mathcal{F}$ and $\nu$ is a signed measure on $\mathcal{F}$. If $$ \nu \ll \mu $$ then there exists a measurable function $f$ whose integral exists with respect to $\mu$ such that $$ \nu=f \cdot \mu $$ Moreover, $f$ is unique $\mu-$ a.e. and it is called the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ denoted by $$ f=\frac{\mathrm{d} \nu}{\mathrm{~d} \mu} $$

Note 1.5（随机变量的pdf是诱导概率测度相对于勒贝格测度的R-N导数）. Recall that the law of a random variable $X$ is a probability measure $\mu_X$ on $(\Omega, \mathcal{F})$ defined as $$ \mu_X(A):=\left(\mathbb{P} X^{-1}\right)(A)=\mathbb{P}\left(X^{-1}(A)\right)=\mathbb{P}(X \in A), \quad \forall A \in \mathcal{F} $$ If $\mu_X \ll m$ (namely if $m(A)=0$, then $\mu_X(A)=0$ ), then the Randon-Nikodym derivative $\frac{\mathrm{d} \mu_X}{\mathrm{~d} m}$ exists and we say the law of random variable $X$ has density $$ p_X=\frac{\mathrm{d} \mu_X}{\mathrm{~d} m} $$ 那么对于 $A\in \mathcal{B}(\mathbb{R})$,有 $$ \mu_X(A) = \int_A p_x d m(x) = \int_A p_x d x $$

Theorem 1.16 (Chain Rule). Let $(\Omega, \mathcal{F}, \mu)$ be a measure space, $f, g$ be non-negative $\mathcal{F}$ measurable functions and $\nu=f . \mu$. Then the integral of $g$ with respect to $\nu$ exists $\iff$ the integral of $g f$ with respect to $\mu$ exists, and then $$ \nu(g)=\mu(g f) $$

Note 1.6(Chain Rule 表示了关于R-N导数的链式法则). Let $$ \lambda(A)=\nu\left(g 1_A\right) $$ then by chain rule, $$ \lambda(A)=\nu\left(g 1_A\right)=\mu\left(g f 1_A\right) $$ We have $$ g=\frac{\mathrm{d} \lambda}{\mathrm{~d} \nu}, \quad f=\frac{\mathrm{d} \nu}{\mathrm{~d} \mu}, \quad g f=\frac{\mathrm{d} \lambda}{\mathrm{~d} \mu} $$ Hence the chain rule and Radon-Nikodym theorem show that $$ \frac{\mathrm{d} \lambda}{\mathrm{~d} \mu}=\frac{\mathrm{d} \lambda}{\mathrm{~d} \nu} \cdot \frac{\mathrm{~d} \nu}{\mathrm{~d} \mu} \quad \mu-a . e . $$

有链式法则的保证，在一些测度变换的运算中，将R-N导数记作形式导数，进行测度之间的运算是非常便捷的，可以利用普通微积分中一些很好的性质。

Question 1.3. In basic probability theory, we know if a random variable $X$ has density $\rho$, that is $$ \mathbb{P}(X \in B)=\int_B \rho(x) \mathrm{d} x $$ Then why the expectation value of $X$ is $$ \mathbb{E}[X]=\int_{\mathbb{R}} x \rho(x) \mathrm{d} x ? $$ Moreover, if $h: \mathbb{R} \rightarrow \mathbb{R}$ is a measurable function then why $$ \mathbb{E}[h(X)]=\int_{\mathbb{R}} h(x) \rho(x) \mathrm{d} x ? $$

Theorem 1.17 (Integral Transformation). Let $X:(\Omega, \mathcal{F}, \mathbb{P}) \rightarrow(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ be measurable, and $h:(\mathbb{R}, \mathcal{B}(\mathbb{R})) \rightarrow(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ be a measurable function, then the integral of $h(X)$ with respect to measure $\mathbb{P}$ exists if and only if the integral of $h$ with respect to image measure $\mathbb{P} X^{-1}$ exists, and for all $B \in \mathcal{B}(\mathbb{R})$, we have $$ \int_{X^{-1}(B)} h(X) \mathrm{d} \mathbb{P}=\int_B h \mathrm{~d}\left(\mathbb{P} X^{-1}\right) $$ Proof. For $A \in \mathcal{B}(\mathbb{R})$ and $h=1_A$ then $$ \int_{X^{-1}(B)} 1_A(X) \mathrm{d} \mathbb{P}=\int_{\Omega} 1_{X^{-1}(A \cap B)} \mathrm{d} \mathbb{P}=\mathbb{P}\left(X^{-1}(A \cap B)\right)=\left(\mathbb{P} X^{-1}\right)(A \cap B)=\int_E 1_{A \cap B} \mathrm{~d}\left(\mathbb{P} X^{-1}\right)=\int_B h \mathrm{~d}\left(\mathbb{P} X^{-1}\right) $$ Then by Typical Method in Measure Theory, we get the desired result.QED

利用该定理，可以回答上面提出的问题： $$ \mathbb{E}[h(X)]=\int_{\Omega} h(X) \mathrm{d} \mathbb{P}=\int_{\mathbb{R}} h(x) \mathrm{d}\left(\mathbb{P} X^{-1}\right)=\int_{\mathbb{R}} h(x) \frac{\mathrm{d} \mathbb{P} X^{-1}}{\mathrm{~d} m(x)} \cdot \mathrm{~d} m(x)=\int_{\mathbb{R}} h(x) \rho(x) \mathrm{d} x . $$