STT 861 Theory of Prob and STT I Lecture Note - 13
2017-11-29
Recap of linear predictor; almost surely convergence, converge in probability, converge in distribution; central limit theorem, theorem of DeMoivre-Laplace.
Portal to all the other notes
- Lecture 01 - 2017.09.06
- Lecture 02 - 2017.09.13
- Lecture 03 - 2017.09.20
- Lecture 04 - 2017.09.27
- Lecture 05 - 2017.10.04
- Lecture 06 - 2017.10.11
- Lecture 07 - 2017.10.18
- Lecture 08 - 2017.10.25
- Lecture 09 - 2017.11.01
- Lecture 10 - 2017.11.08
- Lecture 11 - 2017.11.15
- Lecture 12 - 2017.11.20
- Lecture 13 - 2017.11.29 -> This post
- Lecture 14 - 2017.12.06
Lecture 13 - Nov 29 2017
For Video Recording
Linear Prediction
Recall X,Y let
g(x)=E[Y|X=x]The r.v. g(x) is the best predictor of Y given X in the least square sense: g minimizes E(g(X)−Y)2) = MSE.
But what about making this MSE as small as possible when g is linear? So use notation h(x)=a+bx (instead of g).
We want to minimize
MSE=E((Y−(a+bX))2)Find a and b to make this as small as possible. Let
ZY=Y−μYσY ZX=X−μXσXWe also know,
E(ZXZY)=Corr(X,Y)=ρ Y−(a+bX)=(ZY−cZX)σY+(d−a)where c=bσYσX, and d=μY−bμX,
MSE=σ2YE((Zy−cZX)2)+(d−a)2=σ2Y(1−2cρ+c2)+(d−a)2=σ2Y(1−ρ2+(c−ρ)2)+(d−a)2We see immediately that this is minimal for a=d and c=ρ.
Therefore,
b=ρσYσXThis answers the question of what the best linear predictor of Y given X is the mean square sense.
We see the smallest MSE is therefore σ2Y(1−ρ2).
Therefore, we see that the proportion of Y’s variance which is not explained by X is
MSEσ2Y=1−ρ2Finally, the proportion of Y’s variance which is explained by X is ρ2.
Chapter 6 - Convergences
Definition: We say that the sequence of r.v.’s (Xn)n∈N converges (“a.s.”(almost surely)) to the r.v. X if
limwith probability 1. In other word,
P(\lim_{n\rightarrow\infty}|X_n-X|) = 1Definition: (A weaker notion of convergence) A sequence of r.v.’s (X_n)_{n\in\mathbb{N}} converges in probability to X if
\forall \varepsilon >0: P(|X-X_n|>\varepsilon) \rightarrow 0Note: [Convince yourself as an exercise at home] Convergence in probability is (easier to achieve) than converge a.s.
Definition: (even weaker version) Let sequence of r.v.’s (X_n)_{n\in\mathbb{N}} as above but now let F be the CDF of some distribution. We say X_n converges in distribution to the law F is
F_{X_n}(X) \rightarrow F(X)as n\rightarrow\infty for every fixed x where F(x) is continuous.
Note: unlike the previous two notions, here there is no need for a limiting r.v. X and the X_n ’s. Do not need to share a probability space with X or anyone else.
Example 1
Let Y_i , i=1,2,3,… be i.i.d with \mu and \sigma^2 is finite. We proved that, with
X_n= \frac{Y_1 + Y_2 + \cdots + Y_n}{n} = \muthen
P(|X_n-\mu|>\varepsilon) \leq \frac{\sigma^2}{\varepsilon^2n}(By Chebyshev’s inequality)
The whole thing goes to 0 as n\rightarrow \infty , this proves that X_n\rightarrow\mu in probability.
Note: assuming only \mu exists ( \sigma^2 could be infinity), conclusion still holds. See W. Feller’s book in 1950.
Let X_n=Uniform\{1,2,…,n\} . This is a stepper function, with 1/n increment each step. Let’s try to find out F_{X_n}(x) =\frac{1}{n}[x] for x\in [0,n] (integer function, the integer larger than x ).
For fixed x\in\mathbb{R^+} , as n\rightarrow\infty , F_{X_n}(x)\rightarrow 0(\star) .
Since the function \star , is not the CDF of any random variable, this proves that X_n does not converge in distribution, And therefore, X_n cannot converges in any stronger sense (in probability or a.s.).
How about Y_n=Uniform \{1/n, 2/n, …, n/n\} ?
Since
F_{Y_n}(x)\rightarrow xfor x\in[0,1] , this is the CDF of Uniform(0,1).
Example 2
Let U_n\sim Unif(0,1) , i.i.d. Let M_n =\max_{i=1,2,…,n}(U_i). We can tell that M_n\rightarrow1 in some sense. Let’s prove it in probability:
Let \varepsilon>0 , \begin{align*} P(\vert M_n-1\vert > \varepsilon) &= P(1-M_n>\varepsilon) \\ &=P(M_n<1-\varepsilon)\\ &=P(\max_{i=1,2,...,n}U_i<1-\varepsilon) \\ &=P(\forall i: U_i<1-\varepsilon)\\ &=P(\cap_{i=1}^{n}\{U_i<1-\varepsilon\}) \\ &=\prod_{i=1}^{n}P(U_i<1-\varepsilon) =(1-\varepsilon)^n\\ \lim P(\vert M_n-1 \vert >\varepsilon) &=0 \end{align*}
We proved that M_n\rightarrow1 in probability.
Now consider Y_n=(1-M_n)n . Let’s see about CDF of Y_n .
\begin{align*} 1-F_{Y_n}(y) & = P((1-M_n)>y) \\ &= P((1-M_n)>\frac{y}{n}) \\ &=P(M_n<1-y/n) \\ &=P(\prod \{U_i<1-y/n\}) \\ &=(1-y/n)^n\\ &\rightarrow e^{-y} \end{align*}This CDF is exponential. Thus Y_n \rightarrow Exp(\lambda = 1) in distribution.
Theorem (6.3.6): Let (X_n)_{n\in\mathbb{N}} be a sequence of r.v.’s. If X_n has MGF M_{X_n}(t) and M_{X_n}(t)\rightarrow M_X(t) for t not 0, then X_n \rightarrow X in distribution.
Example 3
Let X_n be Bin( n,p=\lambda/n ). We know that the PMF of X_n converges to the PMF of Poisson( \lambda ). Try it again here. We will find
M_{X_n}(t)=e^\lambda (e^t-1)and here we recognize that this is the MGF of Poisson( \lambda ). By Theorem 6.3.6, X_n\rightarrow Poiss(\lambda) in distribution.
Let X_i, i=1,2,…,n be i.i.d with E(X_i)=\mu and variance Var(X_i)=\sigma^2 .
Let Z_i =\frac{X_i-\mu}{\sigma} , S_n=\frac{\sum_{i=1}^{n}Z_i}{\sqrt{n}} ,
(We divide by \sqrt(n) as a standardization and we know Var(S_n)=1 ).
Central Limit Theorem (CLT)
As n\rightarrow\infty , then S_n\rightarrow Normal(0,1) in distribution.
This means:
\lim P(S_n\leq x) = F_{N(0,1)}(x) = \int_{-\infty}^{x}\frac{1}{\sqrt{2\pi}}\exp(-z^2/2) dzThe proof just combines the Taylor formula up to the order 2 and Theorem 6.3.6.
Next, consider
Y_n = \frac{(\sum_{i=1}^{n}x_i)-n\mu}{\sqrt{n}}=\sigma S_nA corollary of the CLT says this:
Y_n\rightarrow N(0,\sigma^2)in distribution.
Proof:
\begin{align*} P(Y_n\leq x) & = P(\sigma S_n\leq x) \\ &=P(S_n\leq\frac{x}{\sigma}) \\ &=\int_{-\infty}^{x/\sigma}\frac{1}{\sqrt{2\pi}}\exp(-z^2/2) dz \end{align*}and conclude by changing of variable.
Theorem of DeMoivre-Laplace
Let X_i, i=1,2,3,..,n be i.i.d Bernoulli. Let W_n = X_1+X_2+\cdots+ X_n . We know W_n\sim Bin(n,p) .
Let S_n =\frac{W_n-np}{\sqrt{np(1-p)}},
Therefore, by CLT, S_n\rightarrow N(0,1) as n\rightarrow\infty in distribution.
In other words, to be precise,
\lim P(S_n\leq x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}\exp(-z^2/2)dz P(\frac{W_n-np}{\sqrt{np(1-p)}}\leq x) =P(W_n\leq x\sqrt{np(1-p)}+np)The speed of convergence in the CLT is known as a “Berry-Esseen” theorem. But the speed of convergence for Binomial CLT is much faster and rule of thrum is p\in [1/10, 9/10] and n\geq 30 . CLT is good within 1\% convergence error.
Example 4
Let
M_n =\frac{U_1+U_2+\cdots+U_n}{n}where U_i ’s are i.i.d Uniform(0, 1). We know by Weak law of large numbers, that M_n\rightarrow \frac{1}{2} in probability as n\rightarrow\infty . But how spread out is M_n around 1/2? For example, can we estimate the chance that M_n is more than 0.02 away from its mean value 1/2?
Answer:
\begin{align*} P(|M_n-\frac{1}{2}|>0.02) & = 1- P(-0.02 < M_n-\frac{1}{2} <0.02) \\ &= 1-P(-0.02<\frac{\sum (U_i-1/2)}{n}) \\ &=1- P(-0.02\sqrt{n}<\frac{\sum (U_i-1/2)}{\sqrt{n}}<0.02\sqrt{n}) \\ &=1- P(-0.02\sqrt{n}/\sqrt{1/12}<\frac{\sum (U_i-1/2)}{\sqrt{n}}\sqrt{1/12}<0.02\sqrt{n}\sqrt{1/12}) \\ &\approx 1-P(-\frac{0.02\sqrt{n}}{\sqrt{1/12}}<N(0,1)<\frac{0.02\sqrt{n}}{\sqrt{1/12}}) \end{align*}We will feel comfortable if n is large enough to make this greater than 0.95. What value should n at least be.
In order to get this setup, it is known that the right-hand value should be = 1.96.
This value
0.975 = P(N(0,1)\leq 1.95)So 1.96 is therefore know as the 97.5^{th} percentile of N(0,1) . Therefore, we must take
\frac{0.02\sqrt{n}}{\sqrt{1/12}}\geq 1.96 \Rightarrow n>800.33- ← Older-Things to Do After Installing Ubuntu
- STT 861 Theory of Prob and STT I Lecture Note - 14-Newer →