STT 861 Theory of Prob and STT I Lecture Note  11
20171115
Proof of the "Tower" property; discrete conditional distribution; continuous conditional distribution expectation and variance and their examples; linear predictor and mean squared error.
Portal to all the other notes
 Lecture 01  2017.09.06
 Lecture 02  2017.09.13
 Lecture 03  2017.09.20
 Lecture 04  2017.09.27
 Lecture 05  2017.10.04
 Lecture 06  2017.10.11
 Lecture 07  2017.10.18
 Lecture 08  2017.10.25
 Lecture 09  2017.11.01
 Lecture 10  2017.11.08
 Lecture 11  2017.11.15 > This post
 Lecture 12  2017.11.20
 Lecture 13  2017.11.29
 Lecture 14  2017.12.06
Lecture 11  Nov 15 2017
Reinterpretation of last part of item (c) in proof of Theorem 5.2.1 in textbook.
Recall : let $X$, $Y$ be 2 random variables. Just to make things simple, assume $X$ and $Y$ are discrete, with PMF’s $P_x$ and $P_y$ and joint PMF $P_{X,Y}$.
Generally, for $h$ a function $\mathbb{R}\rightarrow\mathbb{R}$,
\[E(h(X)Y) = E(h(X)g(X))\]
where \(g(x)=E(Y\vert X=x)\).
We want to think of that result in the following way:

The meaning of the notation \(g(X)\) is the value of the function ( g(x) ) where \(x\) is replaced by the random variable \( x \).

However, we reinterpret \(g(x)\) as:
\[ g(X)=E(Y\vert X) \]
(think of this as a definition.)
Now we reinterpret the formula above like this:
\[\begin{align*} E(h(x)Y) &= E(E(h(X)YX)) (\star) \\ &= E(h(X)E(YX)) \end{align*}\]The first line means “An expectation can always be written as the expectation of a conditional expectation”. It is known as the “tower” property of conditional expectation.
The second line means: “when conditioning by \(X\), \(X\) can be considered as known (nonrandom) and any factor depending on \(X\) can be pulled out of the conditional expectation”.
Proof the Star ($\star$):
\[\begin{align*} RHS &= E(h(X)g(X)) \\ &= E(h(X)E(YX)) \\ &= \sum_{X} h(x)E(YX=x)P_X(x) \\ &= \sum_xh(x)\sum_y f \frac{P_{X,Y}(x,y)}{P_X(x)}\\ &= \sum_x h(x) \sum_y yP_{X,Y}(x,y) \\ &= \sum_x\sum_yh(x)yP_{X,Y}(x,y)\\ &= E(h(X)Y) \end{align*}\](RHS = right hand side) This is the proof of star.
Next, we use star ($\star$) to compute the unconditional variance \(Var(Y)\) by conditional by \(X\).
\[\begin{align*} Var(Y) &= E((YE(Y))^2) \\ &= E((Yg(X)+g(X)E(Y))^2) \\ &\triangleq E(A^2+2AB+B^2) \end{align*}\]where $A=Yg(X)$, $B=g(X)E(Y)$.
We will first compute
\[\begin{align*} E(AB) &= E((Yg(X))(g(X)E(Y)))\\ &=E(E((Yg(X))(g(X)E(Y))X)) \\ & = E(E(Yg(X)X)(g(X)E(Y))) \\ &= E((g(X)g(X))g(X)E(Y)) = 0 \end{align*}\]We have proved
\[ Var(Y) = E(A^2)+E(B^2) \]
Now we interpret
\[\begin{align*} E(A^2) &= E((Yg(X))^2) \\ &= E(E((Yg(X))^2X)) \\ &= E(Var(YX)) \end{align*}\]\(Var(Y\vert X)\) is also known as \(v(X)\).
So we see \(E(A^2)=E(v(X))\) (expectation of conditional variance).
Finally,
\[\begin{align*} E(B^2) &= E((g(X)E(Y))^2)\\ &= E((g(X)E(E(YX)))^2) \\ &= E((g(X)E(g(X)))^2) \\ &= Var(g(X)) \end{align*}\]this is the variance of the conditional expectation.
Homework Problem 5.2.5 Part b
$X$ takes value ${0,1,2}$ with probability ${0.3, 0.4, 0.3}$, $\varepsilon=\pm1$ with probability 0.5 and 0.5. $Y=5X^2+\varepsilon$.
Q: Find $\rho=\rho(X,Y)$
\[\rho(X,Y) = \frac{cov(X,Y)}{\sqrt{Var(X)Var(Y)}}\]
\[\begin{align*} E((X\mu_X)(Y\mu_Y)) &= E(XY)\mu_X\mu_Y\\ &=E(X(5X^2+\varepsilon))\\ &= E(5XX^3+X\varepsilon) \\ &= 5E(X) E(X^3) =E(X\varepsilon)\\ &=5\mu_X  E(X^3) \end{align*}\]Example 1
\(N\) people come into a store in a given day, customer spends \(X_i\) dollars. Let \(T\) be the total $ of sales for the day.
\[ T= X_1+X_2+\cdots + X_N=\sum_{i=1}^{N}X_i\]
A: Find \(E(T)\) and \(Var(T)\).
Assume:
 \(N\) is independent of all the \(X_i\)’s
 \(X_i\)s are i.i.d.
Let’s compute
\[E(T) = E(E(\sum X_i\vert N))\]
\[\begin{align*} E(T) &= E(\sum E(X_iN))\\ &= E(NE(X_i))\\ &=E(X_1)E(N) \end{align*}\]We know \(T\) is related to \(N\). Therefore we must compute conditional variance
 Conditional variance is \(v(n) = Var (T\vert N=n)\)
 Conditional expectation is \(g(n) = E(T\vert N=n)\)
\[Var(T\vert N=n)=Var(\sum X_i\vert N=n) = Var(\sum X_i) = nVar(X_1)\]
We have just proved that \(v(n)=nVar(X_1)\).
Next,
\(E(T\vert N=n) = E(\sum X_i\vert N=n) = E(\sum X_i) = nE(X_i) \)
We proved here \(g(n)=nE(X_1)\), now finally go back to the original formula,
\[Var(T) = E(v(N)+ Var(g(N)) = E(NVar(X_1))+Var(N(E(X_1))) = Var(X_1)E(N)+ E(X_1)^2Var(N)\]
where \(T=\sum X_i\), where \(X_i\) are i.i.d and \(N\) independent of \(X_i\)’s.
Exercise: Prove the following (using similar method of proof as for the \(Var(T)\) formula).
\[Cov(N,T)=E(X_1)Var(N)\]
and therefore,
\[\rho (N,T)=\frac{1}{\sqrt{1+\theta}}\]
where \(\theta=Var(X)/(E(X_1)Var(N)) \).
Also, for \(N\sim Poi(\lambda)\) and \(X\sim Ber(\theta)\), compute \(Var(T)\) and \(\rho(N,T)\).
Example 2
let \(X\sim Geom(p)\), \(D\sim NegBin(p,r=X)\), therefore, \(D\) is a certain \(T\), where the \(N\) is the \(X\) above and each \(X_i\) is \(\sim Geom(p)\), i.i.d.
Let \(Y=X+D\), find \(E(Y)F\).
\[E(Y)+E(X)+E(D)=\frac{1}{p} +E(X_1)E(X)=\frac{1}{p}+\frac{1}{p^2}\]
\[Var(Y) = Var(X+D)=Var(X)+Var(D)+ 2cov(X,D)\]
Continuous Case
Example 5.3.2
\(X\sim \Gamma (\alpha,1)\), \(Y\sim \Gamma(\beta,1)\).
Let \(V=X+Y\), therefore, \(V\sim Gamma(\alpha+\beta,1)\).
Let \(V=\frac{X}{X+Y}\), this is called Beta random variable \(\sim B(\alpha+\beta)\).
Let’s now try to prove that \(U\) and \(V\) are independent.
A: Let \(g(u)=E(X\vert U=u)\). It turns out (Wikipedia) \(E(V)=\frac{\alpha}{\beta}\).
Therefore \(E(UV\vert U=u)=uE(V\vert U=u)=uE(V)=u \frac{\alpha}{\alpha+\beta}\).
This gives us an example where the function \(g\) is linear as a function of \(u\) because \(E(XU=u)=E(UVU=u)=u \frac{\alpha}{\alpha+\beta}\).
This situation where \(X\) is linear given \(U\) is pretty exceptional.
We call \(g(x)=E(YX=x)\) the predictor of \(Y\) given \(X\). But what is the linear predictor?
Linear Predictor and Mean Squared Error
We would like to predict \(Y\) using a linear function of \(X\).
Let \(aX+b\) be the linear predictor. Consider the error in replacing \(Y\) by \(aX+b\).
We can choose \(a\) and \(b\) such that \(E(YaXb)=0\).
More systematically, let’s consider what statistic cases might called the mean square error (MSE)
\[E((Y(aX+b))^2)\]
we want to minimize MSE over all possible choices of the 2 values \(a\) and \(b\). It turns out that \(a=Corr(X,Y) \frac{\sigma_X}{\sigma_Y}\) and best \(b=E(Y)aE(X)\).
Note: this is the closely allied to the question of linear regression. It turns out the MSE fir that pair of \((a,b)\) is
\[1\rho^2Var(Y)\]
This says: the uncertainty level on \(Y\) is \(Var(Y)\). The proposition of that variance which is explained by \(X\) is the variance of \(aX+b\) is
\[Var(aX)=a^2Var(X) =\rho\frac{\sigma_Y^2}{\sigma_X^2}\sigma_X^2\]
and what is not explained by \(X\) is the MSE \((1\rho^2)Var(Y)\).
Summary: with \((a,b)\) as above and \(\sigma_X^2=Var(X)\), \(\sigma_Y^2\). we see that the amount of variance of \(Y\) explained by \(X\) is \(Var(aX)=\rho^2\sigma_Y^2\) The MSE \(=(1\rho^2)\rho_Y^2\) is the variance of \(Y\)unexplained by \(X\).