My Avatar

LanternD's Castle

An electronics enthusiast - survive technically

STT 861 Theory of Prob and STT I Lecture Note - 11


Proof of the "Tower" property; discrete conditional distribution; continuous conditional distribution expectation and variance and their examples; linear predictor and mean squared error.

Portal to all the other notes

Lecture 11 - Nov 15 2017

Reinterpretation of last part of item (c) in proof of Theorem 5.2.1 in textbook.

Recall : let $X$, $Y$ be 2 random variables. Just to make things simple, assume $X$ and $Y$ are discrete, with PMF’s $P_x$ and $P_y$ and joint PMF $P_{X,Y}$.

Generally, for $h$ a function $\mathbb{R}\rightarrow\mathbb{R}$,

\[E(h(X)Y) = E(h(X)g(X))\]

where \(g(x)=E(Y\vert X=x)\).

We want to think of that result in the following way:

\[ g(X)=E(Y\vert X) \]

(think of this as a definition.)

Now we reinterpret the formula above like this:

\[\begin{align*} E(h(x)|Y) &= E(E(h(X)Y|X)) (\star) \\ &= E(h(X)E(Y|X)) \end{align*}\]

The first line means “An expectation can always be written as the expectation of a conditional expectation”. It is known as the “tower” property of conditional expectation.

The second line means: “when conditioning by \(X\), \(X\) can be considered as known (non-random) and any factor depending on \(X\) can be pulled out of the conditional expectation”.

Proof the Star ($\star$):

\[\begin{align*} RHS &= E(h(X)g(X)) \\ &= E(h(X)E(Y|X)) \\ &= \sum_{X} h(x)E(Y|X=x)P_X(x) \\ &= \sum_xh(x)\sum_y f \frac{P_{X,Y}(x,y)}{P_X(x)}\\ &= \sum_x h(x) \sum_y yP_{X,Y}(x,y) \\ &= \sum_x\sum_yh(x)yP_{X,Y}(x,y)\\ &= E(h(X)Y) \end{align*}\]

(RHS = right hand side) This is the proof of star.

Next, we use star ($\star$) to compute the unconditional variance \(Var(Y)\) by conditional by \(X\).

\[\begin{align*} Var(Y) &= E((Y-E(Y))^2) \\ &= E((Y-g(X)+g(X)-E(Y))^2) \\ &\triangleq E(A^2+2AB+B^2) \end{align*}\]

where $A=Y-g(X)$, $B=g(X)-E(Y)$.

We will first compute

\[\begin{align*} E(AB) &= E((Y-g(X))(g(X)-E(Y)))\\ &=E(E((Y-g(X))(g(X)-E(Y))|X)) \\ & = E(E(Y-g(X)|X)(g(X)-E(Y))) \\ &= E((g(X)-g(X))g(X)-E(Y)) = 0 \end{align*}\]

We have proved

\[ Var(Y) = E(A^2)+E(B^2) \]

Now we interpret

\[\begin{align*} E(A^2) &= E((Y-g(X))^2) \\ &= E(E((Y-g(X))^2|X)) \\ &= E(Var(Y|X)) \end{align*}\]

\(Var(Y\vert X)\) is also known as \(v(X)\).

So we see \(E(A^2)=E(v(X))\) (expectation of conditional variance).


\[\begin{align*} E(B^2) &= E((g(X)-E(Y))^2)\\ &= E((g(X)-E(E(Y|X)))^2) \\ &= E((g(X)-E(g(X)))^2) \\ &= Var(g(X)) \end{align*}\]

this is the variance of the conditional expectation.

Homework Problem 5.2.5 Part b

$X$ takes value ${0,1,2}$ with probability ${0.3, 0.4, 0.3}$, $\varepsilon=\pm1$ with probability 0.5 and 0.5. $Y=5-X^2+\varepsilon$.

Q: Find $\rho=\rho(X,Y)$

\[\rho(X,Y) = \frac{cov(X,Y)}{\sqrt{Var(X)Var(Y)}}\]

\[\begin{align*} E((X-\mu_X)-(Y-\mu_Y)) &= E(XY)-\mu_X\mu_Y\\ &=E(X(5-X^2+\varepsilon))\\ &= E(5X-X^3+X\varepsilon) \\ &= 5E(X) -E(X^3) =-E(X\varepsilon)\\ &=5\mu_X - E(X^3) \end{align*}\]

Example 1

\(N\) people come into a store in a given day, customer spends \(X_i\) dollars. Let \(T\) be the total $ of sales for the day.

\[ T= X_1+X_2+\cdots + X_N=\sum_{i=1}^{N}X_i\]

A: Find \(E(T)\) and \(Var(T)\).


Let’s compute

\[E(T) = E(E(\sum X_i\vert N))\]

\[\begin{align*} E(T) &= E(\sum E(X_i|N))\\ &= E(NE(X_i))\\ &=E(X_1)E(N) \end{align*}\]

We know \(T\) is related to \(N\). Therefore we must compute conditional variance

\[Var(T\vert N=n)=Var(\sum X_i\vert N=n) = Var(\sum X_i) = nVar(X_1)\]

We have just proved that \(v(n)=nVar(X_1)\).


\(E(T\vert N=n) = E(\sum X_i\vert N=n) = E(\sum X_i) = nE(X_i) \)

We proved here \(g(n)=nE(X_1)\), now finally go back to the original formula,

\[Var(T) = E(v(N)+ Var(g(N)) = E(NVar(X_1))+Var(N(E(X_1))) = Var(X_1)E(N)+ E(X_1)^2Var(N)\]

where \(T=\sum X_i\), where \(X_i\) are i.i.d and \(N\) independent of \(X_i\)’s.

Exercise: Prove the following (using similar method of proof as for the \(Var(T)\) formula).


and therefore,

\[\rho (N,T)=\frac{1}{\sqrt{1+\theta}}\]

where \(\theta=Var(X)/(E(X_1)Var(N)) \).

Also, for \(N\sim Poi(\lambda)\) and \(X\sim Ber(\theta)\), compute \(Var(T)\) and \(\rho(N,T)\).

Example 2

let \(X\sim Geom(p)\), \(D\sim NegBin(p,r=X)\), therefore, \(D\) is a certain \(T\), where the \(N\) is the \(X\) above and each \(X_i\) is \(\sim Geom(p)\), i.i.d.

Let \(Y=X+D\), find \(E(Y)F\).

\[E(Y)+E(X)+E(D)=\frac{1}{p} +E(X_1)E(X)=\frac{1}{p}+\frac{1}{p^2}\]

\[Var(Y) = Var(X+D)=Var(X)+Var(D)+ 2cov(X,D)\]

Continuous Case

Example 5.3.2

\(X\sim \Gamma (\alpha,1)\), \(Y\sim \Gamma(\beta,1)\).

Let \(V=X+Y\), therefore, \(V\sim Gamma(\alpha+\beta,1)\).

Let \(V=\frac{X}{X+Y}\), this is called Beta random variable \(\sim B(\alpha+\beta)\).

Let’s now try to prove that \(U\) and \(V\) are independent.

A: Let \(g(u)=E(X\vert U=u)\). It turns out (Wikipedia) \(E(V)=\frac{\alpha}{\beta}\).

Therefore \(E(UV\vert U=u)=uE(V\vert U=u)=uE(V)=u \frac{\alpha}{\alpha+\beta}\).

This gives us an example where the function \(g\) is linear as a function of \(u\) because \(E(X|U=u)=E(UV|U=u)=u \frac{\alpha}{\alpha+\beta}\).

This situation where \(X\) is linear given \(U\) is pretty exceptional.

We call \(g(x)=E(Y|X=x)\) the predictor of \(Y\) given \(X\). But what is the linear predictor?

Linear Predictor and Mean Squared Error

We would like to predict \(Y\) using a linear function of \(X\).

Let \(aX+b\) be the linear predictor. Consider the error in replacing \(Y\) by \(aX+b\).

We can choose \(a\) and \(b\) such that \(E(Y-aX-b)=0\).

More systematically, let’s consider what statistic cases might called the mean square error (MSE)


we want to minimize MSE over all possible choices of the 2 values \(a\) and \(b\). It turns out that \(a=Corr(X,Y) \frac{\sigma_X}{\sigma_Y}\) and best \(b=E(Y)-aE(X)\).

Note: this is the closely allied to the question of linear regression. It turns out the MSE fir that pair of \((a,b)\) is


This says: the uncertainty level on \(Y\) is \(Var(Y)\). The proposition of that variance which is explained by \(X\) is the variance of \(aX+b\) is

\[Var(aX)=a^2Var(X) =\rho\frac{\sigma_Y^2}{\sigma_X^2}\sigma_X^2\]

and what is not explained by \(X\) is the MSE \((1-\rho^2)Var(Y)\).

Summary: with \((a,b)\) as above and \(\sigma_X^2=Var(X)\), \(\sigma_Y^2\). we see that the amount of variance of \(Y\) explained by \(X\) is \(Var(aX)=\rho^2\sigma_Y^2\) The MSE \(=(1-\rho^2)\rho_Y^2\) is the variance of \(Y\)unexplained by \(X\).

Disqus Comment 0