STT 861 Theory of Prob and STT I Lecture Note - 11

2017-11-15

Proof of the "Tower" property; discrete conditional distribution; continuous conditional distribution expectation and variance and their examples; linear predictor and mean squared error.

Portal to all the other notes

Lecture 11 - Nov 15 2017

Reinterpretation of last part of item (c) in proof of Theorem 5.2.1 in textbook.

Recall : let $X$ , $Y$ be 2 random variables. Just to make things simple, assume $X$ and $Y$ are discrete, with PMF’s $P_{x}$ and $P_{y}$ and joint PMF $P_{X, Y}$ .

Generally, for $h$ a function $R \to R$ ,

$E (h (X) Y) = E (h (X) g (X))$

where $g (x) = E (Y | X = x)$ .

We want to think of that result in the following way:

The meaning of the notation $g (X)$ is the value of the function ( g(x) ) where $x$ is replaced by the random variable $x$ .
However, we reinterpret $g (x)$ as:

$g (X) = E (Y | X)$

(think of this as a definition.)

Now we reinterpret the formula above like this:

\begin{aligned} E (h (x) | Y) & = E (E (h (X) Y | X)) (⋆) \\ = E (h (X) E (Y | X)) \end{aligned}

The first line means “An expectation can always be written as the expectation of a conditional expectation”. It is known as the “tower” property of conditional expectation.

The second line means: “when conditioning by $X$ , $X$ can be considered as known (non-random) and any factor depending on $X$ can be pulled out of the conditional expectation”.

Proof the Star ( $⋆$ ):

\begin{aligned} R H S & = E (h (X) g (X)) \\ = E (h (X) E (Y | X)) \\ = \sum_{X} h (x) E (Y | X = x) P_{X} (x) \\ = \sum_{x} h (x) \sum_{y} f \frac{P_{X, Y} (x, y)}{P_{X} (x)} \\ = \sum_{x} h (x) \sum_{y} y P_{X, Y} (x, y) \\ = \sum_{x} \sum_{y} h (x) y P_{X, Y} (x, y) \\ = E (h (X) Y) \end{aligned}

(RHS = right hand side) This is the proof of star.

Next, we use star ( $⋆$ ) to compute the unconditional variance $V a r (Y)$ by conditional by $X$ .

\begin{aligned} V a r (Y) & = E ((Y - E (Y))^{2}) \\ = E ((Y - g (X) + g (X) - E (Y))^{2}) \\ ≜ E (A^{2} + 2 A B + B^{2}) \end{aligned}

where $A = Y - g (X)$ , $B = g (X) - E (Y)$ .

We will first compute

\begin{aligned} E (A B) & = E ((Y - g (X)) (g (X) - E (Y))) \\ = E (E ((Y - g (X)) (g (X) - E (Y)) | X)) \\ = E (E (Y - g (X) | X) (g (X) - E (Y))) \\ = E ((g (X) - g (X)) g (X) - E (Y)) = 0 \end{aligned}

We have proved

$V a r (Y) = E (A^{2}) + E (B^{2})$

Now we interpret

\begin{aligned} E (A^{2}) & = E ((Y - g (X))^{2}) \\ = E (E ((Y - g (X))^{2} | X)) \\ = E (V a r (Y | X)) \end{aligned}

$V a r (Y | X)$ is also known as $v (X)$ .

So we see $E (A^{2}) = E (v (X))$ (expectation of conditional variance).

Finally,

\begin{aligned} E (B^{2}) & = E ((g (X) - E (Y))^{2}) \\ = E ((g (X) - E (E (Y | X)))^{2}) \\ = E ((g (X) - E (g (X)))^{2}) \\ = V a r (g (X)) \end{aligned}

this is the variance of the conditional expectation.

Homework Problem 5.2.5 Part b

$X$ takes value $0, 1, 2$ with probability $0.3, 0.4, 0.3$ , $ε = \pm 1$ with probability 0.5 and 0.5. $Y = 5 - X^{2} + ε$ .

Q: Find $ρ = ρ (X, Y)$

$ρ (X, Y) = \frac{c o v (X, Y)}{\sqrt{V a r (X) V a r (Y)}}$

\begin{aligned} E ((X - μ_{X}) - (Y - μ_{Y})) & = E (X Y) - μ_{X} μ_{Y} \\ = E (X (5 - X^{2} + ε)) \\ = E (5 X - X^{3} + X ε) \\ = 5 E (X) - E (X^{3}) = - E (X ε) \\ = 5 μ_{X} - E (X^{3}) \end{aligned}

Example 1

$N$ people come into a store in a given day, customer spends $X_{i}$ dollars. Let $T$ be the total $ of sales for the day.

$T = X_{1} + X_{2} + \dots + X_{N} = \sum_{i = 1}^{N} X_{i}$

A: Find $E (T)$ and $V a r (T)$ .

Assume:

$N$ is independent of all the $X_{i}$ ’s
$X_{i}$ s are i.i.d.

Let’s compute

$E (T) = E (E (\sum X_{i} | N))$

\begin{aligned} E (T) & = E (\sum E (X_{i} | N)) \\ = E (N E (X_{i})) \\ = E (X_{1}) E (N) \end{aligned}

We know $T$ is related to $N$ . Therefore we must compute conditional variance

Conditional variance is $v (n) = V a r (T | N = n)$
Conditional expectation is $g (n) = E (T | N = n)$

$V a r (T | N = n) = V a r (\sum X_{i} | N = n) = V a r (\sum X_{i}) = n V a r (X_{1})$

We have just proved that $v (n) = n V a r (X_{1})$ .

Next,

$E (T | N = n) = E (\sum X_{i} | N = n) = E (\sum X_{i}) = n E (X_{i})$

We proved here $g (n) = n E (X_{1})$ , now finally go back to the original formula,

$V a r (T) = E (v (N) + V a r (g (N)) = E (N V a r (X_{1})) + V a r (N (E (X_{1}))) = V a r (X_{1}) E (N) + E (X_{1})^{2} V a r (N)$

where $T = \sum X_{i}$ , where $X_{i}$ are i.i.d and $N$ independent of $X_{i}$ ’s.

Exercise: Prove the following (using similar method of proof as for the $V a r (T)$ formula).

$C o v (N, T) = E (X_{1}) V a r (N)$

and therefore,

$ρ (N, T) = \frac{1}{\sqrt{1 + θ}}$

where $θ = V a r (X) / (E (X_{1}) V a r (N))$ .

Also, for $N \sim P o i (λ)$ and $X \sim B e r (θ)$ , compute $V a r (T)$ and $ρ (N, T)$ .

Example 2

let $X \sim G e o m (p)$ , $D \sim N e g B i n (p, r = X)$ , therefore, $D$ is a certain $T$ , where the $N$ is the $X$ above and each $X_{i}$ is $\sim G e o m (p)$ , i.i.d.

Let $Y = X + D$ , find $E (Y) F$ .

$E (Y) + E (X) + E (D) = \frac{1}{p} + E (X_{1}) E (X) = \frac{1}{p} + \frac{1}{p^{2}}$

$V a r (Y) = V a r (X + D) = V a r (X) + V a r (D) + 2 c o v (X, D)$

Continuous Case

Example 5.3.2

$X \sim Γ (α, 1)$ , $Y \sim Γ (β, 1)$ .

Let $V = X + Y$ , therefore, $V \sim G a m m a (α + β, 1)$ .

Let $V = \frac{X}{X + Y}$ , this is called Beta random variable $\sim B (α + β)$ .

Let’s now try to prove that $U$ and $V$ are independent.

A: Let $g (u) = E (X | U = u)$ . It turns out (Wikipedia) $E (V) = \frac{α}{β}$ .

Therefore $E (U V | U = u) = u E (V | U = u) = u E (V) = u \frac{α}{α + β}$ .

This gives us an example where the function $g$ is linear as a function of $u$ because $E (X | U = u) = E (U V | U = u) = u \frac{α}{α + β}$ .

This situation where $X$ is linear given $U$ is pretty exceptional.

We call $g (x) = E (Y | X = x)$ the predictor of $Y$ given $X$ . But what is the linear predictor?

Linear Predictor and Mean Squared Error

We would like to predict $Y$ using a linear function of $X$ .

Let $a X + b$ be the linear predictor. Consider the error in replacing $Y$ by $a X + b$ .

We can choose $a$ and $b$ such that $E (Y - a X - b) = 0$ .

More systematically, let’s consider what statistic cases might called the mean square error (MSE)

$E ((Y - (a X + b))^{2})$

we want to minimize MSE over all possible choices of the 2 values $a$ and $b$ . It turns out that $a = C o r r (X, Y) \frac{σ_{X}}{σ_{Y}}$ and best $b = E (Y) - a E (X)$ .

Note: this is the closely allied to the question of linear regression. It turns out the MSE fir that pair of $(a, b)$ is

$1 - ρ^{2} V a r (Y)$

This says: the uncertainty level on $Y$ is $V a r (Y)$ . The proposition of that variance which is explained by $X$ is the variance of $a X + b$ is

$V a r (a X) = a^{2} V a r (X) = ρ \frac{σ_{Y}^{2}}{σ_{X}^{2}} σ_{X}^{2}$

and what is not explained by $X$ is the MSE $(1 - ρ^{2}) V a r (Y)$ .

Summary: with $(a, b)$ as above and $σ_{X}^{2} = V a r (X)$ , $σ_{Y}^{2}$ . we see that the amount of variance of $Y$ explained by $X$ is $V a r (a X) = ρ^{2} σ_{Y}^{2}$ The MSE $= (1 - ρ^{2}) ρ_{Y}^{2}$ is the variance of $Y$ unexplained by $X$ .