Skip to main content

Introduction

Introduction and Definition

Stochastic processes deal with modeling a series of actions/steps (with a certain goal in mind) and anlayzing the possible paths/steps. This is in contrast with regular probability problems in which there is only a constant number of “actions” being performed (e.g. drawing a ball from a bag, rolling a die, tossing a coin)

Proeess: a series of actions or steps in order to achieve a particular end.

Stochastic: we have some randomness (i.e., uncertainty - which is measured using probabiilty) in each action of this process. e.g. in the snakes and ladder game, we don’t know how many steps we should walk for each action, and the final path is not definite (in fact, we don’t even know how many steps we need to take - if we’re unlucky, we might not reach the goal even after taking 1000 steps).

info

However, we can say that the probability that we evenutally reach the final square (100) in the snakes and ladder game approaches 1 as we play for a longer time. In other words, we are "almost sure" to reach the final square in the long run. This is an example of a "convergence" property of stochastic processes.

Stochastic Process vs. Random Variables

What’s the difference between a stochastic process and random variable? Well, they both measure some kind of randomness and are used to model situations. The difference lies in the kinds of situations each is used to measure.

A random variable is used to describe the randomness of a univariate quantity. For each random variable, we have:

  1. the sample space Ω\Omega of all possibe outcomes
  2. an event FF which is a set of some (possible none) outcomes
  3. the probability measure PP which gives the probability of each event occuring.

A random variable is a mapping from ΩR\Omega \to \mathbb{R}

A stochastic process needs to describe the randomness of a complete process (you can think of this as a combination of a lot of random variables - since each random variable can be used to describe an action/step - but note that in a process, each action may be related to the previous ones so the random variables used to model the steps/actions may not be independent). So, we have:

  1. the sample space Ω\Omega which is a set of all possible outcomes - but now, each outcome is a possible path from the start to end (goal).
  2. an event FF, which is a set of outcomes, i.e., a set of some paths. e.g. the paths with the first roll as 44 in the snakes and ladders game
  3. the probability measure PP. E.g. P(first roll is 4) = 1/6

A stochastic process is a mapping from ΩRN\Omega \to \mathbb{R}^N. In particular, note that we don’t assign a probability for each path (because it would be meaningless in cases where there are \infty possible paths - e.g. snakes and ladders). A stochastic process gives us the probability of any action occuring given the current state (and possibly the history of previous actions)

think 🤔

Q. In snakes and ladders, isn’t the probability of each path equal to the product of the probabilities of each action (since dice rolls are independent in every turn)? And this would be a finite number right? And there are \infty such paths so wouldn’t that mean that the sum of probabilities of all paths is greater than one?

Ans: It is correct that the probability of each path is finite and that there are infinite paths - but note that as the length of a path increases, it’s probability decreases (this means that longer paths are less likely to occur). So, although there are infinite paths (think of infinite terms in a series), they converge to a single value (which must be 11 since it must form a valid probability function).

Also, it's important to remember that we need to consider a "disjoint" set of paths, for the sum of their probabilities to equal 1. E.g. if we consider all paths of length 2 (or any fixed finite length kk for that matter, they are disjoint - none of the paths are "sub-paths" of each other. However, if we consider the set of "all possible paths", obviously they are not disjoint - in particular, all paths of length 11 are sub-paths of all paths of length 22 (and so on). Hence, we cannot simply add the probabilities of all paths and exepct it to be 1.

note

Read more about convergence tests for series here to verify the above claim.

Also in general, we don’t care so much about a single “path” rather than a set of paths (which is basically defined by an event). e.g. the set of paths which take at least one snake, the set of paths which reach the last square in 15 steps, etc. This is similar to the case of continuous random variable → we can never assign a probability measure for P(X=a)P(X=a) when XX is a continuous random variable because XX can take on infinite number of value - so we look at a range P(a<X<b)P(a < X < b) (and consider the limit as the size of the interval (a,b)(a, b) shrinks to zero whilst still always containing our point of interest).

Why do we care about Stochastic Processes?

Because a lot of the real world problems can be modelled as a kind of stochastic process!

Examples:

  1. Stock Price (where we use time series data)
  2. Population growth
  3. Weather change
  4. Number of customers in a queue

Also, modelling any problem/situation as a stochastic process helps us answer important questions:

  • Short-term questions:
    • After nn steps, what will be the distribution of XX?
    • How does the distribution of XnX_n (after nn steps) rely on the initial state X0X_0?
  • Strategy related questions (called so since they relate to states/paths, which we can analyze to come up with a “winning” strategy to a problem/game → based on some rules - e.g. in poker, if I ever reach the state “royal flush”, I will immediately go all in - we can update our strategy when we reach a certain state or meet a certain condition):
    • Is it possible that a given state ii can be visited in the long run? If yes, what is the probability?
    • If the probability of visiting a state ii is 11, what will be the expected time to arrive at this state?
    • What will be the probability that state jj is visited before state ii?
  • Performance in the long run:
    • Will there be a “stable” distribution on states in the long run?
    • What are the necessary/sufficient conditions to achieve this distribution?
    • How to calculate it?
note

In this course, we only deal with discrete stochastic processes → there are fixed “states” and “actions” to move between states. This is in contrast with continuous stochastic process (e.g. any process that depends on continuous time - as opposed to discrete blocks of time).

think

It’s interesting to note that though time is always flowing (and is hence intuitively continuous), we (humans) found a way to “discrete-ize” (think: second, day, year) it so that it is easier to analyze. We “zoom out” to a level of granularity/complexity that we can deal with - we do this in every aspect of life.

Review on Basic Probability and Statistics

Discrete Random Variable

Consider a random variable XX (note that XX is a function that maps every outcome to a real number). If the range of XX is finite or countably infinite (e.g. N\mathbb{N}), then XX is said to be a discrete random variable. (otherwise, continuous)

A discrete random variable can be specified by defining the probability on each value, i.e., specifying px=P(X=x)p_x = P(X = x) for all values of xx. This is called the probability distribution of XX.

Note that once we have the probability distribution of a random variable, we don’t need anything else - not even a definition for it - to work with it. A definition could be something like: “the minimum number when 2 die are rolled”. In other words, a probability distribution of a random variable completely specifies all properties of XX. (though obviously we prefer to have some definitions for random variables, which relate to the problem we’re trying to solve, so that we can think about what each of the random variables "mean" in our head - unless you’re a maniac who’s just aimlessly analyzing random variables)

If we have a function p(x)p(x) that satisfies:

  1. 0p(x)10 \leq p(x) \leq 1
  2. xp(x)=1\sum_x p(x) = 1

then p(x)p(x) specifies a discrete random variable.

Indicator Function

For any event AA, we can define an indicator function I(A)I(A) (note that I(A)I(A) is a function, NOT a number. This function, I(A)I(A), is then applied to any outcome and the output of that is a number - either 0 or 1)

The indicator function I(A)I(A) of an event AΩA \subset \Omega is the function:

I(A)(x)={1,if xA0,if xAI(A)(x) = \begin{cases} 1, \quad \text{if } x \in A \\ 0, \quad \text{if } x \notin A \end{cases}

Note: In the above definition, xx is an element of the sample space, i.e., it is an outcome.

think 🤔

What is the domain and codomain of the function II (which I term “indicator-generating function”)?

Ans: Domain: Set of all possible events. Codomain: Set of all functions whose domain is the sample space, and codomain is {0,1}\{0,1\}. Writing it “programmatically” as a higher order function:

I: (event) => (outcome) => 0 / 1;

The indicator function I(A)I(A) is a random variable. Note that is NOT random in terms of AA → we’re talking about a fixed event AA, for which it maps every elment of the sample space to either 00 or 11.

The indicator variable follows P(I(A)=1)=P(A)P(I(A) = 1) = P(A) and P(I(A)=0)=1P(A)P(I(A) = 0)= 1- P(A). Hence, it is obvious that E[I(A)]=P(A)E[I(A)] = P(A) (very often, we make use of this fact without proving it).

We can also specify the indicator RV as a Bernoulli random variable with parameter p=P(A)p= P(A). Recall that even a bernoulli RV has only 2 outcomes: success (1) and failure (0) and we use pp to be the probability of success.

Properties of indicator variable:

  1. I(Ac)=1I(A)I(A^c) =1 - I(A) → and this is also a random variable
  2. I(AB)=I(A)×I(B)I(A \cap B) = I(A) \times I(B) → easy to prove since it is 11 only if the element is in both sets, and if it is absent from either one, multiplying by 00 will result in 00.
  3. I(AB)=I(A)+I(B)I(A)×I(B)I(A \cup B) = I(A) + I(B) - I(A)\times I(B) → can be proven easily using division into cases since there are only 4 cases.

Expectation, Variance, MGF

A random variable always has an associated distribution with numerical values. So, we can perform addition, subtraction, multiplication, etc on it.

For a discrete RV with probability mass function P(x)P(x), the expectation (mean) is:

E[X]=xRXxP(X=x)E[X] = \sum_{x \in R_X} xP(X= x)

where RXR_X is the range of XX.

Expectation shows the long-run average of the random variable.

Some properties of exepctation:

  1. If X0X \geq 0, then E[X]0E[X] \geq 0
  2. If X0X \geq 0 and E[X]=0E[X] = 0, then P(X=0)=1P(X= 0) = 1 (XX is a constant, not a random variable)
  3. If aa and bb are constants, then E[a+bX]=a+bE[X]E[a + bX] = a + bE[X]
  4. Linear additivity: for any random variables X1,,XnX_1, \dots, X_n, we have E[i=1nXi]=i=1nE[Xi]E[\sum_{i=1}^n X_i] = \sum_{i=1}^n E[X_i], i.e., the expectation of the sum of RVs is equal to the sum of their individual expectation. Note that this does NOT (even) require independence between the RVs.
  5. Minimization: E[X]E[X] is the constant cc that minimizes the squared loss E[(Xc)2]E[(X-c)^2]. Hence, E[X]E[X] is also viewed as the center of the distribution.

We can define some more terms through expectation of a function of a random variable.

Variance: Var(X)=E[(XE(X))2]Var(X) = E[(X-E(X))^2]; and the standard deviation sd(X)=Var(X)sd(X) = \sqrt{Var(X)}

Properties of variance:

  1. Var(X)0Var(X) \geq 0
  2. If Var(X)=0Var(X) = 0, then XX is a constant
  3. Var(a+bX)=b2Var(X)Var(a + bX) = b^2Var(X) → 1) shifting the distribution doesn’t affect the variance 2) scaling has a square effect on the variance.
  4. Var(X)=E[X2](E[X])2Var(X) = E[X^2] - (E[X])^2

Moment generating function: MX(t)=E[etX]M_X(t) = E[e^{tX}]

Note that the moment generating function is a function of tt, NOT a function of xx.

There is a one-to-one mapping between XX and MX(t)M_X(t), i.e., the moment generating function is a characterization of XX (it completely describes the distribution of XX) → if 2 random variables have the same MGF, then they have the same distribution.

think

Note that the expectation and variance (even together) cannot completely describe the distribution. Why? Because they are just numbers! They cannot contain all the information about a distribution → they are “summary” statistics. However, the MGF is itself a function and by changing the value of tt, we can observe the value of MGF and determine the distribution of XX.

Why do we care about the moment generating function and why is it called so? Because it helps us calculate all possible “moments” of XX. The kk-th moment of XX is defined by E[Xk]E[X^k].

We can obtain E[Xk]E[X^k] to be:

E[Xk]=dkdtkMX(t)t=0E[X^k] = \frac{d^k}{dt^k} M_X(t)|_{t=0}

In words, the kk-th moment of XX can be obtained by evaluating the kk-th order derivative of the moment function at t=0t=0 (this result is a number, as “expected” hehe nice pun)

When a linear transformation XaX+bX \to aX + b is applied to a RV, the moment generating function changes as such: MaX+b(t)=ebtMX(at)M_{aX+b}(t) = e^{bt}M_X(at).

Another way to put it: MaX+b(t)=ebtE[eatX]M_{aX+b}(t) = e^{bt}E[e^{atX}] (by definition)

Summary of Some Useful Discrete Distributions

Distributions.png

Multiple Random Variables

Let XX and YY be 2 RVs on the same sample space Ω\Omega.

The event (X=x,Y=yX= x, Y = y) is defined to be {ω:X(ω)=xY(ω)=y}\{\omega: X(\omega) = x \land Y(\omega) = y \}

The joint distribution of XX and YY is:

pX,Y(x,y)=P(X=x,Y=y)=P({ω:X(ω)=xY(ω)=y})p_{X,Y}(x,y) = P(X=x, Y=y) = P(\{\omega: X(\omega) = x \land Y(\omega) = y \})

The maringal distribution of XX is:

pX(x)=P(X=x)=yP(X=x,Y=y)=P({ω:X(ω)=x})p_X(x) = P(X=x) = \sum_{y}P(X=x, Y=y) = P(\{ \omega: X(\omega) = x \})

Covariance between XX and YY is:

Cov(X,Y)=E[(XE(X))(YE(Y))]=E[XY]E[X]E[Y]Cov(X,Y) = E[(X-E(X))(Y-E(Y))] = E[XY] - E[X]E[Y]
note

Covariance is a measure of how XX and YY “move together” → positive covariance means that when X>E[X]X > E[X], it is more likely for Y>E[Y]Y > E[Y] as well (and vice versa).

note

It is important to note that Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y). In other words, variance of the sum of 2 RVs is not additive (due to the interaction term, 2Cov(X,Y)2Cov(X,Y)). More generally, Var(aX+bY)=a2Var(X)+b2Var(Y)+2abCov(X,Y)Var(aX+bY) = a^2Var(X) + b^2Var(Y) + 2abCov(X,Y)

Correlation of XX and YY is:

Corr(X,Y)=ρX,Y=Cov(X,Y)Var(x)Var(Y)=E[XY]E[X]E[Y]E[X2](E[X])2E[Y2](E[Y])2Corr(X,Y) = \rho_{X,Y} = \frac{Cov(X,Y)}{\sqrt{Var(x)}\sqrt{Var(Y)}} = \frac{E[XY] - E[X]E[Y]}{\sqrt{E[X^2] - (E[X])^2}\sqrt{E[Y^2] - (E[Y])^2}}

It is the “normalized” covariance, i.e., correlation is always a number between 1-1 and 11.

Independence: The discrete RVs X1,,XnX_1, \dots, X_n are independent if, and only if, for any x1,,xnx_1, \dots, x_n, P(X1=x1,,Xn=xn)=i=1nP(Xi=xi)P(X_1=x_1, \dots, X_n=x_n)= \prod_{i=1}^n P(X_i=x_i). In other words, independence is achieved when the joint distribution of the RVs is the product of the marginal distributions for all possible outcomes in the sample space.

For indepdendent RVs, we have: Cov(X,Y)=0Cov(X, Y) = 0. Hence, for independent RVs: Var(X+Y)=Var(X)+Var(Y)Var(X+Y) = Var(X) + Var(Y).

caution

However, Cov(X,Y)=0Cov(X, Y) = 0 does not imply XiXjX_i \perp X_j (independence). Why? Because covariance is only a measure of the inner product relationship (i.e., linear combinations). It cannot capture other non-linear relationships between the 2 variables. So, independence is a much stronger/tighter constraint than simply “zero covariance” (or equivalently, “zero correlation”)

Properties of Independence:

  1. Preservation of Independence: If X1,,XnX_1, \dots, X_n are independent random variables, and fi,,fn:RRf_i, \dots , f_n: R \to R are any functions, then the variables f1(X1),,fn(Xn)f_1(X_1), \dots, f_n(X_n) are also independent random variables (with no restriction on the functions fif_i). Intuitively, once you have achieved independence, applying any function on each of the random variable will “preserve” the independence property.

  2. If X1,,XnX_1, \dots, X_n are independent, then

    E[i=1nXi]=i=1nE[Xi]E[\prod_{i=1}^n X_i] = \prod_{i=1}^nE[X_i]

    Note that the above result is only true when the RVs are independent.

  3. If X1,,XnX_1, \dots , X_n are independent and Y=i=1nXiY = \sum_{i=1}^n X_i then

    Var(Y)=i=1nVar(Xi)Var(Y) = \sum_{i=1}^n Var(X_i)
    MY(t)=i=1nMXi(t)M_Y(t) = \prod_{i=1}^n M_{X_i}(t)

    Notice the different effect of adding independent RVs on variance and MGF: variance becomes additive whereas MGF becomes “multiplicative”.

Conditional Probability: We want to explore what happens when XX and YY are not independent.

Recall for events:

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

Similarly for discrete RVs, we define the conditional probability:

pXY(xy)=P(X=xY=y)=P(X=x,Y=y)P(Y=y)p_{X|Y}(x|y) = P(X=x|Y=y) = \frac{P(X=x, Y=y)}{P(Y=y)}

Notice that the conditional distribution is the ratio of the joint distribution to the marginal distribution.

caution

When we hold yy constant, and look at pXY(xy)p_{X|Y}(x|y) for a fixed yy, the distribution becomes a function (only) of XX. And this distribution is a probability distribution, i.e., xpXY(xy)=1\sum_x p_{X|Y}(x|y) = 1 for every yy. On the other hand, ypXY(xy)\sum_y p_{X|Y}(x|y) is NOT a probability distribution for a fixed value of xx.

Multiplication Law (the joint distribution is the product of the conditional distribution and the marginal distribution):

pX,Y(x,y)=pXY(xy)×pY(y)=pYX(yx)×pX(x)p_{X,Y}(x,y) = p_{X|Y}(x|y) \times p_Y(y) = p_{Y|X}(y|x) \times p_X(x)

Law of Total Probability:

pX(x)=ypY(y)pXY(xy)p_X(x) = \sum_y p_Y(y)p_{X|Y}(x|y)

Bayes’ Formula (combining multiplication law - numerator - and law of total probability - denominator):

pYX(yx)=pX,Y(x,y)pX(x)=pXY(xy)×pY(y)ypY(y)pXY(xy)p_{Y|X}(y|x) = \frac{p_{X,Y}(x,y)}{p_X(x)} = \frac{p_{X|Y}(x|y) \times p_Y(y)}{\sum_y p_Y(y)p_{X|Y}(x|y)}

Conditional Independence:

We say XYX \perp Y given ZZ if for any x,y,zx, y, z:

P(X=x,Y=yZ=z)=P(X=xZ=z)P(Y=yZ=z)P(X = x, Y = y | Z = z) = P(X=x|Z=z)P(Y=y|Z=z)

In general, if for any vector (x1,xn)(x_1, \dots x_n) and yy, there is:

P(X1=x1,,Xn=xnY=y)=i=1nP(Xi=xiY=y)P(X_1 =x_1, \dots , X_n=x_n|Y=y) = \prod_{i=1}^n P(X_i=x_i|Y=y)

then we say that X1,,XnX_1, \dots , X_n are conditionally independent given YY.

Example (of events being conditionally indepdendent, but the same intution can be applied to RVs):

Conditional independence depends on the nature of the third event. If you roll two dice, one may assume that the two dice behave independently of each other. Looking at the results of one dice will not tell you about the result of the second dice. (That is, the two dice are independent.) If, however, the 1st dice's result is a 3, and someone tells you about a third event - that the sum of the two results is even - then this extra unit of information restricts the options for the 2nd result to an odd number. In other words, two events can be independent, but NOT conditionally independent.

caution

Conditional Independence and Independence are unrelated concepts! One does not imply the other.

Given Y=yY = y, XY=yX|Y=y is a random variable. Hence, we can calculate E[XY=y]E[X|Y=y] → conditional expectation.

Note that in general, E[XY=y1]E[X|Y=y_1] and E[XY=y2]E[X|Y=y_2] are different. We can view E[XY=y]E[X|Y=y] as a function of yy denoted as h(y):RRh(y): R \to R. Recall that Y:ΩRY: \Omega \to R. Hence, hY=ΩRh \circ Y= \Omega \to R. It means that h(Y)=E[XY]h(Y) = E[X|Y] is also a random variable. It is a transformation of YY.

We also call this random variable E[XY]E[X|Y] as a condtional expectation.

Since E[XY]E[X|Y] is a random variable, we can calculate the expectation and variance of it. Similarly, we can also find the conditional variance Var(XY)Var(X|Y), it is also a random variable.

Law of Total Expectation (aka Law of Iterated Expectation, or Adam’s Law):

E[X]=E[E(XY)]E[X] = E[E(X|Y)]

Law of Total Variance (aka Eve’s Law):

Var(X)=E[Var(XY)]+Var(E[XY])Var(X) = E[Var(X|Y)] + Var(E[X|Y])
note

Intuition behind law of total variance: It is the decomposition of the total variance of XX into 2 parts:

  1. The expected value over all the variances of XX for a given value of YY (i.e., variance within each group when XX’s are grouped by YY). → “expectation of variances”
  2. The variance between the difference expected values of XX for different values of YY (i.e., variance between the groups when XX’s are grouped by YY) → “variance of expectations”

Intuitively, the total variance is the sum of both the parts. We use the terms “explained variance” and “unexplained variance” when we’re dealing with the analysis of variance (ANOVA) test (to see if different categories are statistically different)

Let’s prove the law of total variance (to convince ourselves that the above identity is correct and is a good practice of math). Before we prove it, it’s necessary to understand that:

  • E[XY]E[X|Y] is a function of YY, NOT XX. For a given Y=yY=y, E[XY=y]E[X|Y=y] is a fixed number (here, we’re taking the expectation over XX for a fixed YY)
  • In E[E[XY]]E[E[X|Y]], the outer expectation is taken over different values of Y=yY=y, while the inner E[XY]E[X|Y] is calculated for a fixed Y=yY=y (i.e., expectation is over possible values of XX)

Proof (to show that Var(X)=E[Var(XY)]+Var(E[XY])Var(X) = E[Var(X|Y)] + Var(E[X|Y]):

  1. Var(XY)=E[X2Y](E[XY])2Var(X|Y) = E[X^2|Y] - (E[X|Y])^2 (by the definition of variance: Var(X)=E[X2](E[X])2Var(X) = E[X^ 2] - (E[X])^2. Here, it’s the same formula but we’re applying it in the conditional universe where YY is already known.)

  2. Then, E[Var(XY)]=E(E[X2Y](E[XY])2)=E[E[X2Y]]E[(E[XY])2]E[Var(X|Y)] = E(E[X^2|Y] - (E[X|Y])^2) = E[E[X^2|Y]] - E[(E[X|Y])^2] (by linearity of expectation)

  3. But from the law of iterated expectation, we know that: E[E[X2Y]]=E[X2]E[E[X^2|Y]] = E[X^2]

  4. So, we have: E[Var(XY)]=E[X2]E[(E[XY])2]E[Var(X|Y)] = E[X^2] - E[(E[X|Y])^2] → first term in the equation

  5. Also, we have: Var(E[XY])=E[(E[XY])2](E[E[XY]])2Var(E[X|Y]) = E[(E[X|Y])^2] - (E[E[X|Y]])^2 (by the definition of variance applied on the random variable E[XY]E[X|Y])

  6. Again, from the law of iterated expectation, we know that E[E[XY]]=E[X]E[E[X|Y]] = E[X].

  7. Then, we have: Var(E[XY])=E[(E[XY])2](E[X])2Var(E[X|Y]) = E[(E[X|Y])^2] - (E[X])^2 → second term in the equation

  8. Combining the equations in steps (4) and (7), we get:

    E[Var(XY)]+Var(E[XY])=E[X2](E[X])2=Var(X)E[Var(X|Y)] + Var(E[X|Y]) = E[X^2] - (E[X])^2 = Var(X) (by definition)

  9. Hence, proved.

Random Sum

Random sum is a classic example to test understanding of conditional expectation, conditional variance and MGF.

Let X1,X2,X_1, X_2, \dots be a sequence of i.i.d. (independent and identically distributed) random variables.

Let NN be a discrete random variable, independent with XisX_i's.

Then, the random sum is defined as: Y=i=1NXiY = \sum_{i=1}^N X_i. That is, the number of terms to be summed is also random. That is why it’s called random sum.

Say, E[Xi]=μE[X_i] = \mu and Var(Xi)=σ2Var(X_i) = \sigma^2 for notation simplicity.

Then we have,

Expectation E[Y]E[Y]:

E[Y]=E(i=1NXi)=E[E(i=1NXin)]=E[Nμ]=μE[N]E[Y] = E(\sum_{i=1}^N X_i) = E[E(\sum_{i=1}^N X_i|n)] = E[N\mu] = \mu E[N]

The above result should be fairly intuitive → the expectation of the random sum is just the product of the expectation of the number of terms, NN, and the expectation of each random variable XiX_i, μ\mu.

Note that the above formula for expectation works even when the XisX_i's are NOT independent.

Variance Var(Y)Var(Y):

Var(Y)=Var(i=1NXi)=E[Var(i=1NXiN)]+Var(E[i=1NXiN])=E[Nσ2]+Var(Nμ)=σ2E[N]+μ2Var(N)\begin{equation*} \begin{split} Var(Y) &= Var(\sum_{i=1}^N X_i) \\ &= E[Var(\sum_{i=1}^NX_i|N)] + Var(E[\sum_{i=1}^NX_i|N]) \\ &= E[N\sigma^2] + Var(N\mu) \\ &= \sigma^2E[N]+\mu^2 Var(N) \end{split} \end{equation*}

Moment generating function MY(t)M_Y(t):

MY(t)=Mi=1NXi(t)(by def. of Y)=E[eti=1NXi]=E[E[eti=1NXiN]](by LIE)=E[(MX(t))N](this is an average over N)=E[eN ln MX(t)](by def. of log)=MN(ln[MX(t])(by def. of MGF of N)\begin{equation*} \begin{split} M_Y(t) &= M_{\sum_{i=1}^N X_i}(t) \quad \text{(by def. of Y)} \\ &= E[e^{t\sum_{i=1}^N X_i}] \\ &= E[E[e^{t\sum_{i=1}^N X_i}|N]] \quad \text{(by LIE)}\\ &= E[(M_X(t))^N] \quad \text{(this is an average over N)} \\ &= E[e^{N\ ln\ M_X(t)}] \quad \text{(by def. of log)} \\ &= M_N(ln[M_X(t]) \quad \text{(by def. of MGF of N)} \end{split} \end{equation*}