Introduction
Introduction and Definition
Stochastic processes deal with modeling a series of actions/steps (with a certain goal in mind) and anlayzing the possible paths/steps. This is in contrast with regular probability problems in which there is only a constant number of “actions” being performed (e.g. drawing a ball from a bag, rolling a die, tossing a coin)
Proeess: a series of actions or steps in order to achieve a particular end.
Stochastic: we have some randomness (i.e., uncertainty - which is measured using probabiilty) in each action of this process. e.g. in the snakes and ladder game, we don’t know how many steps we should walk for each action, and the final path is not definite (in fact, we don’t even know how many steps we need to take - if we’re unlucky, we might not reach the goal even after taking 1000 steps).
However, we can say that the probability that we evenutally reach the final square (100) in the snakes and ladder game approaches 1 as we play for a longer time. In other words, we are "almost sure" to reach the final square in the long run. This is an example of a "convergence" property of stochastic processes.
Stochastic Process vs. Random Variables
What’s the difference between a stochastic process and random variable? Well, they both measure some kind of randomness and are used to model situations. The difference lies in the kinds of situations each is used to measure.
A random variable is used to describe the randomness of a univariate quantity. For each random variable, we have:
- the sample space of all possibe outcomes
- an event which is a set of some (possible none) outcomes
- the probability measure which gives the probability of each event occuring.
A random variable is a mapping from
A stochastic process needs to describe the randomness of a complete process (you can think of this as a combination of a lot of random variables - since each random variable can be used to describe an action/step - but note that in a process, each action may be related to the previous ones so the random variables used to model the steps/actions may not be independent). So, we have:
- the sample space which is a set of all possible outcomes - but now, each outcome is a possible path from the start to end (goal).
- an event , which is a set of outcomes, i.e., a set of some paths. e.g. the paths with the first roll as in the snakes and ladders game
- the probability measure . E.g. P(first roll is 4) = 1/6
A stochastic process is a mapping from . In particular, note that we don’t assign a probability for each path (because it would be meaningless in cases where there are possible paths - e.g. snakes and ladders). A stochastic process gives us the probability of any action occuring given the current state (and possibly the history of previous actions)
Q. In snakes and ladders, isn’t the probability of each path equal to the product of the probabilities of each action (since dice rolls are independent in every turn)? And this would be a finite number right? And there are such paths so wouldn’t that mean that the sum of probabilities of all paths is greater than one?
Ans: It is correct that the probability of each path is finite and that there are infinite paths - but note that as the length of a path increases, it’s probability decreases (this means that longer paths are less likely to occur). So, although there are infinite paths (think of infinite terms in a series), they converge to a single value (which must be since it must form a valid probability function).
Also, it's important to remember that we need to consider a "disjoint" set of paths, for the sum of their probabilities to equal 1. E.g. if we consider all paths of length 2 (or any fixed finite length for that matter, they are disjoint - none of the paths are "sub-paths" of each other. However, if we consider the set of "all possible paths", obviously they are not disjoint - in particular, all paths of length are sub-paths of all paths of length (and so on). Hence, we cannot simply add the probabilities of all paths and exepct it to be 1.
Read more about convergence tests for series here to verify the above claim.
Also in general, we don’t care so much about a single “path” rather than a set of paths (which is basically defined by an event). e.g. the set of paths which take at least one snake, the set of paths which reach the last square in 15 steps, etc. This is similar to the case of continuous random variable → we can never assign a probability measure for when is a continuous random variable because can take on infinite number of value - so we look at a range (and consider the limit as the size of the interval shrinks to zero whilst still always containing our point of interest).
Why do we care about Stochastic Processes?
Because a lot of the real world problems can be modelled as a kind of stochastic process!
Examples:
- Stock Price (where we use time series data)
- Population growth
- Weather change
- Number of customers in a queue
Also, modelling any problem/situation as a stochastic process helps us answer important questions:
- Short-term questions:
- After steps, what will be the distribution of ?
- How does the distribution of (after steps) rely on the initial state ?
- Strategy related questions (called so since they relate to states/paths, which we can analyze to come up with a “winning” strategy to a problem/game → based on some rules - e.g. in poker, if I ever reach the state “royal flush”, I will immediately go all in - we can update our strategy when we reach a certain state or meet a certain condition):
- Is it possible that a given state can be visited in the long run? If yes, what is the probability?
- If the probability of visiting a state is , what will be the expected time to arrive at this state?
- What will be the probability that state is visited before state ?
- Performance in the long run:
- Will there be a “stable” distribution on states in the long run?
- What are the necessary/sufficient conditions to achieve this distribution?
- How to calculate it?
In this course, we only deal with discrete stochastic processes → there are fixed “states” and “actions” to move between states. This is in contrast with continuous stochastic process (e.g. any process that depends on continuous time - as opposed to discrete blocks of time).
It’s interesting to note that though time is always flowing (and is hence intuitively continuous), we (humans) found a way to “discrete-ize” (think: second, day, year) it so that it is easier to analyze. We “zoom out” to a level of granularity/complexity that we can deal with - we do this in every aspect of life.
Review on Basic Probability and Statistics
Discrete Random Variable
Consider a random variable (note that is a function that maps every outcome to a real number). If the range of is finite or countably infinite (e.g. ), then is said to be a discrete random variable. (otherwise, continuous)
A discrete random variable can be specified by defining the probability on each value, i.e., specifying for all values of . This is called the probability distribution of .
Note that once we have the probability distribution of a random variable, we don’t need anything else - not even a definition for it - to work with it. A definition could be something like: “the minimum number when 2 die are rolled”. In other words, a probability distribution of a random variable completely specifies all properties of . (though obviously we prefer to have some definitions for random variables, which relate to the problem we’re trying to solve, so that we can think about what each of the random variables "mean" in our head - unless you’re a maniac who’s just aimlessly analyzing random variables)
If we have a function that satisfies:
then specifies a discrete random variable.
Indicator Function
For any event , we can define an indicator function (note that is a function, NOT a number. This function, , is then applied to any outcome and the output of that is a number - either 0 or 1)
The indicator function of an event is the function:
Note: In the above definition, is an element of the sample space, i.e., it is an outcome.
What is the domain and codomain of the function (which I term “indicator-generating function”)?
Ans: Domain: Set of all possible events. Codomain: Set of all functions whose domain is the sample space, and codomain is . Writing it “programmatically” as a higher order function:
I: (event) => (outcome) => 0 / 1;
The indicator function is a random variable. Note that is NOT random in terms of → we’re talking about a fixed event , for which it maps every elment of the sample space to either or .
The indicator variable follows and . Hence, it is obvious that (very often, we make use of this fact without proving it).
We can also specify the indicator RV as a Bernoulli random variable with parameter . Recall that even a bernoulli RV has only 2 outcomes: success (1) and failure (0) and we use to be the probability of success.
Properties of indicator variable:
- → and this is also a random variable
- → easy to prove since it is only if the element is in both sets, and if it is absent from either one, multiplying by will result in .
- → can be proven easily using division into cases since there are only 4 cases.
Expectation, Variance, MGF
A random variable always has an associated distribution with numerical values. So, we can perform addition, subtraction, multiplication, etc on it.
For a discrete RV with probability mass function , the expectation (mean) is:
where is the range of .
Expectation shows the long-run average of the random variable.
Some properties of exepctation:
- If , then
- If and , then ( is a constant, not a random variable)
- If and are constants, then
- Linear additivity: for any random variables , we have , i.e., the expectation of the sum of RVs is equal to the sum of their individual expectation. Note that this does NOT (even) require independence between the RVs.
- Minimization: is the constant that minimizes the squared loss . Hence, is also viewed as the center of the distribution.
We can define some more terms through expectation of a function of a random variable.
Variance: ; and the standard deviation
Properties of variance:
- If , then is a constant
- → 1) shifting the distribution doesn’t affect the variance 2) scaling has a square effect on the variance.
Moment generating function:
Note that the moment generating function is a function of , NOT a function of .
There is a one-to-one mapping between and , i.e., the moment generating function is a characterization of (it completely describes the distribution of ) → if 2 random variables have the same MGF, then they have the same distribution.
Note that the expectation and variance (even together) cannot completely describe the distribution. Why? Because they are just numbers! They cannot contain all the information about a distribution → they are “summary” statistics. However, the MGF is itself a function and by changing the value of , we can observe the value of MGF and determine the distribution of .
Why do we care about the moment generating function and why is it called so? Because it helps us calculate all possible “moments” of . The -th moment of is defined by .
We can obtain to be:
In words, the -th moment of can be obtained by evaluating the -th order derivative of the moment function at (this result is a number, as “expected” hehe nice pun)
When a linear transformation is applied to a RV, the moment generating function changes as such: .
Another way to put it: (by definition)
Summary of Some Useful Discrete Distributions
Multiple Random Variables
Let and be 2 RVs on the same sample space .
The event () is defined to be
The joint distribution of and is:
The maringal distribution of is:
Covariance between and is:
Covariance is a measure of how and “move together” → positive covariance means that when , it is more likely for as well (and vice versa).
It is important to note that . In other words, variance of the sum of 2 RVs is not additive (due to the interaction term, ). More generally,
Correlation of and is:
It is the “normalized” covariance, i.e., correlation is always a number between and .
Independence: The discrete RVs are independent if, and only if, for any , . In other words, independence is achieved when the joint distribution of the RVs is the product of the marginal distributions for all possible outcomes in the sample space.
For indepdendent RVs, we have: . Hence, for independent RVs: .
However, does not imply (independence). Why? Because covariance is only a measure of the inner product relationship (i.e., linear combinations). It cannot capture other non-linear relationships between the 2 variables. So, independence is a much stronger/tighter constraint than simply “zero covariance” (or equivalently, “zero correlation”)
Properties of Independence:
Preservation of Independence: If are independent random variables, and are any functions, then the variables are also independent random variables (with no restriction on the functions ). Intuitively, once you have achieved independence, applying any function on each of the random variable will “preserve” the independence property.
If are independent, then
Note that the above result is only true when the RVs are independent.
If are independent and then
Notice the different effect of adding independent RVs on variance and MGF: variance becomes additive whereas MGF becomes “multiplicative”.
Conditional Probability: We want to explore what happens when and are not independent.
Recall for events:
Similarly for discrete RVs, we define the conditional probability:
Notice that the conditional distribution is the ratio of the joint distribution to the marginal distribution.
When we hold constant, and look at for a fixed , the distribution becomes a function (only) of . And this distribution is a probability distribution, i.e., for every . On the other hand, is NOT a probability distribution for a fixed value of .
Multiplication Law (the joint distribution is the product of the conditional distribution and the marginal distribution):
Law of Total Probability:
Bayes’ Formula (combining multiplication law - numerator - and law of total probability - denominator):
Conditional Independence:
We say given if for any :
In general, if for any vector and , there is:
then we say that are conditionally independent given .
Example (of events being conditionally indepdendent, but the same intution can be applied to RVs):
Conditional independence depends on the nature of the third event. If you roll two dice, one may assume that the two dice behave independently of each other. Looking at the results of one dice will not tell you about the result of the second dice. (That is, the two dice are independent.) If, however, the 1st dice's result is a 3, and someone tells you about a third event - that the sum of the two results is even - then this extra unit of information restricts the options for the 2nd result to an odd number. In other words, two events can be independent, but NOT conditionally independent.
Conditional Independence and Independence are unrelated concepts! One does not imply the other.
Given , is a random variable. Hence, we can calculate → conditional expectation.
Note that in general, and are different. We can view as a function of denoted as . Recall that . Hence, . It means that is also a random variable. It is a transformation of .
We also call this random variable as a condtional expectation.
Since is a random variable, we can calculate the expectation and variance of it. Similarly, we can also find the conditional variance , it is also a random variable.
Law of Total Expectation (aka Law of Iterated Expectation, or Adam’s Law):
Law of Total Variance (aka Eve’s Law):
Intuition behind law of total variance: It is the decomposition of the total variance of into 2 parts:
- The expected value over all the variances of for a given value of (i.e., variance within each group when ’s are grouped by ). → “expectation of variances”
- The variance between the difference expected values of for different values of (i.e., variance between the groups when ’s are grouped by ) → “variance of expectations”
Intuitively, the total variance is the sum of both the parts. We use the terms “explained variance” and “unexplained variance” when we’re dealing with the analysis of variance (ANOVA) test (to see if different categories are statistically different)
Let’s prove the law of total variance (to convince ourselves that the above identity is correct and is a good practice of math). Before we prove it, it’s necessary to understand that:
- is a function of , NOT . For a given , is a fixed number (here, we’re taking the expectation over for a fixed )
- In , the outer expectation is taken over different values of , while the inner is calculated for a fixed (i.e., expectation is over possible values of )
Proof (to show that :
(by the definition of variance: . Here, it’s the same formula but we’re applying it in the conditional universe where is already known.)
Then, (by linearity of expectation)
But from the law of iterated expectation, we know that:
So, we have: → first term in the equation
Also, we have: (by the definition of variance applied on the random variable )
Again, from the law of iterated expectation, we know that .
Then, we have: → second term in the equation
Combining the equations in steps (4) and (7), we get:
(by definition)
Hence, proved.
Random Sum
Random sum is a classic example to test understanding of conditional expectation, conditional variance and MGF.
Let be a sequence of i.i.d. (independent and identically distributed) random variables.
Let be a discrete random variable, independent with .
Then, the random sum is defined as: . That is, the number of terms to be summed is also random. That is why it’s called random sum.
Say, and for notation simplicity.
Then we have,
Expectation :
The above result should be fairly intuitive → the expectation of the random sum is just the product of the expectation of the number of terms, , and the expectation of each random variable , .
Note that the above formula for expectation works even when the are NOT independent.
Variance :
Moment generating function :