Lecture 17: Probability

10:30 AM, Mar 31, 2009

Contents

1 Overview 1

2 Introduction 1

3 Axioms of Probability 2

3.1 PropertiesofProbability Functions ............................ 2

4 Conditional Probability 3

4.1 Independence ........................................ 4

5 Random Variables 5

5.1 ExpectationandVariance ................................. 6

5.2 CorrelationandCovariance ................................ 8

1 Overview

During the last unit, our investigations of AI were rooted in logic; in particular, we studied “symbolic” AI,inwhich all statements were assignedbinary truth values. We nowembark uponastudy of what might be called “numeric” AI. Rather than associate discrete values with statements, we associate continuous values, or probabilities. This representational flexbility is intended to model uncertainty in an agent’s environment.

2 Introduction

Gambling was the primary force driving the early development of probability theory. As early as the 16th century, gamblers noticed that there are empirical laws which govern the frequencies of the various outcomes in a game of chance, even though the precise outcome cannot generally be predicted in advance. For example, Cardano noticed that for many simple games of chance, each outcome is realized in proportion to the reciprocal of the total number of outcomes. Cardano’s observation applies to rolling dice and flipping coins, for example.

Supposean experiment(such as spinning a red andblack roulette wheel) is repeated N times. Let #(A) denote the number of times the outcome A (e.g., “landed on red”) is observed. The ratio of #(A) to A is called the relative frequency of A. Empirically, when N is large, this relative frequency approximates some pR: i.e.,

#(A)

p

N

Since p depends on A, it is usually written P(A), anditis calledthe probability of A. Sometimes the symmetric nature of the experiments renders all outcomes are equally likely, asinthegames of chance studied by Cardano. In this case,

|A|

P(A)=

|Ω|

We write Ω to denote the sample space of possible outcomes, with ω Ω and A Ω.

Example: Consider an experiment in which an unbiased coin is flipped twice in succession. In this experiment, the sample space Ω = {HH,HT,TH,TT}, with each outcome equally likely. If A is the event “at least one head,” then P(A)=3/4.

Similarly, consider an experiment in which an unbiased coin is flipped thrice in succession. In this experiment, the sample space Ω = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}, with each outcome equally likely. If B is the event “at most one tail,” then P(B)=1/2.

3 Axioms of Probability

The following three axioms characterize the probability function P :2Ω R:

  1. P(Ω)=1
  2. 0 P(A)1, for all events A Ω
  3. P(AB)= P(A)+P(B), for allevents A,B Ω s.t. AB =

The third axiom is often expressed as its inductive equivalent, namely

3. given the finite set of disjoint events {A1,A2,...,An},

n



PAi = P(A1)+P(A2)+... + P(An)

i=1

3.1 Properties of Probability Functions

Likeprobabilities, settheory alsoprovides afoundationforthe meaning oflogicalformulas. IfΩis viewed asthesetof all “possibleworlds” (i.e.,interpretations),thenthemeaningofproposition A is the set ofits models, denoted M(A)Ω. Moreover, M(AB)= AB, M(AB)= AB, and M(¬A)= Ac, for arbitrary propositions A and B. This isomorphism allows us to use our logical toolkit to reason about probabilities.

Using the axioms of probability and the laws of logic, we now derive several useful properties of probability functions.

Since A = A∧⊤ = A(B∨¬B)=(AB)(A∧¬B), andsincetheintersection of(AB)and (ABc)is empty, it follows from the third axiom that

P(A)= P(AB)+P(ABc) (1)

Letting A = Ω yields P(Ω)= PB)+ PBc)= P(B)+ P(Bc), from which we conclude by the first axiom that for all B Ω, P(Bc) =1 P(B). In particular, letting B = Ω yields P()= Pc)=1P(Ω)=11 =0.

For arbitrary propositions A and B,

AB =(A∧⊤)(B ∧⊤)

=(A(B ∨¬B))(B (A∨¬A))

=(AB)(A∧¬B)(B A)(B ∧¬A)

=(AB)(A∧¬B)(B ∧¬A)

Using Equation 1, it now follows that for A and B not necessarily disjoint,

P(AB)= P(AB)+P(ABc)+P(B Ac)

= P(AB)+(P(A)P(AB))+(P(B)P(AB))

= P(A)+P(B)P(AB)

Example: Once again, let Ω = {HH,HT,TH,TT}. If A is the event “at least one head,” then A = {HH,HT,TH}; if B is the event “at least one tail,” then B = {HT,TH,TT}and Bc = {HH}. Now A B = {HT,TH} and A Bc = {HH}. Thus, by Equation 1, P(A) =2/4+1/4 =3/4. Moreover, P(Bc)=1/4 =13/4 =1P(B). Finally, P(AB)=3/4+3/42/4 =1: i.e., with probability 1, the event “at least one head or at least one tail occurs.”

Conditional Probability

The conditional probability P(A |B)of A given B is defined as

P(AB)

P(A |B)= (2)

P(B)

The(unconditional)probability of an event A isinfact the conditionalprobability of event A given sample space Ω: i.e., P(A |Ω) = P(A). Conditional probabilities are sometimes called posterior probabilities, in which case unconditional probabilities are called prior probabilities.

Continuing our running example, P(A |B)denotes the probability of observing at least one head, given atleastone tail. This event occurs withprobability2/3, since the two ofofthe three outcomes in B which include at least one tail also include at least one head, namely HT and TH.

One important consequence of Equation 2 is the product rule:

P(AB)= P(A |B)P(B)= P(B |A)P(A) (3)

Intuitively,theprobability of observing two eventsistheprobability of observing theformer,given the latter, times the probability of observing the latter; or it is the probability of observing the latter, given the former, times the probability of observing the former.

In our example, P(AB)= P(A |B)P(B)=(2/3)(3/4) =1/2.

Rewriting the product rule yields Bayesrule:

P(A |B)P(B)

P(B |A)= (4)

P(A)

Equations 1 and 3 together imply P(A)= P(A |B)P(B)+P(A |Bc)P(Bc), whichwe can use to reformulate Bayes’ rule:

P(A |B)P(B)P(B |A)= P(A |B)P(B)+P(A |Bc)P(Bc)

For example, consider the probability of Disease given Behavior:

P(Behavior|Disease)P(Disease) P(Disease|Behavior) = P(Behavior)

where P(Behavior)= P(Behavior|Disease)P(Disease)+P(Behavior|Diseasec)P(Diseasec).

More specifically, consider a medical clinic in which 10% of the patients have cancer, 25% of the patients are smokers, and 75% of cancer patients smoke. According to Bayes’ rule, the likelihood that apatient who smokeshas canceris equalto(.75)(.1)/.25 = .3.



n

In general, let {A1,...,An}beadisjointsetof eventssuch that =Ω. By thedefinitionof

i=1 Ai

conditional probability, P(B Ai)

P(Ai |B)= P(B)



n

By the product rule, P(BAi)= P(B |Ai)P(Ai). Now since the Ai’s are disjoint, B = j=1(B

n

Aj ), andmoreover, P(B)= P(B Aj ). Again, by the product rule,

j=1

n

P(B)= P(B |Aj )P(Aj ) j=1

And now Bayes’ rule in all its generality:

P(B |Ai)P(Ai)

P(Ai |B)= (5)

n

P(B |Aj )P(Aj ) j=1

4.1 Independence

A and B are independent events iff P(A |B)= P(A). If P(A |B)= P(A), then by Equation 4, P(B |A)= P(B). Moreover,

P(AB)= P(A |B)P(B)= P(A)P(B) (6)

Example: If A is the event “head on the first toss” and B is the event “head on the second toss,” then A = {HH,HT}, B = {HH,TH}, and AB = {HH}. A and B are independent events, since P(A B) =1/4 =(1/2)(1/2) = P(A)P(B). On the other hand, if B is the event “at least one tail,” then B = {HT,TH,TT}, and the events A and B are not independent: P(AB)=1/4 but P(A)P(B)=(1/2)/(3/4) =3/8.

Exercise: Show that if A and B are independent events, then the pairs of events A and Bc , Ac and B, and Ac and Bc are all independent.

A and B are conditionally independent with respect to C iff P(A |B C)= P(A |C). This definition also implies that P(B |AC)= P(B |C), since

P(AC |B)P(B)P(B |AC)=

P(AC) P(A |B C)P(C |B)P(B)

=

P(AC) P(A |C)P(C |B)P(B)

=

P(AC) P(C |B)P(B)

=

P(C) = P(B |C)

Moreover, if A and B are conditionally independent of C, then

P(AB |C)= P(A |B C)P(B |C)= P(A |C)P(B |C) (7)

Exercise: Show that if A and B are independent events, then A and B are conditionally independent of C for all events C.

Random Variables

A simple probability space Ω,2Ω,Pis a (finite) sample space Ω together with a function

P :2Ω R that that assigns real values to events A Ω and satisfies the axioms of probability. Given a simple probability space Ω,2Ω,P, a(discrete) random variable X is a map X → {x1,x2,...}. The real-valued function P({ω |X(ω)= xi}), abbreviatedP(X = xi), gives rise to a probability mass function pX : {x1,x2,...}→R given by pX (xi)P(X = xi).

Example: IfΩ = {HH,HT,TH,TT}, apossiblerandomvariable X isthetotal numberofheads,in which case the range of X is {0,1,2}. Now pX (0)= P({TT})=1/4, pX (1)= P({HT,TH})=2/4, and pX (2)= P({HH})=1/4. Similarly, if Y is the random variable representing the total number of tails, then pY (0)= P({HH})=1/4, pY (1)= P({HT,TH})=2/4, and pY (2)= P({TT})=1/4.

These probability functions can be fully specified using one-dimensional tables:

Xp(X) Yp(Y)

0 1/4 0 1/4

1 1/2 1 1/2

2 1/4 2 1/4

Let X and Y be random variables with ranges {x1,x2,...} and {y1,y2,...}, respectively. The random variable X × Y has as its range the set of ordered pairs {x1,x2,...}×{y1,y2,...}. The

joint probability mass function pX×Y (xi,yj )P(X = xi,Y = yj )isdefinedbyP({ω |X(ω)= xi,Y (ω)= yj }). For example,thejointprobability massfunction onthe random variables X and Y in the above example is specified in the following two-dimensional table:

X,Y 012 0 0 01/4 1 01/2 0 2 1/4 0 0

Ingeneral, it requires space exponential in the number of random variables to specify ajointprobabilitymassfunction. However, when certainin/dependence criteria are satisfied, space requirements can be reduced.

Random variables X and Y with respective ranges {x1,x2,...}and {y1,y2,...}are independent iff the events X = xi and Y = yj areindependent: i.e., P(X = xi,Y = yj )= P(X = xi)P(Y = yj ). Assumingindependence, thejointprobability massfunction P(X = xi,Y = yj ) can be specified using two one-dimensional tables, rather than one two-dimensional table. In our example, however, the variables X and Y are not independent; X and Y are perfectly correlated. Indeed, one one-dimensional table suffices to describe both variables.

Thedefinitions conditionalprobability and conditionalindependenceforrandomvariables,likethe definition of independence for random variables, translate directly from the respective definitions for events. The corresponding notions of the product rule and Bayes’ rule are given by:

P(X = xi,Y = yj )= P(X = xi |Y = yj )P(Y = yj )

= P(Y = yj |X = xi)P(X = xi)

and

P(Y = yj |X = xi)P(X = xi)P(X = xi |Y = yj )= P(Y = yj )

5.1 Expectation and Variance

The expected value E[X]of a random variable X is defined as:

E[X]= p(xi)xi i

The expected value is also called the mean, in which case it is denoted µX , or µ, whenever the random variable over which it is defined is clear from context.

For example, if I is a random variable with range {0,1,...,n}, and if I is uniformly distributed (i.e., P(I = i)=1/(n +1) for all 0 i n), then

nn

11 1 n(n +1) n

E[I]= i = i ==

n +1 n +1 n +1 2 2

i=0 i=0

Exercise: Compute E[X]and E[Y]in our running example. The following properties hold of expectation:

  • If X and Y are independent random variables, then E[XY]= E[X]E[Y].
  • Linearity of expectation: i.e., E[X + Y]= E[X]+E[Y].
  • For arbitrary constant c R, E[cX]= cE[X].

Exercise: Prove the stated properties of expectation.

Expectation suffices to predict an average data point. The expected value of the two sequences 0,0,... and 1,1,1,1,... both equal zero, however. Variance captures the variability in a series of data points.

Given random variable X, the variance var(X)= E[(X µX )2]:

var(X)= E[(X µX )2]

2

= E[X2 2µX X + µ ]

X

2

= E[X2]2µX E[X]+µ

X

= E[X2]2E[X]E[X]+(E[X])2 = E[X2](E[X])2

Variance is oftendenoted σ2 , or simply σ2, whenever X is clear from context. The value σ is called

X

the standard deviation. If I is as above, a random variable with range {0,1,...,n}and p(i)is uniformly distributed, then

n

11 n(2n +1)

E[I2]= i2 = (02 +12 +22 + ... + n 2)=

n +1 n +16

i=0

Thus,

��2

n(2n +1) nn(n +2)

var(I)= E[I2](E[I])2 = =

6 2 12

The following properties hold of variance:

  • For arbitrary constant a R, var(aX)= a2 var(X).
  • For arbitrary constant c R, var(X + c)= var(X).

Exercise: Prove the stated properties of variance. Exercise: Let Z represent the number of heads after three successive coin tosses. Given p(Z)as

follows, compute E[Z]and var(Z).

Z

0 1 2 3

p(Z) 1/8 3/8 3/8 1/8

5.2 Correlation and Covariance

Correlation and covariance provide numerical measurements of the strength of the relationship between two random variables.

Given random variables X and Y with respective means µX and µY , the covariance cov(X,Y)= E[(X µX )(Y µY )]:

cov(X,Y)= E[(X µX )(Y µY )] = E[(XY µX Y µY X µxµY )] = E[XY]µX E[Y]µY E[X]µxµY )] = E[XY]µX µY µY µX µxµY )] = E[XY]µX µY

The correlation coefficient ρXY is defined in terms of covariance and standard deviation as follows:

cov(X,Y)

ρXY = (8)

σX σY

The following properties hold of covariance and correlation:

  • For independent random variables X and Y, cov(X,Y)= ρXY =0.
  • For arbitrary random variables X and Y, var(X + Y)= var(X)+var(Y)+2cov(X,Y).