Bayesian Analysis In Ai

8/19/2019

I've done a dreadful job of reading The Theory That Would Not Die, but several weeks ago I somehow managed to read the appendix. Here the author gives a short explanation of Bayes' theorem using statistics related to breast cancer and mammogram results. This is the same real world example (one of several) used by Nate Silver. It's profound in its simplicity and- for an idiot like me- a powerful gateway drug. Possibly related to this is my recent epiphany that when we're talking about Bayesian analysis, we're really talking about multivariate probability. The breast cancer/mammogram example is the simplest form of multivariate analysis available. What does it all mean, how can we extend it and what does it have to do with an underlying philosophy of Bayesian analysis (if such a thing exists)?

Bayesian Analysis In Ai Video

(This article was first published on PirateGrunt Â» R, and kindly contributed to R-bloggers)

The Theory That Would Not Die is sitting at my desk at work, so I'm going to refer to the figures quoted by Nate Silver on page 246. Odds for cancer are read across the columns, odds for a positive mammogram are read down the rows.

Before I go any further, I have to point out that the positioning of the tables is dreadful. WordPress experts are invited to help me sort this out.

C-True	C-False
M-True	11	99
M-False	3	887

From this table, the joint probabilities are easy to read. What is the chance that a person has breast cancer and received a negative mammogram? 3 in 1000. What is the chance that a person does not have cancer, but received a positive mammogram? 99 in 1000, or roughly 10%. It's a trivial thing to determine the marginal probabilities.

C-True	C-False	M
M-True	11	99	110
M-False	3	887	890
C	14	986	1000

The context of this information is what matters to the authors. Each presents the result that the likelihood that a patient has cancer- even with a positive mammogram- is still rather low (10% in this case). This is consistent with advice from some areas of the medical establishment that women not get routine mammograms before a particular age. This (slightly) surprising result is driven by the fact that the positive predictive value (number of true positives divided by the number of predicted positives) is very low as is the likelihood of a positive. Put differently, a mammogram does not appear to have a good success rate at predicting cancer (for this data) and the overall rate of cancer is quite low. How would things look if the numbers changed?

How do we do that? In order to hold the cancer probability fixed, we can't change the marginal totals. So, we can move numbers in the same column from one row to another. Or, if we move from one column to another, we must offset that in the other row. As an extreme, we could assume that the test is perfectly predictive. This would move the 3 false negatives into the true positive cell and the 99 false positives to the true negative cell. In this case, there is no probability in the upper right or lower left corner of the matrix. From another perspective, it is impossible to distinguish the two marginal distributions.

But that's a bit boring, so let's create something more interesting. We'll not alter the number of false negatives, but reduce the false positives so that the positive predictive value is close to 80%.

C-True	C-False	M
M-True	11	14	25
M-False	3	972	975
C	14	986	1000

The chance that a person has cancer, conditional on a positive mammogram is now 44.0%. Before I look at another scenario, I'm going to scrap the tables in favor of something graphical. Here's what the first matrix looks like:

And the second matrix:

In the second plot, we continue to have a large concentration of the probability in the bottom right corner, but the the top half is now more balanced. This balance comes from a shift away from top right corner. All of this means that the information about a mammogram becomes more predictive.

What happens when we increase the likelihood of cancer? In graphical terms, this would mean giving the left side a more yellow color. We'll hold the original positive predictive value (roughly 10%) fixed, but raise the likelihood of cancer to 25%.

C-True	C-False	M
M-True	11	99	110
M-False	239	651	890
C	250	750	1000

This is interesting. The highest probability remains at the lower right hand corner (no cancer, clean mammogram) but there is now a greater concentration at the upper right and lower left corner. So, if one has a positive mammogram result, what is the posterior probability that they have cancer? The same 10% as before. And if the test showed negative? It's now 27%. This is higher than the probability if one got a positive result. Of course, this is because we've held the positive predictive value fixed, while raising the probability of the event. The efficacy of the test and the prevalence of the disease are now anti-correlated. Not the sort of thing one wants in a diagnostic tool. How would things look if the PPV were 50%?

C-True	C-False	M
M-True	55	55	110
M-False	195	695	890
C	250	750	1000

So what makes this Bayesian? The simple answer is that I don't know. I have trouble reconciling Silver and McGrayne's simple (though very accessible) examples of Bayesian inference with what I read in Gelman and Albert. Untangling the math takes me away from the philosophy, so I'll list three quick notions about what Bayesian analysis means to me:

In the presence of new information, our prior understanding may be modified. This is the one that feels like a one-off exercise as it is presented in the mammography
examples. If I don't know anything at all about a person, I assume that the chance they have cancer is about 1.4%. If I know they've had a mammogram, I adjust my result up or down. This is a slightly static view of the world
Similar to the above, but subtly different: the process of gathering information means that our understanding continually evolves. This is the view which Silver seems to push. This allows both for continual improvement of knowledge, but also the opportunity to respond as underlying probabilities change. One critical element that's not addressed in the cancer/mammogram example is that there is presumed- and unearned- certainty in the underlying probabilities. Silver and McGrayne use two different sets
of figures. Either the parameters are uncertain or they're drawing from samples which vary in some other way (which is another way of saying that the parameters possess some stochasticity).
The third interpretation is what I think of as the â€œactuarialâ€ view. I can't point to a specific paper (though Bailey comes close) but it's more a feeling I get from those rare references to Bayes (explicit and otherwise) in the actuarial literature. The world is divided into sets, though you can't know to which set a particular item belongs. You may only refine the likelihood that an item belongs to a specific set in the presence of information. For example, there are three sets of drivers: very good, average and bad. If a driver has had one accident in the past 12 months, to which set do they belong? The chance that they belong to the set of very good drivers is low, but neither are they incontrovertible members of the bad drivers set.

In this example, I look at altering the joint probability distribution. I'm free to do that, if evidence warrants it. If mammography improves- or there is a provable difference in physicians' interpretations of the results- then I may alter the probabilities. If environment and lifestyle changes yield an alteration in disease prevalence, that also affects the joint distribution. It's a great toy example to begin to explore more varied problems. That's what I'll do next as I expand the example from a very simple 2Ã—2 matrix to something more complicated.

Before I forget, my understanding of the definition of positive predictive value is taken from An Introduction to Statistical Learning, which is a great book. That value is one component of the fascinating subject of binary classification. I first heard about this in a great talk given by Dan Kelly at a meeting of the Research Triangle Analysts

Session info:

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt Â» R.

Bayesian Analysis In Ai Video

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more..

If you got this far, why not subscribe for updatesfrom the site? Choose your flavor: e-mail, twitter, RSS, or facebook..

Bayesian statistics
Part of a series on Statistics
Theory
Techniques

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called 'Bayesian probability'.

1Introduction to Bayes' rule
2Formal description of Bayesian inference
3Inference over exclusive and exhaustive possibilities
4Mathematical properties
5Examples
6In frequentist statistics and decision theory
7Applications
11References
12Further reading

Introduction to Bayes' rule[edit]

A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that P(A|B) P(B) = P(B|A) P(A) i.e. P(A|B) = P(B|A) P(A)/P(B) . Similar reasoning can be used to show that P(Ä€|B) = P(B|Ä€) P(Ä€)/P(B) etc.

Formal explanation[edit]

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a 'likelihood function' derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:

P(Hâˆ£E)=P(Eâˆ£H)â‹…P(H)P(E){displaystyle P(Hmid E)={frac {P(Emid H)cdot P(H)}{P(E)}}}

where

H{displaystyle textstyle H} stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
P(H){displaystyle textstyle P(H)}, the prior probability, is the estimate of the probability of the hypothesis H{displaystyle textstyle H}before the data E{displaystyle textstyle E}, the current evidence, is observed.
E{displaystyle textstyle E}, the evidence, corresponds to new data that were not used in computing the prior probability.
P(Hâˆ£E){displaystyle textstyle P(Hmid E)}, the posterior probability, is the probability of H{displaystyle textstyle H}givenE{displaystyle textstyle E}, i.e., afterE{displaystyle textstyle E} is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
P(Eâˆ£H){displaystyle textstyle P(Emid H)} is the probability of observing E{displaystyle textstyle E}givenH{displaystyle textstyle H}, and is called the likelihood. As a function of E{displaystyle textstyle E} with H{displaystyle textstyle H} fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, E{displaystyle textstyle E}, while the posterior probability is a function of the hypothesis, H{displaystyle textstyle H}.
P(E){displaystyle textstyle P(E)} is sometimes termed the marginal likelihood or 'model evidence'. This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis H{displaystyle textstyle H} does not appear anywhere in the symbol, unlike for all the other factors), so this factor does not enter into determining the relative probabilities of different hypotheses.

For different values of H{displaystyle textstyle H}, only the factors P(H){displaystyle textstyle P(H)} and P(Eâˆ£H){displaystyle textstyle P(Emid H)}, both in the numerator, affect the value of P(Hâˆ£E){displaystyle textstyle P(Hmid E)} â€“ the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

Bayes' rule can also be written as follows:

P(Hâˆ£E)=P(Eâˆ£H)P(E)â‹…P(H){displaystyle P(Hmid E)={frac {P(Emid H)}{P(E)}}cdot P(H)}

where the factor P(Eâˆ£H)P(E){displaystyle textstyle {frac {P(Emid H)}{P(E)}}} can be interpreted as the impact of E{displaystyle E} on the probability of H{displaystyle H}.

Alternatives to Bayesian updating[edit]

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

Ian Hacking noted that traditional 'Dutch book' arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote^[1]^[2] 'And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour.'

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on 'probability kinematics') following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.^[3] The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.^[4]

Formal description of Bayesian inference[edit]

Definitions[edit]

x{displaystyle x}, a data point in general. This may in fact be a vector of values.
Î¸{displaystyle theta }, the parameter of the data point's distribution, i.e., xâˆ¼p(xâˆ£Î¸){displaystyle xsim p(xmid theta )} . This may in fact be a vector of parameters.
Î±{displaystyle alpha }, the hyperparameter of the parameter distribution, i.e., Î¸âˆ¼p(Î¸âˆ£Î±){displaystyle theta sim p(theta mid alpha )} . This may in fact be a vector of hyperparameters.
X{displaystyle mathbf {X} } is the sample, a set of n{displaystyle n} observed data points, i.e., x1,â€¦,xn{displaystyle x_{1},ldots ,x_{n}}.
x~{displaystyle {tilde {x}}}, a new data point whose distribution is to be predicted.

Bayesian inference[edit]

The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. p(Î¸âˆ£Î±){displaystyle p(theta mid alpha )} . The prior distribution might not be easily determined. In this case, we can use the Jeffreys prior to obtain the posterior distribution before updating them with newer observations.
The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. p(Xâˆ£Î¸){displaystyle p(mathbf {X} mid theta )} . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written Lâ¡(Î¸âˆ£X)=p(Xâˆ£Î¸){displaystyle operatorname {L} (theta mid mathbf {X} )=p(mathbf {X} mid theta )} .
The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. p(Xâˆ£Î±)=âˆ«p(Xâˆ£Î¸)p(Î¸âˆ£Î±)dÎ¸{displaystyle p(mathbf {X} mid alpha )=int p(mathbf {X} mid theta )p(theta mid alpha )operatorname {d} !theta } .
The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:

p(Î¸âˆ£X,Î±)=p(Î¸,X,Î±)p(X,Î±)=p(Xâˆ£Î¸,Î±)p(Î¸,Î±)p(Xâˆ£Î±)p(Î±)=p(Xâˆ£Î¸,Î±)p(Î¸âˆ£Î±)p(Xâˆ£Î±)âˆp(Xâˆ£Î¸,Î±)p(Î¸âˆ£Î±){displaystyle p(theta mid mathbf {X} ,alpha )={frac {p(theta ,mathbf {X} ,alpha )}{p(mathbf {X} ,alpha )}}={frac {p(mathbf {X} mid theta ,alpha )p(theta ,alpha )}{p(mathbf {X} mid alpha )p(alpha )}}={frac {p(mathbf {X} mid theta ,alpha )p(theta mid alpha )}{p(mathbf {X} mid alpha )}}propto p(mathbf {X} mid theta ,alpha )p(theta mid alpha )}

Note that this is expressed in words as 'posterior is proportional to likelihood times prior', or sometimes as 'posterior = likelihood times prior, over evidence'.

Bayesian prediction[edit]

The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:

p(x~âˆ£X,Î±)=âˆ«p(x~âˆ£Î¸)p(Î¸âˆ£X,Î±)dÎ¸{displaystyle p({tilde {x}}mid mathbf {X} ,alpha )=int p({tilde {x}}mid theta )p(theta mid mathbf {X} ,alpha )operatorname {d} !theta }

The prior predictive distribution is the distribution of a new data point, marginalized over the prior:

p(x~âˆ£Î±)=âˆ«p(x~âˆ£Î¸)p(Î¸âˆ£Î±)dÎ¸{displaystyle p({tilde {x}}mid alpha )=int p({tilde {x}}mid theta )p(theta mid alpha )operatorname {d} !theta }

Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)â€”e.g., by maximum likelihood or maximum a posteriori estimation (MAP)â€”and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.

(In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the fact that (1) the average of normally distributed random variables is also normally distributed; (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactlyâ€”or at least, to an arbitrary level of precision, when numerical methods are used.)

Note that both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, and hence the prior and posterior distributions come from the same family, it can easily be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

Inference over exclusive and exhaustive possibilities[edit]

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

General formulation[edit]

Diagram illustrating event space Î©{displaystyle Omega } in general formulation of Bayesian inference. Although this diagram shows discrete models and events, the continuous case may be visualized similarly using probability densities.

Suppose a process is generating independent and identically distributed events En,n=1,2,3,â€¦{displaystyle E_{n},n=1,2,3,ldots }, but the probability distribution is unknown. Let the event space Î©{displaystyle Omega } represent the current state of belief for this process. Each model is represented by event Mm{displaystyle M_{m}} (what is 'm' here? its range and its meaning are not obvious). The conditional probabilities P(Enâˆ£Mm){displaystyle P(E_{n}mid M_{m})} are specified to define the models. P(Mm){displaystyle P(M_{m})} is the degree of belief in Mm{displaystyle M_{m}}. Before the first inference step, {P(Mm)}{displaystyle {P(M_{m})}} is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.

Suppose that the process is observed to generate Eâˆˆ{En}{displaystyle textstyle Ein {E_{n}}}. For each Mâˆˆ{Mm}{displaystyle Min {M_{m}}}, the prior P(M){displaystyle P(M)} is updated to the posterior P(Mâˆ£E){displaystyle P(Mmid E)}. From Bayes' theorem:^[5]

P(Mâˆ£E)=P(Eâˆ£M)âˆ‘mP(Eâˆ£Mm)P(Mm)â‹…P(M){displaystyle P(Mmid E)={frac {P(Emid M)}{sum _{m}{P(Emid M_{m})P(M_{m})}}}cdot P(M)}

Upon observation of further evidence, this procedure may be repeated.

Multiple observations[edit]

For a sequence of independent and identically distributed observations E=(e1,â€¦,en){displaystyle mathbf {E} =(e_{1},dots ,e_{n})}, it can be shown by induction that repeated application of the above is equivalent to

P(Mâˆ£E)=P(Eâˆ£M)âˆ‘mP(Eâˆ£Mm)P(Mm)â‹…P(M){displaystyle P(Mmid mathbf {E} )={frac {P(mathbf {E} mid M)}{sum _{m}{P(mathbf {E} mid M_{m})P(M_{m})}}}cdot P(M)}

Where

P(Eâˆ£M)=âˆkP(ekâˆ£M).{displaystyle P(mathbf {E} mid M)=prod _{k}{P(e_{k}mid M)}.}

Currently the one I use most because it can stretch a 4:3 input to 16:9, which I find very helpful. Easycap viewer pro apk. Would be nice if it had the option to get rid of thewatermark in recorded video. I'd pay a buck to.not. One of my favorite apps for this stuff.

Parametric formulation[edit]

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is however equally applicable to discrete distributions.

Let the vector Î¸{displaystyle mathbf {theta } } span the parameter space. Let the initial prior distribution over Î¸{displaystyle mathbf {theta } } be p(Î¸âˆ£Î±){displaystyle p(mathbf {theta } mid mathbf {alpha } )}, where Î±{displaystyle mathbf {alpha } } is a set of parameters to the prior itself, or hyperparameters. Let E=(e1,â€¦,en){displaystyle mathbf {E} =(e_{1},dots ,e_{n})} be a sequence of independent and identically distributed event observations, where all ei{displaystyle e_{i}} are distributed as p(eâˆ£Î¸){displaystyle p(emid mathbf {theta } )} for some Î¸{displaystyle mathbf {theta } }. Bayes' theorem is applied to find the posterior distribution over Î¸{displaystyle mathbf {theta } }:

p(Î¸âˆ£E,Î±)=p(Eâˆ£Î¸,Î±)p(Eâˆ£Î±)â‹…p(Î¸âˆ£Î±)=p(Eâˆ£Î¸,Î±)âˆ«p(E|Î¸,Î±)p(Î¸âˆ£Î±)dÎ¸â‹…p(Î¸âˆ£Î±){displaystyle {begin{aligned}p(mathbf {theta } mid mathbf {E} ,mathbf {alpha } )&={frac {p(mathbf {E} mid mathbf {theta } ,mathbf {alpha } )}{p(mathbf {E} mid mathbf {alpha } )}}cdot p(mathbf {theta } mid mathbf {alpha } )&={frac {p(mathbf {E} mid mathbf {theta } ,mathbf {alpha } )}{int p(mathbf {E} |mathbf {theta } ,mathbf {alpha } )p(mathbf {theta } mid mathbf {alpha } ),dmathbf {theta } }}cdot p(mathbf {theta } mid mathbf {alpha } )end{aligned}}}

Where

p(Eâˆ£Î¸,Î±)=âˆkp(ekâˆ£Î¸){displaystyle p(mathbf {E} mid mathbf {theta } ,mathbf {alpha } )=prod _{k}p(e_{k}mid mathbf {theta } )}

Mathematical properties[edit]

Interpretation of factor[edit]

P(Eâˆ£M)P(E)>1â‡’P(Eâˆ£M)>P(E){displaystyle textstyle {frac {P(Emid M)}{P(E)}}>1Rightarrow textstyle P(Emid M)>P(E)}. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, P(Eâˆ£M)P(E)=1â‡’P(Eâˆ£M)=P(E){displaystyle textstyle {frac {P(Emid M)}{P(E)}}=1Rightarrow textstyle P(Emid M)=P(E)}. That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

Cromwell's rule[edit]

If P(M)=0{displaystyle P(M)=0} then P(Mâˆ£E)=0{displaystyle P(Mmid E)=0}. If P(M)=1{displaystyle P(M)=1}, then P(M|E)=1{displaystyle P(M|E)=1}. This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event 'not M{displaystyle M}' in place of 'M{displaystyle M}', yielding 'if 1âˆ’P(M)=0{displaystyle 1-P(M)=0}, then 1âˆ’P(Mâˆ£E)=0{displaystyle 1-P(Mmid E)=0}', from which the result immediately follows.

Asymptotic behaviour of posterior[edit]

Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 ^[6] and 1965 ^[7] when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.^[8] To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

Conjugate priors[edit]

In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

Estimates of parameters and predictions[edit]

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.^[9]

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.^[10]

Î¸~=Eâ¡[Î¸]=âˆ«Î¸p(Î¸âˆ£X,Î±)dÎ¸{displaystyle {tilde {theta }}=operatorname {E} [theta ]=int theta ,p(theta mid mathbf {X} ,alpha ),dtheta }

Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:^[11]

{Î¸MAP}âŠ‚argâ¡maxÎ¸p(Î¸âˆ£X,Î±).{displaystyle {theta _{text{MAP}}}subset arg max _{theta }p(theta mid mathbf {X} ,alpha ).}

There are examples where no maximum is attained, in which case the set of MAP estimates is empty.

There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ('frequentist statistics').^[12]

The posterior predictive distribution of a new observation x~{displaystyle {tilde {x}}} (that is independent of previous observations) is determined by^[13]

p(x~|X,Î±)=âˆ«p(x~,Î¸âˆ£X,Î±)dÎ¸=âˆ«p(x~âˆ£Î¸)p(Î¸âˆ£X,Î±)dÎ¸.{displaystyle p({tilde {x}}|mathbf {X} ,alpha )=int p({tilde {x}},theta mid mathbf {X} ,alpha ),dtheta =int p({tilde {x}}mid theta )p(theta mid mathbf {X} ,alpha ),dtheta .}

Examples[edit]

Probability of a hypothesis[edit]

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let H1{displaystyle H_{1}} correspond to bowl #1, and H2{displaystyle H_{2}} to bowl #2.It is given that the bowls are identical from Fred's point of view, thus P(H1)=P(H2){displaystyle P(H_{1})=P(H_{2})}, and the two must add up to 1, so both are equal to 0.5.The event E{displaystyle E} is the observation of a plain cookie. From the contents of the bowls, we know that P(Eâˆ£H1)=30/40=0.75{displaystyle P(Emid H_{1})=30/40=0.75} and P(Eâˆ£H2)=20/40=0.5.{displaystyle P(Emid H_{2})=20/40=0.5.} Bayes' formula then yields

P(H1âˆ£E)=P(Eâˆ£H1)P(H1)P(Eâˆ£H1)P(H1)+P(Eâˆ£H2)P(H2)=0.75Ã—0.50.75Ã—0.5+0.5Ã—0.5=0.6{displaystyle {begin{aligned}P(H_{1}mid E)&={frac {P(Emid H_{1}),P(H_{1})}{P(Emid H_{1}),P(H_{1});+;P(Emid H_{2}),P(H_{2})}} &={frac {0.75times 0.5}{0.75times 0.5+0.5times 0.5}} &=0.6end{aligned}}}

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, P(H1){displaystyle P(H_{1})}, which was 0.5. After observing the cookie, we must revise the probability to P(H1âˆ£E){displaystyle P(H_{1}mid E)}, which is 0.6.

Making a prediction[edit]

Example results for archaeology example. This simulation was generated using c=15.2.

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?

The degree of belief in the continuous variable C{displaystyle C} (century) is to be calculated, with the discrete set of events {GD,GDÂ¯,GÂ¯D,GÂ¯DÂ¯}{displaystyle {GD,G{bar {D}},{bar {G}}D,{bar {G}}{bar {D}}}} as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

P(E=GDâˆ£C=c)=(0.01+0.81âˆ’0.0116âˆ’11(câˆ’11))(0.5âˆ’0.5âˆ’0.0516âˆ’11(câˆ’11)){displaystyle P(E=GDmid C=c)=(0.01+{frac {0.81-0.01}{16-11}}(c-11))(0.5-{frac {0.5-0.05}{16-11}}(c-11))}

P(E=GDÂ¯âˆ£C=c)=(0.01+0.81âˆ’0.0116âˆ’11(câˆ’11))(0.5+0.5âˆ’0.0516âˆ’11(câˆ’11)){displaystyle P(E=G{bar {D}}mid C=c)=(0.01+{frac {0.81-0.01}{16-11}}(c-11))(0.5+{frac {0.5-0.05}{16-11}}(c-11))}

P(E=GÂ¯Dâˆ£C=c)=((1âˆ’0.01)âˆ’0.81âˆ’0.0116âˆ’11(câˆ’11))(0.5âˆ’0.5âˆ’0.0516âˆ’11(câˆ’11)){displaystyle P(E={bar {G}}Dmid C=c)=((1-0.01)-{frac {0.81-0.01}{16-11}}(c-11))(0.5-{frac {0.5-0.05}{16-11}}(c-11))}

P(E=GÂ¯DÂ¯âˆ£C=c)=((1âˆ’0.01)âˆ’0.81âˆ’0.0116âˆ’11(câˆ’11))(0.5+0.5âˆ’0.0516âˆ’11(câˆ’11)){displaystyle P(E={bar {G}}{bar {D}}mid C=c)=((1-0.01)-{frac {0.81-0.01}{16-11}}(c-11))(0.5+{frac {0.5-0.05}{16-11}}(c-11))}

Assume a uniform prior of fC(c)=0.2{displaystyle textstyle f_{C}(c)=0.2}, and that trials are independent and identically distributed. When a new fragment of type e{displaystyle e} is discovered, Bayes' theorem is applied to update the degree of belief for each c{displaystyle c}:

fC(câˆ£E=e)=P(E=eâˆ£C=c)P(E=e)fC(c)=P(E=eâˆ£C=c)âˆ«1116P(E=eâˆ£C=c)fC(c)dcfC(c){displaystyle f_{C}(cmid E=e)={frac {P(E=emid C=c)}{P(E=e)}}f_{C}(c)={frac {P(E=emid C=c)}{int _{11}^{16}{P(E=emid C=c)f_{C}(c)dc}}}f_{C}(c)}

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or c=15.2{displaystyle c=15.2}. By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. Note that the Bernstein-von Mises theorem asserts here the asymptotic convergence to the 'true' distribution because the probability space corresponding to the discrete set of events {GD,GDÂ¯,GÂ¯D,GÂ¯DÂ¯}{displaystyle {GD,G{bar {D}},{bar {G}}D,{bar {G}}{bar {D}}}} is finite (see above section on asymptotic behaviour of the posterior).

In frequentist statistics and decision theory[edit]

A decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.^[14]

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.^[15]^[16]^[17] For example:

'Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility.'^[14]
'In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution.'^[18]
'In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory.' 'There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis.'^[19]

'A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible'^[20]
'An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained.'^[21]

Model selection[edit]

Applications[edit]

Computer applications[edit]

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolisâ€“Hastings algorithm schemes.^[22] Recently^[when?] Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.

As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naive Bayes classifier.

Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and Occamâ€™s Razor.^[23]^{[unreliable source?]} Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.^[24]^[25]

Bioinformatic applications[edit]

Bayesian inference has been applied in different Bioinformatics applications, including differentially gene expression analysis^[26]^[27], single-cell classification^[28], cancer subtyping^[29], and etc.

In the courtroom[edit]

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.^[30]^[31]^[32] Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.

Adding up evidence.

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.^[33] For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.

The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that 'To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task.'

Gardner-Medwin^[34] argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentistp-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

A The known facts and testimony could have arisen if the defendant is guilty

B The known facts and testimony could have arisen if the defendant is innocent

C The defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Bayesian epistemology[edit]

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

Karl Popper and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences:^[35] It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Other[edit]

The scientific method is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about hypotheses conditional on new observations or experiments.^[36] The Bayesian inference has also been applied to treat stochastic scheduling problems with incomplete information by Cai et al. (2009).^[37]
Bayesian Multi-Domain Learning is used to improve predictive power in domains with small training sets by borrowing information from domains with rich training data.^[29]
Bayesian search theory is used to search for lost objects.
Bayesian approaches to brain function investigate the brain as a Bayesian mechanism.
Bayesian inference in ecological studies^[38]^[39]
Bayesian inference is used to estimate parameters in stochastic chemical kinetic models^[40]

Bayes and Bayesian inference[edit]

The problem considered by Bayes in Proposition 9 of his essay, 'An Essay towards solving a Problem in the Doctrine of Chances', is the posterior distribution for the parameter a (the success rate) of the binomial distribution.^{[citation needed]}

History[edit]

The term Bayesian refers to Thomas Bayes (1702â€“1761), who proved a special case of what is now called Bayes' theorem. However, it was Pierre-Simon Laplace (1749â€“1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence.^[41] Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called 'inverse probability' (because it infers backwards from observations to parameters, or from effects to causes^[42]). After the 1920s, 'inverse probability' was largely supplanted by a collection of methods that came to be called frequentist statistics.^[42]

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or 'non-informative' current, the statistical analysis depends on only the model assumed, the data analyzed,^[43] and the method assigning the prior, which differs from one objective Bayesian to another objective Bayesian. In the subjective or 'informative' current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.^[44] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.^[45] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.^[46]

References[edit]

Citations[edit]

^Hacking, Ian (December 1967). 'Slightly More Realistic Personal Probability'. Philosophy of Science. 34 (4): 316. doi:10.1086/288169.
^ Hacking (1988, p. 124)^{[full citation needed]}
^'Bayes' Theorem (Stanford Encyclopedia of Philosophy)'. Plato.stanford.edu. Retrieved 2014-01-05.
^van Fraassen, B. (1989) Laws and Symmetry, Oxford University Press. ISBN0-19-824860-1
^Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.;Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. ISBN978-1-4398-4095-5.
^Freedman, DA (1963). 'On the asymptotic behavior of Bayes' estimates in the discrete case'. The Annals of Mathematical Statistics. 34 (4): 1386â€“1403. doi:10.1214/aoms/1177703871. JSTOR2238346.
^Freedman, DA (1965). 'On the asymptotic behavior of Bayes estimates in the discrete case II'. The Annals of Mathematical Statistics. 36 (2): 454â€“456. doi:10.1214/aoms/1177700155. JSTOR2238150.
^Robins, James; Wasserman, Larry (2000). 'Conditioning, likelihood, and coherence: A review of some foundational concepts'. JASA. 95 (452): 1340â€“1346. doi:10.1080/01621459.2000.10474344.
^Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM.
^Choudhuri, Nidhan; Ghosal, Subhashis; Roy, Anindya (2005-01-01). Bayesian Methods for Function Estimation. Handbook of Statistics. Bayesian Thinking. 25. pp. 373â€“414. CiteSeerX10.1.1.324.3052. doi:10.1016/s0169-7161(05)25013-7. ISBN9780444515391.
^'Maximum A Posteriori (MAP) Estimation'. www.probabilitycourse.com. Retrieved 2017-06-02.
^Yu, Angela. 'Introduction to Bayesian Decision Theory'(PDF). cogsci.ucsd.edu/. Archived from the original(PDF) on 2013-02-28.
^Hitchcock, David. 'Posterior Predictive Distribution Stat Slide'(PDF). stat.sc.edu.
^ ^a^bBickel & Doksum (2001, p. 32)
^Kiefer, J.; Schwartz R. (1965). 'Admissible Bayes Character of T²-, R²-, and Other Fully Invariant Tests for Multivariate Normal Problems'. Annals of Mathematical Statistics. 36 (3): 747â€“770. doi:10.1214/aoms/1177700051.
^Schwartz, R. (1969). 'Invariant Proper Bayes Tests for Exponential Families'. Annals of Mathematical Statistics. 40: 270â€“283. doi:10.1214/aoms/1177697822.
^Hwang, J. T. & Casella, George (1982). 'Minimax Confidence Sets for the Mean of a Multivariate Normal Distribution'(PDF). Annals of Statistics. 10 (3): 868â€“881. doi:10.1214/aos/1176345877.
^Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.). (see p. 309 of Chapter 6.7 'Admissibilty', and pp. 17â€“18 of Chapter 1.8 'Complete Classes'
^Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN978-0-387-96307-5. (From 'Chapter 12 Posterior Distributions and Bayes Solutions', p. 324)
^Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 432. ISBN978-0-04-121537-3.
^Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 433. ISBN978-0-04-121537-3.)
^Jim Albert (2009). Bayesian Computation with R, Second edition. New York, Dordrecht, etc.: Springer. ISBN978-0-387-92297-3.
^Rathmanner, Samuel; Hutter, Marcus; Ormerod, Thomas C (2011). 'A Philosophical Treatise of Universal Induction'. Entropy. 13 (6): 1076â€“1136. arXiv:1105.5721. doi:10.3390/e13061076.
^Hutter, Marcus; He, Yang-Hui; Ormerod, Thomas C (2007). 'On Universal Prediction and Bayesian Confirmation'. Theoretical Computer Science. 384 (2007): 33â€“48. arXiv:0709.1516. Bibcode:2007arXiv0709.1516H. doi:10.1016/j.tcs.2007.05.016.
^GÃ¡cs, Peter; VitÃ¡nyi, Paul M. B. (2 December 2010). 'Raymond J. Solomonoff 1926-2009'. CiteSeerX.
^Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.
^Hajiramezanali, E. & Dadaneh, S. Z. & Figueiredo, P. d. & Sze, S. & Zhou, Z. & Qian, X. Differential Expression Analysis of Dynamical Sequencing Count Data with a Gamma Markov Chain. https://arxiv.org/pdf/1803.02527.pdf
^Hajiramezanali, E.; Imani, M.; Braga-Neto, U.; Qian, X.; Dougherty, E. R. 'Scalable optimal Bayesian classification of single-cell trajectories under regulatory model uncertainty'. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. arXiv:1902.03188. doi:10.1145/3233547.3233689.
^ ^a^bHajiramezanali, E.; Dadaneh, S. Z.; Karbalayghareh, A.; Zhou, Z.; Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018). MontrÃ©al, Canada. arXiv:1810.09433.
^Dawid, A. P. and Mortera, J. (1996) 'Coherent Analysis of Forensic Identification Evidence'. Journal of the Royal Statistical Society, Series B, 58, 425â€“443.
^Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). 'Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)'. Journal of the Royal Statistical Society, Series A, 160, 429â€“469.
^Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester. ISBN978-0-471-96026-3
^Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by JuriesArchived 2015-07-01 at the Wayback Machine
^Gardner-Medwin, A. (2005) 'What Probability Should the Jury Address?'. Significance, 2 (1), March 2005
^Miller, David (1994). Critical Rationalism. Chicago: Open Court. ISBN978-0-8126-9197-9.
^Howson & Urbach (2005), Jaynes (2003)
^Cai, X.Q.; Wu, X.Y.; Zhou, X. (2009). 'Stochastic scheduling subject to breakdown-repeat breakdowns with incomplete information'. Operations Research. 57 (5): 1236â€“1249. doi:10.1287/opre.1080.0660.
^Ogle, Kiona; Tucker, Colin; Cable, Jessica M. (2014-01-01). 'Beyond simple linear mixing models: process-based isotope partitioning of ecological processes'. Ecological Applications. 24 (1): 181â€“195. doi:10.1890/1051-0761-24.1.181. ISSN1939-5582.
^Evaristo, Jaivime; McDonnell, Jeffrey J.; Scholl, Martha A.; Bruijnzeel, L. Adrian; Chun, Kwok P. (2016-01-01). 'Insights into plant water uptake from xylem-water isotope measurements in two tropical catchments with contrasting moisture conditions'. Hydrological Processes. 30 (18): 3210â€“3227. Bibcode:2016HyPr..30.3210E. doi:10.1002/hyp.10841. ISSN1099-1085.
^Gupta, Ankur; Rawlings, James B. (April 2014). 'Comparison of Parameter Estimation Methods in Stochastic Chemical Kinetic Models: Examples in Systems Biology'. AIChE Journal. 60 (4): 1253â€“1268. doi:10.1002/aic.14409. ISSN0001-1541. PMC4946376. PMID27429455.
^Stigler, Stephen M. (1986). 'Chapter 3'. The History of Statistics. Harvard University Press.
^ ^a^bFienberg, Stephen E. (2006). 'When did Bayesian Inference Become 'Bayesian'?'(PDF). Bayesian Analysis. 1 (1): 1â€“40 [p. 5]. Bibcode:2007BayAn..2.665S. doi:10.1214/06-ba101. Archived from the original(PDF) on 2014-09-10.
^Wolpert, R. L. (2004). 'A Conversation with James O. Berger'. Statistical Science. 19 (1): 205â€“218. CiteSeerX10.1.1.71.6112. doi:10.1214/088342304000000053. MR2082155.
^Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN978-0387310732.

Sources[edit]

Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). Parameter Estimation and Inverse Problems, Second Edition, Elsevier. ISBN0123850487, ISBN978-0123850485
Bickel, Peter J. & Doksum, Kjell A. (2001). Mathematical Statistics, Volume 1: Basic and Selected Topics (Second (updated printing 2007) ed.). Pearson Prenticeâ€“Hall. ISBN978-0-13-850363-5.
Box, G. E. P. and Tiao, G. C. (1973) Bayesian Inference in Statistical Analysis, Wiley, ISBN0-471-57428-7
Edwards, Ward (1968). 'Conservatism in Human Information Processing'. In Kleinmuntz, B. (ed.). Formal Representation of Human Judgment. Wiley.
Edwards, Ward (1982). Daniel Kahneman, Paul Slovic and Amos Tversky (eds.). 'Judgment under uncertainty: Heuristics and biases'. Science. 185 (4157): 1124â€“1131. Bibcode:1974Sci..185.1124T. doi:10.1126/science.185.4157.1124. PMID17835457.|chapter= ignored (help)CS1 maint: Uses editors parameter (link)
Jaynes E. T. (2003) Probability Theory: The Logic of Science, CUP. ISBN978-0-521-59271-0 (Link to Fragmentary Edition of March 1996).
Howson, C. & Urbach, P. (2005). Scientific Reasoning: the Bayesian Approach (3rd ed.). Open Court Publishing Company. ISBN978-0-8126-9578-6.
Phillips, L. D.; Edwards, Ward (October 2008). 'Chapter 6: Conservatism in a Simple Probability Inference Task (Journal of Experimental Psychology (1966) 72: 346-354)'. In Jie W. Weiss; David J. Weiss (eds.). A Science of Decision Making:The Legacy of Ward Edwards. Oxford University Press. p. 536. ISBN978-0-19-532298-9.

External links[edit]

Hazewinkel, Michiel, ed. (2001) [1994], 'Bayesian approach to statistical problems', Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN978-1-55608-010-4
Bayesian Statistics from Scholarpedia.
Introduction to Bayesian probability from Queen Mary University of London
Bayesian reading list, categorized and annotated by Tom Griffiths
A. Hajek and S. Hartmann: Bayesian Epistemology, in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93-106.
S. Hartmann and J. Sprenger: Bayesian Epistemology, in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609-620.
Data, Uncertainty and Inference An introduction to Bayesian inference and MCMC with a lot of examples fully explained. (free ebook)

Retrieved from 'https://en.wikipedia.org/w/index.php?title=Bayesian_inference&oldid=904457488'

You don't have to know a lot about probability theory to use a Bayesian probability model for financial forecasting. The Bayesian method can help you refine probability estimates using an intuitive process.

Any mathematically-based topic can be taken to complex depths, but this one doesn't have to be.

How It's Used

The way that Bayesian probability is used in corporate America is dependent on a degree of belief rather than historical frequencies of identical or similar events. The model is versatile, though. You can incorporate your beliefs based on frequency into the model.

The following uses the rules and assertions of the school of thought within Bayesian probability that pertains to frequency rather than subjectivity. The measurement of knowledge that is being quantified is based on historical data. This view is particularly helpful in financial modeling.

About Bayes' Theorem

The particular formula from Bayesian probability we are going to use is called Bayes' Theorem, sometimes called Bayes' formula or Bayes' rule. This rule is most often used to calculate what is called the posterior probability. The posterior probability is the conditional probability of a future uncertain event that is based upon relevant evidence relating to it historically.

In other words, if you gain new information or evidence and you need to update the probability of an event occurring, you can use Bayes' Theorem to estimate this new probability.

The formula is:

P(Aâˆ£B)=P(Aâˆ©B)P(B)=P(A)Ã—P(Bâˆ£A)P(B)where:P(A)=Probability of A occurring, called theprior probabilityP(Aâˆ£B)=Conditional probability of A giventhat B occursP(Bâˆ£A)=Conditional probability of B giventhat A occursP(B)=Probability of B occurringbegin{aligned} &P (A | B) = frac{ P ( A cap B ) }{ P ( B ) } = frac{ P ( A ) times P ( B | A ) }{ P ( B ) } &textbf{where:} &P(A) = text{Probability of A occurring, called the} &text{prior probability} &P(A|B) = text{Conditional probability of A given} &text{that B occurs} &P(B|A) = text{Conditional probability of B given} &text{that A occurs} &P(B) = text{Probability of B occurring} end{aligned}â€‹P(Aâˆ£B)=P(B)P(Aâˆ©B)â€‹=P(B)P(A)Ã—P(Bâˆ£A)â€‹where:P(A)=Probability of A occurring, called theprior probabilityP(Aâˆ£B)=Conditional probability of A giventhat B occursP(Bâˆ£A)=Conditional probability of B giventhat A occursP(B)=Probability of B occurringâ€‹

P(A|B) is the posterior probability due to its variable dependency on B. This assumes that A is not independent of B.

If we are interested in the probability of an event of which we have prior observations; we call this the prior probability. We'll deem this event A, and its probability P(A). If there is a second event that affects P(A), which we'll call event B, then we want to know what the probability of A is given that B has occurred.

In probabilistic notation, this is P(A|B) and is known as posterior probability or revised probability. This is because it has occurred after the original event, hence the post in posterior.

This is how Bayes' theorem uniquely allows us to update our previous beliefs with new information. The example below will help you see how it works in a concept that is related to an equity market.

An Example

Let's say we want to know how a change in interest rates would affect the value of a stock market index.

A vast trove of historical data is available for all the major stock market indexes, so you should have no problem finding the outcomes for these events. For our example, we will use the data below to find out how a stock market index will react to a rise in interest rates.

Here:

P(SI) = the probability of the stock index increasing
P(SD) = the probability of the stock index decreasing
P(ID) = the probability of interest rates decreasing
P(II) = the probability of interest rates increasing

So the equation will be:

P(SDâˆ£II)=P(SD)Ã—P(IIâˆ£SD)P(II)begin{aligned} &P (SD | II) = frac{ P ( SD ) times P ( II | SD ) }{ P ( II ) } end{aligned}â€‹P(SDâˆ£II)=P(II)P(SD)Ã—P(IIâˆ£SD)â€‹â€‹

Plugging in our numbers we get the following:

P(SDâˆ£II)=(1,1502,000)Ã—(9501,150)(1,0002,000)=0.575Ã—0.8260.5=0.474950.5=0.9499â‰ˆ95%begin{aligned} P (SD | II) &= frac{ left ( frac{ 1,150 }{ 2,000 } right ) times left ( frac { 950 }{ 1,150 } right ) }{ left ( frac { 1,000 }{ 2,000 } right ) } &= frac{ 0.575 times 0.826 }{ 0.5 } &= frac{ 0.47495 }{ 0.5 } &= 0.9499 approx 95% end{aligned}P(SDâˆ£II)â€‹=(2,0001,000â€‹)(2,0001,150â€‹)Ã—(1,150950â€‹)â€‹=0.50.575Ã—0.826â€‹=0.50.47495â€‹=0.9499â‰ˆ95%â€‹

The table shows, the stock index decreased in 1,150 out of 2,000 observations. This is the prior probability based on historical data, which in this example is 57.5% (1150/2000).

This probability doesn't take into account any information about interest rates and is the one we wish to update. After updating this prior probability with information that interest rates have risen leads us to update the probability of the stock market decreasing from 57.5% to 95%. Therefore, 95% is the posterior probability.

Modeling with Bayes' Theorem

As seen above, we can use the outcome of historical data to base the beliefs we use to derive newly updated probabilities.

This example can be extrapolated to individual companies by using changes within their own balance sheets, bonds given changes in credit rating, and many other examples.

So, what if one does not know the exact probabilities but has only estimates? This is where the subjective view comes strongly into play.

Many people put great emphasis on the estimates and simplified probabilities given by experts in their field. This also gives us the ability to confidently produce new estimates for new and more complicated questions introduced by the inevitable roadblocks in financial forecasting.

Instead of guessing, we can now use Bayes' Theorem if we have the right information with which to start.

When to Apply Bayes' Theorem

Changing interest rates can greatly affect the value of particular assets. The changing value of assets can therefore greatly affect the value of particular profitability and efficiency ratios used to proxy a company's performance. Estimated probabilities are widely found relating to systematic changes in interest rates and thus can be used effectively in Bayes' Theorem.

We can also apply the process to a company's net income stream. Lawsuits, changes in the prices of raw materials, and many other things can influence a company's net income.

By using probability estimates relating to these factors, we can apply Bayes' Theorem to figure out what is important to us. Once we find the deduced probabilities that we are looking for, it is a simple application of mathematical expectancy and result forecasting to quantify the financial probabilities.

Using a myriad of related probabilities, we can deduce the answer to rather complex questions with one simple formula. These methods are well accepted and time-tested. Their use in financial modeling can be helpful if applied properly.

Comments are closed.

Bayesian Analysis In Ai

Session info:

Bayesian Analysis In Ai Video

Introduction to Bayes' rule[edit]

Formal explanation[edit]

Alternatives to Bayesian updating[edit]

Formal description of Bayesian inference[edit]

Definitions[edit]

Bayesian inference[edit]

Bayesian prediction[edit]

Inference over exclusive and exhaustive possibilities[edit]

General formulation[edit]

Multiple observations[edit]

Parametric formulation[edit]

Mathematical properties[edit]

Interpretation of factor[edit]

Cromwell's rule[edit]

Asymptotic behaviour of posterior[edit]

Conjugate priors[edit]

Estimates of parameters and predictions[edit]

Examples[edit]

Probability of a hypothesis[edit]

Making a prediction[edit]

In frequentist statistics and decision theory[edit]

Model selection[edit]

Applications[edit]

Computer applications[edit]

Bioinformatic applications[edit]

In the courtroom[edit]

Bayesian epistemology[edit]

Other[edit]

Bayes and Bayesian inference[edit]

History[edit]

See also[edit]

References[edit]

Citations[edit]

Sources[edit]

Further reading[edit]

Elementary[edit]

Intermediate or advanced[edit]

External links[edit]

How It's Used

About Bayes' Theorem

An Example

Modeling with Bayes' Theorem

When to Apply Bayes' Theorem

Author

Archives

Categories