Iwards
Stanford University Libraries
Bepf
62SZ>
"oHectiofls
_____
2f
Fol. Title
!
15
Vol. 70, No. 3
May
1963
PSYCHOLOGICAL REVIEW
BAYESIAN STATISTICAL INFERENCE FOR
PSYCHOLOGICAL RESEARCH
WARD
HAROLD LINDMAN,
University
and
l
LEONARD J. SAVAGE
of Michigan
Bayesian statistics, a currently controversial viewpoint concerning
is based on a definition of probability as a parstatistical
ticular measure of the opinions of ideally consistent people. Statistical
inference is modification of these opinions in the light of evidence, and
Bayes' theorem specifies how such modifications should be made. The
tools of Bayesian statistics include the theory of specific distributions
and the principle of stable estimation, which specifies when actual prior
opinions may be satisfactorily approximated by a uniform distribution.
A common feature of many classical significance tests is that a sharp
null hypothesis is compared with a diffuse alternative hypothesis.
Often evidence which, for a Bayesian statistician, strikingly supports
the null hypothesis leads to rejection of that hypothesis by standard
classical procedures. The likelihood principle emphasized in Bayesian
statistics implies, among other things, that the rules governing when
data collection stops are irrelevant to data interpretation. It is
entirely appropriate to collect data until a point has been proven or
disproven, or until the data collector runs out of time, money, or
patience.
where, even Bayesian specialists will
find some remarks and derivations
hitherto unpublished and perhaps
quite new. The empirical scientist
more interested in the ideas and implications of Bayesian statistics than
in the mathematical details can safely
1
Work on this paper was supported in part skip almost all the equations detours
by the United States Air Force under Conand parallel verbal explanations are
tract AF 49(638)769 and Grant AFAFOSR62182, monitored by the Air Force Office of provided. The textbook that would
Scientific Research of the Air Force Office of make all the Bayesian procedures
Aerospace Research (the paper carries Docu mentioned in this paper readily availment No. AFOSR2009) in part under Con able to experimenting psychologists
tract AF 19(604)7393, monitored by the
does not yet exist, and perhaps it canOperational Applications Laboratory, Deputy
Systems
Division,
Electronic
not exist soon Bayesian statistics as
Technology,
for
and in part by a coherent body of thought is still too
Air Force Systems
the Office of Naval Research under Contract new and incomplete.
Nonr 1224(41). We thank H. C. A. Dale,
Bayes' theorem is a simple and
and E. H. ShuH. V. Roberts, R.
fact about probability
fundamental
ford for their comments on earlier versions.
The main purpose of this paper is
introduce psychologists to the
Bayesian outlook in statistics, a new
fabric with some very old threads.
Although this purpose demands much
repetition of ideas published elseto
193
194
W. Edwards, H. Lindman,
that seems to have been clear to
Thomas Bayes when he wrote his
famous article published in 1763
(recently reprinted), though he did
not state it there explicitly. Bayesian
statistics is so named for the rather
inadequate reason that it has many
more occasions to apply Bayes' theorem than classical statistics has.
Thus, from a very broad point of
view, Bayesian statistics dates back
at least to 1763.
From a stricter point of view,
Bayesian statistics might properly be
said to have begun in 1959 with the
publication of Probability and Statistics for Business Decisions, by
Robert Schlaifer. This introductory
text presented for the first time
practical implementation of the key
ideas of Bayesian statistics: that
probability is orderly opinion, and
that inference from data is nothing
other than the revision of such opinion
in the light of relevant new information. Schlaifer (1961) has since
published another introductory text,
less strongly slanted toward business
And
applications than his first.
Raiffa and Schlaifer (1961) have published a relatively mathematical book.
Some other works in current Bayesian
statistics are by Anscombe (1961), de
Finetti (1959), de Finetti and Savage
(1962), Grayson (1960), Lindley
(1961), Pratt (1961), and Savage et
al. (1962).
The philosophical and mathematical
basis of Bayesian statistics has, in
addition to its ancient roots, a considerable modern history. Two lines
of development important for it are
the ideas of statistical decision theory,
based on the gametheoretic work of
Borel (1921), von Neumann (1928),
and von Neumann and Morgenstern
(1947), and the statistical work of
Neyman (1937, 1938b, for example),
Wald (1942, 1955, for example), and
and
L.
J.
Bayesian
Savage
others; and the personalistic definition of probability, which Ramsey
(1931) and de Finetti (1930, 1937)
crystallized. Other pioneers of personal probability are Borel (1924),
Good (1950, 1960), and Koopman
(1940a, 1940b, 1941). Decision theory
and personal probability fused in the
work of Ramsey (1931), before either
was very mature. By 1954, there was
great progress in both lines for
Savage's The Foundations of Statistics
to draw on. Though this book failed
in its announced object of satisfying
popular nonBayesian statistics in
terms of personal probability and
utility, it seems to have been of some
service toward the development of
Bayesian statistics. Jeffreys (1931,
1939) has pioneered extensively in
applications of Bayes' theorem to
statistical problems. He is one of the
founders of Bayesian statistics, though
he might reject identification with the
viewpoint of this paper because of its
espousal of personal probabilities.
These two, inevitably inadequate,
paragraphs are our main attempt in
this paper to give credit where it is
due. Important authors have not
been listed, and for those that have
been, we have given mainly one early
and one late reference only. Much
more information and extensive bibliographies will be found in Savage
et al. (1962) and Savage (1954,


!
a
\
\
i
!
I
1962a).
We shall, where appropriate, com
pare the Bayesian approach with a
loosely defined set of ideas here
labeled the classical approach, or
classical statistics. You cannot but
be familiar with many of these ideas,
for what you learned about statistical
inference in your elementary statistics
course was some blend of them. They
have been directed largely toward the
topics of testing hypotheses and
interval estimation, and they fall
i
Statistical Inference
roughly into two somewhat conflicting
doctrines associated with the names
of R. A. Fisher (1925, 1956) for one,
and Jerzy Neyman (e.g. 1937, 1938b)
and Egon Pearson for the other. We
do not try to portray any particular
version of the classical approach ; our
real comparison is between such procedures as a Bayesian would employ
in an article submitted to the Journal
and
of Experimental Psychology, say, that
those now typically found in
journal. The fathers of the classical
approach might not fully approve of
either. Similarly, though we adopt for
conciseness an idiom that purports to
define the Bayesian position, there
must be at least as many Bayesian
positions as there are Bayesians. Still ,
as philosophies go, the unanimity
among Bayesians reared apart is remarkable and an encouraging symptorn of the cogency of their ideas.
In some respects Bayesian statistics
is a reversion to the statistical spirit
of the eighteenth and nineteenth
centuries; in others, no less essential,
it is an outgrowth of that modern
movement here called classical. The
latter, in coping with the consequences
of its view about the foundations of
probability which made useless, if not
meaningless, the probability that a
hypothesis is true, sought and found
techniques for statistical inference
which did not attach probabilities to
hypotheses. These intended channels
of escape have now, Bayesians believe,
led to reinstatement of the probabilities of hypotheses and a return of
statistical inference to its original line
In this return,
of development.
formulations,
problems,
mathematics,
as
distribution
vital
tools
and such
theory and tables of functions are
borrowed from extrastatistical probability theory and from classical statistics itself. All the elements of
Bayesian statistics, except perhaps
the personalistic view of probability,
were invented and developed within,
or before, the classical approach to
statistics only their combination into
specific techniquesfor statistical inference is at all new.
The Bayesian approachis a common
sense approach. It is simply a set of
techniques for orderly expression and
revision of your opinions with due
regard for internal consistency among
their various aspects and for the data.
Naturally, then, much that Bayesians
say about inference from data has
been said before by experienced,
intuitive, sophisticated empirical scientists and statisticians. In fact,
when a Bayesian procedure violates
your intuition, reflection is likely to
show the procedure to have been
incorrectly applied. If classically
trained intuitions do have some conthese often prove transient.
Elements of Bayesian Statistics
Two basic ideas which come together in Bayesian statistics, as we
have said, are the decisiontheoretic
formulation of statistical inference
and the notion of personal probability.
Statistics and decisions. Prior to a
paper by Neyman (1938a), classical
statistical inference was usually expressed in terms of justifyingpropositions on the basis of data. Typical
propositions were: Point estimates;
the best guess for the unknown number mis m. Interval estimates; >. is
between Wi and m 2 Rejection of
hypotheses; y. is not 0. Neyman's
(1938a, 1957) slogan "inductive behavior" emphasized the importance of
action, as opposed to assertion, in the
face of uncertainty. The decisiontheoretic, or economic, view of statistics was advanced with particular
vigor by Wald (1942). To illustrate,
in the decisiontheoretic outlook a
.
196
W. Edwards, H. Lindman,
and
L.
J.
Savage
point estimate is a decision to act, of probabilities exclusively as limits of
in some specific context, as though relative frequencies are agreed that
n were m, not to assert something uncertainty about matters of fact is
about fi. Some classical statisticians, ordinarily not measurable by probanotably Fisher (1956, Ch. 4), have bility. Some of them would brand as
hotly rejected the decisiontheoretic nonsense the probability that weightoutlook.
lessness decreases visual acuity; for
While Bayesianstatistics owes much others the probability of this hyto the decisiontheoretic outlook, and pothesis would be 1 or 0 according as
while we personally are inclined to it is in fact true or false. Classical
side with it, the issue is not crucial statistics is characterized by efforts to
to a Bayesian. No one will deny that reformulate inference about such hyeconomic problems of behavior in the potheses without reference to their
face of uncertainty concern statistics, probabilities, especially initial probaeven in its most "pure" contexts. bilities.
These efforts have been many and
For example, "Would it be wise, in the
light of what has just been observed, ingenious. It is disagreement about
to attempt such and such a year's in which of them to espouse, incidentally,
vestigation?" The controversial issue that distinguishes the two main
is only whether such economic prob classical schools of statistics. The
lems are a good paradigm of all related ideas of significance levels,
statistical problems. For Bayesians, "errors of the first kind," and conall uncertainties are measured by fidence levels, and the conflicting idea
probabilities, and these probabilities of fiducial probabilities are all in(along with the here less emphasized tended to satisfy the urge to know
concept of utilities) are the key to all how sure you are after looking at the
problems of economic uncertainty. data, while outlawing the question of
Such a view deprives debate about how sure you were before. In our
whether all problems of uncertainty opinion, the quest for inference withare economic of urgency. On the out initial probabilities has failed,
other hand, economic definitions of inevitably.
You may be asking, "If a probapersonal probability seem, at least to
us, invaluable for communication and bility is not a relative frequency or a
perhaps indispensable for operational hypothetical limiting relative fredefinition of the concept.
quency, what is it? If, when I
A Bayesian can reflect on his cur evaluate the probability of getting
rent opinion (and how he should revise heads when flipping a certain coin as
it on the basis of data) without any .5, I do not mean that if the coin were
reference to the actual economic sig flipped very often the relative frenificance, if any, that his opinion may quency of heads to total flips would be
have. This paper ignores economic arbitrarily close to .5, then what do
considerations, important though they I mean?"
are even for pure science, except for
We think you mean something
brief digressions. So doing may com about yourself as well as about the
bat the misapprehension that Bayes coin. Would you not say, "Heads on
ian statistics is primarily for business, the next flip has probability .5" if and
not science.
only if you would as soon guess heads
Personal probability.
With rare as not, even if there were some imexceptions, statisticians who conceive portant reward for being right? If so,
Bayesian
Statistical Inference
your sense of "probability" is ours;
even if you would not, you begin to
see from this example what we mean
by "probability," or "personal probability." To see how far this notion is
from relative frequencies, imagine
being reliably informed that the coin
has either two heads or two tails.
You may still find that if you had to
guess the outcome of the next flip for a
large prize you would not lift a finger
to shift your guess from heads to tails
or vice versa.
Probabilities other than .5 are
defined in a similar spirit by one of
several mutually harmonious devices
(Savage, 1954, Ch. 14). One that is
particularly vivid and practical, if
not quite rigorous as stated here, is
this. For you, now, the probability
P(A) of an event A is the price you
would just be willing to pay in exchange for a dollar to be paid to you
in case A is true. Thus, rain tomorrow has probability 1/3 for you
if you would pay just $.33 now in
exchange for $1.00 payable to you in
the event of rain tomorrow.
A system of personal probabilities,
or prices for contingent benefits, is
inconsistent if a person who acts in
accordance with it can be trapped
into accepting a combination of bets
that assures him of a loss no matter
what happens. Necessary and sufficient conditions for consistency are
the following, which are familiar as a
basis for the whole mathematical
theory of probability :
0 g P(A) P(S) = 1,
P(A\JB) =P(A) +P(B),
where 5 is the tautological, or universal, event; A and B are any
two incompatible, or nonintersecting,
events; and A\JB is the event that
either .4 or B is true, or the union of
.4 and B. Real people often make
choices that reflect violations of these
__
rules, especially the second, which is
why personalists emphasize that personal probability is orderly, or consistent, opinion, rather than just any
opinion. One of us has presented
elsewhere a model for probabilities
inferred from real choices that does
not include the second consistency
requirement listed above (Edwards,
1962b). It is important to keep clear
the distinction between the somewhat idealized consistent personal
probabilities that are the subject of
this paper and the usually inconsistent
subjective probabilities that can be
inferred from real human choices
amongbets, and the words "personal"
and "subjective" here help do so.
Your opinions about a coin can of
course differ from your neighbor's.
For one thing, you and he may have
different bodies of relevant information. We doubt that this is the only
legitimate source of difference of
opinion. Hence the personal in personal probability. Any probability
should in principle be indexed with the
name of the person, or people, whose
opinion it describes. We usually
leave the indexing unexpressed but
underline it from time to time with
phrases like "the probability for you
that H is true."
Although your initial opinion about
future behavior of a coin may differ
radically from your neighbor's, your
opinion and his will ordinarily be so
transformed by application of Bayes'
theorem to the results of a long
sequence of experimental flips as
to become nearly indistinguishable.
This approximate merging of initially
divergent opinions is, we think, one
reason why empirical research is
called "objective." Personal probability is sometimes dismissed with the
assertion that scientific knowledge
cannot be mere opinion. Yet, obviously, no sharp lines separate the
198
W. Edwards, H. Lindman,
conjecture that many human cancers
may be caused by viruses, the opinion
that many are caused by smoking,
and the "knowledge" that many have
been caused by radiation.
Conditional probabilities and Bayes'
theorem. In the spirit of the rough
definition of the probability P(A) of
an event A given above, the conditional probability P(D\H) of an
event D given another II is the
amount you would be willing to pay
in exchange for a dollar to be paid to
you in case D is true, with the further
provision that all transactions are
canceled unless II is true. As is not
hard to see, P{Df\H) isP(D\H)P(H)
where Df\H is the event that D and
H are both true, or the intersection
of D and 27. Therefore,
"
unless P(H) = 0.
Conditional probabilities are the
probabilistic expression of learning
from experience. It can be argued
that the probability of D for you—the
consistent you after learning that H
is in fact true is P{D\H). Thus,
after you learn that H is true, the new
system of numbers P(D\H) for a
specific H comes to play the role that
was played by the old system P(D)
before.
Although the events D and H are
arbitrary, the initial letters of Data
and Hypothesis are suggestive names
for them. Of the three probabilities
in Equation 1, P(H) might be illustrated by the sentence: "The probability for you, now, that Russia will
use a booster rocket bigger than our
planned Saturn booster within the
next year is .8." The probability
P(DC\H) is the probability of the
joint occurrence of two events regarded as one event, for instance:
"The probability for you, now, that
—
and
L.
J.
Savage
Bayesian
the next manned space capsule to
enter space will contain three men
and also that Russia will use a booster
rocket bigger than our planned Saturn
booster within the next year is .2."
According to Equation 1, the probability for you, now, that the next
manned space capsule to enter space
will contain three men, given that
Russia will use a booster rocket bigger
than our planned Saturn booster
within the next year is .2/.8 = .25.
A little algebra now leads to a basic
form of Bayes' theorem :
pfflsi"®.
ta
provided P(D) and P(H) are not 0.
In fact, if the roles of D and II in
Equation 1 are interchanged, the old
form of Equation 1 and the new form
can be expressed symmetrically, thus:
P(D\H)
P{D)
P(DC\H)
P(D)P(H)
\
I
____(*___)
l* J
P(H) ' r3l
which obviously implies Equation 2.
A suggestive interpretation of Equation 3 is that the relevance of H to D
equals the relevance of D to 11.
Reformulations of Bayes' theorem
apply to continuous parameters or
data. In particular, if a parameter
(or set of parameters) X has a prior
probability density function w(X), and
if x is a random variable (or a set of
random variables such as a set of
measurements) for which v(x\ X) is the
density of x given X and v(x) is the
density of x, then the posterior
probability density of X given x is
.
U(\\X) =
v(x\\)u(\)
7r
v(x)
.
.
[4J
There are of course still other possibilities such as forms of Bayes'
i
Statistical Inference
theorem in which X but not x, or x but
not X, is continuous. A complete and
compact generalization is available
and technically necessary but need
not be presented here.
In Equation 2, D may be a particular observation or a set of data
regarded as a datum and H some
hypothesis, or putative fact. Then
Equation 2 prescribes the consistent
revision of your opinions about the
probability of H in the light of the
datum D—similarly for Equation 4.
In typical applications of Bayes'
theorem, each of the four probabilities
in Equation 2 performs a different
function, as will soon be explained.
Yet they are very symmetrically related to each other, as Equation 3
brings out, and are all the same kind
of animal. In particular, all probabilities are really conditional. Thus,
P(H) is the probability of the hypothesis H for you conditional on all
you know, or knew, about II prior
to learning D; and P(H\D) is the
probability of H conditional on that
same background knowledge together
with D.
Again, the four probabilities in
Equation 2 are personal probabilities.
This does not of course exclude any
of them from also being frequencies,
ratios of favorable to total possibilities, or numbers arrived at by any
other calculation that helps you form
your personal opinions. But some
are, so to speak, more personal than
others. In many applications, practically all concerned find themselves
in substantial agreement with respect
to P(D\H); or P(D\H) is public, as
we say. This happens when P(D\H)
flows from some simple model that
the scientists, or others, concerned
accept as an approximate description
of their opinion about the situation in
which the datum was obtained. A
traditional example of such a sta
tistical model is that of drawing a ball
from an urn known to contain some
balls, each either black or white. If
a series of balls is drawn from the urn,
and after each drawthe ball is replaced
and the urn thoroughly shaken, most
men will agree at least tentatively
that the probability of drawing a
particular sequence D (such as black,
white, black, black) given the hypothesis that there are B black and
W white balls in the urn is
(
\B
B
+
Y(
W) \B
w
Y
+ Wj '
where b is the number of black, and
w the number of white, balls in the
sequence D.
Even the best models have an
element of approximation. For example, the probability of drawing any
sequence D of black and white balls
from an urn of composition H depends,
in this model, only on the number of
black balls and white ones in D, not
on the order in which they appeared.
This may express your opinion in a
specific situation very well, but not
well enough to beretained if D should
happen to consist of 50 black balls
followed by 50 white ones. Idiomatically, such a datum convinces you
that this particular model is a wrong
description of the world. Philosophically, however, the model was
not a description of the world but of
your opinions, and to know that it was
not quite correct, you had at most to
reflect on this datum, not necessarily
to observe it. In many scientific
contexts, the public model behind
P(D\H) may include the notions of
random sampling from a welldefined
population, as in this example. But
precise definition of the population
may be difficult or impossible, and
a sample whose randomness would
thoroughly satisfy you, let alone your
200
W. Edwards, H. Lindman,
neighbor in science, can be hard to
draw.
In some cases P(D\H) does not
command general agreement at all.
What is the probability of the actual
seasonal color changes on Mars if
there is life there? What is this
probability if there is no life there?
Much discussion of life on Mars has
not removed these questions from
debate.
and
L.
J.
Bayesian
Savage
to be exactly the kind of information
that we all want as a guide to thought
is usually regarded as one of a list, or
partition, of mutually exclusive and
exhaustive hypotheses Hi such that
the P(D\Hi) are all equally public, or
part of the statistical model. Since
2iP(Hi\D) must be 1, Equatioh 2
implies that
P(D) = ZiP(D\Hi)P(Hi).
The choice of the partition if, is of
practical importance but largely arbitrary. For example, tomorrow will
be "fair" or "foul," but these two
hypotheses can themselves be subdivided and resubdivided. Equation
2 is of course true for all partitions
but is more useful for some than for
others. As a science advances, partitions originally not even dreamt of
become the important ones (Sinclair,
1960). In principle, room should always be left for "some other" explanation. Since P{D\H) can hardly
be public when // is "some other
explanation," the catchall hypothesis
is usually handled in part by studying
the situation conditionally on denial
of the catchall and in part by informal
appraisal of whether any of the
explicit hypotheses fit the facts well
enough to maintain this denial. Good
illustrations are Urey (1962) and
Bridgman (1960).
In statistical practice, the partition
is ordinarily continuous, which means
roughly that Hi is replaced by a
parameter X (which may have more
than one dimension) with an initial
probability density m(X). In this
Public models, then, are never
perfect and often are not available.
Nevertheless, those applications of
inductive inference, or probabilistic
reasoning, that are called statistical
seem to be characterized by tentative
public agreement on some model and
provisional work within it. Rough
characterization of statistics by the
relative publicness of its models is not
necessarily in conflict with attempts
to characterize it as the study of
numerous repetitions (Bartlett, in
Savage et al., 1962, pp. 3638). This
characterization is intended to distinguish statistical applications of
Bayes' theorem from many other
applications to scientific, economic,
military, and other contexts. In some
of these nonstatistical contexts, it is
appropriate to substitute the judgment of experts for a public model as
the source of P (£>  H) (see for example
Edwards, 1962a, 1963).
The other probabilities in Equation
2 are often not at all public. Reasonable men may differ about them, even
if they share a statistical model that
specifies P{D\H). People do, how case,
ever, often differ much more about
P(D) =
P(D\\)u(\)di\.
P(JI) and P(D) than about P(H\D),
initially
evidence
can
diverbring
for
gent opinions into near agreement.
Similarly, P(D), P(D\Hi), and
The probability P(D) is usually of P(D\\) are replaced by probability
little direct interest, and intuition is densities in D if D is (absolutely) conoften silent about it. It is typically tinuously distributed.
P(H\D) or u(\\D), the usual outcalculated, or eliminated, as follows.
a
of a Bayesian calculation, seems
model,
put
there
is
statistical
II
When
j
Statistical Inference
and action in the light of an observational process. It is the probability
for you that the hypothesis in question
is true, on the basis of all your information, including, but not restricted to, the observation D.
Principle of
Problem
Stable Estimation
of prior probabilities. Since
P{D\H) is often reasonably public
i
I
!
and P(H\D) is usually just what the
scientist wants, the reason classical
statisticians do not base their procedures on Equations 2 and 4 must,
and does, lie in P{H), the prior probability of the hypothesis. We have
already discussed the most frequent
objection to attaching a probability to
a hypothesis and have shown briefly
how the definition of personal probability answers that objection. We
must now examine the practical problem of determining P(H). Without
P(IT), Equations 2 and 4 cannot yield
P(H\D).
But since P{H) is a
personal probability, is it not likely
to be both vague and variable, and
subjective to boot, and therefore useless for public scientific purposes?
Yes, prior probabilities often are
quite vague and variable, but they
are not necessarily useless on that
account (Borel, 1924). The impact
of actual vagueness and variability of
prior probabilities differs greatly from
one problem to another. They frequently have but negligible effect on
the conclusions obtained from Bayes'
theorem, although utterly unlimited
vagueness and variability would have
utterly unlimited effect. If observations are precise, in a certain sense,
relative to the prior distribution on
which they bear, then the form and
properties of the prior distribution
have negligible influence on the pos
terior distribution. From a practical
point of view, then, the untrammeled
subjectivity of opinion about a parameter ceases to apply as soon as
much data become available. More
generally, two people with widely
divergent prior opinions but reasonably open minds will be forced into
arbitrarily close agreement about
future observations by a sufficient
amountof data. An advanced mathematical expression of this phenomenon
is in Blackwell and Dubins (1962).
When prior distributions can be regarded as essentially uniform. Frequently, the data so completely
control your posterior opinion that
there is no practical need to attend to
the details of your prior opinion.
For example, consider taking your
temperature.
Headachy and hot, you are convinced that you have a fever but are
not sure how much. You do not hold
the interval 100.5°101° even 20 times
more probable than the interval 101°101.5° on the basis of your malaise
alone. But now you take your temperature with a thermometer that you
strongly believe to be accurate and
find yourself willing to give much
more than 20 to 1 odds in favor of the
halfdegree centered at the thermometer reading.
Your prior opinion is rather irrelevant to this useful conclusion but
of course not utterly irrelevant. For
readings of 85° or 110°, you would
revise your statistical model according
to which the thermometer is accurate
and correctly used, rather than proclaim a medical miracle. A reading of
104° would be puzzling—too inconsistent with your prior opinion to
seem reasonable and yet not obviously
absurd. You might try again, perhaps
with another thermometer.
it has long been known that, under
suitable circumstances, your actual
W. Edwards, H. Lindman,
202
posterior distribution will be approximately what it would have been
had your prior distribution been
uniform, that is, described by a
constant density. As the fever example suggests, prior distributions
need not be, and never really are,
completely uniform. To ignore the
departures from uniformity, it suffices
that your actual prior density change
gently in the region favored by the
data and not itself too strongly favor
some other region.
But what is meant by "gently," by
"region favored by the data," by
"region favored by the prior distribution," and by two distributions
being approximately the same? Such
questions do not have ultimate answers, but this section explores one
useful set of possibilities. The mathematics and ideas have been current
since Laplace, but we do not know
any reference that would quite substitute for the following mathematical
paragraphs Jeffreys (1939, see Section
3.4 of the 1961 edition) and Lindley
(1961) are pertinent. Those who
would skip or skim the mathematics
will find the trail again immediately
following Implication 7, where the
applications of stable estimation are
informally summarized.
Under some circumstances, the
posterior probability density
.M . =
m(Xa.)
j
/
can be well
V(X\\)U(\)
:
\_?A
v(x\X)u(\')d\'
senses by the probability density
»(*X)
■
/ v(x\\')d\'
,
L.
TAT
[6J
where X is a parameter or set of
parameters, X' is a corresponding
J.
Bayesian
Savage
variable of integration, x is an observation or set of observations, v(x\\)
is the probability (or perhaps probability density) of x given X, m(X) is the
prior probability density of X, and the
integrals are over the entire range of
meaningful values of X. By their
nature, u, v, and w are nonnegative,
and unless the integral in Equation 6
is finite, there is no hope that the
approximation will be valid, so these
conditions are adopted for the following discussion.
Consider a region of values of X,
say B, which is so small that u(\)
varies but little within B and yet so
large that B promises to contain much
of the posterior probability of X given
the value of x fixed throughout the
present discussion. Let a, /3, 7, and
approximated in some
/■» I \
w(\\x)
=j
and
5M(X)
(1 +B)y.
(That is, the prior density changes
very little within B\ .01 or even .05
would be good everyday values for /3.
The value of
/J> and ._ are small s0 are
and
,
where ois a positive constant. (That
«">
u(\\x)d\
w(C\x)
u{C\x)
big
Let
and
denote
astronomically
nowhere
is, mis
f
compared to its nearly constant values
the probab.ht.es of
in B a6 as large as 100 or 1,000 will and c w(\\x)d\, that
often be tolerable.)
C under the densities w(Xx) and w(Xx).
Assumption 3' in the presence of ImpHcation 4 u{B{x)
_7 , and for
Assumptions 1 and 2 can imply as every subset cof B,
is seen thus.
u{C \ x)
l ~ s
~ l +t
~^hcU)
r
r
/
u(\\x)d\
I u(\\x)d\/
X
*
J
J
.,..,
.

]
Implication 5!lIlila function of
b
=
that *(X)_a T for all X, then
f^(x\»u(X)dx/ ] V(x\X)u(X)dX \f
tMuMx)dx
B
<_o*> I
6a
So if y
v(x\\)d\/
thuS

B
_Jl_
_f t{x)wMx)dx
+[J:B \t(\)\u(\\x)dx + J.'
'
such
Implication 7: (1
)
 (l 8)
uicr,B\x)i
IF, H
,1 _~»(cr.B\x)
~
wlC\x) J
I*
Implication 2:
u(C\x)V
~
w(C\x)L
<

4.
lIt
"
*
u(.CC\B\x)+y
w{C\x)
c nß«)
w(C\x)
(1 + «) +
v{x\\)d\.
/y
rtfifiy
j
T
y_l
w(C\x)
J
204
W. Edwards, H. Lindman,
and
L. J.
Savage
What does all this epsilontics mean to assign about as much probability
for practical statistical work? The to the region from X to 2X as to the
overall goal is valid justification for region from 10X to 20X, a logarithmic
proceeding as though your prior transformation of X may well make
distribution were uniform. A set of Assumption 2 applicable for a conthree assumptions implying this justi siderably smaller j3 than otherwise.
We must forestall a dangerous confication was pointed out: First, some
region B is highly favored by the data. fusion. In the temperature example
Second, within B the prior density as in many others, the measurement
changes very little. Third, most of x is being used to estimate the value
the posterior density is concentrated of some parameter X. In such cases,
inside B. According to a more X and x are measured in the same
stringent but more easily verified units (degrees Fahrenheit in the exsubstitute for the third assumption, ample) and interesting values of X are
the prior density nowhere enormously often numerically close to observed
exceeds its general value in B.
values of x. It is therefore imperative
Given the three assumptions, what to maintain the conceptual distinction
follows? One way of looking at the between X and x. When the principle
implications is to observe that no of stable estimation applies, the
where within B, which has high normalized function v(x\X) as a funcposterior probability, is the ratio of tion of X, not of approximates your
the approximate posterior density to posterior distribution. The point is
the actual posterior density much perhaps most obvious in an example
different from 1 and that what hap such as estimating the area of a circle
pens outside B is not important for by measuring its radius. In this case,
some purposes. Again, if the posterior X is in square inches, x is in inches, and
expectation, or average, of some there is no temptation to think that
bounded function is of interest, then the form of the distribution of x's is
the differencebetween the expectation the same as the form of the posterior
under the actual posterior distribution distribution of X's. But the same
and under the approximating dis point applies in all cases. The functribution will be small relative to the tion v (x\ X) is a function of both x and
absolute bound of the function. X only by coincidence will the form
Finally, the actual posterior proba or the parameters of v(x  X) considered
bility and the approximate probability as a function of X be the same as its
of any set of parameter values are form or parameters considered as a
nearly equal. In short, the approxi function of x. One such coincidence
mation is a good one in several im occurs so often that it tends to mislead intuition. When your statistical
portant respects— given the three
assumptions.
Still other respects model leads you to expect that a set
must sometimes be invoked and these
of observations will be normally distributed, then the posterior distribumay require further assumptions.
tion of the mean of the quantity being
See, for example, Lindley (1961).
Even when Assumption 2 is not observed will, if stable estimation apapplicable, a transform ii ion of the plies, be normal with the mean equal
to the mean of the observations. (Of
parameters of the prior distribution
sometimes makes it so. If, for ex course it will have a smaller standard
ample, your prior distribution roughly deviation than the standard deviation
obeys Weber's law, so that you tend of the observations.)
Bayesian
Fig.
Statistical Inference
X (DEGREES FAHRENHEIT)
for the fever thermometer example. (Note that the units
»(*X)
1. m(X) and
on the y axis are different for the two functions.)
Numerically, what can the principle of
stable estimation do for the feverthermometer example? Figure lis a reasonably
plausible numerical picture of the situation,
Your prior distribution in your role as invalid
because on
has a little bump around
other occasions you have taken your ternperature when feeling out of sorts and found
you really think
it depressingly normal.
you have a
so most of vour density is
spread over the region 99.5°104.5°. It gets
rather low at the high end of that interval,
since you doubt that you could have so much
as a 104° fever without feeling even worse
than you do
The thermometerhas a standard deviation
of .05° and negligible systematic error—this
is reasonable for a really good clinical thermometer, the systematic error of which
should be small compared to the errors of
procedure and reading. For convenience and
because it is plausible as an approximation,
we assume also that the thermometer dis
tributes its errors normally. The indicated
reading will, then, lie within a symmetric
region .1° wide around the true temperature
with probability a little less than .7. If the
thermometer reading is 101.0°, we might take
theregion Bto extend from 100.8° t0 101.2
four standard deviations on each side of the
observation. Accordmg to tables of the
normal distribution, ais then somewhat less
than 10 *.
,,.,
The number v should be thought of as the
B,
smallest value of «(X) within but its actual
value cancels out of all important calculations
and so is immaterial. For the same reason,
it: is also immaterial that the two functions
i>(101.0X) and «(X) graphed in Figure 1 are
not measured in the same units and therefore
cannot meaningfully share the same vertical
scale; in so drawing them, we sin against
logic but not against the calculation of w(X*)
or w(\\x). Figure 1 suggests that ois at
most .05, and we shall work with that value,
—
but it is essential
to give
some
serious
W. Edwards, H. Lindman,
206
and
L.
J.
Savage
u(C 101.0) might be wanted. Even if
w(C\ 101.0) is only .01, we get considerable information about u(C\ 101.0); .0093
S u(C\ 101.0) g .021. For w(C\ 101.0)
= .001, .000849 g u(C\ 101.0) g .011.
At
this stage, the upper bound has become almost useless, and when w(C\ 101.0) is as
small as 104 , the lower bound is utterly
useless.
Implication 5, and extensions of it are also
for example, you record what
applicable.
the thermometersays, the mean error and the
rootmeansquared error of the recorded
value, averaged according to your own
opinion, should be about 0° and about
respectively, according to a slight extension
of Implication 5.
To reemphasize the central point, those
details about your initial opinion that were
not clear to you yourself, about which you
might not agree with your neighbor, and that
would have been complicated to keep track
of anyway can be neglected after a fairly
good measurement.
A vital matter that has been postponed is
to adduce a reasonable value for 0. Like 0,
ois an expression of personal opinion. In any
application, 0 must be large enough to be an
expression of actual opinion or, in "public"
applications, of "public" opinion. If your
opinion were perfectly clear or if the public
do not interfere with the desired approxima were of one mind, you could determine 0 by
tion. This alternative sort of calculation will dividing the maximum of your u(X) in B by
its minimum and subtracting 1 ; but the most
be made clear by later examples about
important need for 0 arises just when clarity
hypothesis testing.
Returning from the digression, continue or agreement is lacking. For unity of
discussion, permit us to focus on the problem
with = 100. The comment after Assumpimposed by lack of clarity.
tion 3' leads to 7 = 0a = IO" 4 X 102 = .01.
One
to express the lack of clarity, or
Explore now some of the consequences of
the theory of stable estimation for the ex the vagueness, of an actual set of opinions
ample: w(\\ 101.0) is normal about 101° with about X is to say that many somewhat
different densities portray your opinion
a standard deviation of .05°. If the region B
tolerably well. In assuming that .05 was a
is taken to be the interval from 100.8° to
.05, and 7 = .01. sufficiently large 0 for the fever example, we
10"4 , 0
then a
0)(1
7)]~'< 06, were assuming that you would reject as un1  [(1
8
and _=(1
0)(1
<*) 1 < .051. Ac realistic any initial density u{\) whose
cording to Implication 4, for any C in B, maximum in the interval B from 100.8° to
u(C\ 101.0) differs by at most about 6% from 101.2° exceedsits minimum in B by as much
the explicitly computable w(C\ 101.0). For as 5%. But how can you know such a thing
any
whether in B or not, Implication 6 about yourself? Still more, how could you
hope to guess it about another?
guarantees \u(C\ 101.0)i»(C 101.0)  5.068.
An especially interesting example for C is the
To begin with, you might consider pairs of
outside of some interval that has, say, very short intervals in B and ask how much
95% probability under w(\\ 101.0) so that more probable one is than the other, but this
w(C\ 101.0)
.05. Will u(C\ 101.0) be mod will fail in realistic problems. To see why it
ask yourself what odds 0 you would
erately close to 5%? Implications 4 and
6 do not say so, but Implication 7 says offer (initially) for the last hundredth of a
that (.94) (.0499) = .0470 g u(C\ 101.0) degree in B against the first hundredth; that
if X is in the
fi (1.050)(.05) .01 .0625. This is not so is, imagine contracting to pay
crude for the sort of situation where such a first hundredth of a degree of B, to receive
justification for this crucial assumption, as we
shall later.
We justify Assumption 3 by way of Assumption 3'. The figure, drawn for qualitative suggestion rather than accuracy, makes
a 6 of 2 look reasonable, but since you may
have a very strong suspicion that your tem100
perature is nearly normal, we take 6
for safety. The real test is whether there is
any hundredth, say, of a degree outside of B
that you initially held to be more than 100
times as probable as the initially least
probable hundredth in B. You will not find
this question about yourself so hard, especially since little accuracy is required.
Actually, the technique based on 0 could
fail utterly without really spoiling the
program. Suppose, for example, you really
think it pretty unlikely that you have a
fever and have unusually good knowledge of
the temperature that is normal for you (at
this hour). You may then have as much
probability as .95 packed into some interval
of .1° near normal, but in no such short
interval in B are you likely to have more than
one fiftieth of the residual probability. This
leads to a 9 of at least .95/(.05 X .02) = 950.
Fortunately,
but somewhat analogous, calculations show that even very high
concentrations of initial probability in a
region very strongly discredited by the data
—
+ + +

+

+
Bayesian
Statistical Inference
if it is in the last hundredth, and to be
for instance, you are
quits otherwise.
then you will be
feeling less sick than
clear that m(X) is decreasing throughout B,
that Jl is less than 1, and that 1 — O would
be the smallest valid value for 0. However,
you are likely to be highly confused about (I.
Doubtless Ois very littleless than 1 Is .9999
much too large or .91 much too small? We
find it hard to answer when the question is
put thus, and so may you.
As an entering wedge, consider an interval
much longer than B, say from 100° to 102°.
Perhaps you find u{\) to decrease even
throughout this interval and even to decrease
moderately perceptibly between its two end
points. The ratio u(101)/i*(102) while distinctly greater than 1 may be convincingly
less than 1.2. If the proportion by which
«(X) diminished in every hundredth of a
degree from 100° to 102° were the same
more formally, if the logarithmic derivative
of u(\) were constant between 100° and 102°
—then u (101. 2)/u(100.8) would be at most
(1.2) 4 2 = (1.2) 2 = 1.037. Of course the
rate of decrease is not exactly constant, but
it may seem sufficiently generous to round
1.037 up to 1.05, which results in the 0 of .05
used in this example. Had you taken your
temperature 25 times (with random error
but negligible systematic error), which would
not be realistic in this example but would be
in some other experimental settings, then the
standard error of the measurements would
have been .01, and B would have needed to
be only .08° instead of .4° wide to take in
eight standard deviations. Under those
circumstances, 0 could hardly need to be
.
—
'
greater
than .01, that is, (LOS)
08 4
'

1.
How good should the approximation be before you can feel comfortable
about using it? That depends entirely on your purpose. There are
purposes for which an approximation
of a small probability which is sure to
be within fivefold of the actual probability is adequate. For others, an
error of 1% would be painful. Fortunately, if the approximation is unsatisfactory it will often be possible to
improve it as much as seems necessary
at the price of collecting additional
data, an expedient which often justifies its cost in other ways too. In
practice, the accuracy of the stableestimation approximation will seldom
be so carefully checked as in the fever
example. As individual and collective
experience builds up, many applications will properly be judged safe at a
glance.
Far from always can your prior
distribution be practically neglected.
At least five situations in which detailed properties of the prior distribution are crucial occur to us:
1. If you assign exceedingly small
prior probabilities to regions of X for
which v (x\ X) is relatively large, you in
effect express reluctance to believe in
values of X strongly pointed to by the
data and thus violate Assumption 3,
perhaps irreparably. Rare events do
occur, though rarely, and should not
be permitted to confound us utterly.
Also, apparatus and plans can break
down and produce data that "prove"
preposterous things. Morals conflict
in the fable of the Providence man
who on a cloudy summer day went to
the post office to return his absurdly
lowreading new barometer to Abercrombie and Fitch. His house was
flattened by a hurricane in his absence.
2. If you have strong prior reason
to believe that X lies in a region for
which v(x\\) is very small, you may
be unwilling to be persuaded by the
evidence to the contrary, and so again
may violate Assumption 3. In this
situation, the prior distribution might
consist primarily of a very sharp spike,
whereas v(x\\), though very low in
the region of the prior spike, may be
comparatively gentle everywhere. In
the previous paragraph, it was v(x\\)
which had the sharp spike, and the
prior distribution which was near zero
in the region of that spike. Quite
often it would be inappropriate to
discard a good theory on the basis of a
single opposing experiment. Hypothesis testing situations discussed later
in this paper illustrate this phenomenon.
208
W. Edwards, H. Lindman,
3. If your prior opinion is relatively
diffuse, but so are your data, "then
Assumption 1 is seriously violated.
For when your data really do not
mean much compared to what you
already know, then the exact content
of the initial opinion cannot be
neglected.
4. If observations are expensive and
you have a decision to make, it may
not pay to collect enough information
for the principle of stable estimation
to apply. In such situations you
should collect just so much information that the expected value of the
best course of action available in the
light of the information at hand is
greater than the expectedvalue of any
program that involves collecting more
observations. If you have strong
prior opinions about the parameter,
the amount of new information available when you stop collecting more
may well be far too meager to satisfy
the principle. Often, it will not pay
you to collect any new information
at all.
5. It is sometimes necessary to
make decisions about sizable research
commitments such as sample size or
experimental design while your knowledge is still vague. In this case, an
extreme instance of the former one,
the role of prior opinion is particularly
conspicuous. As Raiffa and Schlaifer
(1961) show, this is one of the most
fruitful applications of Bayesian ideas.
Whenever you cannot neglect the
details of your prior distribution, you
have, in effect, no choice but to
determine the relevant aspects of it as
best you can and use them. Almost
always, you will find your prior
opinions quite vague, and you may be
distressed that your scientific inference or decision has such a labile
basis. Perhaps this distress, more
than anything else, discouraged statisticians from using Bayesian ideas
and
L. J.
Savage
all along (Pearson, 1962). To paraphrase de Finetti (1959, p. 19), people
noticing difficulties in applying Bayes'
theorem remarked "We see that it is
not secure to build on sand. Take
away the sand, we shall build on the
void." If it were meaningful utterly
to ignore prior opinion, it might
presumably sometimes be wise to do
so; but reflection shows that any
policy that pretends to ignore prior
opinion will be acceptable only insofar
as it is actually justified by prior
opinion. Some policies recommended
under the motif of neutrality, or using
only the facts, may flagrantly violate
even very confused prior opinions,
and so be unacceptable. The method
of stable estimation might casually be
described as a procedure for ignoring
prior opinion, since its approximate
results are acceptablefor a wide range
of prior opinions. Actually, far from
ignoring prior opinion, stable estimation exploits certain welldefined features of prior opinion and is acceptable
only insofar as those features are
really present.
A
Smattering ok Bayesian
Distribution Theory
The mathematical equipment required to turn statistical principles
into practical procedures, for Bayesian
as well as for traditional statistics, is
distribution theory, that is, the theory
of specific families of probability
distributions. Bayesian distribution
theory, concerned with the interrelation among the three main distributions of Bayes' theorem, is in some
respects more complicated than classical distribution theory. But the
familiar properties that distributions
have in traditional statistics, and in
the theory of probability in general,
remain unchanged. To a professional
statistician, the added complication
requires little more than possibly a
Bayesian
Statistical Inference
shift to a more complicated notation.
Chapters 7 through 13 of Raiffa and
Schlaifer's (1961) book are an extensive discussion of distribution theory
for Bayesian statistics.
As usual, a consumer need not
understand in detail the distribution
theory on which the methods are
based ; the manipulative mathematics
are being done for him. Yet, like any
other theory, distribution theory must
be used with informed discretion.
The consumer who delegates his
thinking about the meaning of his
data to any "powerful new tool" of
course invites disaster. Cookbooks,
though indispensable, cannot substitute for a thorough understanding
of cooking the inevitable appearance
of cookbooks of Bayesian statistics
must be contemplated with ambivalence.
Conjugate distributions. Suppose
you take your temperature at a
moment when your prior probability
density u(\) is not diffuse with respect
to v(x\\), so your posterior opinion
u(\\x) is not adequately approximated by w(X  x) Determination and
application of u(\\x) may then require laborious numerical integrations
of arbitrary functions. One way to
avoid such labor that is often useful
and available is to use conjugate distributions. When a family of prior
distributions is so related to all the
conditional distributions which can
arise in an experiment that the
posterior distribution is necessarily in
the same family as the prior distributions, the family of prior distributions
is said to be conjugate to the experiment. By no means all experiments
have nontrivial conjugate families,
but a few übiquitous kinds do.
Examples: Beta priors are conjugate
to observations of a Bernoulli process,
normal priors are conjugate to observations of a normal process with
.
known variance. Several other conjugate pairs are discussed by Raiffa
and Schlaifer (1961).
Even when there is a conjugate
family of prior distributions, your own
prior distribution could fail to be in
or even near that family. The distributions of such a family are,
however, often versatile enough to
accommodate the actual prior opinion,
especially when it is a bit hazy.
Furthermore, if stable estimation is
nearly but not quite justifiable, a
conjugate prior which approximates
your true prior even roughly may be
expected to combine with v(x\\) to
produce a rather accurate posterior
distribution.
Should the fit of members of the
conjugate family to your true opinion
be importantly unsatisfactory, realism
may leave no alternative to something
as tedious as approximating the continuous distribution by a discrete one
with many steps, and applying Bayesian logic by brute force. Respect
for your real opinion as opposed to
some handy stereotype is essential.
That is why our discussion of stable
estimation, even in this expository
paper, emphasized criteria for deciding when the details of a prior
opinion really are negligible.
An example: Normal measurement
with variance known. To give a
minimal illustration of Bayesian distribution theory, and especially of
conjugate families, we discuss briefly,
and without the straightforward algebraic details, the normally distributed
measurement of known variance. The
Bayesian treatment of this problem
has much in common with its classical
counterpart. As is well known, it is a
good approximation to many other
problems in statistics. In particular,
it is a good approximation to the case
of 25 or more normally distributed
observations of unknown variance,
W. Edwards, H. Lindman,
210
with the observed standard error of
the mean playing the role of the
known standard deviation and the
observed mean playing the role of the
single observation. In the following
discussion and throughout the remainder of the paper, we shall discuss
the single observation x with known
standard deviation a, and shall leave
it to you to make the appropriate
translation into the set of n 25
observations with mean x(= x) and
standard error of the mean 5/ Vn ( = c) ,
whenever that translation aids your
intuition or applies more directly to
the problem you are thinking about.
Much as in classical statistics, it is
also possible to take uncertainty
about a explicitly into account by
means of Student's t. See, for example, Chapter 11 of Raiffa and
Schlaifer (1961).
Threefunctions enter into the problem of known variance: «(X), i/(xX),
and u(\\x). The reciprocal of the
variance appears so often in Bayesian
calculations that it is convenient to
denote 1/a2 by h and call h the
precision of the measurement. We
are therefore dealing with a normal
measurement with an unknown mean
H but known precision h. Suppose
your prior distribution is also normal.
It has a mean ju0 and a precision ho,
both known by introspection. There
is no necessary relationship between
ho and h, the precision of the measurement, but in typical worthwhile
applications h is substantially greater
than ho. After an observation has
been made, you will have a normally
distributed posterior opinion, now
with mean j_i and precision hi.
Ml
=
uoho + xh
ho + h
and
hi = ho + h.
and
L.
Bayesian
Savage
J.
The posterior mean is an average of
the prior mean and the observation
weighted by the precisions. The
precision of the posterior mean is the
sum of the prior and data precisions.
The posterior distribution in this case
is the same as would result from the
principle of stable estimation if in
addition to the datum x, with its
precision h, there had been an additional measurement of value /x 0 and
precision ht>.
If the prior precision hois very small
relative to h, the posterior mean will
probably, and the precision will certainly, be nearly equal to the data
mean and precision ; that is an explicit
illustration of the principle of stable
estimation. Whether or not that
principle applies, the posterior precision will always be at least the larger
of the other two precisions therefore,
observation cannot but sharpen opinion here. This conclusion is somewhat special to the example; in
general, an observation will occasionally increase, rather than dispel doubt.
In applying these formulas, as an
approximation, to inference based on
a large number n of observations with
average x and sample variance s2 , xis
x and his n/52 To illustrate both the
extent to which the prior distribution
can be irrelevant and the rapid narrowing of the posterior distribution as
the result of a few normal observations, consider Figure 2. The top section of the figure shows two prior
distributions, one with mean —9 and
standard deviation 6 and the other
with mean 3 and standard deviation 2.
The other four sections show posterior
distributions obtained by applying
Bayes' theorem to these two priors
after samples of size n are taken from
a distribution with mean 0 and standard deviation 2. The samples are
artificiallyselected to have exactly the
mean 0. After 9, and still more after
Statistical Inference
211
A<.
3
if
'!
.
Fig.
2.
Posterior distributionsobtained from two normal priors
after n normally distributed observations.
16, observations, these markedly different prior distributions have led
to almost indistinguishable posterior
distributions.
Of course the prior distribution is
never irrelevant if the true parameter
happens to fall in a region to which
the prior distribution assigns virtually
zero probability. A prior distribution
which has a region of zero probability
is therefore undesirable unless you
really consider it impossible that the
true parameter might fall in that
region. Moral : Keep the mind open,
or at least ajar,
Figure 2 also shows the typical
narrowing of the posterior distribution
with successive observations. After 4
observations, the standard deviation
of your posterior distribution is less
!
!
W. Edwards, H. Lindman,
212
and
L.
J.
Savage
which a sequence of observations is
drawn to express it confidently in
terms of any moderate number of
parameters. These are the situations
that have evoked what is called the
theory of nonparametric statistics.
Ironically, a main concern of nonparametric statistics is to estimate the
parameters of unknown distributions.
The classical literature on nonparametric statistics is vast; see I. R.
Savage (1957, 1962) and Walsh
(1962). Bayesian counterparts of
some of it are to be expected but are
not yet achieved. To hint at some
nonparametric Bayesian ideas, it
seems reasonable to estimate the median of a largely unknown distribution
by the median of the sample, and the
mean of the distribution by the mean
of the sample; given the sample, it
will ordinarily be almost an evenmoney bet that the population median
exceeds the sample median and so on.
Technically, the "and so on" points
Edwards, 1962a, 1963).
When practical interest is focused toward Bayesian justification for the
on a few of several unknown pa classical theory of joint nonparametric
tolerance intervals.
rameters, the general Bayesian method
is to find first the posterior joint distribution of all the parameters and
Point and Interval Estimation
from it to compute the corresponding
Measurements are often used to
marginal distribution of the parammake
a point estimate, or best guess,
for
When,
eters of special interest.
instance, n observations are drawn about some quantity. In the feveryou would
from a normal distribution of un thermometer example,
make,
spontaneously
would
want,
and
known mean /_ and standard deviation
true teman
estimate
of
the
such
applied
to
the
Al
=)o(_4).
gible for almost any purpose; if it can be
the evidence is ultimately
In words, the posterior odds in favor rolled many times,
sure to become definitive. As is implicit in
of A given the datum D are the prior the concept of the not necessarily fair die,
odds multiplied by the ratio of the if Di, Dt, D t, " " are the outcomes of successive
conditional probabilities of the datum rolls, then the same function L(A;D) applies
p,„,m
P(A\D)
=
.
Statistical Inference
to each. Therefore Equation 9 can be applied
repeatedly, thus:
a(A\Di)
a(A\Di, Di)

L(A ; _>i)0.(A)
L{A;Di)a{A\D,)
=L(A D t)L (A D ,)n(A )
..
F
'
' '
', ' ' ' ' '. '
n n )il(A\U
o'/F n n1
il{A\D n ,,Di) = L(A;U
*" ' " * Di)
D n }, ■"*
...
L(A D )L(A ■ D _,)"""
y..'
_.
''
n?_,_.(/l;Z),)0(i4).
'
(.
)
This multiplicative composition of likelihood
ratios exemplifies an important general
principle about observations which are independent given the hypothesis.
For the specific example of the die, if x
6's and y non6 s occur (where of course
x+ y
—
n), then
(5 ')* V 25 Xv' n^)f.
O
?_
a.^
For large n, if A obtains, it is highly probable
at the outset that x/n will fall close to 1/6.
Similarly, if A does not obtain x/n will
probably fall close to 1/5. Thus, if A obtains,
the overall likelihood (5/6)* (25/24)" will
probably be very roughly
( _ Y" (
\6 /
L'J
jjp)
< WI$$)
Bayesian
Savage
__ Y""
V24 /
_ \(IA
s V" (
— V" T
J
=
6
\24 /
(1.00364)"
=
10<1001B!n
i. «/m^^\
By the timen is 1 ,200 everyone's odds in favor
of A will probably be augmented about a
if Ais in fact true. One who
started very skeptical of A, say with fi(A)
with a grain of salt. For simple dichotomies
that is applicationsof Equation 9 in which
everyone concerned will agree and be clear
about the values of L(A ; D)—rarely, if ever,
practice.
occur in
in scientific practice.
Public models
Public
models
almost always involve parameters rather than
finite part itions.
Some generalizations are apparent in what
has already
been said
said about simple
dichotomy.
has alreadv been
about simole dichotomy.
Two more will be sketchily illustrated: Decistatistics, and
siontheoretic statistics,
siontheoretic
and the
the relation
relation ot
of
the dominant classical decisiontheoretic
position to the Bayesian position. (More
details will be found in Savage, 1954, and
Savage et al., 1962, indexed under simple
—
dichotomy.)
At a given moment> iet us
have t0 guess whether ;t is
pp ose, you
aor A that
obtains and you will receive
if you guess
correct i y that A obtains, %J if you guess
correct iy rha t A obtains, and nothing otherwise. (No real generality is lost in not
assigning four arbitrarily chosen payoffs to the
four possible combinationsof guess and :fact.)
The expected cash value to you of guessing A
is MP (A) and that of guessing Ais %JP(A).
You will therefore prefer to guess Aif and
only if $IP(A) exceeds $JP(A); that is, just
if « (A) exceeds //i". (More rigorous treatment would replace dollars with utiles.)
Similarly, if you need not make your guess
until after you have examined a datum D,
you will prefer to guess A if and only if>
n(A Z>) exceeds J /I. Putting this together
with Equation 9, you will prefer to guess A
and only
_r
1
T.I.
A

L(A;D)
jXr = A,
> "<»
{A )
where your critical likelihood ratio A is
defined by the context.
This conclusion does not at all require that
thousandth,
will
still
be
rather
about a
skeptical. But he would have to start from a the dichotomy between A and A be simple,
or public, but for comparison with the
very skeptical position indeed not to become
strongly convinced when n is 6,300 and the classical approach to the same problem continue to assume that it is. Classical statistioverall likelihoodratio in favor of Ais about
cians were the first to conclude that there
10 billion.
must be some A such that you will guess A if
The arithmetic for A is:
L(A ;D) > A and guess A if L(A ;D) < A.
( For this sketch, it is excusable to neglect the
V" /25 V" "I"
possibility that A = L{A D)2) By and large,
6
\24 /
J
classical statisticians say that the choice of A
(0 9962}"
in0001
is an entirely subjective one which no one but
So the rate at which evidence accumulates you can make (e.g., Lehmann, 1959, p. 62).
Ais
against A, and for A, when A
is true is in this Bayesians agree for according to Equation 9,
case a trifle more than the rate at which it Ais inversely proportional to your current
oddsfor A, an aspect ofyourpersonal opinion.
accumulates for A when A is true.
The classical
statisticians, however,
however, have
classical statisticians,
Simple dichotomy is instructive for statistical theory generally but must be taken overlooked a great simplification, namely that
(5
""
220
W. Edwards, H. Lindman,
critical A will not depend on the size or
structure of the experiment and will be
proportional to J/I. Once the Bayesian
position is accepted, Equation 9 is of course
an argument for this simplification, but it can
alsobe arrived at along a classical path, which
in effect derives much, if not all, of Bayesian
statistics as a natural completion of the
classical decisiontheoretic position. This
relation between the two views, which in no
way depends on the artificiality of simple
dichotomy here used to illustrate it, cannot
be overemphasized. (For a general demonstration, see Raiffa &
1961, pp.
2427.)
The simplification is brought out by the set
your
and
L.
J.
Savage
Bayesian
value of the likelihood ratio of the datum
conveys the entire import of the datum. (A
latersection is about the likelihood principle.)
Wolfowitz (1962) dissents.
Approaches to null hypothesis testing.
Next we examine situations in which
a very sharp, or null, hypothesis is
compared with a rather flat or diffuse
alternative hypothesis. This short
section indicates general strategies of
such comparisons. None of the computations or conclusions depend on
assumptions about the special initial
credibility of the null hypothesis, but
of indifference curves among the various
probabilities of the two kinds of errors (Leha Bayesian will find such computamann, 1958). Of course, any reduction of the tions uninteresting unless a nonprobability of one kind of error is desirable
negligible amount of his prior probaif it does not increase the probability of the
bility
is concentrated very near the
implications
kind
of
and
the
of
error,
other
classical statistics leave the description of the null hypothesis value.
indifference curves at that. But the conFor the continuous cases to be
siderations discussed easily imply that the considered in following sections, the
indifference curves should be parallel straight
lines with slope — [V//n(A)]. As Savage hypothesis A is that some parameter X
is in a set that might as well also be
(1962b) puts it:
called
A. For onedimensional cases
the subjectivist's position is moreobjective
in
which
the hypothesis A is that X is
than the objectivist's, for the subjectivist
finds the range of coherent or reasonable almost surely negligibly far from some
preference patterns much narrower than the specified value Xo, the odds in favor
objectivist thought it to be. How confusing
of A given the datum D, as in
and dangerous big words are [p. 67]!
Equation 9, arc
Classical statistics tends to divert attention
from A to the two conditional probabilities of
making errors, by guessing A when A obtains
and vice versa. The counterpart of the
probabilities of these two kinds of errors in
more general problems is called the operating
characteristic, and classical statisticians sugthat you should choose among
gest, in
the available operating characteristics as a
method of choosing A, or more generally,
your prior distribution. This is not mathematically wrong, but it distracts attention
from your value judgments and opinions
about the unknown facts upon which your
preferred A should directly depend without
regard to how the probabilities of errors vary
with A in a specific experiment.
There are important advantages to recognizing that your A does not depend on the
structure of the experiment. It will help you,
for example, to choose between possible
experimental plans. It leads immediately to
the very important likelihoodprinciple, which
in this application says that the numerical
wwwm
v(D\\o)
il(A)
v(D\\)u(\\A)d\
= L(A;D)Sl(A).
Natural generalizations apply to
multidimensional cases. The numerator v(D\\o) will in usual applications be public. But the denominator,
the probability of D under the alternative hypothesis, depends on the
usually far from public prior density
under the alternative hypothesis.
Nonetheless, there are some relatively
public methods of appraising the
Statistical Inference
denominator, and much of the following discussion of tests is, in effect,
about such methods. Their spirit is
opportunistic, bringing to bear whatever approximations and bounds offer
themselves in particular cases. The
main ideas of these methods are
sketched in the following three paragraphs, which will later be much
amplified by examples.
First, the principle of stable estimation may apply to the datum and to
the density u(\\A) of X given the
alternative hypothesis A. In this
case, the likelihood ratio reflects no
characteristics of u(X\A) other than
its value in the neighborhood favored
by the datum, a number that can be
made relatively accessible to introspection.
Second, it is relatively easy, in any
given case, to determine how small the
likelihood ratio can possibly be made
by utterly unrestricted and artificial
choice of the function u(\\A). If
this rigorous public lower bound on
the likelihood ratio is not very small,
then there exists no system of prior
probabilities under which the datum
greatly detracts from the credibility
of the null hypothesis. Remarkably,
this smallest possible bound is by no
means always very small in those
cases when the datum would lead to a
high classical significance level such
as .05 or .01. Less extreme (and
therefore larger) lower bounds that do
assume some restriction on u(\\A)
are sometimes appropriate ; analogous
restrictions also lead to upper bounds.
When these are small, the datum does
rather publicly greatly lower the
credibility of the null hypothesis.
Analysis to support an interocular
traumatic impression might often be
of this sort. Inequalities stated more
generally by Hildreth (1963) are
behind most of these lower and upper
bounds.
221
Finally, when v(D\\) admits of a
conjugate family of distributions, it
may be useful, as an approximation,
to suppose u(\\A) restricted to the
conjugate family. Such a restriction
may help fix reasonably public bounds
to the likelihood ratio.
We shall see that classical procedures are often ready severely to
reject the null hypothesis on the basis
of data that do not greatly detract
from its credibility, which dramatically demonstrates the practical difference between Bayesian and classical
statistics. This finding is not altogether new. In particular, Lindley
(1957) has proved that for any
classical significance level for rejecting
the null hypothesis (no matter how
small) and for any likelihood ratio in
favor of the null hypothesis (no
matter how large), there exists a
datum significant at that level and
with that likelihood ratio.
To prepare intuition for later technical discussion we now show informally, as much as possible from a
classical point of view, how evidence
that leads to classical rejection of a
null hypothesis at the .05 level can
favor that null hypothesis. The loose
and intuitive argument can easily be
made precise (and is, later in the
paper). Consider a twotailed / test
with many degrees of freedom. If a
true null hypothesis is being tested,
t will exceed 1.96 with probability
2.5% and will exceed 2.58 with
probability .5%. (Of course, 1.96 and
2.58 are the 5% and 1% two tailed
significance levels the other 2.5% and
.5% refer to the possibility that t may
be smaller than 1.96 or 2.58.)
So on 2% of all occasions when true
null hypotheses are being tested, /
will lie between 1.96 and 2.58. How
often will t lie in that interval when
the null hypothesis is false? That
W. Edwards, H. Lindman,
222
depends on what alternatives to the
null hypothesis are to be considered.
Frequently, given that the null hypothesis is false, all values of t between, say, —20 and +20 are about
equally likely for you. Thus, when
the null hypothesis is false, t may well
fall in the range from 1.96 to 2.58 with
at most the probability (2.58  1.96)/
[+20
(20)] = 1.55%. In such
a case, since 1.55 is less than 2 the
occurrence of t in that interval speaks
mildly for, not vigorously against, the
truth of the null hypothesis.
This argument, like almost all the
following discussion of null hypothesis
testing, hinges on assumptions about
the prior distribution under the alternative hypothesis. The classical statistician usually neglects that distribution—in fact, denies its existence.
He considers how unlikely a t as far
from 0 as 1.96 is if the null hypothesis
is true, but he does not consider that
a t as close to 0 as 1.96 may be even
less likely if the null hypothesis is
false.
A Bernoullian example. To begin
a more detailed examination of Bayesian methods for evaluating null hypotheses, consider this example :
We are studying a motor skills
task. Starting from a neutral rest
position, a subject attempts to touch
a stylus as near as possible to a long,
straight line. We are interested in
whether his responses favor the right
or the left of the line. Perhaps from
casual experience with such tasks, we
give special credence to the possibility
that his longrun frequency p of
"rights" is practically pa = 1/2. The
problem is here posed in the more
familiar frequentistic terminology its
Bayesian translation, due to de Finetti, is sketched in Section 3.7 of
Savage (1954). The following discussion applies to any fraction po as

and
J.
L.
Bayesian
Savage
well as to the specific value 1/2.
Under the null hypothesis, your
density of the parameter p is sharply
concentrated near po, while your
density of p under the alternative
hypothesis is not concentrated and
may be rather diffuse over much of
the interval from 0 to 1
If n trials are undertaken, the
probability of obtaining r rights given
that the true frequency is p is of
course C?pr (l
p) n~r The probability of obtaining r under the null
hypothesis that p is literally po is
C"rpor (l
Po) n~r Under the alternative hypothesis, it is
.
—
foctP'b
~ P) n'u(p\Hi)dp,
MifrO~
y

n
/ pr(\  p)»'u(p\Hi)dp
=m'
Jo/ *
—
r (l
_
n
P) ~'dp
_:
(rc
+
,
1)C?
[11]
The first equality is evident; the
second is a known formula, enchantingly demonstrated by Bayes (1763).
Of course u cannot really be a constant unless it is 1, but if r and n r
r
are both fairly large pr (l p) n~ is a
sharply peaked function with its
maximum at r/n. If u(p\Hi) is
gentle near r/n and not too wild
elsewhere, Equation 11 may be a
satisfactory approximation, with u'
= u (r/n  Hi) This condition is often
met, and it can be considerably
weakened without changing the conclusion, as will be explained next.
—
—
.
L(po; r, n)
r
_
.
that is, the probability of r given p
averaged over p, with each value in
the average weighted by its prior
density under the alternative hypothesis. The likelihood ratio is
therefore
__
In summary, it is often suitable to
denominator of Equation 10 to more
tractable form is to apply the prin approximate the likelihood ratio thus :
ciple of stable estimation, or more L(po\r,n)
accurately certain variants of it, to
the denominator. To begin with, if
£„)»= I±± C>0r (1
u(p\Hi) were a constant w', then the
u
denominator would be
(n + l)P(r\Po,n)
.

Statistical Inference
? (i
o

[10]
pyL(x
ff is the ordinate of the
diffuse (that is, uniform from oo to standard normal density at the point
+the°o) otherwise, no measurement of (x X)/o. Hereafter, we will use
kind contemplated could result in the familiar statistical abbreviation
any opinion other than certainty that t = (x
X_)/
L .
second, and u(x) is probability per
degree centigrade or per cycle per
second, the product cu(x) (in the This is of course the very smallest
denominator of Equation 15) is di likelihood ratio that can be associated
Since the alternative hymensionless. Visualizing au(x) as a with
rectangle of base a, centered at and pothesis now has all its density on one
height u(x), we see (2.58) = .0143, so the value. If the prior distribution under
likelihood ratio is about .286. Thus for the the alternative hypothesis is required
Bayesian, as for the classical statistician, the
evidence here tells against the null hypothesis, to be not only symmetric around the
but the Bayesian is not nearly so strongly null value but also unimodal, which
persuaded as the classical statistician appears seems very safe for many problems,
to be. The datum 1.96 is just significant at
then the results are too similar to
the .05 level of a twotailed test. But the
is
datum,
likelihood ratio 1.17. This
which those obtained later for the smallest
leads to a .05 classical rejection, leaves the possible likelihood ratio obtainable
Bayesian, with the prior opinion postulated, with a symmetrical normal prior
a shade more confident of the null hypothesis
density to merit separatepresentation
than he was to start with. The overreadiness
here.
of classical procedures to reject null hy
__
..
potheses, first illustrated in the Bernoullian
example, is seen again here; indeed, the two
If youknow that your prior density u (X  Hi)
you can
never exceeds some upper bound
Bayesian
Statistical Inference
improve, that is, increase, the crude lower
bound Z.mi_. The prior distribution most
favorable to the alternative hypothesis, given
that it nowhere exceeds
is a rectangular
distribution of height u* with x as its midpoint. Therefore
L(\y,x)
S
that u(\\Hi)
w* > 0 for all Xin
some interval, say of length 4, centered at x.
L(\o;x)
<
'P(P)PP)1
£__»
[16]
where is the standard normal cumulative
function. Not only is this lower bound better
than Zm no matter how large
it also
improves with decreasing a, as is realistic.
The improvement over mm is negligible if
au* & 0.7.
Either directly or by recognizing that the
*
square bracket in Inequality 16 is less than 1,
it is easy to derive a cruder but simpler bound,
which is sometimes better than L m \a ,
In this case,
(<)
r_18J
aip(at)
_ V^T7
"
where
J
=
ln
________
Vl + (r/a)
''I . *
1
 exp }(1  <**)<'"
.
[19]
.
"_",._"
,
,
=—aF
r_
instructive as extreme possibilities. Iable 2
again illustrates how classically significant
values of t can, in realistic cases, be based on
data that actually favor the null hypothesis.
For another comparison of Equation 18
J,
.
£
c, 4.1
.11 u
.u "
.. for
hypothesis
thus support
the null
.u..„
can be very strong, since a might well be about
In the example, you perhaps hope to
confirm the null hypothesis to everyone's
if it is in fact true. You will
.01.
TABLE 2
Values
'alues
Selected values
selected
Values a and
Values t
values
Corresponding to Familiar TwoTailed Significance Levels
L(a,
l(a, t)
t)
t and Significance level
a
.0001
.001
.01
.025
.05
.075
.1
.15
.2
.5
.9
.99
'h
c/t
.0001
.0001
.0010
.0100
.0250
.0501
.0501
.0752
.0752
.1005
.1005
.1517
.2041
.5774
2.0647
2.0647
7.0179
.0001
—
a
1.645
.10
2,585
259
25.9
10.4
5.19
3.47
1.960
.05
1,465
147
14.7
5.87
2.94
1.97
1.49
2.62
1.78
1.36
.725
.859
.983
1.02
__
.791
.474
.771
.972
2.570
.01
362
36.2
3.63
1.45
.731
~
3.291
.001
44.6
4.46
.446
.179
.166
.0903
.0903
.0612
.0612
.0470
.0470
.0336
.0277
.0277
.0345
.0345
.592
.946
.397
.907
.492
.375
.260
.207
3.891
.0001
5.16
.516
.0516
.0207
.0105
.00718
.00556
.00408
.00349
.00685
.264
.869
a
Significance
In
a
c?W = 0,
„
f.
Table 2 shows numerical values of L(a, t) for
some instructive valuesof a and for values of /
corresponding to familiar twotailedclassical
significance levels. The values of a between
.01 and .1 portray reasonably precise experiments; the others included in Table 2 , are
„
'
.
—
(1
Selected Values
.1
.05
.01
is small, say less than .1, then 1 _'
js negligibly different from 1, and so t0
V^WT The effect of using this approximation can never be very bad for the
Kkelihood ratio actua associated with the
imatevalue of ,/cannot be less than
ntg f
r than
or
Tab e
actua yalues of fc and their orresponding
twotailed significance levels. At values of /
slightly smaller than the breakeven values
in Table 3 classical statistics more or less
vigorously rejects the null hypothesis, though
tfle Bayesian described by a becomes more
confident of it than he was to start with.
If* —0, that is, if the observation happens
to point exactly to the null hypothesis,
If
Writing the normal density in explicit
=
— I
Statistical Inference
TABLE 3
Values to and Their Significance
Levels
NormalAlternative
Prior Distributions
.001
t la not reasonable to
t on
*
1 for 1 o*.
approximate X02 by substituting
Since the coefficient of nin Xo is larger than
1 for every fraction a and since the value of x'
that is just significant at say, the .001 level
only slightly exceeds n for sufficiently large n,
there is some first integer n.ooi (a) at which the
breakeven value X02 is just significant at the
.001 level. Some representative values are
shown in Table 5. .From the point of view of
this modelof the testing situation, which is of
course not unobjectionable, the classical procedure is startlingly prone to reject the null
hypothesis contrary to what would often be
very reasonable opinion.
Paralleling the situation for n
1, it is
__, A*jn/X th at is most pessimistic toward
t j,e nun hypothesis for a specified value of
x' The likelihood for this artificial value of
a is
'
THE Rhfak *
Even Value x 0
L.
and
—
—
.
,rmiD ■= (—
) e* s ".
Vn
,
shows the values of _,„,__,
L
,
,
Table 6
at
the
and
that
x« just significant
evels {o seve.ral yalues
?01the
J.
i onedimensional
case
correspond to the values of
a
f but
° ^"eresmall,
'"
fr"*
levels
f
»"
ls
not as small as classical
significance
might suggest. In all these
""realistically large.
cases
,.
.
1his cursory glance at multidimen, ,
.sional
normally distributed observa„i„„„_
T
_,*
„u;j{ m n
,..,
tions has the same general conclusions
as our more detailed Study of the
unidimensional normal case.
Although the statistical theory of multi
TABLE 6
Values
L,normln
THE VALUES
THAT CORRESPOND
x3 JUST SIGNIFICANT
n
.01 AND .001 LEVELS FOR SELECTED VALUES
AT THE
X.tot'
X.oi«
n
l
3
10
30
100
300
1,000
3,000
10,000
.388
.514
.656
.768
.858
.913
.950
.971
.984
1.000
air
Z>normln
.421
.600
.870
.1539
.1134
1.198
.0806
.0742
.0712
.0696
.0680
.0675
.0668
2.075
2.238
3.059
4.047
5.488
00
.0912
a
.339
.429
.581
.709
.818
.887
.935
.962
.979
1.000
»/r
X. normIn
.360
.0242
.475
.0166
.0127
.715
1.005
1.422
1.919
2.636
3.499
3.798
00
.0108
.0097
.0092
.0088
.0087
.0086
.0084
Statistical Inference
dimensional observations (classical or
Bayesian) is distressingly sketchy and
incomplete, drastic surprises about
the relation between classical and
Bayesian multidimensional techniques
have not turned up and now seem
unlikely.
Some morals about testing sharp null
hypotheses. At first glance, our general conclusion that classical procedures are so ready to discredit null
hypotheses that they may well reject
one on the basis of evidence which is
in its favor, even strikingly so, may
suggest the presence of a mathematical mistake somewhere. Not so ;
the contradiction is practical, not
mathematical. A classical rejection
of a true null hypothesis at the .05
level will occur only once in 20 times.
The overwhelming majority of these
false classical rejections will be based
on test statistics close to the borderline value; it will often be easy to
demonstrate that these borderline
test statistics, unlikely under either
hypothesis, are nevertheless more unlikely under the alternative than
under the null hypothesis, and so
speak for the null hypothesis rather
than against it.
Bayesian procedures can strengthen
a null hypothesis, not only weaken it,
whereas classical theory is curiously
asymmetric. If the null hypothesis
is classically rejected, the alternative
hypothesis is willingly embraced, but
if the null hypothesis is not rejected,
it remains in a kind of limbo of
suspended disbelief. This asymmetry
has led to considerable argument
about, the appropriateness of testing
a theory by using its predictions as a
null hypothesis (Grant, 1962; Guilford, 1942, see p. 186 in the 1956 edition ; Rozeboom, 1960 ; Sterling, 1960).
For Bayesians, the problem vanishes,
though they must remember that the
null hypothesis is really a hazily de
fined small region rather than a point.
The. procedures which have been
presented simply compute the likelihood ratio of the hypothesis that
some parameter is very nearly a
specified single value with respect to
the hypothesis that it is not. They
do not depend on the assumption of
special initial credibility of the null
hypothesis. And the general conclusion that classical procedures are
unduly ready to reject null hypotheses
is thus true whether or not the null
hypothesis is especially plausible a
priori. At least for Bayesian statisticians, however, no procedure for
testing a sharp null hypothesis is likely
to be appropriate unless the null hypothesis deserves special initial credence. It is uninteresting to learn
that the odds in favor of the null hypothesis have increased or decreased
a hundredfold if initially they were
negligibly different from zero.
How often are Bayesian and classical procedures likely to lead to
different conclusions in practice?
First, Bayesians are unlikely to consider a sharp null hypothesis nearly so
often as do the consumers of classical
statistics. Such procedures make
sense to a Bayesian only when his
prior distribution has a sharp spike at
some specific value; such prior distributions do occur, but not so often
as do classical null hypothesis tests.
When Bayesians and classicists
agree that null hypothesis testing is
appropriate, the results of their procedures will usually agree also. If the
null hypothesis is false, the interocular
traumatic test will often suffice to
reject it; calculation will serve only
to verify clear intuition. If the null
hypothesis is true, the interocular
traumatic test is unlikely to be of
much use in onedimensional cases,
but may be helpful in multidimensional ones. In at least 95% of cases
236
W. Edwards, H. Lindman,
when the null hypothesis is true,
Bayesian procedures and the classical
.05 level test agree. Only in borderline cases will the two lead to conflictThe widespread
ing conclusions.
custom of reporting the highest classical significance level from among the
conventional ones actually attained
would permit an estimate of the
frequency of borderline cases in published work; any rejection at the .05
or .01 level is likely to be borderline.
Such an estimate of the number of
borderline cases may be low, since it
is possible that many results not
significant at even the .05 level remain
unpublished.
The main practical consequences for
null hypothesis testing of widespread
adoption of Bayesian statistics will
presumably be a substantial reduction
in the resort to such tests and a
decrease in the probability of rejecting
true null hypotheses, without substantial increase in the probability of
accepting false ones.
If classical significance tests have
rather frequently rejected true null
hypotheses without real evidence,
why have they survived so long
and so dominated certain empirical
sciences? Four remarks seem to shed
some light on this important and
difficult question.
1. In principle, many of the rejections at the .05 level are based on
values of the test statistic far beyond
the borderline, and so correspond to
In
almost unequivocal evidence.
practice, this argument loses much of
its force. It has become customary to
reject a null hypothesis at the highest
significance level among the magic
values, .05, .01, and .001, which the
test statistic permits, rather than to
choose a significance level in advance
and reject all hypotheses whose test
statistics fall beyond the criterion
value specified by the chosen signifi
and
L.
J.
Bayesian
Savage
cance level. So a .05 level rejection
today usually means that the test
statistic was significant at the .05
level but not at the .01 level. Still,
a test statistic which falls just short
of the .01 level may correspond to
much stronger evidence against a null
hypothesis than one barely significant
at the .05 level. The point applies
more forcibly to the region between
.01 and .001, and for the region
beyond, the argument reverts to its
original form.
2. Important rejections at the .05
or .01 levels based on test statistics
which would not have been significant
at higher levels are not common.
Psychologists tend to run relatively
large experiments, and to get very
highly significant main effects. The
place where .05 level rejections are
most common is in testing interactions in analyses of variance and
few experimenters take those tests
very seriously, unless several lines of
evidence point to the same conclusions.
3. Attempts to replicate a result
are rather rare, so few null hypothesis
rejections are subjected to an empirical check. When such a check is
performed and fails, explanation of
the anomaly almost always centers on
experimental design, minor variations
in technique, and so forth, rather than
on the meaning of the statistical
procedures used in the original study.
4. Classical procedures sometimes
test null hypotheses that no one would
believe for a moment, no matter what
the data; our list of situations that
might stimulate hypothesis tests
earlier in the section included several
examples. Testing an unbelievable
null hypothesis amounts, in practice,
to assigning an unreasonably large
prior probability to a very small
region of possible values of the true
parameter. In such cases, the more
—
Statistical Inference
the procedure is biased against the
null hypothesis, the better. The
frequent reluctance of empirical scientists to accept null hypotheses which
their data do not classically reject
suggests their appropriate skepticism
about the original plausibility of these
null hypotheses.
This conclusion is the likelihood
principle: Two (potential) data D and
D' are o( the same import if Equation
24 obtains.
Since for the purpose of drawing
inference, the sequence of numbers
P{D\Hi) is, according to the likelihood principle, equivalent to any
other sequence obtained from it by
multiplication by a positive constant,
Likelihood Principle
a name for this class of equivalent
A natural question about Bayes' sequences is useful and there is
theorem leads to an important con precedent for calling it the likelihood
clusion, the likelihood principle, which (of the sequence of hypotheses Hi
was first discovered bycertain classical given the datum D). (This is not
statisticians (Barnard, 1947; Fisher, quite the usage of Raiffa & Schlaifer,
1956).
1961.) The likelihood principle can
Two possible experimental out now be expressed thus : D and D' have
comes D and D'—not necessarily of the same import if P(D\Hi) and
the same experiment—can have the P(D'\Hi) belong to the same likelisame (potential) bearing on your hood—more idiomatically, if D and D'
opinion about a partition of events Hit have
the same likelihood.
that is, P(H t \D) can equal P(J_"<2?')
If, for instance, the partition is twofor each i. Just when are D and D' fold, as it is when you are testing a null
thus evidentially equivalent, or of the hypothesis against an alternative hysame import? Analytically, when is
pothesis, then the likelihood to which
the pair [P (D \H ),P(D\ Hi)] belongs
P
vmmA
is plainly the set of pairs of numbers
[a, &] such that the fraction a/b is
the already familiar likelihood ratio
L(Ho\ D)=P(D\H 0)/P(D\Hi). The
[23] simplification of the theory of testing
by the use of likelihood ratios in place
for each i ?
of the pairs ofconditional probabilities,
Aside from such academic possi which we have seen, is thus an applibilities as that some of the P(H.) are cation of the likelihood principle.
*0, Equation 23 plainly entails that,
Of course, the likelihood principle
for some positive constant k and for applies to a (possibly multidimenall i,
sional) parameter X as well as to a
[24]
partition Hi. The likelihood of D, or
P(D'\Hi) = kP(D\Hi).
the likelihood to which P(D\\) beBut Equation 24 implies Equation 23, longs, is the class of all those functions
from which it was derived, no matter of X that are positive constant mulwhat the initial probabilities P(H t) tiples of (that is, proportional to) the
are, as is easily seen thus :
function P(D\\). Also, conditional
densities
can replace conditional probP(D') = IP(D'\Hi)P(Hi)
abilities in the definition of likelihood
= kXP(D\Hi)P(Hi)
ratios.
There is one implication of the like= kP(D).
""
ma
W. Edwards, H. Lindman,
238
lihood principle that all statisticians
seem to accept. It is not appropriate
in this paper to pursue this implication, which might be called the
principle of sufficient statistics, very
far. One application of sufficient
statistics so familiar as almost to
escape notice will, however, help bring
out the meaning of the likelihood
principle. Suppose a sequence of 100
Bernoulli trials is undertaken and 20
successes and 80 failures are recorded.
What is the datum, and what is its
probability for a given value of the
frequency p? We are all perhaps
overtrained to reply, "The datum is 20
successes out of 100, and its probability, given p, is C^20 (l  P) iOA
Yet it seems more correct to say,
"The datum is this particular sequence
of successes and failures, and its
probability, given p, is p20 (l p) 80 ."
The conventional reply is often more
convenient, because it would be costly
to transmit the entire sequence of
observations it is permissible, because
p) m
the two functions a°0°£20 (l
20
i0
p
to
the
same
belong
and
(l p)
likelihood; they differ only by the
constant factor C.°o°. Many classical
statisticians would demonstrate this
permissibility by an argument that
does not use the likelihood principle,
at least not explicitly (Halmos &
Savage, 1949, p. 235). That the two
arguments are much the same, after
all, is suggested by Birnbaum (1962).
The legitimacy of condensing the
datum is often expressed by saying
that the number of successes in a
given number of Bernoulli trials is a
sufficient statistic for the sequenceof
trials. Insofar as the sequence of
trials is not altogether accepted as
Bernoullian —and it never is—the
condensation is not legitimate. The
practical experimenter always has
some incentive to look over the sequence of his data with a view to
—
—
i
!

and
L.
J.
Bayesian
Savage
discovering periodicities, trends, or
other departures from Bernoullian
expectation. Anyone to whom the
sequence is not available, such as the
reader of a condensed report or the
experimentalist who depends on automatic counters, will reserve some
doubt about the interpretation of the
ostensibly sufficient statistic.
Moving forward to another application of the likelihood principle, imagine a different Bernoullian experiment
in which you have undertaken to
continue the trials until 20 successes
were accumulated and the twentieth
success happened to be the one
hundredth trial. It would be conventional and justifiable to report only
this fact, ignoring other details of the
sequence of trials. The probability
that the twentieth success will be the
one hundredth trial is, given p, easily
p)*°. This is
seen to be Clip20 (I
exactly 1/5 of the probability of 20
successes in 100 trials, so according to
the likelihood principle, the two data
have the same import. This conclusion is even a trifle more immediate
if the data are not condensed; for a
specific sequence of 100 trials of which
the last is the twentieth success has
the probability p (I p)*° in both
experiments. Those who do not
accept the likelihood principle believe
that the probabilities of sequences
that might have occurred, but did not,
somehow affect the import of the
sequence that did occur.
In general, suppose that you collect
data of any kind whatsoever not
necessarily Bernoullian, nor identically
distributed, nor independent of each
other given the parameter X—stopping
only when the data thus far collected
satisfy some criterion of a sort that is
sure to be satisfied sooner or later,
then the import of the sequence of n
data actually observed will be exactly
the same as it would be had you


—
Statistical Inference
planned to take exactly w observations
in the first place. It is not even
necessary that you stop according to
a plan. You may stop when tired,
when interrupted by the telephone,
when you run out of money, when you
have the casual impression that you
have enough data to prove your point,
and so on. The one proviso is that
the moment at which your observation is interrupted must not in itself
be any clue to X that adds anything
to the information in the data already
at hand. A man who wanted to know
how frequently lions watered at a
certain pool was chased away by lions
before he actually saw any of them
watering there; in trying to conclude
how many lions do water there he
should remember why his observation
was interrupted when it was. We
would not give a facetious example
had we been able to think of a serious
one. A more technical discussion of
the irrelevance of stopping rules to
statistical analysis is on pages 3642
of Raiffa and Schlaifer (1961).
This irrelevance of stopping rules to
statistical inference restores a simplicity and freedom to experimental
design that had been lost by classical
emphasis on significance levels (in the
sense of Neyman and Pearson) and on
other concepts that are affected by
stopping rules. Many experimenters
would like to feel free to collect data
until they have either conclusively
proved their point, conclusively disproved it, or run out of time, money,
or patience. Classical statisticians
(except possibly for the few classical
defenders of the likelihood principle)
have frowned on collecting data one
by one or in batches, testing the total
ensemble after each new item or
batch is collected, and stopping the
experiment only when a null hypothesis is rejected at some preset
significance level. And indeed if an
experimenter uses this procedure,
then with probability 1 he will
eventually reject any sharp null
hypothesis, even though it be true.
This is perhaps simply another illustration of the overreadiness of classical
procedures to reject null hypotheses.
In contrast, if you set out to collect
data until your posterior probability
for a hypothesis which unknown to
you is true has been reduced to .01,
then 99 times out of 100 you will
never make it, no matter how many
data you, or your children after you,
may collect.
(Rules which have
nonzero probability of running forever
ought not, and here will not, be called
stopping rules at all.)
The irrelevance of stopping rules is
one respect in which Bayesian procedures are more objective than classical ones. Classical procedures (with the
possible exceptions implied above) insist that the intentions of the experimenter are crucial to the interpretation of data, that 20 successes in 100
observations means something quite
different if the experimenter intended
the 20 successes than if he intended
the 100 observations. According to
the likelihood principle, data analysis
stands on its own feet. The intentions
of the experimenter are irrelevant to
the interpretation of the data once
collected, though of course they are
crucial to the design of experiments.
The likelihood principle also creates
unity and simplicity in inference
about Markov chains and other
stochastic processes (Barnard, Jenkins, & Winsten, 1962), which are
sometimes applied in psychology.
It sheds light on many other problems
of statistics, such as the role of unbiasedness and Fisher's concept of
ancillary statistic. A principle so
simple with consequences so pervasive
is bound to be controversial. For
dissents see Stein (1962), Wolfowitz
W. Edwards, H. Lindman,
240
(1962), and discussions published
with Barnard, Jenkins, and Winsten
(1962), Birnbaum (1962), and Savage
et al. (1962) indexed under likelihood
principle.
In
Retrospect
and
L.
J.
Bayesian
Savage
matical logic to the subjective process
of empirical inference.
We close with a practical rule which
stands rather apart from any conflicts
between Bayesian and classical statistics. The rule was somewhat
overstated by a physicist who said,
"As long as it takes statistics to find
out, I prefer to investigate something
else." Of course, even in physics
some important questions must be
investigated before technology is sufficiently developed to do so definitively.
Still, when the value of doing so is
recognized, it is often possible so to
design experiments that the data
speak for themselves without the
intervention of subtle theory or insecure personal judgments. Estimation is best when it is stable. Rejection of a null hypothesis is best when
it is interocular.
Though the Bayesian view is a
natural outgrowth of classical views,
it must be clear by now that the
distinction between them is important. Bayesian procedures are not
merely another tool for the working
scientist to add to his inventory along
with traditional estimates of means,
variances, and correlation coefficients,
and the . test, F test, and so on. That
classical and Bayesian statistics are
sometimes incompatible was illustrated in the theory of testing. For,
as we saw, evidence that leads to
classical rejection of the null hypothesis will often leave a Bayesian more
REFERENCES
confident of that same null hypothesis
F. J. Bayesian statistics. Amer.
than he was to start with. Incom Anscombe,1961,
15(1), 2124.
by
the Bahadur, R. R., & Robbins, H. The
patibility is also illustrated
attention many classical statisticians
problem of the greater mean. Ann. math.
1950, 21, 469487.
give to stopping rules that Bayesians
Barnard, G. A. A review of "Sequential
find irrelevant.
Analysis" by Abraham Wald. /. Amer.
The Bayesian outlook is flexible,
Statist. Ass., 1947, 42, 658664.
imagination
and criticism Barnard, G. A., Jenkins, G. M., & Winencouraging
in its everyday applications. Bayessten, C. B. Likelihood, inferences, and
1962,
ian experimenters will emphasize suittime series. /. Roy. Statist.
125(Ser. A), 321372.
ably chosen descriptive statistics in
Essay towards solving a problem
their publications, enabling each Bayes, T.doctrine
of chances. Phil. Trans.
in
the
reader to form his own conclusions.
Roy. Soc, 1763, 53, 370418. (Reprinted:
easily
Where an experimenter can
Biometrika, 1958, 45, 293315.)
foresee that his readers will want the Berkson, J. Some difficulties of interpretation encountered in the application of the
results of certain calculations (as for
chisquare test. J. Amer. Statist. Ass.,
example when the data seem suffi1938, 33, 526542.
ciently precise to justify for most Berkson,
J. Tests of significance considered
of
readers application of the principle
as evidence. /. Amer. Statist. Ass., 1942,
stable estimation) he will publish
37, 325335.
them. Adoption of the Bayesian out Birnbaum, A. On the foundations of
statistical inference. /. Amer. Statist.
look should discourage parading staAss., 1962, 57, 269306.
other,
or
procedures,
Bayesian
tistical
Blackwell, D., & Dubins, L. Merging of
as symbols of respectability pretendopinionswith increasing information. Ann.
1962, 33, 882886.
math.
of
matheimprimatur
ing to give the
Statistical Inference
Borel, E. La tWorie du jeu et les equations
integrates a noyau sym