Notes for Formal Epistemology, Week 3

In some circles (including my idiolect many days of the week), the expressions "null set" and "empty set" are used interchangeably. But you should be aware that in set theory and measure theory, these have different meanings. A set with no elements, notated as {} or ∅, is properly called an "empty set". A "null set" is a broader notion, most generally meaning a set whose measure is 0. The empty set is one such, but there can be others too. You can read more here.

There's also the notion of a "negligible set". This is any subset of a null set. It will either have measure 0 (and so also be a null set) or be non-measurable.
In class, I mentioned the Lebesgue measure, which assigns to every real interval [a..b] the measure b-a. We said that (given the axiom of uncountable choice) some subsets of the reals aren't Lebesgue measurable. Here's what may be the most accessible example. (I'm not expecting most members of the seminar to understand how that works.)
Whether a set is measurable depends on the measure you're working with. Some measures are defined on every subset of the reals: for instance, the counting measure, which assigns to every finite set its cardinality, and to every non-finite set the measure ∞. Or an analogous measure, which assigns measure 0 to the empty set, and measure ∞ to every other subset of the reals. When we talk about some sets being non-measurable, that's relative to some assumed measure, such as the Lebesgue measure.
In order to work around the difficulties with non-measurable sets, measure theory (including probability theory more specifically) says that measures are defined only on certain well-behaved collections of sets. In order for the machinery of measure theory to work, these collections have to have a certain structure. The way that structure is standardly guaranteed is by saying: let Ω be some set, and let 𝓕 be an algebra (or a sigma-algebra) over Ω. What that means is that 𝓕 is a set of subsets of Ω, that includes ∅ and is closed under complementation and finite union (if 𝓕 is a sigma-algebra, it's also closed under countable union). These conditions entail that 𝓕 also includes the whole set Ω and is closed under intersection. The sets in 𝓕 are sometimes called the "admissible" subsets of Ω. Measures are defined on the elements of 𝓕.

The combination of Ω and 𝓕 is standardly called a "field of sets". Many authors also call 𝓕 itself a field (or sigma-field).

When we are dealing specifically with a probability measure, then the elements of Ω are called outcomes, or experimental results, or (most specific) possibilities. We assume that reality chooses exactly one of these. Ω itself is usually called a "sample space"; it may be labeled with S, W, or U instead of Ω. In the more general setting where we're working with other measures, the elements of Ω may be called points. (Some authors apply the term "sample space" to the combination of Ω and 𝓕, rather than to Ω alone.)

When dealing with probability measures, the elements of 𝓕 (which will be subsets of Ω) are called events or hypotheses. In the more general setting, they may be called complexes.

The combination of Ω and 𝓕 together is often called a "measurable space". But this terminology can be misleading. Our definition of an algebra only specified what subsets of Ω have to be included; it doesn't force us to exclude any subsets. So 𝓕 could be the power set of Ω. And if Ω is the reals, then for many measures of interest, it won't be possible to define them on all elements of 𝓕. (In other words, a collection of measurable subsets of Ω must be an algebra over Ω, but the converse is not true.) In practice, I guess that people would only call the combination of Ω and 𝓕 a measurable space when the measure they're interested in can be defined on all elements of 𝓕.

The combination of Ω and 𝓕 plus a measure µ defined on 𝓕 is called a "measure space", or a "probability space" when the measure is a probability measure.

When you have only a finite or countably infinite number of outcomes in the sample space Ω, you may see talk of a "probability mass function" defined on them, which will ground the assignment of probabilities to singleton sets of those outcomes. In cases where the sample space is continuous, you may instead see talk of its "probability density function," and the probabilities of individual outcomes, when defined at all, will usually be 0. (There's also a more general notion of a "cumulative distribution function" which can be defined in a broader range of cases.) Classical probability theory as developed by Fermat, Pascal, and Huygens in the 17th century, through to LaPlace in the 19th century, was focused on the former, "discrete" case. Modern probability theory as developed by Kolmogorov in the 1930s gave a unified treatment of the discrete and continuous cases. (And also mixed cases. An experiment may involve tossing a coin: if it comes up heads, we stop; if it comes up tails, we spin a dial which can stop at any angular position.)
When working with discrete outcomes, it will often be natural to define the sample space in such a way that each outcome is equally probable. But there's no requirement to do this, and sometimes it won't even be possible. Here's a case to think about: your experiment involves throwing a thumbtack in the air and seeing how it lands. One outcome is that the point touches the ground (usually because the thumbtack is on its side); another is that the thumbtack lands on its back with the point upward. There's nothing wrong with these being our two outcomes, even if they don't happen to be (or we don't have reason to think they are) equally likely.
Events in 𝓕 are sets of outcomes. They may concern specific outcomes, or they may be more general events that are exemplified by several specific outcomes, or they may be impossible events, that are exemplified by no outcomes. Another device that theorists use to talk about sets of outcomes is the notion of a "random variable". These are often notated with capital Latin letters (like S), but for readability I'll sometimes use capitalized words like Suit. The way these work is that they are functions from specific outcomes to some category or label, often a number. For instance, Suit might map some w∈Ω to the label Hearts, others to the label Clubs, and so on, depending on what the suit was of a randomly chosen card. There may be other aspects of the outcome w that this variable ignores (such as what the card's rank was). NumHeads might map some w∈Ω to the label 3 if it is an outcome in which exactly 3 coin-tosses in some sequence came up heads. Now the event in which we get at most 3 heads in the sequence of coin tosses could be written as {w∈Ω | NumHeads(w) ≤ 3}. That is a set of outcomes. But authors will usually abbreviate that like this: NumHeads ≤ 3. That is, instead of prob({w∈Ω | NumHeads(w) ≤ 3}), they'll write prob(NumHeads ≤ 3). Similarly, they might write prob(NumHeads = 3) or prob(NumHeads ∈ some set), and so on.

(More rigorously, a random variable for a field of sets (Ω, 𝓕) is a function from Ω to a set Γ, where there's an algebra 𝓖 on Γ, such that the pre-image of any G∈𝓖 under that function is in 𝓕. Functions of this sort are called "measurable functions.")

Discrete random variables can take only a finite or countably infinite range of values. Continuous random variables can take more. (There can be mixed cases: in the example described at the end of point 4, we could have a variable that took either the value -1 if the coin came up heads, or a value from a continuous range [0..2π) representing the position of the spun dial, if the coin came up tails.)

When a sample space (only) concerns a single random variable taking on various values, the probability space is called "univariate". When it concerns the values that several random variables take, it's called "multivariate".
As you saw in the readings, two events or hypotheses H and G are said to be "probabilistically independent" iff prob(H ∩ G) --- also written prob(HG) --- is identical to prob(H)∙prob(G), which is equivalent to it being the case that prob(H | G) = prob(H), which is equivalent to it being the case that prob(G | H) = prob(G). (I'm assuming here that prob(H) and prob(G) are both > 0.) Authors differ in which of these they take to define the notion of probabilistic independence, and which they take to be derived. The claim that H and G are probabilistically independent is sometimes written as: H ⫫ G, alternatively as H ⊥ G. When the probabilities you're talking about are representations of your evidence or what credences you're in a position to rationally have, then probabilistic independence is usually taken to be a kind of evidential irrelevance.

If H and G are logically independent, that means that neither entails the other. Certainly if one of them did entail the other, they couldn't in that case be probabilistically independent. But logical independence doesn't entail probabilistic independence. The hypothesis that someone is a philosopher is logically independent from their being charismatic, but presumably these are not probabilistically independent. Presumably your rational credence that someone is charismatic, given that they're a philosopher, is different than your prior, baseline credence that they're charismatic.

Note also that probabilistic independence can come and go. You might have started out with rational credences where being a philosopher and being charismatic were probabilistically independent. But then as you interacted with more and more philosophers, your credences changed; now these properties are no longer evidentially irrelevant to each other.

When you're talking about probabilities that represent objective propensities, don't confuse probabilistic dependence with casual dependence. Presumably there's some connection between these, but they're not the same notion. If B is probabilistically dependent on A, it might not be because B tends to be caused by A; it might be instead that A tends to be caused by B; or they might both tend be caused by some third event; or something else entirely may be going on.

Sometimes authors talk about two variables U and V being independent. What this means is that any event defined in terms of U (such as U ≤ 3) is probabilistically independent of any event defined in terms of V.
We've defined events or hypotheses as sets of outcomes. And sets H and G are said to be disjoint or mutually exclusive if H ∩ G = ∅. But when considering H and G as events or hypotheses, sometimes what's meant by calling them disjoint or mutually exclusive is merely that prob(H ∩ G) = 0. This is a weaker condition.
The "expected value" (or "expectation") of a variable is the probability-weighted average of that variable's possible values. (This has to be stated more carefully when our variables don't always take discrete values, but that's the guiding idea.) For example, suppose you took a fair six-sided die, and repaint the sides displaying 4, 5, and 6 to all instead read 2. Now if you were to toss this modified die, it would have a 1/6 chance of coming up 1, also a 1/6 chance of coming up 3, and a 4/6 chance of coming up 2. We say that the expected value of rolling this die is (1/6)∙1 + (1/6)∙3 + (4/6)∙2 = 12/6 = 2.

In this case, the expected value is also the result that we're most likely to see occur. But that needn't be true in general. (The expected value of a normal die is 3.5, which will never come up on the die. We'll see another counter-example in a moment.) And though the expected value of our die is 2, nothing prevents this die from coming up 3 a hundred times in a row. That's just unlikely to occur. The Law of Large Numbers is a theorem that tells us, roughly, that as you repeat an experiment, the average of the results you see will in the long run converge on the expected value. (Here I'm taking the "expected value" from the die's objective probability of giving each result. This may be different from what anyone reasonably believes about the die.)

Suppose you were to insert some weights into the die to introduce a bias, but let's say you keep it having an equal chance x of coming up 1 as coming up 3 (with 0 < x < 1/2). Then the expected value of rolling the die would be (x)∙1 + (x)∙3 + (1-2x)∙2. As it happens, this would still add up to 2. Even in the case where x is very close to 1/2, so that the die almost never comes up 2, it still has an equal chance of coming up 1 and coming up 3, and they average to 2, so its expected value is still 2. (This is another example where the expected value is not among the results most likely to occur.)

The notion of "variance" measures the difference between our two examples with the die: one where the die often comes up 2, and the other where it rarely comes up 2, though the expected value is 2 in both cases. The variance of a variable measures how widely that variable's actual values can be expected to "spread" around its expected values. In the first case, there is a low variance, and in the second case a high variance. (Specifically, the variance of a variable X is the expected value of the square of the difference between X's actual and expected values. This gives the result that the variance of our first modified die is 1/3, and the variance of the second, biased, die is 2x. The variance of a regular six-sided die is approximately 2.92.)
We explained measures as defined on elements of an algebra 𝓕 over a set Ω, the combination of which is called a field of sets. But really the point of talking about these fields is just to induce a certain structure on the elements of 𝓕. It doesn't really need to be the case that its elements are subsets of some other set, nor that they're sets at all. They just need to have the structure we're looking for.

How could we describe the structure that 𝓕 needs to have, more abstractly? (I encourage you to think through this, but if it makes your head spin, it's not essential.)

A (nonstrict) "partial order" is a binary relation that is reflexive, transitive, and antisymmetric: for example, the subset relation, or the relation ≤ defined on the natural numbers ℕ. (A strict partial order would be the relation <, which is irreflexive rather than reflexive.) Mathematicians use the term "poset" for a partial order together with the set that it's defined over.

Two elements a, b are "comparable" with respect to a relation R if aRb or bRa. A poset where every pair of elements is comparable is called a "total or linear order."

There are several different notions of a "largest" element for a partial order. I'll use the symbol ⊑ for whatever (nonstrict) partial order we're working with.
- We say some element m of a poset is "maximal" (wrt that poset's relation ⊑) when m is not ⊑ any other element of the poset: that is, where Y is the poset in question, ∀y∈Y (m ⊑ y ⊃ m = y). A poset may have zero, or one, or more maximal elements.
- We say some element g of a poset is "greatest" (wrt that poset's relation ⊑) when every element of the poset is ⊑ g. A poset may have zero or one greatest element, and if it has one, it will be that poset's unique maximal element.
- If our poset is (Y, ⊑), and Y ⊆ B, we say that some b∈B (b may or may not be ∈ Y) is an "upper bound" for the poset if every element of Y is ⊑ b. b is a "least upper bound" for that poset and superset B if B contains no other b' ⊑ b which is also an upper bound for the poset. Least upper bounds need not always exist: for instance, let Y be the set of all rational numbers whose square is less than 2. Then Y has many rational upper bounds (for instance, 8 is one of them) but it has no least upper bound among the rationals. Among the reals, Y does have a least upper bound, namely √2. None of the upper bounds we've described are themselves elements of Y. When a poset does have a least upper bound in a superset B, it will be unique. If a poset has a greatest element, it will be that poset's least upper bound (and will be in the poset); if it has no greatest element, then the poset doesn't include any least upper bound within itself.
Another term for a least upper bound is the "supremum" or "join" of a poset.

Analogous to the notions of maximal, greatest, and l.u.b./supremum/join, we can define notions of minimal, least, and greatest lower bound/infimum/meet.

A join is often notated with an "or" symbol ∨, and a meet notated with an "and" symbol ∧. When a poset X is specifically a collection of sets, ordered by the subset relation ⊆, the the join of some elements of X (so here, Y=a subset of X, and the relevant superset B=X) is just the union of those elements, and the meet of those elements is their intersection. Depending on the makeup of X, the union and intersection of some of its elements might not themselves be elements of X. But when all pairs in X do have joins in X, then ∨ induces a binary operation on X that will be commutative, associative, and idempotent. (The last term means that x ∨ x is always = x.)

(It's also possible to introduce the notions of join and/or meet as primitive binary operations that are commutative, associative, and idempotent. Then you can define a partial order ⊑ in terms of them. You say that u ⊑ w iff u ∨ w = w. Or you say that u ⊑ w iff u ∧ w = u. You can prove that the relation ⊑ so defined will be reflexive, transitive, and antisymmetric, and so is a partial order.)

When you have a poset (X, ⊑) that includes a l.u.b./supremum/join for any nonempty finite subset of X, we call that poset a "join semilattice." Similarly, if the poset includes a g.l.b./infimum/meet for any nonempty finite subset, it's a "meet semilattice." When a poset (X, ⊑) is both a join semilattice and a meet semilattice, it's called a just a lattice. (There's also a different notion of a "lattice" in group theory and geometry, which means something like a regular tiling of a space.)

(It's also possible to introduce the notion of a lattice as a set with two primitive binary operations ∨ and ∧ that are commutative, associative, and are linked by the laws u ∨ (u ∧ w) = u, and u ∧ (u ∨ w) = u. This will result in the same kind of structure.)

Not every poset is a lattice. For example, let X = {{10}, {20}, {10,20,30}, {10,20,40}}, ordered by the subset relation. The elements {10} and {20} do have upper bounds in X, but no least upper bound. The elements {10,20,30} and {10,20,40} do not have any upper bound in X at all.

When you do have a lattice, it may have various other special properties.

Above, we were talking about joins and meets only for finite numbers of elements. When all subsets (not just finite nonempty subsets) of a lattice have joins and meets, the lattice is called "complete".

A lattice is called "bounded" when it has a least and a greatest element. We call the least element ⊥ (pronounced "bottom") and it will be the join of zero elements. Also ⊥ ∨ u will always be u. We call the greatest element ⊤ (pronounced "top") and it will be the meet of zero elements. Also ⊤ ∧ u will always be u.

We say that two elements u, w of a bounded lattice X are "complements" just in case u ∨ w = ⊤ and u ∧ w = ⊥. An element of a bounded lattice may have zero, one, or many complements.

A lattice is called "distributive" when its join and meet operations distribute over each other, that is for all x, u, w: x ∧ (u ∨ w) = (x ∧ u) ∨ (x ∧ w). (Or you could require that for all x, u, w where x ≤ u ∨ w, there is a u' ≤ u and a w' ≤ w such that x = u' ∨ w'. Or you could swap the ∨s and ∧s and replace ≤ with ≥. Any of these principles will ensure the others.)

Here's an example of a lattice that is non-distributive: {∅, {10}, {20}, {30}, {10,20,30}}, ordered by the subset relation. Another is {∅, {10}, {10,20}, {30}, {10,20,30}}.

When a lattice is both bounded and distributive, every element will have at most one complement (but they can have zero).

If a lattice is bounded, distributive, and every element does have a complement, this is called a Boolean algebra.

(It's also possible to introduce the notion of a Boolean algebra from other starting points, ending up with the same kind of structure. See 1, 2, 3, and 4 for more.)

So far as I can see, the point of requiring that measures be defined on fields of sets, is just to guarantee that their domain 𝓕 has the structure of a Boolean algebra. But we could have instead just stipulated that directly, and left the nature of 𝓕's elements and its ordering relation (and/or the nature of its join and meet operations) unspecified.

(It's straightforward that any field of sets ordered by ⊆ constitutes a Boolean algebra. A result called Stone's Representation Theorem from 1936 shows that any Boolean algebra is in turn isomorphic to a field of sets.)