My Phil Stat seminar has been meeting for 4 weeks now, and we’re soon to experiment with a small group of outside participants zooming in (write to us, if you are interested in joining us). I’ve been so busy with the seminar that I haven’t blogged. Have you been following? All the materials are on a continually updated syllabus on this blog (SYLLABUS). We’re up to Excursion 2, Tour II.
Last week, we did something unusual: we read from Popper’s Conjectures and Refutations. I wanted to do this because scientists often appeal to distorted and unsophisticated accounts of Popper, especially in discussing falsification, and what demarcates good science from poor science. While I don’t think Popper made good on his most winning slogans, he gives us many seminal launching-off points for improved accounts of falsification, induction, corroboration, and demarcation.
Do people still assume EI is “rational”? Good science can’t be demarcated from poor, questionable, fringe science and the like by its empirical method, if that method is understood as enumerative induction (EI), says Popper, rightly. While it comes in many forms, EI (taken up in Ex 2 Tour I) infers from observed instances (or frequencies) of A’s that are B’s to inferring claims like: the next A will be a B, or most A’s are B’s, or k% of A’s are B’s or even that the probability an A is B is k. Such a method is unreliable, so we shouldn’t be keen to justify it. It permits inferring poorly probed claims and violates the minimal requirement for evidence (weak severity).
Yet we are familiar with claims from epistemologists and others that some version of (EI) is a “rational” method. (It is the basis for famous quandries in legal reasoning). An additional stipulation is generally something like: “nothing else is known,” (which itself is knowing something else), but even that does not help. Neither do claims about indifference or uninformativeness. The philosopher Carnap called EI “the straight rule” and tried for many years to justify it–unsuccessfully. Lack of randomness, biasing selection effects in generating the data and in the choice of reference classes are key issues. Although the data in EI may be seen as relative frequencies, it is very different from frequentist statistics (See SIST, 110-11 on Neyman (1955): “Statistics as the Frequentist Theory of Induction”.)
Popper also rejected the empiricist assumption that observations are known relatively unproblematically. If they are at the “foundation,” it is only because there are apt methods for testing their validity. In fact, we dub claims observable because or to the extent that they are open to stringent checks. (Popper: “anyone who has learned the relevant technique can test it” (1959, p. 99).) Accounts of hypothesis appraisal that start with “evidence x,” as in confirmation logics, vastly oversimplify how data enters in learning.
Demarcation and Investigating Bad Science. Popper’s right that if using enumerative induction (EI) makes you scientific then anyone from an astrologer to one who blithely moves from observed associations to full blown theories is scientific. Yet Popper’s criterion of testability and falsifiability – as it is typically understood – may be nearly as bad. It is both too strong and too weak. Any crazy theory found false would be scientific, and our most impressive theories are not deductively falsifiable. The only theories that deductively prohibit observations are of the sort one mainly finds in philosophy books: All swans are white is falsified by a single non-white swan. There are some statistical claims and contexts, I argue, where it’s possible to achieve deductive falsification: claims such as, these data are independent and identically distributed (IID). Going beyond a mere denial to reliably replacing them, of course, requires more work.
However, interesting claims about mechanisms and causal generalizations require numerous assumptions (substantive and statistical) and are rarely open to deductive falsification. Their tests can be reconstructed as deductively valid, but in order to warrant the premises requires evidence-transcending (ampliative) inferences. So there’s a “whiff of induction” even in Popper (as some of his critics claim), even though not of the crude (EI) sort. (Note Popper’s claim about when a statistical hypothesis is falsified below.)
“The Demise of the Demarcation Problem”. Forty years ago, Larry Laudan’s famous (1983) paper declared the demarcation problem taboo. This is a highly unsatisfactory situation for philosophers of science wishing to grapple with today’s statistical replication crisis. Laudan and I generally see eye to eye, so perhaps our disagreement here is just semantics. I share his view that what really matters is determining if a hypothesis is warranted or not, rather than whether the theory is “scientific,” but surely Popper didn’t mean logical falsifiability sufficed. Popper is clear that many unscientific theories (e.g., Marxism, astrology) are falsifiable. It’s clinging to falsified theories that leads to unscientific practices. It’s trying and trying again in the face of unwelcome results, cherry-picking cases that support preferred hypotheses, and all the rest of the biases that make it easy to find apparent support for poorly probed claims.
Following Laudan, philosophers tend to shy away from saying anything general about science versus pseudoscience – the predominant view is that there is no such thing. One gets the impression that the demarcation task is being left to committees investigating allegations of poor science or fraud. They are forced to articulate what to count as fraud, as bad statistics, or as mere questionable research practices (QRPs). People’s careers depend on their ruling: they have “skin in the game,” as Nassim Nicholas Taleb might say (2018).
Free of the qualms that give philosophers of science cold feet, the committee investigating fraudster Diederik Stapel advance some obvious, yet crucially important rules with Popperian echoes:
One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. (Levelt Committee, Noort Committee, and Drenth Committee 2012).
This is the gist of our minimal requirement for evidence (weak severity principle). To scrutinize the scientific credentials of an inquiry is to determine if there was a serious attempt to detect and report mistaken interpretations of data.
Demarcating Inquiries (4 requirements). However, I say Popper confuses things by making it sound as if he’s asking: When is a theory unscientific? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H, unscientific? We want to distinguish meritorious modes of inquiry from those that are BENT. Despite being logically falsifiable, theories can be rendered immune from falsification by means of cavalier methods for their testing. Some areas have so much noise and/or flexibility that they can’t or won’t distinguish warranted from unwarranted explanations of failed predictions. It does not suffice– for an inquiry to be scientific– that there is criticism of methods and models. The criticism must be constrained by what’s actually responsible for any alleged problems. It may be correct to criticize an inference to a hypothesis H, but it may be for the wrong reason. For instance, the problem might be traced to H’s improbability when in fact the flaw is due to lack of error control due to data-dredging, optional stopping, and P-hacking.
A scientific inquiry or test must be able:
(a) to block inferences that fail the minimal requirement for severity
(b) to embark on a reliable probe to pinpoint blame for anomalies
(c) (from (a)) to directly pick up on altered error probing capacities due to biasing selection effects, optional stopping, cherry picking, data-dredging etc.
(d) (from (b)) to test and falsify claims.
So we get four requirements for an inquiry to be scientific.
Methodological probability. A valuable idea to take from Popper is that probability in learning attaches to a method: it is methodological probability. An error probability is a special case of a methodological probability.
Popper wrote to me expressing regret that he didn’t learn more statistics, but he referred to Fisher, Neiman and Pearson, and also Pierce in explaining when a statistical hypothesis is to count as falsified. Although extremely rare events may occur, Popper notes:
such occurrences would not be physical effects, because, on account of their immense improbability, they are not reproducible at will … If, however, we find reproducible deviations from a macro effect .. . deduced from a probability estimate … then we must assume that the probability estimate is falsified. (Popper 1959, p. 203)
In the same vein, we heard Fisher deny that an “isolated record” of statistically significant results suffices to warrant a reproducible or genuine effect (Fisher 1935a, p. 14). Even where a scientific hypothesis is thought to be deterministic, inaccuracies and knowledge gaps involve error-laden predictions; so our methodological rules typically involve inferring a statistical hypothesis. Popper calls it a falsifying hypothesis. It’s a hypothesis inferred in order to falsify some other claim. A first step is often to infer an anomaly is real, by falsifying a “due to chance” hypothesis. That is the role of statistical significance tests.
Insofar as we falsify general scientific claims, we are all methodological falsificationists. Some people say, “I know my models are false, so I’m done with the job of falsifying before I even begin.” Really? That’s not falsifying. Let’s look at your method: always infer that H is false, or fails to solve its intended problem. Then you’re bound to infer this even when this is erroneous. (Were H a null hypothesis of “no effect” you’d always be inferring the effect is genuine.) Your method fails the minimal severity requirement.
Note on Language. Keen to distinguish their accounts from inductivist and probabilist accounts of the day, Popper, Fisher, Neyman and Pearson talk of accepting, deciding, and inductive behavior. While these terms are fine, correctly understood, I think it’s preferable to use “induction” to refer to any ampliative (and thus error-prone) inference– one that goes beyond the data. (See Souvenir F, SIST p. 87). Rather than say “it is warranted to infer H” but H is not justified (as Popper did) we may use either–although I tend to use the former.