Proof By Selected Instances
Richard Feynman: "Many years ago I awoke in the dead of night in a cold sweat, with the certain knowledge that a close relative had suddenly died. I was so gripped with the haunting intensity of the experience that I was afraid to place a long-distance phone call, for fear that the relative would trip over the telephone cord (or something) and make the experience a self-fulfilling prophecy. In fact, the relative is alive and well, and whatever psychological roots the experience may have, it was not a reflection of an imminent event in the real world.
"After my experience I did not write a letter to an institute of parapsychology relating a compelling predictive dream which was not borne out by reality. That is not a memorable letter. But had the death I dreamt actually occurred, such a letter would have been marked down as evidence for precognition. The hits are recorded, the misses are not.
"Thus human nature unconsciously conspires to produce a biased reporting of the frequency of such events. If enough independent phenomena are studied and correlations sought, some will of course be found. If we know only the coincidences and not the unsuccessful trials, we might believe that an important finding has been made. Actually, it is only what statisticians call the fallacy of the enumeration of favorable circumstances." (Counting the hits and ignoring the misses.)
Another example is the Texas Sharpshooter effect: a man shoots at the side of a barn and then proceeds to draw targets around the holes. He makes every shot into a bull's-eye. For example: if an epidemiologist were to draw a circle around the greater Boston area, he would find an incidence of leukemia comparable with the rest of the USA. Draw a circle around Woburn and he'd find a worrisome elevation. Draw a circle around the Pine Street neighborhood and he'd find an alarming cluster. Is it a real cluster? Or is he just drawing bull's-eyes where he found bullet holes?
These people don't tell you how many possible combinations of data arrangements they searched through in the process of arriving at their conclusions, nor how many contorted definitions of "closeness" they used to get their "statistically significant" results. Correlations have a distribution just like any random variable. If you crank through enough data, a certain number of correlations, even from purely random data sets, will fall within the spread of a distribution where they appear significant. Every quantitative researcher knows this. If you torture the data long enough, it will talk.
Just go into the forest looking for any interesting leaf pattern. The odds are pretty good that you will find one. Then come out saying that that pattern is what you were looking for. Prediction, or out-of-sample testing, is one very strong way to avoid accepting a spurious conclusion resulting from data manipulation, because coincidences in one data set are very unlikely to re-occur in a different, independent set.
A large professional organization once surveyed its members on a variety of topics. One of the questions on the poll was "Did you vote in the last society election?" When the responses to this question were compared with the actual voting records, the pollsters noted a large discrepancy - the percentage of respondents who said they had voted was significantly larger than the percentage of society members who actually had voted.
Of course! Those who responded to the survey were a self-selected subgroup of the general membership: those members who are more likely to participate in organizational affairs such as voting and polling.
"They say 1 out of every 5 people is Chinese. How is this possible? I know hundreds of people, and none of them is Chinese."
And then there is the optimist who exclaims "I've thrown three sevens in a row. Tonight I can't lose!"
And President Dwight Eisenhower expressing astonishment and alarm on discovering that half of all Americans have below average intelligence.