# History of Statistics

## Overview

*Statistics* is used to make predictions or conclusions based on data. Data (facts or statistics) selected and collected from a sample population. Statistical analysis tries to make sense of data. Determine if a difference or change in data is due to chance or a logical consistent relationship. The more a systematic relationship is observed, the more certain a prediction will be. The more random error observed, the more uncertain a prediction will be.

Statisticians provide a measure of the uncertainty to a prediction. When making inferences about a population, the statistician is trying to estimate how good a summary statistic of a sample really is at estimating a prediction about the population.

Parameter is an element of a system that is useful, or critical, to identify a system, or to evaluate its performance, status, or condition,

| Education timeline | Historical & political timeline | Literature timeline |

## Timeline of statistics

2012 The *Large Hadron Collider* confirms existence of a *Higgs boson* with a probability of five standard deviations (about one in 3.5 million) the data is a coincidence.

2012 *Nate Silver*, statistician, successfully predicts the results for all 50 states in the U.S. Presidential election. He becomes a media star and sets up what may be an over reliance on statistical analysis for the 2016 election.

1997 The term *big data* first appears in print.

1965 Bradford Hill creates *Hill's criteria for causation*. His criteria includes 9 principles to use epidemiological evidence to show a causal relationship between a presumed cause and an observed effect. It is widely used in public health research.

1958 Doctors begin to use the *Kaplan–Meier estimator*, a non-parametric statistic, to estimate the efficacy of a treatment. Known as the survival function, expressed as the fraction of patients who survive a certain length of time after treatment.

1953 The *Kinsey Report* on female human sexual behavior is released. A summary of a large scale collection of data to use for a variety of statistical analysis.

1950s *Genichi Taguchi’s* robust statistical *design methods* are used to improve the quality of automobiles, manufactured goods, electronic components to drive Japanese industry to prominence. Later his methods are applied to engineering, biotechnology, marketing and advertising.

1950 Sir Austin Bradford Hill and Richard Doll use randomized clinical trial to conclusively *prove a link between cigarette smoking and lung cancer*.

1948 Claude Shannon studies *information theory*, essential for the digital age, to determine limits on the transfer of information: kinds of signals, transfer, quantify, storage, compression, and limits for processing and communication.

1948 The *Kinsey Report* on male human sexual behavior is released. A large-scale survey that collected data from over 5000 males to use for a variety of statistical analysis.

1946 Richard Cox derives a theorem to justify a logical interpretation of probability if his derived set of logical assumptions are met.

1944 *The German tank problem*: During WWII the Allies use statistical analysis, based on the serial numbers on gearboxes from captured German Panther tanks, to predict they will face 270 tanks on D-Day. The actual total is found as 276.

1940-45 *Alan Turing* uses *Bayesian statistics* and *Colossus* (first programmable electronic computer) to find a solution for the German Enigma code.

1937 Jerzy Neyman introduces confidence intervals in statistical testing, which leads to modern *scientific sampling*.

1935 R. A. Fisher creates a *Design of Experiments* to determine which scientific experiment results are significant or not.

1908 William Sealy Gosset, chief brewer for Guinness in Dublin uses a *t-test* to determine the quality of beer from a small sample size.

1907 Francis Galton describes *wisdom of crowds* as the average of many uniformed guesses and claims it will be close to the actual value. He generalizes this idea after he wins a *Guess the weight of an Ox contest* by using guesses he collects from 787 villagers.

1904 Charles Spearman (English, 1863–1945): Extends the Pearson correlation coefficient to create the *Spearman's rank correlation coefficient*. A measure of correlation to determine the strength of association between two variables without making any assumptions about the frequency distributions of the underlying variables.

1902 Pearson’s contributions firmly establish *statistics* as a discipline. However, he saw his goal to develop and apply statistical methods to study heredity and evolution.

1900 Karl Pearson creates the *Pearson chi-squared test* to establish if two variables are independent of each other. It uses two sets of categorical data to evaluate the probability the observed difference between the sets are by chance. Suitable for unpaired data of large samples.

1896 Karl Pearson lectures on a general theory of* skew correlation* and *nonlinear regression*, which was not published until 1905.

1896 Karl Pearson lectures on *experimental* and *theoretical* material on *errors of judgemen*t, *measurement errors*, and the *variation* over time of personal equations of individual observers.

1901 Pearson with others publish Tables for Statisticians and Biometricians in *Biometrika*. Later editions in 1914 and 1931.

1900 Thorvald N. Thiele Introduces the mathematical theory of *Brownian motion*, cumulants (kn of a probability distribution are a set of quantities (first cumulant: mean, second cumulant, variance, and third cumulant is the third central moment) that provide an alternative to the moments of the distribution), and likelihood functions.

1898 Von Bortkiewicz’s claims data from the number of soldiers killed from horse kicks is a predictable pattern and creates the *Poisson distribution* (named for Siméon Denis Poisson). It is a discrete (individual) probability distribution that describes the probability of a given number of events occurring in a fixed interval of time or space if an event (distance, area or volume) occur at a known constant rate and independent of the time since the last event.

1895 Karl Pearson begins to define *symmetrical *and* asymmetrical curves* of both limited and unlimited range in either or both directions (most unimodal with some U, J, & reverse J).

1895 Karl Pearson creates the *Pearson correlation coefficient* or the *Pearson product moment*.

1894 Pearson lectures describe the *method of moments* to provide a general method to determine the values of the parameters for a frequency distribution of a selected set of observational or experimental data.

1894 Karl Pearson creates the *standard deviation* as a measure for the amount of variation, or dispersion, in a set of data. One standard deviation includes 68% of the sample. It replaces the use of root mean square error and error of mean square and mean error.

1893 Pearson lectures on *Normal Curves* and *normal correlation* for three, four, and n variables.

1892 Pearson lectures on *Variation*

1891 Pearson develops his theories on *Laws of Chance* with coin-tossing and card-drawing experiments and lectures on *probability theory*, the concept of *correlation*, and begins to develop the mathematical branch of *statistics*.

1891 Pearson is an enthusiast for *graphical representation* and lectures on *Geometry of Statistics*.

1889 Pearson is fascinated by *Galton’s correlation* as a category broader than causation and begins to develop ideas for statistics, which will be valuable for psychology, anthropology, medicine and sociology for mathematical validation.

1888 Francis Galton describes *correlation* as the relationship between two variables.

1883 Charles S. Peirce publishes *A Theory of Probable Inference*. In it he describes the importance a *repeated measures *design, *blinded studies*, and *controlled randomized experiments.* Uses logistic regression, correlation, smoothing, improves the treatment of outliers, and introduces terms *confidence* and *likelihood*.

1880 Walter Frank Raphael Weldon *searches for a method to use data from studies of animal and plant populations to support evolution*. Specifically, to relate parameters of asymmetrical distributions and probability of correlated variables. He *consults with Pearson for help*.

1877 Francis Galton describes *regression to the mean*.

1877 Charles S. Peirce develops a *theory of statistical inferences* and publishes it in *Illustrations of the Logic of Science*. He studies deduction, mathematical logic, and science logic - induction (he called retroduction or abduction). How to use induction (abduction) to form a hypothesis to explain facts and on the level with deduction. Induction that can be applied on a practical sense (pragmatism). For example: X is harder than Y. X will scratch Y. Y will not scratch X. Therefore, a person will habitually us X to scratch Y. Or keep X from Y so it is not scratched. Peirce believed mathematics is a study of what is or is not logically possible without concern of what actually exists. Philosophy discovers from ordinary everyday experience.

1868 Charles Joseph Minard creates a graphic diagram of Napoleon’s March on Moscow that shows the distance covered, the number of men alive at each kilometer of the march, and the temperatures as they go.

1859 Florence Nightingale makes a circular chart to illustrate her predictions month by month of casualties for a Crimean War to convince the War Office to better prepare to save lives. Known as the *Nightingale rose, Coxcomb Chart, Polar Area Diagram, *and* forerunner of the pie chart. *

1854 John Snow First modern study of *epidemics*.

1849 Charles Babbage designs the *difference engine* to handle data for a modern computer. Ada Lovelace, Lord Byron’s niece, writes the *first computer program* for it.

1840 William Farr organizes an official system to record and *store data* on the causes of death in England and Wales to track disease and epidemics to enable statistical analysis for medical purposes.

1842 Adolphe Quetelet develops the concept of the *average man*, which he derives from the normal distribution of a persons height, body mass index, earnings, … as explained in his *The Treatise on Man and the Development of his Faculties*. Introduction of social science statistics.

1831 Lambert Adolphe Quetelet studied variation in body measurements and social attributes related to crime. He took ideas of political arithmetic, which he believed could be used without bias if applied to descriptive statistics (he termed social physics). His ideas of large numbers include: Causes are proportional to the effects they produce (If a person lifts twice as much as another, he is twice as strong as that person. Accurate conclusions are possible only with large numbers. Initial ideas for *The laws of large numbers*.

1808 Carl Friedrich Gauss, with Pierre-Simon Laplace derive a *normal distribution*, or bell curve. A normal distribution implies: mean=median=mode; symmetry about the center, 50% of the represented values are less than the mean; and 50% are greater than the mean. It is also related to variation and error.

1805 Adrien-Marie Legendre creates the method of *least squares to fit a curve* to a set of observations.

1795 Carl Friedrich Gauss describes the method of *least squares* analysis as a method of estimation. However, Adrien-Marie Legendre is first to publish the method in 1805.

1791 The first use of the English word *statistics* by Sir John Sinclair in his *Statistical Account of Scotland*.

1790 *First US census*, as required by the Constitution, is ordered by Thomas Jefferson, and counts 3.9 million Americans.

1786 William Playfair explores the use of *graphs and bar charts* to represent economic data.

1761 Rev. Thomas Bayes proves Bayes’ theorem necessary for conditional probability to test beliefs and hypotheses. Kevin Gray interview of Andrew Gelman who explains Bayesian Theory basics and history.

1749 Gottfried Achenwall creates the *word statistics* (German - Statistik) his definition is the information you need to run a nation state (previously termed political arithmetic).

1741 Johann Peter Sussmilch publishes the first edition of *The Divine Order in the Changes of the Human Species, as Demonstrated by its Birth, Death and Propagation*. He continues to collect data and publish updates for 20 years. This extensive set of vital statistics is published across Europe and is used by statistician to the extent that he is claimed as the father of demographic statistics.

1715 Edmund Halley draws the first *data created visual map* to show the path of a solar eclipse on a map.

1713 Jacob Bernoulli’s book on combination and probability is published, *Ars conjectandi*. It includes the *law of large numbers*: the more an experiment is repeated, the more accurate the results can be predicted.

1693 Edmund Halley uses statistics and data from mortality tables to compare *death rates* to age for* life insurance companies*.

1657 Christiaan Huygens writes the *first book on probability*, *On Reasoning in Games of Chance*.

1654 Blaise Pascal and Pierre de Fermat create the *theory of probability* to predict outcomes of chance in games from their study of games of chance.

1644 Michael van Langren creates the *first known graph* to show the distance between Toledo and Rome or statistical data for estimates and error.

1570 Tycho Brahe improves estimates of star and planet locations with *arithmetic
means*.

1560 Gerolamo Cardano determines *probabilities for dice games.*