Blog – Alex Braksator

Leo Breiman worked 7 years as an academic probabilist, and then became a full time free-lance consultant for 30 years during which he shaped many of the opinions that are shared in his paper, Statistical Modeling: The Two Cultures.

There are two competing schools of thought when it comes to modeling data using statistics. The data modeling culture, consisting primarily of textbook statisticians, pull a stochastic model from the air that can reasonably explain the given data and check the goodness of fit to decide whether to keep the model or scrap it. The algorithmic modeling culture (machine learning), used by a scarce amount of statisticians (and Breiman himself), try not to oversimplify an intricate natural process with a model. Instead, this second type of culture attempts to come up with an algorithm, f, that takes in input values or vectors (namely x), and outputs a possible corresponding y value/vector. This algorithm can be a mathematical equation, such as a neural network, or a decision tree implementation. Furthermore, instead of a yes/no answer to see whether the algorithm worked, the accuracy is measured by a predictive accuracy metric (usually by hiding a portion of the data from training and using it later to test).

Breiman affirms that an algorithmic model can produce more reliable information about the structure of the relationship between inputs and outputs, whereas stochastic data models have led statisticians to “irrelevant theory” and “questionable conclusions”. He chastises statisticians who model data in such a way that makes sense in their minds, but not necessarily in a way that represents the underlying, intrinsic processes that produced such data. He goes through several examples from his own work as a freelance consultant to prove his point. In the Ozone project, he resorted to playing around with a very lengthy linear regression equation with quadratic and interaction terms for 450 variables. This was a time before a lot of the big machine learning breakthroughs, and Breiman expresses regret in succumbing to the first culture of statisticians in this project which eventually led to a model that over- predicted ozone levels (false alarm).

He then moves on to explain a more successful project to identify whether a chemical compound contained halogens, specifically chlorine. He decided to take advantage of previously gathered data and information on analyzing a compound’s mass spectra, from which it can be determined by a trained eye whether the compound contains chlorine. A mass spectra of a compound was the input, and he developed an algorithm (using a decision tree which mimics the process that a trained eye would go through) that would output YES if the compound contained chlorine or NO if it did not. This approach led Breiman to 95% accuracy in his predictions. Next, Breiman discusses the unlikely setting in which statistics is taught at a university. University teaches statistics students to first assume a model which generated a group of data and then to perform hypothesis . This is an unlikely scenario to find oneself in the real world where the underlying model is almost never known and is usually more complicated than, say, linear regression. He seems very troubled by the fact that the go-to answer when faced with a data problem is to think of a model; if the model does not work, then it must be the model’s fault and not the actual process of choosing a model. If the model does work (goodness of fit tests usually tend to agree with some models), then the hard work is done and dangerous extrapolations can be made. The problem with this is that there can be some data where two different models pass the residuals check and the goodness of fit test. Yet, which model is the correct one? Breiman would probably argue that both are wrong. This is the problem of using a yes/no system of determining a model’s fit, and Breiman classifies this issue as the multiplicity of data models.

Moving onto algorithmic modeling, Breiman introduces the two breakthrough algorithmic models in the 1980s used to fit data: neural nets and decision trees. The criterion for these models became predictive accuracy. He then discusses how different modeling algorithms can produce similar accuracy scores (multiplicity), but accuracy and model simplicity have a trade-off. He rates decision trees very highly when it comes to interpreting them, but says their accuracy are merely a B. He then begins to introduce the notion of random forests which in his experience have A+ predictions. I completely agree with Breiman on his stance on how data modeling problems should be dealt with. There is a time and place for applying models, but they should not be used as a guess for an unknown, underlying natural process from which we extrapolate and make faulty/weak conclusions from. Based on statistics courses I have taken at Cal Poly, I understand why he believes most statisticians oversimplify the data process with a model they believe to be true. I can also see its value at times, but most of the time I find myself defending the statisticians with phrases like “good enough”. Why should data interpreters settle for something that is a good enough representation of a complicated process as opposed to more algorithmic approaches to solving the problem at hand? The statements he made about the first culture of statisticians may have been brutal at times, but I truly believe he elucidated his claims with well-thought out real-world examples which I really appreciated.

On a final note, I also agreed with Breiman’s comments regarding the testing of models vs. algorithms. Models have a very yes/no type of deal when it comes to examining their performance on a sample (goodness of fit test and residual examinations). However, with an algorithm one can test its accuracy by training it on a subset of the data and testing it on the subset that was hidden from training to compute an accuracy score. This is a more pragmatic way of testing how well the algorithm worked as opposed to a yes/no answer. However, if these tests causes one to alter something about the algorithm, it is best to have another hidden subset of the data that you do not even look at until you have the final model ready.