It is quite often that a data scientist is confronted with the question: why develop a model? The person who asks wants a quick answer and going into the (presumably cumbersome) process of developing a model seems to him as a procrastinating exercise instead of just providing a quick answer. While, of course, there is nothing wrong with providing a ´back of an envelope´ answer, there is always a risk of losing ground and consistency if no supporting model is developed. In this short article, we will try to understand what is a model and why one might need to have one.
“Essentially, all models are wrong, but some are useful.” – George E.P. Box
Every one of us, especially those who drive a car, has asked themselves the question: at what time should I leave home if I want to arrive at a particular place at a certain time? The estimation seems to be very simple:
This formula is believed to have been known somewhere back in Archimedes’ time and accepted in the 1300s by the Merton scholars at Oxford. Such proportion used in the context of speed, time, and distance is nothing less than a deterministic physical model of uniform motion. At any moment when one wants to calculate anything related to the concepts of time, distance, and speed, one uses this very ancient model. There is no need to reinvent or derive it again. It is always there, and it very clearly demonstrates how time, distance, and speed interact with each other. Let us try another example.
Say that you have a circular flower bed in your garden and you would like to extend its circumference by one metre. The question one has to answer here is what should the new diameter be?
We might stop and let a curious reader perform simple calculations. Knowing the model (formulae) of a circumference, it should be very simple; without knowing the model it might take quite a lot of time. After doing very simple derivations using a formula for the circumference of a circle, we learn that in fact, it doesn’t matter what size the circle under consideration is. Be it the circumference of planet Earth or Moon or your circular flower bed, extending its radius by approximately 15.9 centimetres will always extend its length by one metre.
As you can see, having another deterministic model, this time geometric, at our convenience helped to provide a quick and universal answer. So what is so special about such models that make our life easier? There are a couple of things that we stress: structure and reusability.
Let us turn for a little while to another discipline that quite heavily relies on modelling – computer programming. When one learns to program, the first programs are usually nothing but a file with a bunch of lines suggesting sequential execution. While such a file provides a quick answer to an individual problem, reusing it for a slightly different formulation without rewriting everything from the beginning is almost impossible. That is the price of a quick answer – once provided it becomes impossible to repeat, as it doesn’t provide any formidable foundations. At some point after recognising this drawback, one moves on to writing methods (functions), which although providing some reusability are still not abstract enough to give adequate modelling possibilities; and it is only after introducing interfaces and abstract classes that one reaches the full power of computer models. United into a structure by a UML diagram, these forms of abstraction provide clear understandings of what should be done and, later on, allow reusability and repeatability, of course, if models are created and written in a coherent and lucid way.
There are different types of models, such as numerical models, computer simulation models, political models, or econometric models. In data analytics, we concentrate mostly on statistical models. While there is no exact mathematical definition of a statistical model, we would happily adopt one provided by Peter McCullagh in his paper ‘What is a statistical model?’ According to currently accepted theories, a statistical model is a set of probability distributions on the sample space. As opposed to the deterministic model, a statistical model introduces uncertainties expressed through the probability distributions.
For the sake of simplicity, we don’t differentiate here between Bayesian and traditional statistical modelling – as we are looking at modelling at a very high level. We just urge you to remember that when it comes to Bayesian modelling, at a minimum one should have one more component that is prior distribution which expresses the beliefs about a parameter under consideration.
In our industry, we are mostly interested in econometric models, which the founding members of the Cowles Commission defined as ‘a branch of economics in which economic theory and statistical method are fused in the analysis of numerical and institutional data.’ In the modern parlance, models that combine explicit economic theories with statistical models are called structural econometric models. On the other side of the modelling spectrum, we have ‘reduced form’ models. Under this umbrella, we have statistical models that don’t refer to any specific economic theory, for instance, autoregressive conditional volatility models or a regression model based on some business-related covariates, but without any explicit economic theory behind it. Either of the models can be used, of course, for industrial purposes, although the use of non-structural models seems to be more prevalent.
Examples of regression models include explaining installs as a function of TV campaigns, an autoregressive forecasting model, and the market evaluation of a newly launched game. In each of these and many other cases utilising the model thinking, we are able to not only obtain a desired numerical answer but to see how different parts of our model are interacting with each other. We know and hopefully successfully simulate relative and absolute changes in the observed variables (dimensions of our model) and understand what might cause these changes. Most importantly, of course, by having a model we are capable of replicating these findings.
So what does one have to do to start thinking in modelling terms? It isn’t difficult at all. Every time you are facing a problem, to start with, as weird as it sounds, don’t attempt to find an immediate solution. Try to think what forces are influencing the problem, which of them might be introduced as variables, which might be constants, and then think which of them are deterministic and which are stochastic – for a moment for clarity you may assume that everything is deterministic. At the next stage, introduce the problem in the dimensions of your variables. This spatial representation helps you to understand relationships between your variables and allows you to visualise your problem. By then, the solution or at least the way of reaching it should have become more or less evident. Now, you can start to add distributions to stochastic variables and abstract off deterministic factors through the reasonable and plausible assumptions. Later on, you might come back and revise your assumption, of course. Voilà, you have a model. Formalise it by writing it using computer code and obtain your solution. Just remember that while a solution for a given problem is necessary, it is much more important to be able to replicate your solution and demonstrate how your solution might be affected if the underlying forces were changing. This is practically impossible without careful formulation – modelling.
In conclusion, let me finish as I started with a quote from George E.P. Box:
“Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behaviour of gas molecules. For such a model there is no need to ask the question “Is the model true?”. If “truth” is to be the “whole truth” the answer must be “No”. The only question of interest is “Is the model illuminating and useful?”.
And this is all that matters folks.