Humans try to understand the world around them by representing it in different ways. Architects capture attributes of buildings through blueprints and three-dimensional, scaled-down versions. Molecular biologists capture protein structure with three-dimensional visualizations of the connections between amino acids. Statisticians and data scientists capture the uncertainty and randomness of data-generating processes with mathematical functions that express the shape and structure of the data itself.
A model is our attempt to understand and represent the nature of reality through a particular lens, be it architectural, biological, or mathematical.
A model is an artificial construction where all extraneous detail has been removed or abstracted. Attention must always be paid to these abstracted details after a model has been analyzed to see what might have been overlooked.
In the case of proteins, a model of the protein backbone with sidechains by itself is removed from the laws of quantum mechanics that govern the behavior of the electrons, which ultimately dictate the structure and actions of proteins. In the case of a statistical model, we may have mistakenly excluded key variables, included irrelevant ones, or assumed a mathematical structure divorced from reality.
Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the underlying process might be with your model. What comes first? What influences what? What causes what? What’s a test of that?
But different people think in different ways. Some prefer to express these kinds of relationships in terms of math. The mathematical expressions will be general enough that they have to include parameters, but the values of these parameters are not yet known.
So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship, you’d write down y = mx +c. You don’t know m and c in actual numbers yet, so they’re parameters.
Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows, showing how things affect other things or what happens over time. This gives them an abstract picture of the relationships before choosing equations to express them.
How do you build a model?
How do you have any clue whatsoever what functional form the data should take? Truth is, it’s part art and part science. And sadly, this is where you’ll find the least guidance in textbooks, in spite of the fact that it’s the key to the whole thing. After all, this is the part of the modeling process where you have to make a lot of assumptions about the underlying structure of reality, and we should have standards as to how we make those choices and how we explain them. But we don’t have global standards, so we make them up as we go along, and hopefully in a thoughtful way.
We’re admitting this here: where to start is not obvious. If it were, we’d know the meaning of life. However, we will do our best to demonstrate for you throughout the book how it’s done.
One place to start is exploratory data analysis
(EDA). This entails making plots and building intuition for your particular dataset. EDA helps out a lot, as well as trial and error and iteration.
To be honest, until you’ve done it a lot, it seems very mysterious. The best thing to do is start simply and then build in complexity. Do the dumbest thing you can think of first. It’s probably not that dumb.
For example, you can (and should) plot histograms and look at scatterplots to start getting a feel for the data. Then you just try writing something down, even if it’s wrong first (it will probably be wrong first, but that doesn’t matter).
So try writing down a linear function. When you write it down, you force yourself to think: does this make any sense? If not, why? What would make more sense? You start simply and keep building it up in complexity, making assumptions, and writing your assumptions down. You can use full-blown sentences if it helps—e.g., “I assume that my users naturally cluster into about five groups because when I hear the sales rep talk about them, she has about five different types of people she talks about”—then taking your words and trying to express them as equations and code.
Remember, it’s always good to start simple. There is a trade-off in modeling between simple and accurate. Simple models may be easier to interpret and understand. Oftentimes the crude, simple model gets you 90% of the way there and only takes a few hours to build and fit, whereas getting a more complex model might take months and only get you to 92%.
Fitting a model
Fitting a model means that you estimate the parameters of the model using the observed data. You are using your data as evidence to help approximate the real-world mathematical process that generated the data. Fitting the model often involves optimization methods and algorithms, such as maximum likelihood estimation, to help get the parameters.
In fact, when you estimate the parameters, they are actually estimators, meaning they themselves are functions of the data. Once you fit the model, you actually can write it as y=7.2+4.5x, for example, which means that your best guess is that this equation or functional form expresses the relationship between your two variables, based on your assumption that the data followed a linear pattern.
Fitting the model is when you start actually coding: your code will read in the data, and you’ll specify the functional form that you wrote down on the piece of paper. Then R or Python
will use built-in optimization methods to give you the most likely values of the parameters given the data.
As you gain sophistication, or if this is one of your areas of expertise, you’ll dig around in the optimization methods yourself. Initially you should have an understanding that optimization is taking place and how it works, but you don’t have to code this part yourself—it underlies the R or Python functions.
Overfitting is the term used to mean that you used a dataset to estimate the parameters of your model, but your model isn’t that good at capturing reality beyond your sampled data. You might know this because you have tried to use it to predict labels for another set of data that you didn’t use to fit the model, and it doesn’t do a good job, as measured by an evaluation metric such as accuracy.