50 Shades of Linear Regression - the depth we need..
The Hidden Depths of Linear Regression: A Journey Through Mathematical Foundations and Learning Paradigms (transcribed)
Let’s talk about something that seems very simple but has subtle nuances throughout. I’m writing this because I wanted to emphasize something important: everything has its own depth, and these aren’t throwaway concepts we learned five or six years ago and can forget about.
Take, for example, the experience in grad school. Everyone doing machine learning masters or data science masters has relevant courses, and everyone starts with something called linear regression. If you go and ask grad school students about linear regression, they’ll say “Oh, it’s so boring, I learned it in my undergrad already.” Another common response is “it’s just curve fitting.”
But here’s the thing about linear regression - there are different schools of thought, and many people have contributed to this one specific concept. Linear regression serves as a formal entry point to different elements and concepts. For example, linear regression could be a very good way towards understanding various algorithms. If somebody is interested in iteration-based algorithms related to linear regression, or in understanding the role of randomness and probabilistic inference - where there’s a joint distribution generating the data with its own parameters that we want to estimate - that’s something we see from the Bayesian or probabilistic point of view.
We also want to understand the statistical significance of these estimates so we can interpret things correctly. Why I’m emphasizing this simple concept is that if we can develop good practices with something simple, we’ll be able to maintain those practices when we move to more complex models. I’ve witnessed this and tested it on myself - if you directly jump to, say, transformers for time series forecasting, that’s all good and there are plenty of papers on it. But sometimes, to build a deeper understanding, we need to go back and appreciate the beautiful nuances of simpler models. We can keep adding things layer by layer and revisit the relevant mathematics.
Everything is math, but predominantly we see things from the statistical point of view, the probabilistic point of view, and the algorithmic point of view. There’s another perspective that’s quite interesting - how statistical learning comes into the picture. When we look at linear regression through ordinary least squares, which Lagrange and Gauss proposed hundreds of years ago, it wasn’t actually a learning paradigm. It wasn’t considered statistical learning either - it was curve fitting, or what we call BLUE (Best Linear Unbiased Estimator). It was simply a fitted curve to understand relationships.
Years later, the statistical notion of estimation came into principle, providing statistical values. Even then, we weren’t explicitly worrying about the data generation process or joint distribution. Then came empirical risk minimization - you have an empirical risk (which may not be the exact risk but an expected risk) that you want to minimize. There, we’re learning the relationship and formulating a hypothesis or function that maps one set of inputs to another. For generalization purposes, people add a penalty - risk plus regularization.
From there, things started evolving in the notion of statistical learning. People began asking when this becomes “learning” - because in ordinary least squares, we weren’t commenting on overfitting or generalization. By theorem itself, it should be the Best Linear Unbiased Estimator. But the problem really transforms when you have multiple dimensions, when you’re approaching it iteratively, or when you’re seeing things from the empirical risk minimization point of view. That’s when things turn into caring more about generalization, not only for linear regression but for other models as well.
The econometric side of things truly involves hypothesis testing and understanding relationships. For example, you want to understand the influence or strength of relationship of one variable on another. This aspect of econometrics deals with real-world applications - from the old times when people used to understand social observations, like how age impacts vulnerability to cancer, or how minimum wage impacts unemployment. These applications span public policy, public health, labor economics, and have evolved into business analytics and operations research.
In econometric applications, we want to ensure our estimates are unbiased. We can form hypotheses around perfect sampling to avoid randomness. There are issues to consider - like multicollinearity, hidden disturbances, interaction effects, and what happens when observations don’t follow independent and identically distributed assumptions or have temporal errors. Based on different situations, we need different regression approaches and techniques.
Then we can think about causal inference on top of linear models and generalize to different models. From the beginning, ordinary least squares has been associated with the normal distribution, primarily estimating the mean. But for other distributions, there isn’t necessarily a mean - there are different parameters. That’s where generalized linear models come in.
These are multiple dimensions or different shades to look at how linear regression helps you think more theoretically and in-depth from any angle. If you’re interested in machine learning and statistical learning, you want to see how this connects with learning theory - it’s still a very relevant aspect that provides important foundations.
From the statistical point of view, things evolved to capture regularization and how statistically we approach problems. Regularization appears at three different levels: the statistical way, the Bayesian learning side, and in empirical risk minimization where we add regularization as a penalty. There’s also the bias side - unbiasedness, multicollinearity, and different loss functions.
Naturally, we use squared loss - the sum of squared residuals (RSS). It’s interesting to explore why this is so common. In ordinary least squares, following the Gauss-Markov theorem, we want to minimize least squares. From the probabilistic perspective, when we do Maximum A Posteriori (MAP) estimation with log-likelihood and gradients, it automatically becomes log-likelihood plus the regularized squared difference. From the empirical risk minimization viewpoint, this is easily tractable, though there are different loss functions one can explore.
Sometimes I go into depths and feel “okay, this depth is right,” and then over time I realize I need more. It’s all need-based - how we really want to grow. There’s hypothesis testing - understanding p-values, examining coefficients and their standard errors. With linear regression alone, we can see multiple elements: algorithms (how things fit computationally), theoretical foundations (the assumptions we start with, like Gauss-Markov, probabilistic models, Bayesian approaches, conditional modeling), and the machine learning perspective.
We learn about different algorithms used, what loss functions are employed, on what theoretical grounds they iterate, and how computationally they’re calculated. Sometimes it’s simple linear algebra, sometimes more complex.
There’s also full posterior estimation where you don’t just get a point estimate of the mean but understand how it can vary. This helps with concepts like conformal prediction. Why connect all these things to regression? So we can conceptually develop them. I’ve seen that we can learn these together and build a comprehensive analytical development from regression onwards. Most of the time when we’re working, we never care about these nuances - it’s all demand-based. Sometimes we’re asked about the loss function, sometimes about explainability, sometimes about implementation.
When doing gradient ascent for log-likelihood estimation, we ask: how much are we able to increase the log-likelihood? What’s the maximum level we can reach? That’s the log-likelihood itself. Based on that, we compare models - model one has this log-likelihood, model two has another. The problem is we often end up memorizing these things, but if you connect concepts together, it becomes very easy to reason things out.
There are different scenarios and situations - it’s all contextual, depending on where we are and what we intend in practice. For a business analyst, if you just see these formulas and conclude it’s not significant, that’s one thing. But once you move forward, you discover concepts like causality and causal mechanisms with their own definitions and influences.
This could be a good start to build confidence that we should maintain a balance between computational, theoretical, and applied perspectives, applying them appropriately. I plan to spend an hour every day doing similar small code examples and calculations, maybe some small derivations if possible - a series of workouts on these concepts. This could be good practice and a warm-up to focus on different concepts.
For example, what made someone think to move from ID3 to CART mechanisms? It’s good to learn these progressions. This practice provides a roadmap for exploring. Nowadays with neural networks, we want to understand different things - how neural networks learn, how autoencoders help us get explainable features. There are issues here too - features aren’t always human-understandable. I went through a paper from Anthropic about this - the features exist but aren’t necessarily human-compatible.
There’s so much to develop and explore. I’m trying to understand and hypothesize, to build something around these concepts. It’s not just about building pipelines - I find value in incremental theoretical and applied understanding.
I’m building this discipline now. In the past, I used to start something and then because of work or other commitments - maybe needing to build an agent or not having time - things would fall aside. Now I’m focusing on myself and my interests, seeing how this develops through consistent exploration of these foundational concepts that reveal their depths the more we examine them.