I won’t go over all the prompting, but one thing I wanted to try was to figure out how useful Claude can be for helping to code up some Polars.
I asked for a brief overview of what the Polars library is, and then for an example of some code that uses Polars. I didn’t ask for anything specific: I wanted to see what it came up with as an example from no detail. It provided the below code:
import polars as pl
# Create a DataFrame from a list of dictionaries
data = [
{"name": "Alice", "age": 32, "city": "New York"},
{"name": "Bob", "age": 27, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"},
{"name": "Dave", "age": 29, "city": "New York"},
]
df = pl.DataFrame(data)
# Print the DataFrame
print(df)
# Select specific columns
selected_cols = df[["name", "city"]]
print(selected_cols)
# Filter rows based on a condition
filtered_df = df[df["age"] > 30]
print(filtered_df)
# Group by and aggregate
grouped = df.groupby("city").agg(pl.col("age").mean().alias("avg_age"))
print(grouped)
# Sort the DataFrame
sorted_df = df.sort("age", descending=True)
print(sorted_df)
# Write the DataFrame to a Parquet file
df.write_parquet("output.parquet")
It’s a good example of some things we might want to do with a small dataframe. The comments are a nice touch. However, one issue that presents is the use of brackets to filter the rows based on a condition. Polars doesn’t use bracket notation: it uses the .filter()
method. So the correct way to filter the dataframe to rows with an age
value greater than 30 is:
filtered_df = df.filter(pl.col("age") > 30)
This is the solution that Claude gives when prompted that bracketed filtering is not the correct approach for Polars.
My suspicion is that Claude (and this is likely true for all LLM models) has a concept of Polars that is very similar to the concept of Pandas. I think the LLM is finding that, after getting to the filtering section of the code, its probability model suggests that the next line should use bracket filtering.
I think models like Claude are likely to make this kind of mistake in longer sections of code, where the Polars syntax is similar to Pandas but then differs in one area. The model presumably has a bunch of text - that could be Polars or Pandas - and goes back to the much larger corpus of training data it possesses on Pandas.
Overall, using LLMs for coding in Polars (or any other API that has a limited history of training data) is an interesting exercise, but probably something that should be approached with caution. Especially when there is a risk of close but not perfect syntax overlap, as in the case of Pandas.
Only the 3.5 model thus far ↩
A brief clip of the slip-up can be found here^{1}. In short: Huberman says that the probability of conceiving on any one attempt is 20%, each additional attempt increases the total probability of conceiving by 20%, and so after 5 attempts you should expect to have conceived with 100% probability - if you haven’t then you should think about seeing a fertility expert. Enough (virtual) ink has been spilled in explaining why this is incorrect so I won’t do that here. Huberman provides his own correction in a later post - much to his credit.
I do want to explore the idea in a way rased by @trading_noise based on the discourse. They set out the following problem:
A couple learns that for a “typical” couple, the probability of conception is 20% per attempt, and that attempts are considered independent. This couple decides they will consult a fertility specialist if they have yet to conceive by attempt n, where n is the least number of attempts for which a “typical” couple would have at least a 99% probability of conceiving. What is the value of n?
To answer this question (as well as the original question posed by Huberman) we have to follow some interesting logic that’s not always clear at first glance. When we’re presented with the probability of 20% (or 0.2), the first thought might be to add or multiply this number in some way. But that would be the wrong approach!^{2} In fact, the 20% is something of a red herring here - while it’s an important piece of information, it’s not actually the number we need.
Let’s lay this problem out using the idea of a binomial model. An attempt is one of a sequence of trials, each with probability of success (i.e. conception) of 0.2. Thus the probabilty of failure (i.e. not conceiving) in an attempt is 0.8. The question we’re being asked revolves around the probability of conceiving at least once^{3}, versus the probability of not conceiving at all. The latter is the key to answering these questions. What does it mean to not conceieve at all in a series of n trials? It means that, for every trial \(1...n\), each trial results in a failure. We know this happens in each independent trial with probability 0.8. Since we’ve assumed the independence of trials, we can calculate this probability as \(0.8^{n}\).
So when we’re asked the question of how many trials we need for a minimum 99% chance of conceiving, the mirror of this problem is to ask how many trials we need to have a less than 1% chance of failing to conceive. And since we know the general form for the probability of failing to conceive (\(0.8^{n}\)), we can set out to solve for the n that makes this probability less than 1% (0.01).
\[0.8^{n} \lt 0.01 \implies n\ln(0.8) \lt \ln(0.01) \implies n \gt \frac{\ln (0.01)}{\ln (0.8)}\]Since the last value is approximately equal to 20.638 - but we can’t have a fractional attempt - we can see that it will take at least 21 attempts for the typical couple to have a probability of failing to conceive of less than 1%. The mirror of this is that it will take 21 attempts for the couple to have a 99% probability of conceiving.
Obviously these are heavily simplified examples of the probabilities involved in pregnancy, and a crucial assumption here (the independence of attempts) doesn’t seem at all realistic. But it’s a nice example to motivate the ideas of probability and the challenges of thinking probabilistically. If anything, this all evinces the fact you should never do maths off-the-cuff!
Many posts of this mistake have rather unkind/critical captions, and this is the least unkind post I could find ↩
In fact, multiplying 0.2 by itself k times is going to give us the probability of conceiving on every one of those k attempts. ↩
Obviously it’s not possible to conceive more than once in a short time, but we’ll assume that conception on an attempt means any future attempts are failures. That doesn’t change the maths around the question of at least one attempt being successful. ↩
I recently posted some code that used Rust to calculate the price of a bond (given the yield to maturity) and the yield to maturity of a bond (when given the price). One reason for writing this code in Rust - as opposed to, say, Python - was to create something that’s closer to the type of code one would use in a real pricing system^{1}. But in many cases, it might be useul to write the meatiest code in Rust and then call it from a higher-level language like Python. It’s this scenario that I want to cover here.
It seems that there’s a pretty good system for binding Rust code to Python, in the form of PyO3 and maturin. The Getting Started page of the PyO3 documentation provides a pretty good overview of the steps involved in setting up a project. I followed those steps, and won’t re-hash them here.
I’ve put the repo here: the main Rust code is in src/lib.rs. I won’t copy all the code below, just some relevant changes.
Converting the original Rust code into a form usable by Python is pretty straightforward for a basic program like this. After bringing the PyO3 crate into scope by adding use pyo3::prelude::*;
at the top of the lib file, we can then use various attributes to tell PyO3 that we’re creating a Python module:
#[pymodule]
fn pyo3_bond_pricing(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<SimpleBond>()?;
Ok(())
}
There are few other changes we have to make, with the exception of adding attributes at some points in the code, so that PyO3 knows we’re treating a piece of code as a part of our python accessible code. The two attributes we use here are the #[pyclass]
attribute above the struct we create to represent the bond, and the #[pymethods]
attribute above the impl of methods for the struct.
An additional difference from the basic rust code is that we need to give python a wy to construct an instance of the class when we want to do so. This requires adding a new
method (with the #[new]
attribute before it) to be called when we construct a class instance.
After we’ve taken these steps, and run maturin develop
, the library is ready for use! We can create a short .py file to test the basics:
from pyo3_bond_pricing import SimpleBond
bond1 = SimpleBond(1000, 0.04, 2, 10, 0.04584, 0)
bond2 = SimpleBond(1000, 0.04, 2, 10, 0, 953.57)
print(f"bond price is {round(bond1.solve_price(), 2)} USD")
print(
f"bond yield is {round(bond2.solve_yield_to_maturity(0.01, 1000, 0.00001, 0.00000001), 4)}"
)
When we run this file in the terminal, we get the following output:
which is what we expect, and matches up with the output of the Rust code we created previously.
While it’s obviously not necessary to use Rust as the engine behind these calculations, there are use cases where Python might not be fast enough and it’s advantageous to use a language like Rust. While Rust is only one of several options, it’s a language that’s getting more mature and I think it’s an interesting choice to explore.
The next steps for this small program is to make it robust - both to computational issues in the yield solver and to errors that can arise from bad inputs. This is an extension I’ll make in the future by exploring the use of Option<T>
to accomodate null values for yield or price: Rust doesn’t have a null type and it requires the use of and Option to handle cases where a variable has no value.
As always, I appreciate any corrections or feedback to feedback@finlaymcalpine.com
While there’s no question that the speed benefits of a language like Rust are of little use in a simple case like this, it’s an entry point into the general concept of using high-performance languages under the hood. And Rust is an interesting language that’s gaining a lot of traction in the data space, e.g. it’s the engine behind the Polars DataFrame library ↩
In the future, I’ll try to expand on the ideas in this notebook by considering how we think of the coefficients of a regression as a sample statistic and why the sampling distribution is critical to testing hypotheses when we’re modelling. That’s for another day, however…
I’ll acknowledge on this George Lentzas, for his persistence in covering a topic that is often quickly stated and then invoked without a proper explanation of the underlying idea.
As usual, please reach out with corrections or comments to feedback@finlaymcalpine.com
]]>In discussing the rapid growth of oil extraction in the Delaware Basin - spanning the border between New Mexico and Texas - the article highlights the diverging land subsidence and expansion across the geography, as oil and water are extracted from some areas and wastewater is pumped into the ground in others. That process has had notable effects on local communities. In particular, the regions of Texas that have experienced rising surface levels have also seen a significant expansion in earthquakes, in an area that registered few notable quakes in the early part of the 2010s.
The screenshot below highlights the stark disparity in wastewater disposal volumes on either side of the New Mexico-Texas border, and one of the explanations given is the much looser regulatory constraints imposed in Texas. In recent years many firms have taken to extracing wastewater from wells in New Mexico and shipping it over to Texas for disposal.
Source: WSJ
The (at least visibly) pretty clear difference in deposited volumes across the state line could be due to a number of factors (geology, cost, etc.) but one thing we can suggest as a reason is the difference in regulation on either side of the border. Clear borders of some kind (in this case regulatory) between otherwise similar areas can be a really interesting tool for analyzing the effects of interventions on different groups.
Specifically, this all reminded me of a 1994 paper by Card & Krueger studying the effects of minimum wage increases on employment. They use the different regulations (in this case the minimum wage) between New Jersey and Pennsylvania as a tool to examine how an increase in minimum wage in New Jersey (but not in Pennsylvania) changed employment in that state. Their finding (which was and remains a topic of debate) was that the increase in minimum wage in the treatment state (New Jersey) did not result in a decrease in employment relative to the control state (Pennsylvania). It was, I believe, one of the early difference-in-difference papers that really caught the attention of economists - in particular because the results ran contrary to what labor models would suggest.
(There’s a seminar discussion of the paper here, along with many other discussions.)
Now, I have no knowledge of either earthquakes or oil and gas extraction. So whether there is any question that this kind of natural experiment can answer is for someone else to judge. But it’s nice to be able to visualize the kind of ‘line on a map’ differences that present opportunities to conduct experiments on observational data.
As usual, please reach out with corrections or comments to feedback@finlaymcalpine.com
]]>The Yield to Maturity (\(\lambda\)) is the interest rate that equates the price of the bond to the present value of the cashflow obtained by holding the bond until maturity. It’s the answer to the question: if we pay price P for this bond, what is the implied interest rate based on the predetermined cashflow?
As I tried to motivate in the last post, the market mechanism that keeps the equation in balance is the fact that an investor will expect the yield from this bond to be an equivalent interest rate to any other equivalent investment (a bond of equivalent maturity and default risk).
We need to use a some computational method to solve for the yield to maturity here, since there is no closed form solution. We set the present value equal to the price and find \(\lambda\) such that the equation is zero, i.e. find the roots of:
\[y = f(\lambda) = \frac{F}{(1 + ( \lambda / m))^ {n}} + \frac{C}{\lambda}\left\{ 1 - \frac{1}{(1 + ( \lambda / m))^ {n}} \right\} - P\]One mechanism that can be used is Newton’s Method, which involves differentiating the above function and running an iterative procedure, where \(x_{n+1} = x_{n} - \frac{f(\lambda)}{f'(\lambda)}\). We continue this updating step until the changes between \(x_{n}\) and \(x_{n+1}\) are sufficiently small^{2}.
fn solve_yield_to_maturity(&mut self, mut x0: f32, iter: i32, tolerance: f32, epsilon: f32) -> f32 {
let max_iter: i32 = iter;
let coupon_periods = self.frequency * self.maturity;
for _ in 1..max_iter {
let y = (self.face_value - ((self.coupon * self.face_value) / x0))
* ((1.0 + (x0 / self.frequency)).powf(-1.0 * coupon_periods))
+ ((self.coupon * self.face_value) / x0)
- self.price;
let y_prime = ((self.coupon * self.face_value) / x0.powf(2.0))
* (1.0 + (x0 / self.frequency)).powf(-1.0 * coupon_periods)
- ((self.face_value - ((self.coupon * self.face_value) / x0))
* (coupon_periods
* (1.0 + (x0 / self.frequency)).powf(-1.0 * coupon_periods - 1.0))
/ self.frequency)
- ((self.coupon * self.face_value) * x0.powf(-2.0));
if y_prime.abs() < epsilon {
break;
};
let x1: f32 = x0 - (y / y_prime); // Newton's Method Step
if (x1 - x0).abs() <= tolerance {
self.yield_to_maturity = x1;
break;
}
x0 = x1;
}
self.yield_to_maturity
}
We can now check that the yield given by our calculation is the same as the yield we supplied when we used the price function last time:
let mut bond2 = SimpleBond {
face_value: 1000.0,
coupon: 0.04,
frequency: 2.0,
maturity: 10.0,
yield_to_maturity: 0.0, // we have to give some float to fill out the struct.
price: 953.5723,
};
This gives us a yield to maturity of 0.04583999
, which aligns with the number we gave the solve_price()
function to get the price $953.5723
in the pricing post. The fact that we can move between the price and yield using these functions is a sign that they’re doing what we want.
So we can now, whether given the price or the yield to maturity, calculate the other to fully specify the key elements of the pricing equation for a bond.
The code used here can be found in this repo, although this is a quick implementation of the main idea. It should be taken as an outline. I plan to clean it up in order to attach it to a PyO3 library.
As always, I appreciate any corrections or feedback to feedback@finlaymcalpine.com
As before, I’ll refer to Luenberger’s Investment Science textbook ↩
The choice of ‘sufficiently small’ depends on the context: in this case the function accepts an argument to define this ‘tolerance’. There are also concerns about convergence with numerical algorithms like this. I won’t cover that here, but those are discussed in the Wikipedia page linked and in other sources ↩
I’m going to reference the book Investment Science by Luenberger, which is a really nice reference for the basics of financial markets from a quantitative viewpoint. A more detailed reference focused on empirical applications is The Econometrics of Financial Markets by Lo, et. al. Finally, an overview of bonds by PIMCO is a good quick overview of the ideas.
A standard bond for our purposes is a bond that has a principal to be repaid at maturity and makes coupon payments of a predetermined amount, on a predetermined schedule (let’s choose every 6 months, as is the case for US Treasury Bonds). The bond therefore has a fully deterministic cashflow through its lifetime: we are not looking at inflation-linked bonds (TIPS or Linkers) or callable bonds that can be redeemed prior to their maturity date.
While I won’t get far into the detail of the nature of bonds, it’s clear that we’re thinking of bonds here as a predefined cash flow. So how do we value that? In much the same way that we value any cashflow: by discounting according to the time value of money. In the same way that a discounted cashflow model for an equity can be used to assess the ‘value’ of a stock, we can do the same thing on the more deterministic cashflow associated with a bond.
The general formula for the price of a bond is:
\[P = \frac{F}{(1 + ( \lambda / m))^ {n}} + \sum_1^n \frac{C/m}{(1 + ( \lambda / m))^ {n}}\]which, given that the second part of the sum is the present value of an annuity^{1}, is equivalent to:
\[P = \frac{F}{(1 + ( \lambda / m))^ {n}} + \frac{C}{\lambda}\left\{ 1 - \frac{1}{(1 + ( \lambda / m))^ {n}} \right\}\]In the above, the bond has exactly n coupon periods remaining to maturity, F is the face value of the bond, C is the annual coupon payment, m is the number of coupon payments per year, and $\lambda$ is the yield to maturity.
The yield to maturity (YtM) here is the return we’d get if we bought and held the bond until it matures. This is the discount rate we apply to the cashflow. An interesting aspect of pricing a bond is that market forces are trying to balance the price and YtM to maintain the equality of this equation, and maintain the YtM in line with other interest rates.
Suppose that the price of a bond fell and the associated YtM rose above the interest rate available by purchasing other assets with an equivalent payoff profile^{2}. Since investors recognise that they can obtain a higher return by purchasing this particular bond, they will bid up the price (and consequently reduce the YtM) to align the returns. This feedback leads to the price of a bond being more than just a simple formula: it is the product of a complex marketplace in which the principle of not paying a different price for the same cashflow generates prices and yields that move together to keep interest rates in balance.
In Rust, we can price the bond - if given a yield - as follows:
struct SimpleBond {
face_value: f32,
coupon: f32,
frequency: f32,
maturity: f32,
yield_to_maturity: f32,
price: f32,
}
impl SimpleBond {
// We'll calculate the principal and coupon flow present values separately, and then combine them
fn solve_price(&mut self) -> f32 {
let pv_principal: f32 = self.face_value
/ ((1.0 + (self.yield_to_maturity / self.frequency))
.powf(self.maturity * self.frequency));
let pv_coupon: f32 = ((self.coupon * self.face_value) / self.yield_to_maturity)
* (1.0
- (1.0
/ ((1.0 + (self.yield_to_maturity / self.frequency))
.powf(self.maturity * self.frequency))));
self.price = pv_principal + pv_coupon;
self.price
}
Here, we’ve created a SimpleBond struct to hold the data we need for the bond (we’ll add more data later, along with methods to calculate them), and then an implementation for a price function that takes the data and generates the price according to the formula above.
We can test the approximate performance of this calculator by using the example of the current 10 year UST bond on 5/2/2024. The bond has a maturity of 2/15/2034, so an accurate price would have to account for accrued interest and the less than full remaining coupon period.
fn main() {
// details for TMUBMUSD10Y bond on 5/2/2024
let bond1 = SimpleBond {
face_value: 1000.0,
coupon: 0.04,
frequency: 2.0,
maturity: 10.0,
yield_to_maturity: 0.04584,
price: 0.0, // we have to give some float to fill out the struct.
};
println!("10 year UST price on 5/2/2024: ${}", bond1.solve_price());
}
This gives the output 10 year UST price on 5/2/2024: $953.5723
. Compare this to the market published price of 95 4/32
, which equals to (4/32 * 100) = 12.5bp => 1000 * 0.95125 = $951.25
. So our simple implementation gives an approximately correct result.
I appreciate any corrections or feedback to feedback@finlaymcalpine.com
]]>Some sources that I will recommend on this topic (there are many) are:
If we have a response (or dependent) variable Y and one or more covariates (or features) X, we can model the relationship between these in the following way:
Y = a + bX+ e
where Y, X, and e are for each sample in our data.
A crucial assumption for the linear regression model is that the linear model correctly describes the data - that the true relationship between Y and X is linear. In many cases, we’ll assume this based solely on looking at the plotted data. While not the most precise method, this will help quickly reject linear regression in cases where the relationship is clearly not linear. While it this is probably the most important assumption, it is often easiest to see violations.
Once we’ve plotted the data and verified that the linear model is appropraite, we then have to think about other assumptions we’re making when we use linear regression.
There seems to be some confusion about the assumptions of linear regression. I have seen a number of articles that make incorrect claims.
The most wrong claim about the assumptions of linear regression is that the data need to be normally distributed. That is not correct. We don’t need any specific distribution of the data: the features X could be normally distributed (as might be the case with height), it could be log-normal (as might be the case for income), or many more distributions. We might even find that the feature takes on a bimodal distribution clustered strongly around two values. None of these scenarios prevent us from using the linear model. However, if we have strongly clustered data or data that is sparse in some regions of the feature space, we might want to think carefully about the data generating process underlying our data.
So what assumptions do we make when we use the linear model?
First, we assume that there is no multicollinearity in our data: that we don’t have features that are a linear combination of other features. A possible cause of a failure here would be that a column of our data is accidentally duplicated.
This assumption is not actually prohibitive from a statistical point of view, but it does make the calculation of the model more difficult for the software. In the case that some features are highly correlated with one another, we won’t have biased estimates of the coefficients, but we will widen the standard errors of our coefficient estimates. Intuitively, if two features are highly correlated, it is more difficult to precisely tease out the individual effect on the response variable.
So while multicollinearity isn’t going to bias our coefficient estimates, it will make them less efficient (in the statistical sense). The topic of
A more technical treatment of the collinearity problem can be found here.
Second, we assume orthogonality of the errors, that E(e) = 0. That’s to say that knowing the value of the X matrix tells us nothing about the error for a specific observation. A scenario in which this assumption would fail would be the existence of an omitted variable that is correlated with our features (Omitted Variable Bias), and has correlation (i.e. some explanatory power) with the response variable. In that case, knowing something about X can give us some information about the error term, because there’s an omitted variable that exists ‘inside’ that error and it’s correlated with X.
If we omit such a variable, we are going to be capturing its effect on Y in the coefficient we estimate for X. That leads our coefficient estimate, b to be biased and we no longer have an unbiased estimator.
There is a nice StackExchange answer here that provides an interesting example of how an omitted variable can bias our coefficient estimates.
A useful discussion of the difference between OVB and Multicollinearity is to be found here.
This assumption more or less boils down to all the errors having a covariance structure that shares a variance for all errors and has no covariance between errors.
There are two components to this assumption, the first of which is that errors conditional on X are similarly distributed across the values of a feature. When we plot our residuals, we don’t want to see that the residuals have a higher variance at some points in the range of X. In that case, we’d be concerned that the error in our model is in fact heteroskedastic.
The second component of the assumption is that there is no serial correlation between errors. That breakdown most often occurs in time series applications, where we are concerned about correlation through time, or in spatial correlation across groups in cross-sectional data.
The failure of this assumption, on the structure of the error variance, is that our coefficient estimates will no longer be efficient, and any hypothesis testing we do on those estimates will be erroneous. The standard errors given by the standard OLS procedure are based on homoskedasticity. In order to correct for the presence of heteroskedasticity, we use robust standard errors (under the framework of Generalised Least Squares or Weighted Least Squares).
This assumption is not necessary for unbiased estimation of the coefficients or the efficiency of our estimates. However, the addition of this assumption has a couple of consequences: one is that the OLS coefficient estimator is then equivalent to the Maximum Likelihood Estimator of the coefficients.
If we don’t have normally distributed errors, there can be issues with point estimates and confidence intervals for our coefficients. However, the effect of non-normal errors is usually pretty quick to wash out as the sample size increases, and it is not a primary concern when we’re working with linear models.
I appreciate any corrections or feedback to feedback@finlaymcalpine.com
]]>The bias-variance tradeoff, in short, shows that - when fitting a statistical model - we can decompose the mean squared error into a bias component and a variance component (the derivation of this can be found on the wikipedia page). So our measurement of predictive error can be attributed to those two components. The challenge comes from the fact that bias is a decreasing function of model complexity, and variance is an increasing function of model complexity.
I’ve often seen the tradeoff use the idea of increasing the degree of a polynomial fit to data - the higher the degree, the more closely the function tracks the training data. But I wanted to motivate this idea a little more from the lens of linear regression. When we think of a standard linear model, we can throw a lot of variables to help explain as much of the variance in our response variable as we can. But that might not be a good idea (and in fact - beyond a certain point - it’s almost certainly going to be a bad idea!).
Given that we don’t know the correct specification for the linear model \(Y= \beta X + \epsilon\), we need to find the balance of bias and variance that allows us to obtain the most accurate prediction. One way we can misspecify the model is to omit relevant variables. Omitting the variable induces bias into our model, because we’re now capturing some of the effect of the omitted variable in the coefficient on the included variable. Now, if the included and exlcuded variables are uncorrelated then the omitted variable won’t bias the coefficient. And in that case, the only loss is the explanatory power gained by adding another way to explain the response variable Y. But most variables in observational data will have some correlation. This becomes even more complex when we have multiple included and omitted variables, because we now have to consider the multiplicitous relationships between all of the included and exluded variables.
There’s a derivation of the formula for the bias induced by an omitted variable in section 2 of this handout.
A second source of misspecification that causes problems when applying the linear model is the inclusion of too many variables. This relates to the multicollinearity assumption in the post here, and specifically to the idea of adding highly correlated features to our model. When we do so, we increase the variance of the estimates of our coefficients and make our estimates more prone to changing as a result of small changes in the data. From the perspective of inference we will be less likely to correctly reject a null hypothesis, and in the context of prediction, our out-of-sample performance is likely to be weaker - because our estimates are more sensitive to the specific sample we have trained on.
Let’s suppose we have a model of the form \(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \epsilon\), where \(X_{1}\) and \(X_{2}\) are highly collinear. We can imagine that (plot to come) the data points are going to lie in a broadly cylindrical area around a line. Given that most of the data is focused in a narrow area of the feature space, the plane of best fit for our data is not anchored in the way it would be if we had features with low correlation. Thus, the variance of our coefficients is going to be wider in the case with highly correlated features.
The general approach to dealing with the issue of bias-variance tradeoff is to accept a little bit of bias in return for a reduction in variance. This is the basis of regularization (whether LASSO, Ridge, or some other technique). While I think I will try to make some notes on these in a future post, the purpose of these methods is to either remove features or reduce the feature coefficients closer to zero, in order to remove or reduce their influence on our estimates. This often has the effect of improving the MSE of the model relative to unregularized least squares.
Greene has a treatment of these ideas (bias and variance in the linear model) in his textbook on econometrics. The relevant chapter is here, sections 4.3.2 and 4.3.3.
As always, I appreciate any corrections or feedback to feedback@finlaymcalpine.com
]]>I forked the repository (this is my own version of the repo) and completed most of the projects. They generally entail setting up a CLI using argparse
, reading a text file or specified command line argument, and then undertaking a small task using the input. Mostly there’s a lot of string parsing and manipulation, along with the use of standard python data structures.
The whole project is nicely set up for testing and building and I learned a lot while debugging some of these projects. The test will fail a fair bit, especially because some of the provided edge cases are pretty frustrating.
Having a small set of projects I can refer to in the future seems like a useful resource, and I’d recommend the repo to anyone looking to get some scripting practice.
]]>