Claude AI Tinkering
Having tinkered a bit with ChatGPT1 and having found it to be useful but prone to error, I thought I should also look at Claude AI, an alternative that I’ve heard good things about. I messed about with it a bit last night, just throwing prompts at it in an attempt to see what it would do.
I won’t go over all the prompting, but one thing I wanted to try was to figure out how useful Claude can be for helping to code up some Polars.
I asked for a brief overview of what the Polars library is, and then for an example of some code that uses Polars. I didn’t ask for anything specific: I wanted to see what it came up with as an example from no detail. It provided the below code:
import polars as pl
# Create a DataFrame from a list of dictionaries
data = [
{"name": "Alice", "age": 32, "city": "New York"},
{"name": "Bob", "age": 27, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"},
{"name": "Dave", "age": 29, "city": "New York"},
]
df = pl.DataFrame(data)
# Print the DataFrame
print(df)
# Select specific columns
selected_cols = df[["name", "city"]]
print(selected_cols)
# Filter rows based on a condition
filtered_df = df[df["age"] > 30]
print(filtered_df)
# Group by and aggregate
grouped = df.groupby("city").agg(pl.col("age").mean().alias("avg_age"))
print(grouped)
# Sort the DataFrame
sorted_df = df.sort("age", descending=True)
print(sorted_df)
# Write the DataFrame to a Parquet file
df.write_parquet("output.parquet")
It’s a good example of some things we might want to do with a small dataframe. The comments are a nice touch. However, one issue that presents is the use of brackets to filter the rows based on a condition. Polars doesn’t use bracket notation: it uses the .filter()
method. So the correct way to filter the dataframe to rows with an age
value greater than 30 is:
filtered_df = df.filter(pl.col("age") > 30)
This is the solution that Claude gives when prompted that bracketed filtering is not the correct approach for Polars.
My suspicion is that Claude (and this is likely true for all LLM models) has a concept of Polars that is very similar to the concept of Pandas. I think the LLM is finding that, after getting to the filtering section of the code, its probability model suggests that the next line should use bracket filtering.
I think models like Claude are likely to make this kind of mistake in longer sections of code, where the Polars syntax is similar to Pandas but then differs in one area. The model presumably has a bunch of text - that could be Polars or Pandas - and goes back to the much larger corpus of training data it possesses on Pandas.
Overall, using LLMs for coding in Polars (or any other API that has a limited history of training data) is an interesting exercise, but probably something that should be approached with caution. Especially when there is a risk of close but not perfect syntax overlap, as in the case of Pandas.
-
Only the 3.5 model thus far ↩