Linear Regression: California House Price Prediction
In this tutorial, we’ll predict California housing prices using two Haskell libraries: DataFrame (for data wrangling) and Hasktorch (for machine learning).
You can follow along and code here.
What Are We Building?
We’re going to:
- 📊 Load and clean real housing data
- 🔧 Engineer some clever features
- 🤖 Train a linear regression model
- 🎯 Predict house prices!
Think of it as teaching a computer to estimate home values based on things like location, number of rooms, and how close the house is to the ocean.
Our libraries
DataFrame
DataFrame is Swiss Army knife of data manipulation. It lets you work with tabular data (like CSV files) in a mostly type-safe, functional way.
Hasktorch
Hasktorch brings the power of Torch to Haskell. It lets us do numerical computing and machine learning. It has tensors (multi-dimensional arrays) which are the building blocks of neural networks.
Let’s Dive Into The Code!
Setting Up Our Imports
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}
module Main where
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Hasktorch (toTensor)
import Torch
import DataFrame ((|>))
What’s happening here? We’re enabling some handy language extensions and importing our tools. The |> operator is particularly cool. The operator is like the Unix pipe, letting us chain operations left-to-right!
Step 1: Loading the Data
df <- D.readCsv "../data/housing.csv"
Simple, right? We’re loading California housing data from a CSV file. This dataset contains information about different neighborhoods—things like population, median income, and (importantly) median house values.
Step 2: Handling Missing Data
Real-world data is messy. Sometimes values are missing, and we need to deal with that:
let meanTotalBedrooms = df |> D.filterJust "total_bedrooms" |> D.mean (F.col @Double "total_bedrooms")
Translation: “Hey DataFrame, take our data, filter out the rows where total_bedrooms is missing, then calculate the mean of what’s left.”
We’ll use this mean to fill in the blanks later. This is called imputation—fancy word for “educated guess filling.”
Step 3: Feature Engineering
Arguably the most important part of the learning process is making sure your data is meaningful. This is called “feature engineering.” We want to combine our features in interesting ways so that patterns become easier for the model to spot.
Machine learning models are powerful, but they’re not magic. They can only learn from what we give them. If we just hand over raw numbers, we’re making the model work way harder than it needs to. But if we do some creative thinking and craft features that highlight the relationships we care about, we can make even a simple model perform amazingly well.
In our housing example, we’re going to:
- Convert text categories (like “NEAR OCEAN”) into numbers the model can use (with 0 being the closest to the ocean and 5 being the furthest).
- Create a brand new feature:
rooms_per_household(because maybe spacious homes are worth more?) - Normalize everything so no single feature dominates
let cleaned =
df
|> D.impute (F.col @(Maybe Double "total_bedrooms")) meanTotalBedrooms
|> D.exclude ["median_house_value"]
|> D.derive "ocean_proximity" (F.lift oceanProximity (F.col "ocean_proximity"))
|> D.derive
"rooms_per_household"
(F.col @Double "total_rooms" / F.col "households")
|> normalizeFeatures
oceanProximity :: T.Text -> Double
oceanProximity op = case op of
"ISLAND" -> 0
"NEAR OCEAN" -> 1
"NEAR BAY" -> 2
"<1H OCEAN" -> 3
"INLAND" -> 4
_ -> error ("Unknown ocean proximity value: " ++ T.unpack op)
Let’s break this pipeline down:
- Impute: Fill in those missing bedroom values with the mean we calculated
- Exclude: Remove the house value column (we’ll use it as labels, not features)
- Derive ocean_proximity: Convert text like “NEAR OCEAN” into numbers (0-4) that our model can understand
- Derive rooms_per_household: Create a new feature! Maybe houses with more rooms per household are worth more?
- Normalize: Scale all features to a 0-1 range so no single feature dominates
Feature Normalization
normalizeFeatures :: D.DataFrame -> D.DataFrame
normalizeFeatures df =
df
|> D.fold
( \name d ->
let col = F.col @Double name
in D.derive name ((col - F.minimum col) / (F.maximum col - F.minimum col)) d
)
(D.columnNames (df |> D.selectBy [D.byProperty (D.hasElemType @Double)]))
Neural networks do better when all the data is scaled the same. We applying min-max normalization to every numeric column:
normalized_value = (value - min) / (max - min)
This squishes every feature to the 0-1 range. Why? Imagine if house prices ranged from 0-500,000 but number of bedrooms ranged from 0-5. The huge price numbers would dominate the small bedroom numbers during training. Normalization levels the playing field.
Step 4: From DataFrame to Tensors
features = toTensor cleaned
labels = toTensor (D.select ["median_house_value"] df)
Bridge time! We’re converting our nice, clean DataFrame into Hasktorch tensors. Think of tensors as supercharged matrices that GPUs love to work with. Our features are what the model learns from, and labels are what it’s trying to predict.
Step 5: Building Our Model
init <- sample $ LinearSpec{in_features = snd (D.dimensions cleaned), out_features = 1}
What’s a linear model? Imagine drawing the best-fit line through a scatter plot—except we’re doing it in many dimensions. The model learns:
house_price = w₁×feature₁ + w₂×feature₂ + ... + wₙ×featureₙ + bias
We’re creating a linear layer with as many inputs as we have features (after cleaning) and 1 output (the predicted price).
model :: Linear -> Tensor -> Tensor
model state input = squeezeAll $ linear state input
This is our prediction function—feed in features, get out a price estimate.
Step 6: Training Loop
trained <- foldLoop init 100_000 $ \state i -> do
let labels' = model state features
loss = mseLoss labels labels'
when (i `mod` 10_000 == 0) $ do
putStrLn $ "Iteration: " ++ show i ++ " | Loss: " ++ show loss
(state', _) <- runStep state GD loss 0.1
pure state'
This is where learning happens! Let’s break it down:
- 100,000 iterations: The model gets 100,000 chances to improve
- labels’: Make predictions with current model weights
- loss: How wrong are we? MSE (Mean Squared Error) measures the average squared difference between predictions and real prices
- Print every 10,000 steps: Show us how we’re doing!
- runStep with GD: Update the model using Gradient Descent with a learning rate of 0.1
- Think of gradient descent as rolling a ball down a hill to find the lowest point (best model)
- Learning rate controls how big our steps are
What you’ll see:
Training linear regression model...
Iteration: 10000 | Loss: Tensor Float [] 5.0225e9
Iteration: 20000 | Loss: Tensor Float [] 4.9093e9
Iteration: 30000 | Loss: Tensor Float [] 4.8576e9
Iteration: 40000 | Loss: Tensor Float [] 4.8333e9
Iteration: 50000 | Loss: Tensor Float [] 4.8217e9
Iteration: 60000 | Loss: Tensor Float [] 4.8160e9
Iteration: 70000 | Loss: Tensor Float [] 4.8130e9
Iteration: 80000 | Loss: Tensor Float [] 4.8114e9
Iteration: 90000 | Loss: Tensor Float [] 4.8105e9
Iteration: 100000 | Loss: Tensor Float [] 4.8099e9
Step 7: Making Predictions
let predictions =
D.insertUnboxedVector
"predicted_house_value"
(asValue @(VU.Vector Float) (model trained features))
df
print $ D.select ["median_house_value", "predicted_house_value"] predictions
The grand finale! We’re:
- Using our trained model to predict all the house values
- Converting the tensor back to a vector
- Adding it as a new column in our original DataFrame
- Printing a comparison of real vs. predicted values
You’ll see something like:
-------------------------------------------
median_house_value | predicted_house_value
--------------------|----------------------
Double | Float
--------------------|----------------------
452600.0 | 414079.94
358500.0 | 423011.94
352100.0 | 383239.06
341300.0 | 324928.94
342200.0 | 256934.23
269700.0 | 264944.84
299200.0 | 259094.13
241400.0 | 257224.55
226700.0 | 201753.69
261100.0 | 268698.7
...
Key Concepts We Learned
DataFrame Operations:
|>- Pipeline operator (read left to right!)readCsv- Load data from CSV filesimpute- Fill in missing valuesderive- Create new columns from existing onesfilterJust- Remove rows with missing valuesselect/exclude- Choose which columns to keep
Hasktorch:
toTensor- Convert DataFrames to tensorsLinear- Linear regression layermseLoss- Mean Squared Error loss functionrunStepwithGD- Gradient descent optimizationsample- Initialize model parameters
Machine Learning Flow:
- Load Data → Get it into your program
- Clean & Transform → Handle missing values, normalize
- Feature Engineering → Create useful new features
- Train → Iteratively improve the model
- Predict → Use the trained model on data
Try It Yourself!
Experiment ideas:
- Change the learning rate (0.1) to see how it affects training
- Add more derived features (like income per person)
- Try different numbers of iterations
- Use different normalization strategies
The advantages of this approach
- Type Safety: DataFrame’s type system catches most errors at compile time
- Functional Style: Pure functions and pipelines make data transformations clear
- Performance: Hasktorch uses PyTorch’s battle-tested backend
- Readability: The
|>operator makes data pipelines read like stories
Next Steps
Now that you’ve mastered the basics:
- Try different models (polynomial regression, neural networks)
- Experiment with more complex feature engineering
- Learn about train/test splits and model validation
- Explore Hasktorch’s neural network modules
Get involved
Wanna help contribute to data science in Haskell?