Machine studying isn’t only for classification duties — it additionally excels at predicting steady values, like home costs, inventory actions, or on this case: used heavy gear sale costs. On this challenge, based mostly on the Bluebook for Bulldozers Kaggle competitors, I explored how machine studying may be utilized to foretell the public sale value of bulldozers.
This was my second end-to-end ML challenge, the place I handled real-world, messy information, time collection options, and mannequin analysis utilizing a logarithmic regression metric (RMSLE). Let me stroll you thru it.
- Downside: Can we predict the long run sale value of a bulldozer given its previous traits and examples?
- Knowledge Supply: Bluebook for Bulldozers Dataset — Kaggle
- Analysis Metric: Root Imply Squared Log Error (RMSLE)
To construct a regression mannequin that predicts bulldozer sale costs utilizing historic public sale information, with options like mannequin ID, 12 months, utilization, and sale date. The ultimate mannequin’s efficiency could be assessed on unseen information utilizing RMSLE — which penalizes underestimation and is often used for value prediction duties.
The dataset comes with three major CSVs:
Prepare.csv
: Full coaching information till 2011Legitimate.csv
: Validation information for public leaderboard (Jan 2012 – April 2012)Check.csv
: Hidden check information for remaining rankings (Could 2012 – Nov 2012)
I used TrainAndValid.csv
for this challenge.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)
df.information()
df.SalePrice.plot.hist();
This dataset was messy — many lacking values, inconsistent information sorts, and many unused columns. A key column, saledate
, was learn in as an object.
df = pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv",
low_memory=False,
parse_dates=["saledate"])
This allowed me to extract 12 months, month, day, and day of week, that are highly effective options for time-series based mostly predictions.
- Extracted time-based options from
saledate
- Stuffed lacking numerical information with median
- Transformed string/object classes to numerical utilizing ordinal encoding
- Eliminated columns with >50% lacking information or no variance
I used RandomForestRegressor from sklearn.ensemble
, as tree-based fashions carry out very properly with tabular information and don’t require function scaling.
from sklearn.ensemble import RandomForestRegressor
mannequin = RandomForestRegressor(n_jobs=-1, random_state=42)
mannequin.match(X_train, y_train)
To judge the mannequin’s efficiency, I used the Root Imply Squared Log Error (RMSLE). It really works properly after we care about relative errors in predicting massive values (e.g., $50k bulldozer vs $200k bulldozer).
from sklearn.metrics import mean_squared_log_error, mean_squared_error
import numpy as np
preds = mannequin.predict(X_valid)
rating = np.sqrt(mean_squared_log_error(y_valid, preds))
I used RandomizedSearchCV
to go looking over n_estimators
, max_depth
, min_samples_split
, and min_samples_leaf
.
from sklearn.model_selection import RandomizedSearchCV
This helped optimize efficiency whereas controlling overfitting.
- Date options matter: Extracting 12 months/month improved mannequin accuracy
- Timber are highly effective: Random forests dealt with nulls and categorical information with out scaling
- Knowledge cleansing is 70% of the work
- Metrics matter: RMSLE helped stability out excessive variance predictions