Project 3 Regression¶
Predicting House Sale Prices¶
Predicting the amount a house will sell for on the market is an important task within the market. Many Realtors often search up different aspects such as neighborhood, square footage, and more to see what to price a house in. This challenge while relevant is a difficult one as the market is always changing.
The dataset that we are using can be found here. It comes from a Kaggle competition that focuses on learning regression techniques. The Dataset offers 79 variables that we can play around with ranging from Lot Size, neighborhoods, the year the house was built, ratings, the year sold, and more. It showcases many different values that are often considered when determining the sale price of a house. As someone who has a close friend as a realtor, these models are becoming more and more often within the space as they help people determine their house value at any time.
Regression is the focus of this assignment and we will be exploring what that means and how we use it throughout the different number of experiments we will be tackling.
What is Regression and How Does it Work?¶
Regression is a Machine Learning training method for models that allows for predicting a continuous value based upon several features the model is aware of. For example, a car price could be predicted depending on different values provided, such as Make, Year, Brand, Gas Mileage, Current Mileage, and more. Unlike classification, regression involves looking at and predicting a range of different values instead of specific values.
In this case, we will be looking at House Prices. We will use a mix of different regression methods but focusing on Linear Regression. This is where it models the target variable as a linear value of different features, minimizing error. However, there are more techniques that can be used like Ridge Regression (which we use in Experiment 2) and Random Forst Regression (which we use in Experiment 3)
Using these different methods, we will explore different values and different features and techniques we can use to improve our models.
%pip install numpy pandas matplotlib seaborn sklearn
Requirement already satisfied: numpy in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (2.2.2) Requirement already satisfied: pandas in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (2.2.3) Requirement already satisfied: matplotlib in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (3.10.0) Requirement already satisfied: seaborn in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (0.13.2) Collecting sklearn Using cached sklearn-0.0.post12.tar.gz (2.6 kB) Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'error' Note: you may need to restart the kernel to use updated packages.
error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> [15 lines of output] The 'sklearn' PyPI package is deprecated, use 'scikit-learn' rather than 'sklearn' for pip commands. Here is how to fix this error in the main use cases: - use 'pip install scikit-learn' rather than 'pip install sklearn' - replace 'sklearn' by 'scikit-learn' in your pip requirements files (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...) - if the 'sklearn' package is used by one of your dependencies, it would be great if you take some time to track which package uses 'sklearn' instead of 'scikit-learn' and report it to their issue tracker - as a last resort, set the environment variable SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error More information is available at https://github.com/scikit-learn/sklearn-pypi-package [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
train_df = pd.read_csv("./train.csv");
train_df.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
print(train_df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 588 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB None
print(train_df.describe())
Id MSSubClass LotFrontage LotArea OverallQual \ count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 mean 730.500000 56.897260 70.049958 10516.828082 6.099315 std 421.610009 42.300571 24.284752 9981.264932 1.382997 min 1.000000 20.000000 21.000000 1300.000000 1.000000 25% 365.750000 20.000000 59.000000 7553.500000 5.000000 50% 730.500000 50.000000 69.000000 9478.500000 6.000000 75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 max 1460.000000 190.000000 313.000000 215245.000000 10.000000 OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... \ count 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... mean 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... std 1.112799 30.202904 20.645407 181.066207 456.098091 ... min 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 25% 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 50% 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 75% 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... max 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch \ count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 mean 94.244521 46.660274 21.954110 3.409589 15.060959 std 125.338794 66.256028 61.119149 29.317331 55.757415 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 0.000000 0.000000 0.000000 0.000000 0.000000 50% 0.000000 25.000000 0.000000 0.000000 0.000000 75% 168.000000 68.000000 0.000000 0.000000 0.000000 max 857.000000 547.000000 552.000000 508.000000 480.000000 PoolArea MiscVal MoSold YrSold SalePrice count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 mean 2.758904 43.489041 6.321918 2007.815753 180921.195890 std 40.177307 496.123024 2.703626 1.328095 79442.502883 min 0.000000 0.000000 1.000000 2006.000000 34900.000000 25% 0.000000 0.000000 5.000000 2007.000000 129975.000000 50% 0.000000 0.000000 6.000000 2008.000000 163000.000000 75% 0.000000 0.000000 8.000000 2009.000000 214000.000000 max 738.000000 15500.000000 12.000000 2010.000000 755000.000000 [8 rows x 38 columns]
Lets check for any missing values¶
# Lets check for missing values.
missing = train_df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
print("\nMissing Values:\n", missing)
Missing Values: PoolQC 1453 MiscFeature 1406 Alley 1369 Fence 1179 MasVnrType 872 FireplaceQu 690 LotFrontage 259 GarageQual 81 GarageFinish 81 GarageType 81 GarageYrBlt 81 GarageCond 81 BsmtFinType2 38 BsmtExposure 38 BsmtCond 37 BsmtQual 37 BsmtFinType1 37 MasVnrArea 8 Electrical 1 dtype: int64
As we can see we have many missing values when taking a look at PoolQC
, MiscFeature
, Alley
, and Fence
these values are missing in over 50%. We will remove them as well as any others that fall into the more or equal to 50% range, after that we will fill in any other remaining missing values with the median to ensure an overall good model.
# The try finally is just to prevent errors while running.
try:
train_df = train_df.drop(train_df.columns[(train_df.isnull().sum() / train_df.shape[0] >= 0.50)], axis=1)
finally:
missing = train_df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
print("Missing Values:\n", missing)
Missing Values: FireplaceQu 690 LotFrontage 259 GarageYrBlt 81 GarageQual 81 GarageFinish 81 GarageType 81 GarageCond 81 BsmtFinType2 38 BsmtExposure 38 BsmtQual 37 BsmtCond 37 BsmtFinType1 37 MasVnrArea 8 Electrical 1 dtype: int64
# We want to fill numerical valeus with the median
numCols = train_df.select_dtypes(include=["int64", "float64"]).columns #
train_df[numCols] = train_df[numCols].fillna(train_df[numCols].median())
# Now we want to remove colummns that wont serve a purpose to use. Non-Infomative colums that we want removed from the training.
train_df = train_df.drop(['Id', 'MSZoning', 'MSSubClass', 'LandSlope', 'RoofMatl', 'MoSold', 'LotShape', 'BsmtCond', 'LandContour', 'GarageYrBlt', 'YearRemodAdd', 'Street', 'Utilities', 'GarageCond', 'Condition1', 'Condition2', 'Electrical', 'Functional', 'Heating', 'CentralAir'], axis=1)
# Now we want to encode the categorial values we need. The values arent numbered, and we want to be able for them to be a vlue. So we will One-hot encode the values so they can be used in modeling.
train_df = pd.get_dummies(train_df, columns=train_df.select_dtypes(include=["object"]).columns, drop_first=True, dtype=int);
train_df.head()
LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | ... | SaleType_ConLI | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65.0 | 8450 | 7 | 5 | 2003 | 196.0 | 706 | 0 | 150 | 856 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 80.0 | 9600 | 6 | 8 | 1976 | 0.0 | 978 | 0 | 284 | 1262 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 68.0 | 11250 | 7 | 5 | 2001 | 162.0 | 486 | 0 | 434 | 920 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 | 60.0 | 9550 | 7 | 5 | 1915 | 0.0 | 216 | 0 | 540 | 756 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 84.0 | 14260 | 8 | 5 | 2000 | 350.0 | 655 | 0 | 490 | 1145 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 171 columns
Correlated Features (Top 15)¶
Some Insights:
OverallQual
- This is the leading correlated feature, showcasing that the overall quality of the house is the most important.GrLivArea
- Showcases that a larger living area is a strong method on predicting a sale value.GarageCars
- The size of a garage showcases its correlation being strong with the sale value.YearBuilt
- Showcases a rather strong correlation for our sale price. As depending on how old a house is, can easily showcase value due to possible renovations needed.
sns.set_theme(style="whitegrid")
# Correlation heatmap
corr_matrix = train_df.corr()
top_corr_features = corr_matrix["SalePrice"].abs().sort_values(ascending=False)[0:15] # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.figure(figsize=(12, 10))
sns.heatmap(train_df[top_corr_features.index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation heatmap of features (Top 15)')
plt.show()
plt.figure(figsize=(12, 10))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=train_df)
plt.suptitle('Living Area vs Sale Price')
plt.show()
plt.figure(figsize=(12,6))
sns.boxplot(x='OverallQual', y='SalePrice', data=train_df)
plt.title('House Quality vs Sale Price')
plt.show()
Variance Threshold Feature¶
I wanted to try the Variance threshold, which removes features with low variance. This removes 38 features after we filter through, we use this to create our split for training and testing.
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
print(f"Original features: {X.shape[1]}")
print(f"Features after variance threshold: {X_selected.shape[1]}")
Original features: 170 Features after variance threshold: 132
Modeling¶
Now we create the Test and Train Split and create the Linear Regression Model
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42) # We will do a 80/20 Split where 20 percent goes torwards testing.
lr = LinearRegression()
lr.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
preds = lr.predict(X_test)
print(f"Linear Regression Experiement 1 Results:")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")
Linear Regression Experiement 1 Results: RMSE: $30,895.02 MAE: $19,230.91 R²: 0.876
Visualization¶
I want to take a look at our model and see how it performed. Above we can see the scores of our RMSE
, MAE
, and R^2
. Below are 2 plots that showcase that our model can make a semi-accurate prediction on the sale price of the house. We know this due to the middle line showcases the prefect line where predicted == real
and our values hover around it.
sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()
Experiment 2: Ridge Regression + More Features¶
For Experiment 2, I will be using Ridge Regression while also adding 2 additional features. Using the Ridge function from sklearn to train the model. While also adding Feature Scaling during Preprocessing
TotalSF
- The Total SF of the house which adds up the SF provided per house (GrLivArea
+TotalBsmtSF
)HouseAge
- The age of the house (YrSold
-YearBuilt
)
ex2_df = train_df.copy(); # Make a copy of the Train_Df Modified DF
ex2_df['TotalSF'] = ex2_df['GrLivArea'] + ex2_df['TotalBsmtSF'];
ex2_df['HouseAge'] = ex2_df['YrSold'] - ex2_df['YearBuilt']
X = ex2_df.drop('SalePrice', axis=1)
X = (X - X.mean()) / X.std() # Scale/Average the Values
y = ex2_df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ridge = Ridge(alpha=100)
ridge.fit(X_train, y_train)
preds = ridge.predict(X_test)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")
RMSE: $32,262.35 MAE: $19,229.14 R²: 0.864
Visualization¶
We will use the same visualization that we used in the previous experiment. We can see that our model continues to perform.
sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()
Correlated Features with Enginnerring¶
Some Insights:
OverallQual
- Continues to be a leading correlated feature followed byGrLivArea
,GarageCars
, andGarageArea
TotalSF
- Our added feature plays a heavy role at.78
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(ex2_df[ex2_df.corr()["SalePrice"].abs().sort_values(ascending=False)[0:15].index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.title('Correlation heatmap of features + Feature Enginerring Done (Top 15)')
plt.show()
ex3_df = train_df.copy();
ex3_df['TotalSF'] = ex3_df['GrLivArea'] + ex3_df['TotalBsmtSF'];
ex3_df['HouseAge'] = ex3_df['YrSold'] - ex3_df['YearBuilt']
X_selected = ex3_df.drop('SalePrice', axis=1)
y = ex3_df['SalePrice']
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
X_selected = (X_selected - X_selected.mean()) / X_selected.std() # Scale/Average the Values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
preds = rf.predict(X_test)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")
RMSE: $30,253.11 MAE: $18,128.59 R²: 0.881
Visualization¶
We will use the same visualization that we used in the previous experiment. We can see that our model continues to perform. This time even better than our previous models.
sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()
Correlated Features (Experiment 3)¶
Our Correlated features have pretty much stayed consistant across the board within each of our models.
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(ex3_df[ex3_df.corr()["SalePrice"].abs().sort_values(ascending=False)[0:15].index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.title('Correlation heatmap of features + Feature Engineering + Variance (Top 15)')
plt.show()
Impact¶
The impact of the model can vary. The model helps drive data-driven decsioins within the Real Estate market, helping agents and buyers evaluate houses and the possible sale price of a house. By Using its key features that many homes have it can help determine pricing for everoyne. The model can also scale easliey, by implementing new data when needed.
The negatives of this model are a bit different, there is bias due to the historical data of houses. As we have seen over the past couple of years, house prices tend to get into a bubble which can lead to increased and inflated pricing over the market for a short period. Another is an overreliance on how small this model and data are. Each area in the world has its pricing, so the model has a shortcoming due to location. This model can also lead to some issues with jobs as there are jobs that require people to go out and appraise houses.
Conclusion¶
The project aimed to be able to predict the sale prices of houses. By using different techniques we looked at Linear Regression, Ridge Regression, and Random Forest Regression. All have their benefits while also implementing different features like Variance and including custom features like TotalSF
and HouseAge
which both proved to be beneficial to the model and had a large effect when it came to the correlation of the sale price.
References¶
Kaggle House Prices Dataset Used here
Scikit Learn Linear Regression here
Scikit Learn Documentation Overall (Too many to list) here