%pip install numpy pandas matplotlib seaborn sklearn

Requirement already satisfied: numpy in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (2.2.2)
Requirement already satisfied: pandas in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (2.2.3)
Requirement already satisfied: matplotlib in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (3.10.0)
Requirement already satisfied: seaborn in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (0.13.2)
Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
Note: you may need to restart the kernel to use updated packages.

  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

train_df = pd.read_csv("./train.csv");

train_df.head()

print(train_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None

print(train_df.describe())

                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  ...  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000  ...   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726  ...   
std       1.112799    30.202904     20.645407   181.066207   456.098091  ...   
min       1.000000  1872.000000   1950.000000     0.000000     0.000000  ...   
25%       5.000000  1954.000000   1967.000000     0.000000     0.000000  ...   
50%       5.000000  1973.000000   1994.000000     0.000000   383.500000  ...   
75%       6.000000  2000.000000   2004.000000   166.000000   712.250000  ...   
max       9.000000  2010.000000   2010.000000  1600.000000  5644.000000  ...   

        WoodDeckSF  OpenPorchSF  EnclosedPorch    3SsnPorch  ScreenPorch  \
count  1460.000000  1460.000000    1460.000000  1460.000000  1460.000000   
mean     94.244521    46.660274      21.954110     3.409589    15.060959   
std     125.338794    66.256028      61.119149    29.317331    55.757415   
min       0.000000     0.000000       0.000000     0.000000     0.000000   
25%       0.000000     0.000000       0.000000     0.000000     0.000000   
50%       0.000000    25.000000       0.000000     0.000000     0.000000   
75%     168.000000    68.000000       0.000000     0.000000     0.000000   
max     857.000000   547.000000     552.000000   508.000000   480.000000   

          PoolArea       MiscVal       MoSold       YrSold      SalePrice  
count  1460.000000   1460.000000  1460.000000  1460.000000    1460.000000  
mean      2.758904     43.489041     6.321918  2007.815753  180921.195890  
std      40.177307    496.123024     2.703626     1.328095   79442.502883  
min       0.000000      0.000000     1.000000  2006.000000   34900.000000  
25%       0.000000      0.000000     5.000000  2007.000000  129975.000000  
50%       0.000000      0.000000     6.000000  2008.000000  163000.000000  
75%       0.000000      0.000000     8.000000  2009.000000  214000.000000  
max     738.000000  15500.000000    12.000000  2010.000000  755000.000000  

[8 rows x 38 columns]

# Lets check for missing values.
missing = train_df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
print("\nMissing Values:\n", missing)

Missing Values:
 PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageQual        81
GarageFinish      81
GarageType        81
GarageYrBlt       81
GarageCond        81
BsmtFinType2      38
BsmtExposure      38
BsmtCond          37
BsmtQual          37
BsmtFinType1      37
MasVnrArea         8
Electrical         1
dtype: int64

# The try finally is just to prevent errors while running.
try:
    train_df = train_df.drop(train_df.columns[(train_df.isnull().sum() / train_df.shape[0] >= 0.50)], axis=1)
finally:
    missing = train_df.isnull().sum().sort_values(ascending=False)
    missing = missing[missing > 0]
    print("Missing Values:\n", missing)

Missing Values:
 FireplaceQu     690
LotFrontage     259
GarageYrBlt      81
GarageQual       81
GarageFinish     81
GarageType       81
GarageCond       81
BsmtFinType2     38
BsmtExposure     38
BsmtQual         37
BsmtCond         37
BsmtFinType1     37
MasVnrArea        8
Electrical        1
dtype: int64

# We want to fill numerical valeus with the median
numCols = train_df.select_dtypes(include=["int64", "float64"]).columns # 
train_df[numCols] = train_df[numCols].fillna(train_df[numCols].median())

# Now we want to remove colummns that wont serve a purpose to use. Non-Infomative colums that we want removed from the training.
train_df = train_df.drop(['Id', 'MSZoning', 'MSSubClass', 'LandSlope', 'RoofMatl', 'MoSold', 'LotShape', 'BsmtCond', 'LandContour', 'GarageYrBlt', 'YearRemodAdd', 'Street', 'Utilities', 'GarageCond', 'Condition1', 'Condition2', 'Electrical', 'Functional', 'Heating', 'CentralAir'], axis=1)
# Now we want to encode the categorial values we need. The values arent numbered, and we want to be able for them to be a vlue. So we will One-hot encode the values so they can be used in modeling. 
train_df = pd.get_dummies(train_df, columns=train_df.select_dtypes(include=["object"]).columns, drop_first=True, dtype=int);
train_df.head()

sns.set_theme(style="whitegrid")

# Correlation heatmap
corr_matrix = train_df.corr()
top_corr_features = corr_matrix["SalePrice"].abs().sort_values(ascending=False)[0:15] # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.figure(figsize=(12, 10))
sns.heatmap(train_df[top_corr_features.index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) 
plt.title('Correlation heatmap of features (Top 15)')
plt.show()

plt.figure(figsize=(12, 10))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=train_df)
plt.suptitle('Living Area vs Sale Price')
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(x='OverallQual', y='SalePrice', data=train_df)
plt.title('House Quality vs Sale Price')
plt.show()

X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

print(f"Original features: {X.shape[1]}")
print(f"Features after variance threshold: {X_selected.shape[1]}")

Original features: 170
Features after variance threshold: 132

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42) # We will do a 80/20 Split where 20 percent goes torwards testing.
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

LinearRegression()

preds = lr.predict(X_test)
print(f"Linear Regression Experiement 1 Results:")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}") 
print(f"R²: {r2_score(y_test, preds):.3f}")

Linear Regression Experiement 1 Results:
RMSE: $30,895.02
MAE: $19,230.91
R²: 0.876

sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()

ex2_df = train_df.copy(); # Make a copy of the Train_Df Modified DF

ex2_df['TotalSF'] = ex2_df['GrLivArea']  + ex2_df['TotalBsmtSF'];
ex2_df['HouseAge'] = ex2_df['YrSold'] - ex2_df['YearBuilt']

X = ex2_df.drop('SalePrice', axis=1)
X = (X - X.mean()) / X.std() # Scale/Average the Values
y = ex2_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ridge = Ridge(alpha=100)
ridge.fit(X_train, y_train)

preds = ridge.predict(X_test)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")

RMSE: $32,262.35
MAE: $19,229.14
R²: 0.864

sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(ex2_df[ex2_df.corr()["SalePrice"].abs().sort_values(ascending=False)[0:15].index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.title('Correlation heatmap of features + Feature Enginerring Done (Top 15)')
plt.show()

ex3_df = train_df.copy();

ex3_df['TotalSF'] = ex3_df['GrLivArea']  + ex3_df['TotalBsmtSF'];
ex3_df['HouseAge'] = ex3_df['YrSold'] - ex3_df['YearBuilt']

X_selected = ex3_df.drop('SalePrice', axis=1)
y = ex3_df['SalePrice']

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

X_selected = (X_selected - X_selected.mean()) / X_selected.std() # Scale/Average the Values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

preds = rf.predict(X_test)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")

RMSE: $30,253.11
MAE: $18,128.59
R²: 0.881

sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(ex3_df[ex3_df.corr()["SalePrice"].abs().sort_values(ascending=False)[0:15].index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.title('Correlation heatmap of features + Feature Engineering + Variance (Top 15)')
plt.show()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	MasVnrArea	BsmtFinSF1	BsmtUnfSF	TotalBsmtSF	...	SaleType_WD	SaleCondition_Normal
0	65.0	8450	7	5	2003	196.0	706	150	856	...	1	1
1	80.0	9600	6	8	1976	0.0	978	284	1262	...	1	1
2	68.0	11250	7	5	2001	162.0	486	434	920	...	1	1
3	60.0	9550	7	5	1915	0.0	216	540	756	...	1	0
4	84.0	14260	8	5	2000	350.0	655	490	1145	...	1	1

Project 3 Regression¶

Predicting House Sale Prices¶

What is Regression and How Does it Work?¶

Lets check for any missing values¶

Correlated Features (Top 15)¶

Experiment 1: Linear Regression¶

Modeling.¶

Variance Threshold Feature¶

Modeling¶

Visualization¶

Experiment 2: Ridge Regression + More Features¶

Visualization¶

Correlated Features with Enginnerring¶

Experiment 3: Random Forest Regression + Feature Engineering + Variance¶

A Compelation of everything so far.¶

Visualization¶

Correlated Features (Experiment 3)¶

Impact¶

Conclusion¶

References¶