Project 3 Regression¶

Predicting House Sale Prices¶

Predicting the amount a house will sell for on the market is an important task within the market. Many Realtors often search up different aspects such as neighborhood, square footage, and more to see what to price a house in. This challenge while relevant is a difficult one as the market is always changing.

The dataset that we are using can be found here. It comes from a Kaggle competition that focuses on learning regression techniques. The Dataset offers 79 variables that we can play around with ranging from Lot Size, neighborhoods, the year the house was built, ratings, the year sold, and more. It showcases many different values that are often considered when determining the sale price of a house. As someone who has a close friend as a realtor, these models are becoming more and more often within the space as they help people determine their house value at any time.

Regression is the focus of this assignment and we will be exploring what that means and how we use it throughout the different number of experiments we will be tackling.

What is Regression and How Does it Work?¶

Regression is a Machine Learning training method for models that allows for predicting a continuous value based upon several features the model is aware of. For example, a car price could be predicted depending on different values provided, such as Make, Year, Brand, Gas Mileage, Current Mileage, and more. Unlike classification, regression involves looking at and predicting a range of different values instead of specific values.

In this case, we will be looking at House Prices. We will use a mix of different regression methods but focusing on Linear Regression. This is where it models the target variable as a linear value of different features, minimizing error. However, there are more techniques that can be used like Ridge Regression (which we use in Experiment 2) and Random Forst Regression (which we use in Experiment 3)

Using these different methods, we will explore different values and different features and techniques we can use to improve our models.

In [26]:
%pip install numpy pandas matplotlib seaborn sklearn
Requirement already satisfied: numpy in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (2.2.2)
Requirement already satisfied: pandas in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (2.2.3)
Requirement already satisfied: matplotlib in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (3.10.0)
Requirement already satisfied: seaborn in c:\users\mike\appdata\local\programs\python\python313\lib\site-packages (0.13.2)
Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
Note: you may need to restart the kernel to use updated packages.
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
In [27]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

train_df = pd.read_csv("./train.csv");
In [28]:
train_df.head()
Out[28]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [29]:
print(train_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None
In [30]:
print(train_df.describe())
                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  ...  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000  ...   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726  ...   
std       1.112799    30.202904     20.645407   181.066207   456.098091  ...   
min       1.000000  1872.000000   1950.000000     0.000000     0.000000  ...   
25%       5.000000  1954.000000   1967.000000     0.000000     0.000000  ...   
50%       5.000000  1973.000000   1994.000000     0.000000   383.500000  ...   
75%       6.000000  2000.000000   2004.000000   166.000000   712.250000  ...   
max       9.000000  2010.000000   2010.000000  1600.000000  5644.000000  ...   

        WoodDeckSF  OpenPorchSF  EnclosedPorch    3SsnPorch  ScreenPorch  \
count  1460.000000  1460.000000    1460.000000  1460.000000  1460.000000   
mean     94.244521    46.660274      21.954110     3.409589    15.060959   
std     125.338794    66.256028      61.119149    29.317331    55.757415   
min       0.000000     0.000000       0.000000     0.000000     0.000000   
25%       0.000000     0.000000       0.000000     0.000000     0.000000   
50%       0.000000    25.000000       0.000000     0.000000     0.000000   
75%     168.000000    68.000000       0.000000     0.000000     0.000000   
max     857.000000   547.000000     552.000000   508.000000   480.000000   

          PoolArea       MiscVal       MoSold       YrSold      SalePrice  
count  1460.000000   1460.000000  1460.000000  1460.000000    1460.000000  
mean      2.758904     43.489041     6.321918  2007.815753  180921.195890  
std      40.177307    496.123024     2.703626     1.328095   79442.502883  
min       0.000000      0.000000     1.000000  2006.000000   34900.000000  
25%       0.000000      0.000000     5.000000  2007.000000  129975.000000  
50%       0.000000      0.000000     6.000000  2008.000000  163000.000000  
75%       0.000000      0.000000     8.000000  2009.000000  214000.000000  
max     738.000000  15500.000000    12.000000  2010.000000  755000.000000  

[8 rows x 38 columns]

Lets check for any missing values¶

In [31]:
# Lets check for missing values.
missing = train_df.isnull().sum().sort_values(ascending=False)
missing = missing[missing > 0]
print("\nMissing Values:\n", missing)
Missing Values:
 PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageQual        81
GarageFinish      81
GarageType        81
GarageYrBlt       81
GarageCond        81
BsmtFinType2      38
BsmtExposure      38
BsmtCond          37
BsmtQual          37
BsmtFinType1      37
MasVnrArea         8
Electrical         1
dtype: int64

As we can see we have many missing values when taking a look at PoolQC, MiscFeature, Alley, and Fence these values are missing in over 50%. We will remove them as well as any others that fall into the more or equal to 50% range, after that we will fill in any other remaining missing values with the median to ensure an overall good model.

In [32]:
# The try finally is just to prevent errors while running.
try:
    train_df = train_df.drop(train_df.columns[(train_df.isnull().sum() / train_df.shape[0] >= 0.50)], axis=1)
finally:
    missing = train_df.isnull().sum().sort_values(ascending=False)
    missing = missing[missing > 0]
    print("Missing Values:\n", missing)
    
Missing Values:
 FireplaceQu     690
LotFrontage     259
GarageYrBlt      81
GarageQual       81
GarageFinish     81
GarageType       81
GarageCond       81
BsmtFinType2     38
BsmtExposure     38
BsmtQual         37
BsmtCond         37
BsmtFinType1     37
MasVnrArea        8
Electrical        1
dtype: int64
In [33]:
# We want to fill numerical valeus with the median
numCols = train_df.select_dtypes(include=["int64", "float64"]).columns # 
train_df[numCols] = train_df[numCols].fillna(train_df[numCols].median())

# Now we want to remove colummns that wont serve a purpose to use. Non-Infomative colums that we want removed from the training.
train_df = train_df.drop(['Id', 'MSZoning', 'MSSubClass', 'LandSlope', 'RoofMatl', 'MoSold', 'LotShape', 'BsmtCond', 'LandContour', 'GarageYrBlt', 'YearRemodAdd', 'Street', 'Utilities', 'GarageCond', 'Condition1', 'Condition2', 'Electrical', 'Functional', 'Heating', 'CentralAir'], axis=1)
# Now we want to encode the categorial values we need. The values arent numbered, and we want to be able for them to be a vlue. So we will One-hot encode the values so they can be used in modeling. 
train_df = pd.get_dummies(train_df, columns=train_df.select_dtypes(include=["object"]).columns, drop_first=True, dtype=int);
train_df.head()
Out[33]:
LotFrontage LotArea OverallQual OverallCond YearBuilt MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ... SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 65.0 8450 7 5 2003 196.0 706 0 150 856 ... 0 0 0 0 1 0 0 0 1 0
1 80.0 9600 6 8 1976 0.0 978 0 284 1262 ... 0 0 0 0 1 0 0 0 1 0
2 68.0 11250 7 5 2001 162.0 486 0 434 920 ... 0 0 0 0 1 0 0 0 1 0
3 60.0 9550 7 5 1915 0.0 216 0 540 756 ... 0 0 0 0 1 0 0 0 0 0
4 84.0 14260 8 5 2000 350.0 655 0 490 1145 ... 0 0 0 0 1 0 0 0 1 0

5 rows × 171 columns

Correlated Features (Top 15)¶

Some Insights:

  • OverallQual - This is the leading correlated feature, showcasing that the overall quality of the house is the most important.
  • GrLivArea - Showcases that a larger living area is a strong method on predicting a sale value.
  • GarageCars - The size of a garage showcases its correlation being strong with the sale value.
  • YearBuilt - Showcases a rather strong correlation for our sale price. As depending on how old a house is, can easily showcase value due to possible renovations needed.
In [34]:
sns.set_theme(style="whitegrid")

# Correlation heatmap
corr_matrix = train_df.corr()
top_corr_features = corr_matrix["SalePrice"].abs().sort_values(ascending=False)[0:15] # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.figure(figsize=(12, 10))
sns.heatmap(train_df[top_corr_features.index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) 
plt.title('Correlation heatmap of features (Top 15)')
plt.show()
No description has been provided for this image
In [35]:
plt.figure(figsize=(12, 10))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=train_df)
plt.suptitle('Living Area vs Sale Price')
plt.show()
No description has been provided for this image
In [36]:
plt.figure(figsize=(12,6))
sns.boxplot(x='OverallQual', y='SalePrice', data=train_df)
plt.title('House Quality vs Sale Price')
plt.show()
No description has been provided for this image

Experiment 1: Linear Regression¶

Modeling.¶

Since we are using the Kaggle competition, This will not be submitted so we need to split the dataset into training and testing subsets for our uses.

Variance Threshold Feature¶

I wanted to try the Variance threshold, which removes features with low variance. This removes 38 features after we filter through, we use this to create our split for training and testing.

In [37]:
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

print(f"Original features: {X.shape[1]}")
print(f"Features after variance threshold: {X_selected.shape[1]}")
Original features: 170
Features after variance threshold: 132

Modeling¶

Now we create the Test and Train Split and create the Linear Regression Model

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42) # We will do a 80/20 Split where 20 percent goes torwards testing.
lr = LinearRegression()
lr.fit(X_train, y_train)
Out[38]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [39]:
preds = lr.predict(X_test)
print(f"Linear Regression Experiement 1 Results:")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}") 
print(f"R²: {r2_score(y_test, preds):.3f}")
Linear Regression Experiement 1 Results:
RMSE: $30,895.02
MAE: $19,230.91
R²: 0.876

Visualization¶

I want to take a look at our model and see how it performed. Above we can see the scores of our RMSE, MAE, and R^2. Below are 2 plots that showcase that our model can make a semi-accurate prediction on the sale price of the house. We know this due to the middle line showcases the prefect line where predicted == real and our values hover around it.

In [40]:
sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()
No description has been provided for this image

Experiment 2: Ridge Regression + More Features¶

For Experiment 2, I will be using Ridge Regression while also adding 2 additional features. Using the Ridge function from sklearn to train the model. While also adding Feature Scaling during Preprocessing

  • TotalSF - The Total SF of the house which adds up the SF provided per house (GrLivArea + TotalBsmtSF)
  • HouseAge - The age of the house (YrSold - YearBuilt)
In [41]:
ex2_df = train_df.copy(); # Make a copy of the Train_Df Modified DF

ex2_df['TotalSF'] = ex2_df['GrLivArea']  + ex2_df['TotalBsmtSF'];
ex2_df['HouseAge'] = ex2_df['YrSold'] - ex2_df['YearBuilt']

X = ex2_df.drop('SalePrice', axis=1)
X = (X - X.mean()) / X.std() # Scale/Average the Values
y = ex2_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ridge = Ridge(alpha=100)
ridge.fit(X_train, y_train)

preds = ridge.predict(X_test)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")
RMSE: $32,262.35
MAE: $19,229.14
R²: 0.864

Visualization¶

We will use the same visualization that we used in the previous experiment. We can see that our model continues to perform.

In [42]:
sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()
No description has been provided for this image

Correlated Features with Enginnerring¶

Some Insights:

  • OverallQual - Continues to be a leading correlated feature followed by GrLivArea, GarageCars, and GarageArea
  • TotalSF - Our added feature plays a heavy role at .78
In [43]:
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(ex2_df[ex2_df.corr()["SalePrice"].abs().sort_values(ascending=False)[0:15].index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.title('Correlation heatmap of features + Feature Enginerring Done (Top 15)')
plt.show()
No description has been provided for this image

Experiment 3: Random Forest Regression + Feature Engineering + Variance¶

A Compelation of everything so far.¶

For Experiment 3, I will be using Random Forest Regression

In [44]:
ex3_df = train_df.copy();

ex3_df['TotalSF'] = ex3_df['GrLivArea']  + ex3_df['TotalBsmtSF'];
ex3_df['HouseAge'] = ex3_df['YrSold'] - ex3_df['YearBuilt']

X_selected = ex3_df.drop('SalePrice', axis=1)
y = ex3_df['SalePrice']

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

X_selected = (X_selected - X_selected.mean()) / X_selected.std() # Scale/Average the Values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

preds = rf.predict(X_test)
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, preds)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, preds):,.2f}")
print(f"R²: {r2_score(y_test, preds):.3f}")
RMSE: $30,253.11
MAE: $18,128.59
R²: 0.881

Visualization¶

We will use the same visualization that we used in the previous experiment. We can see that our model continues to perform. This time even better than our previous models.

In [45]:
sns.scatterplot(x=y_test, y=preds, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.ylabel("Predicted Price")
plt.xlabel("Actual Price")
plt.show()
No description has been provided for this image

Correlated Features (Experiment 3)¶

Our Correlated features have pretty much stayed consistant across the board within each of our models.

In [46]:
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(ex3_df[ex3_df.corr()["SalePrice"].abs().sort_values(ascending=False)[0:15].index].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) # Due to too many features, we want to just get the top ones, since its all that matters for the showcase.
plt.title('Correlation heatmap of features + Feature Engineering + Variance (Top 15)')
plt.show()
No description has been provided for this image

Impact¶

The impact of the model can vary. The model helps drive data-driven decsioins within the Real Estate market, helping agents and buyers evaluate houses and the possible sale price of a house. By Using its key features that many homes have it can help determine pricing for everoyne. The model can also scale easliey, by implementing new data when needed.

The negatives of this model are a bit different, there is bias due to the historical data of houses. As we have seen over the past couple of years, house prices tend to get into a bubble which can lead to increased and inflated pricing over the market for a short period. Another is an overreliance on how small this model and data are. Each area in the world has its pricing, so the model has a shortcoming due to location. This model can also lead to some issues with jobs as there are jobs that require people to go out and appraise houses.

Conclusion¶

The project aimed to be able to predict the sale prices of houses. By using different techniques we looked at Linear Regression, Ridge Regression, and Random Forest Regression. All have their benefits while also implementing different features like Variance and including custom features like TotalSF and HouseAge which both proved to be beneficial to the model and had a large effect when it came to the correlation of the sale price.

References¶

Kaggle House Prices Dataset Used here

Scikit Learn Linear Regression here

Scikit Learn Documentation Overall (Too many to list) here