Boook club “Hands-On Machine Learning with R” #5

Chp7 Splines - Chp 8 KNN

Chapter 7 Multivariate Adaptive Regression Splines

The WHAT

  • Multivariate adaptive regression splines (MARS)
  • Automatically creates a piecewise linear model
  • Inherently nonlinear

The WHY

  • Will search for, and discover, nonlinearities and interactions in the data that help maximize predictive accuracy

The HOW

  • Hinge function
  • Looks for the single point across the range of X values where two different linear relationships between Y and X achieve the smallest error

Capturing non-linear relationships

  • Polynomials

  • Step functions

  • Require specifications by the user

    • Which variables should have what specific degree of interaction or at what points of a variable \(X\) should cut points be made for the step functions

Multivariate adaptive regression splines (MARS)

  • Capture the nonlinear relationships in the data by assessing cutpoints (knots) similar to step functions

  • The procedure assesses each data point for each predictor as a knot and creates a linear regression model with the candidate feature(s)

  • Many knots may have a good fit in training data, but may not generalize to new data

  • Pruning: remove knots that do not contribute to predictive accuracy using, e.g. cross-validation

Fitting a basic MARS model with earth package

library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics
library(rsample)  # splitting data
library(caret)    # for cross-validation, etc.
library(vip)      # variable importance
library(modeldata) # ames data
data(ames)

# Stratified sampling with the rsample package
set.seed(77654) # I used a different seed than in the book
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)

mars1 <- earth::earth(
  Sale_Price ~ .,  
  data = ames_train   
)
print(mars1)
Selected 38 of 41 terms, and 24 of 276 predictors
Termination condition: RSq changed by less than 0.001 at 41 terms
Importance: Gr_Liv_Area, Year_Built, Total_Bsmt_SF, Kitchen_AbvGr, ...
Number of terms at each degree of interaction: 1 37 (additive model)
GCV 559029270    RSS 1.063131e+12    GRSq 0.9122792    RSq 0.9185039
# hinge functions produced from the original 307 predictors
summary(mars1) %>% .$coefficients %>% head(10)
                        Sale_Price
(Intercept)           259584.69398
h(Gr_Liv_Area-3082)       93.14554
h(3082-Gr_Liv_Area)      -61.74407
h(Year_Built-2003)      5435.63870
h(2003-Year_Built)      -397.66277
h(Total_Bsmt_SF-2330)   -729.49718
h(2330-Total_Bsmt_SF)    -23.57389
h(Total_Bsmt_SF-1657)    110.09111
h(Bsmt_Unf_SF-555)       -23.77975
h(555-Bsmt_Unf_SF)        11.41421

Performance and residual plots

  • Generalized cross-validation (GCV) \(R^2\) (solid black line; left-hand y-axis)
  • Number of terms retained in the model (x-axis)
  • Number of original predictors (right-hand y-axis)
plot(mars1, which = 1)

Interactions between different hinge functions

# degree = 2: interaction terms between a maximum of two hinge functions (e.g., h(2004-Year_Built)*h(Total_Bsmt_SF-1330)
mars2 <- earth::earth(
  Sale_Price ~ .,  
  data = ames_train,
  degree = 2
)
# check out the first 10 coefficient terms
summary(mars2) %>% .$coefficients %>% head(10)
                                            Sale_Price
(Intercept)                               3.461254e+05
h(Gr_Liv_Area-3082)                       7.303777e+02
h(3082-Gr_Liv_Area)                      -6.622323e+01
h(Year_Built-2003)                        1.298095e+04
h(2003-Year_Built)                       -1.122875e+03
h(2330-Total_Bsmt_SF)                    -4.846894e+01
h(2003-Year_Built)*h(Total_Bsmt_SF-1237) -1.189485e+00
h(2003-Year_Built)*h(1237-Total_Bsmt_SF)  5.062716e-01
h(Bsmt_Unf_SF-876)*h(3082-Gr_Liv_Area)   -1.538933e-02
h(876-Bsmt_Unf_SF)*h(3082-Gr_Liv_Area)    5.832110e-03

Tuning hyperparameters

  • The maximum degree of interactions
  • The number of terms retained in the final model
  • Perform a CV grid search to identify the optimal hyperparameter mix
# degree: degree of interactions
# nprune: number of terms to retain

# create a tuning grid
hyper_grid <- expand.grid(
  degree = 1:3, 
  nprune = seq(2, 100, length.out = 10) %>% floor()
)

head(hyper_grid)
  degree nprune
1      1      2
2      2      2
3      3      2
4      1     12
5      2     12
6      3     12

Tuning hyperparameters with caret

  • Grid search using 10-fold CV
  • The optimal model’s cross-validated RMSE was $26,817 in the book and $27,246.61 in this example with different seed
  • The optimal model retains 56 terms and includes up to 2nd degree interactions in the book and 45 terms and 1 degree interactions with the seed I chose
# Cross-validated model
set.seed(123)  # for reproducibility
cv_mars <- train(
  x = subset(ames_train, select = -Sale_Price),
  y = ames_train$Sale_Price,
  method = "earth",
  metric = "RMSE",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = hyper_grid
)
cv_mars$bestTune
  nprune degree
5     45      1
cv_mars$results %>%
  filter(nprune == cv_mars$bestTune$nprune, degree == cv_mars$bestTune$degree)
  degree nprune     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      1     45 27246.61 0.8844656 17822.15 3950.652 0.02631387 1398.517
ggplot(cv_mars) # mind different seed

summary(cv_mars$resample)
      RMSE          Rsquared           MAE          Resample        
 Min.   :21382   Min.   :0.8429   Min.   :15338   Length:10         
 1st Qu.:25083   1st Qu.:0.8684   1st Qu.:16760   Class :character  
 Median :27005   Median :0.8781   Median :18240   Mode  :character  
 Mean   :27247   Mean   :0.8845   Mean   :17822                     
 3rd Qu.:29091   3rd Qu.:0.9046   3rd Qu.:18775                     
 Max.   :35227   Max.   :0.9242   Max.   :19859                     

Comparing MARS with other modelling approaches

cv_mars$resample
       RMSE  Rsquared      MAE Resample
1  26592.75 0.8747269 18206.57   Fold06
2  28382.27 0.8990683 18886.13   Fold03
3  21381.58 0.9165223 15338.32   Fold07
4  25412.92 0.8811599 16541.65   Fold10
5  35226.86 0.8428650 19858.67   Fold04
6  29327.27 0.8663435 18838.37   Fold08
7  27417.38 0.8749882 18584.27   Fold01
8  23167.69 0.9241581 16278.33   Fold05
9  30584.40 0.8584081 18273.82   Fold09
10 24972.94 0.9064159 17415.43   Fold02

Feature interpretation

  • earth has backwards elimination feature selection tool
  • This tool looks at reductions in the GCV estimate of error as each predictor is added to the model: value = "gcv"
  • MARS automatically includes and excludes terms during the pruning process (automated feature selection)
  • Feature never included in the final model-> importance value=0
  • An alternative: the change in the residual sums of squares (RSS) as terms are added (value = "rss")
  • No measuring of the impact for particular hinge functions created for a given feature
# variable importance plots
p1 <- vip(cv_mars, num_features = 40, geom = "point", value = "gcv") + ggtitle("GCV")
p2 <- vip(cv_mars, num_features = 40, geom = "point", value = "rss") + ggtitle("RSS")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Feature interpretation: hinge functions interactions

  • Investigate interactions
  • Create partial dependence plots (PDPs) for each feature individually and also together

# extract coefficients, convert to tidy data frame, and filter for interaction terms
cv_mars$finalModel %>%
  coef(.) %>%  
  broom::tidy(.) %>%  
  filter(stringr::str_detect(names, "\\*"))
# A tibble: 0 × 2
# … with 2 variables: names <chr>, x <dbl>
# no interactions with the seed I used
  • Model found that one knot in each feature provides the best fit
  • Gr_Liv_Area increases and for newer homes, Sale_Price increases dramatically
# Construct partial dependence plots
p1 <- partial(cv_mars, pred.var = "Gr_Liv_Area", grid.resolution = 10) %>% 
  autoplot()
p2 <- partial(cv_mars, pred.var = "Year_Built", grid.resolution = 10) %>% 
  autoplot()
p3 <- partial(cv_mars, pred.var = c("Gr_Liv_Area", "Year_Built"), 
              grid.resolution = 10) %>% 
  plotPartial(levelplot = FALSE, zlab = "yhat", drape = TRUE, colorkey = TRUE, 
              screen = list(z = -20, x = -60))

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

Attrition data example

  • MARS method and algorithm can be extended to handle classification problems and GLMs
# plot results

df <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE)

# Create training (70%) and test (30%) sets for the 
# rsample::attrition data.
set.seed(123)  # for reproducibility
churn_split <- initial_split(df, prop = .7, strata = "Attrition")
churn_train <- training(churn_split)
churn_test  <- testing(churn_split)

set.seed(123)

# cross validated model
tuned_mars <- train(
  x = subset(churn_train, select = -Attrition),
  y = churn_train$Attrition,
  method = "earth",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = hyper_grid
)
# best model
tuned_mars$bestTune
  nprune degree
3     23      1
ggplot(tuned_mars)