HomeBlogHitters ISLR: Regression and Model Evaluation

Hitters ISLR: Regression and Model Evaluation

Published on

The Hitters ISLR is a well-known dataset from the “Introduction to Statistical Learning with R” (ISLR) textbook, often used to teach machine learning, linear regression, and feature selection techniques. This dataset contains statistics for baseball players and is widely employed in academic settings for understanding regression methods and model evaluation.

In this article, we will explore the Hitters dataset, analyze its structure, and demonstrate how it can be used to fit predictive models. We’ll also discuss regression techniques, feature selection, and model evaluation metrics relevant to the Hitters dataset.

What is the Hitters Dataset?

The Hitters dataset comprises data on 322 Major League Baseball players from the 1986 season. It contains information about each player’s performance, salary, and other key statistics, making it an excellent example for regression analysis. However, one challenge lies in the missing values, particularly for the salary variable.

Dataset Overview

Here is a brief overview of the key variables in the Hitters dataset:

  • AtBat: Number of at-bats in 1986
  • Hits: Number of hits in 1986
  • HmRun: Number of home runs in 1986
  • Runs: Total runs scored in 1986
  • RBI: Runs batted in (RBI) in 1986
  • Walks: Number of walks received in 1986
  • Years: Number of years in the major leagues
  • CAtBat: Career at-bats by 1986
  • CHits: Career hits by 1986
  • CHmRun: Career home runs by 1986
  • CRuns: Career runs scored by 1986
  • CRBI: Career runs batted in by 1986
  • Salary: Annual salary (in thousands of dollars)
  • League: League of the player (A or N)
  • Division: Division of the player’s team (E or W)
  • NewLeague: League after 1986 (A or N)

This dataset lends itself well to predictive modeling, with a common objective being to predict a player’s salary based on their career and performance statistics.

Key Statistical Concepts with Hitters Dataset

The Hitters dataset offers an excellent opportunity to explore several fundamental concepts in statistics and machine learning, such as:

  • Linear regression
  • Ridge and Lasso regression
  • Feature selection
  • Handling missing data
  • Model evaluation with cross-validation

Data Cleaning: Handling Missing Values

Before fitting any models, it is essential to address missing data. In the Hitters dataset, the salary variable contains several missing values. If these are left untreated, they may negatively affect the analysis. Some common approaches to handle missing values are:

  • Removing rows with missing values (simplest approach but risks data loss).
  • Imputation (filling in missing values based on other features).
  • Using models such as regression or k-nearest neighbors to estimate missing values.

In R or Python, the rows with missing values in the Hitters dataset can be removed using:

r
# R Code Example to Remove NA Values
hitters_clean <- na.omit(Hitters)

Linear Regression with the Hitters Dataset

A common task with the Hitters dataset is to use linear regression to predict a player’s salary. Let’s explore the steps involved in building a linear regression model.

Simple Linear Regression

A simple linear regression involves modeling the relationship between the target variable (salary) and a single predictor (e.g., Hits).

r
# R Code: Simple Linear Regression
model_simple <- lm(Salary ~ Hits, data = hitters_clean)
summary(model_simple)

In this case, Salary is the dependent variable, and Hits is the independent variable. The summary output of the model will include coefficients, p-values, and R-squared values, which provide insight into the model’s fit.

Multiple Linear Regression

While simple linear regression can be useful, using multiple features often improves the model’s predictive power. A multiple linear regression model can be built using several predictors from the Hitters dataset.

r
# R Code: Multiple Linear Regression
model_multi <- lm(Salary ~ AtBat + Hits + HmRun + Walks + Years + CAtBat, data = hitters_clean)
summary(model_multi)

Model Interpretation

  • Coefficients: Each predictor’s coefficient represents its effect on salary, holding other variables constant.
  • p-values: A low p-value (< 0.05) indicates that the predictor is statistically significant.
  • R-squared: Measures the proportion of variance explained by the model.

Feature Selection: Ridge and Lasso Regression

In datasets with many predictors, some variables may not contribute much to the model. In such cases, Ridge and Lasso regression are useful techniques for feature selection and improving model performance.

Lasso Regression

Lasso regression adds an L1 penalty to the loss function, shrinking the coefficients of less important variables to zero, effectively selecting a subset of predictors.

r
# R Code: Lasso Regression
library(glmnet)
X <- model.matrix(Salary ~ ., hitters_clean)[, -1]
y <- hitters_clean$Salary
model_lasso <- cv.glmnet(X, y, alpha = 1)
coef(model_lasso)

Model Evaluation: Cross-Validation

Cross-validation is essential to ensure the model performs well on unseen data. In k-fold cross-validation, the dataset is split into k parts, and the model is trained and tested k times, each time using a different fold for testing.

r
# R Code: Cross-Validation
library(boot)
cv_error <- cv.glm(hitters_clean, model_multi, K = 10)
print(cv_error$delta)

The mean squared error (MSE) from cross-validation helps assess the generalization error of the model.

Predicting Salary with Hitters Dataset

After building and evaluating the model, the next step is to use it for predictions. Here is how predictions can be made using a linear regression model.

r
# R Code: Predicting Salary
new_data <- data.frame(AtBat = 400, Hits = 100, HmRun = 20, Walks = 40, Years = 5, CAtBat = 1000)
predicted_salary <- predict(model_multi, newdata = new_data)
print(predicted_salary)

Conclusion

The Hitters dataset from ISLR provides a great opportunity to explore regression techniques, feature selection, and model evaluation. Through simple and multiple linear regression, Lasso and Ridge regression, and cross-validation, students and practitioners can gain valuable insights into data analysis and predictive modeling.

Understanding how to handle missing values, select relevant features, and evaluate models ensures that predictive models are both accurate and reliable. The insights gained from the Hitters dataset are not only applicable to baseball statistics but also to many other domains where regression models are used to make predictions and inform decisions.