Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball

Paper by Powers, et. al

Introduction

Motivation

Each baseball game is comprised of matchups between batters and pitchers.
Each matchup has a technical term called a plate appearance, and can result in a set of outcomes.

Problem

Plenty of MLB front offices has have conducted analyses revolving around an idea called run value.
Far less attention has been devoted to optimally estimating the probabilities batters and pitchers will produce specific plate appearance outcomes.
Many methods have been explored, but few actually do an appropriate job capturing the true structure of the problem.
The very nature of this problem is more complex than batting averages (binomial probabilites), but rather seeks to estimate the probability of a specific outcome from a set of outcome (multinomial probabilites).

Approaches to the Problem

One approach by Judge and BP Stats Team is deserved run average, which gives each category of the response, \(Y\), a series of estimated binomial probabilities. For example, each batter \(B\) has a propensity \(\beta_B^{\text{HR}}\) for hitting home runs. Same for each pitcher, represented by \(\gamma_P^{\text{HR}}\). If \(Y\) denotes the probability of a home run given a specific batter-pitcher matchup, then

\[ P(Y = \text{HR} | B, P) = \frac{e^{\alpha^{\text{HR}} + \beta_B^{\text{HR}} + \gamma_P^{\text{HR}}}}{1 + e^{\alpha^{\text{HR}} + \beta_B^{\text{HR}} + \gamma_P^{\text{HR}}}}. \]

What is the problem with this?

Approaches to the Problem

Another approach could be to treat the outcomes as ordinal, as they have different values corresponding to each team:

\[ \text{K} < \text{G} < \text{F} < \text{BB} < \text{HBP} < \text{1B} < \text{2B} < \text{3B} < \text{HR} \]

“the proportional odds model used for ordinal regression assumes that when one outcome is more likely to occur, the outcomes close to it in the ordering are also more likely to occur. That assumption is woefully off-base in this setting \(\ldots\) players who hit a lot of home runs tend to strike out often, and they tend not to hit many triples.”

Approaches to the Problem

Ridge multinomial regression is the most appropriate current solution to this problem, and is implemented easily in the R package glmnet.

\[ (\alpha^*, \mathbf{B}^*) = \underset{\alpha \in \mathbb{R}^K,\, \mathbf{B} \in \mathbb{R}^{p \times K}}{\arg\min} \, -\ell(\alpha^{(t)}, \mathbf{B}^{(t)}; \mathbf{X}, Y) + \lambda ||\mathbf{B}||_F^2 \]

The chief issue with using this method in number of parameters to estimate the application context (one for each outcome and each player). This would be about 8,000 parameters, and use 160,000 plate appearances in a season.

The Novel Work: Nuclear penalized multinomial regression

Understanding the Nature of the Problem

Rather than developing binomial models for each outcome, some EDA reveals that true structure of the problem is more hierarchical. Powers et. al used PCA to demonstrate this.
Note the directions of the loadings for the outcomes.

Introduction to Multinomial Logistic Regression

We have that a mulinomial logistic regression model in this motivating application can be written as:

\[P(Y_i = k | \mathbf{x}_i) = \frac{e^{\alpha_k + \mathbf{x}_i^T \beta_k}}{\sum_{\ell=1}^K e^{\alpha_\ell + \mathbf{x}_i^T \beta_\ell}}\]

\(\mathbf{X}\) is the data matrix with rows \(\mathbf{x}_i \in \mathbb{R}^p\)

\(Y_i \in \{1, \ldots, K\}\) for \(i = 1,\ldots, n\) is the outcome of interest for player \(i\), and are independent, conditional on \(\mathbf{X}\)
\(\alpha_k \in \mathbb{R}\) and \(\beta_k \in \mathbb{R}^p\) are fixed, unknown parameters.
In this model we estimate \(\mathbf{B} = (\beta_1, \ldots, \beta_k)\), were \(k\) corresponds to the outcome we are estimating.

Fitting the Multinomial Logistic Regression Model

As you can imagine, as \(K\) and \(p\) grows larger, the number of parameters to estimate increases.
An alternative is a reduced-rank multinomial logistic regression model, where we solve, for some \(r \in \{1, \ldots, K-1\}\),

\[\begin{align*} \underset{\alpha \in \mathbb{R}^K, \, \mathbf{B \in \mathbb{R}^{p \times K}}}{\text{minimize}} \ & -\sum_{i=1}^{n} \log \left( \sum_{k=1}^K \frac{e^{\alpha_k + \mathbf{x}_i^T\beta_k}}{\sum_{l=1}^K e^{\alpha_\ell + \mathbf{x}_i^T\beta_\ell} } \mathbb{I}_{\{Y_i=k\}} \right) \\ \text{subject to} \ & \text{rank}(\mathbf{B}) \leq r, \text{ with } \alpha_1 = 0, \beta_1 = \mathbf{0}_p. \end{align*}\]

If \(\text{rank}(\mathbf{B}) < r\), you can factor \(\mathbf{B} =\mathbf{A}_{p \times r}\mathbf{C}_{K \times r}^T\) where the columns of \(\mathbf{C}\) are latent outcome variables.
Yee (2010) implemented an algorithm to solve this in the VGAM package, but it is too slow for datasets as large as the one motivating our problem.

Enter nuclear penalized multinomial regression.

Because of the computational challenges of the reduced-rank form of this problem, the authors propose replacing the rank restriction with the restriction on the nuclear norm \(\|\cdot \|_\star\), where instead of \(\text{rank}(\mathbf{B}) < r\), they write, \(\|\mathbf{B} \|_* < \rho\).

They reframe the same problem in its Lagrangian form:

\[\begin{align*} (\alpha^*, \mathbf{B}^*) &= \underset{\alpha \in \mathbb{R}^K,\, \mathbf{B} \in \mathbb{R}^{p \times K}}{\arg\min} \, - \sum_{i=1}^n \log \left( \frac{\sum_{k=1}^{K} e^{\alpha_i + \mathbf{x}_i^T \beta_k}}{\sum_{\ell=1}^{K} e^{\alpha_\ell + \mathbf{x}_i^T \beta_\ell}} \mathbb{I}_{\{y_i=k\}} \right) + \lambda ||\mathbf{B}||_* \\ & \equiv \underset{\alpha \in \mathbb{R}^K, \mathbf{B} \in \mathbb{R}^{p \times K}}{\arg\min} \, - \ell(\alpha, \mathbf{B}; \mathbf{X}, Y) + \lambda ||\mathbf{B}||_*. \end{align*}\]

where \(\ell(\alpha, \mathbf{B}; \mathbf{X}, Y)\) is the log-likelihood of the regression coefficients.

For the singular values of \(\mathbf{B}\), written as \(\sigma_d\), the nuclear norm is defined as: \[\|\mathbf{B}\|_* = \overset{\min\{p,\,K\}}{\sum_{d = 1}} \sigma_d.\]

Features of Nuclear Penalized Multinomial Regression

Like lasso, solving this optimization problem drives some of the singular values to exactly zero. This reduces the rank of \(\mathbf{B}\).
It is easier to solve than ridge multinomial regression using standard solvers, such as proximal gradient descent.
In practice, the authors recommend using standard cross-validation techniques for selecting the regularization parameter \(\lambda\), which controls the rank of the solution.

“By estimating some of the singular values of \(\mathbf{B}\) (the entries of the diagonal \(p \times K\) matrix \(\boldsymbol{\Sigma}\)) to be zero, we reduce the number of coefficients to be estimated for each predictor variable from (a) one for each of \(K\) outcome categories; to (b) one for each of some smaller number of latent variables.”

Features of Nuclear Penalized Multinomial Regression

Another feature of NPMR is that when estimating \(\mathbf{B}\), you wind up with some smaller number of latent variables that can model the structure of the problem better, being the loadings of the principle components. It is the columns of \(\mathbf{V} \in \mathbb{R}^{k \times k}\).

Application

Simulation Study: Ridge vs. NPMR

The authors compare ridge regression and Nonparametric Multinomial Regression (NPMR) in two coefficient matrix settings:

Full-rank: \(\mathbf{B} \sim \mathcal{N}(0, I_{12})\)
Low-rank: \(\mathbf{B} = \mathbf{A} \mathbf{C}^T\), with \(\text{rank}(\mathbf{B}) = 2\)

They vary training size \(n \in [600, 2000]\), fit both models, and evaluate out-of-sample log-likelihood loss on 10,000 test points.

Results (See Figure):

Full-rank: Ridge slightly outperforms NPMR.
Low-rank: NPMR outperforms ridge, especially at small \(n\), by leveraging the low-rank structure.

Real Life Example: Baseball Data

Powers et. al use 2015 MLB play-by-play data to model plate appearance (PA) outcomes for Major League Baseball.

For plate appearance \(i\) and outcome \(k \in \mathcal{K}\), where \[\mathcal{K} \equiv \{\text{K}, \text{G}, \text{F}, \text{BB}, \text{HBP}, \text{1B}, \text{2B}, \text{3B}, \text{HR}\},\]

\[\begin{align*} P(Y_i = k) &= \frac{\exp(\eta_{ik})}{\sum_{\ell \in \mathcal{K}} \exp(\eta_{i\ell})}, \\ \eta_{i\ell} &= \alpha_k + \beta_{B_i k} + \gamma_{P_i k} + \delta_{S_i k} + \zeta_k H_i + \xi_k O_i. \end{align*}\]

\(B_i\): Batter, \(P_i\): Pitcher, \(S_i\): Stadium
\(H_i\): Home team indicator, \(O_i\): Opposite handedness

Penalization Strategy

Ridge regression: \(\ell_2\) penalty on full coefficient matrix
NPMR: Nuclear norm penalties on:
- Batter matrix
- Pitcher matrix
- Stadium matrix
Home-field and handedness effects unpenalized
Result: More interpretable latent structure

Predictive Performance

NPMR consistently outperforms ridge, especially with small sample sizes
NPMR provides similar deviance explained with more interpretable parameters

Latent Variables as Parsimonious oSkiolls from NPMR

The authors derived three latent variables and characterized them as follows:

Batters:

Skill 1: Patience (K & BB vs BIP)
Skill 2: Trajectory (F vs G)
Skill 3: Speed (1B vs G)

Pitchers:

Skill 1: Power (K vs F/G)
Skill 2: Trajectory (F vs G)
Skill 3: Command (F/G/K vs BB/1B)

Results & Discussion

“The interpretation of the results on the baseball data is promising in how it coalesces with modern baseball understanding\(\ldots\) We recommend the use of NPMR for any multinomial regression problem for which there is some nonordinal structure among the outcome categories.”