Each baseball game is comprised of matchups between batters and pitchers.
Each matchup has a technical term called a plate appearance, and can result in a set of outcomes.
Plenty of MLB front offices has have conducted analyses revolving around an idea called run value.
Far less attention has been devoted to optimally estimating the probabilities batters and pitchers will produce specific plate appearance outcomes.
Many methods have been explored, but few actually do an appropriate job capturing the true structure of the problem.
The very nature of this problem is more complex than batting averages (binomial probabilites), but rather seeks to estimate the probability of a specific outcome from a set of outcome (multinomial probabilites).
One approach by Judge and BP Stats Team is deserved run average, which gives each category of the response, \(Y\), a series of estimated binomial probabilities. For example, each batter \(B\) has a propensity \(\beta_B^{\text{HR}}\) for hitting home runs. Same for each pitcher, represented by \(\gamma_P^{\text{HR}}\). If \(Y\) denotes the probability of a home run given a specific batter-pitcher matchup, then
\[ P(Y = \text{HR} | B, P) = \frac{e^{\alpha^{\text{HR}} + \beta_B^{\text{HR}} + \gamma_P^{\text{HR}}}}{1 + e^{\alpha^{\text{HR}} + \beta_B^{\text{HR}} + \gamma_P^{\text{HR}}}}. \]
What is the problem with this?
Another approach could be to treat the outcomes as ordinal, as they have different values corresponding to each team:
\[ \text{K} < \text{G} < \text{F} < \text{BB} < \text{HBP} < \text{1B} < \text{2B} < \text{3B} < \text{HR} \]
“the proportional odds model used for ordinal regression assumes that when one outcome is more likely to occur, the outcomes close to it in the ordering are also more likely to occur. That assumption is woefully off-base in this setting \(\ldots\) players who hit a lot of home runs tend to strike out often, and they tend not to hit many triples.”
Ridge multinomial regression is the most appropriate current solution to this problem, and is implemented easily in the R package glmnet.
\[ (\alpha^*, \mathbf{B}^*) = \underset{\alpha \in \mathbb{R}^K,\, \mathbf{B} \in \mathbb{R}^{p \times K}}{\arg\min} \, -\ell(\alpha^{(t)}, \mathbf{B}^{(t)}; \mathbf{X}, Y) + \lambda ||\mathbf{B}||_F^2 \]
The chief issue with using this method in number of parameters to estimate the application context (one for each outcome and each player). This would be about 8,000 parameters, and use 160,000 plate appearances in a season.
Rather than developing binomial models for each outcome, some EDA reveals that true structure of the problem is more hierarchical. Powers et. al used PCA to demonstrate this.
Note the directions of the loadings for the outcomes.
We have that a mulinomial logistic regression model in this motivating application can be written as:
\[P(Y_i = k | \mathbf{x}_i) = \frac{e^{\alpha_k + \mathbf{x}_i^T \beta_k}}{\sum_{\ell=1}^K e^{\alpha_\ell + \mathbf{x}_i^T \beta_\ell}}\]
\(Y_i \in \{1, \ldots, K\}\) for \(i = 1,\ldots, n\) is the outcome of interest for player \(i\), and are independent, conditional on \(\mathbf{X}\)
\(\alpha_k \in \mathbb{R}\) and \(\beta_k \in \mathbb{R}^p\) are fixed, unknown parameters.
In this model we estimate \(\mathbf{B} = (\beta_1, \ldots, \beta_k)\), were \(k\) corresponds to the outcome we are estimating.
As you can imagine, as \(K\) and \(p\) grows larger, the number of parameters to estimate increases.
An alternative is a reduced-rank multinomial logistic regression model, where we solve, for some \(r \in \{1, \ldots, K-1\}\),
\[\begin{align*} \underset{\alpha \in \mathbb{R}^K, \, \mathbf{B \in \mathbb{R}^{p \times K}}}{\text{minimize}} \ & -\sum_{i=1}^{n} \log \left( \sum_{k=1}^K \frac{e^{\alpha_k + \mathbf{x}_i^T\beta_k}}{\sum_{l=1}^K e^{\alpha_\ell + \mathbf{x}_i^T\beta_\ell} } \mathbb{I}_{\{Y_i=k\}} \right) \\ \text{subject to} \ & \text{rank}(\mathbf{B}) \leq r, \text{ with } \alpha_1 = 0, \beta_1 = \mathbf{0}_p. \end{align*}\]
If \(\text{rank}(\mathbf{B}) < r\), you can factor \(\mathbf{B} =\mathbf{A}_{p \times r}\mathbf{C}_{K \times r}^T\) where the columns of \(\mathbf{C}\) are latent outcome variables.
Yee (2010) implemented an algorithm to solve this in the VGAM package, but it is too slow for datasets as large as the one motivating our problem.
Because of the computational challenges of the reduced-rank form of this problem, the authors propose replacing the rank restriction with the restriction on the nuclear norm \(\|\cdot \|_\star\), where instead of \(\text{rank}(\mathbf{B}) < r\), they write, \(\|\mathbf{B} \|_* < \rho\).
They reframe the same problem in its Lagrangian form:
\[\begin{align*} (\alpha^*, \mathbf{B}^*) &= \underset{\alpha \in \mathbb{R}^K,\, \mathbf{B} \in \mathbb{R}^{p \times K}}{\arg\min} \, - \sum_{i=1}^n \log \left( \frac{\sum_{k=1}^{K} e^{\alpha_i + \mathbf{x}_i^T \beta_k}}{\sum_{\ell=1}^{K} e^{\alpha_\ell + \mathbf{x}_i^T \beta_\ell}} \mathbb{I}_{\{y_i=k\}} \right) + \lambda ||\mathbf{B}||_* \\ & \equiv \underset{\alpha \in \mathbb{R}^K, \mathbf{B} \in \mathbb{R}^{p \times K}}{\arg\min} \, - \ell(\alpha, \mathbf{B}; \mathbf{X}, Y) + \lambda ||\mathbf{B}||_*. \end{align*}\]
where \(\ell(\alpha, \mathbf{B}; \mathbf{X}, Y)\) is the log-likelihood of the regression coefficients.
For the singular values of \(\mathbf{B}\), written as \(\sigma_d\), the nuclear norm is defined as: \[\|\mathbf{B}\|_* = \overset{\min\{p,\,K\}}{\sum_{d = 1}} \sigma_d.\]
Like lasso, solving this optimization problem drives some of the singular values to exactly zero. This reduces the rank of \(\mathbf{B}\).
It is easier to solve than ridge multinomial regression using standard solvers, such as proximal gradient descent.
In practice, the authors recommend using standard cross-validation techniques for selecting the regularization parameter \(\lambda\), which controls the rank of the solution.
“By estimating some of the singular values of \(\mathbf{B}\) (the entries of the diagonal \(p \times K\) matrix \(\boldsymbol{\Sigma}\)) to be zero, we reduce the number of coefficients to be estimated for each predictor variable from (a) one for each of \(K\) outcome categories; to (b) one for each of some smaller number of latent variables.”
Another feature of NPMR is that when estimating \(\mathbf{B}\), you wind up with some smaller number of latent variables that can model the structure of the problem better, being the loadings of the principle components. It is the columns of \(\mathbf{V} \in \mathbb{R}^{k \times k}\).
The authors compare ridge regression and Nonparametric Multinomial Regression (NPMR) in two coefficient matrix settings:
They vary training size \(n \in [600, 2000]\), fit both models, and evaluate out-of-sample log-likelihood loss on 10,000 test points.
Results (See Figure):
Powers et. al use 2015 MLB play-by-play data to model plate appearance (PA) outcomes for Major League Baseball.
For plate appearance \(i\) and outcome \(k \in \mathcal{K}\), where \[\mathcal{K} \equiv \{\text{K}, \text{G}, \text{F}, \text{BB}, \text{HBP}, \text{1B}, \text{2B}, \text{3B}, \text{HR}\},\]
\[\begin{align*} P(Y_i = k) &= \frac{\exp(\eta_{ik})}{\sum_{\ell \in \mathcal{K}} \exp(\eta_{i\ell})}, \\ \eta_{i\ell} &= \alpha_k + \beta_{B_i k} + \gamma_{P_i k} + \delta_{S_i k} + \zeta_k H_i + \xi_k O_i. \end{align*}\]
The authors derived three latent variables and characterized them as follows:
Batters:
Pitchers:
“The interpretation of the results on the baseball data is promising in how it coalesces with modern baseball understanding\(\ldots\) We recommend the use of NPMR for any multinomial regression problem for which there is some nonordinal structure among the outcome categories.”
Carson Slater