BLUE: A Bayesian Approach to College Football Ratings

Author

James Lauer

Introduction

Traditional football ratings often rely on simple metrics like win percentage, points per game, and strength of schedule. Slightly more advanced ratings include yards per play, EPA per play, and play success rate. While these numerical ratings paint a decent picture of team strengths in the NFL, College Football’s large variation in schedule strengths can cause these numbers to be misleading. This analysis proposes a Bayesian approach to college football unit ratings that provides a more statistically robust measure of each team’s efficiency by play type.

Methods

Design Matrix

The core framework for this analysis implements a similar framework to the popular advanced NBA statistic RAPM. The goal of RAPM is to estimate each player’s impact on their team’s net scoring margin while they’re playing. The original ridge regression approach has a key limitation of using a single global variance for all players. By calculating these player impacts through bayesian simulation in a language like Stan, separate variance components can be used for different groups of players.

While RAPM is calculated with player combinations and stint scoring rate data, the proposed football BLUE (Bayesian Learned Unit Efficiency) ratings are calculated at the play level.
The design matrix contains the columns:

The response variable Y:

Expected Points Added (EPA). EPA was chosen over yards to measure team efficiency since it is:

Context Aware: Accounts for field position, down, and distance

Predictive: Stronger correlation with future performance

Granular: Available for every play, not just drives or games

5 predictor terms per team

Pass Offense: Team’s ability to generate EPA (Expected Points Added) through passing plays
Run Offense: Team’s ability to generate EPA through rushing plays
Pass Defense: Team’s ability to limit opponent EPA on passing plays
Run Defense: Team’s ability to limit opponent EPA on rushing plays
Special Teams: Team’s performance on special teams plays

Each row represents a play and is designated with the home team as a +1 and the away team as a -1 in the correct columns with all other columns being 0 (0s for every other team in the study not playing as well as the playing teams units who are off the field). For an explanation of basketball’s RAPM design matrix, I like this article by Justin Jacobs.

Data

This study only uses 2024 season games from Power 4 (+ Notre Dame) matchups since they contain a much smaller proportion of second and third string snaps than matchups between levels.

The data itself originally comes from the cfbfastR R package.

Stan

Parameter Architecture

Unit-Specific Parameters: Separate vectors for each unit (pass_offense, run_offense, pass_defense, run_defense, special_teams)

Hierarchical Variances: Individual sigma_ parameters for each unit type to allow unit-specific shrinkage.

Positivity Constraints: <lower=0> variances are strictly positive.

Prior Choices

Zero-Centered Normals: normal(0, sigma_unit) priors for team effects ensure shrinks teams toward population mean.

Half-Cauchy Variances: cauchy(0, 2.5) priors for unit variances and cauchy(0, 5) for residual provide weakly informative regularization with heavy tails.

Model Structure

Additive Effects: Expected EPA = Offense + Defense components following established RAPM methodology.

Special Teams Exception: Uses subtraction (team_A - team_B) rather than separate offense/defense parameters for conceptual clarity.

Computational Settings

Conservative MCMC: adapt_delta=0.95, max_treedepth=12, long warmup for robust convergence with complex hierarchical structure.

Multiple Chains: 4 chains with 10,000 iterations (5,000 warmup) for reliable convergence diagnostics and effective sample sizes.

Github

For the code used to find these bayesian distributions, see this github repo.

Key Advantages of the Bayesian Approach

Uncertainty Quantification

Unlike point estimates from traditional rankings, the Bayesian approach provides:

Credible Intervals: 50% confidence ranges for each team’s rating

Probabilistic Rankings: Ability to calculate probability that Team A’s unit is better than Team B’s unit

Uncertainty Visualization: Can visualize an estimate of the probability distribution function

Hierarchical Regularization

The Bayesian framework naturally handles the small sample size problem through:

Adaptive Regularization: The model learns appropriate shrinkage levels from the data

Borrowing Strength: Information from all teams informs individual estimates

Opponent Adjustment

The BLUE framework automatically adjusts for strength of schedule by:

Simultaneous Estimation: All team ratings estimated together in one model

Opponent Quality: Each performance is evaluated relative to opponent strength

Transitive Relationships: Indirect comparisons through common opponents

Statistical Rigor

The Bayesian approach provides:

Model Diagnostics: Convergence checks and posterior predictive validation

Principled Inference: All conclusions based on probability distributions

Reproducible Results: Full specification of assumptions and methods

Results

The output rankings pass the smell test! Note the two inner black lines indicate 25th and 75th percentile estimates for the true efficiency (EPA/play) of each team’s respective unit. I decided not to include the Special Teams plots due to the extremely high variance + low sample nature of the data, but they didn’t look problematic. I also am not sure pooling together kicking + receiving kickoffs/punts/FGs into a rating makes much sense.

2024 Top/Bottom 5 by play type

Conference Rankings by play type

2024 Big Ten

2024 SEC

2024 Big 12

2024 ACC

Discussion

Limitations

Computational Intensity

Bayesian inference requires significant computational resources

Takes over 1 hour to run on M3 MacBook Pro with 4 cores parallel processing

Memory requirements for storing posterior samples

Need parallel processing capabilities

Model Assumptions

The approach assumes:

Linear additive effects of offensive and defensive units

EPA model from cfbfastR package is accurate (it’s not but it’s probably still better than yards)

Stationary team performance throughout season (ignores injuries etc)

Interpretability

While statistically sophisticated, the approach may be:

Difficult to explain to casual fans

More complex than traditional rankings

Future Work

Historical seasons

The cfbfastR packagae has data going back 20 years. I would love to see how Mike Leach’s offenses stack up to today’s teams.

Season-over-season Temporal Modeling

Incorporating time-varying effects to account for season over season change in team strengths. This 27 year old paper by Mark Glickman and Hal Stern would be a good place to start. College football has even more team turnover than pro football especially these days. There is also a plethora of player evaluation data on the internet that could be incorporated into a model.

Situational Contexts

Expanding the model to include:

Down and distance effects

Field position impacts (different teams may excel in shorter yardage situations etc)
Score differential influences (teams often sit starters in blowouts etc so maybe weight those snaps less)
Weather and environmental factors (I would love to see the Gophers host a playoff game)

Player-Level Analysis

Extending to individual player evaluation:
Quarterback-specific passing ratings
Running back effectiveness metrics (both player specific and fatigue specific metrics could be cool)
Defensive player impact measurements

Acknowledgements

Thanks to:

YOU for taking the time to click on this.
the creators of Stan
makers of all the packages I used like cfbfastR, teamcolors
Ron Yurko for encouraging me to write this and teaching me Stan
Nicco Jacimovic and Quang Nguyen for helping me debug this blog post rendering and also being general beasts
The people who hired me for the job I start tomorrow

Contact

My poodle Torrie on Sunday Night Football

Twitter

Bluesky