BLUE: A Bayesian Approach to College Football Ratings
Introduction
Traditional football ratings often rely on simple metrics like win percentage, points per game, and strength of schedule. Slightly more advanced ratings include yards per play, EPA per play, and play success rate. While these numerical ratings paint a decent picture of team strengths in the NFL, College Football’s large variation in schedule strengths can cause these numbers to be misleading. This analysis proposes a Bayesian approach to college football unit ratings that provides a more statistically robust measure of each team’s efficiency by play type.
Methods
Design Matrix
The core framework for this analysis implements a similar framework to the popular advanced NBA statistic RAPM. The goal of RAPM is to estimate each player’s impact on their team’s net scoring margin while they’re playing. The original ridge regression approach has a key limitation of using a single global variance for all players. By calculating these player impacts through bayesian simulation in a language like Stan, separate variance components can be used for different groups of players.
While RAPM is calculated with player combinations and stint scoring rate data, the proposed football BLUE (Bayesian Learned Unit Efficiency) ratings are calculated at the play level.
The design matrix contains the columns:
The response variable Y:
Expected Points Added (EPA). EPA was chosen over yards to measure team efficiency since it is:
Context Aware: Accounts for field position, down, and distance
Predictive: Stronger correlation with future performance
Granular: Available for every play, not just drives or games
5 predictor terms per team
Pass Offense: Team’s ability to generate EPA (Expected Points Added) through passing plays
Run Offense: Team’s ability to generate EPA through rushing plays
Pass Defense: Team’s ability to limit opponent EPA on passing plays
Run Defense: Team’s ability to limit opponent EPA on rushing plays
Special Teams: Team’s performance on special teams plays
Each row represents a play and is designated with the home team as a +1 and the away team as a -1 in the correct columns with all other columns being 0 (0s for every other team in the study not playing as well as the playing teams units who are off the field). For an explanation of basketball’s RAPM design matrix, I like this article by Justin Jacobs.
Data
This study only uses 2024 season games from Power 4 (+ Notre Dame) matchups since they contain a much smaller proportion of second and third string snaps than matchups between levels.
The data itself originally comes from the cfbfastR R package.
Stan
Parameter Architecture
Unit-Specific Parameters: Separate vectors for each unit (pass_offense
, run_offense
, pass_defense
, run_defense
, special_teams
)
Hierarchical Variances: Individual sigma_
parameters for each unit type to allow unit-specific shrinkage.
Positivity Constraints: <lower=0>
variances are strictly positive.
Prior Choices
Zero-Centered Normals: normal(0, sigma_unit)
priors for team effects ensure shrinks teams toward population mean.
Half-Cauchy Variances: cauchy(0, 2.5)
priors for unit variances and cauchy(0, 5)
for residual provide weakly informative regularization with heavy tails.
Model Structure
Additive Effects: Expected EPA = Offense + Defense components following established RAPM methodology.
Special Teams Exception: Uses subtraction (team_A - team_B
) rather than separate offense/defense parameters for conceptual clarity.
Computational Settings
Conservative MCMC: adapt_delta=0.95
, max_treedepth=12
, long warmup for robust convergence with complex hierarchical structure.
Multiple Chains: 4 chains with 10,000 iterations (5,000 warmup) for reliable convergence diagnostics and effective sample sizes.
Github
For the code used to find these bayesian distributions, see this github repo.
Key Advantages of the Bayesian Approach
Uncertainty Quantification
Unlike point estimates from traditional rankings, the Bayesian approach provides:
Credible Intervals: 50% confidence ranges for each team’s rating
Probabilistic Rankings: Ability to calculate probability that Team A’s unit is better than Team B’s unit
Uncertainty Visualization: Can visualize an estimate of the probability distribution function
Hierarchical Regularization
The Bayesian framework naturally handles the small sample size problem through:
Adaptive Regularization: The model learns appropriate shrinkage levels from the data
Borrowing Strength: Information from all teams informs individual estimates
Opponent Adjustment
The BLUE framework automatically adjusts for strength of schedule by:
Simultaneous Estimation: All team ratings estimated together in one model
Opponent Quality: Each performance is evaluated relative to opponent strength
Transitive Relationships: Indirect comparisons through common opponents
Statistical Rigor
The Bayesian approach provides:
Model Diagnostics: Convergence checks and posterior predictive validation
Principled Inference: All conclusions based on probability distributions
Reproducible Results: Full specification of assumptions and methods
Results
The output rankings pass the smell test! Note the two inner black lines indicate 25th and 75th percentile estimates for the true efficiency (EPA/play) of each team’s respective unit. I decided not to include the Special Teams plots due to the extremely high variance + low sample nature of the data, but they didn’t look problematic. I also am not sure pooling together kicking + receiving kickoffs/punts/FGs into a rating makes much sense.
2024 Top/Bottom 5 by play type
Conference Rankings by play type
2024 Big Ten
2024 SEC
2024 Big 12
2024 ACC
Discussion
Limitations
Computational Intensity
Bayesian inference requires significant computational resources
Takes over 1 hour to run on M3 MacBook Pro with 4 cores parallel processing
Memory requirements for storing posterior samples
Need parallel processing capabilities
Model Assumptions
The approach assumes:
Linear additive effects of offensive and defensive units
EPA model from cfbfastR package is accurate (it’s not but it’s probably still better than yards)
Stationary team performance throughout season (ignores injuries etc)
Interpretability
While statistically sophisticated, the approach may be:
Difficult to explain to casual fans
More complex than traditional rankings
Future Work
EPA Model
Someone PLEASE make a better public NCAAF EPA model. This paper by Ron Yurko could be effectively adapted to CFB play by play data already available within cfbfastR. The current nflfastR model uses an xgboost model and it would really make me so happy for someone to do this in public.
Historical seasons
The cfbfastR packagae has data going back 20 years. I would love to see how Mike Leach’s offenses stack up to today’s teams.
Season-over-season Temporal Modeling
Incorporating time-varying effects to account for season over season change in team strengths. This 27 year old paper by Mark Glickman and Hal Stern would be a good place to start. College football has even more team turnover than pro football especially these days. There is also a plethora of player evaluation data on the internet that could be incorporated into a model.
Situational Contexts
Expanding the model to include:
Down and distance effects
Field position impacts (different teams may excel in shorter yardage situations etc)
Score differential influences (teams often sit starters in blowouts etc so maybe weight those snaps less)
Weather and environmental factors (I would love to see the Gophers host a playoff game)
Player-Level Analysis
Extending to individual player evaluation:
Quarterback-specific passing ratings
Running back effectiveness metrics (both player specific and fatigue specific metrics could be cool)
Defensive player impact measurements
Acknowledgements
Thanks to:
YOU for taking the time to click on this.
the creators of Stan
makers of all the packages I used like cfbfastR, teamcolors
Ron Yurko for encouraging me to write this and teaching me Stan
Nicco Jacimovic and Quang Nguyen for helping me debug this blog post rendering and also being general beasts
The people who hired me for the job I start tomorrow