Today’s Friday lecture will discuss the first of many steps taken to estimate a number of runs. Bill James’ Runs Created, initially published in an early version of the classic Bill James Historical Abstract (in Dan Fox’s explanation, he has it as the 1979 version), was one of the first run estimation models and is likely the most well known. As a result, it will be the first of a few run estimators that we’ll look into here.
As texts for this discussion, I’ll refer to the following two pieces.
1. Patriot’s explanation of Runs Created shown here.
2. Dan Fox’s explanation shown here.
I may also be referring to some of the charts Tom Tango posts here in his series on how runs are actually created.
Again, none of these looks are my original research; full claim for that goes to Bill James for designing the formulas mentioned and for Patriot, Fox, and Tango for independently reviewing the model. I’ll simply be guiding the reader through the texts provided.
OK, let’s finally start stepping into Runs Created. First off, I’ll once again point out to you the formula mentioned in the last discussion. Here is the essential model for all Runs Created formulas.
Runs = A*B/C
Where A = on-base factor; B = Advancement factor; and C = Opportunities
Now, before we move past the jump and start defining these factors, consider whether or not this seems fundamentally sound from a baseball perspective.
The Factors
As I mentioned before, in the discussion on runs, the factors of Runs Created are actually somewhat obvious when looking at them. The A factor, the on-base factor, sounds like it is some measure of the amount of baserunners a team produces. The B factor, the advancement factor, sounds like how much these baserunners are moved. Finally, the C factor, the opportunity factor, measures some amount of chances to hit. Think for a moment what that may look like in terms of baseball.
OK, now that we (may) have done that, here are the first inputs for Runs Created.
A = H + BB
B = TB
C = AB + BB
Where H = hits, BB = walks, TB = Total Bases, and AB = At-Bats
Rewriting the equation with the substituted variables gets you RC = (H + BB) * TB/(AB + BB). Here you can see that the formula contains (H + BB)/(AB + BB), which is essentially on-base percentage (OBP), excluding HBP. If you added those into the A factor, you’d get that RC = OBP * TB, or RC = OBP * SLG *AB.
As you can see, the first formula was simple and elegant. It was intuitive, in that it was designed based on how the game of baseball is played (more on that later). And finally, it seemed to work, boasting decent root mean square errors (RMSE) at the team level compared to team runs scored.
Because the factors were intuitively part of the game and fairly obviously defined, it was simple to come up with improvements on the basic model by incorporating other statistics. The texts provided run down the changes added since the original version, and both Patriot and Dan Fox do an excellent job explaining the reasoning behind these changes, so I will not go in-depth into these. Instead, I’ll simply present the TECH-1 formula published in 1984 and the latest revision from the 2004.
1984 TECH-1:
A = H+W+HB-CS-DP
B = TB+.26(W+HB-IW)+.52(SB+SH+SF)
C = AB+W+HB+SH+SF
2005 Revision:
A = H+W+HB-CS-DP
B = 1.125S+1.69D+3.02T+3.73HR+.29(W-IW+HB)+.492(SB+SH+SF)-.04K
C = AB+W+HB+SH+SF
These additions are not terribly groundbreaking. The out factors of caught stealing and double plays were added on because they take away baserunners. The B factor changes reflect the addition of stolen bases and the fact that walks and other on-base factors advance runners as well. The additions to the opportunities factor C just include additional events that are a part of plate apperances. The factors placed in front of the various inputs in the B were designed to improve the fit for the run environment.
The Advantage
Runs Created is derived from a fairly simple formula that is also dynamic rather than linear, which better reflects how baseball runs are scored, and it boasts low acceptable errors across the span of MLB talent and the course of a full season at the team level. Jim Furtado did a study over the 1955-1997 period and had RC at a root mean square error (RMSE) of 25 runs, five back of his own model, which ranked the best in the set of models listed. This effect can be seen graphically in this chart by Tom Tango, plotting a range of OBP versus a percentage of runners scored.
The Flaws
As RC has continued to develop, it has gotten more and more complex. The latest version no longer has the simplicity of the A*B/C, as additions have turned the basic function into RC = (A+2.4C)(B+3C)/(9C)-.9C. Yet this addition in complexity has not solved the critical problem with RC: it is not an intuitive model on how baseball works, but rather a model that reflects the context and environment of a normal MLB season. In other words, RC does a fine job predicting “normal” major league teams, but struggles at the extremes because its basic formula is not actually grounded in baseball realism, but rather modeled based on MLB results.
The chart provided by Tom Tango linked above is a great example of this fundamental problem. If you can spot the grey line that represents RC’s scoring rate compared to the black line which represents an empirically determined scoring rate based on data clumped by games of a certain range of OBP, you can see that RC begins drastically overestimating actual scoring rate somewhere past the .400 OBP mark. Why? Again, the .400 OBP mark is not typical for your normal MLB team, but RC was designed with those teams in mind. This discrepancy is even more apparent when Tango groups games by OPS, shown here. In that chart, the grey line jumps significantly off the mark at around .900 OPS and is of course well off base past an OPS of 1.00.
This flaw is exacerbated when examining the HR in RC. Patriot had this to say on the mistreatment of the HR in RC:
“A home run always produces at least one run, no matter what. In RC, a team with 1 HR and 100 outs will be projected to score 1*4/101 runs, a far cry from the one run that we know will score. And in an offensive context where no outs are made, all runners will eventually score, and each event, be it a walk, a single, a home run–any on base event at all–will be worth precisely one run. In a 1.000 OBA context, RC puts a HR at 1*4/1 = 4 runs. This flaw is painfully obvious at that kind of extreme point, but the distorting effects begin long before that.”
We’ve seen that this distortion effect indeed occurs quite before that. What is the end result? When measuring teams at either high or low extremes, RC is either too optimistic or not enough so respectively. This is an issue that Bill James himself recognized quite some time before; as Patriot mentions, the home run issue is one of the reasons why James tweaked the B factor for 2005 down from total bases to a set of factors for hit event. Patriot describes how this has affected the linear weights of each event as compared to empirically determined linear weights, and you can read that on his piece.
What I’d like to point out is that, while James has tinkered with the formula to attempt to address concerns with RC at different run environments, he has not addressed the fundamental concern of RC’s design. Ultimately, all of these moves are patches to fix a greater issue with the formula. In Dan Fox’s piece, he mentions that the authors of Curve Ball: Baseball, Statistics, and the Role of Chance in the Game, Jim Albert and Jay Bennett, think that product models for run estimation such as RC are not effective for the extremes.
Do not use run estimators for individual players
The initial formula for RC seemed friendly enough to be applied to players. In fact, James did this in part to help in reconciling total team RC with actual team runs. However, there is a clear issue with applying run estimators such as RC on an individual level. Such estimators were not designed at the individual level, and interactions between the various factors do not make sense within the context of individuals. Applying RC to an individual player assumes that the player is both getting on base and driving himself home, an impossibility outside of home runs. Thus, it models something more like the amount of runs produced by a lineup of that player at the given number of opportunities.
It is important to mention here that non-linear weights run estimators all have this issue. When we later discuss the best of these estimators, BaseRuns, you’ll see that applying BaseRuns to individual hitters is also fundamentally flawed. It is not the fault of the run estimator because it simply is not designed to do this calculation.
However, a method that can possibly used to evaluate a player’s run contribution is to place him in a hypothetical team of players of a certain level/baseline and see the differences in run production between the team with or without the player. This is the so-called “Theoretical Team” analysis, which originated from Bill James’ work and will be discussed here some time in a later Friday lecture.
Conclusion
All of that being said, RC still holds a valuable and important place in the history of baseball statistical analysis. The introduction of RC’s basic formula A*B/C, while ultimately not effective outside of the MLB environment, influenced others in their work on run estimation. The basic formula was at the root of the design of BaseRuns by David Smyth, currently the best run estimator at the team level. The work on the theoretical team model was originated from work using RC as a team run estimator. In other words, while RC is not and should not be considered an ideal run estimator for work in the present time, its fingerprints are on many designs in the field of run estimation to this day. It was a first step into the topic, but was not the last.
References
1. Patriot. “Runs Created.” http://gosu02.tripod.com/id104.html
2. Fox, D. “A Brief History of Run Estimation: Runs Created.” Dan Agonistes. 07 Oct 2004. http://danagonistes.blogspot.com/2004/10/brief-history-of-run-estimation-runs.html
3. Tango, T. “How Runs are Really Created: Third Installment.” Tangotiger.net. http://www.tangotiger.net/rc3.html
4. Furtado, J. “Methods and Accuracy in Run Estimation Tools.” Baseball Think Factory. http://www.baseballthinkfactory.org/btf/scholars/furtado/articles/accuracy.htm
4. Albert, J., Bennett, J. Curve Ball: Baseball, Statistics, and the Role of Chance in the Game. Springer. ISBN: 978-0-387-00193-7
Reading Materials
Here’s the reading material for the weekend.
1. The next discussion on run estimators will be on linear weights of all kinds! I’ll deal with them in a general fashion, but you can check out Jim Furtado’s explanation of his own system, Extrapolated Runs, here.
2. BtB colleague Tommy Bennett has a look at GDP and MLB salary growth that was very intriguing. Check out the discussion thread over at The Book blog as well.
MLB Front Page
NBA Front Page
NHL Front Page
NFL Front Page



