Member Login

auto-login
Intro to Sabermetrics 101
Join My SpaceJoin FaceBooktwitterfantasyplayers

THE Latest

BaseRuns: The Ultimate in Run Estimation (so far)
  • By Michael Jong
  • November 20th, 2009

In the last few Fundamentals sections, I’ve been discussing run estimators. First, we started with one of the originals, Runs Created. Then, we discussed a little about linear weights. Both system types had their share of positives and negatives. One common thing I mentioned for both is that, because both systems were designed with certain run environments in mind (linear weights by definition must be based on certain environments, while Runs Created was initially made through observational analysis of the major league environment by Bill James), they falter when taken “out of their element.”

Today, we’re going to discuss the current premier run estimator, BaseRuns. The reason why BaseRuns is considered the best run estimation tool at the moment is because it has the closest model to reality of any of the other systems. As Bill James mentioned, runs are not scored/created linearly, so linear weights is ultimately a system that approximates the environment it is assigned (and it does this very well). RC is a dynamic estimator, intended to model baseball reality, but in actuality its system is flawed and does not actually model run scoring as much as it fits it (like linear weights). BaseRuns, on the other hand, offers us the best, most intuitive model for run scoring as of yet.

Here are the references/readings for the discussion:

1. Patriot’s excellent BaseRuns write-up, upon which most of this piece will based.

2. Part 3 of Tom Tango’s run estimator series, particularly those charts I showed in our last two discussions.

3. TangoTiger’s wiki article on BaseRuns.

Let’s dive into BaseRuns.

The Model

David Smyth is attributed for his work in the early 1990’s on BaseRuns. Initially, he attempted to work from the basic design of Runs Created shown here:

Runs = A*B/C

where A is the on-base factor, B is the advancement factor, and C is the opportunity factor. However, Smyth found in his work that this design did not model run scoring well. He changed the model to this basic structure.

Runs = A*[B/(B+C)] + D

where A is once again an on-base factor, B represents an advancement factor, C represents a number of outs, and D represents home runs. You can see by this construction that the general run-scoring model is a measure of baserunners (A) multiplied by a percentage of baserunners scored (advancement of runners B over total opportunities B+C) plus any home runs hit (D).

The simplest formula for the BaseRuns components is shown as follows.

A = H + W – HR
B = (1.4*TB – .6*H – 3*HR + .1*W)*1.02
C = AB – H
D = HR

A more complicated model including most recorded official statistics is as follows.

A = H + W + HBP – HR – .5*IW
B = (1.4*TB – .6*H – 3*HR + .1*(W + HBP – IW) + .9*(SB – CS – GDP))*1.1
C = AB – H + CS + GDP
D = HR

Finally, there is a model used for pitchers.

A = H + W – HR
B = (1.4*TBe – .6*H – 3*HR + .1*W)*1.1
C = 3*IP
D = HR
Where TBe = 1.12*H + 4*HR

Here, TBe serves as an estimate of total bases against for pitchers. Obviously, many sources have that data, and that can be used in its stead.
Here, TBe = 1.12*H + 4*HR

The differences between RC and BaseRuns

The primary and perhaps most important difference between RC and BaseRuns is how the system handles home runs. By not separating home runs from other factors, RC handles home runs inherently incorrectly. As Patriot explains in his article, home runs in an environment where every hit is a home run are worth four runs each, which makes no sense. On the other hand, all home runs must at least be worth one run, even in an environment where no other appearances resulted in a positive event. The example given in the article cites a game in which a player hits a home run, then the team records 27 outs. In this case, RC would estimate that this team would score 0.14 runs, another impossibility. This problem is of course also seen in linear weights. Each home run is valued at approximately 1.4 runs in a normal context. However, if you stretch the context into a per-PA basis, you can see that solo home runs clearly cannot amount to 1.4 runs.

BaseRuns handles the home run separately from the on-base factor and deducts some amount from the advancement factors. It correctly values a solo home run as worth one run, regardless of the context provided.

The Advantages

The biggest advantage in BaseRuns is obvious: it is the best model for estimating how runs are scored intuitively. When looking at the basic structure, it makes perfect sense. Runners that get on base need to be advanced to score, and the advancement factor divided by the opportunities factor of (Outs + Advanced Bases) comes out as a percentage of baserunners that score. Multiplying these factors obviously comes out to an estimate of how many runs score from players that get on base. Adding home runs separately accounts for the fact that each home run must by definition provide at least one run. Furthermore, removing home runs from the baserunner factor makes sense, as home runs do not actually put players on base. Basically, the fundamental structure of BaseRuns makes baseball sense, and that is why it works so well.

As evidenced by much of the research on the topic, it does indeed work very well, even in extreme environments. Check out this chart from Tango’s series. You can see in the chart that the estimated scoring percentage for BaseRuns and actual empirical data matches very closely even as we move from extremely low run environments to extremely high ones.

An additional advantage comes from the customization element. Like RC, there are ways to adjust the factors accordingly to fit any new data points, because the factors are realities of baseball that are easy to understand. Patriot points out that the B factor of advancement can be changed depending on the environment. It can also be changed to fit an actual percentage of runners scored, though this would feel like “cheating” by tailoring the system to meet the dataset. Nevertheless, it is possible to tailor factors to include different points of data and to tinker with new formulas for the baserunners scoring rate.

Finally, because BaseRuns fits empirical data so well, even in the extremes, linear weights can be determined from BaseRuns by using a little differentiation. In fact, Tango has done this exact thing in determining linear weights for various run environments.

In short, BaseRuns is a dynamic estimator with what we perceive as the correct factors necessary for real baseball run scoring.

The Disadvantages

While BaseRuns is the best estimate we have at the moment, it is by no means perfect. One standing issue with BaseRuns is its estimation of stranded runners in an inning. As mentioned on the wiki, BaseRuns can overestimate a number of runners stranded, putting the number past three even though that is the limit of stranded runners in an inning. Also described in the wiki, BaseRuns can overestimate runs scored in an OBP range of .500-.800, as can be seen by this chart. However, this estimation is leagues closer to what any one linear weights model or RC would give.

Finally, there is still an issue with applying BaseRuns to individual players. Clearly, the model was designed to run on teams, so applying it to individuals is a misuse of the equation. In applying it to hitters, the same problem from RC is seen; one hitters’ on-base events (OBP) and advancement events (SLG) do not interact with each other, but rather with the on-base/advancement of their team. However, just like in RC, use of a theoretical team model can fix this. Also, use of BaseRuns for pitchers, given that pitchers play behind a defense, is acceptable for estimating runs allowed by pitchers and their defense.

Conclusions

BaseRuns stands as the best model we currently have available to estimate the impact of events in terms of runs. It is dynamic, which counters the use of linear models that do not model baseball reality. At its basic structure, it is a better model for actual run scoring than Runs Created or other designed dynamic models. Because of its accuracy, its applications are wide-ranging. The only thing left to do is find a better estimation of baserunner scoring rate, but the basic B/(B+C) is good enough to fit very well in all sorts of run environments. While linear weights are still useful for individual players, BaseRuns can help contribute to the system’s accuracy by helping determine the weights themselves. At a team level, there simply is no reason to use anything other than BaseRuns to estimate run production.

Post to Twitter Tweet This Post

One Response to “BaseRuns: The Ultimate in Run Estimation (so far)”

  1. Mixer Taps says:

    Certainly got us thinking here are work, expect a few replies later

Leave a Reply

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image

  • Categories

  • Archives

Reader Poll

Who would you like to see in our next Player Profile?
Loading ... Loading ...