fbpx

The Empirical Vanilla Test in MTG

You have probably heard of the vanilla test – it’s a framework, used mainly in Limited card evaluation, against which to compare a creature’s baseline power/toughness relative to its casting cost. Well, turns out it’s possible to create an “Empirical Vanilla Test,” based on factors such as P/T, release date, some keyword abilities and a few other aspects of a card. I’ll explain how it works, how to apply the test, and share a web app that makes it easy for you to apply the Empirical Vanilla Test yourself.

Vedalken Certarch

 

 

What is a card worth?

The act of building a Magic deck essentially consists of considering a large number of options and repeatedly asking “What is this card worth?” in a particular environment. A long tradition of strategy articles has developed approaches to answering that question; from the exchange of cards for life to a “system for describing every sort of resource (force) in the game completely independent of context.” With experience, every Magic player learns to evaluate cards relative to one another, in terms of card advantage, tempo, value for cost, etc.

There are a number of different approaches one could take to quantify the value of a card. You could, for example, tally the win percentage of decks featuring at least one copy of the card. Or you could count the frequency with which the card appears in the best decks in a given format, win rates in Limited or how often deck builders have chosen to include the card in their decks.

But what if there was a team of experienced Magic players, maybe even former Pro Tour competitors, who would actually playtest each card, evaluate its traits and abilities and use their collective experience and judgment to assign a value to each card? We could use those assigned values to make inferences about the relative strengths of the traits and abilities!

Well, this team exists: it’s called Magic R&D, and they assign casting costs.

How can we learn from this? Let’s start with the idea of a baseline reference.

Consider Stonework Puma versus Millennial Gargoyle:

Stonework PumaMillennial Gargoyle

The Puma is a colorless pure vanilla 2/2; the Gargoyle is the same, but with flying. The difference in mana costs implies that R&D deems flying to be worth an additional {1}.

Now, consider a series of Llanowar Elves variants:

  • Leaf Gilder has the same single-mana ability as Llanowar Elves, but greater power, at 2/1. Its mana cost implies that +1 power on a creature is worth {1} additional mana to cast.
  • Wirewood Elf and Druid of the Cowl again tap for {G}, but have greater toughness, at 1/2 and 1/3, and both cost {1}{G}, from which we might infer that {1} should buy you somewhere between +0/+1 and +0/+2.
  • Druid of the Anima and Urborg Elf both imply that, ceteris paribus, the added flexibility afforded by optionally tapping for two other colors of mana is worth {1}{G} – {G} = {1}.
  • And so forth, with a litany of other abilities on creatures that differ from Llanowar Elves in just one way.

Some of these cost-implied values seem inconsistent. For example, there are a number of “bears” in white for two mana value, but also a lot of French vanilla bears with lifelink in white for two mana value. R&D is limited to integer casting costs in black-border Magic, and they may not be internally consistent over time, deliberately or accidentally, so some costs will be too high, and some too low. But imagine moving outside of the realm of Elf Druids to consider the entire universe of creatures that Wizards of the Coast has ever printed. Just as in our Llanowar Elves examples, there are many cases in which two cards, which differ in only one small way, will have been assigned different casting costs. By aggregating across many such small differences, we can “solve” this system of equations to derive implied costs that are, on average, correct.

The price of power (and toughness)

Let’s start by looking at the cost – or implied value – of one additional point of power or toughness. To do this, I want to start by focusing on creatures – specifically, a subset of French vanilla creatures, limited to what I’ll call “Coniferous Keywords.” Basically, Coniferous Keywords – as distinct from Evergreen or Deciduous Keywords – are those which appear on a reasonable number of cards and have no “knobs,” meaning that every time you see the keyword, it means exactly the same thing1. So, for example, ward, which always comes with a cost, and scry, which always comes with a number of cards to scry, are not Coniferous.

So, from the population of all Coniferous French vanilla creatures, let’s start with the most canonically basic, Grizzly Bears. There are 34 vanilla 2/2s in all of Magic, and their average mana value is 2.21.

What about creatures with one greater power than these Grizzlies? There are 17 such 3/2 vanillas, with an average MV of 2.94.

I assert that this implies that Magic R&D has determined, through extensive playtesting and observation of many, many games over the years, that the difference between 3/2 and 2/2 is 2.94 – 2.21 = 0.73 MV.

Let’s take a slightly more exotic example; that of the 2/2 French vanilla flyer, typified by Wind Drake. There are 11 such creatures, with an average MV of 2.91. If we took this set of creatures, incremented their toughness by one but changed nothing else, our comparison set would consist of six cards, with an average MV of 3.17. Again, I am going to assume that this suggests that the difference in value between a 2/3 flyer and a 2/2 flyer is 3.17 – 2.91 = 0.26.

Now, what if we consider the entire gamut of French vanilla creatures? The 4/3 vanillas, the vigilant 2/2s, the 1/1s with deathtouch, the 4/8 with flying, double strike, vigilance, trample and indestructible, etc.

If we match each card to all of the cards with identical abilities, but with +1/+0 or +0/+1, and aggregate the average MVs of both the lower base P/T and the higher base P/T, we get something like this:

This plot shows, for example, that across all French vanilla creatures in this matched set, the average MV of those with power 1 and any toughness is 1.54, and the average MV of otherwise similar creatures, but power 2, is 2.51. As you can see, the cost to buff power tends to be higher than for toughness (as indicated by greater arrow lengths).2

If we calculate a weighted average of all of these differences, we find that:

  • One additional point of power costs/is worth 0.78 MV.
  • One additional point of toughness costs/is worth 0.53 MV.

Of course, you can do this for keywords, as well. For example, the average cost of a French vanilla X/X flyer is just about 0.75 MV greater than an otherwise identical X/X flyer without flying. I’ll spare you the graph, though, because…

We have to go deeper…

You’ve probably been thinking: what about multicolored cards? Or those with multiple colored pips in their costs? Or cards printed at higher rarities? For all of these relevant factors, we can take advantage of a commonly-used statistical technique that lets us infer marginal costs from marginal differences, called linear regression.

For those who don’t know, linear regression is a statistical model that identifies relationships between independent variables (or “predictors”, or “features”) and a single dependent (or “response”) variable. I won’t really try to explain it here, but the basic idea is this: Every one-unit increase of [A THING] is associated with an X-unit increase in mana value. A one-unit increase in flying is going from not having flying to having flying. A one-unit increase in power is going from 0 to 1, or 1 to 2, etc.

This model includes most of the obvious factors that go into costing. First and most obviously, the model includes power and toughness, as well as indicators for all of the coniferous keywords. I also include quite a bit of information about the specific colored mana cost requirements, like how many {W}, {U}, {B}, {R} and {G} are required, hybrid mana and the number of distinct colors in the cost. I also include rarity indicators, and a time trend – measured in years since Alpha – under the theory that WotC has “made creatures better over the years.”

So, for example, Serra Angel would be represented as a 4/4 with indicators for flying and vigilance, two {W} in her cost, one color, zero hybrid, originally printed zero years after Alpha, at uncommon. Plumeveil, most recently printed at uncommon 25.3 years after Alpha, is a 4/4  with indicators for flash, defender and flying, costing three hybrid mana across two colors. The way I’ve specified the model, Plumeveil is listed as costing 1.5 {W} and 1.5 {U}. Repeat this process for every Coniferous French vanilla creature that has been printed, and line that matrix up with a column vector of their converted mana costs, and you have a system of equations, which linear regression can then fit.

But what does it mean?

Instead of printing a table of numeric regression coefficients, I have chosen to plot them, along with their standard errors, for ease of comparison. I’ll run through each variable, from the top-down:

  • The intercept in this model is an imaginary 0/0 pure vanilla creature with no specific colored mana costs, a concept which doesn’t really make much sense, but serves as a baseline from which all other variables can deviate. Such a card, according to this model, would have a MV of 0.59. I’m not sure the best way to translate this value into Magic terms, but I think it might be something like “what a card is worth.”
  • For each year that has elapsed since the printing of Alpha, modeled MVs have decreased by an average of 0.023. One implication of this is that a functional reprint of an Alpha card, printed today, would cost 0.7 MV less than the original.
  • Costs:
    • Hybrid: Adding hybrid costs, while holding total color costs constant, tends to increase MV, because hybrid mana is easier to produce than a specific {M}.
    • Colors: Increasing the number of distinct colors required to cast a creature tends to decrease MV, because it is harder to produce multiple colors.
    • As the number of pips in a creature’s cost increases, {G} >> {W} > {B} ≈ {R} ≈ {U} in terms of making creature spells cheaper, which aligns fairly well with each color’s strength in creatures. This reflects the fact that the same creature with equivalent {M} costs across colors would not be a balanced cycle.
  • Rarity: Common is the baseline, and cards at higher rarities are generally stronger (i.e. better for their MV), which is reflected in this model. Given the MV of a card printed at common, the same card at uncommon would cost 0.28 less, rare would be 0.69 less and mythic rare would be 1.17 less. This one is a little hard to wrap one’s mind around, but suffice to say, it should be the case that the coefficients for each rarity are ordinally 0 > uncommon > rare > mythic.
  • Keywords: Does the order of the Coniferous keywords look right to you? Some of these were hard for me to generate predictions about, a priori. For example, is lifelink more valuable/costly than trample, on average? On the other hand, some keywords are strictly better than others, and for those, the model returns coefficients that reflect that. For example:
    • Cost-reducing mechanics, like delve and convoke, each add substantially to a card’s MV, for obvious reasons.
    • Undying > persist, though both have a substantial impact on MV. They are similar mechanics, in that they bring a creature back from the graveyard, but undying is better (apart from combo potential) because it increases P/T.
    • Double strike >> first strike. Double strike one of the most valuable keywords, according to the model, adding almost 1.5 MV on average. First strike a bit less than half that, at 0.6.
    • Hexproof > shroud. Hexproof “replaced” shroud, apparently because many players played shroud as though it were hexproof, but hexproof is clearly just shroud without the downside, and thus it should cost more. Much more, according to the model. As noted above, I don’t count ward as Coniferous, because it is scalable, but one would expect ward’s implied mana value to be less than hexproof.
    • Flying > reach. Reach is just flying without the evasion, and thus should cost more, although the model finds a relatively minor difference.
    • Flying > vigilance, which is in line with Mark Rosewater’s assertion in DtW #715.
    • Fearintimidate. Fear is very close to being “intimidate for black“, and so I would expect the two keywords to be costed similarly. Indeed, their confidence intervals overlap almost entirely.
    • Defender < 0. Defender is the only downside mechanic on this list. Consider the set of all creatures with defender and power > 0. Wouldn’t you pay about {1} more to cast them without defender?
    • The relatively low inferred value for prowess is interesting, as Rosewater has noted in several places that prowess seems weak or underwhelming to many people when they are first exposed to it, but that its strength becomes apparent after several plays. If this model is to be believed, R&D seems to cost prowess as though it is relatively weak. I suspect its low cost is due to the fact that you need to do something else to activate it.
    • Most of the other inequalities aren’t clear to me, absent a model like this. Is flash better than vigilance? Depends on the card, the deck, the opponent, and the context, but the model suggests that on average, we’d expect a card with Flash to be costed higher than an identical creature with vigilance.

Using the model

The model coefficients are interesting, and generally line up with my expectations. But beyond just getting an idea of the implied value of different abilities, there are many ways the model can be used.

For example, let’s consider a relatively new Coniferous French vanilla creature, Fleetfoot Dancer:

Intercept = 0.59
{R} + {G} + {W} = -0.22 – 0.60 – 0.34 = -1.16
3 colors × -0.08 = -0.21
Rare = -0.69
Trample + lifelink + haste = 0.35 + 0.39 + 0.49 = 1.23
4 power × 0.68 = 2.72
4 toughness × 0.53 = 2.12
28.7 years since Alpha × -0.023 = -0.66
Total expected mana value (xMV) = 3.94

So, this model suggests that, at the time and rarity it was printed, Fleetfoot Dancer was costed almost perfectly as a four-drop.

Some more examples

We can apply the same process to every creature, evaluating its cost on the basis of the variables included in the model. Here, for example, are the most over- or under-costed Coniferous French vanilla creatures, at the time they were first printed:

name set P/T xMV MV diff
Merfolk of the Depths 4/2 3.6 6 2.4
Autochthon Wurm 9/14 12.9 15 2.1
Jedit Ojanen 5/5 5.3 7 1.7
Ramirez DePietro 4/3 4.3 6 1.7
Hawkeater Moth 1/2 2.4 4 1.6
Clinging Anemones 1/4 2.4 4 1.6
Caravan Hurda 1/5 3.6 5 1.4
Dawnstrike Paladin 2/4 3.6 5 1.4
Zephid 3/4 4.6 6 1.4
Trapjaw Kelpie 3/3 4.6 6 1.4
Kasimir the Lone Wolf 5/3 4.6 6 1.4
Tangle Spider 3/4 4.6 6 1.4
Steeple Roc 3/1 3.6 5 1.4
Plumeveil 4/4 4.7 3 -1.7
Force of Savagery 8/0 4.7 3 -1.7
Zetalpa, Primal Dawn 4/8 9.7 8 -1.7
Risen Sanctuary 8/8 8.8 7 -1.8
Goliath Sphinx 8/7 8.9 7 -1.9
Phyrexian Walker 0/3 2.1 0 -2.1
Yargle, Glutton of Urborg 9/3 7.1 5 -2.1
Ornithopter 0/2 2.4 0 -2.4
Kalonian Behemoth 9/9 9.6 7 -2.6
Fusion Elemental 8/8 7.6 5 -2.6
Eldrazi Devastator 8/9 10.7 8 -2.7
Gigantosaurus 10/10 8.4 5 -3.4
Impervious Greatwurm 16/16 19.7 10 -9.7

xMV is the “expected mana value,” based on the model, while diff is the difference between xMV and actual mana value. Positive differences mean that the model is underpredicting the actual cost, i.e. that it “fails the vanilla test.” As a sanity check, note that three of the notoriously expensive Legends legends are among the most-overcosted side of this list.

And here are predictions for some frequently-reprinted creatures:

name set P/T xMV MV diff
Serra Angel 4/4 5.5 5 -0.5
Llanowar Elves 1/1 1.1 1 -0.1
Air Elemental 4/4 5.4 5 -0.4
Birds of Paradise 0/1 0.6 1 0.4
Shivan Dragon 5/5 6.2 6 -0.2
Giant Spider 2/4 4.0 4 -0.0
Sengir Vampire 4/4 5.4 5 -0.4
Gravedigger 2/2 2.4 4 1.6
Mahamoti Djinn 5/6 6.8 6 -0.8
Bog Wraith 3/3 3.7 4 0.3
Drudge Skeletons 1/1 1.5 2 0.5
Sakura-Tribe Elder 1/1 0.9 2 1.1
Grizzly Bears 2/2 2.4 2 -0.4
Hill Giant 3/3 3.9 4 0.1
Juggernaut 5/3 5.3 4 -1.3
Merfolk of the Pearl Trident 1/1 1.5 1 -0.5
Ornithopter 0/2 2.4 0 -2.4
Solemn Simulacrum 2/2 2.1 4 1.9
Acidic Slime 2/2 1.7 5 3.3
Dragon Whelp 2/3 3.5 4 0.5
Mulldrifter 2/2 3.1 5 1.9
Prodigal Sorcerer 1/1 1.5 3 1.5
Royal Assassin 1/1 0.6 3 2.4

The model applies straightforwardly to Serra Angel, as its keywords are all part of the model – a slightly undercosted beater when first appearing in Alpha. A 1/1 for {G} was almost exactly the going rate nearly 30 years ago when Llanowar Elves was first printed, and the mana producing ability was obviously huge upside. The same goes for the Shivan Dragon’s firebreathing and Sengir Vampire’shunger” ability.

Gravedigger offers a good example of how to think about implied mana value for creatures with more than just Coniferous abilities. Even 25 years ago, a vanilla 2/2 was only worth 2.4 xMV — this implies that R&D valued the regrowth effect at about 1.6 MV. Similarly, Juggernaut’s “attack each combat if able” is a downside worth about -1.3 MV, and Mulldrifter’s card draw and flexibility are worth about 1.9!

Of course, these examples show two things: If you would gladly pay more than 1.6 MV to bring a creature back from your graveyard, or more than 1.9 MV for two cards, then Gravedigger and Mulldrifter are very good. And if you can find a way to ameliorate Juggernaut’s downside, then it’s just undercosted.

For the most extreme examples of creature abilities that make the vanilla test largely irrelevant, consider this list of most extreme model “misses” from the entire pool of creatures — not just the Coniferous French vanilla subset on which the model was fit:

name set P/T xMV MV diff
Death’s Shadow 13/13 15.0 1 -14.0
Phyrexian Dreadnought 12/12 14.8 1 -13.8
Impervious Greatwurm 16/16 19.7 10 -9.7
Arixmethes, Slumbering Isle 12/12 12.9 4 -8.9
Eater of Days 9/8 11.2 4 -7.2
Leveler 10/10 11.8 5 -6.8
Jokulmorder 12/12 13.8 7 -6.8
Daemogoth Titan 11/10 10.6 4 -6.6
Phyrexian Soulgorger 8/8 9.3 3 -6.3
Etched Monstrosity 10/10 11.2 5 -6.2
Desecration Elemental 8/8 10.0 4 -6.0
Hunted Horror 7/7 7.9 2 -5.9
Hogaak, Arisen Necropolis 8/8 12.5 7 -5.5
Inferno Project 0/0 -0.7 7 7.7
Realm Seekers 0/0 -1.8 6 7.8
Shadow of Mortality 7/7 7.2 15 7.8
Spike Hatcher 0/0 -0.8 7 7.8
Gigantomancer 1/1 0.1 8 7.9
Progenitor Mimic 0/0 -1.9 6 7.9
Wiitigo 0/0 -2.0 6 8.0
Phantom Nishoba 0/0 -1.0 7 8.0
Ignition Team 0/0 -1.1 7 8.1
Arcbound Overseer 0/0 -0.3 8 8.3
Towering Titan 0/0 -2.4 6 8.4
Naya Soulbeast 0/0 -1.5 8 9.5
Sekki, Seasons’ Guide 0/0 -2.2 8 10.2

Death’s Shadow and Shadow of Mortality offer two interesting counterexamples: Death’s Shadow is ostensibly undercosted as a 13/13 for {B}, but in practice it’s never actually a 13/13, and really can’t be played early in the game. Shadow of Mortality is hugely overcosted as a 7/7 for {13}{B}{B}, but gets cheaper and cheaper as your life drains away.

This list reads like a set of challenges – to Stifle/Torpor Orb your way around Phyrexian Dreadnought’s downside, to run Towering Titan in your Doran, the Siege Tower Commander deck or to otherwise find a way to avoid the downsides and maximize the upsides, relative to casting cost.

The Empirical Vanilla Test Calculator

Finally, I’d like to share a little app I built that will let you do these calculations yourself!

The Empirical Vanilla Test Calculator

 

The Empirical Vanilla Test Calculator (Check it out!)

When you open the app, it defaults to a colorless vanilla 2/2, released on today’s date. From there, you can modify any model-relevant attributes, like release date, colored and hybrid casting cost requirements, rarity, abilities, and power/toughness.

Here, for example, is Adult Gold Dragon:

 

And here is Plumeveil:

 

Evidently, the model “thinks” that Adult Gold Dragon is a bit overcosted (at rare – once it’s in your hand and on equal footing with your other cards, it’s reasonably efficient), while Plumeveil is a relative bargain.

Why you don’t try it yourself? Perhaps do a “Vanilla Test” for Jewel Thief, and see why, at common and with a free Treasure token, it was such a strong Limited card. Or, invent a card of your own, within reason3, and see what the model thinks is a good starting point for costing.

 

 

 

Thoughts? Questions? Critiques? Suggestions? Leave a comment below or talk to me on Twitter @MtG_DS. I’d love to hear what you think about the strengths and weaknesses of the Empirical Vanilla Test, and whether you have ideas to improve it! Spend some time playing with the app – do the predictions generally seem reasonable? How do you think this could be useful for you?

1 I made an exception for protection, which could be “protection from” a variety of different things.

2 Don’t read too much into the values at toughness = 6 or toughness = 7. In the former case, a relatively large number of relevant cards are zero-power Walls. In the latter case, we have three different walls being compared to just a single 8-toughness wall, Wall of Stone. Nevertheless, both are included in the global average.

3 The model does a lot of things well, but does not do well with hypothetical cards that “bend the laws of Magic.” It will produce predictions for cards with 20 white pips, or negative toughness, or all of the available keywords checked, but those are not guaranteed to be realistic.

More generally, the model doesn’t account for interactions between components of a card. For example, on a creature with deathtouch, additional points of power aren’t worth as much as they are on a creature with trample or double strike. This interaction between power and keywords is not captured, although predictions are still reasonably accurate in those cases. Similarly, it doesn’t “know” that menace and defender are almost never going to be on the same card, or that flying (shroud) essentially makes reach (hexproof) redundant, so it’s not going to give sensible predictions in those situations.

If you have any questions about this, please ask me!

5 thoughts on “The Empirical Vanilla Test in MTG”

  1. It’s a cool project, though as you acknowledge with it’s limitations, which are themselves likely interesting and insightful.

    It think the biggest one you didn’t mention is that the value of one mana is non-linear. That is to say the difference between 1 and 2 mana is not the same as the difference between 6 and 7. At the same time, the value of power and toughness should decrease at higher values (simplistically, 7 and 8 power each kill a player in 3 hits).

    I’m actually surprised the lines in your first graph are so straight, but that may be because they cut off above about 5, which is about where I think these non-linearities kick in, and your sample sizes for high values will be smaller.

    This is revealed by the fact that most of your most undercosted cards have >=7 as a mana cost (and though good, they’re not broken). The same can go for the other outlying unusual costs, like the gigantosaurus (the cost of 5 green symbols isn’t a linear function of comparing 1 and 2).

    You can also see this if you use the empirical vanilla test on actual vanilla creatures. A green 2/2 common would cost 1.7 today (seems reasonable), a green 3/3 common would cost 2.9 (yes Jewel thief is busted), a green 6/6 common would cost 6.5 (someone should tell Colossal Dreadmaw it’s broken).

    This is something I’ve seen discussed by Maro, who has at various points explained how the number of turns it takes before you’re expected to reach >5 lands increases by an increasing amount for each additional land (assuming no ramp; I think Frank Karsten may have done some relevant models too). That said, this wasn’t realized by early designers (recall the wurms of yore), but means that high mana value creatures have been increasingly pushed over time.

    What I’m curious is if there is some transformation or other way to account for this non-linearity (although sample sizes are often going to be smaller here and thus harder to fit, since modern high-cost cards may be less likely to have only coniferous keywords).

    Also, I’d be interested in approaches to this data that use subsets for which we do have a large sample size to test more specific hypotheses about interactions. (for example, has the cost of flying changed over time?)

    Hope you don’t mind the nerding out, you caught a fellow data scientist’s interest.

  2. This article is just superb (and the articles you linked are sending me on an MtG analysis rabbit hole). The rigour and care put into this is great. I’d be curious to see how potent of a (presumable) discount Legendary grants to creatures. Cheers, mate.

Leave a Reply

Scroll to Top