Hashing Tokens: The What & The Why
As folks have become more familiar with the mechanisms & underlying design components of our token-based authentication model, one question has come up with regularity: what's up with the hashing? And... fair enough. For someone interpolating the ontological goals of token auth, it's entirely reasonable for them to initially find a bit of befuddlement when it comes to this step of hasing an already entropy-rich stream of non-semiotic ("random," but that's not a word you'll hear us make us of, around here, absent the requisite "scare quote" buffer) content. In short: what's up with that.
Since I'm the one who insisted on the inclusion of hashing in the token auth model, the rest of our team has tasked me with providing a concise explanation as to the what & the why of it. That's sort of optimistic, as "concise" and & are not often terribly well-acquainted. Still, in this brief note I will do my best to provide some backstory without bogging things down in (my characteristically) needlessly detail-centric style of exposition. No promises; quoting from "Kill Bill: Volume II" - I promised to do my best, nothing more....
There's heaps of well-written, well-footnoted tutorials out there that do an excellent job of explaining the guts of hashing theory, and the various mathematical methods employed in commonly-used modern hash algorithms. I am neither qualified to replicate those efforts, nor convinced any value would be gained in my attempts to do so. Instead, I'll skim over the relevance of hashing to our token-based security model.
Hash functions - let us simply call them "hashes" for now, although their output is also often called a "hash" so we'll try to keep them distinct in our usage - are just one-way transformations of a given input. To say they are one-way is to indicate that, for someone who has the full details of the algorithm being used to do the hashing (there's no "security through obscurity" here, although as Jaron Lanier points out in biology we call such an approach "biodiversity" and tout its robust abilities as a security technique, interestingly enough), it's trivially easy to compute the output of the hash but to go in the "other direction" - to take a given output of a known hash function, and "go back to" the input - is mathematically difficult. At the extreme, we can argue that some hash functions make such an effort - reversing the hash - "computationally intractable," if not formally NP Complete in their level of required computational effort to accomplish.
Short version: hash functions take something and make it into something else - that something else, in turn, can't easily be turned back around into the original input. It's a one-way transform.
There's some other goodies involved in how hash functions work, such as minimisation of "collisions" - in particular, engineered collisions. Which is to say that two distinct inputs shouldn't provide the same output, save in exceptionally rare circumstances or in corner-state events. That's not really a feature of hashing that's crucial to our work in the token auth model, so I won't belabor it further here (even though it's fascinating, and was at the root of the Stuxnet worm's unprecedented approach to subverting Windows Update as part of its infection vector; but I digress, as usual).
Why Hash Tokens? Threat Models 101
Ok, so a hash takes something and makes into something else which can't easily be "reversed" into the original. Good deal. What is the benefit of doing this to tokens, and then using the output hashed value as the "key" that enables connections to the network? To understand that, we are best served by stepping back to consider the threat modelling methodology chosen by our team, and the way in which decisions under that methodology come into existence.
To vastly oversimplify, there's two different ways of thinking of threats to security (as always, false dichotomies hinder understanding as much as they help them, but for now we'll go with it even so). A "theoretical" approach considers a given scenario and then interactively imagines any conceivable way that any attacker might find a weakness in the system. This is powerful, of course, but always incomplete: the enumeration of approaches of attack is only as compete as is the imaginative or experiential basis on which the defenders base their assumptions. Think of the Maginot Line: the defenders failed to consider (in any practical sense) the expedient of simply going around the Line. Their imaginative frame included all manner of frontal assaults, and they built a massively strong defensive bulwark... but they didn't consider all possible attack vectors. Indeed, it's probably true that considering _all_ attack vectors is theoretically impossible (a simple extension of Godel's Incompleteness Theorem, if not Turing's halting problem results as well). This approach to research, the "theoretical" approach, often produces breakthroughs in understanding attack models (one thinks of fuzzing, or "heap spraying" attack models) in a slow, steady drip... but really doesn't do a great job of covering the full terrain in doing so.
In (somewhat artificial) contrast is the "prioritised threat vector" approach to security modeling (I've made that term up whole-cloth, and thus hope not to have trod on the work of other, better-qualified thinkers in this area). In this approach, one focuses on real-world "failure models" for security: what actually goes wrong, in documented, in-the-wild examples of attacks on a given type of system. Of course, we can only do this if we have available a substantive data set of known attack vectors and their relative frequency; otherwise, we'd simply be shooting in the dark. But, the power of this "prioritised" approach is that we tend to invest time & resources in attacks that are known to work, known to be used, and used by known adversaries. It's not perfect, definitionally so; but it helps to cut off the most severe risks & to ensure that resources are deployed where they are most likely to be useful. It can be argued that this approach is innately closer to a method of "minimising known attack surfaces" than is the theoretical approach: carve down those vulnerable attack surfaces as far as possible, and then what's left - the undefended space - can perhaps be approached via a more theoretic approach.
Indeed, all such efforts are - or should be - syncretic. This we take as axiomatic, even as we conjure dichotomies from thin air.
What's this to do with hashing? Here goes: each exitnode is a world unto itself. It stands independent of all others, and has no "central node" to which it looks. Given that, a known attack vector against the network itself is to seek escalated user privileges on a given, specific exitnode (to "root the box," in the vernacular). This is not theoretical; it happens all the time, in the wild. Naturally, boxe are hardened against such attacks... but it is simply not true to say that such hardening is sufficient to ensure a box is secure. Hardening makes escalation attacks progressively more difficult/expensive for a well-resourced attacker... but not impossible. Ever.
The Dreaded 'Backtrace'
Ah, yes: "backtracing." From a term of derision taken from n00b-speak, it's become a word that slips into use in security researchers own writing. So be it. Let us consider the attacker who seeks, as is a common goal, to figure out who is visiting a given web resource from the data obtained by the webserver as part of that visit. The attacker checks the IP... and finds it leads to one of our network exitnodes. D'oh. That complicates things. What does a persistent attacker do?
As per above, how about if the attacker roots that box? Then what does the scenario look like? Well, if we used raw tokens as network auth credentials, the attacker would have a list of tokens and could watch which token visited which website... including her own website. That's not good, because - in theory - that attacker might herself be reselling tokens and thus have some information about the purchasers of specific tokens. Or let's say that attacker somehow has a canonical list of tokens and purchase data from some other, magical source. In that case, rooting a box provides a direct link between tokens and purchase data. That is a serious break in the security model.
In contrast, let us consider the case in which an attacker roots an exitnode, but in fact only has a list of hashed tokens visible on that box. There's no tokens on any production exitnodes - that's actually true in our deployed model. So the attacker knows that hashed-token-value xxx has connected to her website. What then? To reverse that back to something even theoretically useful - a token, which might possibly be then connected to purchase data - that hash must be reversed. As per above, that is not feasible.
This is the layer of decoupling that is added by the hashing of the token value. It's one step, amoungst many, that seeks to make this sort of backtracing expensive and difficult and time-consuming and low-probability of success. It doesn't simply "assume" that exitnodes won't be routed; rather, it iteratively models that scenario and protects against it from multiple angles. Of course, we work to keep our nodes from being rooted - but to put our full faith in that is a form of the "fail open" fallacy that's so deeply destructive in security-intensive contexts.
Hasing tokens is done to help move the token hash model itself towards "fail closed" behavior, even if an exitnode is rooted without our team's knowledge.
There's multiple limitations to this added decoupling step, as there most always is in such matters. An attacker with a full list of all minted tokens (somehow, magickally), could then make a "rainbow table" of hashed outputs & reverse the hash that way. A known limitation, which we have buffered against by minting... alot of tokens. We'll leave that obscure, but say that the number is arbitrarily large, in the Leibniz-calculus sense of the term. Without that master list of rainbow-table-generating hashes, making a generic lookup table given the consideration space of all possible hash values (something on the order of 2(50), give or take a few thousand orders of magnitude) remains computationally intractible.
An "evil token reseller" also can create a compressed rainbow table of only tokens she has sold, and then use these to reverse hashed tokens from a rooted exitnode... but the statistical probabilty of a matched success - buyer buys from her, buyer connects to a specific node she roots in the time period during which she's rooted it - becomes asymptotically small as number or reseller, tokens, and exitnodes grows. The more, the merrier.
Which is to say that adding hashing isn't a panacea that protects against all forms of attack against our token model. Obviously not. We do think, on balance, that adding the hashing step produces sufficient real-world security benefits to be worth the hassle of doing it. Well... I think so - and I pushed hashing into the model based on that conclusion. Hopefully, it's not too much of a hassle - although I'll say that the security uptick it engenders is both nonzero and (I argue) nontrivial.
Improvements To The Model?
There's always improvements possible in a real-world security framework, and hashing of tokens is no exception to that. We do think there's some tweaks, and perhaps some big jumps, possible as we expand out the benefits gained by this step in the process. Indeed, I hope others will contribute their analysis and suggestions to this thread, and thus help us do exactly that: continue to improve the model.
I am, infamously so perhaps, enamoured of the power and mathematical elegance of the field of hashing algorithms. They express something about the nature of our universe that I find... fascinating. They also provide serious, real-live benefits in environments where attackers have massive resources (viz, NSA) and defenders are limited in their resource availabilty (viz, all the rest of us without .mil email addresses). They are, in short, a form of asymmetric mathematical warfare.
That resonates with me.
As such, we've shoehorned (or "cleverly included," take your pick) hashing into our protection model. Beyond this, I believe - at an intuitive level - that we've only scratched the surface in the ways which hashing can be used to substantively benefit member security. I can almost see the extensions... but they slip away, just in the corner of my eye.
Perhaps someone reading this will see them straight-on, and help my limited mind do likewise!