HTTP cookies, or how not to design protocols

For as long as I remember, HTTP cookies have been vilified as a grave threat to the privacy of online browsing. I don't quite agree, but I find the mechanism fascinating for another reason: it's a unique cautionary tale for engineers.

Cookies were devised by Lou Montulli, a Netscape engineer, somewhere in 1994. Lou outlined his original design in a minimalistic, four-page proposal posted on netscape.com; based on that specification, the implementation shipped in their browser several months later - and other vendors were quick to follow.

It wasn't until 1997 that the first reasonably detailed specification of the mechanism has been attempted: RFC 2109. The document captured some of the status quo - but confusingly, also tried to tweak the design, an effort that proved to be completely unsuccessful; for example, contrary to what is implied by this RFC, most browsers do not support multiple comma-delimited NAME=VALUE pairs in a single Set-Cookie header; do not recognize quoted-string cookie values; and do not use max-age to determine cookie lifetime.

Three years later, another, somewhat better structured effort to redesign cookies - RFC 2965 - proved to be equally futile. Meanwhile, browser vendors tweaked or extended the scheme in their own ways: for example, around 2002, Microsoft proposed httponly cookies as a security mechanism to slightly mitigate the impact of cross-site scripting flaws - a concept quickly embraced by the security community.

All these moves led to a peculiar outcome: there is no accurate, official account of cookie behavior in modern browsers; the two relevant RFCs are out of touch with reality. Many are forced to discover compatible behaviors by trial and error. This makes it an exciting gamble to try and secure modern web apps.

In any case - well-documented or not, cookies emerged as the canonical solution to an increasingly pressing problem of session management; and as web applications have grown more complex and more sensitive, the humble cookie caught the world by storm. With it, came a flurry of fascinating security flaws.

What's a domain?

Perhaps the most striking issue - and an early sign of trouble - is the problem of domain scoping.

Unlike the more pragmatic approach employed for JavaScript DOM access, cookies can be set for any domain of which the setter is a member - say, foo.example.com is meant to be able to set a cookie for *.example.com. On the other hand, allowing example1.com to set cookies for example2.com is clearly undesirable, as it allows a variety of sneaky attacks: denial of service at the very least - and altering site preferences, modifying carts, or stealing personal data if you pick the target right.

To that effect, the original RFC provided this elegant but blissfully simplistic advice:

"Only hosts within the specified domain can set a cookie for a domain and domains must have at least two (2) or three (3) periods in them to prevent domains of the form: ".com", ".edu", and "va.us". Any domain that fails within one of the seven special top level domains listed below only require two periods. Any other domain requires at least three. The seven special top level domains are: "COM", "EDU", "NET", "ORG", "GOV", "MIL", and "INT".

There are at least three problems with this scheme - two of which should have been obvious even back in the day:

Some country-level registrars indeed mirror the top-level hierarchy (e.g. example.co.uk), in which case the three-period rule makes sense; but many others allow direct registrations (e.g., example.fr), or permit both approaches to coexist (say, example.jp and example.co.jp). In the end, the three-period rule managed to break cookies in a significant number of ccTLDs - and consequently, most implementations (Netscape included) largely disregarded the advice. For a while, you could set cookies for *.com.pl, effectively injecting data into the security context of other apps.
The RFC missed the fact that websites are reachable by means other than their canonical DNS names; in particular, the rule permitted a website at http://1.2.3.4/ to set cookies for *.3.4, or a website at http://example.com.pl./ to set a cookie for *.com.pl. (note the trailing period).
To add insult to injury, Internet Assigned Numbers Authority eventually decided to roll out a wide range of new top-level domains, such as .biz, .info, or .jobs - and is now attempting to allow arbitrary gTLD registrations, making any sort of an algorithmic approach untenable.

Because of this design, all mainstream browsers had a history of embarrassing bugs in this area - and now ship with a giant, frequently-updated lists of real-world "public suffix" domains for which cookies should not be set (as well as an array of checks to exclude non-FQDN, IPs, and pathological DNS notations of all sorts).

8K ought to be enough for for anybody

To make denial-of-service attacks a bit harder, most web servers limit the size of a HTTP request (sans payload) they are willing to process; these limits are very modest - for example, Apache rejects request headers over 8 kB, while IIS draws the line at 16 kB. This is perfectly fine under normal operating conditions - but can be easily exceeded when the browser is attempting to construct a request with a lot of previously set cookies attached.

The specification didn't acknowledge this reality, offered no warning to implementators, and proposed no limit discovery and resolution algorithm. In fact, it mandated minimal jar size requirements well in excess of the limits enforced by HTTP servers:

"In general, user agents' cookie support should have no fixed limits. They should strive to store as many frequently-used cookies as possible. Furthermore, general-use user agents should provide each of the following minimum capabilities [...]:

* at least 300 cookies
* at least 4096 bytes per cookie (as measured by the size of the characters that comprise the cookie non-terminal in the syntax description of the Set-Cookie header)
* at least 20 cookies per unique host or domain name"

As should be apparent, the suggested minimum - 20 cookies of 4096 bytes each - allows HTTP request headers to balloon up to 80 kB.

In theory, it shouldn't matter, because a website can only DoS itself - except there are quite a few popular sites that rely on user-name.example.com content compartmentalization. In such settings, any malicious user can set top-level cookies to prevent the visitor from ever being able to access any *.example.com site again.

The only recourse domain owners have in this case is to request their site to be added to the aforementioned public suffix list; there are quite a few entries along these lines there already, including operaunite.com or appspot.com - but this approach obviously does not scale particularly well. The list is also not universally supported by all browsers, and not officially mandated for new implementations.

Sending cookies to the farm upstate

In the RFC 2109 paragraph cited earlier, the specification pragmatically acknowledged that implementations will be forced to limit cookie jar sizes - but then, didn't provide on how the jars should be pruned once the maximum size is reached.

This led to an interesting issue: any implementation that enforces something similar to the aforementioned minimums is going to vulnerable to a cross-domain denial-of-service attack. The attacker may use wildcard DNS entries (a.example.com, b.example.com, ...), or even just a couple of throw-away domains, to exhaust the global limit, and have all sensitive cookies purged - logging the user out from sites they depend on.

The not-so-special flags

Two special types of HTTP cookies are supported by all contemporary web browsers: secure, sent only on HTTPS navigation (protecting the cookie from being leaked to or interfered by rogue proxies); and httponly, exposed only to HTTP servers, but not visible to JavaScript (protecting the cookie against cross-site scripting flaws).

Although these ideas appear to be straightforward, the way they were specified implicitly allowed a number of unintended possibilities - all of which plagued web browsers for years. Consider the following questions that were never elaborated on when introducing the flags:

Should JavaScript be able to set httponly cookies via document.cookie?
Should non-encrypted pages be able to set secure cookies?
Should browsers hide jar-stored httponly cookies from APIs offered to plugins such as Flash or Java?
Should browsers hide httponly Set-Cookie headers in server responses shared with XMLHttpRequest, Flash, or Java?
Should it be possible to drop httponly or secure cookies by overflowing the "plain" cookie jar in the same domain, then replace them with vanilla lookalikes?
Should it be possible to drop httponly or secure cookies by setting tons of httponly or secure cookies in other domains?

Each of these scenarios is a potential security risk, allowing cookies to be read or written from contexts other than intuitively expected. Yet, because the spec never dwelled on this, all of these vectors cropped up in various implementations - and some persist to this day.

At first sight, the list may appear inconsequential - but these weaknesses have profound consequences for web application design in certain environments. One striking example is rolling out HTTPS-only services that are intended to withstand rogue, active attackers on open wireless networks: if secure cookies can be injected on easy-to-intercept HTTP pages, it suddenly gets a whole lot harder to keep users safe.

If it tastes good, who cares where it comes from?

Cookies diverge from JavaScript same-origin model in two fairly important ways:

domain= scoping is significantly more relaxed than SOP, paying no attention to protocol, port number, or exact host name. This undermines the SOP-derived security model in many compartmentalized applications that also use cookie authentication. The approach also makes it unclear how to handle document.cookie access from non-HTTP URLs - historically leading to quite a few fascinating browser bugs. For example, you could set location.host while on a data: page, which had no effect except for confusing the cookie API.
path= scoping is considerably stricter than what's offered by SOP - and therefore, it is completely useless as a security boundary, despite appearing to be a security mechanism. Web developers misled by this often mistakenly rely on it for security compartmentalization; heck, even reputable security consultants get it completely wrong.

On top of this somewhat odd scoping scheme, scoping conflicts are essentially ignored in the specification; every cookie is identified by a name-domain-path tuple, allowing identically named but differently scoped cookies to coexist and apply to the same request, but without giving the server any way to differentiate between the duplicates.

This omission adds another interesting twist to the httponly and secure cookie cases; consider these two cookies:

Set on https://www.example.com/:
  FOO=legitimate_value; secure; domain=www.example.com; path=/

Set on http://www.example.com/:
  FOO=injected_over_http; domain=.example.com; path=/

The two cookies are considered distinct and are allowed to coexist. The server will receive both FOO values in a single Cookie header, their ordering dependent on the browser and essentially unpredictable. Which one should be trusted? Well, roll the dice. But that's not all: in Microsoft Internet Explorer, the cookies will be separate, but the values will be merged together in a HTTP field.

Character set murder mystery

HTTP/1.0 RFC technically allowed high-bit characters in HTTP headers without further qualification; HTTP/1.1 RFC later disallowed them. Neither of these documents provided any guidance on how such characters should be handled when encountered, though: rejected, transcoded to 7-bit, treated as ISO-8859-1, as UTF-8, or perhaps treated in some other way.

The specification for cookies further aggravated this problem, cryptically stating:

"If there is a need to place such data in the name or value, some encoding method such as URL style %XX encoding is recommended, though no encoding is defined or required."

In the end, you technically can use high-bit characters, and there's no sanctioned alternative - but their behavior may be unpredictable.

To wit, I have a two-year-old bug with Mozilla (418394): the problem is that Firefox has a tendency to mangle high-bit values in HTTP cookies, permitting cookie separators (";") to suddenly materialize in place of UTF-8 in the middle of an otherwise sanitized cookie value; this led to more than one web application vulnerability to date.

A session is forever

For some reason, presumably due to privacy concerns, the specification decided to distinguish between session cookies, meant to be non-persistent; and cookies with a specified expiration date, which may persist across browser sessions, are stored on the disk, and may be subject to additional client-enforced restrictions, such as separate "allow" / "block" settings, or the P3P policy statements required by MSIE. On the topic of the longevity of the former class of cookies, the RFC says:

"Each session is relatively short-lived."

There was no enforcement mechanism for this assertion - and today, the situation is largely reversed: with the emergence on portable computers with suspend functionality, and the increased shift toward web-oriented computing, users tend to keep browsers open for weeks or months at a time. Session cookies may also be stored and then recovered across auto-updates or software crashes, allowing them to persist for weeks, months, or years.

When session cookies routinely persist longer than many definite-expiry tokens, and yet are specified as a more secure and less privacy-invasive alternative, we obviously have a problem. We probably need to rethink the concept - and either ditch them altogether, or impose reasonable no-use time limits at which such cookies are evicted from the cookie jar.

Closing words

I find it quite captivating to see the number of subtle problems caused by such a simple and a seemingly harmless scheme. It is also unfortunate how poorly documented and fragile the design remains some 15 years later; and that the introduction of well-intentioned security mechanisms, such as httponly, only contributed to the misery. An IETF effort to document and clarify some of the security-critical aspects of the mechanism is underway only now - but it won't be able to fix them all.