How many valid JSON strings are there?
A valid JSON string is a string conforming to the simple grammar described at json.org.
In this scenario a string is a finite sequence of Unicode characters. Unicode provides us with an alphabet of some 1,114,112 possible characters, which we number in hexadecimal from U+0000 to U+10FFFF inclusive.
These characters originate in Unicode, but aside from serving as a basic lookup table — character U+0020 is a space, for example, and x is U+0078 — Unicode itself is not actually massively important to this question. The JSON grammar doesn't say anything about the use or misuse of reserved Unicode characters, the proper matching of surrogate pairs or anything like that. This makes our job a lot easier.
The various possible binary encodings of a Unicode string, such as UTF-8, are also not considered here. String length will be measured in characters, not in bytes. This also simplifies things.
There are 1,114,112N possible strings of length N ≥ 0, but not all of these are valid JSON. For example, there is only one string of length N = 0, '', but this is not valid JSON. There are 1,114,112 strings of length N = 1, only 10 of which are valid JSON: the single digits, '0' to '9'.
Some valid JSON strings for N = 4 would be 'null', 'true', '1234', '"ab"' and '[{}]'. There are many others. There are 10 × 10 × 10 × 10 = 10,000 strings which take the form of four digits but only 9,000 of these are valid JSON; JSON allows a lone '0', but disallows leading zeroes, such as in '0123'. "Stringified strings", of the form '"ab"' — double quote, character, character, double quote — allow over a million possibilities for each of the two characters, resulting in over a trillion possible combinations. These make up the vast majority of the possibilities when N ≥ 3.
JSON strings may also feature whitespace. There are four permissible whitespace characters. This means that there are some 16 possible strings of the form ' {} ' — that is, whitespace, left brace, right brace, whitespace — and 1,024 of the form ' [ { } ] ', and so on. We will therefore draw a distinction between a JSON value and a JSON element:
- A value always starts and ends with a non-whitespace character.
'true','[ ]'and'123.4567e+89'are all values. - An element may (or may not) be padded at the beginning and end with whitespace. All values are elements. Examples of elements which are not values are
'19 'and' {} '. It's the enumeration of all possible elements which we care about.
(Later, we'll find out what happens if we disallow whitespace.)
Table of values
| N | Number of valid JSON strings of length N |
|---|---|
| 0 | 0 |
| 1 | 10 |
| 2 | 183 |
| 3 | 1,116,690 |
| 4 | 1,241,178,737,201 |
| 5 | 1,382,769,886,828,373,987 |
| 6 | 1,540,513,509,987,425,647,994,799 |
| 7 | 1,716,252,210,190,833,344,214,413,058,975 |
| 8 | 1,912,038,829,837,307,338,535,532,996,384,125,503 |
| 9 | 2,130,160,395,481,217,702,782,516,195,462,638,598,742,492 |
| 10 | 2,373,164,833,092,220,366,519,238,536,719,956,333,503,580,977,677 |
A table of values up to N = 100 is here.
I've gone as far as N = 350 working locally. The calculation gets progressively slower as the numbers involved rise into the thousands of digits; the only real constraint on how far we can go is patience.
Specifics of the enumeration process
'null', 'true' and 'false' are all fixed-length tokens which are very easy to check for.
As mentioned, there are four possible whitespace characters. These are the space (U+0020), tab (U+0009), carriage return (U+000D) and line feed (U+000A). If there are N characters of whitespace, then we multiply the number of possibilities by 4N.
JSON numbers have several different formats. I found that the least confusing approach was to enumerate numbers of the form 123, 123.456, 123e456 and 123.456e789 separately. There are six different permitted exponent symbols: E, e, E+, e+, E- and e-. As these have differing lengths, the enumeration becomes more complicated. Leading zeroes are not permitted, which makes the enumeration more complicated again. That said, all of these can still be enumerated using fairly simple closed form expressions.
The fact that a JSON consumer would almost certainly interpret the expressions 1e9 and 1E9 identically doesn't affect our calculation. These are distinct strings so we count them separately. The same applies to apparently identical numbers such as 1e+9, 10E8, 1000000000.000000000 and so on.
The possibility of a leading unary minus sign doesn't make any of this significantly more complicated. JSON does permit '-0' but again, we don't care whether or not a JSON consumer would interpret this differently from '0'.
JSON strings — that is, stringified strings, the results you get from doing something like JSON.stringify('abc'), which results in the five-character string '"abc"' — are more complicated. The string "interior", between the two double quotes, consists of three different types of character:
Valid single characters are those from U+0020 to U+10FFFF inclusive, except for the double quote
"and the backslash\. There are 1,114,078 of these.There are also 8 simple backslash escape sequences. Each is 2 characters long, consisting of a backslash
\followed by one of eight possible characters:",/,\,b,f,n,rort. For example,\n. What these escape sequences actually mean/do is actually not important for this piece of work.Lastly there are hex escapes. These are 6 characters long, consisting of a backslash
\, auand 4 hexadecimal digits. For example,\u0020. As the hexadecimal digits are not case-sensitive, there are not 16 but 22 possible hexadecimal digits, and not 164 = 65,536 but 224 = 234,256 possible hex escape sequences. Again, these create distinct strings so we count them all, even if a consumer would almost certainly treat\uffffand\uFFFFidentically.
Putting it all together, we develop a fairly simple Fibonacci-esque recurrence relation for the number of possible string interiors of length N:
Or...
Memoization makes this pretty fast to compute. The number of operations is O(N), though of course the time taken for each operation is not really constant.
I imagine that there is a closed-form solution possible here, akin to the one for the Fibonacci sequence but six to ten times more ghastly and complicated. Unlike the Fibonacci sequence, there is no closed-form solution for this recurrence relation; the polynomial in question is too complicated.
As for those other formats...
The smallest array is of course '[]' (N = 2) and the smallest object is '{}' (N = 2). The smallest non-empty array is of the form [0] (N = 3) and the smallest non-empty object is of the form '{"":0}' (N = 6) — there are 10 of each of these.
We do not care about duplicate keys in objects — JSON does not either. '{"":1,"":2}' (N = 11) is valid JSON. Exactly how this should be interpreted, or even whether this is an error, is unspecified. The merit of this is debated but for this specific piece of work this is good because it would make the job unbelievably harder otherwise. (For an example of how omissions in the JSON specification can be exploited to increase parsing and serialisation performance, check out my fastjson project.)
Beyond this point, I think the simplest explanation of what I did is just the code itself...
Code
At present I do not have code to actually generate the distinct JSON strings in question. This would be doable but I do note that as N increases past 3 the task of generating them all becomes deeply impractical. Some option for restricting the space of allowable single string characters might be useful here.
Observations
N = 0
There are 0 valid JSON strings of length 0. JSON.parse('') throws an exception.
N = 1
There are 10 valid JSON strings of length 1. These are the single digits '0' to '9'.
Note that none of these elements have padding — they are all values as well as being elements.
N = 2
There are 183 valid JSON strings of length 2. 103 of these are values, which is to say they have no whitespace padding:
'{}','[]'and'""'(3)- Numbers
'10'to'99'(90) - Numbers
'-0'to'-9'(10)
The remaining 80 are the 10 values of length 1 padded with one space on either side. Let s represent a whitespace character:
'0s'through'9s'(10 × 4 = 40)'s0'through's9'(4 × 10 = 40)
N = 3
There are 1,116,690 valid JSON strings of length 3. 1,304 of these are smaller strings padded with space:
-
The 10 values of length 1, padded to length 3:
'Xss'(10 × 4 × 4 = 160)'sXs'(4 × 10 × 4 = 160)'ssX'(4 × 4 × 10 = 160)
-
The 103 values of length 2, padded to length 3:
'XXs'(103 × 4 = 412)'sXX'(4 × 103 = 412)
The remaining 1,115,386 are values of length 3:
- 1,114,078 strings of the form
'"x"' - 900 numbers of the form
'123'(again, remember leading zeroes are not permitted) - 100 numbers of the form
'1.2' - 200 numbers of the form
'1e2'or'1E2' - 90 numbers of the form
'-12' - 10 arrays of the form
'[4]' - 4 arrays of the form
'[ ]'(remember, there are four possible whitespace characters) - 4 objects of the form
'{ }'
N = 4
There are 1,241,178,737,201 valid JSON strings of length 4. 8,930,592 of these are smaller strings padded with space:
-
The 10 values of length 1, padded to length 4:
'Xsss'(10 × 4 × 4 × 4 = 640)'sXss'(4 × 10 × 4 × 4 = 640)'ssXs'(4 × 4 × 10 × 4 = 640)'sssX'(4 × 4 × 4 × 10 = 640)
-
The 103 values of length 2, padded to length 4:
'XXss'(103 × 4 × 4 = 1,648)'sXXs'(4 × 103 × 4 = 1,648)'ssXX'(4 × 4 × 103 = 1,648)
-
The 1,115,386 values of length 3, padded to length 4:
'XXXs'(1,115,386 × 4 = 4,461,544)'sXXX'(4 × 1,115,386 = 4,461,544)
The remaining 1,241,169,806,609 are values of length 4:
-
1,241,169,790,092 are "stringified strings", of which:
- 1,114,0782 = 1,241,169,790,084 are of the form
'"ab"'whereaandbare each single characters - 8 are 2-character escapes of the form
'"\\n"'
- 1,114,0782 = 1,241,169,790,084 are of the form
-
16,517 are not strings:
-
16,300 are numbers:
- 9,000 numbers of the form
'1234' - 1,900 numbers of the form
'1.23'or'12.3' - 3,800 numbers of the form
'1e23','1E23','12e3'or'12E3' - 400 numbers of the form
'1e+2','1E+2','1e-2'or'1E-2' - 900 numbers of the form
'-123' - 100 numbers of the form
'-1.2' - 200 numbers of the form
'-1e2'or'-1E2'
- 9,000 numbers of the form
-
199 are arrays:
- 16 empty arrays of the form
'[ ]' - 183 arrays with a single entry, of the form
'[XX]'- see the result for N = 2, above
- 16 empty arrays of the form
- 16 are objects (all empty, of the form
'{ }') - and 1 is
'null'! - and 1 is
'true'!
It should now be clear that the stringified strings are going to dominate these figures going forward.
Higher N
'false', the last fixed token, first appears at N = 5, naturally. This is also where we first encounter arrays with multiple entries such as '[1,2]' and numbers with both a decimal and an exponent such as '1.2E9'.
Our first non-empty objects, of which there are 10 of the form '{"":0}', appear at N = 6. Here we also start seeing numbers with both a decimal and a signed exponent, such as '9.9e-9'.
As N increases, the vast majority of JSON strings take the form of stringified strings and the vast majority of these in turn consist exclusively of single characters (not escape sequences). That is, JSON of the form '"abc..."'. There are 1,114,078N - 2 possible valid JSON strings of this form, easily exceeding all other partially escaped strings, let alone formats such as arrays and objects. In fact, 1,114,078N - 2 is a very reasonable approximation for the number of valid JSON strings of length N ≥ 3.
What if we disallow whitespace?
Well, here's the table:
| N | Number of valid JSON strings of length N, disallowing whitespace |
|---|---|
| 0 | 0 |
| 1 | 10 |
| 2 | 103 |
| 3 | 1,115,378 |
| 4 | 1,241,169,806,497 |
| 5 | 1,382,759,957,416,341,979 |
| 6 | 1,540,502,447,848,189,871,865,303 |
| 7 | 1,716,239,886,104,877,753,069,071,001,039 |
| 8 | 1,912,025,099,844,274,016,966,438,855,135,598,311 |
| 9 | 2,130,145,099,198,039,027,362,775,288,519,180,372,992,548 |
| 10 | 2,373,147,791,839,649,124,325,020,160,021,655,103,136,553,007,029 |
A table of values up to N = 100 is here.
The approximation of 1,114,078N - 2 remains accurate. The figures for N = 0 and N = 1 are the same as whitespace does not factor into those figures. Beyond this point, we find that there is no distinction between values and elements, and our calculations are a little simpler because we no longer need to count whitespaced elements from prior N:
N = 2
There are 103 valid JSON strings of length 2, disallowing whitespace:
'{}','[]'and'""'(3)- Numbers
'10'to'99'(90) - Numbers
'-0'to'-9'(10)
N = 3
There are 1,115,378 valid JSON strings of length 3, disallowing whitespace:
- 1,114,078 strings of the form
'"x"' - 900 numbers of the form
'123' - 100 numbers of the form
'1.2' - 200 numbers of the form
'1e2'or'1E2' - 90 numbers of the form
'-12' - 10 arrays of the form
'[4]' - 0 arrays of the form
'[ ]'(remember, whitespace is disallowed) - 0 objects of the form
'{ }'
N = 4
There are 1,241,169,806,497 valid JSON strings of length 4, disallowing whitespace:
-
1,241,169,790,092 are "stringified strings", of which:
- 1,114,0782 = 1,241,169,790,084 are of the form
'"ab"'whereaandbare each single characters - 8 are 2-character escapes of the form
'"\\n"'
- 1,114,0782 = 1,241,169,790,084 are of the form
-
16,405 are not strings:
-
16,300 are numbers:
- 9,000 numbers of the form
'1234' - 1,900 numbers of the form
'1.23'or'12.3' - 3,800 numbers of the form
'1e23','1E23','12e3'or'12E3' - 400 numbers of the form
'1e+2','1E+2','1e-2'or'1E-2' - 900 numbers of the form
'-123' - 100 numbers of the form
'-1.2' - 200 numbers of the form
'-1e2'or'-1E2'
- 9,000 numbers of the form
- 103 are arrays with a single entry of the form
'[XX]'- see the result for N = 2, above - 0 are objects (again,
'{ }'is disallowed) - and 1 is
'null'! - and 1 is
'true'!
An incomplete list of things which are not valid JSON
- Leading zeroes e.g.
'0451' - Omitted leading digit on a decimal e.g.
'.123' - Omitted trailing digits after a decimal e.g.
'456.' - Unary plus sign at the start of a number e.g.
'+7800' - Hexadecimal, octal or binary numeric literals such as
'0x10FFFF','0o755'or'0b11001010' - Other JavaScript-style string escape sequences such as
'\\v','\\x45','\\u{10FFFF}'(no, you can't hex-escape characters beyond U+FFFF at all, you just have to put them in the string as literals) - Trailing commas in arrays or objects
Note that for the purposes of this piece of work, most of these omissions make the counting a lot easier.