Anatomy of a PDF File
June 2023
How do you read a PDF file?
This introductory memo will walk you through the process of decoding its internal structure. A very simple "Hello World" file similar to an example written in the PDF specification will serve as a material.
Syntax Overview
The PDF specification distinguishes 4 domains :
- Objects : the basic building blocks,
- File structure : how objects are stored and accessed in a file,
- Document structure : how linked objects are interpreted to represent a document,
- Content streams : special objects that describe the appearance of a page.
The following explanations will show you how reading a PDF file make use of these domains.
End of File
At the very end of the file sits a %%EOF line.
We must go up by a few lines to see a startxref keyword implying that something is actually starting here.
In fact the number immediately following this keyword is the file offset - in bytes - of a structure named the Cross-Reference.
This structure is an index that allows direct access to all parts (objects) and gives an entry point into the root of the document.
Why is the entry point located at the end of the document? This approach allows efficient incremental updates. More on that later.
Cross-Reference Table, and Trailer
When startxref points to a xref keyword,
it means that the Cross-Reference is implemented as a table and immediately followed by a trailer.
A table subsection starts with a line specifying the number of the first object mentioned, and the total number of the objects referenced in the subsection;
then lines of fixed-length strings (20 bytes) that specify the location of each object and its status (in use, or freed).
The subsection lists 8 indirect objects starting at index 0, so object #7 is mentioned on the 8th line and can be found at file offset 526.
Indirect Objects
A N G obj line denotes an indirect object, where N is its object number (ID) and G is its generation number.
These indirection properties are an envelope that allows to address the object. But the payload is just a "regular" object:
this object is enclosed between obj and endobj keywords.
In the previous example, indirect object #7 contains a payload that is a dictionnary defining 5 key-value pairs. The following section is here to describe most of the object types.
Object Types
There are atomic types :
- Boolean :
trueorfalse, - Integer : for example
800, - Real : for example
-3.14, - Literal String : characters enclosed in parentheses like
(ABC), - Hexadecimal String : digits enclosed in angle brackets like
<414243>(3 ASCII bytes for "ABC"), - Name : a symbol that begins with a slash like
/Something, - Comment : all characters between a
%and the end of the line, like% some comment.
And there are collection types :
- Array : an ordered list of atomic objects written bewteen brackets, like
[true 800 (ABC) /Something], - Dictionary : a map / associative array of unordered key-value pairs;
all keys must be names, and the object is enclosed in double angle brackets like
<< /Key1 (Value1) /Key2 (Value2) >>; Note that the same separator (for example space or carriage return) may occur bewteen a key and a value and bewteen distinct pairs: a parser needs to keep a context in order to determine if the next token is a key or a value.
And there is a composite type for content :
- Stream : a dictionnary immediately followed by a sequence of bytes enclosed bewteen the
streamandendstreamkeywords; It typically conveys either a sequence of commands that write content on a page or a blob used in a sequence of commands (font file, image).
Last but not least :
- Indirect reference : an ordered sequence of an object number, a generation number, and the
Rkeyword that references an indirect object, like7 0 Rfor object #7 in its generation 0; This sequence is not enclosed in delimiters (unlike an array), therefore a special attention is needed when parsing it in order to correctly group tokens. For example the array[3 0 R 4 0 R 5 0 R]does not begin with 2 integers and does not contain 9 items: it contains 3 indirect references to objects #3, #4 and #5.
Filters
In this example the stream content is made of plain ASCII characters:
But very often some filter modifies the bytes sequence. A filter may compress the data or encode it, and several may be chained to form a pipeline.
For example a stream dictionnary containing /Filter [/ASCII85Decode /FlateDecode] (besides the mandatory /Length attribute)
should be decoded from ASCII Base85 into binary and then decompressed with the deflate algorithm.
Cross-Reference Stream
The most common type of Cross-Reference, as explained above, is a table. But since PDF 1.5 a cross-reference may be encoded as a Stream object:
- the dictionary is defined with
/Type /XRefand contains the same/Rootattribute that occurs in a trailer, - and the stream content contains a structure specifying the location of indirect objects
This mecanism adds a feature that was not possible with Cross-Reference tables where all objects are accessed with file offset in bytes: an indirect object may be located inside another indirect object. In that case the terminology says that the container is an Object Stream that contains compressed objects.
Document Structure
The /Root attribute of the Trailer or Cross-Reference Stream indicates the reference of the /Catalog indirect object:
The Catalog object starts a tree of nested Pages (plural) objects. This hierarchy leads to Page (singular) objects.
A Page have dimensions (/MediaBox), content and associated resources like fonts.
Incremental Updates
It is possible to build a new revision of a document without writing a whole new file: changes are appended to the original file.
Changes consist in new or modified objects, a Cross-reference, and a startxref that points to it.
The Cross-Reference (either its trailer or its stream dictionary) contains a /Prev attribute thats links the new revision to the original Cross-Reference.
Conclusion
This was an overview of the main concepts and syntactic elements. To go further you can read chapter 7 of the freely available Adobe PDF 1.7 Specification or - if you can access it - the subsequent ISO 32000 Specification that took over.