Parse Prometheus Exposition format in Rust using PEG

Press enter or click to view image in full size

Let there be a Parser for the Prometheus exposition format in Rust.

I have been thinking lately about a simple service. The purpose of the service is to collect multiple Prometheus metrics and combine them into one. Of course, to do so, metrics from each endpoint need a namespace, or simply a prefix, as well as optional key-value pairs added in between curly braces. There are already some repos on GitHub that provide this functionality, implemented mostly in Go with the use of an excellent official Prometheus library which includes an expression parser.

There are several ways to implement this sort of service from scratch, assuming that the official Prometheus library which has of course an excellent parser is not available. It can be as simple as just prepending a suffix to each prom line, finding the first open curly brace appending some key-value pairs. Using regex is another obvious alternative. Writing a simple Tokenizer and a state machine is also an alternative.

Although this was appealing, the temptation to sprinkle some Rust into the text editor to get that special crust was extraordinarily high. This idea
lingered for a while and had its Maillard reaction on my mind. The Doneness of this idea is on par with how I like my steaks, medium rare.

I decided to use the excellent PEG parser generator ( https://pest.rs ) to build the parser. Together we are going to build that beautiful crust
from the ground up, by parsing and building the necessary functionalities to manipulate prom lines.

the Prometheus exposition format is defined as follows:

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

A line always consists of a metric name and a value, there can be an optional list of one or more key-value pairs and the line can have an optional timestamp at the end as well.

Based on the type of metrics, there can be either one line representing a single metric primitive or several lines representing a group associated with a particular metric primitive, for example, Histograms are represented like this:

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

This primitive type along with summary requires the most care as they are the most involved exposition of metric primitives available in Prometheus.

According to the documentation ( https://prometheus.io/docs/instrumenting/exposition_formats/ ) each metric group can start with three optional lines:

Comments
Help text
Type information

By observing the metrics generated by Prometheus clients, these are mostly always included. This has the nice property that each metric group,
let’s say whether a counter or histogram can be then converted into a block representing a particular metric primitive in the AST that the parser generates. By converting them into blocks, it is easy to detect the kind of metric using the type information and directly parse the
particular format that is expected.

The metric group can be represented in a single structure and derived based on the presence or absence of certain values inside the structure. This way, the metric key is stored once, and based on the metric type each line gets generated from a vector containing tuples of key-value pair for each line along with their values and timestamp. If a line lacks
some pairs, simply an empty key-value pair is inserted into the structure.

// Value represents a metric group
#[derive(Debug)]
pub struct Value<'a> {
    pub prefix: Option<String>,
    pub description: Option<Desc<'a>>,
    pub key: String,
    pub pairs: Vec<Vec<(Cow<'a, str>, Cow<'a, str>)>>,
    pub values: Vec<(Cow<'a, str>, Option<Cow<'a, str>>)>,
    pub sum: Option<Segment<'a>>,
    pub count: Option<Segment<'a>>,
}

All metadata are stored in `Desc` struct, so a complete view looks like this:

// Kind is Prometheus metric type
#[derive(Debug, Clone)]
pub enum Kind {
    Untyped,
    Counter,
    Gauge,
    Histogram,
    Summary,
}// Desc contains meta data of a metri group
// including comments, type information and
// the metric name defined by the client
#[derive(Debug, Clone)]
pub struct Desc<'a> {
    pub kind: Kind,
    pub name: Cow<'a, str>,
    pub help_desc: Option<Cow<'a, str>>,
    pub comment: Option<Cow<'a, str>>,
}
// Segment contains the optional pairs for
// '_sum' and '_count' metric line for Histogram
// and Summary, with the associated value
#[derive(Default, Debug)]
pub struct Segment<'a> {
    pub value: Cow<'a, str>,
    pub pairs: Vec<(Cow<'a, str>, Cow<'a, str>)>,
}
// Value represents a metric group
#[derive(Debug)]
pub struct Value<'a> {
    // optional prefix to prepend to all lines in a block
    pub prefix: Option<String>,
    pub description: Option<Desc<'a>>,
    pub key: String,
    pub pairs: Vec<Vec<(Cow<'a, str>, Cow<'a, str>)>>,
    pub values: Vec<(Cow<'a, str>, Option<Cow<'a, str>>)>,
    pub sum: Option<Segment<'a>>,
    pub count: Option<Segment<'a>>,
}

This way there is a way to represent metric groups and generate them from a `Value` struct.

To feed the required information into this struct, a parser is needed. This implementation as mentioned uses the excellent pest library. To achieve this, a set of known tokens with keywords are needed to bootstrap the grammar.

This means a way to tokenize and parse the metric line into an AST. and later transform it into the struct that has domain knowledge about what the parser generates. The grammar needs to detect strings, strings concatenated by “_”, quoted strings, numbers ( int and floats ),
and some special keywords such as positive and negative infinity. Special keywords such ‘TYPE’, ‘HELP’, metric types, and set of known tokens such as ‘{‘, ‘}’, ‘#’, and so on:

The basic keywords and tokens can be defined as follow:

hash = _{"#"}
posInf = {"+Inf"}
negInf = {"-Inf"}
NaN = {"NaN"}
lbrace = _{"{"}
rbrace = _{"}"}
typelit = _{"TYPE"}
helplit = _{"HELP"}
comma = _{","}
countertype = {"counter"}
gaugetype = {"gauge"}
histogramtype = {"histogram"}
summarytype = {"summary"}
untyped = {"untyped"}

Followed by basic blocks for parsing strings and special identifiers:

alpha = _{'a'..'z' | 'A'..'Z'}
number = @{
    "-"?
    ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
    ~ ("." ~ ASCII_DIGIT*)?
    ~ (^"e" ~ ("+" | "-")? ~ ASCII_DIGIT+)?
}
string = ${"\"" ~ inner ~ "\""}
inner = @{char*}
char = {
    !("\"" | "\\") ~ ANY
    | "\\" ~ ("\"" | "\\" | "/" | "b" | "f" | "n" | "r" | "t")
    | "\\" ~ ("u" ~ ASCII_HEX_DIGIT{4})
}
whitespace_or_newline = _{(" "| "\n")*}
ident = {alpha+}

ident can parse: “someidentifier”.

Let’s examine the following metric group:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

In the first attempt, let’s exclude the comments and start with the metric itself. From inside out, the simple ‘ident’ can be used to create a pair of (key=”value”). That is parsing a single (key, value) tuple from the brackets.

pair = {ident ~ "="  ~ string}

pair can parse: `key=”value”`

All pairs inside the brackets then can simply be defined as this rule:

pairs = {pair ~ (comma ~ pair)*}

pairs can parse `key=”value”,anotherkey=”anothervalue”` and so on.

Now, there are enough rules to parse a single line, expressed as:

a metric name, optionally followed by a bracket containing 0 or more pairs, followed by a number, (+|-)infinity, NaN, followed by an optional timestamp.

promstmt = {key ~ (lbrace ~ (pairs)* ~ rbrace){0,1} ~ whitespace_or_newline ~ ((posInf | negInf | NaN | number) ~ whitespace_or_newline ){1,2}}

promstmt can parse:

`http_requests_total{method=”post”,code=”400"} 4 1395066363000`
`http_requests_total 4 1395066363000`
`http_requests_total +Inf`
`http_requests_total -Inf`
`http_requests_total NaN`
`http_requests_total{} NaN`

The AST for the first metric looks as follows, such hierarchy is what we need to generate a value and subsequently rebuild the metrics from scratch:

- promstmt
  - key: "http_requests_total"
  - pairs
    - pair
      - ident: "method"
      - string > inner: "post"
    - pair
      - ident: "code"
      - string > inner: "400"
  - number: "3"
  - number: "1395066363000"

At this point, a little celebration is expected.

To properly parse the metric groups, comments, help text and type information must get parsed as well. Those are simple formats requiring a few line of grammar. The most important piece comes after implementation of these lines, which glues them together and create a block out of a metric group. As mentioned earlier parsing a block is easier and requires less code compared to implementing the state management in the parser itself. Adding a new set of keywords to detect the comment section which starts with ‘#’ is the next step:

helpkey = {key}
helpval = {inner}
typekey = {key}
typeval = {countertype | gaugetype | histogramtype | summarytype | untyped}

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram

The emergent pattern by observing the comment line is that:

type information has the simple (metric_name, type_info) format
help text has (metric_name, text) format
generic comment has (text) format

The most important rule for the implementation of comments is simply letting ‘text’ section accept arbitrary text

commentval = @{((ASCII_DIGIT| ASCII_NONZERO_DIGIT | ASCII_BIN_DIGIT | ASCII_OCT_DIGIT | ASCII_HEX_DIGIT | ASCII_ALPHA_LOWER | ASCII_ALPHA_UPPER | ASCII_ALPHA | ASCII_ALPHANUMERIC | !"\n" ~ ANY ))*}
helpexpr = {hash ~ whitespace_or_newline ~ helplit ~ whitespace_or_newline ~ helpkey ~  whitespace_or_newline ~ commentval}
typexpr = {hash ~ whitespace_or_newline ~ typelit ~ whitespace_or_newline ~ typekey ~ whitespace_or_newline ~ typeval }
genericomment = {hash ~ whitespace_or_newline ~ commentval}

The grammar is almost finished, now the metric groups can be defined as blocks, and finally the main rule which the input gets parsed with.

block = {((helpexpr | typexpr | genericomment)~ NEWLINE?)+ ~ (promstmt ~ NEWLINE?)+}
statement = {SOI ~ block+ ~ EOI}

Let’s feed it the metrics from Prometheus’s official documentation page:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
# Minimalistic line:
metric_without_timestamp_and_labels 12.47
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

It generates following AST:

- statement
  - block
    - helpexpr
      - helpkey > key: "http_requests_total"
      - commentval: "The total number of HTTP requests."
    - typexpr
      - typekey > key: "http_requests_total"
      - typeval > countertype: "counter"
    - promstmt
      - key: "http_requests_total"
      - pairs
        - pair
          - ident: "method"
          - string > inner: "post"
        - pair
          - ident: "code"
          - string > inner: "200"
      - number: "1027"
      - number: "1395066363000"
    - promstmt
      - key: "http_requests_total"
      - pairs
        - pair
          - ident: "method"
          - string > inner: "post"
        - pair
          - ident: "code"
          - string > inner: "400"
      - number: "3"
      - number: "1395066363000"
  - block
    - genericomment > commentval: "Escaping in label values:"
    - promstmt
      - key: "msdos_file_access_time_seconds"
      - pairs
        - pair
          - ident: "path"
          - string > inner: "C:\\\\DIR\\\\FILE.TXT"
        - pair
          - ident: "error"
          - string > inner: "Cannot find file:\\n\\\"FILE.TXT\\\""
      - number: "1.458255915e9"
  - block
    - genericomment > commentval: "Minimalistic line:"
    - promstmt
      - key: "metric_without_timestamp_and_labels"
      - number: "12.47"
  - block
    - genericomment > commentval: "A weird metric from before the epoch:"
    - promstmt
      - key: "something_weird"
      - pairs > pair
        - ident: "problem"
        - string > inner: "division by zero"
      - posInf: "+Inf"
      - number: "-3982045"
  - block
    - genericomment > commentval: "A histogram, which has a pretty complex representation in the text format:"
    - helpexpr
      - helpkey > key: "http_request_duration_seconds"
      - commentval: "A histogram of the request duration."
    - typexpr
      - typekey > key: "http_request_duration_seconds"
      - typeval > histogramtype: "histogram"
    - promstmt
      - key: "http_request_duration_seconds_bucket"
      - pairs > pair
        - ident: "le"
        - string > inner: "0.05"
      - number: "24054"
    - promstmt
      - key: "http_request_duration_seconds_bucket"
      - pairs > pair
        - ident: "le"
        - string > inner: "0.1"
      - number: "33444"
    - promstmt
      - key: "http_request_duration_seconds_bucket"
      - pairs > pair
        - ident: "le"
        - string > inner: "0.2"
      - number: "100392"
    - promstmt
      - key: "http_request_duration_seconds_bucket"
      - pairs > pair
        - ident: "le"
        - string > inner: "0.5"
      - number: "129389"
    - promstmt
      - key: "http_request_duration_seconds_bucket"
      - pairs > pair
        - ident: "le"
        - string > inner: "1"
      - number: "133988"
    - promstmt
      - key: "http_request_duration_seconds_bucket"
      - pairs > pair
        - ident: "le"
        - string > inner: "+Inf"
      - number: "144320"
    - promstmt
      - key: "http_request_duration_seconds_sum"
      - number: "53423"
    - promstmt
      - key: "http_request_duration_seconds_count"
      - number: "144320"
  - block
    - genericomment > commentval: "Finally a summary, which has a complex representation, too:"
    - helpexpr
      - helpkey > key: "rpc_duration_seconds"
      - commentval: "A summary of the RPC duration in seconds."
    - typexpr
      - typekey > key: "rpc_duration_seconds"
      - typeval > summarytype: "summary"
    - promstmt
      - key: "rpc_duration_seconds"
      - pairs > pair
        - ident: "quantile"
        - string > inner: "0.01"
      - number: "3102"
    - promstmt
      - key: "rpc_duration_seconds"
      - pairs > pair
        - ident: "quantile"
        - string > inner: "0.05"
      - number: "3272"
    - promstmt
      - key: "rpc_duration_seconds"
      - pairs > pair
        - ident: "quantile"
        - string > inner: "0.5"
      - number: "4773"
    - promstmt
      - key: "rpc_duration_seconds"
      - pairs > pair
        - ident: "quantile"
        - string > inner: "0.9"
      - number: "9001"
    - promstmt
      - key: "rpc_duration_seconds"
      - pairs > pair
        - ident: "quantile"
        - string > inner: "0.99"
      - number: "76656"
    - promstmt
      - key: "rpc_duration_seconds_sum"
      - number: "1.7560473e+07"
    - promstmt
      - key: "rpc_duration_seconds_count"
      - number: "2693"
  - EOI: ""

Here is the complete rule set:

alpha = _{'a'..'z' | 'A'..'Z'}
number = @{
    "-"?
    ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
    ~ ("." ~ ASCII_DIGIT*)?
    ~ (^"e" ~ ("+" | "-")? ~ ASCII_DIGIT+)?
}
string = ${"\"" ~ inner ~ "\""}
inner = @{char*}
char = {
    !("\"" | "\\") ~ ANY
    | "\\" ~ ("\"" | "\\" | "/" | "b" | "f" | "n" | "r" | "t")
    | "\\" ~ ("u" ~ ASCII_HEX_DIGIT{4})
}
whitespace_or_newline = _{(" "| "\n")*}
hash = _{"#"}
posInf = {"+Inf"}
negInf = {"-Inf"}
NaN = {"NaN"}
lbrace = _{"{"}
rbrace = _{"}"}
typelit = _{"TYPE"}
helplit = _{"HELP"}
comma = _{","}
countertype = {"counter"}
gaugetype = {"gauge"}
histogramtype = {"histogram"}
summarytype = {"summary"}
untyped = {"untyped"}
ident = {alpha+}
key = @{ident ~ ("_" ~ ident)*}
pair = {ident ~ "="  ~ string}
pairs = {pair ~ (comma ~ pair)*}
helpkey = {key}
helpval = {inner}
typekey = {key}
typeval = {countertype | gaugetype | histogramtype | summarytype | untyped}
commentval = @{((ASCII_DIGIT| ASCII_NONZERO_DIGIT | ASCII_BIN_DIGIT | ASCII_OCT_DIGIT | ASCII_HEX_DIGIT | ASCII_ALPHA_LOWER | ASCII_ALPHA_UPPER | ASCII_ALPHA | ASCII_ALPHANUMERIC | !"\n" ~ ANY ))*}
helpexpr = {hash ~ whitespace_or_newline ~ helplit ~ whitespace_or_newline ~ helpkey ~  whitespace_or_newline ~ commentval}
typexpr = {hash ~ whitespace_or_newline ~ typelit ~ whitespace_or_newline ~ typekey ~ whitespace_or_newline ~ typeval }
genericomment = {hash ~ whitespace_or_newline ~ commentval}
promstmt = {key ~ (lbrace ~ (pairs)* ~ rbrace){0,1} ~ whitespace_or_newline ~ ((posInf | negInf | NaN | number) ~ whitespace_or_newline ){1,2}}
block = {((helpexpr | typexpr | genericomment)~ NEWLINE?)+ ~ (promstmt ~ NEWLINE?)+}
statement = {SOI ~ block+ ~ EOI}