Taming ElasticSearch Field Explosion

11 min read Original article ↗

Tom Johnell

Over time, it’s easy for the number of fields within your ElasticSearch (ES) index to grow to what seems like an unmanageable size. ES, out of the box, is happy to consume and index whatever data you send its way. The downside is the performance of your ES cluster will slowly degrade and unexpected surprises when querying your data will crop up. You can quickly fix the problem of too many fields with a couple tricks using explicit mappings and dynamic templates. In this post, we’ll go over how Handy fell into the trap of too many fields and how we got out of it.

Before diving into the incident Handy ran into, it might help to understand a bit of our data stack. Handy has developed an internal logging library that enables engineers to easily log just about anything. That data is then filtered out for sensitive information, serialized as JSON, and then shipped off to FluentD over UDP. FluentD then ships the data off to two main stores. The first being ES which facilitates Handy’s real-time data needs for debugging, on-call, and some monitoring. The other store being S3 where logs eventually make their way into Hive for long-term storage and Snowflake/Looker for data science.

More Fields == No Monitoring

Going back to Handy’s internal library logging; it really is flexible enough to log just about anything. Whether it’s a simple object with a few attributes or a giant hierarchy of data, the library will happily serialize the data and punt it through UDP. Our engineers have taken full advantage of that flexibility and have been logging all sorts of interesting data. The unfortunate result being that the number of fields within our production ES index slowly grew to an unmanageable size. To make matters worse, when we hit the default max field count limit, we made the decision to increase the limit to punt the problem to a later date. In fact, we did that several times.

That was all good and well until 600 new fields were added in the span of a day and the ES Rest API began returning the following errors to our Grafana dashboards:

{
"type": "illegal_argument_exception",
"reason": "field expansion matches too many fields, limit: 4096, got: 4462"
}

The above error caused zero documents to be returned rendering the Grafana monitoring useless. Or in the case of monitors looking for existence of data, the alerts triggered. And yes, you’re reading that right, we had 4,462 fields; quadruple the recommended max number of fields of 1000.

A look at Handy’s main index pattern in Kibana prior to fixes

In ElasticSearch’s documentation, you’ll find an IMPORTANT tag regarding the index.mapping.total_fields.limit setting:

The limit is in place to prevent mappings and searches from becoming too large. Higher values can lead to performance degradations and memory issues, especially in clusters with a high load or few resources.

We decided it was finally time to figure out a medium-term solution to avoid these surprises. First things first, why were there so many fields?

Dynamic Field Mapping

By default, every ES index will have dynamic field mappings enabled. What that means is that every index field and its associated type are created and mapped at the time of document creation. Additionally, by default, ES will generate two fields for every new text field encountered, one of type text for wild-card searches and highlighting, and another of type keyword for aggregations.

As an example, if the following document were submitted to an ES index with dynamic field mappings enabled:

POST https://elasticsearch:9200/my-index/_doc

The following field mapping would be generated dynamically:

The full field names being:

first_name
first_name.keyword
last_name
last_name.keyword
address.address1
address.address1.keyword
address.zip
address.zip.keyword
address.city
address.city.keyword
address.state
address.state.keyword
age

So, with a small JSON object, we’ve suddenly created 13 fields in ES. You can imagine how many fields that might lead to over the span of years of development and various loggers. Of special note is how each nested attribute within the JSON object has a corresponding field created, e.g. address.zip.

Objects of Unknown Size

Again, back to Handy’s internal logging library, you might see something along the lines of the following within a service:

The above code looks relatively innocent; It takes the headers from the request and logs them. Underneath the hood, the service shoots a JSON object off to FluentD and the JSON eventually makes its way into ES. Our tooling has very minimal validation, i.e. the properties passed cannot be complex objects, but a hash (or JSON object) of many header strings is perfectly fine.

A small example of what the above might look like when it makes its way into FluentD:

There’s an obvious problem with the above. The headers sent by the client are theoretically unbounded, so over time, the number of corresponding ES fields will grow. Couple that with the fact that each text attribute creates two ES fields, text & keyword, the above code will generate twice as many ES fields as there are request headers (28 fields in the case above)!

That is, in fact, what was happening at Handy. We were logging every request header we received (with some scrubbing for auth headers & cookies) as two ES fields. At the time of fixing this, it was not uncommon for there to be over 420 request header fields within a single index, nearly half the recommended number of total ES fields.

Nested Objects

Similar to request headers, issues arise when nested objects of unknown depth are logged.

More example code:

On the face of it, this looks like the code is logging a single field called events, however, what if the shape of the data were as follows:

In the case of the above, there are nested fields of different names with various differing properties. In this case it looks like bundles of events for some kind of payment processor. Although charge and transfer have similar properties, they will be stored as separate fields:

events.event
events.event.keyword
events.properties.charge.id
events.properties.charge.created_at
events.properties.charge.status
events.properties.charge.status.keyword
events.properties.charge.account.customer_id
events.properties.transfer.id
events.properties.transfer.status
events.properties.transfer.status.keyword
events.properties.transfer.created_at
events.properties.transfer.account.customer_id

The above issues forced us to think about what we should allow engineers to actually log. A Wild West approach of allowing anything and just having the mapping figure itself out dynamically allows for situations like the above to occur, where an explosion of new fields is caused by a single logger.

Stop The Bleeding

There’s an obvious way to solve this problem: Disable dynamic field mappings and define a static list of fields that are accepted for a particular index pattern. Any new property that’s added to a logger by an engineer could be ignored until they explicitly add the field to the index. However, this solution presents its own sets of challenges. Requiring the explicit mapping is a process change and in order to enable our engineers to move fast, we would prefer to automate or at the very least provide exceptional tooling so that the ease in which engineers can log data remains relatively the same. We recognize static mappings are a solid, long-term solution, but in terms of how we prevent an engineer from breaking our Grafana monitoring right now, we wanted an interim solution to stop the bleeding.

Solution: Explicit Mappings + Dynamic Templates

A nice middle ground would be some kind of static mapping for certain fields as well as restrictions on the depth of objects.

Request Headers

Dynamic mapping within ES is not all or nothing. An index can have all fields but a certain subset be defined dynamically. Having some subset defined explicitly allows there to be some restriction put in place to avoid field explosion. In the case of request headers, there are certain headers we definitely want indexed and searchable, and then there are others that are more informational and not searched frequently. Handy fixed the problem of creating a new field for every header by explicitly defining the headers we care about, and then not indexing the rest.

Note: As of version 7.3, ES offers a field type called flattened that would have been perfect for this use-case. Handy is currently running a lower version, but will very likely move to that type once we upgrade. I will provide examples for both.

Adding the static mapping:

PUT https://elasticsearch:9200/my-index

The above only shows a subset of the headers that Handy has allow-listed, but you get the idea. The important attribute above is the dynamic: false defined within the request_headers object. That attribute disables dynamic mapping, therefore requiring all fields that appear within the request_headers object to be defined explicitly in order to be indexed. Additionally, notice the fields entry for each property defining the keyword field. This maintains compatibility with the fields that existed prior to the change, meaning any aggregations that took place, for instance, on request_headers.HTTP_ACCEPT.keyword will continue to work. The fields property is how multi-fields are defined. The above change immediately eliminated over 300 fields!

Press enter or click to view image in full size

A look at Handy’s request header fields (only 88!)

For those running ES >= 7.3, using a flattened field type across request headers would likely be a much better solution.

The mapping could be created like so:

PUT https://elasticsearch:9200/my-index

The above would only create a single field when given a large object of request_headers. As of writing this, the flattened field type allows for the following operations:

  • term, terms, and terms_set
  • prefix
  • range
  • match and multi_match
  • query_string and simple_query_string
  • exists

Nested Objects

In addition to some explicit mappings, we wanted some protection against an engineer logging large, deeply nested objects. We did not want to enforce that restriction at the point of logging because we want that data available through our data pipeline via FluentD -> S3 -> Snowflake.

The solution was to use Dynamic Templates. Dynamic templates allow for applying field mappings based on a combination of detected data type, full dotted path, and field name. In the case of deeply nested objects, we care about detected data type (object) and the dot-path (no more than X number of dots).

It’s worth noting, ES does offer the ability to set a max depth for fields within an index. ES even explicitly calls out these settings as a means to prevent field explosion. Those settings are visible within the Mapping documentation. The index.mapping.depth.limit actually sounded perfect for this use-case, but the problem is this setting configures ES to reject the document if a document surpasses that depth. In our case, that is not what we wanted. We want to still capture and index the document, just not the fields that surpass a certain depth.

Here are the settings we applied via a dynamic template to avoid indexing deeply nested objects:

PUT https://elasticsearch:9200/my-index

The above code applies a mapping based on the dotted path matching the path_match property and the type of that field matching the match_mapping_type of object. In the case that it does match, it applies the mapping defined. Most importantly, in this example, it defines the enabled property of false, ensuring that object is not indexed. Putting it all together, what this dynamic template achieves is it disables indexing of all objects nested deeper than two levels. Going back to our previous example with the following fields:

events.event
events.event.keyword
events.properties.charge.id
events.properties.charge.created_at
events.properties.charge.status
events.properties.charge.status.keyword
events.properties.charge.account.customer_id
events.properties.transfer.id
events.properties.transfer.status
events.properties.transfer.status.keyword
events.properties.transfer.created_at
events.properties.transfer.account.customer_id

After applying the dynamic template, only the following fields would be indexed:

events.event
events.event.keyword

This dynamic template now puts the onus on the engineer to extract out any critical fields and log them at a higher level otherwise they’re not indexed at all. The above change eliminated thousands of fields from Handy’s main, production ES index. Admittedly, this isn’t a perfect solution. As it currently stands, useful information from older logging within the system may in fact be unindexed and no longer searchable due to the depth of its fields, but that was the trade-off we were willing to make.

In Conclusion

The biggest lesson learned when trying to resolve our internal monitoring incident was that there can be a happy place between a 100% dynamic mapping and 100% explicit mapping. Handy has decided to forego indexing every single request header and every triply nested field to greatly reduce the number of fields within our production index, and now we can focus on a longer term solution with significantly less risk of having too many fields crop up again. Ultimately, we may decide an explicit mapping for every field is the way to go, or even more dramatically, we may realize ES isn’t the right tool for what we need.

I’m curious if other companies that rely on ES have run into similar situations, so please reach out if you’ve had a similar experience and found a nice middle-ground. Also, if you find these sorts of problems interesting you might also be a great candidate for Handy’s Foundation Engineering team. Please check out the position and apply if the description excites you!