Elasticsearch ICU now understands emoji!

5 min read Original article โ†—

And that simple change in Elastic 6.4 may have a bigger impact on your indices that you might think.

Elasticsearch 6.4 is shipped with Lucene 7.4 โ€“ this is a one-liner in the official Release Notes but if you look closer, this new version ships updated ICU data and real support for emoji. And thatโ€™s a game changer ๐Ÿ˜Ž (for some!).

International Components for Unicode (ICU) is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization, itโ€™s used everywhere (on your computer, your phone, and probably even your connected fridge).

All usages of icu_tokenizer are impacted, that means that everyone using the must-need icu_tokenizer should probably reindex everything, because โ€œ๐Ÿ•โ€ is now a token!

Section intitulรฉe the-new-behavior-of-icu-tokenizerThe new behavior of icu_tokenizer

With this simple query we are testing Elasticsearch ICU Tokenizer:

GET /_analyze
{
  "tokenizer": "icu_tokenizer",
  "text": "I live in ๐Ÿ‡ซ๐Ÿ‡ท and I'm ๐Ÿ‘ฉโ€๐Ÿš€"
}

The ๐Ÿ‘ฉโ€๐Ÿš€ emoji is very peculiar as itโ€™s a combination of the more classic ๐Ÿ‘ฉ and ๐Ÿš€ emoji. The flag of France is also a special one, itโ€™s the combination of ๐Ÿ‡ซ and ๐Ÿ‡ท. So we are not just talking about splitting Unicode code points properly but really understanding emoji here.

Letโ€™s compare the resulting tokens of this _analyze call with both Elasticsearch 6.3 and Elasticsearch 6.4:

Section intitulรฉe elasticsearch-6โ€“3Elasticsearch 6.3

  • I
  • live
  • in
  • and
  • Iโ€™m

Emoji are just dropped like punctuation.

Section intitulรฉe elasticsearch-6โ€“4Elasticsearch 6.4

  • I
  • live
  • in
  • ๐Ÿ‡ซ๐Ÿ‡ท
  • and
  • Iโ€™m
  • ๐Ÿ‘ฉโ€๐Ÿš€

Emoji are kept and understood!

More tokens means more relevant search results! Here is how to take advantage of this new capability.

Now that you have the tokens, you can search for meaning and relevance inside the emoji world. Thousand of new words, new meanings and ways to communicate are available to you.

Your users will be able to search for a pizza place by typing โ€œpizzaโ€, or โ€œ๐Ÿ•โ€, or both.

In order to add this capability to your indices, you must use the CLDR annotation for each emoji and add them as synonyms via a custom Token Filter. Here is an example:

PUT /emoji-capable
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_emoji"
          ]
        }
      }
    }
  }
}

You may wonder how to populate the cldr-emoji-annotation-synonyms-en.txt file? Wonder no more! Everything is already done and it looks like this:

๐Ÿคง => ๐Ÿคง, face, gesundheit, sneeze, sneezing face
๐Ÿงž => ๐Ÿงž, djinn, genie
๐Ÿ•บ => ๐Ÿ•บ, dance, man, man dancing
๐Ÿ‘‚ => ๐Ÿ‘‚, body, ear
๐Ÿ… => ๐Ÿ…, tiger
๐Ÿบ => ๐Ÿบ, bar, beer, drink, mug
๐Ÿ†˜ => ๐Ÿ†˜, help, sos, SOS button
๐Ÿ‘ฉโ€๐Ÿš’ => ๐Ÿ‘ฉโ€๐Ÿš’, firefighter, firetruck, woman
๐Ÿ‡ฎ๐Ÿ‡ช => ๐Ÿ‡ฎ๐Ÿ‡ช, Ireland

I took the time to generate properly formatted and Elasticsearch-compatible synonym files for all the languages and emoji supported by CLDR, and you can find this on Github.

That way when you search for ๐Ÿ…, โ€œtigerโ€ is also searched, and the other way around.

Section intitulรฉe a-complete-emoji-search-exampleA complete Emoji Search example

Letโ€™s build a simple index, add some documents and search for them (gosh, I wish found.no/play was still working, it was like a JS Fiddle but for Elasticsearch!):

Section intitulรฉe the-index-with-english-based-analysisThe index with English based analysis

PUT /tweets
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms": [
            "๐Ÿ… => ๐Ÿ…, tiger",
            "๐Ÿบ => ๐Ÿบ, bar, beer, drink, mug",
            "๐Ÿ => ๐Ÿ, fruit, pineapple",
            "๐Ÿ• => ๐Ÿ•, cheese, pizza, slice"
          ]
        },
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_emoji",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "english_with_emoji"
        }
      }
    }
  }
}

Section intitulรฉe add-some-documentsAdd some documents

POST /tweets/_doc
{
  "author": "NotFunny",
  "content": "Pineapple on pizza, are you kidding me? #peopleAreCrazy"
}

POST /tweets/_doc
{
  "author": "JulFactor93",
  "content": "๐Ÿ๐Ÿ• is the best #food"
}
GET /tweets/_search
{
  "query": {
    "match": {
      "content": "pineapple pizza"
    }
  }
}

GET /tweets/_search
{
  "query": {
    "match": {
      "content": "๐Ÿ๐Ÿ•"
    }
  }
}

Both those searches will return our two documents, because they both search with the emoji and the words.

{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.653994,
    "hits": [
      {
        "_index": "tweets",
        "_type": "_doc",
        "_id": "-PJ172YBCoXstbpG75QM",
        "_score": 0.653994,
        "_source": {
          "author": "JulFactor93",
          "content": "๐Ÿ๐Ÿ• is the best #food"
        }
      },
      {
        "_index": "tweets",
        "_type": "_doc",
        "_id": "-fJ172YBCoXstbpG9pRv",
        "_score": 0.3971361,
        "_source": {
          "author": "NotFunny",
          "content": "Pineapple on pizza, are you kidding me? #peopleAreCrazy"
        }
      }
    ]
  }
}

Section intitulรฉe final-wordsFinal words

This change is great news for me as my Elasticsearch plugin Emoji Search is no longer needed! Itโ€™s a relief since building and shipping it is no fun at all. The documentation is not really helping and you have to compile your plugin for each Elasticsearch release.

Supporting emoji is now easier than ever, if you take a look at this technical article from 2016 your eyes will bleed a little, we had to use a whitespace tokenizer, some char filters and hacks everywhere. Elasticsearch 6.4 improves the search engine Unicode support and we will enjoy it.

Happy emoji searching ๐Ÿ”Ž!