Parsing Keywords in Lisp with Speed of C

Lisp is not a single language. It is a family of languages. There are numerous Lisps, serving various purposes, from being embedded in tiny hardware devices like uLisp, to being used as an embeddable scripting language like Guile or ECL, to run on Java platform such as Clojure or ABCL. There is experimenting, language implementation platform, Racket, and a good application development Lisps with optimizing compilers such as SBCL or CCL.

Emacs Lisp

The most popular, at least the most used of all Lisps in current use, seems to be Emacs Lisp though, a built-in Lisp for scripting the GNU Emacs editor. GNU Emacs itself has evolved from Gosling's Emacs, which was a text editor with a pseudo-lisp, so called, "mock lisp" as its scripting language, into a full-fledged "lisp implementation". For lots of people, me included, Emacs is the entrance into the Lisp world. It was my first encounter with Lisp, and thus far still a Lisp I am most familiar with.

I am really in love with Emacs as the application. Personally, for years now, I think it would be a great CS teaching tool. Everything is moldable, it is straightforward to write a function, evaluate it, execute it, and step through it when debugging. At the same time while stepping through, one can switch to input or output buffers and see how input and output to a program are consumed or generated, all while program is executed one expression at the time.

With that convenience, it is just natural that we would like to use Emacs Lisp for writing more general programs than just extending Emacs. I am not sure if GNU Emacs is there, but we can use it to write scripts, and technically, it would be possible to use it as a framework for custom applications, though it is quite technical and involved process.

Anyway, here I would just like to see how Emacs Lisp compares to Common Lisp and how both compares to C, as well as how we can use C from both. It is just curiosity, do not take it as partisan writing for either of those three languages.

List and Hash in Emacs Lisp

First let us read the keywords into a list:

(defvar *keyword-list*
  (with-temp-buffer
    (insert-file-contents-literally "C23-keywords")
    (cl-loop while (not (eobp))
             with lines do
             (push
              (buffer-substring-no-properties (pos-bol) (pos-eol))
              lines)
             (forward-line)
             finally (cl-return lines))))

The simplest way to know if an element is in a set, is to check if a keyword is in a list (set) with member function:

(defun ckeywords-list-benchmark ()
  (benchmark-run-compiled 1
    (let (throwaway)
      (garbage-collect)
      (cl-loop
       repeat 1000000 do
       (setf throwaway nil)
       (cl-loop for word in *keyword-list*
                when
                (cl-member word *keyword-list* :test #'string=) do
                (setf throwaway word))))))

As expected this is slow. It took so long that I eventually pressed C-g to cancel the computation. I have of course compiled the file with the "native compiler" (GCC), before the test.

We can do better with a hash table:

(defvar *keyword-table*
  (let ((ht (make-hash-table :test #'equal)))
    (dolist (keyword *keyword-list*)
      (setf (gethash keyword ht) keyword))
    ht))

(defun ckeywords-table-benchmark ()
  (benchmark-run 1
    (let (throwaway)
      (garbage-collect)
      (cl-loop repeat 1000000 do
               (setf throwaway nil)
               (cl-loop for word in *keyword-list*
                        when (gethash word *keyword-table*) do
                        (setf throwaway word)))
      (print throwaway))))

;; (0.4601723 1 0.06320889999999224)

I think this is the best we can do with pure Elisp, out of the box so to say. I am not familiar with any perfect hash library for Emacs, nor about some other library that might do better than the hash lookup. If you know about some, let me know.

I do not think it is bad, it is ~4x slower than the C version. But it is OK for the convenience a relatively general-purpose programming Lisp implementation offers directly from the text editor. As said, the test is very synthetic. In real-life applications, we usually do not look up millions of strings in a loop.

As a little regression, Emacs Lisp has pcase, a pattern matching library, which is built-in and even preloaded into Emacs runtime. It is good, and we can easily construct code to test all the 59 keywords. Something like this for example:

(defun match-pcase (string)
  (cl-macrolet ((matchstring (s)
                  `(pcase ,s
                     ,@(mapcar (lambda (w) (list w t)) *speakers*))))
    (matchstring string)))

It expands into a cond-form which means we do basically a linear probing on string equality. It is slower than hash table version.

In regard to this test, I would like to mention string-case, a Common Lisp library which generates tests first on the string length and then constructs a decision tree for each. It does very well in Common Lisp, which will be described later.

I have ported it to Emacs Lisp, but since we do not have a compiler and as sophisticated type inference engine, it does not matter at all since compiler can't optimize away all the functions call as SBCL can. If you compare it to the original string-case for CL, on x86, they emit some optimized assembly. SBCL emits optimized code for functions, based on the involved types, similarly as GCC can do for C functions (say strlen or memcpy). This is not something we can do in Emacs, at least not as of now. I believe they are working on the native compiler and type inference engine, but I have no idea how far they have come with that work. I see more function declarations every time I look up doc strings, but I really do not know where the work is going.

My port is quite bad too, it is probably possible to make it more efficient, but I have not invested the time. If someone is interested in giving it a run and a look, the code is in string-case.el in the repo for this blog post.

I did though try the main idea of that library: to group functions based on length and produce a cond form for each group. The worst case, when all strings are of equal length, should be the same as code produced by pcase. On average, where we have several groups of various lengths, we should be able to cut off all but one group and than do less string equality tests. I have done that, but I don't see any speedup worth talking about. I guess in practice, equal already checks the string length before comparing characters, so ending up doing some extra tests for length does not matter (checking string length in Elisp is a constant time operation).

Emacs Modules

When I was at it, I did play a bit with Emacs modules too. Emacs modules are a way for a third-party developer to call into C, either for speed or just to expose functionality to Lisp.

#define BUFFER_SIZE 250
#define _unused(var) _##var __attribute__((unused))

/* This is of course not a proper way to do things. In real code
   we would check for errors and convert utf8 string to proper encoding.
   For this example, we are assuming ascii text and
   are not checking for errors. For proper handling, see the manual or some
   other code.  */
static emacs_value c23_keyword_p(emacs_env *env, ptrdiff_t _unused (nargs),
                                 emacs_value args[], void* _unused (data)) {
  static char buffer[BUFFER_SIZE];
  emacs_value keyword = env->make_global_ref (env, *args);
  ptrdiff_t keyword_length = 0;
  if (!env->copy_string_contents(env, keyword, 0, &keyword_length))
    return Qnil;
  if (keyword_length > (ptrdiff_t) BUFFER_SIZE - 1)
    return Qnil;
  env->copy_string_contents (env, keyword, buffer, &keyword_length);
  return in_c23_p (buffer, (size_t)keyword_length-1) ? Qt : Qnil;
}

To call it from Elisp:

(module-load (expand-file-name "ckeywords-module.dll"))

(declare-function c23-keyword-p "c23-keywords.c")

(defun ckeywords-module-benchmark ()
  (benchmark-run 1
    (let (throwaway)
      (garbage-collect)
      (cl-loop repeat 1000000 do
               (setf throwaway nil)
               (cl-loop for keyword in *keyword-list*
                        when (c23-keyword-p keyword) do
                        (setf throwaway keyword)))
      (print throwaway))))

I get 4.1 seconds (4.112878299999999 1 0.032796099999999884).

I am not an expert in writing Emacs modules, but I think it will do. The code is in ckeywords-module.c.

The slowness is not the code, but the strategy. Calling low-level functions like this is not the best strategy, since we are paying the cost for packing and unpacking Emacs pointers 59 million times. Obviously, a better way of using a module is to write the test completely in C and export just the function to call the test, i.e. a high-level function call. In other words, making lots of calls to module API adds overhead.

If you check the code in ckeywords-module.c, c23_keywords_test does exactly that. There we can measure just the double loop, as we did in a pure C program in ckeywords.c:

  beg = clock();

  for (int i = 0; i < 1000000; i++)
    for (int w = 0; w < NKWDS; w++) {
      k = keywords[w];
      if (in_c23_set (k.string, k.length))
        temp = k.string;
    }

  end = clock();

The result is hovering about ~0.08 ~ 0.1 seconds, same as for the statically linked C program. The conclusion I draw from this experiment is that Emacs modules are good for exposing the functionality to Emacs Lisp. However, when it comes to execution speed, modules can be used if we do not do too many interactions with Emacs runtime. Otherwise, the overhead for packing and unpacking data to and from Emacs runtime adds up noticeably. In other words, we can get the raw C speed for the code that runs relatively independent from Emacs core, at in hot loops.