Add Jinja template support by ochafik · Pull Request #11016 · ggml-org/llama.cpp

2 min read Original article ↗

Subset of #9639 with just the Jinja templating support.

Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.

  • Copies minja.hpp & chat-template.hpp from google/minja (created for this 😅) at this commit
  • Adds --jinja flag to llama-server, llama-cli, llama-run
  • Adds --chat-template-file flag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 )
  • Loads tokenizer.chat_template (or tokenizer.chat_template.tool_use if defined, only when the request has tools).
  • Dual testing in test-chat-template.cpp of legacy adhoc templating & jinja route. Wherever the expected outputs diverge, the jinja expectations should be more correct (note that templates are run w/ trim_blocks = true, lstrip_blocks = true)

Example usage:

# Launch in background
./build/bin/llama-server \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF \
  -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --jinja &

curl http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ipython",
          "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
          "parameters": {
            "type": "object",
            "properties": {
              "code": {
                "type": "string",
                "description": "The code to run in the ipython interpreter."
              }
            },
            "required": ["code"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Print a hello world message with python (using single quotes '"'"' for strings)."
      }
    ]
  }'
show output
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<tool_call>\n{\"name\": \"ipython\", \"arguments\": {\"code\": \"print('Hello world!')\"}}\n</tool_call>",
        "role": "assistant"
      }
    }
  ],
  "created": 1736811609,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4494-a57bb94e",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 205,
    "total_tokens": 230
  },
  "id": "chatcmpl-5YJXFVhvjoMDlLx1asuWNdSO3JVWWsUF",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 155.151,
    "prompt_per_token_ms": 155.151,
    "prompt_per_second": 6.445333900522716,
    "predicted_n": 25,
    "predicted_ms": 419.714,
    "predicted_per_token_ms": 16.78856,
    "predicted_per_second": 59.56437002339688
  }
}

TODO:

  • Add cross-testing in test-chat-template.cpp (note that minja is tested against a lot of templates in its own repo)
  • Add some instructions here
  • Add more server tests to exercise the template overrides.