Subset of #9639 with just the Jinja templating support.
Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.
- Copies minja.hpp & chat-template.hpp from google/minja (created for this 😅) at this commit
- Adds
--jinjaflag to llama-server, llama-cli, llama-run - Adds
--chat-template-fileflag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 ) - Loads
tokenizer.chat_template(ortokenizer.chat_template.tool_useif defined, only when the request has tools). - Dual testing in test-chat-template.cpp of legacy adhoc templating & jinja route. Wherever the expected outputs diverge, the jinja expectations should be more correct (note that templates are run w/
trim_blocks = true, lstrip_blocks = true)- Sent Refactor test-chat-template.cpp #11224 separately
Example usage:
# Launch in background ./build/bin/llama-server \ -hfr bartowski/Qwen2.5-7B-Instruct-GGUF \ -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf \ --jinja & curl http://localhost:8080/v1/chat/completions \ -d '{ "model": "gpt-3.5-turbo", "tools": [ { "type": "function", "function": { "name": "ipython", "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.", "parameters": { "type": "object", "properties": { "code": { "type": "string", "description": "The code to run in the ipython interpreter." } }, "required": ["code"] } } } ], "messages": [ { "role": "user", "content": "Print a hello world message with python (using single quotes '"'"' for strings)." } ] }'
show output
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "<tool_call>\n{\"name\": \"ipython\", \"arguments\": {\"code\": \"print('Hello world!')\"}}\n</tool_call>",
"role": "assistant"
}
}
],
"created": 1736811609,
"model": "gpt-3.5-turbo",
"system_fingerprint": "b4494-a57bb94e",
"object": "chat.completion",
"usage": {
"completion_tokens": 25,
"prompt_tokens": 205,
"total_tokens": 230
},
"id": "chatcmpl-5YJXFVhvjoMDlLx1asuWNdSO3JVWWsUF",
"timings": {
"prompt_n": 1,
"prompt_ms": 155.151,
"prompt_per_token_ms": 155.151,
"prompt_per_second": 6.445333900522716,
"predicted_n": 25,
"predicted_ms": 419.714,
"predicted_per_token_ms": 16.78856,
"predicted_per_second": 59.56437002339688
}
}TODO:
- Add cross-testing in test-chat-template.cpp (note that minja is tested against a lot of templates in its own repo)
- Add some instructions here
- Add more server tests to exercise the template overrides.