News
- 2025-03-08: Added Parler-TTS text-to-speech model. Added real time WebSocket API for audio transcription and voice chat. Added BERT embeddings.
- 2024-09-30: Added Llama 3.2, phi3, Qwen2 and gte_qwen2 models support. Added embeddings endpoint. Added cpu offloading.
- 2024-08-03: Added Llama 3.1 model support.
- 2024-05-21: Added Llama 3 model support.
- 2024-01-20: Added Mixtral model support. Added fast Whisper based speech to text transcription.
- 2023-10-21: CUDA support in the Windows version, mistral model support. Speculative sampling is supported. BNF grammar and JSON schema sampling.
- 2023-08-07: The GPU version and model conversion utilities are now freely available.
- 2023-07-21: The MPT and Llama 2 models are supported.
- 2023-06-10: New ts_chat utility to chat with language models. The Falcon and RedPajama-INCITE models are supported.
- 2023-03-26: The NLLB200 and flan UL2 models have been added. An HTML GUI is now available in ts_server.
Introduction
ts_server is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, audio transcription, speech synthesis, ...It has the following characteristics:
- All is included in a single binary. Very few external dependencies (Python is not needed) so installation is easy.
- Supports many Transformer variants (GPT-J, GPT-NeoX, GPT-Neo, OPT, Fairseq GPT, M2M100, CodeGen, GPT2, T5, RWKV, LLAMA, Falcon, MPT, Llama 3.2, Mistral, Mixtral, Qwen2, Phi3, Whisper, Parler-TTS) and Stable Diffusion.
- Integrated REST JSON API for text completion, translation, image generation, audio transcription and speech synthesis. It is used by textsynth.com.
- Integrated WebSocket API for real time audio transcription and voice chat (experimental).
- Integrated HTML GUI for testing.
- Very high performance for small and large batches on CPU and GPU. Support of dynamic batching to handle a large number of simultaneous requests.
- Efficient custom 8, 4 and 3 bit quantization. Our quantized models are thoroughly evaluated on several standard tasks to ensure good performance.
- Larger models work optimally on lower cost GPUs (e.g. RTX 3090, RTX A6000) thanks to efficient quantization.
- Support of speculative sampling for even faster inference.
- Support of grammar based sampling to constraint the model output according to a BNF grammar or a JSON schema.
- Uses the LibNC library for simple tensor manipulation using the C language.
- Simple command line tools (ts_test, ts_sd, ts_chat, ts_audiototext are provided to test the various models).
Download
- Linux version ts_server_free-2025-03-09.tar.gz (Changelog).
- Windows version ts_server_free-2024-09-30-win64.zip (Changelog).
Documentation
Benchmarks
Text generation
100 tokens are generated with a batch size of 1 and 50 input tokens:
| Model(3) | Epyc 7313 (6) (tokens/s) | RTX A6000 (tokens/s) | RTX 4090 (tokens/s) |
|---|---|---|---|
| gptj_6B_q4 | 21.5 | 132 | 164 |
| flan_t5_xxl_q4 | 25 | 130 | 158 |
| llama2_7B_q4 | 23 | 115 | 144 |
| llama2_13B_q4 | 12.0 | 69.3 | 88 |
| gptneox_20B_q4 | 8.1 | 45.5 | 59 |
| llama2_70B_q4 | 2.5 | 15.2 | - |
8 simultaneous requests generating 100 tokens with 50 input tokens (equivalent to a batch size of 8):
| Model(3) | RTX A6000 (tokens/s) |
|---|---|
| llama2_7B_q4 | 783 |
| llama2_13B_q4 | 492 |
| llama2_70B_q4 | 118 |
Text to image
A single 512x512 image is generated using 50 time steps.
| Model(3) | RTX A6000 (seconds) | RTX 4090 (seconds) |
|---|---|---|
| stable diffusion 1.4 | 1.82 | 1.21 |
| stable diffusion 2.1 | 1.67 | 1.19 |
Available Models
We provide here model files that can be used with the TextSynth Server. Each model was evaluated with the lm-evaluation-harness with the TextSynth server on a single RTX A6000 GPU.Language Models:
| bloom_560M | 1.1 | 29.176 | 36.8% | 35.8% | 51.4% | 63.7% | 36.0% | 44.7% |
| codegen_6B_mono_q4 | 4.4 | 69.409 | 28.0% | 35.7% | 51.1% | 60.2% | 38.0% | 42.6% |
| codegen_6B_mono_q8 | 7.7 | 67.262 | 28.1% | 35.8% | 50.8% | 60.1% | 39.1% | 42.8% |
| fairseq_gpt_13B | 26.2 | 3.567 | 71.9% | 72.7% | 67.5% | 77.6% | 70.1% | 71.9% |
| fairseq_gpt_13B_q4 | 7.9 | 3.646 | 71.2% | 72.5% | 67.6% | 77.4% | 70.6% | 71.9% |
| fairseq_gpt_13B_q8 | 14.2 | 3.565 | 71.8% | 72.7% | 67.2% | 77.7% | 70.0% | 71.9% |
| flan_t5_base | 0.5 | 12.891 | 54.2% | 36.5% | 54.7% | 65.8% | 62.1% | 54.7% |
| flan_t5_base_q8 | 0.3 | 13.098 | 54.2% | 36.4% | 54.2% | 65.7% | 61.8% | 54.5% |
| flan_t5_small | 0.2 | 23.343 | 46.7% | 29.2% | 50.0% | 62.4% | 47.9% | 47.2% |
| flan_t5_small_q8 | 0.1 | 23.449 | 46.7% | 29.2% | 49.7% | 62.4% | 48.2% | 47.2% |
| flan_t5_xxl_q4 | 6.5 | 3.010 | 77.7% | 71.5% | 73.4% | 77.6% | 71.8% | 74.4% |
| flan_t5_xxl_q8 | 12.0 | 3.049 | 77.8% | 72.1% | 75.1% | 77.8% | 73.1% | 75.2% |
| flan_ul2_20B_q4 | 11.3 | - | 74.1% | 24.3% | 51.1% | 49.9% | 78.8% | 55.6% |
| flan_ul2_20B_q8 | 20.9 | - | 74.4% | 24.4% | 52.0% | 50.6% | 77.3% | 55.7% |
| gpt2_117M | 0.3 | 40.110 | 32.9% | 31.1% | 52.1% | 62.9% | 27.3% | 41.3% |
| gpt2_345M | 0.7 | 18.272 | 43.5% | 39.4% | 53.3% | 67.7% | 43.1% | 49.4% |
| gpt2_345M_q8 | 0.5 | 18.452 | 43.1% | 39.4% | 53.1% | 67.5% | 41.9% | 49.0% |
| gpt2_774M | 1.6 | 12.966 | 47.8% | 45.4% | 55.6% | 70.4% | 48.5% | 53.5% |
| gpt2_774M_q8 | 1.0 | 12.928 | 47.9% | 45.4% | 55.3% | 70.3% | 48.2% | 53.4% |
| gpt2_1558M | 3.1 | 10.637 | 51.3% | 50.8% | 58.4% | 70.8% | 53.2% | 56.9% |
| gpt2_1558M_q8 | 1.8 | 10.655 | 51.2% | 50.8% | 58.6% | 70.8% | 53.2% | 56.9% |
| gptj_6B | 12.1 | 4.124 | 69.0% | 66.2% | 64.8% | 75.5% | 66.9% | 68.5% |
| gptj_6B_q4 | 3.8 | 4.153 | 68.9% | 65.7% | 63.9% | 74.4% | 67.0% | 68.0% |
| gptj_6B_q8 | 6.6 | 4.122 | 69.1% | 66.2% | 64.4% | 75.4% | 66.4% | 68.3% |
| gptneox_20B | 41.1 | 3.657 | 72.6% | 71.4% | 65.5% | 77.5% | 73.3% | 72.0% |
| gptneox_20B_q4 | 12.2 | 3.711 | 72.0% | 69.3% | 64.8% | 76.7% | 70.8% | 70.7% |
| gptneox_20B_q8 | 22.1 | 3.659 | 72.6% | 71.3% | 65.8% | 77.3% | 72.9% | 72.0% |
| llama_7B | 13.5 | 3.463 | 73.6% | 76.2% | 70.4% | 78.1% | 75.4% | 74.7% |
| llama_7B_q4 | 4.0 | 3.549 | 73.2% | 75.5% | 70.4% | 78.0% | 74.7% | 74.4% |
| llama_7B_q8 | 7.3 | 3.453 | 73.7% | 76.1% | 70.2% | 78.0% | 75.5% | 74.7% |
| llama_13B_q4 | 7.6 | 3.130 | 77.1% | 78.6% | 72.2% | 78.3% | 77.8% | 76.8% |
| llama_13B_q8 | 14.0 | 3.178 | 76.5% | 79.1% | 73.2% | 79.1% | 77.1% | 77.0% |
| llama_30B_q4 | 18.7 | 2.877 | 77.5% | 82.4% | 75.7% | 80.2% | 80.2% | 79.2% |
| llama_30B_q8 | 34.8 | 2.853 | 77.7% | 82.7% | 76.3% | 80.3% | 80.4% | 79.5% |
| llama_65B_q4 | 37.2 | 2.760 | 78.5% | 83.9% | 76.6% | 81.4% | 83.2% | 80.7% |
| opt_125M | 0.3 | 26.028 | 37.9% | 31.3% | 50.2% | 63.2% | 23.4% | 41.2% |
| opt_30B_q4 | 17.8 | 3.656 | 71.5% | 72.1% | 68.0% | 77.4% | 69.9% | 71.8% |
| opt_30B_q8 | 32.6 | 3.628 | 71.6% | 72.3% | 68.2% | 77.7% | 71.4% | 72.3% |
| opt_66B_q4 | 38.2 | 3.308 | 73.4% | 74.4% | 68.4% | 78.5% | 75.0% | 73.9% |
| pythia_deduped_70M | 0.1 | 96.126 | 25.6% | 28.3% | 54.4% | 60.4% | 13.1% | 36.3% |
| pythia_deduped_160M | 0.3 | 26.380 | 36.9% | 32.3% | 51.4% | 63.8% | 23.2% | 41.5% |
| pythia_deduped_410M | 0.8 | 10.827 | 51.7% | 40.8% | 54.0% | 67.2% | 43.0% | 51.4% |
| pythia_deduped_410M_q8 | 0.5 | 10.729 | 51.8% | 40.7% | 53.8% | 67.1% | 42.7% | 51.2% |
| pythia_deduped_1B | 2.0 | 7.273 | 58.5% | 49.0% | 54.5% | 71.0% | 49.9% | 56.6% |
| pythia_deduped_1B_q8 | 1.2 | 7.286 | 58.4% | 49.0% | 54.9% | 70.9% | 49.0% | 56.5% |
| pythia_deduped_1.4B | 2.8 | 6.546 | 63.1% | 52.2% | 57.1% | 72.7% | 52.6% | 59.5% |
| pythia_deduped_1.4B_q8 | 1.6 | 6.577 | 63.3% | 52.1% | 55.7% | 73.1% | 53.0% | 59.4% |
| pythia_deduped_2.8B | 5.6 | 4.787 | 67.1% | 61.6% | 60.9% | 74.4% | 65.5% | 65.9% |
| pythia_deduped_2.8B_q8 | 3.1 | 4.778 | 66.9% | 61.5% | 61.2% | 74.5% | 65.6% | 66.0% |
| pythia_deduped_6.9B | 13.7 | 4.195 | 69.1% | 65.7% | 63.9% | 75.1% | 66.1% | 68.0% |
| pythia_deduped_6.9B_q4 | 4.3 | 4.344 | 68.3% | 65.0% | 62.5% | 75.3% | 66.3% | 67.5% |
| pythia_deduped_6.9B_q8 | 7.5 | 4.187 | 69.4% | 65.7% | 63.6% | 75.5% | 66.8% | 68.2% |
| pythia_deduped_12B | 23.7 | 3.854 | 70.9% | 69.2% | 63.9% | 76.3% | 70.8% | 70.2% |
| pythia_deduped_12B_q4 | 7.2 | 4.187 | 69.2% | 68.5% | 63.1% | 76.4% | 69.6% | 69.4% |
| pythia_deduped_12B_q8 | 12.8 | 3.857 | 70.9% | 69.2% | 64.2% | 76.1% | 70.9% | 70.3% |
| rwkv_14B | 28.3 | 3.819 | 71.6% | 70.2% | 63.1% | 77.5% | 47.2% | 65.9% |
| rwkv_14B_q4 | 8.5 | 4.076 | 68.3% | 69.8% | 63.1% | 77.1% | 45.0% | 64.7% |
| rwkv_14B_q8 | 15.3 | 3.806 | 71.9% | 70.2% | 63.0% | 77.5% | 47.1% | 65.9% |
| rwkv_7B | 16 | 4.396 | 67.5% | 65.6% | 61.9% | 75.6% | 39.7% | 62.1% |
| rwkv_7B_q4 | 4.6 | 4.939 | 64.7% | 64.8% | 61.2% | 75.4% | 38.4% | 60.9% |
| rwkv_7B_q8 | 8.0 | 4.395 | 67.5% | 65.6% | 61.6% | 75.9% | 40.2% | 62.2% |
| RedPajama-INCITE-7B_q4 | 4.3 | 4.006 | 71.0% | 69.7% | 64.6% | 76.3% | 71.7% | 70.7% |
| RedPajama-INCITE-7B_q8 | 7.5 | 3.910 | 71.4% | 70.4% | 64.3% | 77.0% | 71.9% | 71.0% |
| falcon_40B_q4 | 24.6 | 2.844 | 77.6% | 82.5% | 76.2% | 82.2% | 78.8% | 79.5% |
| falcon_40B_q8 | 45.0 | 2.799 | 77.9% | 82.7% | 76.7% | 82.2% | 80.4% | 80.0% |
| falcon_7B | 14.4 | 3.359 | 75.0% | 76.2% | 67.3% | 79.4% | 72.1% | 74.0% |
| falcon_7B_q4 | 4.6 | 3.444 | 73.9% | 75.8% | 67.5% | 79.7% | 71.6% | 73.7% |
| falcon_7B_q8 | 7.9 | 3.368 | 75.0% | 76.2% | 66.9% | 79.5% | 71.9% | 73.9% |
| mpt_30B_q4 | 17.8 | 3.219 | 78.9% | 79.4% | 70.1% | 79.8% | 79.8% | 77.6% |
| mpt_30B_q8 | 32.6 | 3.062 | 80.7% | 79.8% | 70.7% | 80.0% | 79.9% | 78.2% |
| mpt_7B_q4 | 4.3 | 3.949 | 73.1% | 75.7% | 67.4% | 79.0% | 75.9% | 74.2% |
| mpt_7B_q8 | 7.5 | 3.850 | 73.2% | 76.2% | 68.5% | 79.1% | 76.4% | 74.7% |
| llama2_7B | 13.5 | 3.428 | 74.5% | 76.2% | 69.7% | 78.4% | 77.2% | 75.2% |
| llama2_7B_q4 | 4.0 | 3.487 | 73.5% | 75.5% | 69.9% | 77.6% | 77.8% | 74.9% |
| llama2_13B | 26.0 | 3.051 | 77.2% | 79.6% | 72.1% | 78.9% | 79.3% | 77.4% |
| llama2_13B_q4 | 7.6 | 3.109 | 77.0% | 79.0% | 72.6% | 79.5% | 78.9% | 77.4% |
| llama2_70B_q4 | 39.3 | 2.646 | 80.6% | 84.0% | 78.7% | 82.0% | 83.4% | 81.7% |
| llama2_7B_q3 | 3.2 | 3.566 | 72.7% | 74.1% | 68.0% | 77.6% | 77.5% | 74.0% |
| llama2_13B_q3 | 6.1 | 3.148 | 76.5% | 77.9% | 71.4% | 78.4% | 77.8% | 76.4% |
| llama2_70B_q3 | 30.8 | 2.638 | 79.9% | 82.9% | 77.7% | 81.7% | 82.6% | 80.9% |
| mistral_7B | 14.5 | 3.178 | 76.2% | 81.0% | 74.2% | 80.4% | 80.9% | 78.5% |
| mistral_7B_q4 | 4.3 | 3.412 | 74.9% | 80.1% | 73.9% | 80.7% | 80.3% | 78.0% |
| mistral_7B_q8 | 7.8 | 3.174 | 76.0% | 81.0% | 73.6% | 80.4% | 80.7% | 78.3% |
| mixtral_47B_q3 | 19.3 | 2.851 | 76.8% | 82.2% | 75.6% | 81.3% | 79.8% | 79.1% |
| mixtral_47B_q4 | 26.5 | 2.811 | 78.6% | 83.3% | 76.0% | 82.6% | 80.4% | 80.2% |
| mixtral_47B_q8 | 49.7 | 2.790 | 79.3% | 83.9% | 78.1% | 82.0% | 80.7% | 80.8% |
| llama3_8B | 16.1 | 3.107 | 76.8% | 79.1% | 73.1% | 79.7% | 80.7% | 77.9% |
| llama3_8B_q4 | 5.5 | 3.291 | 75.2% | 78.2% | 73.5% | 78.8% | 80.4% | 77.2% |
| llama3_70B | 141.1 | 2.597 | 80.6% | 84.9% | 80.1% | 82.3% | 84.0% | 82.4% |
| llama3_70B_q4 | 41.7 | 2.619 | 80.4% | 84.4% | 80.3% | 82.1% | 83.1% | 82.1% |
| llama3.1_8B | 16.1 | 3.150 | 76.6% | 78.8% | 73.9% | 79.9% | 80.8% | 78.0% |
| llama3.1_70B | 141.1 | 2.670 | 80.1% | 84.9% | 79.4% | 83.0% | 83.7% | 82.2% |
| llama3.1_70B_q4 | 41.8 | 2.713 | 79.9% | 84.4% | 79.4% | 82.6% | 83.4% | 81.9% |
| llama3.1_70B_q3 | 31.1 | 2.865 | 78.0% | 83.0% | 78.4% | 82.0% | 83.6% | 81.0% |
| llama3.1_405B_q4 | 232.4 | 2.454 | 81.6% | 87.0% | 82.4% | 83.8% | 83.8% | 83.7% |
| qwen2_7B | 15.2 | 3.647 | 72.3% | 78.3% | 72.3% | 79.9% | 80.9% | 76.8% |
| qwen2_7B_q4 | 5.3 | 3.712 | 72.0% | 77.8% | 71.3% | 79.7% | 81.7% | 76.5% |
Chat Models:
| llama3_8B_instruct | 16.1 | 67.3% |
| llama3_8B_instruct_q4 | 5.5 | 65.7% |
| llama2_7B_chat_q4 | 3.9 | 45.3% |
| llama2_13B_chat_q4 | 7.6 | 51.2% |
| llama2_70B_chat_q4 | 39.3 | 61.1% |
| mistral_7B_instruct_q4 | 3.9 | 53.0% |
| mixtral_47B_instruct_q4 | 26.5 | 67.6% |
| llama3.1_8B_instruct | 16.1 | 68.6% |
| llama3.1_8B_instruct_q4 | 5.6 | 67.1% |
| llama3.1_70B_instruct_q4 | 41.8 | 82.4% |
| phi3_mini_4k_instruct | 7.6 | 70.1% |
| phi3_mini_4k_instruct_q4 | 2.3 | 67.8% |
| phi3.5_mini_instruct | 7.7 | 67.7% |
| phi3.5_mini_instruct_q4 | 2.4 | 65.9% |
| qwen2_7B_instruct | 15.2 | 70.3% |
| qwen2_7B_instruct_q4 | 5.3 | 68.7% |
| llama3.3_70B_instruct_q4 | 41.8 | 81.9% |
Translation Models:
| Description | ||
|---|---|---|
| m2m100_1_2B_q8 | 1.6 | Translation between 100 languages |
| nllb200_1.3B_q8 | 2.0 | Translation between 200 languages |
| nllb200_3.3B_q8 | 4.6 | Translation between 200 languages |
| madlad400_7B_q4 | 5.7 | Translation between 400 languages |
| madlad400_3B_q4 | 2.2 | Translation between 400 languages |
Embeddings Models:
| Description | ||
|---|---|---|
| gte_qwen2_1.5B_instruct_q8 | 1.9 | Qwen2 GTE embeddings |
| bge_large_en_v1.5_q8 | 0.4 | BGE-Large EN v1.5 embeddings |
Text-to-Image Models:
| Description | ||
|---|---|---|
| sd_v1.4 | 2.1 | Stable Diffusion text-to-image version 1.4 |
| sd_v2.1 | 2.6 | Stable Diffusion text-to-image version 2.1 |
Audio Models:
| Description | ||
|---|---|---|
| whisper_large_v3_q8 | 1.8 | Whisper large v3 speech-to-text transcription |
| parler_tts_large_v1_q8 | 1.1 | Parler-TTS text-to-speech model |
| dac_mono | 0.3 | Descript Audio Codec (used with Parler-TTS) |
SHA256 of all the models: sha256.txt.
Notes:
- Some models have restrictive licenses. In particular, OPT, Vicuna and NLLB200 cannot be used commercially. BLOOM, Stable Diffusion, Llama 2, Llama 3, Llama 3.1 can be used commercially but have use limitations.
- For the larger models we don't provide the unquantized version when it is too large for consumer GPUs or when the quantized version gives the same performance as the unquantized version.
- The q8 suffix indicates that the model was 8 bit quantized. The q4 suffix indicates that the model was 4 bit quantized. The q3 suffix indicates that the model was 3 bit quantized. Unquantized models use either float16 or bfloat16 parameters.
- File size on disk (1 GB = 109 bytes). The amount of CPU or GPU RAM needed to run the model is close to this value.
- lambada perplexity (ppl) are comparable only for models using the same tokenizer. So the lambada accuracy (acc) should be used when comparing all models.
- The speed is measured on an AMD Epyc 7313 CPU using 16 threads (ts_test -T 16)
- MMLU was evaluated using 5 shots.
Fabrice Bellard - https://bellard.org/