Qwen3.5 Fine-tuning Guide

Learn how to fine-tune Qwen3.5 LLMs locally with Unsloth.

You can now fine-tune Qwen3.5 model family (0.8B, 2B, 4B, 9B, 27B, 35B‑A3B, 122B‑A10B) with Unsloth. Support includes both vision and text fine-tuning. Qwen3.5‑35B‑A3B - bf16 LoRA works on 74GB VRAM.

Unsloth makes Qwen3.5 train 1.5× faster and uses 50% less VRAM than FA2 setups.
Qwen3.5 bf16 LoRA VRAM use: 0.8B: 3GB • 2B: 5GB • 4B: 10GB • 9B: 22GB • 27B: 56GB
Fine-tune 0.8B, 2B and 4B bf16 LoRA via our free Google Colab notebooks:

If you want to preserve reasoning ability, you can mix reasoning-style examples with direct answers (keep a minimum of 75% reasoning). Otherwise you can emit it fully.
Full fine-tuning (FFT) works as well. Note it will use 4x more VRAM.
Qwen3.5 is powerful for multilingual fine-tuning as it supports 201 languages.
After fine-tuning, you can export to GGUF (for llama.cpp/Ollama/LM Studio/etc.) or vLLM

If you’re on an older version (or fine-tuning locally), update first:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Please use transformers v5 for Qwen3.5. Older versions will not work. Unsloth automatically uses transformers v5 by default now (except for Colab environments).

If training seems slower than usual, it’s because Qwen3.5 use custom Mamba Triton kernels. Compiling those kernels can take longer than normal, especially on T4 GPUs.

It is not recommended to do QLoRA (4-bit) training on the Qwen3.5 models, no matter MoE or dense, due to higher than normal quantization differences.

MoE fine-tuning (35B, 122B)

For MoE models like Qwen3.5‑35B‑A3B / 122B‑A10B / 397B‑A17B:

Best to use bf16 setups (e.g. LoRA or full fine-tuning) (MoE QLoRA 4‑bit is not recommended due to BitsandBytes limitations).
Unsloth’s MoE kernels are enabled by default and can use different backends; you can switch with UNSLOTH_MOE_BACKEND.
Router-layer fine-tuning is disabled by default for stability.
Qwen3.5‑122B‑A10B - bf16 LoRA works on 256GB VRAM. If you're using multiGPUs, add device_map = "balanced" or follow our multiGPU Guide.

Below is a minimal SFT recipe (works for “text-only” fine-tuning). See also our vision fine-tuning section.

Qwen3.5 is “Causal Language Model with Vision Encoder” (it’s a unified VLM), so ensure you have the usual vision deps installed (torchvision, pillow) if needed, and keep Transformers up-to-date. Use the latest Transformers for Qwen3.5.

If you'd like to do GRPO, it works in Unsloth if you disable fast vLLM inference and use Unsloth inference instead. Follow our Vision RL notebook examples.

If you OOM:

Drop per_device_train_batch_size to 1 and/or reduce max_seq_length.

Loader example for MoE (bf16 LoRA):

Once loaded, you’ll attach LoRA adapters and train similarly to the SFT example above.

Unsloth supports vision fine-tuning for the multimodal Qwen3.5 models. Use the below Qwen3.5 notebooks and change the respective model names to your desired Qwen3.5 model.

Disabling Vision / Text-only fine-tuning:

To fine-tune vision models, we now allow you to select which parts of the mode to finetune. You can select to only fine-tune the vision layers, or the language layers, or the attention / MLP layers! We set them all on by default!

In order to fine-tune or train Qwen3.5 with multi-images, view our multi-image vision guide.

Saving / export fine-tuned model

You can view our specific inference / deployment guides for llama.cpp, vLLM, llama-server, Ollama, LM Studio or SGLang.

Unsloth supports saving directly to GGUF:

Or push GGUFs to Hugging Face:

If the exported model behaves worse in another runtime, Unsloth flags the most common cause: wrong chat template / EOS token at inference time (you must use the same chat template you trained with).

vLLM version 0.16.0 does not support Qwen3.5. Wait until 0.170 or try the Nightly release.

To save to 16-bit for vLLM, use:

To save just the LoRA adapters, either use:

Or use our builtin function:

For more details read our inference guides:

Last updated 44 minutes ago

import os
import torch
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/Qwen3.5-35B-A3B",
    max_seq_length = 2048,
    load_in_4bit = False,     # MoE QLoRA not recommended, dense 27B is fine
    load_in_16bit = True,     # bf16/16-bit LoRA
    full_finetuning = False,
)

model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")

model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")

model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
## OR to upload to HuggingFace:
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

model.save_pretrained("finetuned_lora")
tokenizer.save_pretrained("finetuned_lora")

model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "lora")
## OR to upload to HuggingFace
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")