Local GitHub Copilot with Lemonade Server on Windows

This guide is for Windows, there's a Linux version here.

Perhaps you, like me, saw the specs for AMD Ryzen AI Max (Strix Halo) processors and thought 'cool I can run a local LLM as a coding assistant' on a general purpose PC. Then once it arrived you looked at it and thought 'actually I have absolutely no idea how to do that'.

So here's the quickstart guide that I wish I had when I first unwrapped my new Framework Desktop.

1. Prerequisites

A Strix Halo - I'm assuming you've already completed this step, if not let's pause while you go shopping and wait for delivery.
Windows 11.
VSCode- Again, I assume you've got this installed and setup.
GitHub Copilot - You'll need the Copilot Chat extension, installed and working which will also need you to sign up for at least the free plan.

2. Get AMD Adrenalin

Adrenalin is the AMD equivalent of Nvidia Geforce, ~~a bloated mess of marketing dark patterns and~~ a convenient utility for updating drivers and managing configuration. If this isn't installed by your manufacturer, it didn't come with my Framework Desktop, you can grab it from the link above. Next make sure you have the latest chipset driver and software versions via Settings > System > Manage Updates.

For me that's AMD Software: Adrenalin Edition Version 26.1.1 and AMD Ryzen Chipset Driver: 7.11.26.2142.

3. Configuring CPU/GPU Memory split

The Strix Halo has a unified memory architecture which means that it's (up to) 128GB of memory is available to both the CPU and the GPU, kind of, in fact up to 96GB can be reserved for the GPU on Windows. You can configure the memory configuration in Adrenalin via Performance > Tuning > Variable Graphics Memory.

It's tempting at this point to head for the Custom option and reserve 96GB for the GPU but that can actually result in problems loading models as explained in the Lemonade FAQ.

On Windows, the GPU can access both unified RAM and dedicated GPU RAM, but the CPU is blocked from accessing dedicated GPU RAM. For this reason, allocating too much dedicated GPU RAM can interfere with model loading, which requires the CPU to access a substantial amount unified RAM.

So instead let's set the Dedicated Graphics Memory/Remaining System Memory to a 64GB/64GB split. With this configuration the GPU can still use up to 96GB but we avoid starving the CPU of memory.

4. Get Lemonade

What's that?

Lemonade Server is an Open Source project from AMD that bundles everything you need to run LLMs locally. It also includes an HTTP API that is OpenAI compatible and web UI and CLI for downloading, loading, unloading models and checking server stats and status.

Install

We can download an msi installer from lemonade-server.ai but I prefer to install it using winget.

# Install Lemonade
winget install AMD.LemonadeServer

# Reload PowerShell to refresh your path
pwsh

winget install

Now we should have the lemonade cli available, let's confirm by checking its version.

lemonade -v

lemonade-server version

Download a model

Great, let's use the pull command to download a model. We're going to start with Qwen3-Coder-30B-A3B-Instruct-GGUF, it's not the most powerful or modern model but it's coding focused, a reasonable size (~18GB) and supports tool calling which we need for Copilot.

lemonade pull Qwen3-Coder-30B-A3B-Instruct-GGUF

Download Qwen3 Coder

OK, that's going to take a while to download so let's take a moment to talk about a couple of important concepts.

Context Size

This is the maximum number of tokens that the model can process at any one time, you can think of it as the models working memory. It's measured in tokens which, in English, map to approximately 4 characters and includes both the input you send to the model and the response it returns. As the context grows the processing and memory costs grow which in turn is going to mean more latency, however a larger context allows the model to provide better quality responses. We can control the maximum size of the context when running a model as well as setting it globally for the Lemonade server.

Modality, Recipes and Backends

Lemonade supports several modalities (types of data processing) currently including Text generation, Speech-to-text, Text-to-speech and Image generation. For coding assistance we're interested in the Text generation modality. For each modality there is one or more recipe available and each recipe is supported by one or more backend which will run on either the CPU, GPU or NPU. We're only interested in running on the GPU, CPU is too slow and NPU is too small.

We can use the cli to get a list of supported recipes and see which versions you have installed. You can check out all the supported configurations in the Lemonade readme but the tldr; is that we're only interested in the llamacpp recipe on either the vulkan or rocm backend.

To check what recipe and backends are supported and installed we can use the cli.

lemonade backends

List available recipes

The Action column shows the command we can use to install each Backend, let's start with llamacpp and rocm

Running the model

OK, now we've got our model downloaded and we understand a little about how it's going to run, let's spin up a server and try chatting to it.

lemonade load Qwen3-Coder-30B-A3B-Instruct-GGUF --ctx-size 131072

Load Qwen3 Coder

Lemonade starts a server, loads the model and we also see a notification popup and an icon appear in the system tray to show that the server is running.

We can see details of the server and the model we've just loaded via the status command.

lemonade status

You can see that the model is loaded to the gpu using the llamacpp recipe and the server is running on port 13305.

And now we can access the UI for the server at http://localhost:13305/.

On the left you can see the Model Manager which shows that our Qwen3-Coder-30B-A3B-Instruct-GGUF model is loaded and we can browse some pre-selected models available for download. On the right we can chat to that model via Lemonade Chat, let's get it to write some code:

Write a single Python file that implements Conway's Game of Life running in the terminal. The grid should be 40×20, initialized randomly. Each generation should render as a cleared and redrawn frame. Use O for live cells and . for dead cells. The simulation should run at roughly 10 frames per second. Include the generation count displayed above the grid. The program should run indefinitely until the user presses Ctrl+C.

Now we've got our server running and a model loaded let's move on to getting it connected to Copilot. Fortunately the Lemonade team have a VSCode extension for exactly that purpose. We can search for and install it direectly from the extensions marketplace inside VSCode.

With the extension installed we can now select our locally running model from Copilot Chat -> Model Selector -> Manage Models

And now we can ask our local model to do anything we would ask GitHub Copilot to do. For example here's me asking it to review some horrible Go code I wrote a long time ago.

And the result.

In the next post I'll write up some tips on better models, customising how the model is run for better performance and how to get this working in WSL.