Zero-Cost Code Reviews: Self-Hosted Qwen2.5-Coder on GitLab CI

10 min read Original article ↗

How we eliminated API costs while keeping our code private and reviews comprehensive

Every merge request now gets an instant AI code review. The catch? It costs us exactly $0 in API fees.

If you’re using Claude, ChatGPT, or any cloud-based LLM for code reviews, you’re probably familiar with the pain points:

  • API costs add up fast — At $0.10–0.50 per review, 100 daily MRs means $300–1,500/month
  • Privacy concerns — Your proprietary code leaves your network and hits third-party servers
  • Rate limits — Nothing kills CI pipeline velocity like hitting API throttling
  • Vendor lock-in — Pricing changes or service outages are outside your control

We solved all of these by running Qwen2.5-Coder:7b locally on a Mac Mini, integrated directly into our GitLab CI pipeline. Here’s exactly how we did it — including the batching system, prompt engineering, and optimizations that make it production-ready.

The Architecture

╔═══════════════════════════════════════════════════════════════════╗
║ GitLab CI Pipeline ║
║ ║
║ ┌──────────┐ ┌──────────┐ ┌──────────┐ ║
║ │ Build │ ───► │ ai_bot │ ───► │ Tests │ ║
║ │ Stage │ │ Job │ │ Stage │ ║
║ └──────────┘ └────┬─────┘ └──────────┘ ║
║ │ ║
╚══════════════════════════╪════════════════════════════════════════╝

│ HTTP Request
│ (port 11434)

╔═══════════════════════════════════════════════════════════════════╗
║ Mac Mini (In-House Server) ║
║ ║
║ ┌───────────────────────────────────────────────────────────┐ ║
║ │ Ollama Service │ ║
║ │ │ ║
║ │ ┌───────────────────────────────────────────────────┐ │ ║
║ │ │ qwen2.5-coder:7b (~5GB RAM) │ │ ║
║ │ │ │ │ ║
║ │ │ • Code-specialized (5.5T tokens training) │ │ ║
║ │ │ • Apache 2.0 license (commercial OK) │ │ ║
║ │ │ • 128K context window │ │ ║
║ │ └───────────────────────────────────────────────────┘ │ ║
║ │ │ ║
║ └───────────────────────────────────────────────────────────┘ ║
║ ║
║ IP: Internal Network | Auto-start: launchd ║
╚══════════════════════════╤════════════════════════════════════════╝

│ JSON Response
│ + GitLab API

╔═══════════════════════════════════════════════════════════════════╗
║ Merge Request ║
║ ║
║ ┌───────────────────────────────────────────────────────────┐ ║
║ │ ## 🤖 AI Code Review Summary │ ║
║ │ │ ║
║ │ ### ✅ APPROVE WITH COMMENTS │ ║
║ │ │ ║
║ │ **Critical:** None │ ║
║ │ **Important:** 2 issues found │ ║
║ │ **Suggestions:** 1 improvement │ ║
║ │ │ ║
║ │ ───────────────────────────────────────────────────── │ ║
║ │ *Powered by Ollama (qwen2.5-coder:7b) • 8.5s • $0* │ ║
║ └───────────────────────────────────────────────────────────┘ ║
║ ║
╚═══════════════════════════════════════════════════════════════════╝

Simple Flow Diagram (for featured image)

┌────────────┐     ┌────────────┐     ┌────────────┐
│ │ │ │ │ │
│ GitLab │────►│ Ollama │────►│ MR Gets │
│ CI Job │ │ (Local) │ │ Review │
│ │ │ │ │ │
└────────────┘ └────────────┘ └────────────┘
MR Qwen2.5 AI Comment
Created Coder:7b Posted
         ┌─────────────────────────────────┐
│ 💰 Cost: $0 🔒 Private ⚡ Fast │
└─────────────────────────────────┘

Why Mac Mini?

Apple Silicon is remarkably efficient for LLM inference:

  • M1/M2 chips — Unified memory architecture eliminates GPU memory bottlenecks
  • Power efficiency — ~$5/month in electricity vs $50+ for a GPU server
  • Quiet operation — Can sit under a desk without dedicated cooling
  • One-time cost — $599–999 vs ongoing cloud API fees

Why Ollama?

Ollama makes running local LLMs trivially easy:

  • One-command setupollama pull qwen2.5-coder:7b
  • OpenAI-compatible API — Drop-in replacement for existing integrations
  • Model management — Easy switching between models for testing
  • Automatic optimization — Handles memory management and batching

Why Qwen2.5-Coder?

We tested Llama3.1, CodeLlama, DeepSeek Coder, and Qwen2.5-Coder. Here’s why Qwen won:

Qwen2.5-Coder at 7B parameters outperforms CodeLlama-34B (48.8%) while running 5x faster. The Apache 2.0 license means no commercial restrictions.

Press enter or click to view image in full size

Implementation Deep-Dive

Step 1: Infrastructure Setup

Install Ollama on Mac Mini:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the model (4.7GB download)
ollama pull qwen2.5-coder:7b
# Verify it works
ollama run qwen2.5-coder:7b "Write a hello world in Python"

Configure auto-start (launchd):

Create /Library/LaunchDaemons/com.ollama.ollama.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.ollama</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_HOST</key>
<string>0.0.0.0:11434</string>
</dict>
</dict>
</plist>
sudo launchctl load /Library/LaunchDaemons/com.ollama.ollama.plist

Step 2: GitLab CI Job

Add this to your .gitlab-ci.yml:

ai_bot:
stage: build # Or any stage that runs on MRs
needs: [] # No dependencies - runs in parallel
image: alpine:3.21
rules:
# Skip for main branch pushes
- if: '$CI_COMMIT_BRANCH == "main"'
when: never
# Run on all merge requests
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
when: on_success
before_script:
- apk add --no-cache git curl bash jq
script:
- chmod +x scripts/ollama_review.sh
- bash scripts/ollama_review.sh
tags:
- inhouse # Runs on self-hosted runner with network access
allow_failure: true # Don't block MRs if AI review fails
timeout: 10m

Key decisions:

  • allow_failure: true — AI review is advisory, not blocking
  • tags: [inhouse] — Ensures the job runs on runners that can reach the Mac Mini
  • timeout: 10m — Large diffs may take longer; prevents hanging jobs

The Secret Sauce: Optimizations That Make It Work

A naive implementation would hit context limits, produce inconsistent output, and miss important issues. Here are the optimizations that make our system production-ready.

Optimization 1: Token Budget Management

Local LLMs have limited context windows (~16K tokens for 7B models). We budget tokens carefully:

# Token budget allocation (24K chars ≈ 8K tokens)
MAX_PROMPT_CHARS=24000 # Total prompt limit
MAX_ADDED_LINES_SIZE=18000 # ~6K tokens for code changes
MAX_ASANA_CONTEXT_SIZE=3000 # ~1K tokens for task context
# Remaining ~1K for instructions, patterns, format

Why this matters: Without budgeting, a 500-line diff would exceed context and produce garbage output.

Optimization 2: Batched Reviews

Large MRs can have 50+ files. Instead of one massive prompt, we batch:

# Batching configuration
BATCH_SIZE=5 # Files per LLM call
MAX_BATCHES=10 # Maximum batches (50 files coverage)
BATCH_DELAY=2 # Seconds between batches (avoid overload)

How it works:

50 files changed

Split into 10 batches of 5 files each

Batch 1: Files 1-5 → LLM call → JSON result
Batch 2: Files 6-10 → LLM call → JSON result
...
Batch 10: Files 46-50 → LLM call → JSON result

Merge all results into single review

Merge logic: Accumulate issues from all batches, use worst verdict:

merge_reviews() {
# Collect all critical/important/suggestions
# Verdict priority: REQUEST_CHANGES > APPROVE_WITH_COMMENTS > APPROVE
# Combine all reasons
}

Optimization 3: Added Lines Only Format

We don’t send the entire diff. We extract ONLY added lines with precise locations:

# Input: raw git diff
# Output: structured format
[src/auth/login.ex:45] +    def authenticate(user, password) do
[src/auth/login.ex:46] + case verify_password(user, password) do
[src/auth/login.ex:47] + {:ok, _} -> {:ok, user}
[src/auth/login.ex:48] + {:error, _} -> {:error, :invalid_credentials}
[src/auth/login.ex:49] + end
[src/auth/login.ex:50] + end

Why [file:line] format:

  1. LLM can reference exact locations in its response
  2. We can validate LLM output against real line numbers
  3. Enables line-specific comments on MRs (future feature)

Optimization 4: Triggered Pattern Injection

Not all code needs the same review rules. We detect patterns in the diff and inject relevant checks:

load_triggered_patterns() {
local diff="$1"
local patterns=""
    # Security patterns
if echo "$diff" | grep -qE "password|secret|token|api_key"; then
patterns+="⚠️ SECRETS: Check for hardcoded credentials\n"
fi
# Financial patterns
if echo "$diff" | grep -qE "Float\.|amount.*\*|price.*\*"; then
patterns+="💰 MONEY: Use Decimal, NEVER Float for currency\n"
fi
# Database patterns
if echo "$diff" | grep -qE "INSERT|UPDATE|DELETE|Repo\.(insert|update)"; then
patterns+="📝 DATABASE: Check for SQL injection, use parameterized queries\n"
fi
# API patterns
if echo "$diff" | grep -qE "fetch|axios|http|curl"; then
patterns+="🔌 HTTP: Verify timeout handling and error responses\n"
fi
# Async patterns
if echo "$diff" | grep -qE "async|await|Promise|Task\.async"; then
patterns+="⏳ ASYNC: Check for unhandled rejections/errors\n"
fi
echo "$patterns"
}

Result: The LLM gets context-specific review criteria instead of generic advice.

Optimization 5: LLM Parameters Tuning

Default LLM parameters produce verbose, inconsistent output. We tuned for code review:

# LLM parameters
TEMPERATURE=0.1 # Low = deterministic, consistent format
REPEAT_PENALTY=1.5 # High = prevents verbose repetition
MAX_TOKENS=600 # Short = forces concise output
NUM_CTX=8192 # Context window for full understanding
# System prompt (10%+ quality improvement per research)
SYSTEM_PROMPT="You are an expert code reviewer. You identify ONLY real bugs,
security issues, and critical problems. You never flag style preferences or
valid patterns as issues. You output ONLY the requested format."

Optimization 6: Semantic Annotations

We pre-annotate important changes to guide LLM attention:

annotate_diff_semantically() {
case "$line" in
+*def\ *|+*function\ *)
echo "[NEW FUNC] $line" ;;
+*Repo.insert*|+*INSERT*)
echo "[DB MUTATION] $line" ;;
+*password*|+*secret*)
echo "[SECURITY] $line" ;;
+*send_email*|+*notify*)
echo "[NOTIFICATION] $line" ;;
*)
echo "$line" ;;
esac
}

Before: LLM might miss a buried password variable After: [SECURITY] + password = params["password"] gets flagged

Optimization 7: Module Context Extraction

For languages with rich type systems (Elixir, TypeScript), we extract context:

extract_module_context() {
local file="$1"
    # Extract module name
grep -m1 "^defmodule" "$file"
# Extract type specs
grep "^\s*@spec" "$file" | head -8
# Extract function signatures
grep "^\s*def " "$file" | head -15
# Extract struct definition
grep "defstruct" "$file"
}

This helps the LLM understand the codebase structure without sending entire files.

Prompt Engineering: The Complete Prompt

Here’s the actual prompt structure we use:

PROMPT="${ASANA_CONTEXT}
Review these code changes. Each line shows [file:line] and the code.CHANGES TO REVIEW (Batch $batch_num of $total_batches):
$ADDED_LINES_WITH_NUMBERS
CHECKS TO APPLY:
$TRIGGERED_PATTERNS
RULES:
1. Only review the NEWLY ADDED lines shown above
2. Use the EXACT [file:line] from the line you're reviewing
3. If no issues with the NEW code, say \"None found\"
4. Consider task context (if provided) to understand business intent
FOCUS ON:
- Bugs or logic errors in the new code
- Security issues introduced by these specific changes
- Missing error handling for new code paths
- Potential crashes or exceptions
SEVERITY GUIDE:
- CRITICAL: Security vulnerabilities, data loss, crashes
- IMPORTANT: Logic errors, missing error handling
- SUGGESTIONS: Style improvements, minor optimizations
DO NOT flag as issues:
- Patterns that follow existing conventions
- Configuration constants (timeouts, limits are intentional)
- Valid language idioms
OUTPUT JSON:
{\"critical\":\"none\",\"important\":\"none\",\"suggestions\":\"none\",
\"verdict\":\"APPROVE\",\"reason\":\"clean code\"}"

Why JSON Output?

We request JSON format for structured parsing:

curl "http://$OLLAMA_HOST/api/generate" -d '{
"model": "qwen2.5-coder:7b",
"prompt": "...",
"format": "json", # ← Request JSON output
"options": {
"temperature": 0.1,
"num_predict": 600,
"repeat_penalty": 1.5
}
}'

Benefits:

  • Predictable parsing (no regex gymnastics)
  • Easy verdict extraction for CI status
  • Simple merge of batch results
  • Clean separation of issue categories

Integrations: External Context

Task Management (Asana/Jira)

We extract task context from MR descriptions:

# Extract task URL from MR description
TASK_URL=$(echo "$MR_DESCRIPTION" | grep -oE 'https://app.asana.com/[0-9/]+')
# Fetch task details via API
TASK_CONTEXT=$(curl -H "Authorization: Bearer $ASANA_PAT" \
"https://app.asana.com/api/1.0/tasks/$TASK_GID?opt_fields=name,notes")
# Include in prompt
ASANA_SECTION="
## Task Context
**Task:** $TASK_NAME
**Description:** $TASK_NOTES
Use this context to understand the business intent of the changes.
"

Why this helps: LLM can distinguish between “bug fix” and “new feature” behavior.

Cross-Module Impact Detection

For monorepos, detect when changes affect multiple modules:

detect_cross_impact() {
local changed_files="$1"
    if echo "$changed_files" | grep -q "auth/"; then
echo "⚡ AUTH changes may affect: user sessions, permissions"
fi
if echo "$changed_files" | grep -q "database/"; then
echo "⚡ DB schema changes require migrations"
fi
if echo "$changed_files" | grep -q "api/"; then
echo "⚡ API changes may break clients - check versioning"
fi
}

Real Results

Sample Review Output

## AI Code Review Summary
### CRITICAL
None found
### IMPORTANT
1. **Missing error handling** [src/api/users.ex:67]
The `Repo.insert` call doesn't handle `{:error, changeset}`.
Pattern match on the result to handle validation failures.
2. **Potential nil access** [src/api/users.ex:82]
`user.profile.name` may crash if profile is nil.
Use `user.profile && user.profile.name` or safe navigation.
### SUGGESTIONS
- Consider extracting lines 45-60 into a separate function for testability.
### VERDICT
APPROVE WITH COMMENTS - Good implementation, address error handling before merge.
---
*Review: Batch 1/2 | Model: qwen2.5-coder:7b | Time: 6.2s | Tokens: 1,250+180*

Limitations & Honest Assessment

Where It Excels

  • Pattern matching — Catches common mistakes reliably
  • Style consistency — Enforces coding standards
  • Security basics — SQL injection, XSS, hardcoded secrets
  • Speed — 5–15 seconds vs hours waiting for human review

Where It Falls Short

  • Complex business logic — Can’t understand domain-specific rules
  • Architecture decisions — Won’t catch design flaws
  • Context awareness — Doesn’t know codebase history
  • Nuanced judgment — Sometimes flags valid patterns

Our Recommendation

Use AI review as a first-pass filter, not a replacement for human review.

The AI catches low-hanging fruit so humans can focus on architecture and logic.

Getting Started

Quick Start (5 minutes)

  1. Install Ollama on any machine with 8GB+ RAM:
curl -fsSL https://ollama.ai/install.sh | sh ollama pull qwen2.5-coder:7b
  1. Add minimal CI job:
ai_review:   image: alpine:3.21   script:     - apk add curl jq git     - |       DIFF=$(git diff origin/main -- '*.py' '*.js' '*.ts')       curl -s http://YOUR_OLLAMA_HOST:11434/api/generate \         -d "{\"model\":\"qwen2.5-coder:7b\",\"prompt\":\"Review:\\n$DIFF\"}" \         | jq -r '.response'
  1. Iterate — Add batching, patterns, and GitLab posting as needed.

Resources

Conclusion

Switching from cloud APIs to local Qwen2.5-Coder was one of the best infrastructure decisions we made. The key insights:

  1. Budget tokens carefully — Context limits are real constraints
  2. Batch large reviews — Don’t try to review 50 files in one call
  3. Inject relevant context — Triggered patterns beat generic rules
  4. Tune LLM parameters — Low temperature + high repeat penalty = consistent output
  5. Use structured output — JSON is easier to parse than free text

The result: Zero ongoing API costs, complete data privacy, and instant feedback on every MR.

Is the AI perfect? No. But it catches enough issues to be valuable, and the price (after hardware) is unbeatable.

Have questions? Found improvements? Leave a comment below.

Tags: #DevOps #AI #CodeReview #GitLabCI #Ollama #LLM #MachineLearning #SelfHosted #Qwen