Zero-Cost Code Reviews: Self-Hosted Qwen2.5-Coder on GitLab CI

How we eliminated API costs while keeping our code private and reviews comprehensive

Every merge request now gets an instant AI code review. The catch? It costs us exactly $0 in API fees.

If you’re using Claude, ChatGPT, or any cloud-based LLM for code reviews, you’re probably familiar with the pain points:

API costs add up fast — At $0.10–0.50 per review, 100 daily MRs means $300–1,500/month
Privacy concerns — Your proprietary code leaves your network and hits third-party servers
Rate limits — Nothing kills CI pipeline velocity like hitting API throttling
Vendor lock-in — Pricing changes or service outages are outside your control

We solved all of these by running Qwen2.5-Coder:7b locally on a Mac Mini, integrated directly into our GitLab CI pipeline. Here’s exactly how we did it — including the batching system, prompt engineering, and optimizations that make it production-ready.

The Architecture

╔═══════════════════════════════════════════════════════════════════╗
║                      GitLab CI Pipeline                           ║
║                                                                   ║
║   ┌──────────┐      ┌──────────┐      ┌──────────┐                ║
║   │  Build   │ ───► │  ai_bot  │ ───► │  Tests   │                ║
║   │  Stage   │      │   Job    │      │  Stage   │                ║
║   └──────────┘      └────┬─────┘      └──────────┘                ║
║                          │                                        ║
╚══════════════════════════╪════════════════════════════════════════╝
                           │
                           │ HTTP Request
                           │ (port 11434)
                           ▼
╔═══════════════════════════════════════════════════════════════════╗
║                   Mac Mini (In-House Server)                      ║
║                                                                   ║
║   ┌───────────────────────────────────────────────────────────┐  ║
║   │                    Ollama Service                         │  ║
║   │                                                           │  ║
║   │   ┌───────────────────────────────────────────────────┐   │  ║
║   │   │          qwen2.5-coder:7b (~5GB RAM)              │   │  ║
║   │   │                                                   │   │  ║
║   │   │  • Code-specialized (5.5T tokens training)        │   │  ║
║   │   │  • Apache 2.0 license (commercial OK)             │   │  ║
║   │   │  • 128K context window                            │   │  ║
║   │   └───────────────────────────────────────────────────┘   │  ║
║   │                                                           │  ║
║   └───────────────────────────────────────────────────────────┘  ║
║                                                                   ║
║   IP: Internal Network | Auto-start: launchd                      ║
╚══════════════════════════╤════════════════════════════════════════╝
                           │
                           │ JSON Response
                           │ + GitLab API
                           ▼
╔═══════════════════════════════════════════════════════════════════╗
║                       Merge Request                               ║
║                                                                   ║
║   ┌───────────────────────────────────────────────────────────┐  ║
║   │  ## 🤖 AI Code Review Summary                             │  ║
║   │                                                           │  ║
║   │  ### ✅ APPROVE WITH COMMENTS                             │  ║
║   │                                                           │  ║
║   │  **Critical:** None                                       │  ║
║   │  **Important:** 2 issues found                            │  ║
║   │  **Suggestions:** 1 improvement                           │  ║
║   │                                                           │  ║
║   │  ─────────────────────────────────────────────────────    │  ║
║   │  *Powered by Ollama (qwen2.5-coder:7b) • 8.5s • $0*       │  ║
║   └───────────────────────────────────────────────────────────┘  ║
║                                                                   ║
╚═══════════════════════════════════════════════════════════════════╝

Simple Flow Diagram (for featured image)

┌────────────┐     ┌────────────┐     ┌────────────┐
│            │     │            │     │            │
│   GitLab   │────►│   Ollama   │────►│  MR Gets   │
│  CI Job    │     │  (Local)   │     │  Review    │
│            │     │            │     │            │
└────────────┘     └────────────┘     └────────────┘
     MR               Qwen2.5           AI Comment
   Created            Coder:7b          Posted

         ┌─────────────────────────────────┐
         │  💰 Cost: $0  🔒 Private  ⚡ Fast │
         └─────────────────────────────────┘

Why Mac Mini?

Apple Silicon is remarkably efficient for LLM inference:

M1/M2 chips — Unified memory architecture eliminates GPU memory bottlenecks
Power efficiency — ~$5/month in electricity vs $50+ for a GPU server
Quiet operation — Can sit under a desk without dedicated cooling
One-time cost — $599–999 vs ongoing cloud API fees

Why Ollama?

Ollama makes running local LLMs trivially easy:

One-command setup — ollama pull qwen2.5-coder:7b
OpenAI-compatible API — Drop-in replacement for existing integrations
Model management — Easy switching between models for testing
Automatic optimization — Handles memory management and batching

Why Qwen2.5-Coder?

We tested Llama3.1, CodeLlama, DeepSeek Coder, and Qwen2.5-Coder. Here’s why Qwen won:

Qwen2.5-Coder at 7B parameters outperforms CodeLlama-34B (48.8%) while running 5x faster. The Apache 2.0 license means no commercial restrictions.

Press enter or click to view image in full size

Implementation Deep-Dive

Step 1: Infrastructure Setup

Install Ollama on Mac Mini:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the model (4.7GB download)
ollama pull qwen2.5-coder:7b# Verify it works
ollama run qwen2.5-coder:7b "Write a hello world in Python"

Configure auto-start (launchd):

Create /Library/LaunchDaemons/com.ollama.ollama.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.ollama</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>EnvironmentVariables</key>
    <dict>
        <key>OLLAMA_HOST</key>
        <string>0.0.0.0:11434</string>
    </dict>
</dict>
</plist>

sudo launchctl load /Library/LaunchDaemons/com.ollama.ollama.plist

Step 2: GitLab CI Job

Add this to your .gitlab-ci.yml:

ai_bot:
  stage: build  # Or any stage that runs on MRs
  needs: []     # No dependencies - runs in parallel
  image: alpine:3.21
  rules:
    # Skip for main branch pushes
    - if: '$CI_COMMIT_BRANCH == "main"'
      when: never
    # Run on all merge requests
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
      when: on_success
  before_script:
    - apk add --no-cache git curl bash jq
  script:
    - chmod +x scripts/ollama_review.sh
    - bash scripts/ollama_review.sh
  tags:
    - inhouse  # Runs on self-hosted runner with network access
  allow_failure: true  # Don't block MRs if AI review fails
  timeout: 10m

Key decisions:

allow_failure: true — AI review is advisory, not blocking
tags: [inhouse] — Ensures the job runs on runners that can reach the Mac Mini
timeout: 10m — Large diffs may take longer; prevents hanging jobs

The Secret Sauce: Optimizations That Make It Work

A naive implementation would hit context limits, produce inconsistent output, and miss important issues. Here are the optimizations that make our system production-ready.

Optimization 1: Token Budget Management

Local LLMs have limited context windows (~16K tokens for 7B models). We budget tokens carefully:

# Token budget allocation (24K chars ≈ 8K tokens)
MAX_PROMPT_CHARS=24000        # Total prompt limit
MAX_ADDED_LINES_SIZE=18000    # ~6K tokens for code changes
MAX_ASANA_CONTEXT_SIZE=3000   # ~1K tokens for task context
# Remaining ~1K for instructions, patterns, format

Why this matters: Without budgeting, a 500-line diff would exceed context and produce garbage output.

Optimization 2: Batched Reviews

Large MRs can have 50+ files. Instead of one massive prompt, we batch:

# Batching configuration
BATCH_SIZE=5        # Files per LLM call
MAX_BATCHES=10      # Maximum batches (50 files coverage)
BATCH_DELAY=2       # Seconds between batches (avoid overload)

How it works:

50 files changed
    ↓
Split into 10 batches of 5 files each
    ↓
Batch 1: Files 1-5   → LLM call → JSON result
Batch 2: Files 6-10  → LLM call → JSON result
...
Batch 10: Files 46-50 → LLM call → JSON result
    ↓
Merge all results into single review

Merge logic: Accumulate issues from all batches, use worst verdict:

merge_reviews() {
    # Collect all critical/important/suggestions
    # Verdict priority: REQUEST_CHANGES > APPROVE_WITH_COMMENTS > APPROVE
    # Combine all reasons
}

Optimization 3: Added Lines Only Format

We don’t send the entire diff. We extract ONLY added lines with precise locations:

# Input: raw git diff
# Output: structured format

[src/auth/login.ex:45] +    def authenticate(user, password) do
[src/auth/login.ex:46] +      case verify_password(user, password) do
[src/auth/login.ex:47] +        {:ok, _} -> {:ok, user}
[src/auth/login.ex:48] +        {:error, _} -> {:error, :invalid_credentials}
[src/auth/login.ex:49] +      end
[src/auth/login.ex:50] +    end

Why [file:line] format:

LLM can reference exact locations in its response
We can validate LLM output against real line numbers
Enables line-specific comments on MRs (future feature)

Optimization 4: Triggered Pattern Injection

Not all code needs the same review rules. We detect patterns in the diff and inject relevant checks:

load_triggered_patterns() {
    local diff="$1"
    local patterns=""

    # Security patterns
    if echo "$diff" | grep -qE "password|secret|token|api_key"; then
        patterns+="⚠️ SECRETS: Check for hardcoded credentials\n"
    fi    # Financial patterns
    if echo "$diff" | grep -qE "Float\.|amount.*\*|price.*\*"; then
        patterns+="💰 MONEY: Use Decimal, NEVER Float for currency\n"
    fi    # Database patterns
    if echo "$diff" | grep -qE "INSERT|UPDATE|DELETE|Repo\.(insert|update)"; then
        patterns+="📝 DATABASE: Check for SQL injection, use parameterized queries\n"
    fi    # API patterns
    if echo "$diff" | grep -qE "fetch|axios|http|curl"; then
        patterns+="🔌 HTTP: Verify timeout handling and error responses\n"
    fi    # Async patterns
    if echo "$diff" | grep -qE "async|await|Promise|Task\.async"; then
        patterns+="⏳ ASYNC: Check for unhandled rejections/errors\n"
    fi    echo "$patterns"
}

Result: The LLM gets context-specific review criteria instead of generic advice.

Optimization 5: LLM Parameters Tuning

Default LLM parameters produce verbose, inconsistent output. We tuned for code review:

# LLM parameters
TEMPERATURE=0.1       # Low = deterministic, consistent format
REPEAT_PENALTY=1.5    # High = prevents verbose repetition
MAX_TOKENS=600        # Short = forces concise output
NUM_CTX=8192         # Context window for full understanding

# System prompt (10%+ quality improvement per research)
SYSTEM_PROMPT="You are an expert code reviewer. You identify ONLY real bugs,
security issues, and critical problems. You never flag style preferences or
valid patterns as issues. You output ONLY the requested format."

Optimization 6: Semantic Annotations

We pre-annotate important changes to guide LLM attention:

annotate_diff_semantically() {
    case "$line" in
        +*def\ *|+*function\ *)
            echo "[NEW FUNC] $line" ;;
        +*Repo.insert*|+*INSERT*)
            echo "[DB MUTATION] $line" ;;
        +*password*|+*secret*)
            echo "[SECURITY] $line" ;;
        +*send_email*|+*notify*)
            echo "[NOTIFICATION] $line" ;;
        *)
            echo "$line" ;;
    esac
}

Before: LLM might miss a buried password variable After: [SECURITY] + password = params["password"] gets flagged

Optimization 7: Module Context Extraction

For languages with rich type systems (Elixir, TypeScript), we extract context:

extract_module_context() {
    local file="$1"

    # Extract module name
    grep -m1 "^defmodule" "$file"    # Extract type specs
    grep "^\s*@spec" "$file" | head -8    # Extract function signatures
    grep "^\s*def " "$file" | head -15    # Extract struct definition
    grep "defstruct" "$file"
}

This helps the LLM understand the codebase structure without sending entire files.

Prompt Engineering: The Complete Prompt

Here’s the actual prompt structure we use:

PROMPT="${ASANA_CONTEXT}

Review these code changes. Each line shows [file:line] and the code.CHANGES TO REVIEW (Batch $batch_num of $total_batches):
$ADDED_LINES_WITH_NUMBERSCHECKS TO APPLY:
$TRIGGERED_PATTERNSRULES:
1. Only review the NEWLY ADDED lines shown above
2. Use the EXACT [file:line] from the line you're reviewing
3. If no issues with the NEW code, say \"None found\"
4. Consider task context (if provided) to understand business intentFOCUS ON:
- Bugs or logic errors in the new code
- Security issues introduced by these specific changes
- Missing error handling for new code paths
- Potential crashes or exceptionsSEVERITY GUIDE:
- CRITICAL: Security vulnerabilities, data loss, crashes
- IMPORTANT: Logic errors, missing error handling
- SUGGESTIONS: Style improvements, minor optimizationsDO NOT flag as issues:
- Patterns that follow existing conventions
- Configuration constants (timeouts, limits are intentional)
- Valid language idiomsOUTPUT JSON:
{\"critical\":\"none\",\"important\":\"none\",\"suggestions\":\"none\",
 \"verdict\":\"APPROVE\",\"reason\":\"clean code\"}"

Why JSON Output?

We request JSON format for structured parsing:

curl "http://$OLLAMA_HOST/api/generate" -d '{
  "model": "qwen2.5-coder:7b",
  "prompt": "...",
  "format": "json",          # ← Request JSON output
  "options": {
    "temperature": 0.1,
    "num_predict": 600,
    "repeat_penalty": 1.5
  }
}'

Benefits:

Predictable parsing (no regex gymnastics)
Easy verdict extraction for CI status
Simple merge of batch results
Clean separation of issue categories

Integrations: External Context

Task Management (Asana/Jira)

We extract task context from MR descriptions:

# Extract task URL from MR description
TASK_URL=$(echo "$MR_DESCRIPTION" | grep -oE 'https://app.asana.com/[0-9/]+')

# Fetch task details via API
TASK_CONTEXT=$(curl -H "Authorization: Bearer $ASANA_PAT" \
  "https://app.asana.com/api/1.0/tasks/$TASK_GID?opt_fields=name,notes")# Include in prompt
ASANA_SECTION="
## Task Context
**Task:** $TASK_NAME
**Description:** $TASK_NOTESUse this context to understand the business intent of the changes.
"

Why this helps: LLM can distinguish between “bug fix” and “new feature” behavior.

Cross-Module Impact Detection

For monorepos, detect when changes affect multiple modules:

detect_cross_impact() {
    local changed_files="$1"

    if echo "$changed_files" | grep -q "auth/"; then
        echo "⚡ AUTH changes may affect: user sessions, permissions"
    fi    if echo "$changed_files" | grep -q "database/"; then
        echo "⚡ DB schema changes require migrations"
    fi    if echo "$changed_files" | grep -q "api/"; then
        echo "⚡ API changes may break clients - check versioning"
    fi
}

Real Results

Sample Review Output

## AI Code Review Summary

### CRITICAL
None found### IMPORTANT
1. **Missing error handling** [src/api/users.ex:67]
   The `Repo.insert` call doesn't handle `{:error, changeset}`.
   Pattern match on the result to handle validation failures.2. **Potential nil access** [src/api/users.ex:82]
   `user.profile.name` may crash if profile is nil.
   Use `user.profile && user.profile.name` or safe navigation.### SUGGESTIONS
- Consider extracting lines 45-60 into a separate function for testability.### VERDICT
APPROVE WITH COMMENTS - Good implementation, address error handling before merge.---
*Review: Batch 1/2 | Model: qwen2.5-coder:7b | Time: 6.2s | Tokens: 1,250+180*

Limitations & Honest Assessment

Where It Excels

Pattern matching — Catches common mistakes reliably
Style consistency — Enforces coding standards
Security basics — SQL injection, XSS, hardcoded secrets
Speed — 5–15 seconds vs hours waiting for human review

Where It Falls Short

Complex business logic — Can’t understand domain-specific rules
Architecture decisions — Won’t catch design flaws
Context awareness — Doesn’t know codebase history
Nuanced judgment — Sometimes flags valid patterns

Our Recommendation

Use AI review as a first-pass filter, not a replacement for human review.

The AI catches low-hanging fruit so humans can focus on architecture and logic.

Getting Started

Quick Start (5 minutes)

Install Ollama on any machine with 8GB+ RAM:

curl -fsSL https://ollama.ai/install.sh | sh ollama pull qwen2.5-coder:7b

Add minimal CI job:

ai_review:   image: alpine:3.21   script:     - apk add curl jq git     - |       DIFF=$(git diff origin/main -- '*.py' '*.js' '*.ts')       curl -s http://YOUR_OLLAMA_HOST:11434/api/generate \         -d "{\"model\":\"qwen2.5-coder:7b\",\"prompt\":\"Review:\\n$DIFF\"}" \         | jq -r '.response'

Iterate — Add batching, patterns, and GitLab posting as needed.

Resources

Conclusion

Switching from cloud APIs to local Qwen2.5-Coder was one of the best infrastructure decisions we made. The key insights:

Budget tokens carefully — Context limits are real constraints
Batch large reviews — Don’t try to review 50 files in one call
Inject relevant context — Triggered patterns beat generic rules
Tune LLM parameters — Low temperature + high repeat penalty = consistent output
Use structured output — JSON is easier to parse than free text

The result: Zero ongoing API costs, complete data privacy, and instant feedback on every MR.

Is the AI perfect? No. But it catches enough issues to be valuable, and the price (after hardware) is unbeatable.

Have questions? Found improvements? Leave a comment below.

Tags: #DevOps #AI #CodeReview #GitLabCI #Ollama #LLM #MachineLearning #SelfHosted #Qwen