How we eliminated API costs while keeping our code private and reviews comprehensive
Every merge request now gets an instant AI code review. The catch? It costs us exactly $0 in API fees.
If you’re using Claude, ChatGPT, or any cloud-based LLM for code reviews, you’re probably familiar with the pain points:
- API costs add up fast — At $0.10–0.50 per review, 100 daily MRs means $300–1,500/month
- Privacy concerns — Your proprietary code leaves your network and hits third-party servers
- Rate limits — Nothing kills CI pipeline velocity like hitting API throttling
- Vendor lock-in — Pricing changes or service outages are outside your control
We solved all of these by running Qwen2.5-Coder:7b locally on a Mac Mini, integrated directly into our GitLab CI pipeline. Here’s exactly how we did it — including the batching system, prompt engineering, and optimizations that make it production-ready.
The Architecture
╔═══════════════════════════════════════════════════════════════════╗
║ GitLab CI Pipeline ║
║ ║
║ ┌──────────┐ ┌──────────┐ ┌──────────┐ ║
║ │ Build │ ───► │ ai_bot │ ───► │ Tests │ ║
║ │ Stage │ │ Job │ │ Stage │ ║
║ └──────────┘ └────┬─────┘ └──────────┘ ║
║ │ ║
╚══════════════════════════╪════════════════════════════════════════╝
│
│ HTTP Request
│ (port 11434)
▼
╔═══════════════════════════════════════════════════════════════════╗
║ Mac Mini (In-House Server) ║
║ ║
║ ┌───────────────────────────────────────────────────────────┐ ║
║ │ Ollama Service │ ║
║ │ │ ║
║ │ ┌───────────────────────────────────────────────────┐ │ ║
║ │ │ qwen2.5-coder:7b (~5GB RAM) │ │ ║
║ │ │ │ │ ║
║ │ │ • Code-specialized (5.5T tokens training) │ │ ║
║ │ │ • Apache 2.0 license (commercial OK) │ │ ║
║ │ │ • 128K context window │ │ ║
║ │ └───────────────────────────────────────────────────┘ │ ║
║ │ │ ║
║ └───────────────────────────────────────────────────────────┘ ║
║ ║
║ IP: Internal Network | Auto-start: launchd ║
╚══════════════════════════╤════════════════════════════════════════╝
│
│ JSON Response
│ + GitLab API
▼
╔═══════════════════════════════════════════════════════════════════╗
║ Merge Request ║
║ ║
║ ┌───────────────────────────────────────────────────────────┐ ║
║ │ ## 🤖 AI Code Review Summary │ ║
║ │ │ ║
║ │ ### ✅ APPROVE WITH COMMENTS │ ║
║ │ │ ║
║ │ **Critical:** None │ ║
║ │ **Important:** 2 issues found │ ║
║ │ **Suggestions:** 1 improvement │ ║
║ │ │ ║
║ │ ───────────────────────────────────────────────────── │ ║
║ │ *Powered by Ollama (qwen2.5-coder:7b) • 8.5s • $0* │ ║
║ └───────────────────────────────────────────────────────────┘ ║
║ ║
╚═══════════════════════════════════════════════════════════════════╝Simple Flow Diagram (for featured image)
┌────────────┐ ┌────────────┐ ┌────────────┐
│ │ │ │ │ │
│ GitLab │────►│ Ollama │────►│ MR Gets │
│ CI Job │ │ (Local) │ │ Review │
│ │ │ │ │ │
└────────────┘ └────────────┘ └────────────┘
MR Qwen2.5 AI Comment
Created Coder:7b Posted ┌─────────────────────────────────┐
│ 💰 Cost: $0 🔒 Private ⚡ Fast │
└─────────────────────────────────┘Why Mac Mini?
Apple Silicon is remarkably efficient for LLM inference:
- M1/M2 chips — Unified memory architecture eliminates GPU memory bottlenecks
- Power efficiency — ~$5/month in electricity vs $50+ for a GPU server
- Quiet operation — Can sit under a desk without dedicated cooling
- One-time cost — $599–999 vs ongoing cloud API fees
Why Ollama?
Ollama makes running local LLMs trivially easy:
- One-command setup —
ollama pull qwen2.5-coder:7b - OpenAI-compatible API — Drop-in replacement for existing integrations
- Model management — Easy switching between models for testing
- Automatic optimization — Handles memory management and batching
Why Qwen2.5-Coder?
We tested Llama3.1, CodeLlama, DeepSeek Coder, and Qwen2.5-Coder. Here’s why Qwen won:
Qwen2.5-Coder at 7B parameters outperforms CodeLlama-34B (48.8%) while running 5x faster. The Apache 2.0 license means no commercial restrictions.
Press enter or click to view image in full size
Implementation Deep-Dive
Step 1: Infrastructure Setup
Install Ollama on Mac Mini:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh# Pull the model (4.7GB download)
ollama pull qwen2.5-coder:7b# Verify it works
ollama run qwen2.5-coder:7b "Write a hello world in Python"
Configure auto-start (launchd):
Create /Library/LaunchDaemons/com.ollama.ollama.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.ollama</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_HOST</key>
<string>0.0.0.0:11434</string>
</dict>
</dict>
</plist>sudo launchctl load /Library/LaunchDaemons/com.ollama.ollama.plistStep 2: GitLab CI Job
Add this to your .gitlab-ci.yml:
ai_bot:
stage: build # Or any stage that runs on MRs
needs: [] # No dependencies - runs in parallel
image: alpine:3.21
rules:
# Skip for main branch pushes
- if: '$CI_COMMIT_BRANCH == "main"'
when: never
# Run on all merge requests
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
when: on_success
before_script:
- apk add --no-cache git curl bash jq
script:
- chmod +x scripts/ollama_review.sh
- bash scripts/ollama_review.sh
tags:
- inhouse # Runs on self-hosted runner with network access
allow_failure: true # Don't block MRs if AI review fails
timeout: 10mKey decisions:
allow_failure: true— AI review is advisory, not blockingtags: [inhouse]— Ensures the job runs on runners that can reach the Mac Minitimeout: 10m— Large diffs may take longer; prevents hanging jobs
The Secret Sauce: Optimizations That Make It Work
A naive implementation would hit context limits, produce inconsistent output, and miss important issues. Here are the optimizations that make our system production-ready.
Optimization 1: Token Budget Management
Local LLMs have limited context windows (~16K tokens for 7B models). We budget tokens carefully:
# Token budget allocation (24K chars ≈ 8K tokens)
MAX_PROMPT_CHARS=24000 # Total prompt limit
MAX_ADDED_LINES_SIZE=18000 # ~6K tokens for code changes
MAX_ASANA_CONTEXT_SIZE=3000 # ~1K tokens for task context
# Remaining ~1K for instructions, patterns, formatWhy this matters: Without budgeting, a 500-line diff would exceed context and produce garbage output.
Optimization 2: Batched Reviews
Large MRs can have 50+ files. Instead of one massive prompt, we batch:
# Batching configuration
BATCH_SIZE=5 # Files per LLM call
MAX_BATCHES=10 # Maximum batches (50 files coverage)
BATCH_DELAY=2 # Seconds between batches (avoid overload)How it works:
50 files changed
↓
Split into 10 batches of 5 files each
↓
Batch 1: Files 1-5 → LLM call → JSON result
Batch 2: Files 6-10 → LLM call → JSON result
...
Batch 10: Files 46-50 → LLM call → JSON result
↓
Merge all results into single reviewMerge logic: Accumulate issues from all batches, use worst verdict:
merge_reviews() {
# Collect all critical/important/suggestions
# Verdict priority: REQUEST_CHANGES > APPROVE_WITH_COMMENTS > APPROVE
# Combine all reasons
}Optimization 3: Added Lines Only Format
We don’t send the entire diff. We extract ONLY added lines with precise locations:
# Input: raw git diff
# Output: structured format[src/auth/login.ex:45] + def authenticate(user, password) do
[src/auth/login.ex:46] + case verify_password(user, password) do
[src/auth/login.ex:47] + {:ok, _} -> {:ok, user}
[src/auth/login.ex:48] + {:error, _} -> {:error, :invalid_credentials}
[src/auth/login.ex:49] + end
[src/auth/login.ex:50] + endWhy [file:line] format:
- LLM can reference exact locations in its response
- We can validate LLM output against real line numbers
- Enables line-specific comments on MRs (future feature)
Optimization 4: Triggered Pattern Injection
Not all code needs the same review rules. We detect patterns in the diff and inject relevant checks:
load_triggered_patterns() {
local diff="$1"
local patterns=""# Security patterns
if echo "$diff" | grep -qE "password|secret|token|api_key"; then
patterns+="⚠️ SECRETS: Check for hardcoded credentials\n"
fi # Financial patterns
if echo "$diff" | grep -qE "Float\.|amount.*\*|price.*\*"; then
patterns+="💰 MONEY: Use Decimal, NEVER Float for currency\n"
fi # Database patterns
if echo "$diff" | grep -qE "INSERT|UPDATE|DELETE|Repo\.(insert|update)"; then
patterns+="📝 DATABASE: Check for SQL injection, use parameterized queries\n"
fi # API patterns
if echo "$diff" | grep -qE "fetch|axios|http|curl"; then
patterns+="🔌 HTTP: Verify timeout handling and error responses\n"
fi # Async patterns
if echo "$diff" | grep -qE "async|await|Promise|Task\.async"; then
patterns+="⏳ ASYNC: Check for unhandled rejections/errors\n"
fi echo "$patterns"
}
Result: The LLM gets context-specific review criteria instead of generic advice.
Optimization 5: LLM Parameters Tuning
Default LLM parameters produce verbose, inconsistent output. We tuned for code review:
# LLM parameters
TEMPERATURE=0.1 # Low = deterministic, consistent format
REPEAT_PENALTY=1.5 # High = prevents verbose repetition
MAX_TOKENS=600 # Short = forces concise output
NUM_CTX=8192 # Context window for full understanding# System prompt (10%+ quality improvement per research)
SYSTEM_PROMPT="You are an expert code reviewer. You identify ONLY real bugs,
security issues, and critical problems. You never flag style preferences or
valid patterns as issues. You output ONLY the requested format."Optimization 6: Semantic Annotations
We pre-annotate important changes to guide LLM attention:
annotate_diff_semantically() {
case "$line" in
+*def\ *|+*function\ *)
echo "[NEW FUNC] $line" ;;
+*Repo.insert*|+*INSERT*)
echo "[DB MUTATION] $line" ;;
+*password*|+*secret*)
echo "[SECURITY] $line" ;;
+*send_email*|+*notify*)
echo "[NOTIFICATION] $line" ;;
*)
echo "$line" ;;
esac
}Before: LLM might miss a buried password variable After: [SECURITY] + password = params["password"] gets flagged
Optimization 7: Module Context Extraction
For languages with rich type systems (Elixir, TypeScript), we extract context:
extract_module_context() {
local file="$1"# Extract module name
grep -m1 "^defmodule" "$file" # Extract type specs
grep "^\s*@spec" "$file" | head -8 # Extract function signatures
grep "^\s*def " "$file" | head -15 # Extract struct definition
grep "defstruct" "$file"
}
This helps the LLM understand the codebase structure without sending entire files.
Prompt Engineering: The Complete Prompt
Here’s the actual prompt structure we use:
PROMPT="${ASANA_CONTEXT}Review these code changes. Each line shows [file:line] and the code.CHANGES TO REVIEW (Batch $batch_num of $total_batches):
$ADDED_LINES_WITH_NUMBERSCHECKS TO APPLY:
$TRIGGERED_PATTERNSRULES:
1. Only review the NEWLY ADDED lines shown above
2. Use the EXACT [file:line] from the line you're reviewing
3. If no issues with the NEW code, say \"None found\"
4. Consider task context (if provided) to understand business intentFOCUS ON:
- Bugs or logic errors in the new code
- Security issues introduced by these specific changes
- Missing error handling for new code paths
- Potential crashes or exceptionsSEVERITY GUIDE:
- CRITICAL: Security vulnerabilities, data loss, crashes
- IMPORTANT: Logic errors, missing error handling
- SUGGESTIONS: Style improvements, minor optimizationsDO NOT flag as issues:
- Patterns that follow existing conventions
- Configuration constants (timeouts, limits are intentional)
- Valid language idiomsOUTPUT JSON:
{\"critical\":\"none\",\"important\":\"none\",\"suggestions\":\"none\",
\"verdict\":\"APPROVE\",\"reason\":\"clean code\"}"
Why JSON Output?
We request JSON format for structured parsing:
curl "http://$OLLAMA_HOST/api/generate" -d '{
"model": "qwen2.5-coder:7b",
"prompt": "...",
"format": "json", # ← Request JSON output
"options": {
"temperature": 0.1,
"num_predict": 600,
"repeat_penalty": 1.5
}
}'Benefits:
- Predictable parsing (no regex gymnastics)
- Easy verdict extraction for CI status
- Simple merge of batch results
- Clean separation of issue categories
Integrations: External Context
Task Management (Asana/Jira)
We extract task context from MR descriptions:
# Extract task URL from MR description
TASK_URL=$(echo "$MR_DESCRIPTION" | grep -oE 'https://app.asana.com/[0-9/]+')# Fetch task details via API
TASK_CONTEXT=$(curl -H "Authorization: Bearer $ASANA_PAT" \
"https://app.asana.com/api/1.0/tasks/$TASK_GID?opt_fields=name,notes")# Include in prompt
ASANA_SECTION="
## Task Context
**Task:** $TASK_NAME
**Description:** $TASK_NOTESUse this context to understand the business intent of the changes.
"
Why this helps: LLM can distinguish between “bug fix” and “new feature” behavior.
Cross-Module Impact Detection
For monorepos, detect when changes affect multiple modules:
detect_cross_impact() {
local changed_files="$1"if echo "$changed_files" | grep -q "auth/"; then
echo "⚡ AUTH changes may affect: user sessions, permissions"
fi if echo "$changed_files" | grep -q "database/"; then
echo "⚡ DB schema changes require migrations"
fi if echo "$changed_files" | grep -q "api/"; then
echo "⚡ API changes may break clients - check versioning"
fi
}
Real Results
Sample Review Output
## AI Code Review Summary### CRITICAL
None found### IMPORTANT
1. **Missing error handling** [src/api/users.ex:67]
The `Repo.insert` call doesn't handle `{:error, changeset}`.
Pattern match on the result to handle validation failures.2. **Potential nil access** [src/api/users.ex:82]
`user.profile.name` may crash if profile is nil.
Use `user.profile && user.profile.name` or safe navigation.### SUGGESTIONS
- Consider extracting lines 45-60 into a separate function for testability.### VERDICT
APPROVE WITH COMMENTS - Good implementation, address error handling before merge.---
*Review: Batch 1/2 | Model: qwen2.5-coder:7b | Time: 6.2s | Tokens: 1,250+180*
Limitations & Honest Assessment
Where It Excels
- Pattern matching — Catches common mistakes reliably
- Style consistency — Enforces coding standards
- Security basics — SQL injection, XSS, hardcoded secrets
- Speed — 5–15 seconds vs hours waiting for human review
Where It Falls Short
- Complex business logic — Can’t understand domain-specific rules
- Architecture decisions — Won’t catch design flaws
- Context awareness — Doesn’t know codebase history
- Nuanced judgment — Sometimes flags valid patterns
Our Recommendation
Use AI review as a first-pass filter, not a replacement for human review.
The AI catches low-hanging fruit so humans can focus on architecture and logic.
Getting Started
Quick Start (5 minutes)
- Install Ollama on any machine with 8GB+ RAM:
curl -fsSL https://ollama.ai/install.sh | sh ollama pull qwen2.5-coder:7b- Add minimal CI job:
ai_review: image: alpine:3.21 script: - apk add curl jq git - | DIFF=$(git diff origin/main -- '*.py' '*.js' '*.ts') curl -s http://YOUR_OLLAMA_HOST:11434/api/generate \ -d "{\"model\":\"qwen2.5-coder:7b\",\"prompt\":\"Review:\\n$DIFF\"}" \ | jq -r '.response'- Iterate — Add batching, patterns, and GitLab posting as needed.
Resources
Conclusion
Switching from cloud APIs to local Qwen2.5-Coder was one of the best infrastructure decisions we made. The key insights:
- Budget tokens carefully — Context limits are real constraints
- Batch large reviews — Don’t try to review 50 files in one call
- Inject relevant context — Triggered patterns beat generic rules
- Tune LLM parameters — Low temperature + high repeat penalty = consistent output
- Use structured output — JSON is easier to parse than free text
The result: Zero ongoing API costs, complete data privacy, and instant feedback on every MR.
Is the AI perfect? No. But it catches enough issues to be valuable, and the price (after hardware) is unbeatable.
Have questions? Found improvements? Leave a comment below.
Tags: #DevOps #AI #CodeReview #GitLabCI #Ollama #LLM #MachineLearning #SelfHosted #Qwen