Turning Historical Incidents into AI Insights: RAG-Enhanced LLM Approach to Code Review

In the AI-driven world of software development, data has become the new gold, and postmortem data represents some of the most valuable insights an organization can possess. Every incident, outage, and near-miss contains critical lessons about what can go wrong and how to prevent it. Yet we still apply these insights manually, relying on individual reviewer knowledge and ad-hoc knowledge sharing. Despite years of accumulated incident data, there is no automated solution to systematically prevent the same issues from recurring across different services.

At PayPay, we continuously innovate to improve our development workflows and maintain the highest standards of code quality, which is why we built GBB RiskBot, a code review system that leverages historical incident data to proactively identify potential risks in pull requests across our organization.

The Challenge: Scaling Code Review in a Fast-Growing Organization

PayPay has experienced tremendous growth over the past 7 years since its launch in 2018, and with that growth comes the challenge of maintaining code quality across an ever-expanding codebase. Our engineering teams work across dozens of repositories, each with unique characteristics and potential risk patterns.

Traditional code review processes, while effective, have limitations:

Knowledge silos: Past incident knowledge is limited to project scope, with team member turnover causing critical context loss.
Manual knowledge sharing: Sharing and applying knowledge across teams is done manually, creating inconsistent application of lessons learned.
Potential recurring issues: Similar patterns can happen across different services due to lack of centralized incident awareness.
Historical incident context varies: Each reviewer has different experience levels, leading to inconsistent risk assessment.

GBB RiskBot: An Automated Code Review Assistant

GBB RiskBot represents our approach to solving these challenges by automating the security assessment code review. When a developer opens a pull request, the bot automatically analyzes the code changes against our historical incident database to identify potential risks.

How it works

The process involves two key systems.

Knowledge Base Ingestion

Data Fetching: a cron job in Github Actions to continuously detect newly created incident data from multiple sources
Data Preprocessing: extract meaningful data from sources, then normalize to database format
Vector Indexing: use OpenAI embeddings wrapped by LangChain to create searchable vectors, stored in VectorDB (ChromaDB)
- There are many VectorDB to choose from: pgvector, Weaviate, Pinecone, FAISS, Chroma. We ended up using ChromaDB for low cost and easy POC setup.

Contextual Code Analysis

PR changes triggers similarity searches against the knowledge base
Similarity search: from the given context (PR title, description, and all code changes), convert each to vector by OpenAI Embedding
- Enforce 1000-characters limit per item for performance, analysis quality and cost control
- from each of the vector, query in the VectorDB using cosin similarity
- retrieve top K similar documents
RAG response generation: given the facts and code change, feed it to ChatGPT (gpt-4o-mini) with a prompt template to generate Github comment

Why is GPT-4o-mini sufficient?

Since the system relies on semantic search to retrieve relevant historical incidents, the LLM’s role is primarily to synthesize and present existing facts rather than perform complex reasoning. The heavy lifting—identifying similar code patterns and matching them to past incidents—is handled by the vector similarity search. The model doesn’t need to “think” (reasoning) about what could go wrong because it already has concrete examples of what did go wrong in a similar context. It just needs to write up the findings in a readable format.

Cost

The operational cost of GBB RiskBot is primarily driven by OpenAI API usage across two main components: embedding generation for knowledge base indexing and chat model inference for code analysis.

OpenAI API Pricing

Embedding (text-embedding-ada-002): $0.10 per 1M tokens
Chat Model (gpt-4o-mini):
- Input: $0.15 per 1M tokens
- Output: $0.60 per 1M tokens

Cost Factors

Knowledge Base Ingestion (Oneshot database init + Incremental):

Initial historical incident data processing (embedding generation)
New incident reports (if any) detected and indexed to the knowledge base

Per-PR Analysis (Ongoing cost):

PR Context Size: pull requests with long description require more input tokens
Number of File x Lines of Code (LOC) changed: The more file change and more LOC increase token consumption
Lines of Code (LOC) Changed: More substantial changes increase token consumption
Analysis time: Each of item will be compared against the whole VectorDB, the more item to be scanned the longer of the analysis

Sample cost

Estimated cost of database initialization with 47 incidents data: $0.001852
Estimated cost on analyze a PR with 1 file change: $0.000350

Last month July’s running cost for 12 repositories, with a total of around 380+ total bot runs, the cost is just mere $0.59 USD! Given the OpenAI API Pricing, this is a very cost-effective method compared to the potential cost of production incidents.

Success Metrics

For measuring the effectiveness of an AI-powered code review system, we’re monitoring three-tier metrics that provide both leading indicators for system health and lagging indicators for business value.

Tier 1: Core Operational Metrics (Leading Indicators)

These real-time metrics provide immediate insights into system performance and help identify issues quickly:

Issue Detection Rate: Percentage of analyzed PRs where potential risks are identified based on historical incidents. Too high suggests false positives and can make developers have bot fatigue, they can ignore real alert. Too low then maybe training data is not enough or false negative.
Distinct Incident Coverage: Number of unique historical incidents referenced in analysis. This can tell if the analysis can easily be skewed to an incident, this tells data training quality.
Repository Coverage
Knowledge Base Growth Rate: Tracks the continuous expansion of our incident database, ensuring the system’s learning capacity improves over time.

Tier 2: Developer Feedback (Direct Indicators)

Based on GitHub emoji reactions, developers can react to the analysis to give feedback. A daily automated workflow collects reactions from the past 7 days, storing detailed metrics in our analytics database for trend analysis.

Tier 3: Business Impact (Lagging Indicators)

Long-term metrics that demonstrate ROI and organizational value:

Incident Rate Reduction: Tracking whether the incident rate trend decreases over time across repositories using the bot.

Future Roadmap

In order to improve the training data, the current embedding model should be upgraded from “ada” to “text embedding 3 large.” Additionally, we also want to experiment with models other than RAG, such as CAG and mem0.

To improve search accuracy and reduce false positives, we want to add a “rerank” step after the cosine similarity search to further assess the relevance of the results with the input.

Conclusion

GBB RiskBot represents our commitment to leveraging technology to improve developer productivity and code quality at PayPay. By combining historical incident data with modern AI, we’ve created a system that not only catches potential issues but also educates developers and democratizes knowledge across our organization.

As we continue to scale PayPay’s engineering organization, tools like GBB RiskBot will play an increasingly important role in maintaining our high standards of code quality and operational excellence.