12 min read
Dillon Browne

AI-Powered Code Reviews with LLMs

A practical guide to integrating LLMs into your CI/CD pipelines for automated code reviews, security scanning, and intelligent feedback—with real-world implementation patterns and cost analysis.

AI LLM CI/CD DevOps Automation Code Review GitHub Actions MLOps OpenAI Anthropic Infrastructure as Code Python

I’ve been watching the AI code review space explode over the past year, and I’ll be honest—most implementations I’ve seen are either glorified linters or expensive SaaS products that don’t integrate well with existing DevOps workflows. After building several production AI-powered CI/CD pipelines for enterprise clients, I’ve developed a battle-tested approach that delivers real value without breaking the bank.

Today, I’m seeing articles about “AI Code Reviews Revolutionizing Developer Workflows in 3 Minutes,” and while the hype is real, the implementation details are usually glossed over. Let’s fix that.

The Problem: Code Review Bottlenecks at Scale

At my last engagement, we had a platform team supporting 200+ microservices across 50+ development teams. Code reviews were the single biggest bottleneck in our deployment pipeline:

  • Average PR review time: 8-12 hours
  • Security issues caught in production: 15-20% of releases
  • Infrastructure misconfigurations: Caught too late in the deployment cycle
  • Inconsistent feedback: Different reviewers, different standards

Traditional automated tools (SonarQube, linters, SAST scanners) caught syntax issues and known vulnerabilities, but they missed context-aware problems:

  • “This Terraform change will work, but it’ll cost $50k/month more than the current approach”
  • “This database query is fine for 100 users, but you’re launching to 1M next week”
  • “You’re implementing rate limiting, but you already have it three layers up in the API gateway”

This is where LLMs shine—not replacing human reviewers, but augmenting them with intelligent, context-aware analysis.

Architecture: Multi-Model AI Review Pipeline

Here’s the architecture I’ve deployed across multiple organizations:

┌─────────────────┐
│   GitHub PR     │
│   (triggered)   │
└────────┬────────┘


┌─────────────────────────────────────────┐
│     GitHub Actions Workflow             │
│                                         │
│  1. Diff Analysis & Context Gathering   │
│  2. Repository Embedding Search (RAG)   │
│  3. Multi-Model LLM Review              │
│  4. Security & Cost Analysis            │
│  5. Comment Generation & Posting        │
└─────────────────────────────────────────┘


┌─────────────────────────────────────────┐
│         LLM Routing Layer               │
│      (OpenRouter / Self-Hosted)         │
│                                         │
│  • GPT-4o: Complex logic review         │
│  • Claude 3.5 Sonnet: Security analysis │
│  • Llama 3.1 70B: IaC/config review     │
└─────────────────────────────────────────┘


┌─────────────────────────────────────────┐
│       Knowledge Base (RAG)              │
│                                         │
│  • Vector DB (Pinecone/pgvector)        │
│  • Company coding standards             │
│  • Architecture decision records        │
│  • Historical incident postmortems      │
└─────────────────────────────────────────┘

Why Multi-Model?

Different LLMs excel at different tasks. After testing 15+ models across thousands of PRs, here’s what I found:

  • GPT-4o: Best for complex business logic, API design, and architectural decisions
  • Claude 3.5 Sonnet: Superior at security analysis and identifying subtle vulnerabilities
  • Llama 3.1 70B (self-hosted): Cost-effective for Infrastructure as Code and configuration reviews
  • DeepSeek Coder: Excellent for language-specific optimizations (Python, Go, TypeScript)

Total cost per PR review: $0.03 - $0.15 depending on PR size and models used.

Implementation: GitHub Actions Workflow

Here’s the core GitHub Actions workflow that powers the system:

name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for context
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install openai anthropic langchain pinecone-client tiktoken
      
      - name: Run AI Code Review
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          python .github/scripts/ai_code_review.py \
            --pr-number ${{ github.event.pull_request.number }} \
            --repo ${{ github.repository }}

The Review Engine: Python + LangChain

Here’s the core review engine I’ve refined over dozens of implementations:

from langchain.chat_models import ChatOpenAI, ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema import HumanMessage, SystemMessage
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import tiktoken
import os
import json

class AICodeReviewer:
    def __init__(self):
        self.gpt4 = ChatOpenAI(model="gpt-4o", temperature=0.1)
        self.claude = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0.1)
        self.embeddings = OpenAIEmbeddings()
        self.vector_store = Pinecone.from_existing_index(
            index_name="code-knowledge-base",
            embedding=self.embeddings
        )
        
    def analyze_pr(self, pr_diff: str, pr_context: dict) -> dict:
        """
        Multi-stage PR analysis with different models for different concerns
        """
        # Stage 1: Gather relevant context from knowledge base (RAG)
        context_docs = self._get_relevant_context(pr_diff, pr_context)
        
        # Stage 2: Parallel analysis with different models
        results = {
            "security": self._security_analysis(pr_diff, context_docs),
            "architecture": self._architecture_review(pr_diff, context_docs),
            "performance": self._performance_analysis(pr_diff, context_docs),
            "cost": self._cost_impact_analysis(pr_diff, pr_context),
        }
        
        # Stage 3: Synthesize findings
        final_review = self._synthesize_review(results, pr_context)
        
        return final_review
    
    def _get_relevant_context(self, pr_diff: str, pr_context: dict) -> list:
        """
        RAG: Retrieve relevant context from knowledge base
        """
        # Create search query from PR title, description, and changed files
        search_query = f"""
        PR: {pr_context['title']}
        Description: {pr_context['description']}
        Files changed: {', '.join(pr_context['files'])}
        """
        
        # Semantic search for relevant documentation, standards, and incidents
        relevant_docs = self.vector_store.similarity_search(
            search_query,
            k=5,
            filter={
                "type": {"$in": ["coding_standard", "architecture_decision", "incident_postmortem"]}
            }
        )
        
        return relevant_docs
    
    def _security_analysis(self, pr_diff: str, context_docs: list) -> dict:
        """
        Use Claude for security analysis (empirically best at this)
        """
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="""You are a senior security engineer reviewing code changes.
            Focus on:
            - Authentication/authorization issues
            - SQL injection, XSS, CSRF vulnerabilities
            - Secrets or credentials in code
            - Insecure dependencies
            - Data exposure risks
            
            Reference the provided company security standards and past incidents.
            Be specific about line numbers and provide remediation suggestions."""),
            HumanMessage(content=f"""
            Code changes:
            {pr_diff}
            
            Relevant security standards and incidents:
            {self._format_context_docs(context_docs, 'security')}
            
            Provide security analysis in JSON format:
            {{
                "critical_issues": [],
                "warnings": [],
                "recommendations": []
            }}
            """)
        ])
        
        response = self.claude.invoke(prompt.format_messages())
        return json.loads(response.content)
    
    def _architecture_review(self, pr_diff: str, context_docs: list) -> dict:
        """
        Use GPT-4o for architectural and design pattern analysis
        """
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="""You are a principal architect reviewing code changes.
            Evaluate:
            - Adherence to established patterns and standards
            - API design and interface contracts
            - Separation of concerns and modularity
            - Scalability implications
            - Integration with existing systems
            
            Reference architecture decision records (ADRs) and coding standards."""),
            HumanMessage(content=f"""
            Code changes:
            {pr_diff}
            
            Relevant ADRs and standards:
            {self._format_context_docs(context_docs, 'architecture')}
            
            Provide architectural analysis in JSON format.
            """)
        ])
        
        response = self.gpt4.invoke(prompt.format_messages())
        return json.loads(response.content)
    
    def _performance_analysis(self, pr_diff: str, context_docs: list) -> dict:
        """
        Analyze performance implications of code changes
        """
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="""You are a performance engineer reviewing code changes.
            Evaluate:
            - Algorithm complexity and efficiency
            - Database query optimization
            - Memory usage patterns
            - Potential bottlenecks
            - Caching opportunities
            
            Provide specific recommendations for performance improvements."""),
            HumanMessage(content=f"""
            Code changes:
            {pr_diff}
            
            Relevant performance standards:
            {self._format_context_docs(context_docs, 'performance')}
            
            Provide performance analysis in JSON format.
            """)
        ])
        
        response = self.gpt4.invoke(prompt.format_messages())
        return json.loads(response.content)
    
    def _cost_impact_analysis(self, pr_diff: str, pr_context: dict) -> dict:
        """
        Analyze infrastructure/cloud cost implications
        """
        # Check if PR contains IaC files
        iac_files = [f for f in pr_context['files'] 
                     if f.endswith(('.tf', '.yaml', '.yml', 'Dockerfile'))]
        
        if not iac_files:
            return {"impact": "none", "analysis": "No infrastructure changes detected"}
        
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="""You are a FinOps engineer analyzing infrastructure changes.
            Evaluate cost implications of:
            - New cloud resources (compute, storage, networking)
            - Scaling configurations
            - Data transfer patterns
            - Managed service usage
            
            Provide monthly cost estimates and optimization suggestions."""),
            HumanMessage(content=f"""
            Infrastructure changes:
            {pr_diff}
            
            Changed files: {', '.join(iac_files)}
            
            Provide cost analysis in JSON format with estimated monthly impact.
            """)
        ])
        
        response = self.gpt4.invoke(prompt.format_messages())
        return json.loads(response.content)
    
    def _format_context_docs(self, context_docs: list, doc_type: str) -> str:
        """
        Format context documents for inclusion in prompts
        """
        filtered_docs = [doc for doc in context_docs 
                        if doc_type in doc.metadata.get('type', '')]
        return "\n\n".join([doc.page_content for doc in filtered_docs])
    
    def _synthesize_review(self, results: dict, pr_context: dict) -> str:
        """
        Combine all analyses into a coherent review comment
        """
        review_sections = []
        
        # Security findings
        if results['security']['critical_issues']:
            review_sections.append("## 🚨 Security Issues\n")
            for issue in results['security']['critical_issues']:
                review_sections.append(f"- **{issue['title']}** (Line {issue['line']})\n  {issue['description']}\n")
        
        # Architecture feedback
        if results['architecture']['recommendations']:
            review_sections.append("\n## 🏗️ Architecture Recommendations\n")
            for rec in results['architecture']['recommendations']:
                review_sections.append(f"- {rec}\n")
        
        # Performance feedback
        if results['performance']['recommendations']:
            review_sections.append("\n## ⚡ Performance Recommendations\n")
            for rec in results['performance']['recommendations']:
                review_sections.append(f"- {rec}\n")
        
        # Cost impact
        if results['cost']['impact'] != 'none':
            review_sections.append(f"\n## 💰 Cost Impact\n")
            review_sections.append(f"Estimated monthly cost change: **{results['cost']['estimate']}**\n")
            review_sections.append(f"{results['cost']['analysis']}\n")
        
        return "\n".join(review_sections)

RAG Integration: Teaching the AI Your Codebase

The secret sauce is the knowledge base. Here’s what I embed into the vector database:

  1. Coding Standards - Company style guides, best practices
  2. Architecture Decision Records (ADRs) - Why we chose certain patterns
  3. Incident Postmortems - What went wrong and how to prevent it
  4. API Documentation - Internal service contracts and dependencies
  5. Cost Benchmarks - Historical spending data for infrastructure components

Here’s the knowledge base builder:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
import os
import glob

def build_knowledge_base():
    """
    Index all relevant documentation into vector database
    """
    embeddings = OpenAIEmbeddings()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    
    documents = []
    
    # Index coding standards
    for file in glob.glob("docs/standards/**/*.md", recursive=True):
        with open(file, 'r') as f:
            content = f.read()
            chunks = text_splitter.split_text(content)
            for chunk in chunks:

Found this helpful? Share it with others: