RepoRover / ARCHITECTURE.md
nathsujal
MVP
c9f1afa

πŸ—οΈ RepoRover System Architecture

RepoRover is an AI-powered code analysis platform that provides deep insights into GitHub repositories. The system is built on a modern, scalable architecture that combines FastAPI for the backend, AI models for code understanding, and a clean, responsive frontend.

🌟 Core Principles

  • Modular Design: Components are loosely coupled and follow the single responsibility principle
  • Extensible: Easy to add new analysis modules or integrate with different AI models
  • Real-time Processing: Provides immediate feedback during repository analysis
  • Scalable: Designed to handle repositories of various sizes efficiently

🧩 Core Components

1. Backend Services

  • FastAPI Application: Handles HTTP requests and serves the frontend
  • Background Task Queue: Manages long-running repository analysis tasks
  • API Endpoints:
    • /ingest: Start repository ingestion
    • /ingest/status/{task_id}: Check ingestion status
    • /query: Submit questions about the repository

2. AI Components

  • Dispatcher Agent: Orchestrates the analysis workflow
  • Semantic Memory Manager: Handles storage and retrieval of code knowledge
  • AI Model Integrations: Support for multiple AI providers (Gemini, Groq)

3. Frontend

  • Single Page Application: Built with vanilla JavaScript
  • Responsive UI: Using Tailwind CSS for styling
  • Real-time Updates: WebSocket-based updates for long-running tasks

4. Data Storage

  • Semantic Memory: Stores processed code information
  • Vector Database: For efficient similarity search of code patterns
  • Task Status Tracking: In-memory storage for monitoring analysis progress

πŸ”„ Ingestion Workflow

The ingestion process transforms a GitHub repository into a structured knowledge base that can be queried naturally.

Trigger

  • User submits a GitHub repository URL through the web interface

Process Flow

  1. Repository Cloning

    • Clones the target repository locally
    • Scans the repository structure
    • Identifies different file types and their relationships
  2. Code Analysis

    • Parses source code files
    • Extracts functions, classes, and their documentation
    • Builds a semantic understanding of the codebase
    • Identifies dependencies between components
  3. Knowledge Base Population

    • Stores extracted information in the semantic memory
    • Generates vector embeddings for semantic search
    • Builds a knowledge graph of the codebase
graph TD
    A[Start: GitHub URL] --> B(Dispatcher Agent);
    B --> C{Clones Repo & Scans Files};
    C --> D[Architect Agent];
    D --> E[Librarian Agent];
    E --> F[Annotator Agent];
    
    subgraph Semantic Memory
        G[Entity Store - SQLite];
        H[Knowledge Graph - NetworkX];
        I[Vector Store - ChromaDB];
    end

    D -- Creates Code Entities & Relationships --> H;
    D -- Stores Code Details --> G;
    E -- Creates Doc Chunks --> I;
    E -- Stores Doc Details --> G;
    F -- Generates Summaries --> G;
    F -- Updates Embeddings --> I;

    F --> J[End: Ingestion Complete];

πŸ’¬ Query Processing Workflow

Trigger

  • User submits a natural language question about the codebase

Process Flow

  1. Query Understanding

    • Analyzes the user's question
    • Identifies key concepts and intents
    • Determines relevant parts of the codebase to examine
  2. Context Retrieval

    • Searches the semantic memory for relevant code snippets
    • Retrieves related documentation and examples
    • Gathers contextual information about the code
  3. Response Generation

    • Formulates a comprehensive answer using AI
    • Includes relevant code examples
    • Provides additional context and suggestions

πŸš€ Deployment Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚     β”‚                     β”‚     β”‚                  β”‚
β”‚  User's Browser β”œβ”€β”€β”€β”€β–Ίβ”‚  FastAPI Backend    │◄───►│  AI Models       β”‚
β”‚                 β”‚     β”‚  (Python)           β”‚     β”‚  (Gemini, Groq)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚                   β”‚
                         β”‚  Semantic Memory  β”‚
                         β”‚  (ChromaDB)       β”‚
                         β”‚                   β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Data Flow

  1. Ingestion Path

    • GitHub Repo β†’ FastAPI β†’ Background Task β†’ AI Processing β†’ Semantic Memory
  2. Query Path

    • User Question β†’ FastAPI β†’ AI Model β†’ Semantic Memory β†’ Response Generation β†’ User
graph TD
    A[Start: User Question] --> B(Dispatcher Agent);
    B -- Assembles Cognitive Context --> C[Query Planner Agent];
    
    subgraph Cognitive Context
        D[Episodic Memory - History];
        E[Core Memory - Persona];
    end

    D --> B;
    E --> B;
    
    C -- Creates Plan --> F[Information Retriever Agent];
    F -- Executes Plan --> G((Semantic Memory));
    G -- Returns Data --> H[Synthesizer Agent];
    H -- Generates Response --> I[End: Final Answer];