Vectra Architecture

Vectra is an enterprise RAG (Retrieval-Augmented Generation) solution designed to transform unstructured data into actionable knowledge via an intelligent chat interface.

Overview

The architecture follows a distributed model consisting of a reactive API, an asynchronous worker for data ingestion, and a multi-provider AI stack.

graph TD
    User([User]) <--> Frontend[Frontend Vue.js/Quasar]
    Frontend <--> API[FastAPI Backend]

    API <--> Postgres[(PostgreSQL)]
    API <--> Redis[(Redis Cache)]
    API <--> Qdrant[(Qdrant Vector DB)]

    API -- WebSocket triggers --> Worker[Background Worker]
    Worker -- Ingestion --> Postgres
    Worker -- Vectorization --> Qdrant

    API -- LLM/Embed --> AI[Gemini / OpenAI / Mistral]
    Worker -- Embeddings --> AI

Main Components

1. Backend API (FastAPI)

The heart of the system, responsible for real-time orchestration:

Session Management: Authentication and conversation history.
RAG Orchestration: Integration with LlamaIndex for chunking, indexing, and retrieval.
WebSocket Manager: Real-time broadcast of responses and synchronization states.
Semantic Cache: Uses Redis to store results for similar queries to reduce latency and costs.

2. Background Worker

An autonomous service dedicated to heavy tasks:

Multi-Source Ingestion: Scanning and extracting data from various sources (Connectors).
Vectorization Pipeline: Transforming documents into vectors via embedding models (Gemini 004).
Real-Time Synchronization: Connected to the API via WebSocket to react instantly to user requests.

3. Persistence Layer

PostgreSQL: Stores metadata, connector configurations, and document structure.
Qdrant: High-performance vector database for ultra-fast semantic search.
Redis: Semantic cache and temporary storage.

4. Artificial Intelligence

Vectra is "Model Agnostic" but optimized for the Google Cloud suite:

Chat Models: Gemini 1.5 Pro/Flash for complex reasoning.
Embeddings: Gemini Text Embedding 004 for state-of-the-art semantic representation.
Reranking: Uses reranking models to refine result relevance.

Data Flow (RAG)

Request: The user asks a question via the frontend.
Cache: The API checks Redis to see if a similar question has already been processed.
Retrieval: If not, Vectra queries Qdrant to extract the most relevant passages.
Augmentation: The extracted context is injected into the LLM prompt.
Generation: The LLM generates a sourced and accurate response.
Streaming: The response is sent back in chunks via WebSocket for a smooth user experience.