System Design

RAG Architecture

A local-first retrieval-augmented generation pipeline with semantic search and grounded responses.

Request Flow

User

Question

React

ChatWidget

FastAPI

Orchestration

Weaviate

Vector Search

Ollama

LLM Generate

User

Question

React

ChatWidget

FastAPI

Orchestration

Weaviate

Vector Search

Ollama

LLM Generate

RAG Pipeline Steps

User Query

User asks a question about their career through the React chat widget.

API Routing

FastAPI receives the POST request and routes to the ChatService.

Semantic Search

Weaviate performs near_text() search across all 4 collections using nomic-embed-text embeddings.

Context Assembly

Top matches from each collection are assembled into a JSON-formatted BACKGROUND section.

LLM Generation

Ollama llama3.2 generates a natural language response grounded in the retrieved context.

Response Delivery

Answer and source attribution returned to the user through the chat interface.

Project Structure

# RAG Backend Structure

rag/

api/

server.py # FastAPI app + endpoints

models.py # Pydantic schemas

services/

chat_service.py # RAG orchestration

clients/

weaviate_client.py # Vector DB wrapper

ollama_client.py # LLM client wrapper

core/

settings.py # Configuration

schema/

schema.py # Collection definitions

ingest/

ingest_jobs_from_json.py # Data loader

candidate_data.json # Career data

Core Components

Weaviate

v1.34.4

text2vec-ollama module for embeddings
Multi-collection semantic search
Persistent volumes in Docker

Ollama

Local LLM

llama3.2 for response generation
nomic-embed-text for embeddings
Runs locally on port 11434

FastAPI

Backend

POST /api/chat endpoint
OpenAPI/Swagger at /docs
CORS configured for React dev

React Frontend

v19 + TypeScript

Floating ChatWidget component
Vite 7 for fast HMR builds
Animations and responsive design

Running Locally

Start Ollama

ollama run llama3.2

Start Weaviate

docker-compose up -d

Ingest Data

python ingest/ingest_jobs.py

Run API

uvicorn api.server:app