System Design

RAG Architecture

A local-first retrieval-augmented generation pipeline with semantic search and grounded responses.

Request Flow

User

Question

React

ChatWidget

FastAPI

Orchestration

Weaviate

Vector Search

Ollama

LLM Generate

RAG Pipeline Steps

1

User Query

User asks a question about their career through the React chat widget.

2

API Routing

FastAPI receives the POST request and routes to the ChatService.

3

Semantic Search

Weaviate performs near_text() search across all 4 collections using nomic-embed-text embeddings.

4

Context Assembly

Top matches from each collection are assembled into a JSON-formatted BACKGROUND section.

5

LLM Generation

Ollama llama3.2 generates a natural language response grounded in the retrieved context.

6

Response Delivery

Answer and source attribution returned to the user through the chat interface.

Project Structure

# RAG Backend Structure

rag/

api/

server.py # FastAPI app + endpoints

models.py # Pydantic schemas

services/

chat_service.py # RAG orchestration

clients/

weaviate_client.py # Vector DB wrapper

ollama_client.py # LLM client wrapper

core/

settings.py # Configuration

schema/

schema.py # Collection definitions

ingest/

ingest_jobs_from_json.py # Data loader

candidate_data.json # Career data

Core Components

Weaviate

v1.34.4

  • text2vec-ollama module for embeddings
  • Multi-collection semantic search
  • Persistent volumes in Docker

Ollama

Local LLM

  • llama3.2 for response generation
  • nomic-embed-text for embeddings
  • Runs locally on port 11434

FastAPI

Backend

  • POST /api/chat endpoint
  • OpenAPI/Swagger at /docs
  • CORS configured for React dev

React Frontend

v19 + TypeScript

  • Floating ChatWidget component
  • Vite 7 for fast HMR builds
  • Animations and responsive design

Running Locally

1

Start Ollama

ollama run llama3.2
2

Start Weaviate

docker-compose up -d
3

Ingest Data

python ingest/ingest_jobs.py
4

Run API

uvicorn api.server:app