Icon
 PDF Offer						PDF Offer Icon
 Write						Write Icon
 Call						Call Icon
 Projects						Projects
Go back

Self-hosted AI assistant with RAG

The challenge

Most AI assistant projects today follow a similar pattern: send user queries to a hosted LLM API, get responses back.

The client’s requirements made that impossible. We needed to build and run the entire AI stack – language model, embeddings, vector search, context assembly – on the client’s own servers with limited GPU resources.

Four constraints shaped our technical decisions.

Data privacy
User profiles, conversation history and source documents could not leave the client's servers. This ruled out hosted AI APIs entirely.
Hardware budget
The system had to run on limited GPU resources (6GB VRAM budget), limiting the size of the models used.
Languages
The assistant had to handle many languages – switching naturally based on user input.
Live data
The assistant needed to work with daily updated records rather than static information.
Need a reliable tech partner? Let’s talk.
Solution
Architecturefour layers

Application layer
Handles routing, authentication and session management.

AI platform
Runs the language models locally without external network calls.

Vector database
Stores document representations and performs similarity searches to retrieve context.

Real-time transport
Streams AI responses to the frontend to keep the interface responsive.

RAG pipeline flow

When a user sends a message, it first passes through a guardrail layer that checks for security threats before anything else happens. Once validated, the message reaches the language model, which decides what information it needs to answer. Depending on the query, the model calls specific retrieval tools – searching policy documents, job records or both.

The retrieval works through vector similarity. The query is embedded using the multilingual model, then matched against stored vectors in PostgreSQL via cosine similarity. Results can be filtered by metadata (e.g. region). The retrieved context is then assembled into a prompt together with the user’s profile data and recent conversation history – all compressed to fit within the model’s token budget.

The model generates its response based on this combined context. The output streams to the frontend in real time via Mercure/SSE, so the user sees the answer as it is being written rather than waiting for the full response.

What we built

Document indexing.
Automated tools for loading and vectorizing policy files.

Record sync pipeline.
Daily import and cleaning of external API data.

Retrieval tools.
Specialized search functions available to the AI agent.

Session management.
Systems to handle user profiles and chat history within strict constraints.

Key technical challenges and how we solved them

Self-hosting an AI assistant on limited hardware means solving problems that hosted API solutions handle for you – or ignore entirely.

Model security, context budgeting, embedding quality and data processing all required custom engineering. Each constraint pushed us toward solutions that are more complex to build but give the client full control over cost, privacy and system behavior.

Production AI on limited resources
Most high-end models require significant hardware. We selected a compact model with strong multilingual support and efficient performance on limited hardware. The RAG architecture compensates for the smaller model size by injecting specific context at query time.
Prompt injection defense
Smaller models can be more susceptible to user manipulation. We implemented a multi-layer defense, including vector-based guardrails that check queries against blacklists/whitelists before they reach the model and specialized prompt hardening techniques.
Context window management
To fit within data limits, every component was compressed. This included limiting conversation history, pre-summarizing documents at indexing time and stripping unnecessary fields from tool results.
Embedding quality and data processing
Source records varied in quality and length. We implemented automated parsing and filtering to clean data, used structured summarization before vectorization and manually segmented large documents by topic to ensure better retrieval accuracy.
The scope of our actions
Workshops

Together with the client, we participate in workshops to better understand their vision and ensure an ideal fit with their expectations.

Workshops
Analysis

We start by gaining a deep understanding of the client's needs and project goals to deliver personalized solutions.

Analysis
Technology audit

We conduct a detailed technology audit, identifying the best tools and solutions tailored to the project's needs.

Technology audit
Front-end web development

We create responsive and dynamic user interfaces, leveraging the latest front-end technologies.

Front-end web development
Database

We design and implement efficient and scalable databases, providing a robust foundation for your website.

Database
Manual testing

We conduct comprehensive manual testing, verifying the functionality of each feature on various devices available on the market, ensuring flawless operation and high product quality.

Manual testing
Launch

We deploy the website on the production server, ensuring a smooth process and minimizing potential disruptions.

Launch
Warranty service

We provide a 12-month warranty on the work performed, ensuring clients complete security and confidence in the functioning of the website after its launch.

Warranty service
Tech stack

A production RAG assistant running on limited private hardware, processing tens of thousands of records daily, serving multilingual queries from a shared vector space – with zero data leaving the infrastructure and zero per-token costs.

Backend framework
Modern enterprise Symfony 8 framework
AI platform
Ollama – local model hosting environment (Bedrock as a an alternative)
Models
Multilingual language and embedding models
Database
PostgreSQL with pgvector – vector-enabled database solution
Communication
Server-Sent Events – real-time streaming and asynchronous message handling
Need a reliable tech partner? Let’s talk.
Let's talk
Jakub Kozak
Jakub Kozak
Co-founder, Head of technology
Need a reliable tech partner? Let’s talk.
Smultron Web Development
12 Slawkowska Street
31-014 Krakow, Poland
TAX ID: PL6762482785