Skip to main content

Deploying Prompts with RAG

Retrieval-Augmented Generation (RAG) deploys a Manicode security prompt alongside context retrieved from your own sources — internal coding standards, framework documentation, prior security findings, or threat models. At query time, the most relevant context is fetched and injected into the prompt before it reaches the model. This keeps the model grounded in your organization's specifics without baking that content into a single, ever-growing system prompt.

When to Use RAG

A static system prompt (see API and Programmatic Usage) is the right default for most deployments. Reach for RAG when:

  • Your security context is large or fragmented — internal standards, multiple framework guides, and historical findings that would not fit cleanly in one prompt.
  • Context changes frequently — you want to update a knowledge base without re-publishing the prompt itself.
  • Relevance matters — a Django question should pull Django guidance, not the entire library of every framework you support.

If your security context fits comfortably in a system prompt and rarely changes, stick with the static approach. RAG adds an index, a retrieval step, and the operational work of keeping both healthy.

The End-to-End Flow

A RAG deployment has three stages:

  1. Index — Split your source content into chunks, embed each chunk into a vector, and store the vectors.
  2. Retrieve — Embed the incoming query and fetch the most similar chunks from the index.
  3. Inject — Compose the security prompt with the retrieved chunks, then send the result to the model.
Source docs ──► chunk ──► embed ──► vector store

User query ──► embed ──► similarity search┘──► top-k chunks

Manicode security prompt + chunks ──► model ──► response

Stage 1: Indexing and Embedding

Index the content you want the model to draw on: internal secure-coding standards, framework references, past review findings, or threat models. Split each document into chunks, embed them, and store the vectors.

import os
from openai import OpenAI

client = OpenAI()

def chunk_text(text: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
"""Split text into overlapping chunks so context isn't lost at boundaries."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks

def embed(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]

# Build the index from a directory of source documents
index = [] # in production, use a vector database
for filename in os.listdir("knowledge"):
with open(os.path.join("knowledge", filename)) as f:
chunks = chunk_text(f.read())
for chunk, vector in zip(chunks, embed(chunks)):
index.append({"text": chunk, "vector": vector, "source": filename})

For anything beyond a prototype, store vectors in a dedicated vector database (e.g., pgvector, Pinecone, Weaviate, or Qdrant) rather than an in-memory list. These handle persistence, fast similarity search, and metadata filtering at scale.

Stage 2: Retrieving Context at Query Time

When a query arrives, embed it and find the most similar chunks. The example below uses cosine similarity over the in-memory index; a vector database exposes this as a single query.

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

def retrieve(query: str, k: int = 4) -> list[dict]:
query_vector = embed([query])[0]
scored = [
{**entry, "score": cosine_similarity(query_vector, entry["vector"])}
for entry in index
]
scored.sort(key=lambda e: e["score"], reverse=True)
return scored[:k]

Tune k to balance coverage against prompt size and cost. Start with 3–5 chunks and adjust based on retrieval quality (see Evaluating Retrieval Quality).

Stage 3: Injecting Context into the Prompt

Compose the Manicode security prompt with the retrieved chunks, then send the combined message to the model. Keep the security prompt as the authoritative instruction and present retrieved context as supporting reference material.

def load_prompt(framework: str) -> str:
with open(f"prompts/{framework}-security.md") as f:
return f.read()

def build_system_prompt(framework: str, query: str) -> str:
security_prompt = load_prompt(framework)
chunks = retrieve(query)
context = "\n\n".join(
f"[Source: {c['source']}]\n{c['text']}" for c in chunks
)
return (
f"{security_prompt}\n\n"
"## Retrieved Context\n\n"
"Use the following project-specific references when they apply. "
"If they conflict with the security guidance above, the security "
"guidance takes precedence.\n\n"
f"{context}"
)

query = "Write a Django authentication view"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": build_system_prompt("python", query)},
{"role": "user", "content": query},
],
)

print(response.choices[0].message.content)

The retrieved context is clearly delimited and explicitly subordinated to the security prompt. This keeps the model's security behavior stable even when retrieval surfaces a marginally relevant or outdated chunk.

Operational Concerns

RAG introduces an index and a retrieval step that both need ongoing care.

Keeping the Index Fresh

The index is only as good as its last build. Stale chunks lead to outdated guidance.

  • Re-index whenever source documents change. Trigger a rebuild from CI when files in your knowledge directory are updated, the same way you propagate prompt updates in Team and Enterprise Deployment.
  • Store a content hash or timestamp per chunk so you can re-embed only what changed instead of rebuilding the whole index.
  • Embedding models change over time. If you upgrade the embedding model, re-embed the entire corpus — vectors from different models are not comparable.

Chunking Strategy

How you split documents has an outsized effect on retrieval quality.

  • Chunk on semantic boundaries — headings, sections, or functions — rather than arbitrary character counts when the source has structure.
  • Use overlap (as in the example above) so a concept that straddles a boundary isn't lost.
  • Keep chunks focused. Chunks that are too large dilute relevance and waste tokens; chunks that are too small lose context. Start around 500–1,000 tokens and adjust.

Evaluating Retrieval Quality

Treat retrieval as something you measure, not something you assume works.

  • Build a small evaluation set of representative queries paired with the chunks that should be retrieved.
  • Track whether the right chunks appear in the top k, and at what rank.
  • Re-run the evaluation after changing the embedding model, chunk size, or k.
  • Validate the model's final output the same way you would for any deployment — layer an output-guard prompt for defense in depth.