Thread: RAG and Memory Systems

Summary

This thread documents the community’s exploration of Retrieval-Augmented Generation (RAG) and memory systems for LLM game engines. The conversation evolved from simple vector databases to sophisticated hybrid retrieval systems combining multiple search methods.

Core Challenge

The Problem: Context windows are limited, but game worlds are vast.

The Question: How do you give the LLM relevant information without overwhelming the context window?

Three Main Approaches

1. RAG (Vector Database / Semantic Search)

Store embeddings of text chunks
Query by semantic similarity
Return most relevant results

2. Boolean/Keyword Search

Traditional keyword matching
Tag-based filtering
Exact name matching

3. Knowledge Graphs

Entity relationships
Graph traversal
Connected information discovery

Consensus: Hybrid approach works best

Early RAG Discussions

Initial Vector Database Issues

User-vali98 (January 2024):

“from what Cohee has told me, vector DB and the likes suck”

User-50h100a:

“vector db is hard to use” “more critically, vector db cannot interface directly with the llm”

User-underscore_x:

“chroma is good at retrieval. but you have to make it about retrieving information”

The Fundamental Issue

User-50h100a:

“vector db can only help you as a search tool… using a vector database for the purpose of, say, letting the LLM write a sentence describing the category of things it wants to recall and then finding relevant ‘memory’ summaries would be appropriate”

Key Insight: Vector DBs are tools, not solutions. They retrieve; you still need to integrate intelligently.

Vector Database Implementation

User-veritasr’s Approach (ReallmCraft)

Tech Stack:

ChromaDB for vector storage
sentence-transformers (all-MiniLM-L6-v2) for embeddings
384-dimensional vectors

Embedding Model Choice:

Small, fast, good enough
Runs locally without issues
Reasonable accuracy for game content

Storage Strategy

What Gets Embedded:

Location descriptions
Character backstories
Lore entries
Item descriptions
Event histories

Metadata Stored Alongside:

{
  "id": "char_001",
  "name": "Vendrick the Cursed",
  "tags": ["player", "warrior", "cursed"],
  "type": "character",
  "last_seen": "2024-07-05",
  "relationships": ["npc_042", "location_15"]
}

The Semantic Similarity Problem

User-veritasr’s Critical Insight (February 2024)

“Semantic similarity isn’t meant to locate stuff like keywords or key phrases, it’s meant to find similar paragraphs of text.”

Implication: If you store “Vendrick is a warrior cursed by ancient magic,” querying “warrior” won’t match well because you’re matching a word to a paragraph.

Solution: Store Experiences, Not Descriptions

Wrong Approach:

Store: "The Sword of Kings is a legendary blade forged in dragon fire."

Better Approach:

Store: "Vendrick gripped the Sword of Kings, feeling the ancient dragon fire still warm within the blade. Its weight was familiar, comforting, a reminder of battles past."

Why: Narrative experiences provide richer semantic context than bare facts.

Hypothetical Questions Approach

Also Called: HyDE (Hypothetical Document Embeddings)

Technique:

Take your content: “The Sword of Kings was forged in dragon fire”
Generate hypothetical questions it could answer:
- “What is the Sword of Kings?”
- “How was the Sword of Kings created?”
- “What makes the Sword of Kings special?”
Store both questions and answers with embeddings
At query time, match user question to hypothetical questions

User-veritasr (February 2024):

“This is the thing I was imagining the other day. Good to know it has a name”

Hybrid Retrieval System

User-veritasr’s Final Architecture

Phase 1: Exact Matching

# Direct name match - highest priority
exact_matches = db.query(name=user_input)
context.add(exact_matches, priority=1)

Phase 2: Tag-Based Boolean Search

# Extract keywords from input
keywords = extract_keywords(user_input)
 
# Find entries with matching tags
tag_matches = db.query(tags__contains=keywords)
context.add(tag_matches, priority=2)

Phase 3: Semantic Search

# Vector similarity search on descriptions
semantic_matches = vector_db.similarity_search(
    query=user_input,
    k=10,
    threshold=1.5
)
context.add(semantic_matches, priority=3)

Phase 4: Graph Traversal

# For each result, follow relationship links
for entity in current_results:
    related = db.query(id__in=entity.relationships)
    context.add(related, priority=4)

Phase 5: Ranking and Fusion

# RRF-style ranking
all_results = combine_results(
    exact_matches,
    tag_matches,
    semantic_matches,
    related_entities
)
 
# Sort by average rank across methods
ranked = reciprocal_rank_fusion(all_results)
 
# Add to context until token limit reached
context.fill_until_limit(ranked)

Context Window Management

Token Estimation

User-veritasr’s Method (February 2024):

import tiktoken
import math
 
class ContextBuilder:
    def __init__(self, prompt, messages, context_strings, max_length):
        self.context = context_strings
        self.prompt = prompt
        self.messages = messages
        # Padding for tokenizer differences (1/50 ratio)
        self.context_padding = math.floor(max_length / 50)
        self.max_length = max_length - self.context_padding
 
    def get_tokens(self, text, encoder="gpt-3.5-turbo"):
        encoding = tiktoken.encoding_for_model(encoder)
        return len(encoding.encode(text))

Padding Strategy: Reserve 1/50 tokens as safety buffer for tokenizer differences.

Later Adjustment: “1 in 50 might be too high, guess maybe it should be 1 in 25”

Priority-Based Context Inclusion

Order of Importance:

System prompt (always included)
Current scene description
Active characters in scene
Recent messages (last N turns)
Exact match results (from queries)
Related entities
Semantic matches (by rank)
General world lore

Implementation:

def build_context(self, max_tokens):
    remaining = max_tokens
    context = []
 
    # Add prompt (always fits)
    context.append(self.prompt)
    remaining -= self.get_tokens(self.prompt)
 
    # Add messages (recent first)
    for msg in reversed(self.messages):
        tokens = self.get_tokens(msg)
        if tokens <= remaining:
            context.insert(1, msg)
            remaining -= tokens
        else:
            break
 
    # Add context items by priority
    for item in sorted(self.context, key=lambda x: x.priority):
        tokens = self.get_tokens(item.text)
        if tokens <= remaining:
            context.append(item.text)
            remaining -= tokens
        else:
            print(f"Unable to add {item.name}, {remaining} tokens left")
            break
 
    return context

User-veritasr’s Results

Typical Context Size: Under 3k tokens Max Context Tested: 32k (with Mixtral MoE models)

“Pretty satisfied with that fact so far. Context window has yet to break 3k”

Memory Systems

Iterative Summarization

Pattern: Chain summaries together to maintain continuity

User-veritasr’s Approach:

Break chat into message blocks (prompt + response pairs)
Summarize each block into events
Feed previous summary into next summary
Store final summary + all intermediate summaries in vector DB
Query summaries when needed

Benefits:

Maintains narrative continuity
Captures fine-grained details
Doesn’t lose information over time

Trade-offs:

Takes longer (more LLM calls)
Requires background processing
Complexity in managing summary chains

Hash-Based Invalidation

Problem: User edits or deletes messages, summaries become stale

Solution (User-veritasr):

# Compute hash of message blocks
block_hash = hash(message_content)
 
# Only summarize if hash doesn't exist
if block_hash not in summary_db:
    summary = summarize(block)
    summary_db.save(block_hash, summary)

Benefit: Re-summarize only when content actually changes

SummerWind Approach

User-underscore_x:

“I’m assuming you’re all coming at this from a low-context-local-model territory… chroma is good at retrieval.”

Their System:

Define proper template for CYOA continuation
Create text structure to track character progress
Pull text into variables and store somewhere
Create logic to push relevant variables based on scene
Build prompt dynamically

Outcome: “This is how I run summerwind and it’s pretty much a video game”

Retrieval Techniques

RAKE-NLTK for Keyword Extraction

User-veritasr used RAKE (Rapid Automatic Keyword Extraction):

from rake_nltk import Rake
 
rake = Rake()
rake.extract_keywords_from_text(description)
keywords = rake.get_ranked_phrases()

Application: Auto-generate tags for entries

Process:

Extract keywords from description
Run similarity search against all tags in DB
Filter results above threshold (1.5)
Suggest as tags for new entry

Threshold Tuning

Initial Settings: 1.5 similarity threshold Observation: Too restrictive for keywords, good for paragraphs Adjustment: Different thresholds per content type

Context Extension Methods

User-50h100a’s Vision

“ultimately, if external state tracking is reliable, it could be used more broadly as a context extension method”

Concept: Instead of cramming everything into context:

Store world state externally
Use function calling or keywords to query state
Inject query results mid-generation
LLM continues with new information

Status: “In lieu of realtime backprop” (not yet possible with current models)

World Info Territory

User-50h100a:

“in some sense this is approaching WorldInfo territory” “as large worlds or maps or character memories are impractical to cram into the prompt every time”

SillyTavern’s World Info: Keyword-triggered lore injection This Project’s Evolution: Semantic + keyword hybrid

Specific Implementation Details

ChromaDB Configuration

User-veritasr’s Setup:

import chromadb
from chromadb.config import Settings
 
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./qdrant_data"
))
 
collection = client.create_collection(
    name="discord_chat",
    metadata={"hnsw:space": "cosine"}
)

Storage Path: ./qdrant_data (local persistence) Distance Metric: Cosine similarity Embedding Dimension: 384 (all-MiniLM-L6-v2)

Ranking and Fusion

Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(results_lists, k=60):
    """
    Combine multiple ranked result lists
    k is a constant (typically 60)
    """
    scores = {}
    for results in results_lists:
        for rank, item in enumerate(results, 1):
            if item.id not in scores:
                scores[item.id] = 0
            scores[item.id] += 1 / (k + rank)
 
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Application: Combine exact matches, tag matches, and semantic matches into unified ranking

Tag vs Description Weighting

User-veritasr (February 2024):

“Technically this means that there’s a higher probability of things with tags getting higher rank than those without, but I’ll see how it performs.”

Observation: Tags + descriptions work better than description alone

Reason: Tags provide keyword signal, descriptions provide semantic signal

RAG Article Reference

User-veritasr shared (February 2024): https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6

Key Techniques Referenced:

Hypothetical Questions: Generate questions for each chunk
HyDE: Generate hypothetical response and use for search
Chunk Optimization: Overlap chunks for better context
Metadata Filtering: Pre-filter by metadata before similarity search
Query Rewriting: Rephrase user query for better matches
Re-ranking: Two-stage retrieval (broad then narrow)

Memory Extension

User-veritasr’s Memory Extension (mentioned, not fully detailed):

Compared current text with stored summaries
Retrieved relevant “memories”
Pretty good results

Use Case: Long-running campaigns where events from weeks ago matter

Scene-Based Context Management

Current Scene Priority

Pattern: Prioritize information about current location and present characters

def gather_context(self, user_input):
    context = []
 
    # Current scene (always include)
    context.append(f"Location: {self.current_location.description}")
 
    # Characters in scene
    for char in self.current_location.characters:
        context.append(f"Present: {char.summary}")
 
    # Now do retrieval for additional context
    retrieved = self.retrieval_system.search(user_input)
    context.extend(retrieved)
 
    return context

Dynamic Context Updates

On Scene Change:

Clear low-priority context
Load new scene description
Load characters in new scene
Keep recent messages
Re-run retrieval with new location context

Chat Context Management

User-monkeyrithms (February 2024):

“context window management for long chats… since mine is splintered into a bunch of different ‘rooms’, I haven’t actually stayed in one spot and just chatted the context all the way to max”

Their Approach: Scene-based context splitting

Each location is a separate context space
Moving between locations resets context
Summaries carry over key information

Issue: “if that does happen, I think my app will crash” Solution: Needed to implement context pruning

Background Task Processing

User-veritasr’s Design:

# Use separate LLM instance for background tasks
background_llm = LLM(endpoint="secondary_api")
 
# Generate summaries in background
async def summarize_in_background(messages):
    summary = await background_llm.summarize(messages)
    summary_db.save(summary)
 
# Doesn't block main gameplay

Benefits:

Doesn’t interrupt player
Can use different/cheaper model
Processes during idle time

Preferred Model: Flan-T5 or similar text2text model (better at summarization)

Comparison: Vector vs World Info

Traditional World Info (SillyTavern)

Keyword triggers
Exact match required
Manual curation
Simple, reliable
No AI needed

Vector/Semantic Approach

Semantic similarity
Fuzzy matching
Can be automated
More complex
Requires embeddings

Hybrid (This Project)

Best of both worlds
Exact matches get priority
Semantic fills gaps
Graph traversal for relationships

Performance Considerations

Embedding Generation Speed

Local Models: Fast enough for real-time

sentence-transformers: ~100ms per chunk
Can batch process for better throughput

Vector Search Speed

ChromaDB Performance:

Search 10k entries: ~50-100ms
Acceptable for gameplay
Can index/cache for speed

Context Building Speed

Bottleneck: Token counting

tiktoken is reasonably fast
Can cache token counts
Pre-compute when possible

Lessons Learned

What Worked

Hybrid retrieval better than pure vector
Exact name matching essential
Priority-based context prevents important info from being cut
Token padding (1/25 to 1/50) prevents overflow
Store experiences not descriptions for better semantic matches

What Didn’t Work

Pure vector search too unreliable
Keyword-only too rigid
No thresholds returned irrelevant results
Same embedding for all content types (descriptions vs experiences)
Not considering relationships missed obvious connections

Ongoing Challenges

Relevance ranking still imperfect
Query formulation affects results
Threshold tuning requires experimentation
Model-specific embeddings don’t transfer well
Balancing recall vs precision

Integration with Game Loop

Typical Flow

1. User inputs action
2. Extract entities mentioned (NPC names, locations, items)
3. Query RAG system with entities + action text
4. Retrieve relevant context
5. Build prompt with:
   - System instructions
   - Current scene
   - Retrieved context
   - Recent messages
   - User input
6. Generate response
7. Update world state
8. (Background) Summarize turn and update memory

User-veritasr’s Context Ranking Implementation

def gather_context(self):
    # Check all entity types
    location_tags = check_for_tags("location")
    location_desc = check_for_description("location")
    character_tags = check_for_tags("character")
    character_desc = check_for_description("character")
    # ... etc for events, items, lore
 
    # Rank each type
    location_rankings = rank_results(location_tags, location_desc, "location")
    character_rankings = rank_results(character_tags, character_desc, "character")
    # ... etc
 
    # Combine and sort
    overall = location_rankings + character_rankings + ...
    overall.sort(reverse=True)  # Highest scores first
 
    # Add to context until full
    for rank, item in overall:
        if item not in self.context:
            self.context.append(format_entry(item))

Result: “More relevant stuff will be added in before less relevant stuff”

01-Architecture-and-Design - How RAG fits in architecture
02-Prompt-Engineering - Using retrieved context in prompts
04-World-Generation - Generating content to populate RAG
User-veritasr - Primary RAG implementer
User-50h100a - Early RAG discussions

Future Directions

Attention mechanism integration
Long-term memory consolidation
Automatic relevance feedback
Query expansion and refinement
Multi-modal embeddings (text + metadata)
Graph neural networks for relationships

Key Quotes

“Semantic similarity isn’t meant to locate stuff like keywords or key phrases, it’s meant to find similar paragraphs of text.” - User-veritasr

“vector db can only help you as a search tool” - User-50h100a

“chroma is good at retrieval. but you have to make it about retrieving information.” - User-underscore_x

“It sort of feels like the industry is catching up to what we were doing in here last year” - User-veritasr (July 2025)

01-Architecture-and-Design - How RAG fits into architecture
02-Prompt-Engineering - RAG for prompt context
05-State-Management - Retrieving persistent state
User-veritasr - ChromaDB implementation
User-underscore_x - Hybrid retrieval patterns

Prompt Library

00-PROMPT-INDEX - Complete prompt library
query-formulation-hyde - HyDE for improved semantic search (40-80% better recall)

Pattern Library

00-PATTERN-INDEX - Complete pattern library

Technical Resources Referenced

ChromaDB: Vector database used by veritasr
Qdrant: Alternative vector DB mentioned
sentence-transformers: Embedding models
all-MiniLM-L6-v2: Specific embedding model (384dim)
tiktoken: Token counting library
RAKE-NLTK: Keyword extraction
Advanced RAG Techniques: TowardsAI article shared by veritasr

LLM World Engine Knowledge Base

Explorer

03-RAG-and-Memory

Thread: RAG and Memory Systems

Summary

Core Challenge

Three Main Approaches

1. RAG (Vector Database / Semantic Search)

2. Boolean/Keyword Search

3. Knowledge Graphs

Early RAG Discussions

Initial Vector Database Issues

The Fundamental Issue

Vector Database Implementation

User-veritasr’s Approach (ReallmCraft)

Storage Strategy

The Semantic Similarity Problem

User-veritasr’s Critical Insight (February 2024)

Solution: Store Experiences, Not Descriptions

Hypothetical Questions Approach

Hybrid Retrieval System

User-veritasr’s Final Architecture

Context Window Management

Token Estimation

Priority-Based Context Inclusion

User-veritasr’s Results

Memory Systems

Iterative Summarization

Hash-Based Invalidation

SummerWind Approach

Retrieval Techniques

RAKE-NLTK for Keyword Extraction

Threshold Tuning

Context Extension Methods

User-50h100a’s Vision

World Info Territory

Specific Implementation Details

ChromaDB Configuration

Ranking and Fusion

Tag vs Description Weighting

RAG Article Reference

Memory Extension

Scene-Based Context Management

Current Scene Priority

Dynamic Context Updates

Chat Context Management

Background Task Processing

Comparison: Vector vs World Info

Traditional World Info (SillyTavern)

Vector/Semantic Approach

Hybrid (This Project)

Performance Considerations

Embedding Generation Speed

Vector Search Speed

Context Building Speed

Lessons Learned

What Worked

What Didn’t Work

Ongoing Challenges

Integration with Game Loop

Typical Flow

User-veritasr’s Context Ranking Implementation

Related Threads

Future Directions

Key Quotes

Related Threads

Related Enrichment Outputs

Prompt Library

Pattern Library

Technical Resources Referenced

Graph View

Table of Contents

Backlinks