githubEdit

brainMemory (RAG)

Agent's memory system provides intelligent conversation context management through a hybrid approach combining short-term context windows and long-term retrieval using Retrieval-Augmented Generation (RAG). This enables agents to maintain coherent conversations while efficiently accessing relevant historical information.

Overview

The memory system consists of two complementary mechanisms:

  1. Short-Term Memory: Recent messages within a sliding window (configurable via MaxContextMessages)

  2. Long-Term Memory (RAG): Vector-based retrieval of historically relevant messages using embeddings

When RAG is enabled, the agent automatically:

  • Embeds all conversation messages into a vector store

  • Retrieves semantically relevant past messages when responding

  • Combines short-term context with retrieved long-term memories

  • Re-ranks results based on similarity, recency, and thread relevance

Memory Settings

Configure memory behavior through AgentMemorySettings:

public class AgentMemorySettings
{
    // Basic Settings
    public bool autoSave = true;                    // Auto-save conversations
    public bool generateTitle = true;                // Generate conversation titles
    public int maxContextMessages = 20;              // Max messages in context (10-100)
    
    // RAG Settings
    public bool useVectorStore = false;             // Enable long-term RAG retrieval
    public string summaryModelId;                    // Model for summarization
    public string embeddingModelId;                  // Model for embeddings
    public int retrievalTopK = 8;                    // Top K results (1-32)
    public float retrievalMinSim = 0.5f;            // Min similarity threshold (0.0-1.0)
}

Basic Settings

Auto Save

Automatically persists conversation state to the configured store:

Generate Title

Automatically generates descriptive titles for conversations:

Max Context Messages

Controls the size of the short-term context window:

Considerations:

  • Higher values provide more context but increase token costs

  • Lower values save tokens but may lose important context

  • Recommended: 10-50 depending on use case

  • RAG can supplement reduced context windows

RAG Settings

Use Vector Store

Enables semantic retrieval of historical messages:

When to enable:

  • Long conversations spanning multiple sessions

  • Knowledge retention across conversation boundaries

  • Agents requiring recall of specific past information

  • Applications with large conversation histories

When to disable:

  • Short, ephemeral conversations

  • Performance-critical applications

  • Limited embedding API budget

Embedding Model

Specifies the model for generating message embeddings:

Available Models:

  • OpenAI_TextEmbedding_3_Small - Fast, cost-effective (1536 dimensions)

  • OpenAI_TextEmbedding_3_Large - Higher accuracy (3072 dimensions)

  • Custom embedding models from supported providers

Retrieval Top K

Number of relevant messages to retrieve from vector store:

Tuning Guidelines:

  • Lower (1-5): Focused retrieval, lower token cost, may miss context

  • Medium (6-12): Balanced approach, recommended for most cases

  • Higher (13-32): Comprehensive retrieval, higher cost, more context

Retrieval Min Similarity

Minimum cosine similarity threshold for retrieved messages:

Tuning Guidelines:

  • High (0.7-1.0): Strict matching, highly relevant results only

  • Medium (0.5-0.7): Balanced relevance (recommended)

  • Low (0.25-0.5): Broader retrieval, may include less relevant messages

  • Very Low (<0.25): May retrieve noise

How Memory Works

Short-Term Context (Without RAG)

When RAG is disabled, the agent uses a simple sliding window:

Process:

  1. User sends a message

  2. Agent retrieves last maxContextMessages messages

  3. Messages are sent to the LLM

  4. Response is generated and added to history

Long-Term Retrieval (With RAG)

When RAG is enabled, the system performs hybrid retrieval:

Detailed Workflow

1. Message Embedding

Every message is automatically embedded and indexed:

2. Query Embedding

When user sends a new message, it's embedded for search:

3. Vector Search

Similar messages are retrieved from the vector store:

4. Re-Ranking

Results are re-ranked using a weighted scoring formula:

Factors:

  • Similarity (75%): Semantic relevance to query

  • Recency (20%): Preference for recent messages

  • Same Thread (5%): Bonus for messages in current conversation

Example Ranking:

Message
Similarity
Recency
Same Thread
Final Score

"API integration timeline"

0.92

0.8

Yes

0.91

"Project deadline is Friday"

0.88

0.9

Yes

0.90

"Weather is nice today"

0.85

0.95

No

0.83

5. Deduplication

Duplicate messages are removed based on content hash:

6. Context Assembly

Final context is assembled in order:

This structure provides:

  • Summary: High-level conversation overview

  • Short-term: Recent conversation flow

  • Long-term: Relevant historical information

  • Current: Immediate query

Configuration Examples

Minimal Memory (Token Efficient)

For short conversations or budget constraints:

Characteristics:

  • ✅ Minimal token usage

  • ✅ Fast response times

  • ❌ Limited context retention

  • ❌ No long-term memory

For most production applications:

Characteristics:

  • ✅ Good context retention

  • ✅ Semantic retrieval enabled

  • ✅ Reasonable token costs

  • ✅ Suitable for most use cases

Maximum Memory (Knowledge Intensive)

For applications requiring extensive context:

Characteristics:

  • ✅ Maximum context retention

  • ✅ Comprehensive retrieval

  • ✅ High-quality embeddings

  • ❌ Higher token costs

  • ❌ Slower response times

Custom Configuration

Adapt settings to specific needs:

Conversation Stores

Memory persistence is handled through ConversationStoreType:

Local File Store

Save conversations to local device storage:

Characteristics:

  • ✅ No external API required

  • ✅ Fast local access

  • ✅ Full data control

  • ❌ Not synchronized across devices

  • ❌ Limited to device storage

Use Cases:

  • Single-player games

  • Offline applications

  • Development/testing

Threads API Store (OpenAI)

Use OpenAI's Threads API for conversation persistence:

Characteristics:

  • ✅ Cloud synchronized

  • ✅ Built-in OpenAI integration

  • ✅ Scalable storage

  • ❌ Requires OpenAI API

  • ❌ Limited to OpenAI ecosystem

Use Cases:

  • Multi-device applications

  • OpenAI-based agents

  • Cloud-backed services

Conversations API Store (OpenAI)

Use OpenAI's newer Conversations API:

Similar to Threads API with enhanced features.

Realtime API Store (OpenAI)

For real-time voice/streaming applications:

Note: Specialized for real-time streaming scenarios.

Custom Store

Implement your own storage backend:

Use Cases:

  • Custom cloud backends (Firebase, AWS, Azure)

  • Specialized persistence requirements

  • Integration with existing systems

Memory Management APIs

Creating Conversations

Loading Conversations

Listing Conversations

Saving Conversations

Note: With autoSave = true, these are called automatically.

Deleting Conversations

Accessing Conversation Data

RAG Performance Tuning

Optimizing Retrieval Quality

Problem: Irrelevant results

Problem: Missing relevant context

Problem: Too much noise in results

Optimizing Token Usage

Reduce context window:

Smart retrieval:

Selective embedding:

Optimizing Response Speed

Use faster embedding model:

Reduce retrieval count:

Disable RAG for simple queries:

Best Practices

1. Choose Appropriate Store Types

2. Enable RAG for Long Conversations

3. Balance Context Window and RAG

4. Monitor Token Usage

5. Implement Smart Caching

6. Handle Vector Store Initialization

7. Clean Up Old Conversations

Troubleshooting

RAG Not Working

Symptom: Retrieved messages don't seem relevant

Solutions:

  1. Check embedding model is configured:

  2. Verify vector store is initialized:

  3. Adjust similarity threshold:

  4. Increase retrieval count:

High Token Costs

Symptom: Token usage is higher than expected

Solutions:

  1. Reduce context window:

  2. Reduce retrieved messages:

  3. Increase similarity threshold:

  4. Disable RAG for short conversations:

Slow Response Times

Symptom: Responses take too long

Solutions:

  1. Use faster embedding model:

  2. Reduce retrieval operations:

  3. Disable RAG for simple queries:

Conversations Not Persisting

Symptom: Conversations don't save between sessions

Solutions:

  1. Enable auto-save:

  2. Verify store type is configured:

  3. Manually save conversations:

  4. Check for errors during save:

Memory Leaks with Large Conversations

Symptom: Memory usage grows over time

Solutions:

  1. Limit conversation history:

  2. Implement conversation rotation:

  3. Clear vector store periodically:

Last updated