ai-skills-api/TOKEN-SAVING-PATTERN.md

11 KiB

Token-Saving Architecture

This explains how the AI Skills API reduces token consumption for your AI agents.

The Two Main Mechanisms

1. Smart RAG (Retrieval-Augmented Generation) - 60-80% Savings

Problem: Sending all skills/conventions every query wastes 2000+ tokens.

Solution: Pre-computed embeddings + fast similarity search returns only the top 3 most relevant items.

# Instead of this (sends everything):
GET /context?project=/opt/home-server  # -> 50 skills = ~3000 tokens

# Do this (sends only relevant):
GET /context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server
# -> 3 skills + 2 conventions = ~600 tokens

How it works:

  • On startup, all skills/snippets are loaded into memory with their embeddings
  • Query is embedded and cosine similarity computed against all items
  • Top-K items above threshold returned in ~5ms for 1000 items
  • No database queries during retrieval - fully in-memory

Configuration:

rag:
  max_skills: 3
  max_conventions: 2
  max_snippets: 2
  min_skill_score: 0.3

2. Conversation Compression - 50-75% Savings

Problem: Long conversations (10+ turns) can consume 8000+ tokens of history.

Solution: Summarize old turns, keep recent exchanges full.

# Send this to /compress endpoint:
{
  "messages": [
    {"role": "user", "content": "..."},  # turn 1
    {"role": "assistant", "content": "..."},
    # ... many more turns
    {"role": "user", "content": "..."},  # turn 10
  ]
}

# Get back:
{
  "messages": [
    {"role": "user", "content": "[CONVERSATION SUMMARY]\nUser asked about Docker setup, decided to use Traefik...[/CONVERSATION SUMMARY]"},
    {"role": "user", "content": "..."},  # turn 9 (full)
    {"role": "assistant", "content": "..."},  # turn 10 (full)
  ],
  "original_tokens": 8000,
  "compressed_tokens": 2000,
  "tokens_saved": 6000,
  "reduction_percent": 75.0
}

Strategies:

  • extractive (default): Fast LSA summarization, no model required
  • ollama: High-quality summaries using local phi-3-mini (requires Ollama running)
  • none: Disabled

Configuration:

compression:
  enabled: true
  strategy: "extractive"  # or "ollama"
  keep_last_n: 3
  max_tokens: 2000
  ollama_model: "phi3:mini"
  ollama_url: "http://localhost:11434"

Integration Flow (Complete Example)

import httpx
import asyncio

async def chat_with_llm(user_message: str, project: str = None, conversation: list = None):
    """Complete integration pattern"""
    
    # 1. Get relevant context (RAG)
    context_resp = await httpx.get(
        "http://helm:8675/context/rag",
        params={"query": user_message, "project": project, "max_skills": 3}
    )
    context = context_resp.json()
    # context contains: skills, conventions, snippets, estimated_tokens
    
    # 2. Build system prompt with context
    context_str = format_context(context)  # See agent/template/agent.py for full implementation
    system_prompt = f"{context_str}\n\nYou are a helpful assistant."
    
    # 3. Build messages array
    messages = [{"role": "system", "content": system_prompt}]
    if conversation:
        messages.extend(conversation[-4:])  # last few turns
    messages.append({"role": "user", "content": user_message})
    
    # 4. Call your LLM (OpenAI, Claude, Ollama, etc.)
    llm_response = await call_your_llm(messages)
    
    # 5. Update conversation history
    if conversation is None:
        conversation = []
    conversation.append({"role": "user", "content": user_message})
    conversation.append({"role": "assistant", "content": llm_response})
    
    # 6. Periodically compress (e.g., every 10 turns)
    if len(conversation) > 10:
        compress_resp = await httpx.post(
            "http://helm:8675/compress",
            json={"messages": conversation, "keep_last_n": 3}
        )
        compression = compress_resp.json()
        conversation = compression["messages"]
        print(f"Compressed: saved {compression['tokens_saved']} tokens ({compression['reduction_percent']}%)")
    
    # 7. Optionally store learnings in memory
    if project:
        await httpx.post(
            "http://helm:8675/memory",
            json={
                "project": project,
                "key": f"decision-{int(time.time())}",
                "content": f"Decision: {llm_response[:200]}"
            }
        )
    
    return llm_response, conversation

Expected Savings Summary

Component Before After Token Savings
Context injection 3000 tokens 600 tokens 80%
Conversation history (10 turns) 8000 tokens 2000 tokens 75%
Repeat questions 1500 tokens 0 tokens 100% (if using cache externally)

Typical agent query: ~3500 tokens → ~1000 tokens (71% reduction)


What Was Removed (v1 → v2)

  • Semantic cache - Was broken (embeded responses not prompts), removed for simplicity
  • Exact-match cache - Low value, use HTTP cache headers instead
  • Keyword-based compression - Replaced with real summarization

Performance Characteristics

  • RAG latency: 5-10ms for 1000 items (cold start loads embeddings once)
  • Compression: 100-500ms (extractive) or ~2s (ollama)
  • Memory usage: ~50MB for embedding cache (1000 skills)
  • Concurrent requests: Fully async, supports dozens simultaneous

Tips for Best Results

  1. Seed relevant skills - Good skills = better RAG results. Use /skills and /snippets to build your knowledge base.
  2. Use project-specific conventions - Set project=/path/to/project to auto-load conventions for that codebase.
  3. Enable Ollama compression if you need higher quality summaries (run ollama pull phi3:mini)
  4. Monitor /config to verify your settings are active
  5. Cache embeddings in your agent if you call /context/rag repeatedly

Agent Template

We've created a ready-to-use template repository with a working agent integration. Clone it and start building:

git clone git.bouncypixel.com:helm/ai-agent-template.git
cd ai-agent-template
cp .env.example .env
docker compose up -d

See template/README.md for details.

Savings: 80-90% on repeated questions


2. RAG Context Selection (Moderate Win)

Before: Inject ALL skills/conventions (2000+ tokens) After: Inject only top 3 relevant (400-600 tokens)

# Legacy endpoint - returns EVERYTHING
curl "http://localhost:8080/context?project=/opt/home-server"
# Returns: 50 skills, 10 conventions = ~3000 tokens

# RAG endpoint - returns only relevant
curl "http://helm:8675/context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server"
# Returns: 3 skills about Docker, 2 conventions = ~600 tokens

Savings: 60-80% on context injection


3. Conversation Compression (Moderate Win)

Before: Full conversation history sent every request After: Old turns summarized, only recent kept full

# Compress a long conversation
curl -X POST http://helm:8675/compress \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [...],  # Your conversation history
    "keep_last_n": 3,
    "max_tokens": 2000
  }'

# Response:
{
  "messages": [...],  # Compressed version
  "original_tokens": 8000,
  "compressed_tokens": 2000,
  "tokens_saved": 6000,
  "reduction_percent": 75.0
}

Savings: 50-75% on conversation history


Integration Flow

# Your agent wrapper
async def query_llm(prompt, conversation_history, project=None):
    # 1. Check semantic cache FIRST
    cache_result = await httpx.post(
        "http://helm:8675/cache/semantic-lookup",
        json={"prompt": prompt, "model": "claude-3-opus"}
    )
    
    if cache_result.json()["hit"]:
        # No API call needed!
        return cache_result.json()["response"]
    
    # 2. Get ONLY relevant context (not everything)
    context = await httpx.get(
         "http://helm:8675/context/rag",
        params={"query": prompt, "project": project}
    )
    
    # 3. Compress conversation history
    compressed = await httpx.post(
         "http://helm:8675/compress",
        json={"messages": conversation_history, "keep_last_n": 3}
    )
    
    # 4. Build final prompt with compressed history + relevant context
    final_prompt = f"""
    {context.json()['skills']}
    {context.json()['conventions']}
    
    {compressed.json()['messages']}
    
    User: {prompt}
    """
    
    # 5. Call LLM
    response = await call_llm_api(final_prompt)
    
    # 6. Store in semantic cache
    await httpx.post(
         "http://helm:8675/cache/semantic-store",
        json={
            "prompt": prompt,
            "response": response,
            "tokens_in": len(final_prompt.split()),
            "tokens_out": len(response.split())
        }
    )
    
    return response

Expected Savings

Scenario Before After Savings
Repeated question 1500 tokens 0 tokens (cache hit) 100%
Similar question 1500 tokens 0 tokens (semantic match) 100%
New question, known project 3500 tokens 1200 tokens 65%
Long conversation (10+ turns) 12000 tokens 4000 tokens 67%

Real-world average: 50-70% reduction in token consumption


Why No Vector DB?

For your scale (single user, <1000 items):

Approach Query Time Setup Overhead
In-memory cosine sim ~5ms None None
SQLite + embeddings ~10ms None None
Qdrant/Chroma ~2ms Docker container 500MB+ RAM

Verdict: Vector DB adds complexity without meaningful benefit at your scale.


New Endpoints

Endpoint Purpose
POST /cache/semantic-lookup Find similar cached responses
POST /cache/semantic-store Store with embedding for matching
GET /context/rag?query=... RAG-based context selection
POST /compress Summarize conversation history
GET /tokens/count?text=... Count tokens in text
GET /cache/stats Cache statistics
POST /cache/clear-old Cleanup old cache entries

System Prompt for Agents

## Token Efficiency Protocol

You have access to local infrastructure that reduces API usage:

**Before responding to any request:**
1. Call `POST /cache/semantic-lookup` with the user's prompt
2. If hit (similarity >= 0.85), return cached response directly
3. If miss, call `GET /context/rag?query={prompt}` for relevant context only

**For long conversations:**
1. Call `POST /compress` every 5+ turns
2. Use compressed history for subsequent requests

**After providing valuable responses:**
1. Call `POST /cache/semantic-store` to cache for future
2. Call `skills/create_skill` if it's a reusable pattern

**Token budget awareness:**
- Keep responses concise
- Don't repeat injected context
- Reference skills by ID when possible

This infrastructure saves 50-70% on token consumption.