# Token-Saving Architecture This explains how the AI Skills API reduces token consumption for your AI agents. ## The Two Main Mechanisms ### 1. Smart RAG (Retrieval-Augmented Generation) - 60-80% Savings **Problem:** Sending all skills/conventions every query wastes 2000+ tokens. **Solution:** Pre-computed embeddings + fast similarity search returns only the top 3 most relevant items. ```python # Instead of this (sends everything): GET /context?project=/opt/home-server # -> 50 skills = ~3000 tokens # Do this (sends only relevant): GET /context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server # -> 3 skills + 2 conventions = ~600 tokens ``` **How it works:** - On startup, all skills/snippets are loaded into memory with their embeddings - Query is embedded and cosine similarity computed against all items - Top-K items above threshold returned in ~5ms for 1000 items - No database queries during retrieval - fully in-memory **Configuration:** ```yaml rag: max_skills: 3 max_conventions: 2 max_snippets: 2 min_skill_score: 0.3 ``` ### 2. Conversation Compression - 50-75% Savings **Problem:** Long conversations (10+ turns) can consume 8000+ tokens of history. **Solution:** Summarize old turns, keep recent exchanges full. ```python # Send this to /compress endpoint: { "messages": [ {"role": "user", "content": "..."}, # turn 1 {"role": "assistant", "content": "..."}, # ... many more turns {"role": "user", "content": "..."}, # turn 10 ] } # Get back: { "messages": [ {"role": "user", "content": "[CONVERSATION SUMMARY]\nUser asked about Docker setup, decided to use Traefik...[/CONVERSATION SUMMARY]"}, {"role": "user", "content": "..."}, # turn 9 (full) {"role": "assistant", "content": "..."}, # turn 10 (full) ], "original_tokens": 8000, "compressed_tokens": 2000, "tokens_saved": 6000, "reduction_percent": 75.0 } ``` **Strategies:** - **extractive** (default): Fast LSA summarization, no model required - **ollama**: High-quality summaries using local phi-3-mini (requires Ollama running) - **none**: Disabled **Configuration:** ```yaml compression: enabled: true strategy: "extractive" # or "ollama" keep_last_n: 3 max_tokens: 2000 ollama_model: "phi3:mini" ollama_url: "http://localhost:11434" ``` --- ## Integration Flow (Complete Example) ```python import httpx import asyncio async def chat_with_llm(user_message: str, project: str = None, conversation: list = None): """Complete integration pattern""" # 1. Get relevant context (RAG) context_resp = await httpx.get( "http://helm:8675/context/rag", params={"query": user_message, "project": project, "max_skills": 3} ) context = context_resp.json() # context contains: skills, conventions, snippets, estimated_tokens # 2. Build system prompt with context context_str = format_context(context) # See agent/template/agent.py for full implementation system_prompt = f"{context_str}\n\nYou are a helpful assistant." # 3. Build messages array messages = [{"role": "system", "content": system_prompt}] if conversation: messages.extend(conversation[-4:]) # last few turns messages.append({"role": "user", "content": user_message}) # 4. Call your LLM (OpenAI, Claude, Ollama, etc.) llm_response = await call_your_llm(messages) # 5. Update conversation history if conversation is None: conversation = [] conversation.append({"role": "user", "content": user_message}) conversation.append({"role": "assistant", "content": llm_response}) # 6. Periodically compress (e.g., every 10 turns) if len(conversation) > 10: compress_resp = await httpx.post( "http://helm:8675/compress", json={"messages": conversation, "keep_last_n": 3} ) compression = compress_resp.json() conversation = compression["messages"] print(f"Compressed: saved {compression['tokens_saved']} tokens ({compression['reduction_percent']}%)") # 7. Optionally store learnings in memory if project: await httpx.post( "http://helm:8675/memory", json={ "project": project, "key": f"decision-{int(time.time())}", "content": f"Decision: {llm_response[:200]}" } ) return llm_response, conversation ``` --- ## Expected Savings Summary | Component | Before | After | Token Savings | |-----------|--------|-------|---------------| | Context injection | 3000 tokens | 600 tokens | 80% | | Conversation history (10 turns) | 8000 tokens | 2000 tokens | 75% | | Repeat questions | 1500 tokens | 0 tokens | 100% (if using cache externally) | **Typical agent query:** ~3500 tokens → ~1000 tokens (**71% reduction**) --- ## What Was Removed (v1 → v2) - **Semantic cache** - Was broken (embeded responses not prompts), removed for simplicity - **Exact-match cache** - Low value, use HTTP cache headers instead - **Keyword-based compression** - Replaced with real summarization --- ## Performance Characteristics - **RAG latency**: 5-10ms for 1000 items (cold start loads embeddings once) - **Compression**: 100-500ms (extractive) or ~2s (ollama) - **Memory usage**: ~50MB for embedding cache (1000 skills) - **Concurrent requests**: Fully async, supports dozens simultaneous --- ## Tips for Best Results 1. **Seed relevant skills** - Good skills = better RAG results. Use `/skills` and `/snippets` to build your knowledge base. 2. **Use project-specific conventions** - Set `project=/path/to/project` to auto-load conventions for that codebase. 3. **Enable Ollama compression** if you need higher quality summaries (run `ollama pull phi3:mini`) 4. **Monitor `/config`** to verify your settings are active 5. **Cache embeddings** in your agent if you call `/context/rag` repeatedly --- ## Agent Template We've created a ready-to-use template repository with a working agent integration. Clone it and start building: ```bash git clone git.bouncypixel.com:helm/ai-agent-template.git cd ai-agent-template cp .env.example .env docker compose up -d ``` See [template/README.md](template/README.md) for details. **Savings:** 80-90% on repeated questions --- ### 2. RAG Context Selection (Moderate Win) **Before:** Inject ALL skills/conventions (2000+ tokens) **After:** Inject only top 3 relevant (400-600 tokens) ```bash # Legacy endpoint - returns EVERYTHING curl "http://localhost:8080/context?project=/opt/home-server" # Returns: 50 skills, 10 conventions = ~3000 tokens # RAG endpoint - returns only relevant curl "http://helm:8675/context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server" # Returns: 3 skills about Docker, 2 conventions = ~600 tokens ``` **Savings:** 60-80% on context injection --- ### 3. Conversation Compression (Moderate Win) **Before:** Full conversation history sent every request **After:** Old turns summarized, only recent kept full ```bash # Compress a long conversation curl -X POST http://helm:8675/compress \ -H "Content-Type: application/json" \ -d '{ "messages": [...], # Your conversation history "keep_last_n": 3, "max_tokens": 2000 }' # Response: { "messages": [...], # Compressed version "original_tokens": 8000, "compressed_tokens": 2000, "tokens_saved": 6000, "reduction_percent": 75.0 } ``` **Savings:** 50-75% on conversation history --- ## Integration Flow ```python # Your agent wrapper async def query_llm(prompt, conversation_history, project=None): # 1. Check semantic cache FIRST cache_result = await httpx.post( "http://helm:8675/cache/semantic-lookup", json={"prompt": prompt, "model": "claude-3-opus"} ) if cache_result.json()["hit"]: # No API call needed! return cache_result.json()["response"] # 2. Get ONLY relevant context (not everything) context = await httpx.get( "http://helm:8675/context/rag", params={"query": prompt, "project": project} ) # 3. Compress conversation history compressed = await httpx.post( "http://helm:8675/compress", json={"messages": conversation_history, "keep_last_n": 3} ) # 4. Build final prompt with compressed history + relevant context final_prompt = f""" {context.json()['skills']} {context.json()['conventions']} {compressed.json()['messages']} User: {prompt} """ # 5. Call LLM response = await call_llm_api(final_prompt) # 6. Store in semantic cache await httpx.post( "http://helm:8675/cache/semantic-store", json={ "prompt": prompt, "response": response, "tokens_in": len(final_prompt.split()), "tokens_out": len(response.split()) } ) return response ``` --- ## Expected Savings | Scenario | Before | After | Savings | |----------|--------|-------|---------| | Repeated question | 1500 tokens | 0 tokens (cache hit) | 100% | | Similar question | 1500 tokens | 0 tokens (semantic match) | 100% | | New question, known project | 3500 tokens | 1200 tokens | 65% | | Long conversation (10+ turns) | 12000 tokens | 4000 tokens | 67% | **Real-world average:** 50-70% reduction in token consumption --- ## Why No Vector DB? For your scale (single user, <1000 items): | Approach | Query Time | Setup | Overhead | |----------|-----------|-------|----------| | In-memory cosine sim | ~5ms | None | None | | SQLite + embeddings | ~10ms | None | None | | Qdrant/Chroma | ~2ms | Docker container | 500MB+ RAM | **Verdict:** Vector DB adds complexity without meaningful benefit at your scale. --- ## New Endpoints | Endpoint | Purpose | |----------|---------| | `POST /cache/semantic-lookup` | Find similar cached responses | | `POST /cache/semantic-store` | Store with embedding for matching | | `GET /context/rag?query=...` | RAG-based context selection | | `POST /compress` | Summarize conversation history | | `GET /tokens/count?text=...` | Count tokens in text | | `GET /cache/stats` | Cache statistics | | `POST /cache/clear-old` | Cleanup old cache entries | --- ## System Prompt for Agents ```markdown ## Token Efficiency Protocol You have access to local infrastructure that reduces API usage: **Before responding to any request:** 1. Call `POST /cache/semantic-lookup` with the user's prompt 2. If hit (similarity >= 0.85), return cached response directly 3. If miss, call `GET /context/rag?query={prompt}` for relevant context only **For long conversations:** 1. Call `POST /compress` every 5+ turns 2. Use compressed history for subsequent requests **After providing valuable responses:** 1. Call `POST /cache/semantic-store` to cache for future 2. Call `skills/create_skill` if it's a reusable pattern **Token budget awareness:** - Keep responses concise - Don't repeat injected context - Reference skills by ID when possible This infrastructure saves 50-70% on token consumption. ```