# Usage Guide: AI Skills API This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions. ## Table of Contents 1. [Understanding the Integration Pattern](#understanding-the-integration-pattern) 2. [RAG Context Retrieval](#rag-context-retrieval) 3. [Conversation Compression](#conversation-compression) 4. [Project Memory](#project-memory) 5. [Session Workflow](#session-workflow) 6. [Managing Skills](#managing-skills) 7. [Token Accounting](#token-accounting) 8. [Best Practices](#best-practices) 9. [Example Implementations](#example-implementations) --- ## Understanding the Integration Pattern The API provides three core capabilities that work together: 1. **RAG (Retrieval-Augmented Generation)**: Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time. 2. **Compression**: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows. 3. **Memory**: Store decisions, configurations, and learnings per project for future reference. **Expected savings**: 60-80% token reduction vs. sending everything. --- ## RAG Context Retrieval ### The `/context/rag` Endpoint This is your primary integration point. It returns only the most relevant items from your knowledge base. **Request:** ``` GET /context/rag?query={query}&project={project} ``` **Response:** ```json { "skills": [ { "id": "homelab-docker-compose", "name": "Docker Compose Standard", "category": "homelab", "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.", "relevance_score": 0.89 } ], "conventions": [ { "id": "conv-123", "name": "React Project Standards", "project": "/home/user/my-react-app", "content": "Use TypeScript, React 18+, and functional components with hooks.", "relevance_score": 0.76 } ], "snippets": [ { "id": "snippet-456", "name": "FastAPI CORS setup", "language": "python", "content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)", "relevance_score": 0.82 } ] } ``` ### How It Works - Skills are globally available (your general knowledge base) - Conventions are scoped to a project path or identifier (e.g., `/home/user/project1`) - Snippets are globally available code examples - Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out - Limits are configurable (default: 3 skills, 2 conventions, 2 snippets) ### Usage Pattern ```python async def query_with_context(query: str, project: str = None): # 1. Fetch context context = await get_context(query, project) # 2. Build system prompt system_prompt = format_context(context) # system_prompt now contains: # ## Relevant Skills # ### Docker Compose Standard (relevance: 0.89) # Always use docker-compose v3.8+... # ... # 3. Inject into LLM call messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": query} ] response = await llm.chat(messages) return response ``` --- ## Conversation Compression ### The `/compress` Endpoint Compresses a list of conversation messages into a shorter representation. **Request:** ```json { "messages": [ {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi! How can I help?"}, {"role": "user", "content": "I need to set up Docker Compose."}, {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}, ... (up to 20+ messages) ] } ``` **Response:** ```json { "messages": [ {"role": "system", "content": "Summary of earlier conversation..."}, {"role": "user", "content": "I need to set up Docker Compose."}, {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."} ], "tokens_saved": 245 } ``` ### Compression Strategies - **Extractive** (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required. - **Ollama**: Uses `phi3:mini` for abstractive summaries. Better quality but slower (~2s). Requires Ollama running. **Configure in `config.yaml`:** ```yaml compression: enabled: true strategy: "extractive" # or "ollama" ``` ### Usage Pattern ```python conversation = [] async def chat(query): # Add user message conversation.append({"role": "user", "content": query}) # Call LLM (with context from RAG) response = await llm.chat(conversation) conversation.append({"role": "assistant", "content": response}) # Compress when conversation gets long if len(conversation) >= 10: compressed = await compress_messages(conversation) conversation = compressed["messages"] print(f"Saved {compressed['tokens_saved']} tokens") return response ``` **Important**: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones. --- ## Project Memory ### The `/memory` Endpoints Store and retrieve project-specific knowledge. **Store:** ``` POST /memory { "project": "my-project", "key": "architecture-decision-2024-01-15", "content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs." } ``` **Retrieve:** ``` GET /memory?project=my-project ``` **Update:** ``` PUT /memory/{id} ``` **Delete:** ``` DELETE /memory/{id} ``` ### Usage Pattern ```python # Store a decision after making it await store_memory( project="/home/user/myapp", key="db-choice", content="Using PostgreSQL over MongoDB for relational data integrity" ) # Retrieve past decisions at project start resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"}) decisions = resp.json()["entries"] # decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}] ``` **When to use memory:** - Architecture decisions - Configuration choices (API keys, service URLs) - Learned preferences ("User likes code examples") - Debugging notes ("Issue with CORS on port 8080") **When NOT to use memory:** - Temporary conversation state (use compression instead) - Large codebases (store in skills/snippets instead) - Public documentation (should be in skills) --- ## Session Workflow ### Starting a New Session 1. **Define your project identifier** - a path or unique string: ```python PROJECT = "/home/user/myapp" # or "my-discord-bot", "workspace-123" ``` 2. **Load past memories** (optional but helpful): ```python memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"] # Inject into system prompt or create context from them ``` 3. **Begin conversation loop** - for each user query: - Call `GET /context/rag?query=...&project=PROJECT` - Inject context into LLM prompt - Call LLM - Store important outputs in memory if they represent decisions/learnings - Compress conversation when it reaches ~10 turns ### Ending a Session - Optionally store a session summary in memory: ```python await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens") ``` - No cleanup needed - conversation state lives in your agent, not the server ### Multi-Project Agents If your agent works across multiple projects: ```python # Switch project context mid-conversation PROJECT = "/home/user/project1" # current active project # Each project has its own conventions and memories context = await get_context(query, project=PROJECT) ``` --- ## Managing Skills Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script. ### Categories Group skills by category (e.g., `homelab`, `dnd`, `python`, `devops`). Categories don't affect RAG retrieval but help with organization. ### Tags Tags are keywords used for **future search** (not currently used by RAG, but planned for enhanced filtering). ```json { "tags": ["docker", "compose", "infrastructure", "production"] } ``` ### Best Practices for Skills - **Be specific**: "Docker Compose Production Patterns" > "Docker" - **Include examples**: Show code snippets in the content - **Keep it concise**: 1-3 paragraphs, focus on actionable guidance - **Use markdown**: The API preserves formatting for injection into prompts - **Version when updating**: If a skill changes significantly, create a new `id` (e.g., `docker-compose-v2`) ### Search Skills ``` GET /skills/search?q={query} ``` Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use `/context/rag` instead). --- ## Token Accounting ### Count Tokens ``` GET /tokens/count?text={text} ``` Returns the token count (using tiktoken for GPT models, approximations for others). **Use this to:** - Track compression savings - Pre-flight check prompts before sending to LLM - Budget token usage per session ### Example: Measure RAG Savings ```python full_context = load_all_skills() # hypothetical: all your skills text full_tokens = count_tokens(full_context) rag_context = get_context(query, project) # only relevant items rag_tokens = count_tokens(format_context(rag_context)) savings_pct = (1 - rag_tokens / full_tokens) * 100 print(f"RAG saved {savings_pct:.1f}% tokens") ``` --- ## Best Practices ### 1. Always Use Project Scoping Set `project` parameter consistently. Even if you have one main project, use a consistent identifier: ```python PROJECT = "/home/user/myapp" # NOT "default" or None context = await get_context(query, project=PROJECT) ``` This allows: - Project-specific conventions - Memory isolation between projects - Future per-project analytics ### 2. Call RAG Before Every LLM Request Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses. ### 3. Compress Proactively Don't wait until context window is full. Compress at ~10 messages: ```python if len(conversation) >= 10: compressed = await compress_messages(conversation) conversation = compressed["messages"] ``` This keeps the compression quality high (summaries are more accurate with fewer messages). ### 4. Store Learnings, Not Everything Memory is for **decisions** and **facts you want to recall**. Don't store: - Every user query/response (that's what compression is for) - Public documentation (put in skills instead) - Transient state (keep in agent memory) ### 5. Version Your Skills When a skill's guidance changes: - **Minor update** (typo, clarification): update the existing skill's `content` in place - **Major update** (different approach, breaking change): create a new `id` (e.g., `docker-compose-v2`) and optionally mark the old one as deprecated in its content ### 6. Use MCP in Claude Desktop If you use Claude Desktop, add the MCP server (see `CLAUDE.md`). This gives you: - Direct access to skills via Claude's tool calling - No need to implement API calls manually - Same token savings within Claude ### 7. Monitor Token Savings Track metrics: ```python import time from datetime import datetime logs = [] def log_savings(tokens_before, tokens_after, operation): logs.append({ "timestamp": datetime.now().isoformat(), "operation": operation, "tokens_before": tokens_before, "tokens_after": tokens_after, "savings": tokens_before - tokens_after }) # Periodically upload or analyze these ``` --- ## Example Implementations ### Minimal Agent ```python import asyncio, httpx, os API_URL = os.getenv("API_URL", "http://helm:8675") PROJECT = os.getenv("PROJECT", "/default") async def get_context(query): async with httpx.AsyncClient() as client: resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT}) return resp.json() async def chat(): conv = [] while True: query = input("You: ") if query == "quit": break # Get context ctx = await get_context(query) system = format_context(ctx) # Call LLM (pseudo) response = call_llm(system, conv[-4:], query) conv.extend([{"role": "user", "content": query}, {"role": "assistant", "content": response}]) print(f"Assistant: {response}") asyncio.run(chat()) ``` ### Discord Bot with Context ```python import discord from discord.ext import commands import httpx bot = commands.Bot(command_prefix="!") API_URL = "http://helm:8675" PROJECT = "/home/user/discord-bot" @bot.event async def on_message(message): if message.author == bot.user: return # RAG context async with httpx.AsyncClient() as client: resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT}) ctx = resp.json() # Build prompt system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot." # Respond (using your LLM of choice) response = await generate_response(message.content, system_prompt) await message.reply(response) # Store in memory if it's a decision if "decision" in message.content.lower(): async with httpx.AsyncClient() as client: await client.post(f"{API_URL}/memory", json={ "project": PROJECT, "key": f"decision-{discord.utils.utcnow().timestamp()}", "content": response[:500] }) bot.run(os.getenv("DISCORD_TOKEN")) ``` --- ## Need More Help? - **Setup issues**: See `SETUP.md` - **Template repo**: Clone `git.bouncypixel.com:helm/agentic-templates.git` - **API reference**: Visit `http://helm:8675/docs` when the service is running - **MCP tools**: See `CLAUDE.md` for Claude Desktop integration