# Usage Guide: AI Skills API This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions. ## Table of Contents 1. [Understanding the Integration Pattern](#understanding-the-integration-pattern) 2. [RAG Context Retrieval](#rag-context-retrieval) 3. [Conversation Compression](#conversation-compression) 4. [Project Memory](#project-memory) 5. [Session Workflow](#session-workflow) 6. [Managing Skills](#managing-skills) 7. [Token Accounting](#token-accounting) 8. [Best Practices](#best-practices) 9. [Example Implementations](#example-implementations) --- ## Understanding the Integration Pattern The API provides three core capabilities that work together: 1. **RAG (Retrieval-Augmented Generation)**: Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time. 2. **Compression**: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows. 3. **Memory**: Store decisions, configurations, and learnings per project for future reference. **Expected savings**: 60-80% token reduction vs. sending everything. --- ## RAG Context Retrieval ### The `/context/rag` Endpoint This is your primary integration point. It returns only the most relevant items from your knowledge base. **Request:** ``` GET /context/rag?query={query}&project={project} ``` **Response:** ```json { "skills": [ { "id": "homelab-docker-compose", "name": "Docker Compose Standard", "category": "homelab", "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.", "relevance_score": 0.89 } ], "conventions": [ { "id": "conv-123", "name": "React Project Standards", "project": "/home/user/my-react-app", "content": "Use TypeScript, React 18+, and functional components with hooks.", "relevance_score": 0.76 } ], "snippets": [ { "id": "snippet-456", "name": "FastAPI CORS setup", "language": "python", "content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)", "relevance_score": 0.82 } ] } ``` ### How It Works - Skills are globally available (your general knowledge base) - Conventions are scoped to a project path or identifier (e.g., `/home/user/project1`) - Snippets are globally available code examples - Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out - Limits are configurable (default: 3 skills, 2 conventions, 2 snippets) ### Usage Pattern ```python async def query_with_context(query: str, project: str = None): # 1. Fetch context context = await get_context(query, project) # 2. Build system prompt system_prompt = format_context(context) # system_prompt now contains: # ## Relevant Skills # ### Docker Compose Standard (relevance: 0.89) # Always use docker-compose v3.8+... # ... # 3. Inject into LLM call messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": query} ] response = await llm.chat(messages) return response ``` --- ## Conversation Compression ### The `/compress` Endpoint Compresses a list of conversation messages into a shorter representation. **Request:** ```json { "messages": [ {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi! How can I help?"}, {"role": "user", "content": "I need to set up Docker Compose."}, {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}, ... (up to 20+ messages) ] } ``` **Response:** ```json { "messages": [ {"role": "system", "content": "Summary of earlier conversation..."}, {"role": "user", "content": "I need to set up Docker Compose."}, {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."} ], "tokens_saved": 245 } ``` ### Compression Strategies - **Extractive** (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required. - **Ollama**: Uses `phi3:mini` for abstractive summaries. Better quality but slower (~2s). Requires Ollama running. **Configure in `config.yaml`:** ```yaml compression: enabled: true strategy: "extractive" # or "ollama" ``` ### Usage Pattern ```python conversation = [] async def chat(query): # Add user message conversation.append({"role": "user", "content": query}) # Call LLM (with context from RAG) response = await llm.chat(conversation) conversation.append({"role": "assistant", "content": response}) # Compress when conversation gets long if len(conversation) >= 10: compressed = await compress_messages(conversation) conversation = compressed["messages"] print(f"Saved {compressed['tokens_saved']} tokens") return response ``` **Important**: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones. --- ## Project Memory ### The `/memory` Endpoints Store and retrieve project-specific knowledge. **Store:** ``` POST /memory { "project": "my-project", "key": "architecture-decision-2024-01-15", "content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs." } ``` **Retrieve:** ``` GET /memory?project=my-project ``` **Update:** ``` PUT /memory/{id} ``` **Delete:** ``` DELETE /memory/{id} ``` ### Usage Pattern ```python # Store a decision after making it await store_memory( project="/home/user/myapp", key="db-choice", content="Using PostgreSQL over MongoDB for relational data integrity" ) # Retrieve past decisions at project start resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"}) decisions = resp.json()["entries"] # decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}] ``` **When to use memory:** - Architecture decisions - Configuration choices (API keys, service URLs) - Learned preferences ("User likes code examples") - Debugging notes ("Issue with CORS on port 8080") **When NOT to use memory:** - Temporary conversation state (use compression instead) - Large codebases (store in skills/snippets instead) - Public documentation (should be in skills) --- ## Session Workflow ### Starting a New Session The AI should determine the project identifier at the start of each session. **Recommended approach:** Use the git remote origin URL as a stable identifier that follows you across machines. ```python # Detecting the git remote (the AI would use its shell tool) import subprocess try: project = subprocess.check_output(["git", "remote", "get-url", "origin"]).decode().strip() except: project = "fallback-identifier" # or ask user ``` This ensures that if you work on the same repository from multiple machines (different file paths), the project context remains consistent. The same repo uses the same identifier everywhere. If the directory isn't a git repository, the AI should ask the user for a unique project identifier or fall back to a configured environment variable. 2. **Load past memories** (optional but helpful): ```python memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"] # Inject into system prompt or create context from them ``` 3. **Begin conversation loop** - for each user query: - Call `GET /context/rag?query=...&project=PROJECT` - Inject context into LLM prompt - Call LLM - Store important outputs in memory if they represent decisions/learnings - Compress conversation when it reaches ~10 turns ### Ending a Session - Optionally store a session summary in memory: ```python await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens") ``` - No cleanup needed - conversation state lives in your agent, not the server ### Multi-Project Agents If your agent works across multiple projects: ```python # Switch project context mid-conversation PROJECT = "git@github.com:company/project-a.git" # stable identifier # Each project has its own conventions and memories context = await get_context(query, project=PROJECT) ``` --- ## Managing Skills Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script. ### Categories Group skills by category (e.g., `homelab`, `dnd`, `python`, `devops`). Categories don't affect RAG retrieval but help with organization. ### Tags Tags are keywords used for **future search** (not currently used by RAG, but planned for enhanced filtering). ```json { "tags": ["docker", "compose", "infrastructure", "production"] } ``` ### Best Practices for Skills - **Be specific**: "Docker Compose Production Patterns" > "Docker" - **Include examples**: Show code snippets in the content - **Keep it concise**: 1-3 paragraphs, focus on actionable guidance - **Use markdown**: The API preserves formatting for injection into prompts - **Version when updating**: If a skill changes significantly, create a new `id` (e.g., `docker-compose-v2`) ### Search Skills ``` GET /skills/search?q={query} ``` Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use `/context/rag` instead). --- ## Token Accounting ### Count Tokens ``` GET /tokens/count?text={text} ``` Returns the token count (using tiktoken for GPT models, approximations for others). **Use this to:** - Track compression savings - Pre-flight check prompts before sending to LLM - Budget token usage per session ### Example: Measure RAG Savings ```python full_context = load_all_skills() # hypothetical: all your skills text full_tokens = count_tokens(full_context) rag_context = get_context(query, project) # only relevant items rag_tokens = count_tokens(format_context(rag_context)) savings_pct = (1 - rag_tokens / full_tokens) * 100 print(f"RAG saved {savings_pct:.1f}% tokens") ``` --- ## Best Practices ### 1. Always Use Project Scoping Set `project` parameter consistently. Even if you have one main project, use a consistent identifier: ```python PROJECT = "/home/user/myapp" # NOT "default" or None context = await get_context(query, project=PROJECT) ``` This allows: - Project-specific conventions - Memory isolation between projects - Future per-project analytics ### 2. Call RAG Before Every LLM Request Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses. ### 3. Compress Proactively Don't wait until context window is full. Compress at ~10 messages: ```python if len(conversation) >= 10: compressed = await compress_messages(conversation) conversation = compressed["messages"] ``` This keeps the compression quality high (summaries are more accurate with fewer messages). ### 4. Store Learnings, Not Everything Memory is for **decisions** and **facts you want to recall**. Don't store: - Every user query/response (that's what compression is for) - Public documentation (put in skills instead) - Transient state (keep in agent memory) ### 5. Version Your Skills When a skill's guidance changes: - **Minor update** (typo, clarification): update the existing skill's `content` in place - **Major update** (different approach, breaking change): create a new `id` (e.g., `docker-compose-v2`) and optionally mark the old one as deprecated in its content ### 6. Use MCP in Claude Desktop If you use Claude Desktop, add the MCP server (see `CLAUDE.md`). This gives you: - Direct access to skills via Claude's tool calling - No need to implement API calls manually - Same token savings within Claude ### 7. Monitor Token Savings Track metrics: ```python import time from datetime import datetime logs = [] def log_savings(tokens_before, tokens_after, operation): logs.append({ "timestamp": datetime.now().isoformat(), "operation": operation, "tokens_before": tokens_before, "tokens_after": tokens_after, "savings": tokens_before - tokens_after }) # Periodically upload or analyze these ``` --- ## Example Implementations ### Minimal Agent ```python import asyncio, httpx, os API_URL = os.getenv("API_URL", "http://helm:8675") PROJECT = os.getenv("PROJECT", "/default") async def get_context(query): async with httpx.AsyncClient() as client: resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT}) return resp.json() async def chat(): conv = [] while True: query = input("You: ") if query == "quit": break # Get context ctx = await get_context(query) system = format_context(ctx) # Call LLM (pseudo) response = call_llm(system, conv[-4:], query) conv.extend([{"role": "user", "content": query}, {"role": "assistant", "content": response}]) print(f"Assistant: {response}") asyncio.run(chat()) ``` ### Discord Bot with Context ```python import discord from discord.ext import commands import httpx bot = commands.Bot(command_prefix="!") API_URL = "http://helm:8675" PROJECT = "/home/user/discord-bot" @bot.event async def on_message(message): if message.author == bot.user: return # RAG context async with httpx.AsyncClient() as client: resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT}) ctx = resp.json() # Build prompt system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot." # Respond (using your LLM of choice) response = await generate_response(message.content, system_prompt) await message.reply(response) # Store in memory if it's a decision if "decision" in message.content.lower(): async with httpx.AsyncClient() as client: await client.post(f"{API_URL}/memory", json={ "project": PROJECT, "key": f"decision-{discord.utils.utcnow().timestamp()}", "content": response[:500] }) bot.run(os.getenv("DISCORD_TOKEN")) ``` --- ## Need More Help? - **Setup issues**: See `SETUP.md` - **Template repo**: Clone `git.bouncypixel.com:helm/agentic-templates.git` - **API reference**: Visit `http://helm:8675/docs` when the service is running - **MCP tools**: See `CLAUDE.md` for Claude Desktop integration