ai-skills-api/USAGE.md

14 KiB

Usage Guide: AI Skills API

This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions.

Table of Contents

  1. Understanding the Integration Pattern
  2. RAG Context Retrieval
  3. Conversation Compression
  4. Project Memory
  5. Session Workflow
  6. Managing Skills
  7. Token Accounting
  8. Best Practices
  9. Example Implementations

Understanding the Integration Pattern

The API provides three core capabilities that work together:

  1. RAG (Retrieval-Augmented Generation): Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time.

  2. Compression: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows.

  3. Memory: Store decisions, configurations, and learnings per project for future reference.

Expected savings: 60-80% token reduction vs. sending everything.


RAG Context Retrieval

The /context/rag Endpoint

This is your primary integration point. It returns only the most relevant items from your knowledge base.

Request:

GET /context/rag?query={query}&project={project}

Response:

{
  "skills": [
    {
      "id": "homelab-docker-compose",
      "name": "Docker Compose Standard",
      "category": "homelab",
      "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
      "relevance_score": 0.89
    }
  ],
  "conventions": [
    {
      "id": "conv-123",
      "name": "React Project Standards",
      "project": "/home/user/my-react-app",
      "content": "Use TypeScript, React 18+, and functional components with hooks.",
      "relevance_score": 0.76
    }
  ],
  "snippets": [
    {
      "id": "snippet-456",
      "name": "FastAPI CORS setup",
      "language": "python",
      "content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)",
      "relevance_score": 0.82
    }
  ]
}

How It Works

  • Skills are globally available (your general knowledge base)
  • Conventions are scoped to a project path or identifier (e.g., /home/user/project1)
  • Snippets are globally available code examples
  • Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out
  • Limits are configurable (default: 3 skills, 2 conventions, 2 snippets)

Usage Pattern

async def query_with_context(query: str, project: str = None):
    # 1. Fetch context
    context = await get_context(query, project)
    
    # 2. Build system prompt
    system_prompt = format_context(context)
    # system_prompt now contains:
    # ## Relevant Skills
    # ### Docker Compose Standard (relevance: 0.89)
    # Always use docker-compose v3.8+...
    # ...
    
    # 3. Inject into LLM call
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ]
    response = await llm.chat(messages)
    
    return response

Conversation Compression

The /compress Endpoint

Compresses a list of conversation messages into a shorter representation.

Request:

{
  "messages": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi! How can I help?"},
    {"role": "user", "content": "I need to set up Docker Compose."},
    {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."},
    ... (up to 20+ messages)
  ]
}

Response:

{
  "messages": [
    {"role": "system", "content": "Summary of earlier conversation..."},
    {"role": "user", "content": "I need to set up Docker Compose."},
    {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}
  ],
  "tokens_saved": 245
}

Compression Strategies

  • Extractive (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required.
  • Ollama: Uses phi3:mini for abstractive summaries. Better quality but slower (~2s). Requires Ollama running.

Configure in config.yaml:

compression:
  enabled: true
  strategy: "extractive"  # or "ollama"

Usage Pattern

conversation = []

async def chat(query):
    # Add user message
    conversation.append({"role": "user", "content": query})
    
    # Call LLM (with context from RAG)
    response = await llm.chat(conversation)
    conversation.append({"role": "assistant", "content": response})
    
    # Compress when conversation gets long
    if len(conversation) >= 10:
        compressed = await compress_messages(conversation)
        conversation = compressed["messages"]
        print(f"Saved {compressed['tokens_saved']} tokens")
    
    return response

Important: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones.


Project Memory

The /memory Endpoints

Store and retrieve project-specific knowledge.

Store:

POST /memory
{
  "project": "my-project",
  "key": "architecture-decision-2024-01-15",
  "content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs."
}

Retrieve:

GET /memory?project=my-project

Update:

PUT /memory/{id}

Delete:

DELETE /memory/{id}

Usage Pattern

# Store a decision after making it
await store_memory(
    project="/home/user/myapp",
    key="db-choice",
    content="Using PostgreSQL over MongoDB for relational data integrity"
)

# Retrieve past decisions at project start
resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"})
decisions = resp.json()["entries"]
# decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}]

When to use memory:

  • Architecture decisions
  • Configuration choices (API keys, service URLs)
  • Learned preferences ("User likes code examples")
  • Debugging notes ("Issue with CORS on port 8080")

When NOT to use memory:

  • Temporary conversation state (use compression instead)
  • Large codebases (store in skills/snippets instead)
  • Public documentation (should be in skills)

Session Workflow

Starting a New Session

  1. Define your project identifier - we recommend using git remote origin for consistency across machines:

    # Option A: Auto-detect (recommended) - agent template does this automatically
    project = get_project_identifier()  # returns git remote origin if available, else env var, else dir name
    
    # Option B: Explicit project identifier (stable across machines)
    PROJECT = "git@github.com:username/repo.git"  # or "https://github.com/username/repo"
    
    # Option C: Use environment variable
    # export PROJECT="git@github.com:username/repo.git"
    

    Why git remote? If you work on the same repository from multiple machines (different file paths), using the git remote as the project identifier ensures your conventions and memories follow you. The same repo gets the same context regardless of where you clone it.

  2. Load past memories (optional but helpful):

    memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"]
    # Inject into system prompt or create context from them
    
  3. Begin conversation loop - for each user query:

    • Call GET /context/rag?query=...&project=PROJECT
    • Inject context into LLM prompt
    • Call LLM
    • Store important outputs in memory if they represent decisions/learnings
    • Compress conversation when it reaches ~10 turns

Ending a Session

  • Optionally store a session summary in memory:

    await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens")
    
  • No cleanup needed - conversation state lives in your agent, not the server

Multi-Project Agents

If your agent works across multiple projects:

# Switch project context mid-conversation
PROJECT = "git@github.com:company/project-a.git"  # stable identifier

# Each project has its own conventions and memories
context = await get_context(query, project=PROJECT)

Managing Skills

Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script.

Categories

Group skills by category (e.g., homelab, dnd, python, devops). Categories don't affect RAG retrieval but help with organization.

Tags

Tags are keywords used for future search (not currently used by RAG, but planned for enhanced filtering).

{
  "tags": ["docker", "compose", "infrastructure", "production"]
}

Best Practices for Skills

  • Be specific: "Docker Compose Production Patterns" > "Docker"
  • Include examples: Show code snippets in the content
  • Keep it concise: 1-3 paragraphs, focus on actionable guidance
  • Use markdown: The API preserves formatting for injection into prompts
  • Version when updating: If a skill changes significantly, create a new id (e.g., docker-compose-v2)

Search Skills

GET /skills/search?q={query}

Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use /context/rag instead).


Token Accounting

Count Tokens

GET /tokens/count?text={text}

Returns the token count (using tiktoken for GPT models, approximations for others).

Use this to:

  • Track compression savings
  • Pre-flight check prompts before sending to LLM
  • Budget token usage per session

Example: Measure RAG Savings

full_context = load_all_skills()  # hypothetical: all your skills text
full_tokens = count_tokens(full_context)

rag_context = get_context(query, project)  # only relevant items
rag_tokens = count_tokens(format_context(rag_context))

savings_pct = (1 - rag_tokens / full_tokens) * 100
print(f"RAG saved {savings_pct:.1f}% tokens")

Best Practices

1. Always Use Project Scoping

Set project parameter consistently. Even if you have one main project, use a consistent identifier:

PROJECT = "/home/user/myapp"  # NOT "default" or None
context = await get_context(query, project=PROJECT)

This allows:

  • Project-specific conventions
  • Memory isolation between projects
  • Future per-project analytics

2. Call RAG Before Every LLM Request

Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses.

3. Compress Proactively

Don't wait until context window is full. Compress at ~10 messages:

if len(conversation) >= 10:
    compressed = await compress_messages(conversation)
    conversation = compressed["messages"]

This keeps the compression quality high (summaries are more accurate with fewer messages).

4. Store Learnings, Not Everything

Memory is for decisions and facts you want to recall.

Don't store:

  • Every user query/response (that's what compression is for)
  • Public documentation (put in skills instead)
  • Transient state (keep in agent memory)

5. Version Your Skills

When a skill's guidance changes:

  • Minor update (typo, clarification): update the existing skill's content in place
  • Major update (different approach, breaking change): create a new id (e.g., docker-compose-v2) and optionally mark the old one as deprecated in its content

6. Use MCP in Claude Desktop

If you use Claude Desktop, add the MCP server (see CLAUDE.md). This gives you:

  • Direct access to skills via Claude's tool calling
  • No need to implement API calls manually
  • Same token savings within Claude

7. Monitor Token Savings

Track metrics:

import time
from datetime import datetime

logs = []

def log_savings(tokens_before, tokens_after, operation):
    logs.append({
        "timestamp": datetime.now().isoformat(),
        "operation": operation,
        "tokens_before": tokens_before,
        "tokens_after": tokens_after,
        "savings": tokens_before - tokens_after
    })
    # Periodically upload or analyze these

Example Implementations

Minimal Agent

import asyncio, httpx, os

API_URL = os.getenv("API_URL", "http://helm:8675")
PROJECT = os.getenv("PROJECT", "/default")

async def get_context(query):
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT})
        return resp.json()

async def chat():
    conv = []
    while True:
        query = input("You: ")
        if query == "quit": break
        
        # Get context
        ctx = await get_context(query)
        system = format_context(ctx)
        
        # Call LLM (pseudo)
        response = call_llm(system, conv[-4:], query)
        
        conv.extend([{"role": "user", "content": query},
                     {"role": "assistant", "content": response}])
        
        print(f"Assistant: {response}")

asyncio.run(chat())

Discord Bot with Context

import discord
from discord.ext import commands
import httpx

bot = commands.Bot(command_prefix="!")
API_URL = "http://helm:8675"
PROJECT = "/home/user/discord-bot"

@bot.event
async def on_message(message):
    if message.author == bot.user:
        return
    
    # RAG context
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT})
        ctx = resp.json()
    
    # Build prompt
    system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot."
    
    # Respond (using your LLM of choice)
    response = await generate_response(message.content, system_prompt)
    await message.reply(response)
    
    # Store in memory if it's a decision
    if "decision" in message.content.lower():
        async with httpx.AsyncClient() as client:
            await client.post(f"{API_URL}/memory", json={
                "project": PROJECT,
                "key": f"decision-{discord.utils.utcnow().timestamp()}",
                "content": response[:500]
            })

bot.run(os.getenv("DISCORD_TOKEN"))

Need More Help?

  • Setup issues: See SETUP.md
  • Template repo: Clone git.bouncypixel.com:helm/agentic-templates.git
  • API reference: Visit http://helm:8675/docs when the service is running
  • MCP tools: See CLAUDE.md for Claude Desktop integration