helm/ai-skills-api

Fork 0

Lukas Parsons dfb684f319 docs: simplify project identification guidance; trust AI to use local tools

2026-03-23 00:47:51 -04:00

14 KiB

Raw Blame History

Usage Guide: AI Skills API

This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions.

Understanding the Integration Pattern
RAG Context Retrieval
Conversation Compression
Project Memory
Session Workflow
Managing Skills
Token Accounting
Best Practices
Example Implementations

Understanding the Integration Pattern

The API provides three core capabilities that work together:

RAG (Retrieval-Augmented Generation): Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time.
Compression: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows.
Memory: Store decisions, configurations, and learnings per project for future reference.

Expected savings: 60-80% token reduction vs. sending everything.

RAG Context Retrieval

The `/context/rag` Endpoint

This is your primary integration point. It returns only the most relevant items from your knowledge base.

Request:

GET /context/rag?query={query}&project={project}

Response:

{
  "skills": [
    {
      "id": "homelab-docker-compose",
      "name": "Docker Compose Standard",
      "category": "homelab",
      "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
      "relevance_score": 0.89
    }
  ],
  "conventions": [
    {
      "id": "conv-123",
      "name": "React Project Standards",
      "project": "/home/user/my-react-app",
      "content": "Use TypeScript, React 18+, and functional components with hooks.",
      "relevance_score": 0.76
    }
  ],
  "snippets": [
    {
      "id": "snippet-456",
      "name": "FastAPI CORS setup",
      "language": "python",
      "content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)",
      "relevance_score": 0.82
    }
  ]
}

How It Works

Skills are globally available (your general knowledge base)
Conventions are scoped to a project path or identifier (e.g., /home/user/project1)
Snippets are globally available code examples
Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out
Limits are configurable (default: 3 skills, 2 conventions, 2 snippets)

Usage Pattern

async def query_with_context(query: str, project: str = None):
    # 1. Fetch context
    context = await get_context(query, project)
    
    # 2. Build system prompt
    system_prompt = format_context(context)
    # system_prompt now contains:
    # ## Relevant Skills
    # ### Docker Compose Standard (relevance: 0.89)
    # Always use docker-compose v3.8+...
    # ...
    
    # 3. Inject into LLM call
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ]
    response = await llm.chat(messages)
    
    return response

Conversation Compression

The `/compress` Endpoint

Compresses a list of conversation messages into a shorter representation.

Request:

{
  "messages": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi! How can I help?"},
    {"role": "user", "content": "I need to set up Docker Compose."},
    {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."},
    ... (up to 20+ messages)
  ]
}

Response:

{
  "messages": [
    {"role": "system", "content": "Summary of earlier conversation..."},
    {"role": "user", "content": "I need to set up Docker Compose."},
    {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}
  ],
  "tokens_saved": 245
}

Compression Strategies

Extractive (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required.
Ollama: Uses phi3:mini for abstractive summaries. Better quality but slower (~2s). Requires Ollama running.

Configure in config.yaml:

compression:
  enabled: true
  strategy: "extractive"  # or "ollama"

Usage Pattern

conversation = []

async def chat(query):
    # Add user message
    conversation.append({"role": "user", "content": query})
    
    # Call LLM (with context from RAG)
    response = await llm.chat(conversation)
    conversation.append({"role": "assistant", "content": response})
    
    # Compress when conversation gets long
    if len(conversation) >= 10:
        compressed = await compress_messages(conversation)
        conversation = compressed["messages"]
        print(f"Saved {compressed['tokens_saved']} tokens")
    
    return response

Important: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones.

Project Memory

The `/memory` Endpoints

Store and retrieve project-specific knowledge.

Store:

POST /memory
{
  "project": "my-project",
  "key": "architecture-decision-2024-01-15",
  "content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs."
}

Retrieve:

GET /memory?project=my-project

Update:

PUT /memory/{id}

Delete:

DELETE /memory/{id}

Usage Pattern

# Store a decision after making it
await store_memory(
    project="/home/user/myapp",
    key="db-choice",
    content="Using PostgreSQL over MongoDB for relational data integrity"
)

# Retrieve past decisions at project start
resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"})
decisions = resp.json()["entries"]
# decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}]

When to use memory:

Architecture decisions
Configuration choices (API keys, service URLs)
Learned preferences ("User likes code examples")
Debugging notes ("Issue with CORS on port 8080")

When NOT to use memory:

Temporary conversation state (use compression instead)
Large codebases (store in skills/snippets instead)
Public documentation (should be in skills)

Session Workflow

Starting a New Session

The AI should determine the project identifier at the start of each session. Recommended approach: Use the git remote origin URL as a stable identifier that follows you across machines.

# Detecting the git remote (the AI would use its shell tool)
import subprocess
try:
    project = subprocess.check_output(["git", "remote", "get-url", "origin"]).decode().strip()
except:
    project = "fallback-identifier"  # or ask user

This ensures that if you work on the same repository from multiple machines (different file paths), the project context remains consistent. The same repo uses the same identifier everywhere.

If the directory isn't a git repository, the AI should ask the user for a unique project identifier or fall back to a configured environment variable.

Load past memories (optional but helpful):

memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"]
# Inject into system prompt or create context from them

Begin conversation loop - for each user query:
- Call GET /context/rag?query=...&project=PROJECT
- Inject context into LLM prompt
- Call LLM
- Store important outputs in memory if they represent decisions/learnings
- Compress conversation when it reaches ~10 turns

Ending a Session

Optionally store a session summary in memory:

await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens")

No cleanup needed - conversation state lives in your agent, not the server

Multi-Project Agents

If your agent works across multiple projects:

# Switch project context mid-conversation
PROJECT = "git@github.com:company/project-a.git"  # stable identifier

# Each project has its own conventions and memories
context = await get_context(query, project=PROJECT)

Managing Skills

Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script.

Best Practices for Skills

Be specific: "Docker Compose Production Patterns" > "Docker"
Include examples: Show code snippets in the content
Keep it concise: 1-3 paragraphs, focus on actionable guidance
Use markdown: The API preserves formatting for injection into prompts
Version when updating: If a skill changes significantly, create a new id (e.g., docker-compose-v2)

Search Skills

GET /skills/search?q={query}

Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use /context/rag instead).

Token Accounting

Count Tokens

GET /tokens/count?text={text}

Returns the token count (using tiktoken for GPT models, approximations for others).

Use this to:

Track compression savings
Pre-flight check prompts before sending to LLM
Budget token usage per session

Example: Measure RAG Savings

full_context = load_all_skills()  # hypothetical: all your skills text
full_tokens = count_tokens(full_context)

rag_context = get_context(query, project)  # only relevant items
rag_tokens = count_tokens(format_context(rag_context))

savings_pct = (1 - rag_tokens / full_tokens) * 100
print(f"RAG saved {savings_pct:.1f}% tokens")

Best Practices

1. Always Use Project Scoping

Set project parameter consistently. Even if you have one main project, use a consistent identifier:

PROJECT = "/home/user/myapp"  # NOT "default" or None
context = await get_context(query, project=PROJECT)

This allows:

Project-specific conventions
Memory isolation between projects
Future per-project analytics

2. Call RAG Before Every LLM Request

Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses.

3. Compress Proactively

Don't wait until context window is full. Compress at ~10 messages:

if len(conversation) >= 10:
    compressed = await compress_messages(conversation)
    conversation = compressed["messages"]

This keeps the compression quality high (summaries are more accurate with fewer messages).

4. Store Learnings, Not Everything

Memory is for decisions and facts you want to recall.

Don't store:

Every user query/response (that's what compression is for)
Public documentation (put in skills instead)
Transient state (keep in agent memory)

5. Version Your Skills

When a skill's guidance changes:

Minor update (typo, clarification): update the existing skill's content in place
Major update (different approach, breaking change): create a new id (e.g., docker-compose-v2) and optionally mark the old one as deprecated in its content

6. Use MCP in Claude Desktop

If you use Claude Desktop, add the MCP server (see CLAUDE.md). This gives you:

Direct access to skills via Claude's tool calling
No need to implement API calls manually
Same token savings within Claude

7. Monitor Token Savings

Track metrics:

import time
from datetime import datetime

logs = []

def log_savings(tokens_before, tokens_after, operation):
    logs.append({
        "timestamp": datetime.now().isoformat(),
        "operation": operation,
        "tokens_before": tokens_before,
        "tokens_after": tokens_after,
        "savings": tokens_before - tokens_after
    })
    # Periodically upload or analyze these

Example Implementations

Minimal Agent

import asyncio, httpx, os

API_URL = os.getenv("API_URL", "http://helm:8675")
PROJECT = os.getenv("PROJECT", "/default")

async def get_context(query):
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT})
        return resp.json()

async def chat():
    conv = []
    while True:
        query = input("You: ")
        if query == "quit": break
        
        # Get context
        ctx = await get_context(query)
        system = format_context(ctx)
        
        # Call LLM (pseudo)
        response = call_llm(system, conv[-4:], query)
        
        conv.extend([{"role": "user", "content": query},
                     {"role": "assistant", "content": response}])
        
        print(f"Assistant: {response}")

asyncio.run(chat())

Discord Bot with Context

import discord
from discord.ext import commands
import httpx

bot = commands.Bot(command_prefix="!")
API_URL = "http://helm:8675"
PROJECT = "/home/user/discord-bot"

@bot.event
async def on_message(message):
    if message.author == bot.user:
        return
    
    # RAG context
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT})
        ctx = resp.json()
    
    # Build prompt
    system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot."
    
    # Respond (using your LLM of choice)
    response = await generate_response(message.content, system_prompt)
    await message.reply(response)
    
    # Store in memory if it's a decision
    if "decision" in message.content.lower():
        async with httpx.AsyncClient() as client:
            await client.post(f"{API_URL}/memory", json={
                "project": PROJECT,
                "key": f"decision-{discord.utils.utcnow().timestamp()}",
                "content": response[:500]
            })

bot.run(os.getenv("DISCORD_TOKEN"))

Need More Help?

Setup issues: See SETUP.md
Template repo: Clone git.bouncypixel.com:helm/agentic-templates.git
API reference: Visit http://helm:8675/docs when the service is running
MCP tools: See CLAUDE.md for Claude Desktop integration

14 KiB

Raw Blame History

Usage Guide: AI Skills API

Table of Contents

Understanding the Integration Pattern

RAG Context Retrieval

The `/context/rag` Endpoint

How It Works

Usage Pattern

Conversation Compression

The `/compress` Endpoint

Compression Strategies

Usage Pattern

Project Memory

The `/memory` Endpoints

Usage Pattern

Session Workflow

Starting a New Session

Ending a Session

Multi-Project Agents

Managing Skills

Categories

Tags

Best Practices for Skills

Search Skills

Token Accounting

Count Tokens

Example: Measure RAG Savings

Best Practices

1. Always Use Project Scoping

2. Call RAG Before Every LLM Request

3. Compress Proactively

4. Store Learnings, Not Everything

5. Version Your Skills

6. Use MCP in Claude Desktop

7. Monitor Token Savings

Example Implementations

Minimal Agent

Discord Bot with Context

Need More Help?

14 KiB Raw Blame History

Usage Guide: AI Skills API

Table of Contents

Understanding the Integration Pattern

RAG Context Retrieval

The /context/rag Endpoint

How It Works

Usage Pattern

Conversation Compression

The /compress Endpoint

Compression Strategies

Usage Pattern

Project Memory

The /memory Endpoints

Usage Pattern

Session Workflow

Starting a New Session

Ending a Session

Multi-Project Agents

Managing Skills

Categories

Tags

Best Practices for Skills

Search Skills

Token Accounting

Count Tokens

Example: Measure RAG Savings

Best Practices

1. Always Use Project Scoping

2. Call RAG Before Every LLM Request

3. Compress Proactively

4. Store Learnings, Not Everything

5. Version Your Skills

6. Use MCP in Claude Desktop

7. Monitor Token Savings

Example Implementations

Minimal Agent

Discord Bot with Context

Need More Help?

14 KiB

Raw Blame History

The `/context/rag` Endpoint

The `/compress` Endpoint

The `/memory` Endpoints