14 KiB
Usage Guide: AI Skills API
This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions.
Table of Contents
- Understanding the Integration Pattern
- RAG Context Retrieval
- Conversation Compression
- Project Memory
- Session Workflow
- Managing Skills
- Token Accounting
- Best Practices
- Example Implementations
Understanding the Integration Pattern
The API provides three core capabilities that work together:
-
RAG (Retrieval-Augmented Generation): Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time.
-
Compression: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows.
-
Memory: Store decisions, configurations, and learnings per project for future reference.
Expected savings: 60-80% token reduction vs. sending everything.
RAG Context Retrieval
The /context/rag Endpoint
This is your primary integration point. It returns only the most relevant items from your knowledge base.
Request:
GET /context/rag?query={query}&project={project}
Response:
{
"skills": [
{
"id": "homelab-docker-compose",
"name": "Docker Compose Standard",
"category": "homelab",
"content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
"relevance_score": 0.89
}
],
"conventions": [
{
"id": "conv-123",
"name": "React Project Standards",
"project": "/home/user/my-react-app",
"content": "Use TypeScript, React 18+, and functional components with hooks.",
"relevance_score": 0.76
}
],
"snippets": [
{
"id": "snippet-456",
"name": "FastAPI CORS setup",
"language": "python",
"content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)",
"relevance_score": 0.82
}
]
}
How It Works
- Skills are globally available (your general knowledge base)
- Conventions are scoped to a project path or identifier (e.g.,
/home/user/project1) - Snippets are globally available code examples
- Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out
- Limits are configurable (default: 3 skills, 2 conventions, 2 snippets)
Usage Pattern
async def query_with_context(query: str, project: str = None):
# 1. Fetch context
context = await get_context(query, project)
# 2. Build system prompt
system_prompt = format_context(context)
# system_prompt now contains:
# ## Relevant Skills
# ### Docker Compose Standard (relevance: 0.89)
# Always use docker-compose v3.8+...
# ...
# 3. Inject into LLM call
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
]
response = await llm.chat(messages)
return response
Conversation Compression
The /compress Endpoint
Compresses a list of conversation messages into a shorter representation.
Request:
{
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help?"},
{"role": "user", "content": "I need to set up Docker Compose."},
{"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."},
... (up to 20+ messages)
]
}
Response:
{
"messages": [
{"role": "system", "content": "Summary of earlier conversation..."},
{"role": "user", "content": "I need to set up Docker Compose."},
{"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}
],
"tokens_saved": 245
}
Compression Strategies
- Extractive (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required.
- Ollama: Uses
phi3:minifor abstractive summaries. Better quality but slower (~2s). Requires Ollama running.
Configure in config.yaml:
compression:
enabled: true
strategy: "extractive" # or "ollama"
Usage Pattern
conversation = []
async def chat(query):
# Add user message
conversation.append({"role": "user", "content": query})
# Call LLM (with context from RAG)
response = await llm.chat(conversation)
conversation.append({"role": "assistant", "content": response})
# Compress when conversation gets long
if len(conversation) >= 10:
compressed = await compress_messages(conversation)
conversation = compressed["messages"]
print(f"Saved {compressed['tokens_saved']} tokens")
return response
Important: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones.
Project Memory
The /memory Endpoints
Store and retrieve project-specific knowledge.
Store:
POST /memory
{
"project": "my-project",
"key": "architecture-decision-2024-01-15",
"content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs."
}
Retrieve:
GET /memory?project=my-project
Update:
PUT /memory/{id}
Delete:
DELETE /memory/{id}
Usage Pattern
# Store a decision after making it
await store_memory(
project="/home/user/myapp",
key="db-choice",
content="Using PostgreSQL over MongoDB for relational data integrity"
)
# Retrieve past decisions at project start
resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"})
decisions = resp.json()["entries"]
# decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}]
When to use memory:
- Architecture decisions
- Configuration choices (API keys, service URLs)
- Learned preferences ("User likes code examples")
- Debugging notes ("Issue with CORS on port 8080")
When NOT to use memory:
- Temporary conversation state (use compression instead)
- Large codebases (store in skills/snippets instead)
- Public documentation (should be in skills)
Session Workflow
Starting a New Session
The AI should determine the project identifier at the start of each session. Recommended approach: Use the git remote origin URL as a stable identifier that follows you across machines.
# Detecting the git remote (the AI would use its shell tool)
import subprocess
try:
project = subprocess.check_output(["git", "remote", "get-url", "origin"]).decode().strip()
except:
project = "fallback-identifier" # or ask user
This ensures that if you work on the same repository from multiple machines (different file paths), the project context remains consistent. The same repo uses the same identifier everywhere.
If the directory isn't a git repository, the AI should ask the user for a unique project identifier or fall back to a configured environment variable.
-
Load past memories (optional but helpful):
memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"] # Inject into system prompt or create context from them -
Begin conversation loop - for each user query:
- Call
GET /context/rag?query=...&project=PROJECT - Inject context into LLM prompt
- Call LLM
- Store important outputs in memory if they represent decisions/learnings
- Compress conversation when it reaches ~10 turns
- Call
Ending a Session
-
Optionally store a session summary in memory:
await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens") -
No cleanup needed - conversation state lives in your agent, not the server
Multi-Project Agents
If your agent works across multiple projects:
# Switch project context mid-conversation
PROJECT = "git@github.com:company/project-a.git" # stable identifier
# Each project has its own conventions and memories
context = await get_context(query, project=PROJECT)
Managing Skills
Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script.
Categories
Group skills by category (e.g., homelab, dnd, python, devops). Categories don't affect RAG retrieval but help with organization.
Tags
Tags are keywords used for future search (not currently used by RAG, but planned for enhanced filtering).
{
"tags": ["docker", "compose", "infrastructure", "production"]
}
Best Practices for Skills
- Be specific: "Docker Compose Production Patterns" > "Docker"
- Include examples: Show code snippets in the content
- Keep it concise: 1-3 paragraphs, focus on actionable guidance
- Use markdown: The API preserves formatting for injection into prompts
- Version when updating: If a skill changes significantly, create a new
id(e.g.,docker-compose-v2)
Search Skills
GET /skills/search?q={query}
Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use /context/rag instead).
Token Accounting
Count Tokens
GET /tokens/count?text={text}
Returns the token count (using tiktoken for GPT models, approximations for others).
Use this to:
- Track compression savings
- Pre-flight check prompts before sending to LLM
- Budget token usage per session
Example: Measure RAG Savings
full_context = load_all_skills() # hypothetical: all your skills text
full_tokens = count_tokens(full_context)
rag_context = get_context(query, project) # only relevant items
rag_tokens = count_tokens(format_context(rag_context))
savings_pct = (1 - rag_tokens / full_tokens) * 100
print(f"RAG saved {savings_pct:.1f}% tokens")
Best Practices
1. Always Use Project Scoping
Set project parameter consistently. Even if you have one main project, use a consistent identifier:
PROJECT = "/home/user/myapp" # NOT "default" or None
context = await get_context(query, project=PROJECT)
This allows:
- Project-specific conventions
- Memory isolation between projects
- Future per-project analytics
2. Call RAG Before Every LLM Request
Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses.
3. Compress Proactively
Don't wait until context window is full. Compress at ~10 messages:
if len(conversation) >= 10:
compressed = await compress_messages(conversation)
conversation = compressed["messages"]
This keeps the compression quality high (summaries are more accurate with fewer messages).
4. Store Learnings, Not Everything
Memory is for decisions and facts you want to recall.
Don't store:
- Every user query/response (that's what compression is for)
- Public documentation (put in skills instead)
- Transient state (keep in agent memory)
5. Version Your Skills
When a skill's guidance changes:
- Minor update (typo, clarification): update the existing skill's
contentin place - Major update (different approach, breaking change): create a new
id(e.g.,docker-compose-v2) and optionally mark the old one as deprecated in its content
6. Use MCP in Claude Desktop
If you use Claude Desktop, add the MCP server (see CLAUDE.md). This gives you:
- Direct access to skills via Claude's tool calling
- No need to implement API calls manually
- Same token savings within Claude
7. Monitor Token Savings
Track metrics:
import time
from datetime import datetime
logs = []
def log_savings(tokens_before, tokens_after, operation):
logs.append({
"timestamp": datetime.now().isoformat(),
"operation": operation,
"tokens_before": tokens_before,
"tokens_after": tokens_after,
"savings": tokens_before - tokens_after
})
# Periodically upload or analyze these
Example Implementations
Minimal Agent
import asyncio, httpx, os
API_URL = os.getenv("API_URL", "http://helm:8675")
PROJECT = os.getenv("PROJECT", "/default")
async def get_context(query):
async with httpx.AsyncClient() as client:
resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT})
return resp.json()
async def chat():
conv = []
while True:
query = input("You: ")
if query == "quit": break
# Get context
ctx = await get_context(query)
system = format_context(ctx)
# Call LLM (pseudo)
response = call_llm(system, conv[-4:], query)
conv.extend([{"role": "user", "content": query},
{"role": "assistant", "content": response}])
print(f"Assistant: {response}")
asyncio.run(chat())
Discord Bot with Context
import discord
from discord.ext import commands
import httpx
bot = commands.Bot(command_prefix="!")
API_URL = "http://helm:8675"
PROJECT = "/home/user/discord-bot"
@bot.event
async def on_message(message):
if message.author == bot.user:
return
# RAG context
async with httpx.AsyncClient() as client:
resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT})
ctx = resp.json()
# Build prompt
system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot."
# Respond (using your LLM of choice)
response = await generate_response(message.content, system_prompt)
await message.reply(response)
# Store in memory if it's a decision
if "decision" in message.content.lower():
async with httpx.AsyncClient() as client:
await client.post(f"{API_URL}/memory", json={
"project": PROJECT,
"key": f"decision-{discord.utils.utcnow().timestamp()}",
"content": response[:500]
})
bot.run(os.getenv("DISCORD_TOKEN"))
Need More Help?
- Setup issues: See
SETUP.md - Template repo: Clone
git.bouncypixel.com:helm/agentic-templates.git - API reference: Visit
http://helm:8675/docswhen the service is running - MCP tools: See
CLAUDE.mdfor Claude Desktop integration