ai-skills-api/USAGE.md
Lukas Parsons e346d356e5 Add SSE MCP server, comprehensive docs, and OpenCode integration
- Implement SSE mode for MCP server (mcp/skills.py)
- Add MCP service to docker-compose.yml on port 3000
- Add uvicorn dependency to mcp/requirements.txt
- Create SETUP.md, USAGE.md, OPENCODE-MCP.md
- Update README with quick links and MCP section
- Remove semantic cache references throughout
- Add cross-platform Python MCP setup script to template repo
2026-03-22 23:59:33 -04:00

522 lines
14 KiB
Markdown

# Usage Guide: AI Skills API
This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions.
## Table of Contents
1. [Understanding the Integration Pattern](#understanding-the-integration-pattern)
2. [RAG Context Retrieval](#rag-context-retrieval)
3. [Conversation Compression](#conversation-compression)
4. [Project Memory](#project-memory)
5. [Session Workflow](#session-workflow)
6. [Managing Skills](#managing-skills)
7. [Token Accounting](#token-accounting)
8. [Best Practices](#best-practices)
9. [Example Implementations](#example-implementations)
---
## Understanding the Integration Pattern
The API provides three core capabilities that work together:
1. **RAG (Retrieval-Augmented Generation)**: Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time.
2. **Compression**: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows.
3. **Memory**: Store decisions, configurations, and learnings per project for future reference.
**Expected savings**: 60-80% token reduction vs. sending everything.
---
## RAG Context Retrieval
### The `/context/rag` Endpoint
This is your primary integration point. It returns only the most relevant items from your knowledge base.
**Request:**
```
GET /context/rag?query={query}&project={project}
```
**Response:**
```json
{
"skills": [
{
"id": "homelab-docker-compose",
"name": "Docker Compose Standard",
"category": "homelab",
"content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
"relevance_score": 0.89
}
],
"conventions": [
{
"id": "conv-123",
"name": "React Project Standards",
"project": "/home/user/my-react-app",
"content": "Use TypeScript, React 18+, and functional components with hooks.",
"relevance_score": 0.76
}
],
"snippets": [
{
"id": "snippet-456",
"name": "FastAPI CORS setup",
"language": "python",
"content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)",
"relevance_score": 0.82
}
]
}
```
### How It Works
- Skills are globally available (your general knowledge base)
- Conventions are scoped to a project path or identifier (e.g., `/home/user/project1`)
- Snippets are globally available code examples
- Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out
- Limits are configurable (default: 3 skills, 2 conventions, 2 snippets)
### Usage Pattern
```python
async def query_with_context(query: str, project: str = None):
# 1. Fetch context
context = await get_context(query, project)
# 2. Build system prompt
system_prompt = format_context(context)
# system_prompt now contains:
# ## Relevant Skills
# ### Docker Compose Standard (relevance: 0.89)
# Always use docker-compose v3.8+...
# ...
# 3. Inject into LLM call
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
]
response = await llm.chat(messages)
return response
```
---
## Conversation Compression
### The `/compress` Endpoint
Compresses a list of conversation messages into a shorter representation.
**Request:**
```json
{
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help?"},
{"role": "user", "content": "I need to set up Docker Compose."},
{"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."},
... (up to 20+ messages)
]
}
```
**Response:**
```json
{
"messages": [
{"role": "system", "content": "Summary of earlier conversation..."},
{"role": "user", "content": "I need to set up Docker Compose."},
{"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}
],
"tokens_saved": 245
}
```
### Compression Strategies
- **Extractive** (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required.
- **Ollama**: Uses `phi3:mini` for abstractive summaries. Better quality but slower (~2s). Requires Ollama running.
**Configure in `config.yaml`:**
```yaml
compression:
enabled: true
strategy: "extractive" # or "ollama"
```
### Usage Pattern
```python
conversation = []
async def chat(query):
# Add user message
conversation.append({"role": "user", "content": query})
# Call LLM (with context from RAG)
response = await llm.chat(conversation)
conversation.append({"role": "assistant", "content": response})
# Compress when conversation gets long
if len(conversation) >= 10:
compressed = await compress_messages(conversation)
conversation = compressed["messages"]
print(f"Saved {compressed['tokens_saved']} tokens")
return response
```
**Important**: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones.
---
## Project Memory
### The `/memory` Endpoints
Store and retrieve project-specific knowledge.
**Store:**
```
POST /memory
{
"project": "my-project",
"key": "architecture-decision-2024-01-15",
"content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs."
}
```
**Retrieve:**
```
GET /memory?project=my-project
```
**Update:**
```
PUT /memory/{id}
```
**Delete:**
```
DELETE /memory/{id}
```
### Usage Pattern
```python
# Store a decision after making it
await store_memory(
project="/home/user/myapp",
key="db-choice",
content="Using PostgreSQL over MongoDB for relational data integrity"
)
# Retrieve past decisions at project start
resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"})
decisions = resp.json()["entries"]
# decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}]
```
**When to use memory:**
- Architecture decisions
- Configuration choices (API keys, service URLs)
- Learned preferences ("User likes code examples")
- Debugging notes ("Issue with CORS on port 8080")
**When NOT to use memory:**
- Temporary conversation state (use compression instead)
- Large codebases (store in skills/snippets instead)
- Public documentation (should be in skills)
---
## Session Workflow
### Starting a New Session
1. **Define your project identifier** - a path or unique string:
```python
PROJECT = "/home/user/myapp" # or "my-discord-bot", "workspace-123"
```
2. **Load past memories** (optional but helpful):
```python
memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"]
# Inject into system prompt or create context from them
```
3. **Begin conversation loop** - for each user query:
- Call `GET /context/rag?query=...&project=PROJECT`
- Inject context into LLM prompt
- Call LLM
- Store important outputs in memory if they represent decisions/learnings
- Compress conversation when it reaches ~10 turns
### Ending a Session
- Optionally store a session summary in memory:
```python
await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens")
```
- No cleanup needed - conversation state lives in your agent, not the server
### Multi-Project Agents
If your agent works across multiple projects:
```python
# Switch project context mid-conversation
PROJECT = "/home/user/project1" # current active project
# Each project has its own conventions and memories
context = await get_context(query, project=PROJECT)
```
---
## Managing Skills
Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script.
### Categories
Group skills by category (e.g., `homelab`, `dnd`, `python`, `devops`). Categories don't affect RAG retrieval but help with organization.
### Tags
Tags are keywords used for **future search** (not currently used by RAG, but planned for enhanced filtering).
```json
{
"tags": ["docker", "compose", "infrastructure", "production"]
}
```
### Best Practices for Skills
- **Be specific**: "Docker Compose Production Patterns" > "Docker"
- **Include examples**: Show code snippets in the content
- **Keep it concise**: 1-3 paragraphs, focus on actionable guidance
- **Use markdown**: The API preserves formatting for injection into prompts
- **Version when updating**: If a skill changes significantly, create a new `id` (e.g., `docker-compose-v2`)
### Search Skills
```
GET /skills/search?q={query}
```
Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use `/context/rag` instead).
---
## Token Accounting
### Count Tokens
```
GET /tokens/count?text={text}
```
Returns the token count (using tiktoken for GPT models, approximations for others).
**Use this to:**
- Track compression savings
- Pre-flight check prompts before sending to LLM
- Budget token usage per session
### Example: Measure RAG Savings
```python
full_context = load_all_skills() # hypothetical: all your skills text
full_tokens = count_tokens(full_context)
rag_context = get_context(query, project) # only relevant items
rag_tokens = count_tokens(format_context(rag_context))
savings_pct = (1 - rag_tokens / full_tokens) * 100
print(f"RAG saved {savings_pct:.1f}% tokens")
```
---
## Best Practices
### 1. Always Use Project Scoping
Set `project` parameter consistently. Even if you have one main project, use a consistent identifier:
```python
PROJECT = "/home/user/myapp" # NOT "default" or None
context = await get_context(query, project=PROJECT)
```
This allows:
- Project-specific conventions
- Memory isolation between projects
- Future per-project analytics
### 2. Call RAG Before Every LLM Request
Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses.
### 3. Compress Proactively
Don't wait until context window is full. Compress at ~10 messages:
```python
if len(conversation) >= 10:
compressed = await compress_messages(conversation)
conversation = compressed["messages"]
```
This keeps the compression quality high (summaries are more accurate with fewer messages).
### 4. Store Learnings, Not Everything
Memory is for **decisions** and **facts you want to recall**.
Don't store:
- Every user query/response (that's what compression is for)
- Public documentation (put in skills instead)
- Transient state (keep in agent memory)
### 5. Version Your Skills
When a skill's guidance changes:
- **Minor update** (typo, clarification): update the existing skill's `content` in place
- **Major update** (different approach, breaking change): create a new `id` (e.g., `docker-compose-v2`) and optionally mark the old one as deprecated in its content
### 6. Use MCP in Claude Desktop
If you use Claude Desktop, add the MCP server (see `CLAUDE.md`). This gives you:
- Direct access to skills via Claude's tool calling
- No need to implement API calls manually
- Same token savings within Claude
### 7. Monitor Token Savings
Track metrics:
```python
import time
from datetime import datetime
logs = []
def log_savings(tokens_before, tokens_after, operation):
logs.append({
"timestamp": datetime.now().isoformat(),
"operation": operation,
"tokens_before": tokens_before,
"tokens_after": tokens_after,
"savings": tokens_before - tokens_after
})
# Periodically upload or analyze these
```
---
## Example Implementations
### Minimal Agent
```python
import asyncio, httpx, os
API_URL = os.getenv("API_URL", "http://helm:8675")
PROJECT = os.getenv("PROJECT", "/default")
async def get_context(query):
async with httpx.AsyncClient() as client:
resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT})
return resp.json()
async def chat():
conv = []
while True:
query = input("You: ")
if query == "quit": break
# Get context
ctx = await get_context(query)
system = format_context(ctx)
# Call LLM (pseudo)
response = call_llm(system, conv[-4:], query)
conv.extend([{"role": "user", "content": query},
{"role": "assistant", "content": response}])
print(f"Assistant: {response}")
asyncio.run(chat())
```
### Discord Bot with Context
```python
import discord
from discord.ext import commands
import httpx
bot = commands.Bot(command_prefix="!")
API_URL = "http://helm:8675"
PROJECT = "/home/user/discord-bot"
@bot.event
async def on_message(message):
if message.author == bot.user:
return
# RAG context
async with httpx.AsyncClient() as client:
resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT})
ctx = resp.json()
# Build prompt
system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot."
# Respond (using your LLM of choice)
response = await generate_response(message.content, system_prompt)
await message.reply(response)
# Store in memory if it's a decision
if "decision" in message.content.lower():
async with httpx.AsyncClient() as client:
await client.post(f"{API_URL}/memory", json={
"project": PROJECT,
"key": f"decision-{discord.utils.utcnow().timestamp()}",
"content": response[:500]
})
bot.run(os.getenv("DISCORD_TOKEN"))
```
---
## Need More Help?
- **Setup issues**: See `SETUP.md`
- **Template repo**: Clone `git.bouncypixel.com:helm/agentic-templates.git`
- **API reference**: Visit `http://helm:8675/docs` when the service is running
- **MCP tools**: See `CLAUDE.md` for Claude Desktop integration