532 lines
14 KiB
Markdown
532 lines
14 KiB
Markdown
# Usage Guide: AI Skills API
|
|
|
|
This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Understanding the Integration Pattern](#understanding-the-integration-pattern)
|
|
2. [RAG Context Retrieval](#rag-context-retrieval)
|
|
3. [Conversation Compression](#conversation-compression)
|
|
4. [Project Memory](#project-memory)
|
|
5. [Session Workflow](#session-workflow)
|
|
6. [Managing Skills](#managing-skills)
|
|
7. [Token Accounting](#token-accounting)
|
|
8. [Best Practices](#best-practices)
|
|
9. [Example Implementations](#example-implementations)
|
|
|
|
---
|
|
|
|
## Understanding the Integration Pattern
|
|
|
|
The API provides three core capabilities that work together:
|
|
|
|
1. **RAG (Retrieval-Augmented Generation)**: Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time.
|
|
|
|
2. **Compression**: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows.
|
|
|
|
3. **Memory**: Store decisions, configurations, and learnings per project for future reference.
|
|
|
|
**Expected savings**: 60-80% token reduction vs. sending everything.
|
|
|
|
---
|
|
|
|
## RAG Context Retrieval
|
|
|
|
### The `/context/rag` Endpoint
|
|
|
|
This is your primary integration point. It returns only the most relevant items from your knowledge base.
|
|
|
|
**Request:**
|
|
|
|
```
|
|
GET /context/rag?query={query}&project={project}
|
|
```
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"skills": [
|
|
{
|
|
"id": "homelab-docker-compose",
|
|
"name": "Docker Compose Standard",
|
|
"category": "homelab",
|
|
"content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
|
|
"relevance_score": 0.89
|
|
}
|
|
],
|
|
"conventions": [
|
|
{
|
|
"id": "conv-123",
|
|
"name": "React Project Standards",
|
|
"project": "/home/user/my-react-app",
|
|
"content": "Use TypeScript, React 18+, and functional components with hooks.",
|
|
"relevance_score": 0.76
|
|
}
|
|
],
|
|
"snippets": [
|
|
{
|
|
"id": "snippet-456",
|
|
"name": "FastAPI CORS setup",
|
|
"language": "python",
|
|
"content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)",
|
|
"relevance_score": 0.82
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### How It Works
|
|
|
|
- Skills are globally available (your general knowledge base)
|
|
- Conventions are scoped to a project path or identifier (e.g., `/home/user/project1`)
|
|
- Snippets are globally available code examples
|
|
- Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out
|
|
- Limits are configurable (default: 3 skills, 2 conventions, 2 snippets)
|
|
|
|
### Usage Pattern
|
|
|
|
```python
|
|
async def query_with_context(query: str, project: str = None):
|
|
# 1. Fetch context
|
|
context = await get_context(query, project)
|
|
|
|
# 2. Build system prompt
|
|
system_prompt = format_context(context)
|
|
# system_prompt now contains:
|
|
# ## Relevant Skills
|
|
# ### Docker Compose Standard (relevance: 0.89)
|
|
# Always use docker-compose v3.8+...
|
|
# ...
|
|
|
|
# 3. Inject into LLM call
|
|
messages = [
|
|
{"role": "system", "content": system_prompt},
|
|
{"role": "user", "content": query}
|
|
]
|
|
response = await llm.chat(messages)
|
|
|
|
return response
|
|
```
|
|
|
|
---
|
|
|
|
## Conversation Compression
|
|
|
|
### The `/compress` Endpoint
|
|
|
|
Compresses a list of conversation messages into a shorter representation.
|
|
|
|
**Request:**
|
|
|
|
```json
|
|
{
|
|
"messages": [
|
|
{"role": "user", "content": "Hello!"},
|
|
{"role": "assistant", "content": "Hi! How can I help?"},
|
|
{"role": "user", "content": "I need to set up Docker Compose."},
|
|
{"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."},
|
|
... (up to 20+ messages)
|
|
]
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"messages": [
|
|
{"role": "system", "content": "Summary of earlier conversation..."},
|
|
{"role": "user", "content": "I need to set up Docker Compose."},
|
|
{"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}
|
|
],
|
|
"tokens_saved": 245
|
|
}
|
|
```
|
|
|
|
### Compression Strategies
|
|
|
|
- **Extractive** (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required.
|
|
- **Ollama**: Uses `phi3:mini` for abstractive summaries. Better quality but slower (~2s). Requires Ollama running.
|
|
|
|
**Configure in `config.yaml`:**
|
|
|
|
```yaml
|
|
compression:
|
|
enabled: true
|
|
strategy: "extractive" # or "ollama"
|
|
```
|
|
|
|
### Usage Pattern
|
|
|
|
```python
|
|
conversation = []
|
|
|
|
async def chat(query):
|
|
# Add user message
|
|
conversation.append({"role": "user", "content": query})
|
|
|
|
# Call LLM (with context from RAG)
|
|
response = await llm.chat(conversation)
|
|
conversation.append({"role": "assistant", "content": response})
|
|
|
|
# Compress when conversation gets long
|
|
if len(conversation) >= 10:
|
|
compressed = await compress_messages(conversation)
|
|
conversation = compressed["messages"]
|
|
print(f"Saved {compressed['tokens_saved']} tokens")
|
|
|
|
return response
|
|
```
|
|
|
|
**Important**: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones.
|
|
|
|
---
|
|
|
|
## Project Memory
|
|
|
|
### The `/memory` Endpoints
|
|
|
|
Store and retrieve project-specific knowledge.
|
|
|
|
**Store:**
|
|
|
|
```
|
|
POST /memory
|
|
{
|
|
"project": "my-project",
|
|
"key": "architecture-decision-2024-01-15",
|
|
"content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs."
|
|
}
|
|
```
|
|
|
|
**Retrieve:**
|
|
|
|
```
|
|
GET /memory?project=my-project
|
|
```
|
|
|
|
**Update:**
|
|
|
|
```
|
|
PUT /memory/{id}
|
|
```
|
|
|
|
**Delete:**
|
|
|
|
```
|
|
DELETE /memory/{id}
|
|
```
|
|
|
|
### Usage Pattern
|
|
|
|
```python
|
|
# Store a decision after making it
|
|
await store_memory(
|
|
project="/home/user/myapp",
|
|
key="db-choice",
|
|
content="Using PostgreSQL over MongoDB for relational data integrity"
|
|
)
|
|
|
|
# Retrieve past decisions at project start
|
|
resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"})
|
|
decisions = resp.json()["entries"]
|
|
# decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}]
|
|
```
|
|
|
|
**When to use memory:**
|
|
- Architecture decisions
|
|
- Configuration choices (API keys, service URLs)
|
|
- Learned preferences ("User likes code examples")
|
|
- Debugging notes ("Issue with CORS on port 8080")
|
|
|
|
**When NOT to use memory:**
|
|
- Temporary conversation state (use compression instead)
|
|
- Large codebases (store in skills/snippets instead)
|
|
- Public documentation (should be in skills)
|
|
|
|
---
|
|
|
|
## Session Workflow
|
|
|
|
### Starting a New Session
|
|
|
|
The AI should determine the project identifier at the start of each session. **Recommended approach:** Use the git remote origin URL as a stable identifier that follows you across machines.
|
|
|
|
```python
|
|
# Detecting the git remote (the AI would use its shell tool)
|
|
import subprocess
|
|
try:
|
|
project = subprocess.check_output(["git", "remote", "get-url", "origin"]).decode().strip()
|
|
except:
|
|
project = "fallback-identifier" # or ask user
|
|
```
|
|
|
|
This ensures that if you work on the same repository from multiple machines (different file paths), the project context remains consistent. The same repo uses the same identifier everywhere.
|
|
|
|
If the directory isn't a git repository, the AI should ask the user for a unique project identifier or fall back to a configured environment variable.
|
|
|
|
2. **Load past memories** (optional but helpful):
|
|
```python
|
|
memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"]
|
|
# Inject into system prompt or create context from them
|
|
```
|
|
|
|
3. **Begin conversation loop** - for each user query:
|
|
- Call `GET /context/rag?query=...&project=PROJECT`
|
|
- Inject context into LLM prompt
|
|
- Call LLM
|
|
- Store important outputs in memory if they represent decisions/learnings
|
|
- Compress conversation when it reaches ~10 turns
|
|
|
|
### Ending a Session
|
|
|
|
- Optionally store a session summary in memory:
|
|
```python
|
|
await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens")
|
|
```
|
|
|
|
- No cleanup needed - conversation state lives in your agent, not the server
|
|
|
|
### Multi-Project Agents
|
|
|
|
If your agent works across multiple projects:
|
|
|
|
```python
|
|
# Switch project context mid-conversation
|
|
PROJECT = "git@github.com:company/project-a.git" # stable identifier
|
|
|
|
# Each project has its own conventions and memories
|
|
context = await get_context(query, project=PROJECT)
|
|
```
|
|
|
|
---
|
|
|
|
## Managing Skills
|
|
|
|
Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script.
|
|
|
|
### Categories
|
|
|
|
Group skills by category (e.g., `homelab`, `dnd`, `python`, `devops`). Categories don't affect RAG retrieval but help with organization.
|
|
|
|
### Tags
|
|
|
|
Tags are keywords used for **future search** (not currently used by RAG, but planned for enhanced filtering).
|
|
|
|
```json
|
|
{
|
|
"tags": ["docker", "compose", "infrastructure", "production"]
|
|
}
|
|
```
|
|
|
|
### Best Practices for Skills
|
|
|
|
- **Be specific**: "Docker Compose Production Patterns" > "Docker"
|
|
- **Include examples**: Show code snippets in the content
|
|
- **Keep it concise**: 1-3 paragraphs, focus on actionable guidance
|
|
- **Use markdown**: The API preserves formatting for injection into prompts
|
|
- **Version when updating**: If a skill changes significantly, create a new `id` (e.g., `docker-compose-v2`)
|
|
|
|
### Search Skills
|
|
|
|
```
|
|
GET /skills/search?q={query}
|
|
```
|
|
|
|
Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use `/context/rag` instead).
|
|
|
|
---
|
|
|
|
## Token Accounting
|
|
|
|
### Count Tokens
|
|
|
|
```
|
|
GET /tokens/count?text={text}
|
|
```
|
|
|
|
Returns the token count (using tiktoken for GPT models, approximations for others).
|
|
|
|
**Use this to:**
|
|
- Track compression savings
|
|
- Pre-flight check prompts before sending to LLM
|
|
- Budget token usage per session
|
|
|
|
### Example: Measure RAG Savings
|
|
|
|
```python
|
|
full_context = load_all_skills() # hypothetical: all your skills text
|
|
full_tokens = count_tokens(full_context)
|
|
|
|
rag_context = get_context(query, project) # only relevant items
|
|
rag_tokens = count_tokens(format_context(rag_context))
|
|
|
|
savings_pct = (1 - rag_tokens / full_tokens) * 100
|
|
print(f"RAG saved {savings_pct:.1f}% tokens")
|
|
```
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### 1. Always Use Project Scoping
|
|
|
|
Set `project` parameter consistently. Even if you have one main project, use a consistent identifier:
|
|
|
|
```python
|
|
PROJECT = "/home/user/myapp" # NOT "default" or None
|
|
context = await get_context(query, project=PROJECT)
|
|
```
|
|
|
|
This allows:
|
|
- Project-specific conventions
|
|
- Memory isolation between projects
|
|
- Future per-project analytics
|
|
|
|
### 2. Call RAG Before Every LLM Request
|
|
|
|
Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses.
|
|
|
|
### 3. Compress Proactively
|
|
|
|
Don't wait until context window is full. Compress at ~10 messages:
|
|
|
|
```python
|
|
if len(conversation) >= 10:
|
|
compressed = await compress_messages(conversation)
|
|
conversation = compressed["messages"]
|
|
```
|
|
|
|
This keeps the compression quality high (summaries are more accurate with fewer messages).
|
|
|
|
### 4. Store Learnings, Not Everything
|
|
|
|
Memory is for **decisions** and **facts you want to recall**.
|
|
|
|
Don't store:
|
|
- Every user query/response (that's what compression is for)
|
|
- Public documentation (put in skills instead)
|
|
- Transient state (keep in agent memory)
|
|
|
|
### 5. Version Your Skills
|
|
|
|
When a skill's guidance changes:
|
|
|
|
- **Minor update** (typo, clarification): update the existing skill's `content` in place
|
|
- **Major update** (different approach, breaking change): create a new `id` (e.g., `docker-compose-v2`) and optionally mark the old one as deprecated in its content
|
|
|
|
### 6. Use MCP in Claude Desktop
|
|
|
|
If you use Claude Desktop, add the MCP server (see `CLAUDE.md`). This gives you:
|
|
- Direct access to skills via Claude's tool calling
|
|
- No need to implement API calls manually
|
|
- Same token savings within Claude
|
|
|
|
### 7. Monitor Token Savings
|
|
|
|
Track metrics:
|
|
|
|
```python
|
|
import time
|
|
from datetime import datetime
|
|
|
|
logs = []
|
|
|
|
def log_savings(tokens_before, tokens_after, operation):
|
|
logs.append({
|
|
"timestamp": datetime.now().isoformat(),
|
|
"operation": operation,
|
|
"tokens_before": tokens_before,
|
|
"tokens_after": tokens_after,
|
|
"savings": tokens_before - tokens_after
|
|
})
|
|
# Periodically upload or analyze these
|
|
```
|
|
|
|
---
|
|
|
|
## Example Implementations
|
|
|
|
### Minimal Agent
|
|
|
|
```python
|
|
import asyncio, httpx, os
|
|
|
|
API_URL = os.getenv("API_URL", "http://helm:8675")
|
|
PROJECT = os.getenv("PROJECT", "/default")
|
|
|
|
async def get_context(query):
|
|
async with httpx.AsyncClient() as client:
|
|
resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT})
|
|
return resp.json()
|
|
|
|
async def chat():
|
|
conv = []
|
|
while True:
|
|
query = input("You: ")
|
|
if query == "quit": break
|
|
|
|
# Get context
|
|
ctx = await get_context(query)
|
|
system = format_context(ctx)
|
|
|
|
# Call LLM (pseudo)
|
|
response = call_llm(system, conv[-4:], query)
|
|
|
|
conv.extend([{"role": "user", "content": query},
|
|
{"role": "assistant", "content": response}])
|
|
|
|
print(f"Assistant: {response}")
|
|
|
|
asyncio.run(chat())
|
|
```
|
|
|
|
### Discord Bot with Context
|
|
|
|
```python
|
|
import discord
|
|
from discord.ext import commands
|
|
import httpx
|
|
|
|
bot = commands.Bot(command_prefix="!")
|
|
API_URL = "http://helm:8675"
|
|
PROJECT = "/home/user/discord-bot"
|
|
|
|
@bot.event
|
|
async def on_message(message):
|
|
if message.author == bot.user:
|
|
return
|
|
|
|
# RAG context
|
|
async with httpx.AsyncClient() as client:
|
|
resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT})
|
|
ctx = resp.json()
|
|
|
|
# Build prompt
|
|
system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot."
|
|
|
|
# Respond (using your LLM of choice)
|
|
response = await generate_response(message.content, system_prompt)
|
|
await message.reply(response)
|
|
|
|
# Store in memory if it's a decision
|
|
if "decision" in message.content.lower():
|
|
async with httpx.AsyncClient() as client:
|
|
await client.post(f"{API_URL}/memory", json={
|
|
"project": PROJECT,
|
|
"key": f"decision-{discord.utils.utcnow().timestamp()}",
|
|
"content": response[:500]
|
|
})
|
|
|
|
bot.run(os.getenv("DISCORD_TOKEN"))
|
|
```
|
|
|
|
---
|
|
|
|
## Need More Help?
|
|
|
|
- **Setup issues**: See `SETUP.md`
|
|
- **Template repo**: Clone `git.bouncypixel.com:helm/agentic-templates.git`
|
|
- **API reference**: Visit `http://helm:8675/docs` when the service is running
|
|
- **MCP tools**: See `CLAUDE.md` for Claude Desktop integration
|