ai-skills-api/USAGE.md

# Usage Guide: AI Skills API

This guide explains how to use the AI Skills API effectively in your projects and AI agent sessions.

## Table of Contents

1. [Understanding the Integration Pattern](#understanding-the-integration-pattern)
2. [RAG Context Retrieval](#rag-context-retrieval)
3. [Conversation Compression](#conversation-compression)
4. [Project Memory](#project-memory)
5. [Session Workflow](#session-workflow)
6. [Managing Skills](#managing-skills)
7. [Token Accounting](#token-accounting)
8. [Best Practices](#best-practices)
9. [Example Implementations](#example-implementations)

---

## Understanding the Integration Pattern

The API provides three core capabilities that work together:

1. **RAG (Retrieval-Augmented Generation)**: Before each LLM call, fetch relevant skills, conventions, and snippets based on your query. This injects relevant context without sending your entire knowledge base every time.

2. **Compression**: When conversation history grows long (>10 turns), compress old messages into summaries to stay within context windows.

3. **Memory**: Store decisions, configurations, and learnings per project for future reference.

**Expected savings**: 60-80% token reduction vs. sending everything.

---

## RAG Context Retrieval

### The `/context/rag` Endpoint

This is your primary integration point. It returns only the most relevant items from your knowledge base.

**Request:**

```
GET /context/rag?query={query}&project={project}
```

**Response:**

```json
{
  "skills": [
    {
      "id": "homelab-docker-compose",
      "name": "Docker Compose Standard",
      "category": "homelab",
      "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
      "relevance_score": 0.89
    }
  ],
  "conventions": [
    {
      "id": "conv-123",
      "name": "React Project Standards",
      "project": "/home/user/my-react-app",
      "content": "Use TypeScript, React 18+, and functional components with hooks.",
      "relevance_score": 0.76
    }
  ],
  "snippets": [
    {
      "id": "snippet-456",
      "name": "FastAPI CORS setup",
      "language": "python",
      "content": "app.add_middleware(CORSMiddleware, allow_origins=[\"*\"], ...)",
      "relevance_score": 0.82
    }
  ]
}
```

### How It Works

- Skills are globally available (your general knowledge base)
- Conventions are scoped to a project path or identifier (e.g., `/home/user/project1`)
- Snippets are globally available code examples
- Relevance scores are cosine similarity (0-1) - items below 0.3 are typically filtered out
- Limits are configurable (default: 3 skills, 2 conventions, 2 snippets)

### Usage Pattern

```python
async def query_with_context(query: str, project: str = None):
    # 1. Fetch context
    context = await get_context(query, project)

    # 2. Build system prompt
    system_prompt = format_context(context)
    # system_prompt now contains:
    # ## Relevant Skills
    # ### Docker Compose Standard (relevance: 0.89)
    # Always use docker-compose v3.8+...
    # ...

    # 3. Inject into LLM call
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ]
    response = await llm.chat(messages)

    return response
```

---

## Conversation Compression

### The `/compress` Endpoint

Compresses a list of conversation messages into a shorter representation.

**Request:**

```json
{
  "messages": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi! How can I help?"},
    {"role": "user", "content": "I need to set up Docker Compose."},
    {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."},
    ... (up to 20+ messages)
  ]
}
```

**Response:**

```json
{
  "messages": [
    {"role": "system", "content": "Summary of earlier conversation..."},
    {"role": "user", "content": "I need to set up Docker Compose."},
    {"role": "assistant", "content": "Sure! Docker Compose uses a YAML file..."}
  ],
  "tokens_saved": 245
}
```

### Compression Strategies

- **Extractive** (default): Uses LSA summarization to select key sentences. Fast (~100-500ms), no model required.
- **Ollama**: Uses `phi3:mini` for abstractive summaries. Better quality but slower (~2s). Requires Ollama running.

**Configure in `config.yaml`:**

```yaml
compression:
  enabled: true
  strategy: "extractive"  # or "ollama"
```

### Usage Pattern

```python
conversation = []

async def chat(query):
    # Add user message
    conversation.append({"role": "user", "content": query})

    # Call LLM (with context from RAG)
    response = await llm.chat(conversation)
    conversation.append({"role": "assistant", "content": response})

    # Compress when conversation gets long
    if len(conversation) >= 10:
        compressed = await compress_messages(conversation)
        conversation = compressed["messages"]
        print(f"Saved {compressed['tokens_saved']} tokens")

    return response
```

**Important**: Keep the most recent ~4-6 turns uncompressed. The compression endpoint preserves recent messages and compresses only the older ones.

---

## Project Memory

### The `/memory` Endpoints

Store and retrieve project-specific knowledge.

**Store:**

```
POST /memory
{
  "project": "my-project",
  "key": "architecture-decision-2024-01-15",
  "content": "We chose FastAPI over Flask for async support and automatic OpenAPI docs."
}
```

**Retrieve:**

```
GET /memory?project=my-project
```

**Update:**

```
PUT /memory/{id}
```

**Delete:**

```
DELETE /memory/{id}
```

### Usage Pattern

```python
# Store a decision after making it
await store_memory(
    project="/home/user/myapp",
    key="db-choice",
    content="Using PostgreSQL over MongoDB for relational data integrity"
)

# Retrieve past decisions at project start
resp = httpx.get("http://helm:8675/memory", params={"project": "/home/user/myapp"})
decisions = resp.json()["entries"]
# decisions = [{"id": "...", "key": "db-choice", "content": "...", ...}]
```

**When to use memory:**
- Architecture decisions
- Configuration choices (API keys, service URLs)
- Learned preferences ("User likes code examples")
- Debugging notes ("Issue with CORS on port 8080")

**When NOT to use memory:**
- Temporary conversation state (use compression instead)
- Large codebases (store in skills/snippets instead)
- Public documentation (should be in skills)

---

## Session Workflow

### Starting a New Session

1. **Define your project identifier** - a path or unique string:
   ```python
   PROJECT = "/home/user/myapp"  # or "my-discord-bot", "workspace-123"
   ```

2. **Load past memories** (optional but helpful):
   ```python
   memories = httpx.get("http://helm:8675/memory", params={"project": PROJECT}).json()["entries"]
   # Inject into system prompt or create context from them
   ```

3. **Begin conversation loop** - for each user query:
   - Call `GET /context/rag?query=...&project=PROJECT`
   - Inject context into LLM prompt
   - Call LLM
   - Store important outputs in memory if they represent decisions/learnings
   - Compress conversation when it reaches ~10 turns

### Ending a Session

- Optionally store a session summary in memory:
  ```python
  await store_memory(PROJECT, "session-summary-2024-01-15", "Completed user auth flow, decided on JWT tokens")
  ```

- No cleanup needed - conversation state lives in your agent, not the server

### Multi-Project Agents

If your agent works across multiple projects:

```python
# Switch project context mid-conversation
PROJECT = "/home/user/project1"  # current active project

# Each project has its own conventions and memories
context = await get_context(query, project=PROJECT)
```

---

## Managing Skills

Skills are your reusable knowledge base. Manage them via API, MCP, or the seed script.

### Categories

Group skills by category (e.g., `homelab`, `dnd`, `python`, `devops`). Categories don't affect RAG retrieval but help with organization.

### Tags

Tags are keywords used for **future search** (not currently used by RAG, but planned for enhanced filtering).

```json
{
  "tags": ["docker", "compose", "infrastructure", "production"]
}
```

### Best Practices for Skills

- **Be specific**: "Docker Compose Production Patterns" > "Docker"
- **Include examples**: Show code snippets in the content
- **Keep it concise**: 1-3 paragraphs, focus on actionable guidance
- **Use markdown**: The API preserves formatting for injection into prompts
- **Version when updating**: If a skill changes significantly, create a new `id` (e.g., `docker-compose-v2`)

### Search Skills

```
GET /skills/search?q={query}
```

Returns matching skills by name/content similarity. Useful for manual exploration but not needed in automated agents (use `/context/rag` instead).

---

## Token Accounting

### Count Tokens

```
GET /tokens/count?text={text}
```

Returns the token count (using tiktoken for GPT models, approximations for others).

**Use this to:**
- Track compression savings
- Pre-flight check prompts before sending to LLM
- Budget token usage per session

### Example: Measure RAG Savings

```python
full_context = load_all_skills()  # hypothetical: all your skills text
full_tokens = count_tokens(full_context)

rag_context = get_context(query, project)  # only relevant items
rag_tokens = count_tokens(format_context(rag_context))

savings_pct = (1 - rag_tokens / full_tokens) * 100
print(f"RAG saved {savings_pct:.1f}% tokens")
```

---

## Best Practices

### 1. Always Use Project Scoping

Set `project` parameter consistently. Even if you have one main project, use a consistent identifier:

```python
PROJECT = "/home/user/myapp"  # NOT "default" or None
context = await get_context(query, project=PROJECT)
```

This allows:
- Project-specific conventions
- Memory isolation between projects
- Future per-project analytics

### 2. Call RAG Before Every LLM Request

Even if the query seems unrelated, the cost is negligible (<5ms, ~50 tokens). The knowledge injected often improves responses.

### 3. Compress Proactively

Don't wait until context window is full. Compress at ~10 messages:

```python
if len(conversation) >= 10:
    compressed = await compress_messages(conversation)
    conversation = compressed["messages"]
```

This keeps the compression quality high (summaries are more accurate with fewer messages).

### 4. Store Learnings, Not Everything

Memory is for **decisions** and **facts you want to recall**.

Don't store:
- Every user query/response (that's what compression is for)
- Public documentation (put in skills instead)
- Transient state (keep in agent memory)

### 5. Version Your Skills

When a skill's guidance changes:

- **Minor update** (typo, clarification): update the existing skill's `content` in place
- **Major update** (different approach, breaking change): create a new `id` (e.g., `docker-compose-v2`) and optionally mark the old one as deprecated in its content

### 6. Use MCP in Claude Desktop

If you use Claude Desktop, add the MCP server (see `CLAUDE.md`). This gives you:
- Direct access to skills via Claude's tool calling
- No need to implement API calls manually
- Same token savings within Claude

### 7. Monitor Token Savings

Track metrics:

```python
import time
from datetime import datetime

logs = []

def log_savings(tokens_before, tokens_after, operation):
    logs.append({
        "timestamp": datetime.now().isoformat(),
        "operation": operation,
        "tokens_before": tokens_before,
        "tokens_after": tokens_after,
        "savings": tokens_before - tokens_after
    })
    # Periodically upload or analyze these
```

---

## Example Implementations

### Minimal Agent

```python
import asyncio, httpx, os

API_URL = os.getenv("API_URL", "http://helm:8675")
PROJECT = os.getenv("PROJECT", "/default")

async def get_context(query):
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/context/rag", params={"query": query, "project": PROJECT})
        return resp.json()

async def chat():
    conv = []
    while True:
        query = input("You: ")
        if query == "quit": break

        # Get context
        ctx = await get_context(query)
        system = format_context(ctx)

        # Call LLM (pseudo)
        response = call_llm(system, conv[-4:], query)

        conv.extend([{"role": "user", "content": query},
                     {"role": "assistant", "content": response}])

        print(f"Assistant: {response}")

asyncio.run(chat())
```

### Discord Bot with Context

```python
import discord
from discord.ext import commands
import httpx

bot = commands.Bot(command_prefix="!")
API_URL = "http://helm:8675"
PROJECT = "/home/user/discord-bot"

@bot.event
async def on_message(message):
    if message.author == bot.user:
        return

    # RAG context
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{API_URL}/context/rag", params={"query": message.content, "project": PROJECT})
        ctx = resp.json()

    # Build prompt
    system_prompt = format_context(ctx) + "\n\nYou are a helpful Discord bot."

    # Respond (using your LLM of choice)
    response = await generate_response(message.content, system_prompt)
    await message.reply(response)

    # Store in memory if it's a decision
    if "decision" in message.content.lower():
        async with httpx.AsyncClient() as client:
            await client.post(f"{API_URL}/memory", json={
                "project": PROJECT,
                "key": f"decision-{discord.utils.utcnow().timestamp()}",
                "content": response[:500]
            })

bot.run(os.getenv("DISCORD_TOKEN"))
```

---

## Need More Help?

- **Setup issues**: See `SETUP.md`
- **Template repo**: Clone `git.bouncypixel.com:helm/agentic-templates.git`
- **API reference**: Visit `http://helm:8675/docs` when the service is running
- **MCP tools**: See `CLAUDE.md` for Claude Desktop integration