ai-skills-api/README.md

# AI Skills API

Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.

**API available at**: `http://helm:8675`
**Interactive docs**: `http://helm:8675/docs`

## Quick Links

- **[Setup Guide](SETUP.md)** - One-time deployment on your server
- **[Usage Guide](USAGE.md)** - How to integrate with your agents
- **[Template Repository](https://git.bouncypixel.com/helm/agentic-templates)** - Starter kit for new projects

## Key Features

- **Smart RAG**: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
- **Conversation Compression**: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
- **Project Memory**: Store decisions and learnings per project
- **Simple API**: RESTful JSON API + MCP server for Claude Desktop
- **Zero-friction auth**: Optional API key (set-and-forget)

## Quick Start (5 minutes)

```bash
# 1. Deploy the service on helm (see SETUP.md for details)
docker compose up -d

# 2. Clone the template repo for your agent project
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

# 3. Your agent is now running with context management
```

See **[SETUP.md](SETUP.md)** for complete deployment instructions and **[USAGE.md](USAGE.md)** for integration patterns.

## Endpoints

| Endpoint | Description | Auth |
|----------|-------------|------|
| `GET /health` | Health check | No |
| `GET /config` | Show current config | Yes |
| `GET /skills` | List all skills | Yes |
| `GET /skills/{id}` | Get skill (increments usage) | Yes |
| `POST /skills` | Create skill | Yes |
| `PUT /skills/{id}` | Update skill | Yes |
| `DELETE /skills/{id}` | Delete skill | Yes |
| `GET /skills/search?q=query` | Search skills | Yes |
| `GET /snippets` | List snippets | Yes |
| `POST /snippets` | Create snippet | Yes |
| `DELETE /snippets/{id}` | Delete snippet | Yes |
| `GET /conventions` | List conventions | Yes |
| `GET /conventions?project=/path` | Get project conventions | Yes |
| `POST /conventions` | Create convention | Yes |
| `DELETE /conventions/{id}` | Delete convention | Yes |
| `GET /memory` | List memory entries | Yes |
| `GET /memory?project=name` | Get project memory | Yes |
| `POST /memory` | Create memory entry | Yes |
| `PUT /memory/{id}` | Update memory | Yes |
| `DELETE /memory/{id}` | Delete memory | Yes |
| `GET /context/rag?query=...` | **RAG context** (smart retrieval) | Yes |
| `POST /compress` | **Compress conversation** | Yes |
| `GET /tokens/count?text=...` | Count tokens | Yes |
| `POST /admin/clear-cache` | Clear RAG cache | Yes |

**Note**: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).

## Integration Pattern

```python
import httpx

async def query_llm(prompt, conversation_history, project=None):
    # 1. Get relevant context (RAG) - biggest token saver
    context = await httpx.get(
        "http://helm:8675/context/rag",
        params={"query": prompt, "project": project}
    ).json()

    # Inject context into your LLM prompt
    system_prompt = f"{context['skills']}\n{context['conventions']}"

    # 2. Call LLM with context + conversation
    response = call_llm(system_prompt, conversation_history, prompt)

    # 3. Store learnings in memory
    await httpx.post(
        "http://helm:8675/memory",
        json={"project": project, "key": "decision", "content": response}
    )

    # 4. Periodically compress old conversation turns
    if len(conversation_history) > 10:
        await httpx.post(
            "http://helm:8675/compress",
            json={"messages": conversation_history}
        )

    return response
```

**Expected savings**: 60-80% token reduction vs. sending everything.

See **[USAGE.md](USAGE.md)** for complete integration patterns, examples, and best practices.

## Template Repository

Want to get started quickly? Use the agent template:

```bash
# Clone the template
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d
```

The template includes a working agent integration and docker-compose setup. See [USAGE.md](USAGE.md) for integration patterns.

## How It Works (Architecture)

### RAG Engine (Fast)
- All skills/snippets are loaded into memory at startup with pre-computed embeddings
- Queries embed once, compute cosine similarity against cached embeddings
- Returns top-K most relevant items (<5ms for 1000 items)
- No external API calls, no database queries per request

### Compression (Configurable)
- **Extractive** (default): Uses LSA summarization to pick key sentences - fast, no model
- **Ollama**: Sends to local phi-3-mini for high-quality summaries (~2s)
- Keeps recent turns full, replaces old with summary

### Memory Store
- Simple key-value per project
- Stores decisions, configurations, learnings
- Retrieved via `/memory?project=...`

## MCP Server Integration

If you use Claude Desktop, add to your config:

```json
{
  "mcpServers": {
    "skills": {
      "command": "python",
      "args": ["/path/to/ai-skills-api/mcp/skills.py"],
      "env": {
        "SKILLS_API_URL": "http://helm:8675"
      }
    }
  }
}
```

Available tools:
- `search_skills`, `get_skill`, `list_skills`
- `get_context`, `get_conventions`, `get_snippets`
- `get_memory`, `add_memory`, `create_skill`

**Auto-coaching**: The MCP server sends instructions to the AI on connection, teaching it how and when to use these tools to learn and improve over time. This means the AI will proactively call `get_context()`, `add_memory()`, and `create_skill()` without you needing to explicitly tell it each time.

## Migration from v1

If you were using the old semantic cache:
- **Deleted**: Semantic cache endpoints and model
- **Migrate**: Any stored skills/snippets remain (tags now JSON)
- **Upgrade**: Pull new image, restart, optionally enable auth

## Performance

- RAG latency: ~5ms (cached embeddings)
- Embedding model load: ~100MB RAM, ~2s cold start
- Compression: 100-500ms (extractive) or ~2s (ollama)
- Supports 1000+ skills/snippets without degradation

## License

MIT

For detailed usage examples and API reference, see [USAGE.md](USAGE.md) and the interactive docs at `http://helm:8675/docs` when the service is running.