Lukas Parsons 39d891e9ce docs: note about auto-coaching in MCP integration

2026-03-23 00:25:25 -04:00

6.2 KiB

Raw Blame History

AI Skills API

Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.

API available at: http://helm:8675
Interactive docs: http://helm:8675/docs

Quick Links

Setup Guide - One-time deployment on your server
Usage Guide - How to integrate with your agents
Template Repository - Starter kit for new projects

Key Features

Smart RAG: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
Conversation Compression: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
Project Memory: Store decisions and learnings per project
Simple API: RESTful JSON API + MCP server for Claude Desktop
Zero-friction auth: Optional API key (set-and-forget)

Quick Start (5 minutes)

# 1. Deploy the service on helm (see SETUP.md for details)
docker compose up -d

# 2. Clone the template repo for your agent project
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

# 3. Your agent is now running with context management

See SETUP.md for complete deployment instructions and USAGE.md for integration patterns.

Endpoints

Endpoint	Description	Auth
`GET /health`	Health check	No
`GET /config`	Show current config	Yes
`GET /skills`	List all skills	Yes
`GET /skills/{id}`	Get skill (increments usage)	Yes
`POST /skills`	Create skill	Yes
`PUT /skills/{id}`	Update skill	Yes
`DELETE /skills/{id}`	Delete skill	Yes
`GET /skills/search?q=query`	Search skills	Yes
`GET /snippets`	List snippets	Yes
`POST /snippets`	Create snippet	Yes
`DELETE /snippets/{id}`	Delete snippet	Yes
`GET /conventions`	List conventions	Yes
`GET /conventions?project=/path`	Get project conventions	Yes
`POST /conventions`	Create convention	Yes
`DELETE /conventions/{id}`	Delete convention	Yes
`GET /memory`	List memory entries	Yes
`GET /memory?project=name`	Get project memory	Yes
`POST /memory`	Create memory entry	Yes
`PUT /memory/{id}`	Update memory	Yes
`DELETE /memory/{id}`	Delete memory	Yes
`GET /context/rag?query=...`	RAG context (smart retrieval)	Yes
`POST /compress`	Compress conversation	Yes
`GET /tokens/count?text=...`	Count tokens	Yes
`POST /admin/clear-cache`	Clear RAG cache	Yes

Note: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).

Integration Pattern

import httpx

async def query_llm(prompt, conversation_history, project=None):
    # 1. Get relevant context (RAG) - biggest token saver
    context = await httpx.get(
        "http://helm:8675/context/rag",
        params={"query": prompt, "project": project}
    ).json()
    
    # Inject context into your LLM prompt
    system_prompt = f"{context['skills']}\n{context['conventions']}"
    
    # 2. Call LLM with context + conversation
    response = call_llm(system_prompt, conversation_history, prompt)
    
    # 3. Store learnings in memory
    await httpx.post(
        "http://helm:8675/memory",
        json={"project": project, "key": "decision", "content": response}
    )
    
    # 4. Periodically compress old conversation turns
    if len(conversation_history) > 10:
        await httpx.post(
            "http://helm:8675/compress",
            json={"messages": conversation_history}
        )
    
    return response

Expected savings: 60-80% token reduction vs. sending everything.

See USAGE.md for complete integration patterns, examples, and best practices.

Template Repository

Want to get started quickly? Use the agent template:

# Clone the template
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

The template includes a working agent integration and docker-compose setup. See USAGE.md for integration patterns.

How It Works (Architecture)

RAG Engine (Fast)

All skills/snippets are loaded into memory at startup with pre-computed embeddings
Queries embed once, compute cosine similarity against cached embeddings
Returns top-K most relevant items (<5ms for 1000 items)
No external API calls, no database queries per request

Compression (Configurable)

Extractive (default): Uses LSA summarization to pick key sentences - fast, no model
Ollama: Sends to local phi-3-mini for high-quality summaries (~2s)
Keeps recent turns full, replaces old with summary

Memory Store

Simple key-value per project
Stores decisions, configurations, learnings
Retrieved via /memory?project=...

MCP Server Integration

If you use Claude Desktop, add to your config:

{
  "mcpServers": {
    "skills": {
      "command": "python",
      "args": ["/path/to/ai-skills-api/mcp/skills.py"],
      "env": {
        "SKILLS_API_URL": "http://helm:8675"
      }
    }
  }
}

Available tools:

search_skills, get_skill, list_skills
get_context, get_conventions, get_snippets
get_memory, add_memory, create_skill

Auto-coaching: The MCP server sends instructions to the AI on connection, teaching it how and when to use these tools to learn and improve over time. This means the AI will proactively call get_context(), add_memory(), and create_skill() without you needing to explicitly tell it each time.

Migration from v1

If you were using the old semantic cache:

Deleted: Semantic cache endpoints and model
Migrate: Any stored skills/snippets remain (tags now JSON)
Upgrade: Pull new image, restart, optionally enable auth

Performance

RAG latency: ~5ms (cached embeddings)
Embedding model load: ~100MB RAM, ~2s cold start
Compression: 100-500ms (extractive) or ~2s (ollama)
Supports 1000+ skills/snippets without degradation

License

MIT

For detailed usage examples and API reference, see USAGE.md and the interactive docs at http://helm:8675/docs when the service is running.

6.2 KiB Raw Blame History