ai-skills-api/README.md
Lukas Parsons e346d356e5 Add SSE MCP server, comprehensive docs, and OpenCode integration
- Implement SSE mode for MCP server (mcp/skills.py)
- Add MCP service to docker-compose.yml on port 3000
- Add uvicorn dependency to mcp/requirements.txt
- Create SETUP.md, USAGE.md, OPENCODE-MCP.md
- Update README with quick links and MCP section
- Remove semantic cache references throughout
- Add cross-platform Python MCP setup script to template repo
2026-03-22 23:59:33 -04:00

5.9 KiB

AI Skills API

Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.

API available at: http://helm:8675
Interactive docs: http://helm:8675/docs

Key Features

  • Smart RAG: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
  • Conversation Compression: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
  • Project Memory: Store decisions and learnings per project
  • Simple API: RESTful JSON API + MCP server for Claude Desktop
  • Zero-friction auth: Optional API key (set-and-forget)

Quick Start (5 minutes)

# 1. Deploy the service on helm (see SETUP.md for details)
docker compose up -d

# 2. Clone the template repo for your agent project
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

# 3. Your agent is now running with context management

See SETUP.md for complete deployment instructions and USAGE.md for integration patterns.

Endpoints

Endpoint Description Auth
GET /health Health check No
GET /config Show current config Yes
GET /skills List all skills Yes
GET /skills/{id} Get skill (increments usage) Yes
POST /skills Create skill Yes
PUT /skills/{id} Update skill Yes
DELETE /skills/{id} Delete skill Yes
GET /skills/search?q=query Search skills Yes
GET /snippets List snippets Yes
POST /snippets Create snippet Yes
DELETE /snippets/{id} Delete snippet Yes
GET /conventions List conventions Yes
GET /conventions?project=/path Get project conventions Yes
POST /conventions Create convention Yes
DELETE /conventions/{id} Delete convention Yes
GET /memory List memory entries Yes
GET /memory?project=name Get project memory Yes
POST /memory Create memory entry Yes
PUT /memory/{id} Update memory Yes
DELETE /memory/{id} Delete memory Yes
GET /context/rag?query=... RAG context (smart retrieval) Yes
POST /compress Compress conversation Yes
GET /tokens/count?text=... Count tokens Yes
POST /admin/clear-cache Clear RAG cache Yes

Note: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).

Integration Pattern

import httpx

async def query_llm(prompt, conversation_history, project=None):
    # 1. Get relevant context (RAG) - biggest token saver
    context = await httpx.get(
        "http://helm:8675/context/rag",
        params={"query": prompt, "project": project}
    ).json()
    
    # Inject context into your LLM prompt
    system_prompt = f"{context['skills']}\n{context['conventions']}"
    
    # 2. Call LLM with context + conversation
    response = call_llm(system_prompt, conversation_history, prompt)
    
    # 3. Store learnings in memory
    await httpx.post(
        "http://helm:8675/memory",
        json={"project": project, "key": "decision", "content": response}
    )
    
    # 4. Periodically compress old conversation turns
    if len(conversation_history) > 10:
        await httpx.post(
            "http://helm:8675/compress",
            json={"messages": conversation_history}
        )
    
    return response

Expected savings: 60-80% token reduction vs. sending everything.

See USAGE.md for complete integration patterns, examples, and best practices.

Template Repository

Want to get started quickly? Use the agent template:

# Clone the template
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

The template includes a working agent integration and docker-compose setup. See USAGE.md for integration patterns.

How It Works (Architecture)

RAG Engine (Fast)

  • All skills/snippets are loaded into memory at startup with pre-computed embeddings
  • Queries embed once, compute cosine similarity against cached embeddings
  • Returns top-K most relevant items (<5ms for 1000 items)
  • No external API calls, no database queries per request

Compression (Configurable)

  • Extractive (default): Uses LSA summarization to pick key sentences - fast, no model
  • Ollama: Sends to local phi-3-mini for high-quality summaries (~2s)
  • Keeps recent turns full, replaces old with summary

Memory Store

  • Simple key-value per project
  • Stores decisions, configurations, learnings
  • Retrieved via /memory?project=...

MCP Server Integration

If you use Claude Desktop, add to your config:

{
  "mcpServers": {
    "skills": {
      "command": "python",
      "args": ["/path/to/ai-skills-api/mcp/skills.py"],
      "env": {
        "SKILLS_API_URL": "http://helm:8675"
      }
    }
  }
}

Available tools:

  • search_skills, get_skill, list_skills
  • get_context, get_conventions, get_snippets
  • get_memory, add_memory, create_skill

Migration from v1

If you were using the old semantic cache:

  • Deleted: Semantic cache endpoints and model
  • Migrate: Any stored skills/snippets remain (tags now JSON)
  • Upgrade: Pull new image, restart, optionally enable auth

Performance

  • RAG latency: ~5ms (cached embeddings)
  • Embedding model load: ~100MB RAM, ~2s cold start
  • Compression: 100-500ms (extractive) or ~2s (ollama)
  • Supports 1000+ skills/snippets without degradation

License

MIT

For detailed usage examples and API reference, see USAGE.md and the interactive docs at http://helm:8675/docs when the service is running.