ai-skills-api/README.md

6.2 KiB

AI Skills API

Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.

API available at: http://helm:8675
Interactive docs: http://helm:8675/docs

Key Features

  • Smart RAG: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
  • Conversation Compression: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
  • Project Memory: Store decisions and learnings per project
  • Simple API: RESTful JSON API + MCP server for Claude Desktop
  • Zero-friction auth: Optional API key (set-and-forget)

Quick Start (5 minutes)

# 1. Deploy the service on helm (see SETUP.md for details)
docker compose up -d

# 2. Clone the template repo for your agent project
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

# 3. Your agent is now running with context management

See SETUP.md for complete deployment instructions and USAGE.md for integration patterns.

Endpoints

Endpoint Description Auth
GET /health Health check No
GET /config Show current config Yes
GET /skills List all skills Yes
GET /skills/{id} Get skill (increments usage) Yes
POST /skills Create skill Yes
PUT /skills/{id} Update skill Yes
DELETE /skills/{id} Delete skill Yes
GET /skills/search?q=query Search skills Yes
GET /snippets List snippets Yes
POST /snippets Create snippet Yes
DELETE /snippets/{id} Delete snippet Yes
GET /conventions List conventions Yes
GET /conventions?project=/path Get project conventions Yes
POST /conventions Create convention Yes
DELETE /conventions/{id} Delete convention Yes
GET /memory List memory entries Yes
GET /memory?project=name Get project memory Yes
POST /memory Create memory entry Yes
PUT /memory/{id} Update memory Yes
DELETE /memory/{id} Delete memory Yes
GET /context/rag?query=... RAG context (smart retrieval) Yes
POST /compress Compress conversation Yes
GET /tokens/count?text=... Count tokens Yes
POST /admin/clear-cache Clear RAG cache Yes

Note: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).

Integration Pattern

import httpx

async def query_llm(prompt, conversation_history, project=None):
    # 1. Get relevant context (RAG) - biggest token saver
    context = await httpx.get(
        "http://helm:8675/context/rag",
        params={"query": prompt, "project": project}
    ).json()
    
    # Inject context into your LLM prompt
    system_prompt = f"{context['skills']}\n{context['conventions']}"
    
    # 2. Call LLM with context + conversation
    response = call_llm(system_prompt, conversation_history, prompt)
    
    # 3. Store learnings in memory
    await httpx.post(
        "http://helm:8675/memory",
        json={"project": project, "key": "decision", "content": response}
    )
    
    # 4. Periodically compress old conversation turns
    if len(conversation_history) > 10:
        await httpx.post(
            "http://helm:8675/compress",
            json={"messages": conversation_history}
        )
    
    return response

Expected savings: 60-80% token reduction vs. sending everything.

See USAGE.md for complete integration patterns, examples, and best practices.

Template Repository

Want to get started quickly? Use the agent template:

# Clone the template
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d

The template includes a working agent integration and docker-compose setup. See USAGE.md for integration patterns.

How It Works (Architecture)

RAG Engine (Fast)

  • All skills/snippets are loaded into memory at startup with pre-computed embeddings
  • Queries embed once, compute cosine similarity against cached embeddings
  • Returns top-K most relevant items (<5ms for 1000 items)
  • No external API calls, no database queries per request

Compression (Configurable)

  • Extractive (default): Uses LSA summarization to pick key sentences - fast, no model
  • Ollama: Sends to local phi-3-mini for high-quality summaries (~2s)
  • Keeps recent turns full, replaces old with summary

Memory Store

  • Simple key-value per project
  • Stores decisions, configurations, learnings
  • Retrieved via /memory?project=...

MCP Server Integration

If you use Claude Desktop, add to your config:

{
  "mcpServers": {
    "skills": {
      "command": "python",
      "args": ["/path/to/ai-skills-api/mcp/skills.py"],
      "env": {
        "SKILLS_API_URL": "http://helm:8675"
      }
    }
  }
}

Available tools:

  • search_skills, get_skill, list_skills
  • get_context, get_conventions, get_snippets
  • get_memory, add_memory, create_skill

Auto-coaching: The MCP server sends instructions to the AI on connection, teaching it how and when to use these tools to learn and improve over time. This means the AI will proactively call get_context(), add_memory(), and create_skill() without you needing to explicitly tell it each time.

Migration from v1

If you were using the old semantic cache:

  • Deleted: Semantic cache endpoints and model
  • Migrate: Any stored skills/snippets remain (tags now JSON)
  • Upgrade: Pull new image, restart, optionally enable auth

Performance

  • RAG latency: ~5ms (cached embeddings)
  • Embedding model load: ~100MB RAM, ~2s cold start
  • Compression: 100-500ms (extractive) or ~2s (ollama)
  • Supports 1000+ skills/snippets without degradation

License

MIT

For detailed usage examples and API reference, see USAGE.md and the interactive docs at http://helm:8675/docs when the service is running.