No description

Find a file

Lukas Parsons 6853999534 Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull		2026-03-22 22:41:49 -04:00
examples	Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull	2026-03-22 22:41:49 -04:00
mcp	Update MCP server (remove cache tool), fix readme endpoints, add template reference	2026-03-22 22:35:02 -04:00
template	Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull	2026-03-22 22:41:49 -04:00
.env.example	Initial commit: Skills API with MCP servers	2026-03-22 21:18:23 -04:00
.gitignore	Initial commit: Skills API with MCP servers	2026-03-22 21:18:23 -04:00
CLAUDE.md	Change API port from 8080 to 8675 across all configs and docs	2026-03-22 21:54:51 -04:00
compression.py	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
config.py	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
config.yaml	Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull	2026-03-22 22:41:49 -04:00
database.py	Initial commit: Skills API with MCP servers	2026-03-22 21:18:23 -04:00
docker-compose.yml	Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull	2026-03-22 22:41:49 -04:00
Dockerfile	Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull	2026-03-22 22:41:49 -04:00
entrypoint.sh	Add Ollama service to docker-compose, expand seed skills with D&D and monitoring, create entrypoint for auto-model-pull	2026-03-22 22:41:49 -04:00
main.py	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
models.py	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
rag.py	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
README.md	Update MCP server (remove cache tool), fix readme endpoints, add template reference	2026-03-22 22:35:02 -04:00
requirements.txt	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
schemas.py	Major refactor: remove semantic cache, add config, auth, improve RAG performance, fix tags JSON	2026-03-22 22:32:44 -04:00
TOKEN-SAVING-PATTERN.md	Update MCP server (remove cache tool), fix readme endpoints, add template reference	2026-03-22 22:35:02 -04:00

README.md

AI Skills API

Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.

Quick Start

# Copy config file (optional, uses defaults if missing)
cp config.yaml.example config.yaml  # customize if needed

# Run with Docker
docker compose up -d

# Or run locally
pip install -r requirements.txt
uvicorn main:app --reload

API available at http://helm:8675 Docs at http://helm:8675/docs

Key Features

Smart RAG: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
Conversation Compression: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
Project Memory: Store decisions and learnings per project
Simple API: RESTful JSON API + MCP server for Claude Desktop
Zero-friction auth: Optional API key (set-and-forget)

Configuration

Create config.yaml (optional) to customize:

port: 8675
rag:
  max_skills: 3
  max_conventions: 2
  max_snippets: 2
compression:
  enabled: true
  strategy: "extractive"  # or "ollama" for phi-3-mini
auth:
  enabled: false  # set to true and change api_key

Or use environment variables (see config.py for full list).

Endpoints

Endpoint	Description	Auth
`GET /health`	Health check	No
`GET /config`	Show current config	Yes
`GET /skills`	List all skills	Yes
`GET /skills/{id}`	Get skill (increments usage)	Yes
`POST /skills`	Create skill	Yes
`PUT /skills/{id}`	Update skill	Yes
`DELETE /skills/{id}`	Delete skill	Yes
`GET /skills/search?q=query`	Search skills	Yes
`GET /snippets`	List snippets	Yes
`POST /snippets`	Create snippet	Yes
`DELETE /snippets/{id}`	Delete snippet	Yes
`GET /conventions`	List conventions	Yes
`GET /conventions?project=/path`	Get project conventions	Yes
`POST /conventions`	Create convention	Yes
`DELETE /conventions/{id}`	Delete convention	Yes
`GET /memory`	List memory entries	Yes
`GET /memory?project=name`	Get project memory	Yes
`POST /memory`	Create memory entry	Yes
`PUT /memory/{id}`	Update memory	Yes
`DELETE /memory/{id}`	Delete memory	Yes
`GET /context/rag?query=...`	RAG context (smart retrieval)	Yes
`POST /compress`	Compress conversation	Yes
`GET /tokens/count?text=...`	Count tokens	Yes
`POST /admin/clear-cache`	Clear RAG cache	Yes

Note: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).

Integration Pattern

import httpx

async def query_llm(prompt, conversation_history, project=None):
    # 1. Get relevant context (RAG) - biggest token saver
    context = await httpx.get(
        "http://helm:8675/context/rag",
        params={"query": prompt, "project": project}
    ).json()
    
    # Inject context into your LLM prompt
    system_prompt = f"{context['skills']}\n{context['conventions']}"
    
    # 2. Call LLM with context + conversation
    response = call_llm(system_prompt, conversation_history, prompt)
    
    # 3. Store learnings in memory
    await httpx.post(
        "http://helm:8675/memory",
        json={"project": project, "key": "decision", "content": response}
    )
    
    # 4. Periodically compress old conversation turns
    if len(conversation_history) > 10:
        await httpx.post(
            "http://helm:8675/compress",
            json={"messages": conversation_history}
        )
    
    return response

Expected savings: 60-80% token reduction vs. sending everything.

Template Repository

Want to get started quickly? Use the agent template:

# Clone the template (on your Forgejo)
git clone git.bouncypixel.com:helm/ai-agent-template.git
cd ai-agent-template
cp .env.example .env
docker compose up -d

The template includes a working agent integration and docker-compose setup.

How It Works (Architecture)

RAG Engine (Fast)

All skills/snippets are loaded into memory at startup with pre-computed embeddings
Queries embed once, compute cosine similarity against cached embeddings
Returns top-K most relevant items (<5ms for 1000 items)
No external API calls, no database queries per request

Compression (Configurable)

Extractive (default): Uses LSA summarization to pick key sentences - fast, no model
Ollama: Sends to local phi-3-mini for high-quality summaries (~2s)
Keeps recent turns full, replaces old with summary

Memory Store

Simple key-value per project
Stores decisions, configurations, learnings
Retrieved via /memory?project=...

MCP Server Integration

If you use Claude Desktop, add to your config:

{
  "mcpServers": {
    "skills": {
      "command": "python",
      "args": ["/path/to/ai-skills-api/mcp/skills.py"],
      "env": {
        "SKILLS_API_URL": "http://helm:8675"
      }
    }
  }
}

Available tools:

search_skills, get_skill, list_skills
get_context, get_conventions, get_snippets
check_cache (deprecated), get_memory, add_memory, create_skill

Migration from v1

If you were using the old semantic cache:

Deleted: Semantic cache endpoints and model
Migrate: Any stored skills/snippets remain (tags now JSON)
Upgrade: Pull new image, restart, optionally enable auth

Performance

RAG latency: ~5ms (cached embeddings)
Embedding model load: ~100MB RAM, ~2s cold start
Compression: 100-500ms (extractive) or ~2s (ollama)
Supports 1000+ skills/snippets without degradation

License

MIT

Example Usage

Create a skill

curl -X POST http://helm:8675/skills \
  -H "Content-Type: application/json" \
  -d '{
    "id": "homelab-docker-compose",
    "name": "Docker Compose Standard",
    "category": "homelab",
    "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
    "tags": ["docker", "compose", "infrastructure"]
  }'

Get context bundle

curl "http://helm:8675/context?project=/home/server/apps/media-server&skills=homelab-docker-compose,react-v2"

Check cache

curl -X POST http://helm:8675/cache/lookup \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "How do I configure traefik?",
    "model": "claude-3-opus"
  }'

Integration Pattern

In your agent's system prompt or pre-request hook:

Call GET /context?project={current_project}&skills={skill_ids}
Inject returned content into the prompt
Before sending to LLM, check POST /cache/lookup
After receiving response, optionally POST /cache/store

This avoids re-sending your standards every request and caches repeated queries.

Database

SQLite database ai.db with tables:

skills - Reusable patterns and instructions
snippets - Code snippets
conventions - Project-specific conventions
cache - LRU cache of LLM responses
memory - Project memory/notes