| examples | ||
| mcp | ||
| template | ||
| .env.example | ||
| .gitignore | ||
| CLAUDE.md | ||
| compression.py | ||
| config.py | ||
| config.yaml | ||
| database.py | ||
| docker-compose.yml | ||
| Dockerfile | ||
| entrypoint.sh | ||
| main.py | ||
| models.py | ||
| rag.py | ||
| README.md | ||
| requirements.txt | ||
| schemas.py | ||
| TOKEN-SAVING-PATTERN.md | ||
AI Skills API
Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.
Quick Start
# Copy config file (optional, uses defaults if missing)
cp config.yaml.example config.yaml # customize if needed
# Run with Docker
docker compose up -d
# Or run locally
pip install -r requirements.txt
uvicorn main:app --reload
API available at http://helm:8675
Docs at http://helm:8675/docs
Key Features
- Smart RAG: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
- Conversation Compression: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
- Project Memory: Store decisions and learnings per project
- Simple API: RESTful JSON API + MCP server for Claude Desktop
- Zero-friction auth: Optional API key (set-and-forget)
Configuration
Create config.yaml (optional) to customize:
port: 8675
rag:
max_skills: 3
max_conventions: 2
max_snippets: 2
compression:
enabled: true
strategy: "extractive" # or "ollama" for phi-3-mini
auth:
enabled: false # set to true and change api_key
Or use environment variables (see config.py for full list).
Endpoints
| Endpoint | Description | Auth |
|---|---|---|
GET /health |
Health check | No |
GET /config |
Show current config | Yes |
GET /skills |
List all skills | Yes |
GET /skills/{id} |
Get skill (increments usage) | Yes |
POST /skills |
Create skill | Yes |
PUT /skills/{id} |
Update skill | Yes |
DELETE /skills/{id} |
Delete skill | Yes |
GET /skills/search?q=query |
Search skills | Yes |
GET /snippets |
List snippets | Yes |
POST /snippets |
Create snippet | Yes |
DELETE /snippets/{id} |
Delete snippet | Yes |
GET /conventions |
List conventions | Yes |
GET /conventions?project=/path |
Get project conventions | Yes |
POST /conventions |
Create convention | Yes |
DELETE /conventions/{id} |
Delete convention | Yes |
GET /memory |
List memory entries | Yes |
GET /memory?project=name |
Get project memory | Yes |
POST /memory |
Create memory entry | Yes |
PUT /memory/{id} |
Update memory | Yes |
DELETE /memory/{id} |
Delete memory | Yes |
GET /context/rag?query=... |
RAG context (smart retrieval) | Yes |
POST /compress |
Compress conversation | Yes |
GET /tokens/count?text=... |
Count tokens | Yes |
POST /admin/clear-cache |
Clear RAG cache | Yes |
Note: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).
Integration Pattern
import httpx
async def query_llm(prompt, conversation_history, project=None):
# 1. Get relevant context (RAG) - biggest token saver
context = await httpx.get(
"http://helm:8675/context/rag",
params={"query": prompt, "project": project}
).json()
# Inject context into your LLM prompt
system_prompt = f"{context['skills']}\n{context['conventions']}"
# 2. Call LLM with context + conversation
response = call_llm(system_prompt, conversation_history, prompt)
# 3. Store learnings in memory
await httpx.post(
"http://helm:8675/memory",
json={"project": project, "key": "decision", "content": response}
)
# 4. Periodically compress old conversation turns
if len(conversation_history) > 10:
await httpx.post(
"http://helm:8675/compress",
json={"messages": conversation_history}
)
return response
Expected savings: 60-80% token reduction vs. sending everything.
Template Repository
Want to get started quickly? Use the agent template:
# Clone the template (on your Forgejo)
git clone git.bouncypixel.com:helm/ai-agent-template.git
cd ai-agent-template
cp .env.example .env
docker compose up -d
The template includes a working agent integration and docker-compose setup.
How It Works (Architecture)
RAG Engine (Fast)
- All skills/snippets are loaded into memory at startup with pre-computed embeddings
- Queries embed once, compute cosine similarity against cached embeddings
- Returns top-K most relevant items (<5ms for 1000 items)
- No external API calls, no database queries per request
Compression (Configurable)
- Extractive (default): Uses LSA summarization to pick key sentences - fast, no model
- Ollama: Sends to local phi-3-mini for high-quality summaries (~2s)
- Keeps recent turns full, replaces old with summary
Memory Store
- Simple key-value per project
- Stores decisions, configurations, learnings
- Retrieved via
/memory?project=...
MCP Server Integration
If you use Claude Desktop, add to your config:
{
"mcpServers": {
"skills": {
"command": "python",
"args": ["/path/to/ai-skills-api/mcp/skills.py"],
"env": {
"SKILLS_API_URL": "http://helm:8675"
}
}
}
}
Available tools:
search_skills,get_skill,list_skillsget_context,get_conventions,get_snippetscheck_cache(deprecated),get_memory,add_memory,create_skill
Migration from v1
If you were using the old semantic cache:
- Deleted: Semantic cache endpoints and model
- Migrate: Any stored skills/snippets remain (tags now JSON)
- Upgrade: Pull new image, restart, optionally enable auth
Performance
- RAG latency: ~5ms (cached embeddings)
- Embedding model load: ~100MB RAM, ~2s cold start
- Compression: 100-500ms (extractive) or ~2s (ollama)
- Supports 1000+ skills/snippets without degradation
License
MIT
Example Usage
Create a skill
curl -X POST http://helm:8675/skills \
-H "Content-Type: application/json" \
-d '{
"id": "homelab-docker-compose",
"name": "Docker Compose Standard",
"category": "homelab",
"content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
"tags": ["docker", "compose", "infrastructure"]
}'
Get context bundle
curl "http://helm:8675/context?project=/home/server/apps/media-server&skills=homelab-docker-compose,react-v2"
Check cache
curl -X POST http://helm:8675/cache/lookup \
-H "Content-Type: application/json" \
-d '{
"prompt": "How do I configure traefik?",
"model": "claude-3-opus"
}'
Integration Pattern
In your agent's system prompt or pre-request hook:
- Call
GET /context?project={current_project}&skills={skill_ids} - Inject returned content into the prompt
- Before sending to LLM, check
POST /cache/lookup - After receiving response, optionally
POST /cache/store
This avoids re-sending your standards every request and caches repeated queries.
Database
SQLite database ai.db with tables:
skills- Reusable patterns and instructionssnippets- Code snippetsconventions- Project-specific conventionscache- LRU cache of LLM responsesmemory- Project memory/notes