# AI Skills API Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills. ## Quick Start ```bash # Copy config file (optional, uses defaults if missing) cp config.yaml.example config.yaml # customize if needed # Run with Docker docker compose up -d # Or run locally pip install -r requirements.txt uvicorn main:app --reload ``` API available at `http://helm:8675` Docs at `http://helm:8675/docs` ## Key Features - **Smart RAG**: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets - **Conversation Compression**: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history - **Project Memory**: Store decisions and learnings per project - **Simple API**: RESTful JSON API + MCP server for Claude Desktop - **Zero-friction auth**: Optional API key (set-and-forget) ## Configuration Create `config.yaml` (optional) to customize: ```yaml port: 8675 rag: max_skills: 3 max_conventions: 2 max_snippets: 2 compression: enabled: true strategy: "extractive" # or "ollama" for phi-3-mini auth: enabled: false # set to true and change api_key ``` Or use environment variables (see `config.py` for full list). ## Endpoints | Endpoint | Description | Auth | |----------|-------------|------| | `GET /health` | Health check | No | | `GET /config` | Show current config | Yes | | `GET /skills` | List all skills | Yes | | `GET /skills/{id}` | Get skill (increments usage) | Yes | | `POST /skills` | Create skill | Yes | | `PUT /skills/{id}` | Update skill | Yes | | `DELETE /skills/{id}` | Delete skill | Yes | | `GET /skills/search?q=query` | Search skills | Yes | | `GET /snippets` | List snippets | Yes | | `POST /snippets` | Create snippet | Yes | | `DELETE /snippets/{id}` | Delete snippet | Yes | | `GET /conventions` | List conventions | Yes | | `GET /conventions?project=/path` | Get project conventions | Yes | | `POST /conventions` | Create convention | Yes | | `DELETE /conventions/{id}` | Delete convention | Yes | | `GET /memory` | List memory entries | Yes | | `GET /memory?project=name` | Get project memory | Yes | | `POST /memory` | Create memory entry | Yes | | `PUT /memory/{id}` | Update memory | Yes | | `DELETE /memory/{id}` | Delete memory | Yes | | `GET /context/rag?query=...` | **RAG context** (smart retrieval) | Yes | | `POST /compress` | **Compress conversation** | Yes | | `GET /tokens/count?text=...` | Count tokens | Yes | | `POST /admin/clear-cache` | Clear RAG cache | Yes | **Note**: Endpoints marked "Yes" require API key if auth is enabled (default: disabled). ## Integration Pattern ```python import httpx async def query_llm(prompt, conversation_history, project=None): # 1. Get relevant context (RAG) - biggest token saver context = await httpx.get( "http://helm:8675/context/rag", params={"query": prompt, "project": project} ).json() # Inject context into your LLM prompt system_prompt = f"{context['skills']}\n{context['conventions']}" # 2. Call LLM with context + conversation response = call_llm(system_prompt, conversation_history, prompt) # 3. Store learnings in memory await httpx.post( "http://helm:8675/memory", json={"project": project, "key": "decision", "content": response} ) # 4. Periodically compress old conversation turns if len(conversation_history) > 10: await httpx.post( "http://helm:8675/compress", json={"messages": conversation_history} ) return response ``` **Expected savings**: 60-80% token reduction vs. sending everything. ## Template Repository Want to get started quickly? Use the agent template: ```bash # Clone the template (on your Forgejo) git clone git.bouncypixel.com:helm/ai-agent-template.git cd ai-agent-template cp .env.example .env docker compose up -d ``` The template includes a working agent integration and docker-compose setup. ## How It Works (Architecture) ### RAG Engine (Fast) - All skills/snippets are loaded into memory at startup with pre-computed embeddings - Queries embed once, compute cosine similarity against cached embeddings - Returns top-K most relevant items (<5ms for 1000 items) - No external API calls, no database queries per request ### Compression (Configurable) - **Extractive** (default): Uses LSA summarization to pick key sentences - fast, no model - **Ollama**: Sends to local phi-3-mini for high-quality summaries (~2s) - Keeps recent turns full, replaces old with summary ### Memory Store - Simple key-value per project - Stores decisions, configurations, learnings - Retrieved via `/memory?project=...` ## MCP Server Integration If you use Claude Desktop, add to your config: ```json { "mcpServers": { "skills": { "command": "python", "args": ["/path/to/ai-skills-api/mcp/skills.py"], "env": { "SKILLS_API_URL": "http://helm:8675" } } } } ``` Available tools: - `search_skills`, `get_skill`, `list_skills` - `get_context`, `get_conventions`, `get_snippets` - `check_cache` (deprecated), `get_memory`, `add_memory`, `create_skill` ## Migration from v1 If you were using the old semantic cache: - **Deleted**: Semantic cache endpoints and model - **Migrate**: Any stored skills/snippets remain (tags now JSON) - **Upgrade**: Pull new image, restart, optionally enable auth ## Performance - RAG latency: ~5ms (cached embeddings) - Embedding model load: ~100MB RAM, ~2s cold start - Compression: 100-500ms (extractive) or ~2s (ollama) - Supports 1000+ skills/snippets without degradation ## License MIT ## Example Usage ### Create a skill ```bash curl -X POST http://helm:8675/skills \ -H "Content-Type: application/json" \ -d '{ "id": "homelab-docker-compose", "name": "Docker Compose Standard", "category": "homelab", "content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.", "tags": ["docker", "compose", "infrastructure"] }' ``` ### Get context bundle ```bash curl "http://helm:8675/context?project=/home/server/apps/media-server&skills=homelab-docker-compose,react-v2" ``` ### Check cache ```bash curl -X POST http://helm:8675/cache/lookup \ -H "Content-Type: application/json" \ -d '{ "prompt": "How do I configure traefik?", "model": "claude-3-opus" }' ``` ## Integration Pattern In your agent's system prompt or pre-request hook: 1. Call `GET /context?project={current_project}&skills={skill_ids}` 2. Inject returned content into the prompt 3. Before sending to LLM, check `POST /cache/lookup` 4. After receiving response, optionally `POST /cache/store` This avoids re-sending your standards every request and caches repeated queries. ## Database SQLite database `ai.db` with tables: - `skills` - Reusable patterns and instructions - `snippets` - Code snippets - `conventions` - Project-specific conventions - `cache` - LRU cache of LLM responses - `memory` - Project memory/notes