ai-skills-api/README.md

187 lines
6.7 KiB
Markdown

# AI Skills API
Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.
**API available at**: `http://helm:8675`
**Interactive docs**: `http://helm:8675/docs`
## Quick Links
- **[Setup Guide](SETUP.md)** - One-time deployment on your server
- **[Usage Guide](USAGE.md)** - How to integrate with your agents
- **[Template Repository](https://git.bouncypixel.com/helm/agentic-templates)** - Starter kit for new projects
## Key Features
- **Smart RAG**: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
- **Conversation Compression**: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
- **Project Memory**: Store decisions and learnings per project
- **Simple API**: RESTful JSON API + MCP server for Claude Desktop
- **Zero-friction auth**: Optional API key (set-and-forget)
## Quick Start (5 minutes)
```bash
# 1. Deploy the service on helm (see SETUP.md for details)
docker compose up -d
# 2. Clone the template repo for your agent project
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d
# 3. Your agent is now running with context management
```
See **[SETUP.md](SETUP.md)** for complete deployment instructions and **[USAGE.md](USAGE.md)** for integration patterns.
## Endpoints
| Endpoint | Description | Auth |
|----------|-------------|------|
| `GET /health` | Health check | No |
| `GET /config` | Show current config | Yes |
| `GET /skills` | List all skills | Yes |
| `GET /skills/{id}` | Get skill (increments usage) | Yes |
| `POST /skills` | Create skill | Yes |
| `PUT /skills/{id}` | Update skill | Yes |
| `DELETE /skills/{id}` | Delete skill | Yes |
| `GET /skills/search?q=query` | Search skills | Yes |
| `GET /snippets` | List snippets | Yes |
| `POST /snippets` | Create snippet | Yes |
| `DELETE /snippets/{id}` | Delete snippet | Yes |
| `GET /conventions` | List conventions | Yes |
| `GET /conventions?project=/path` | Get project conventions | Yes |
| `POST /conventions` | Create convention | Yes |
| `DELETE /conventions/{id}` | Delete convention | Yes |
| `GET /memory` | List memory entries | Yes |
| `GET /memory?project=name` | Get project memory | Yes |
| `POST /memory` | Create memory entry | Yes |
| `PUT /memory/{id}` | Update memory | Yes |
| `DELETE /memory/{id}` | Delete memory | Yes |
| `GET /context/rag?query=...` | **RAG context** (smart retrieval) | Yes |
| `POST /compress` | **Compress conversation** | Yes |
| `GET /tokens/count?text=...` | Count tokens | Yes |
| `POST /admin/clear-cache` | Clear RAG cache | Yes |
**Note**: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).
## Integration Pattern
```python
import httpx
async def query_llm(prompt, conversation_history, project=None):
# 1. Get relevant context (RAG) - biggest token saver
context = await httpx.get(
"http://helm:8675/context/rag",
params={"query": prompt, "project": project}
).json()
# Inject context into your LLM prompt
system_prompt = f"{context['skills']}\n{context['conventions']}"
# 2. Call LLM with context + conversation
response = call_llm(system_prompt, conversation_history, prompt)
# 3. Store learnings in memory
await httpx.post(
"http://helm:8675/memory",
json={"project": project, "key": "decision", "content": response}
)
# 4. Periodically compress old conversation turns
if len(conversation_history) > 10:
await httpx.post(
"http://helm:8675/compress",
json={"messages": conversation_history}
)
return response
```
**Expected savings**: 60-80% token reduction vs. sending everything.
See **[USAGE.md](USAGE.md)** for complete integration patterns, examples, and best practices.
## Template Repository
Want to get started quickly? Use the agent template:
```bash
# Clone the template
git clone git.bouncypixel.com:helm/agentic-templates.git my-agent
cd my-agent
cp .env.example .env
docker compose up -d
```
The template includes a working agent integration and docker-compose setup. See [USAGE.md](USAGE.md) for integration patterns.
## How It Works (Architecture)
### RAG Engine (Fast)
- All skills/snippets are loaded into memory at startup with pre-computed embeddings
- Queries embed once, compute cosine similarity against cached embeddings
- Returns top-K most relevant items (<5ms for 1000 items)
- No external API calls, no database queries per request
### Compression (Configurable)
- **Extractive** (default): Uses LSA summarization to pick key sentences - fast, no model
- **Ollama**: Sends to local phi-3-mini for high-quality summaries (~2s)
- Keeps recent turns full, replaces old with summary
### Memory Store
- Simple key-value per project
- Stores decisions, configurations, learnings
- Retrieved via `/memory?project=...`
**Project scoping**: Use a stable identifier for the `project` parameter (e.g., git remote URL like `https://github.com/username/repo`). This ensures your project's conventions and memories follow you across machines, even if file paths differ. The template agent auto-detects the git remote.
## MCP Server Integration
If you use Claude Desktop, add to your config:
```json
{
"mcpServers": {
"skills": {
"command": "python",
"args": ["/path/to/ai-skills-api/mcp/skills.py"],
"env": {
"SKILLS_API_URL": "http://helm:8675"
}
}
}
}
```
Available tools:
- `search_skills`, `get_skill`, `list_skills`
- `get_context`, `get_conventions`, `get_snippets`
- `get_memory`, `add_memory`, `create_skill`
**Auto-coaching**: The MCP server sends instructions to the AI on connection, teaching it how and when to use these tools to learn and improve over time.
**Important**: The AI will **propose** creating skills/memories when it identifies valuable patterns, but **will not execute without your permission**. You'll see suggestions like "I could create a skill for this pattern. Should I?" and you can approve or decline. This gives you full control while still building the knowledge base.
## Migration from v1
If you were using the old semantic cache:
- **Deleted**: Semantic cache endpoints and model
- **Migrate**: Any stored skills/snippets remain (tags now JSON)
- **Upgrade**: Pull new image, restart, optionally enable auth
## Performance
- RAG latency: ~5ms (cached embeddings)
- Embedding model load: ~100MB RAM, ~2s cold start
- Compression: 100-500ms (extractive) or ~2s (ollama)
- Supports 1000+ skills/snippets without degradation
## License
MIT
For detailed usage examples and API reference, see [USAGE.md](USAGE.md) and the interactive docs at `http://helm:8675/docs` when the service is running.