237 lines
6.9 KiB
Markdown
237 lines
6.9 KiB
Markdown
# AI Skills API
|
|
|
|
Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Copy config file (optional, uses defaults if missing)
|
|
cp config.yaml.example config.yaml # customize if needed
|
|
|
|
# Run with Docker
|
|
docker compose up -d
|
|
|
|
# Or run locally
|
|
pip install -r requirements.txt
|
|
uvicorn main:app --reload
|
|
```
|
|
|
|
API available at `http://helm:8675`
|
|
Docs at `http://helm:8675/docs`
|
|
|
|
## Key Features
|
|
|
|
- **Smart RAG**: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
|
|
- **Conversation Compression**: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
|
|
- **Project Memory**: Store decisions and learnings per project
|
|
- **Simple API**: RESTful JSON API + MCP server for Claude Desktop
|
|
- **Zero-friction auth**: Optional API key (set-and-forget)
|
|
|
|
## Configuration
|
|
|
|
Create `config.yaml` (optional) to customize:
|
|
|
|
```yaml
|
|
port: 8675
|
|
rag:
|
|
max_skills: 3
|
|
max_conventions: 2
|
|
max_snippets: 2
|
|
compression:
|
|
enabled: true
|
|
strategy: "extractive" # or "ollama" for phi-3-mini
|
|
auth:
|
|
enabled: false # set to true and change api_key
|
|
```
|
|
|
|
Or use environment variables (see `config.py` for full list).
|
|
|
|
## Endpoints
|
|
|
|
| Endpoint | Description | Auth |
|
|
|----------|-------------|------|
|
|
| `GET /health` | Health check | No |
|
|
| `GET /config` | Show current config | Yes |
|
|
| `GET /skills` | List all skills | Yes |
|
|
| `GET /skills/{id}` | Get skill (increments usage) | Yes |
|
|
| `POST /skills` | Create skill | Yes |
|
|
| `PUT /skills/{id}` | Update skill | Yes |
|
|
| `DELETE /skills/{id}` | Delete skill | Yes |
|
|
| `GET /skills/search?q=query` | Search skills | Yes |
|
|
| `GET /snippets` | List snippets | Yes |
|
|
| `POST /snippets` | Create snippet | Yes |
|
|
| `DELETE /snippets/{id}` | Delete snippet | Yes |
|
|
| `GET /conventions` | List conventions | Yes |
|
|
| `GET /conventions?project=/path` | Get project conventions | Yes |
|
|
| `POST /conventions` | Create convention | Yes |
|
|
| `DELETE /conventions/{id}` | Delete convention | Yes |
|
|
| `GET /memory` | List memory entries | Yes |
|
|
| `GET /memory?project=name` | Get project memory | Yes |
|
|
| `POST /memory` | Create memory entry | Yes |
|
|
| `PUT /memory/{id}` | Update memory | Yes |
|
|
| `DELETE /memory/{id}` | Delete memory | Yes |
|
|
| `GET /context/rag?query=...` | **RAG context** (smart retrieval) | Yes |
|
|
| `POST /compress` | **Compress conversation** | Yes |
|
|
| `GET /tokens/count?text=...` | Count tokens | Yes |
|
|
| `POST /admin/clear-cache` | Clear RAG cache | Yes |
|
|
|
|
**Note**: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).
|
|
|
|
## Integration Pattern
|
|
|
|
```python
|
|
import httpx
|
|
|
|
async def query_llm(prompt, conversation_history, project=None):
|
|
# 1. Get relevant context (RAG) - biggest token saver
|
|
context = await httpx.get(
|
|
"http://helm:8675/context/rag",
|
|
params={"query": prompt, "project": project}
|
|
).json()
|
|
|
|
# Inject context into your LLM prompt
|
|
system_prompt = f"{context['skills']}\n{context['conventions']}"
|
|
|
|
# 2. Call LLM with context + conversation
|
|
response = call_llm(system_prompt, conversation_history, prompt)
|
|
|
|
# 3. Store learnings in memory
|
|
await httpx.post(
|
|
"http://helm:8675/memory",
|
|
json={"project": project, "key": "decision", "content": response}
|
|
)
|
|
|
|
# 4. Periodically compress old conversation turns
|
|
if len(conversation_history) > 10:
|
|
await httpx.post(
|
|
"http://helm:8675/compress",
|
|
json={"messages": conversation_history}
|
|
)
|
|
|
|
return response
|
|
```
|
|
|
|
**Expected savings**: 60-80% token reduction vs. sending everything.
|
|
|
|
## Template Repository
|
|
|
|
Want to get started quickly? Use the agent template:
|
|
|
|
```bash
|
|
# Clone the template (on your Forgejo)
|
|
git clone git.bouncypixel.com:helm/ai-agent-template.git
|
|
cd ai-agent-template
|
|
cp .env.example .env
|
|
docker compose up -d
|
|
```
|
|
|
|
The template includes a working agent integration and docker-compose setup.
|
|
|
|
## How It Works (Architecture)
|
|
|
|
### RAG Engine (Fast)
|
|
- All skills/snippets are loaded into memory at startup with pre-computed embeddings
|
|
- Queries embed once, compute cosine similarity against cached embeddings
|
|
- Returns top-K most relevant items (<5ms for 1000 items)
|
|
- No external API calls, no database queries per request
|
|
|
|
### Compression (Configurable)
|
|
- **Extractive** (default): Uses LSA summarization to pick key sentences - fast, no model
|
|
- **Ollama**: Sends to local phi-3-mini for high-quality summaries (~2s)
|
|
- Keeps recent turns full, replaces old with summary
|
|
|
|
### Memory Store
|
|
- Simple key-value per project
|
|
- Stores decisions, configurations, learnings
|
|
- Retrieved via `/memory?project=...`
|
|
|
|
## MCP Server Integration
|
|
|
|
If you use Claude Desktop, add to your config:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"skills": {
|
|
"command": "python",
|
|
"args": ["/path/to/ai-skills-api/mcp/skills.py"],
|
|
"env": {
|
|
"SKILLS_API_URL": "http://helm:8675"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Available tools:
|
|
- `search_skills`, `get_skill`, `list_skills`
|
|
- `get_context`, `get_conventions`, `get_snippets`
|
|
- `check_cache` (deprecated), `get_memory`, `add_memory`, `create_skill`
|
|
|
|
## Migration from v1
|
|
|
|
If you were using the old semantic cache:
|
|
- **Deleted**: Semantic cache endpoints and model
|
|
- **Migrate**: Any stored skills/snippets remain (tags now JSON)
|
|
- **Upgrade**: Pull new image, restart, optionally enable auth
|
|
|
|
## Performance
|
|
|
|
- RAG latency: ~5ms (cached embeddings)
|
|
- Embedding model load: ~100MB RAM, ~2s cold start
|
|
- Compression: 100-500ms (extractive) or ~2s (ollama)
|
|
- Supports 1000+ skills/snippets without degradation
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Example Usage
|
|
|
|
### Create a skill
|
|
```bash
|
|
curl -X POST http://helm:8675/skills \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"id": "homelab-docker-compose",
|
|
"name": "Docker Compose Standard",
|
|
"category": "homelab",
|
|
"content": "Always use docker-compose v3.8+. Include health checks, restart policies, and resource limits.",
|
|
"tags": ["docker", "compose", "infrastructure"]
|
|
}'
|
|
```
|
|
|
|
### Get context bundle
|
|
```bash
|
|
curl "http://helm:8675/context?project=/home/server/apps/media-server&skills=homelab-docker-compose,react-v2"
|
|
```
|
|
|
|
### Check cache
|
|
```bash
|
|
curl -X POST http://helm:8675/cache/lookup \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"prompt": "How do I configure traefik?",
|
|
"model": "claude-3-opus"
|
|
}'
|
|
```
|
|
|
|
## Integration Pattern
|
|
|
|
In your agent's system prompt or pre-request hook:
|
|
|
|
1. Call `GET /context?project={current_project}&skills={skill_ids}`
|
|
2. Inject returned content into the prompt
|
|
3. Before sending to LLM, check `POST /cache/lookup`
|
|
4. After receiving response, optionally `POST /cache/store`
|
|
|
|
This avoids re-sending your standards every request and caches repeated queries.
|
|
|
|
## Database
|
|
|
|
SQLite database `ai.db` with tables:
|
|
- `skills` - Reusable patterns and instructions
|
|
- `snippets` - Code snippets
|
|
- `conventions` - Project-specific conventions
|
|
- `cache` - LRU cache of LLM responses
|
|
- `memory` - Project memory/notes
|