375 lines
11 KiB
Markdown
375 lines
11 KiB
Markdown
# Token-Saving Architecture
|
|
|
|
This explains how the AI Skills API reduces token consumption for your AI agents.
|
|
|
|
## The Two Main Mechanisms
|
|
|
|
### 1. Smart RAG (Retrieval-Augmented Generation) - 60-80% Savings
|
|
|
|
**Problem:** Sending all skills/conventions every query wastes 2000+ tokens.
|
|
|
|
**Solution:** Pre-computed embeddings + fast similarity search returns only the top 3 most relevant items.
|
|
|
|
```python
|
|
# Instead of this (sends everything):
|
|
GET /context?project=/opt/home-server # -> 50 skills = ~3000 tokens
|
|
|
|
# Do this (sends only relevant):
|
|
GET /context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server
|
|
# -> 3 skills + 2 conventions = ~600 tokens
|
|
```
|
|
|
|
**How it works:**
|
|
- On startup, all skills/snippets are loaded into memory with their embeddings
|
|
- Query is embedded and cosine similarity computed against all items
|
|
- Top-K items above threshold returned in ~5ms for 1000 items
|
|
- No database queries during retrieval - fully in-memory
|
|
|
|
**Configuration:**
|
|
```yaml
|
|
rag:
|
|
max_skills: 3
|
|
max_conventions: 2
|
|
max_snippets: 2
|
|
min_skill_score: 0.3
|
|
```
|
|
|
|
### 2. Conversation Compression - 50-75% Savings
|
|
|
|
**Problem:** Long conversations (10+ turns) can consume 8000+ tokens of history.
|
|
|
|
**Solution:** Summarize old turns, keep recent exchanges full.
|
|
|
|
```python
|
|
# Send this to /compress endpoint:
|
|
{
|
|
"messages": [
|
|
{"role": "user", "content": "..."}, # turn 1
|
|
{"role": "assistant", "content": "..."},
|
|
# ... many more turns
|
|
{"role": "user", "content": "..."}, # turn 10
|
|
]
|
|
}
|
|
|
|
# Get back:
|
|
{
|
|
"messages": [
|
|
{"role": "user", "content": "[CONVERSATION SUMMARY]\nUser asked about Docker setup, decided to use Traefik...[/CONVERSATION SUMMARY]"},
|
|
{"role": "user", "content": "..."}, # turn 9 (full)
|
|
{"role": "assistant", "content": "..."}, # turn 10 (full)
|
|
],
|
|
"original_tokens": 8000,
|
|
"compressed_tokens": 2000,
|
|
"tokens_saved": 6000,
|
|
"reduction_percent": 75.0
|
|
}
|
|
```
|
|
|
|
**Strategies:**
|
|
- **extractive** (default): Fast LSA summarization, no model required
|
|
- **ollama**: High-quality summaries using local phi-3-mini (requires Ollama running)
|
|
- **none**: Disabled
|
|
|
|
**Configuration:**
|
|
```yaml
|
|
compression:
|
|
enabled: true
|
|
strategy: "extractive" # or "ollama"
|
|
keep_last_n: 3
|
|
max_tokens: 2000
|
|
ollama_model: "phi3:mini"
|
|
ollama_url: "http://localhost:11434"
|
|
```
|
|
|
|
---
|
|
|
|
## Integration Flow (Complete Example)
|
|
|
|
```python
|
|
import httpx
|
|
import asyncio
|
|
|
|
async def chat_with_llm(user_message: str, project: str = None, conversation: list = None):
|
|
"""Complete integration pattern"""
|
|
|
|
# 1. Get relevant context (RAG)
|
|
context_resp = await httpx.get(
|
|
"http://helm:8675/context/rag",
|
|
params={"query": user_message, "project": project, "max_skills": 3}
|
|
)
|
|
context = context_resp.json()
|
|
# context contains: skills, conventions, snippets, estimated_tokens
|
|
|
|
# 2. Build system prompt with context
|
|
context_str = format_context(context) # See agent/template/agent.py for full implementation
|
|
system_prompt = f"{context_str}\n\nYou are a helpful assistant."
|
|
|
|
# 3. Build messages array
|
|
messages = [{"role": "system", "content": system_prompt}]
|
|
if conversation:
|
|
messages.extend(conversation[-4:]) # last few turns
|
|
messages.append({"role": "user", "content": user_message})
|
|
|
|
# 4. Call your LLM (OpenAI, Claude, Ollama, etc.)
|
|
llm_response = await call_your_llm(messages)
|
|
|
|
# 5. Update conversation history
|
|
if conversation is None:
|
|
conversation = []
|
|
conversation.append({"role": "user", "content": user_message})
|
|
conversation.append({"role": "assistant", "content": llm_response})
|
|
|
|
# 6. Periodically compress (e.g., every 10 turns)
|
|
if len(conversation) > 10:
|
|
compress_resp = await httpx.post(
|
|
"http://helm:8675/compress",
|
|
json={"messages": conversation, "keep_last_n": 3}
|
|
)
|
|
compression = compress_resp.json()
|
|
conversation = compression["messages"]
|
|
print(f"Compressed: saved {compression['tokens_saved']} tokens ({compression['reduction_percent']}%)")
|
|
|
|
# 7. Optionally store learnings in memory
|
|
if project:
|
|
await httpx.post(
|
|
"http://helm:8675/memory",
|
|
json={
|
|
"project": project,
|
|
"key": f"decision-{int(time.time())}",
|
|
"content": f"Decision: {llm_response[:200]}"
|
|
}
|
|
)
|
|
|
|
return llm_response, conversation
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Savings Summary
|
|
|
|
| Component | Before | After | Token Savings |
|
|
|-----------|--------|-------|---------------|
|
|
| Context injection | 3000 tokens | 600 tokens | 80% |
|
|
| Conversation history (10 turns) | 8000 tokens | 2000 tokens | 75% |
|
|
| Repeat questions | 1500 tokens | 0 tokens | 100% (if using cache externally) |
|
|
|
|
**Typical agent query:** ~3500 tokens → ~1000 tokens (**71% reduction**)
|
|
|
|
---
|
|
|
|
## What Was Removed (v1 → v2)
|
|
|
|
- **Semantic cache** - Was broken (embeded responses not prompts), removed for simplicity
|
|
- **Exact-match cache** - Low value, use HTTP cache headers instead
|
|
- **Keyword-based compression** - Replaced with real summarization
|
|
|
|
---
|
|
|
|
## Performance Characteristics
|
|
|
|
- **RAG latency**: 5-10ms for 1000 items (cold start loads embeddings once)
|
|
- **Compression**: 100-500ms (extractive) or ~2s (ollama)
|
|
- **Memory usage**: ~50MB for embedding cache (1000 skills)
|
|
- **Concurrent requests**: Fully async, supports dozens simultaneous
|
|
|
|
---
|
|
|
|
## Tips for Best Results
|
|
|
|
1. **Seed relevant skills** - Good skills = better RAG results. Use `/skills` and `/snippets` to build your knowledge base.
|
|
2. **Use project-specific conventions** - Set `project=/path/to/project` to auto-load conventions for that codebase.
|
|
3. **Enable Ollama compression** if you need higher quality summaries (run `ollama pull phi3:mini`)
|
|
4. **Monitor `/config`** to verify your settings are active
|
|
5. **Cache embeddings** in your agent if you call `/context/rag` repeatedly
|
|
|
|
---
|
|
|
|
## Agent Template
|
|
|
|
We've created a ready-to-use template repository with a working agent integration. Clone it and start building:
|
|
|
|
```bash
|
|
git clone git.bouncypixel.com:helm/ai-agent-template.git
|
|
cd ai-agent-template
|
|
cp .env.example .env
|
|
docker compose up -d
|
|
```
|
|
|
|
See [template/README.md](template/README.md) for details.
|
|
|
|
**Savings:** 80-90% on repeated questions
|
|
|
|
---
|
|
|
|
### 2. RAG Context Selection (Moderate Win)
|
|
|
|
**Before:** Inject ALL skills/conventions (2000+ tokens)
|
|
**After:** Inject only top 3 relevant (400-600 tokens)
|
|
|
|
```bash
|
|
# Legacy endpoint - returns EVERYTHING
|
|
curl "http://localhost:8080/context?project=/opt/home-server"
|
|
# Returns: 50 skills, 10 conventions = ~3000 tokens
|
|
|
|
# RAG endpoint - returns only relevant
|
|
curl "http://helm:8675/context/rag?query=How+do+I+setup+Docker+Compose&project=/opt/home-server"
|
|
# Returns: 3 skills about Docker, 2 conventions = ~600 tokens
|
|
```
|
|
|
|
**Savings:** 60-80% on context injection
|
|
|
|
---
|
|
|
|
### 3. Conversation Compression (Moderate Win)
|
|
|
|
**Before:** Full conversation history sent every request
|
|
**After:** Old turns summarized, only recent kept full
|
|
|
|
```bash
|
|
# Compress a long conversation
|
|
curl -X POST http://helm:8675/compress \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"messages": [...], # Your conversation history
|
|
"keep_last_n": 3,
|
|
"max_tokens": 2000
|
|
}'
|
|
|
|
# Response:
|
|
{
|
|
"messages": [...], # Compressed version
|
|
"original_tokens": 8000,
|
|
"compressed_tokens": 2000,
|
|
"tokens_saved": 6000,
|
|
"reduction_percent": 75.0
|
|
}
|
|
```
|
|
|
|
**Savings:** 50-75% on conversation history
|
|
|
|
---
|
|
|
|
## Integration Flow
|
|
|
|
```python
|
|
# Your agent wrapper
|
|
async def query_llm(prompt, conversation_history, project=None):
|
|
# 1. Check semantic cache FIRST
|
|
cache_result = await httpx.post(
|
|
"http://helm:8675/cache/semantic-lookup",
|
|
json={"prompt": prompt, "model": "claude-3-opus"}
|
|
)
|
|
|
|
if cache_result.json()["hit"]:
|
|
# No API call needed!
|
|
return cache_result.json()["response"]
|
|
|
|
# 2. Get ONLY relevant context (not everything)
|
|
context = await httpx.get(
|
|
"http://helm:8675/context/rag",
|
|
params={"query": prompt, "project": project}
|
|
)
|
|
|
|
# 3. Compress conversation history
|
|
compressed = await httpx.post(
|
|
"http://helm:8675/compress",
|
|
json={"messages": conversation_history, "keep_last_n": 3}
|
|
)
|
|
|
|
# 4. Build final prompt with compressed history + relevant context
|
|
final_prompt = f"""
|
|
{context.json()['skills']}
|
|
{context.json()['conventions']}
|
|
|
|
{compressed.json()['messages']}
|
|
|
|
User: {prompt}
|
|
"""
|
|
|
|
# 5. Call LLM
|
|
response = await call_llm_api(final_prompt)
|
|
|
|
# 6. Store in semantic cache
|
|
await httpx.post(
|
|
"http://helm:8675/cache/semantic-store",
|
|
json={
|
|
"prompt": prompt,
|
|
"response": response,
|
|
"tokens_in": len(final_prompt.split()),
|
|
"tokens_out": len(response.split())
|
|
}
|
|
)
|
|
|
|
return response
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Savings
|
|
|
|
| Scenario | Before | After | Savings |
|
|
|----------|--------|-------|---------|
|
|
| Repeated question | 1500 tokens | 0 tokens (cache hit) | 100% |
|
|
| Similar question | 1500 tokens | 0 tokens (semantic match) | 100% |
|
|
| New question, known project | 3500 tokens | 1200 tokens | 65% |
|
|
| Long conversation (10+ turns) | 12000 tokens | 4000 tokens | 67% |
|
|
|
|
**Real-world average:** 50-70% reduction in token consumption
|
|
|
|
---
|
|
|
|
## Why No Vector DB?
|
|
|
|
For your scale (single user, <1000 items):
|
|
|
|
| Approach | Query Time | Setup | Overhead |
|
|
|----------|-----------|-------|----------|
|
|
| In-memory cosine sim | ~5ms | None | None |
|
|
| SQLite + embeddings | ~10ms | None | None |
|
|
| Qdrant/Chroma | ~2ms | Docker container | 500MB+ RAM |
|
|
|
|
**Verdict:** Vector DB adds complexity without meaningful benefit at your scale.
|
|
|
|
---
|
|
|
|
## New Endpoints
|
|
|
|
| Endpoint | Purpose |
|
|
|----------|---------|
|
|
| `POST /cache/semantic-lookup` | Find similar cached responses |
|
|
| `POST /cache/semantic-store` | Store with embedding for matching |
|
|
| `GET /context/rag?query=...` | RAG-based context selection |
|
|
| `POST /compress` | Summarize conversation history |
|
|
| `GET /tokens/count?text=...` | Count tokens in text |
|
|
| `GET /cache/stats` | Cache statistics |
|
|
| `POST /cache/clear-old` | Cleanup old cache entries |
|
|
|
|
---
|
|
|
|
## System Prompt for Agents
|
|
|
|
```markdown
|
|
## Token Efficiency Protocol
|
|
|
|
You have access to local infrastructure that reduces API usage:
|
|
|
|
**Before responding to any request:**
|
|
1. Call `POST /cache/semantic-lookup` with the user's prompt
|
|
2. If hit (similarity >= 0.85), return cached response directly
|
|
3. If miss, call `GET /context/rag?query={prompt}` for relevant context only
|
|
|
|
**For long conversations:**
|
|
1. Call `POST /compress` every 5+ turns
|
|
2. Use compressed history for subsequent requests
|
|
|
|
**After providing valuable responses:**
|
|
1. Call `POST /cache/semantic-store` to cache for future
|
|
2. Call `skills/create_skill` if it's a reusable pattern
|
|
|
|
**Token budget awareness:**
|
|
- Keep responses concise
|
|
- Don't repeat injected context
|
|
- Reference skills by ID when possible
|
|
|
|
This infrastructure saves 50-70% on token consumption.
|
|
```
|