413 lines
13 KiB
Markdown
413 lines
13 KiB
Markdown
# Discord Voice Transcript Implementation Plan
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Discord Voice Channel │
|
|
│ ┌────────┐ ┌────────┐ ┌────────┐ │
|
|
│ │ Player │ │ Player │ │ Player │ │
|
|
│ └───┬────┘ └───┬────┘ └───┬────┘ │
|
|
│ │ │ │ │
|
|
└───────┼───────────┼───────────┼─────────────────────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Discord Bot (discord.js + @discordjs/voice) │
|
|
│ - Joins voice channel on /session start │
|
|
│ - Subscribes to each user's audio stream │
|
|
│ - Saves per-user .webm files to temp storage │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
│ (on /session stop)
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Whisper Service (Docker container on Ubuntu server) │
|
|
│ - whisper.cpp or openai/whisper │
|
|
│ - Receives audio files via HTTP POST │
|
|
│ - Returns JSON with timestamps + transcript │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ DnD Hub Server │
|
|
│ - Maps Discord user ID → character (existing logic) │
|
|
│ - Stores voice segments in transcript_segments table │
|
|
│ - Marks segments with source='voice' │
|
|
│ - AI recap includes both text + voice transcripts │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1: Database Schema (2-3 hours)
|
|
|
|
### Migration: `apps/server/src/db/migrations/010_voice_transcripts.sql`
|
|
|
|
```sql
|
|
-- Add source column to distinguish text vs voice
|
|
ALTER TABLE transcript_segments ADD COLUMN source TEXT DEFAULT 'text';
|
|
|
|
-- Table for voice recording metadata
|
|
CREATE TABLE voice_recordings (
|
|
id INTEGER PRIMARY KEY,
|
|
session_id INTEGER NOT NULL,
|
|
discord_user_id TEXT NOT NULL,
|
|
file_path TEXT NOT NULL,
|
|
duration_ms INTEGER,
|
|
recorded_at TEXT NOT NULL,
|
|
processed_at TEXT,
|
|
processing_status TEXT DEFAULT 'pending',
|
|
FOREIGN KEY (session_id) REFERENCES sessions(id)
|
|
);
|
|
|
|
-- Index for efficient lookups
|
|
CREATE INDEX idx_voice_recordings_session ON voice_recordings(session_id);
|
|
CREATE INDEX idx_transcript_source ON transcript_segments(source);
|
|
```
|
|
|
|
**Time:** 30 min schema design + 30 min migration + 1 hr testing
|
|
|
|
---
|
|
|
|
## Phase 2: Whisper Docker Service (3-4 hours)
|
|
|
|
### Option A: whisper.cpp (recommended for performance)
|
|
```yaml
|
|
# Add to docker-compose.yml
|
|
services:
|
|
whisper:
|
|
image: ghcr.io/ggerganov/whisper.cpp:server
|
|
ports:
|
|
- "10888:10888"
|
|
volumes:
|
|
- ./whisper-models:/models
|
|
environment:
|
|
- WHISPER_MODEL=/models/ggml-large-v3.bin
|
|
restart: unless-stopped
|
|
```
|
|
|
|
### Option B: OpenAI Whisper (simpler)
|
|
```yaml
|
|
whisper:
|
|
image: ghcr.io/openai/whisper:latest
|
|
ports:
|
|
- "10888:10888"
|
|
volumes:
|
|
- ./whisper-audio:/audio
|
|
```
|
|
|
|
**Setup steps:**
|
|
1. Add whisper service to `docker-compose.yml`
|
|
2. Download model (`ggml-large-v3.bin` ~3GB, best accuracy)
|
|
3. Test API endpoint: `POST /inference` with audio file
|
|
4. Create `apps/server/src/services/whisperClient.ts`
|
|
|
|
**whisperClient.ts:**
|
|
```typescript
|
|
import FormData from 'form-data';
|
|
import fetch from 'node-fetch';
|
|
|
|
const WHISPER_URL = process.env.WHISPER_BASE_URL ?? 'http://whisper:10888';
|
|
|
|
export async function transcribeAudio(audioBuffer: Buffer, speakerId: string) {
|
|
const form = new FormData();
|
|
form.append('file', audioBuffer, { filename: `${speakerId}.webm` });
|
|
form.append('model', 'large-v3');
|
|
form.append('response_format', 'verbose_json');
|
|
form.append('word_timestamps', 'true');
|
|
|
|
const res = await fetch(`${WHISPER_URL}/inference`, {
|
|
method: 'POST',
|
|
body: form
|
|
});
|
|
|
|
return res.json(); // { text, segments: [{start, end, text}] }
|
|
}
|
|
```
|
|
|
|
**Time:** 2 hrs Docker setup + 1-2 hrs client + testing
|
|
|
|
---
|
|
|
|
## Phase 3: Voice Recording in Discord Bot (5-7 hours)
|
|
|
|
### Update `discordService.ts` with voice intents:
|
|
|
|
```typescript
|
|
import { joinVoiceChannel, createAudioSubscriber, VoiceConnectionStatus } from '@discordjs/voice';
|
|
import { AudioReceiveStream } from 'discord.js';
|
|
|
|
// Add to intents
|
|
const client = new Client({
|
|
intents: [
|
|
GatewayIntentBits.Guilds,
|
|
GatewayIntentBits.GuildMembers,
|
|
GatewayIntentBits.GuildVoiceStates,
|
|
GatewayIntentBits.GuildMessages
|
|
]
|
|
});
|
|
```
|
|
|
|
### New service: `apps/server/src/services/voiceRecorder.ts`
|
|
|
|
```typescript
|
|
import { joinVoiceChannel, createAudioSubscriber } from '@discordjs/voice';
|
|
import { VoiceConnection } from 'discord.js';
|
|
import { db } from '../db/client.js';
|
|
import { writeFileSync, mkdirSync } from 'fs';
|
|
import path from 'path';
|
|
|
|
const RECORDINGS_DIR = path.resolve('./data/voice-recordings');
|
|
|
|
class VoiceRecorder {
|
|
private connection: VoiceConnection | null = null;
|
|
private recordingStreams = new Map<string, WriteStream>();
|
|
private sessionId: number | null = null;
|
|
|
|
async joinChannel(guildId: number, channelId: string, sessionId: number) {
|
|
this.connection = joinVoiceChannel({
|
|
channelId,
|
|
guildId,
|
|
adapterCreator: getAdapter() // discord.js adapter
|
|
});
|
|
|
|
this.connection.on(VoiceConnectionStatus.Ready, () => {
|
|
console.log(`Voice connected for session ${sessionId}`);
|
|
this.sessionId = sessionId;
|
|
});
|
|
|
|
// Subscribe to all users
|
|
this.connection.receiver.subscriptions.on('entry', (userId, stream) => {
|
|
this.startRecording(userId, stream);
|
|
});
|
|
}
|
|
|
|
private startRecording(userId: string, stream: AudioReceiveStream) {
|
|
const filePath = path.join(RECORDINGS_DIR, `${this.sessionId}_${userId}.webm`);
|
|
mkdirSync(RECORDINGS_DIR, { recursive: true });
|
|
|
|
const file = createWriteStream(filePath);
|
|
stream.pipe(file);
|
|
this.recordingStreams.set(userId, file);
|
|
|
|
// Track in DB
|
|
db.prepare(`
|
|
INSERT INTO voice_recordings (session_id, discord_user_id, file_path, recorded_at)
|
|
VALUES (?, ?, ?, datetime('now'))
|
|
`).run(this.sessionId, userId, filePath);
|
|
}
|
|
|
|
async stopRecording() {
|
|
// Close all file streams
|
|
for (const stream of this.recordingStreams.values()) {
|
|
stream.end();
|
|
}
|
|
this.recordingStreams.clear();
|
|
|
|
// Leave voice channel
|
|
this.connection?.destroy();
|
|
this.connection = null;
|
|
}
|
|
|
|
async processRecordings(sessionId: number) {
|
|
const recordings = db.prepare(`
|
|
SELECT * FROM voice_recordings
|
|
WHERE session_id = ? AND processing_status = 'pending'
|
|
`).all(sessionId);
|
|
|
|
for (const rec of recordings) {
|
|
await this.processSingleRecording(rec);
|
|
}
|
|
}
|
|
|
|
private async processSingleRecording(recording: any) {
|
|
const audioBuffer = readFileSync(recording.file_path);
|
|
const result = await transcribeAudio(audioBuffer, recording.discord_user_id);
|
|
|
|
// Store transcript segments
|
|
for (const segment of result.segments) {
|
|
sessionService.appendSegment({
|
|
sessionId: this.sessionId!,
|
|
guildId: /* get from session */,
|
|
discordUserId: recording.discord_user_id,
|
|
text: segment.text,
|
|
startedAt: segment.start,
|
|
endedAt: segment.end,
|
|
confidence: segment.confidence,
|
|
source: 'voice'
|
|
});
|
|
}
|
|
|
|
// Mark as processed
|
|
db.prepare(`
|
|
UPDATE voice_recordings
|
|
SET processing_status = 'completed', processed_at = datetime('now')
|
|
WHERE id = ?
|
|
`).run(recording.id);
|
|
}
|
|
}
|
|
|
|
export const voiceRecorder = new VoiceRecorder();
|
|
```
|
|
|
|
**Time:** 4-5 hrs discord.js voice API + 2 hrs file handling + 1 hr testing
|
|
|
|
---
|
|
|
|
## Phase 4: Discord Command Integration (2-3 hours)
|
|
|
|
### Update `/session start` command:
|
|
|
|
```typescript
|
|
// In discordService.ts handleSessionCommand
|
|
if (sub === 'start') {
|
|
const result = sessionService.startSession(guildId, user.id);
|
|
|
|
// Get the voice channel the user is in
|
|
const member = await interaction.guild.members.fetch(interaction.user.id);
|
|
const voiceChannel = member.voice.channel;
|
|
|
|
if (voiceChannel) {
|
|
await voiceRecorder.joinChannel(
|
|
interaction.guildId!,
|
|
voiceChannel.id,
|
|
result.sessionId
|
|
);
|
|
}
|
|
|
|
await interaction.reply({
|
|
content: `Started session #${result.sessionId}. ${voiceChannel ? '🎤 Recording voice' : '📝 Text only'}`,
|
|
ephemeral: false
|
|
});
|
|
}
|
|
```
|
|
|
|
### Update `/session stop` command:
|
|
|
|
```typescript
|
|
if (sub === 'stop') {
|
|
await voiceRecorder.stopRecording();
|
|
await voiceRecorder.processRecordings(sessionId);
|
|
sessionService.stopSession(guildId, user.id);
|
|
await interaction.reply({ content: `Stopped session #${active.id}. Processing voice transcripts...`, ephemeral: false });
|
|
}
|
|
```
|
|
|
|
**Time:** 2 hrs integration + 1 hr testing
|
|
|
|
---
|
|
|
|
## Phase 5: UI for Voice Transcripts (4-6 hours)
|
|
|
|
### Update `CampaignDetailPage.tsx`:
|
|
|
|
```tsx
|
|
// Add filter toggle for transcript sources
|
|
const [showVoice, setShowVoice] = useState(true);
|
|
const [showText, setShowText] = useState(true);
|
|
|
|
// Filter segments
|
|
const filteredSegments = segments.filter(s =>
|
|
(s.source === 'voice' && showVoice) || (s.source === 'text' && showText)
|
|
);
|
|
|
|
// Add audio player for voice segments
|
|
{segment.source === 'voice' && (
|
|
<audio controls src={`/api/voice/${segment.recording_id}`} />
|
|
)}
|
|
```
|
|
|
|
### New API endpoint: `apps/server/src/routes/voiceRoutes.ts`
|
|
|
|
```typescript
|
|
import { Router } from 'express';
|
|
import { requireAuth } from '../middleware.js';
|
|
|
|
export const voiceRoutes = Router();
|
|
|
|
voiceRoutes.get('/voice/:recordingId', requireAuth, (req, res) => {
|
|
const recording = db.prepare('SELECT * FROM voice_recordings WHERE id = ?').get(req.params.recordingId);
|
|
if (!recording) return res.status(404).send('Not found');
|
|
|
|
res.sendFile(path.resolve(recording.file_path));
|
|
});
|
|
```
|
|
|
|
**Time:** 3 hrs UI components + 2 hrs API + 1 hr testing
|
|
|
|
---
|
|
|
|
## Phase 6: AI Recap Integration (2-3 hours)
|
|
|
|
### Update `recapService.ts`:
|
|
|
|
```typescript
|
|
// In buildContext, include voice segments
|
|
const segments = db
|
|
.prepare(`
|
|
SELECT sequence, speaker_label_snapshot AS speakerLabel,
|
|
text, started_at AS startedAt, source
|
|
FROM transcript_segments
|
|
WHERE session_id = ?
|
|
ORDER BY sequence ASC
|
|
`)
|
|
.all(sessionId);
|
|
|
|
// Update prompt to mention voice transcripts
|
|
const prompt = `
|
|
Summarize this DnD session transcript.
|
|
Note: Some segments are from voice transcription (may have minor errors).
|
|
|
|
${context.segments.map(s =>
|
|
`[${s.startedAt}] ${s.speakerLabel}: ${s.text}${s.source === 'voice' ? ' [voice]' : ''}`
|
|
).join('\n')}
|
|
`;
|
|
```
|
|
|
|
**Time:** 1-2 hrs updates + 1 hr testing
|
|
|
|
---
|
|
|
|
## Summary Timeline
|
|
|
|
| Phase | Description | Hours | Cumulative |
|
|
|-------|-------------|-------|------------|
|
|
| 1 | DB schema + migration | 2-3 | 2-3 |
|
|
| 2 | Whisper Docker service | 3-4 | 5-7 |
|
|
| 3 | Voice recording service | 5-7 | 10-14 |
|
|
| 4 | Discord command integration | 2-3 | 12-17 |
|
|
| 5 | UI for voice transcripts | 4-6 | 16-23 |
|
|
| 6 | AI recap integration | 2-3 | 18-26 |
|
|
|
|
**Total: 18-26 hours** (3-4 full days)
|
|
|
|
---
|
|
|
|
## Dependencies to Install
|
|
|
|
```bash
|
|
# Server workspace
|
|
npm install @discordjs/voice form-data node-fetch -w @dnd-hub/server
|
|
npm install -D @types/node-fetch -w @dnd-hub/server
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps (Where to Pick Up)
|
|
|
|
1. **Phase 1**: Create `apps/server/src/db/migrations/010_voice_transcripts.sql`
|
|
2. **Phase 2**: Update `docker-compose.yml` with whisper service
|
|
3. **Phase 2**: Create `apps/server/src/services/whisperClient.ts`
|
|
|
|
Then proceed through phases 3-6 sequentially.
|
|
|
|
---
|
|
|
|
## Key Decisions Made
|
|
|
|
- **Transcription**: whisper.cpp via Docker (local, free, runs on existing Ubuntu server)
|
|
- **Audio format**: Per-user .webm tracks (better speaker separation)
|
|
- **Processing**: Batch on /session stop (not real-time)
|
|
- **Storage**: Voice files in `apps/server/data/voice-recordings/`
|
|
- **DB**: Single `transcript_segments` table with `source` column (voice vs text)
|