# Discord Voice Transcript Implementation Plan ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ Discord Voice Channel │ │ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ Player │ │ Player │ │ Player │ │ │ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ │ │ │ └───────┼───────────┼───────────┼─────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Discord Bot (discord.js + @discordjs/voice) │ │ - Joins voice channel on /session start │ │ - Subscribes to each user's audio stream │ │ - Saves per-user .webm files to temp storage │ └─────────────────────────────────────────────────────────────────┘ │ │ (on /session stop) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Whisper Service (Docker container on Ubuntu server) │ │ - whisper.cpp or openai/whisper │ │ - Receives audio files via HTTP POST │ │ - Returns JSON with timestamps + transcript │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ DnD Hub Server │ │ - Maps Discord user ID → character (existing logic) │ │ - Stores voice segments in transcript_segments table │ │ - Marks segments with source='voice' │ │ - AI recap includes both text + voice transcripts │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Phase 1: Database Schema (2-3 hours) ### Migration: `apps/server/src/db/migrations/010_voice_transcripts.sql` ```sql -- Add source column to distinguish text vs voice ALTER TABLE transcript_segments ADD COLUMN source TEXT DEFAULT 'text'; -- Table for voice recording metadata CREATE TABLE voice_recordings ( id INTEGER PRIMARY KEY, session_id INTEGER NOT NULL, discord_user_id TEXT NOT NULL, file_path TEXT NOT NULL, duration_ms INTEGER, recorded_at TEXT NOT NULL, processed_at TEXT, processing_status TEXT DEFAULT 'pending', FOREIGN KEY (session_id) REFERENCES sessions(id) ); -- Index for efficient lookups CREATE INDEX idx_voice_recordings_session ON voice_recordings(session_id); CREATE INDEX idx_transcript_source ON transcript_segments(source); ``` **Time:** 30 min schema design + 30 min migration + 1 hr testing --- ## Phase 2: Whisper Docker Service (3-4 hours) ### Option A: whisper.cpp (recommended for performance) ```yaml # Add to docker-compose.yml services: whisper: image: ghcr.io/ggerganov/whisper.cpp:server ports: - "10888:10888" volumes: - ./whisper-models:/models environment: - WHISPER_MODEL=/models/ggml-large-v3.bin restart: unless-stopped ``` ### Option B: OpenAI Whisper (simpler) ```yaml whisper: image: ghcr.io/openai/whisper:latest ports: - "10888:10888" volumes: - ./whisper-audio:/audio ``` **Setup steps:** 1. Add whisper service to `docker-compose.yml` 2. Download model (`ggml-large-v3.bin` ~3GB, best accuracy) 3. Test API endpoint: `POST /inference` with audio file 4. Create `apps/server/src/services/whisperClient.ts` **whisperClient.ts:** ```typescript import FormData from 'form-data'; import fetch from 'node-fetch'; const WHISPER_URL = process.env.WHISPER_BASE_URL ?? 'http://whisper:10888'; export async function transcribeAudio(audioBuffer: Buffer, speakerId: string) { const form = new FormData(); form.append('file', audioBuffer, { filename: `${speakerId}.webm` }); form.append('model', 'large-v3'); form.append('response_format', 'verbose_json'); form.append('word_timestamps', 'true'); const res = await fetch(`${WHISPER_URL}/inference`, { method: 'POST', body: form }); return res.json(); // { text, segments: [{start, end, text}] } } ``` **Time:** 2 hrs Docker setup + 1-2 hrs client + testing --- ## Phase 3: Voice Recording in Discord Bot (5-7 hours) ### Update `discordService.ts` with voice intents: ```typescript import { joinVoiceChannel, createAudioSubscriber, VoiceConnectionStatus } from '@discordjs/voice'; import { AudioReceiveStream } from 'discord.js'; // Add to intents const client = new Client({ intents: [ GatewayIntentBits.Guilds, GatewayIntentBits.GuildMembers, GatewayIntentBits.GuildVoiceStates, GatewayIntentBits.GuildMessages ] }); ``` ### New service: `apps/server/src/services/voiceRecorder.ts` ```typescript import { joinVoiceChannel, createAudioSubscriber } from '@discordjs/voice'; import { VoiceConnection } from 'discord.js'; import { db } from '../db/client.js'; import { writeFileSync, mkdirSync } from 'fs'; import path from 'path'; const RECORDINGS_DIR = path.resolve('./data/voice-recordings'); class VoiceRecorder { private connection: VoiceConnection | null = null; private recordingStreams = new Map(); private sessionId: number | null = null; async joinChannel(guildId: number, channelId: string, sessionId: number) { this.connection = joinVoiceChannel({ channelId, guildId, adapterCreator: getAdapter() // discord.js adapter }); this.connection.on(VoiceConnectionStatus.Ready, () => { console.log(`Voice connected for session ${sessionId}`); this.sessionId = sessionId; }); // Subscribe to all users this.connection.receiver.subscriptions.on('entry', (userId, stream) => { this.startRecording(userId, stream); }); } private startRecording(userId: string, stream: AudioReceiveStream) { const filePath = path.join(RECORDINGS_DIR, `${this.sessionId}_${userId}.webm`); mkdirSync(RECORDINGS_DIR, { recursive: true }); const file = createWriteStream(filePath); stream.pipe(file); this.recordingStreams.set(userId, file); // Track in DB db.prepare(` INSERT INTO voice_recordings (session_id, discord_user_id, file_path, recorded_at) VALUES (?, ?, ?, datetime('now')) `).run(this.sessionId, userId, filePath); } async stopRecording() { // Close all file streams for (const stream of this.recordingStreams.values()) { stream.end(); } this.recordingStreams.clear(); // Leave voice channel this.connection?.destroy(); this.connection = null; } async processRecordings(sessionId: number) { const recordings = db.prepare(` SELECT * FROM voice_recordings WHERE session_id = ? AND processing_status = 'pending' `).all(sessionId); for (const rec of recordings) { await this.processSingleRecording(rec); } } private async processSingleRecording(recording: any) { const audioBuffer = readFileSync(recording.file_path); const result = await transcribeAudio(audioBuffer, recording.discord_user_id); // Store transcript segments for (const segment of result.segments) { sessionService.appendSegment({ sessionId: this.sessionId!, guildId: /* get from session */, discordUserId: recording.discord_user_id, text: segment.text, startedAt: segment.start, endedAt: segment.end, confidence: segment.confidence, source: 'voice' }); } // Mark as processed db.prepare(` UPDATE voice_recordings SET processing_status = 'completed', processed_at = datetime('now') WHERE id = ? `).run(recording.id); } } export const voiceRecorder = new VoiceRecorder(); ``` **Time:** 4-5 hrs discord.js voice API + 2 hrs file handling + 1 hr testing --- ## Phase 4: Discord Command Integration (2-3 hours) ### Update `/session start` command: ```typescript // In discordService.ts handleSessionCommand if (sub === 'start') { const result = sessionService.startSession(guildId, user.id); // Get the voice channel the user is in const member = await interaction.guild.members.fetch(interaction.user.id); const voiceChannel = member.voice.channel; if (voiceChannel) { await voiceRecorder.joinChannel( interaction.guildId!, voiceChannel.id, result.sessionId ); } await interaction.reply({ content: `Started session #${result.sessionId}. ${voiceChannel ? '🎤 Recording voice' : '📝 Text only'}`, ephemeral: false }); } ``` ### Update `/session stop` command: ```typescript if (sub === 'stop') { await voiceRecorder.stopRecording(); await voiceRecorder.processRecordings(sessionId); sessionService.stopSession(guildId, user.id); await interaction.reply({ content: `Stopped session #${active.id}. Processing voice transcripts...`, ephemeral: false }); } ``` **Time:** 2 hrs integration + 1 hr testing --- ## Phase 5: UI for Voice Transcripts (4-6 hours) ### Update `CampaignDetailPage.tsx`: ```tsx // Add filter toggle for transcript sources const [showVoice, setShowVoice] = useState(true); const [showText, setShowText] = useState(true); // Filter segments const filteredSegments = segments.filter(s => (s.source === 'voice' && showVoice) || (s.source === 'text' && showText) ); // Add audio player for voice segments {segment.source === 'voice' && (