13 KiB
13 KiB
Discord Voice Transcript Implementation Plan
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Discord Voice Channel │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Player │ │ Player │ │ Player │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │
└───────┼───────────┼───────────┼─────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Discord Bot (discord.js + @discordjs/voice) │
│ - Joins voice channel on /session start │
│ - Subscribes to each user's audio stream │
│ - Saves per-user .webm files to temp storage │
└─────────────────────────────────────────────────────────────────┘
│
│ (on /session stop)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Whisper Service (Docker container on Ubuntu server) │
│ - whisper.cpp or openai/whisper │
│ - Receives audio files via HTTP POST │
│ - Returns JSON with timestamps + transcript │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DnD Hub Server │
│ - Maps Discord user ID → character (existing logic) │
│ - Stores voice segments in transcript_segments table │
│ - Marks segments with source='voice' │
│ - AI recap includes both text + voice transcripts │
└─────────────────────────────────────────────────────────────────┘
Phase 1: Database Schema (2-3 hours)
Migration: apps/server/src/db/migrations/010_voice_transcripts.sql
-- Add source column to distinguish text vs voice
ALTER TABLE transcript_segments ADD COLUMN source TEXT DEFAULT 'text';
-- Table for voice recording metadata
CREATE TABLE voice_recordings (
id INTEGER PRIMARY KEY,
session_id INTEGER NOT NULL,
discord_user_id TEXT NOT NULL,
file_path TEXT NOT NULL,
duration_ms INTEGER,
recorded_at TEXT NOT NULL,
processed_at TEXT,
processing_status TEXT DEFAULT 'pending',
FOREIGN KEY (session_id) REFERENCES sessions(id)
);
-- Index for efficient lookups
CREATE INDEX idx_voice_recordings_session ON voice_recordings(session_id);
CREATE INDEX idx_transcript_source ON transcript_segments(source);
Time: 30 min schema design + 30 min migration + 1 hr testing
Phase 2: Whisper Docker Service (3-4 hours)
Option A: whisper.cpp (recommended for performance)
# Add to docker-compose.yml
services:
whisper:
image: ghcr.io/ggerganov/whisper.cpp:server
ports:
- "10888:10888"
volumes:
- ./whisper-models:/models
environment:
- WHISPER_MODEL=/models/ggml-large-v3.bin
restart: unless-stopped
Option B: OpenAI Whisper (simpler)
whisper:
image: ghcr.io/openai/whisper:latest
ports:
- "10888:10888"
volumes:
- ./whisper-audio:/audio
Setup steps:
- Add whisper service to
docker-compose.yml - Download model (
ggml-large-v3.bin~3GB, best accuracy) - Test API endpoint:
POST /inferencewith audio file - Create
apps/server/src/services/whisperClient.ts
whisperClient.ts:
import FormData from 'form-data';
import fetch from 'node-fetch';
const WHISPER_URL = process.env.WHISPER_BASE_URL ?? 'http://whisper:10888';
export async function transcribeAudio(audioBuffer: Buffer, speakerId: string) {
const form = new FormData();
form.append('file', audioBuffer, { filename: `${speakerId}.webm` });
form.append('model', 'large-v3');
form.append('response_format', 'verbose_json');
form.append('word_timestamps', 'true');
const res = await fetch(`${WHISPER_URL}/inference`, {
method: 'POST',
body: form
});
return res.json(); // { text, segments: [{start, end, text}] }
}
Time: 2 hrs Docker setup + 1-2 hrs client + testing
Phase 3: Voice Recording in Discord Bot (5-7 hours)
Update discordService.ts with voice intents:
import { joinVoiceChannel, createAudioSubscriber, VoiceConnectionStatus } from '@discordjs/voice';
import { AudioReceiveStream } from 'discord.js';
// Add to intents
const client = new Client({
intents: [
GatewayIntentBits.Guilds,
GatewayIntentBits.GuildMembers,
GatewayIntentBits.GuildVoiceStates,
GatewayIntentBits.GuildMessages
]
});
New service: apps/server/src/services/voiceRecorder.ts
import { joinVoiceChannel, createAudioSubscriber } from '@discordjs/voice';
import { VoiceConnection } from 'discord.js';
import { db } from '../db/client.js';
import { writeFileSync, mkdirSync } from 'fs';
import path from 'path';
const RECORDINGS_DIR = path.resolve('./data/voice-recordings');
class VoiceRecorder {
private connection: VoiceConnection | null = null;
private recordingStreams = new Map<string, WriteStream>();
private sessionId: number | null = null;
async joinChannel(guildId: number, channelId: string, sessionId: number) {
this.connection = joinVoiceChannel({
channelId,
guildId,
adapterCreator: getAdapter() // discord.js adapter
});
this.connection.on(VoiceConnectionStatus.Ready, () => {
console.log(`Voice connected for session ${sessionId}`);
this.sessionId = sessionId;
});
// Subscribe to all users
this.connection.receiver.subscriptions.on('entry', (userId, stream) => {
this.startRecording(userId, stream);
});
}
private startRecording(userId: string, stream: AudioReceiveStream) {
const filePath = path.join(RECORDINGS_DIR, `${this.sessionId}_${userId}.webm`);
mkdirSync(RECORDINGS_DIR, { recursive: true });
const file = createWriteStream(filePath);
stream.pipe(file);
this.recordingStreams.set(userId, file);
// Track in DB
db.prepare(`
INSERT INTO voice_recordings (session_id, discord_user_id, file_path, recorded_at)
VALUES (?, ?, ?, datetime('now'))
`).run(this.sessionId, userId, filePath);
}
async stopRecording() {
// Close all file streams
for (const stream of this.recordingStreams.values()) {
stream.end();
}
this.recordingStreams.clear();
// Leave voice channel
this.connection?.destroy();
this.connection = null;
}
async processRecordings(sessionId: number) {
const recordings = db.prepare(`
SELECT * FROM voice_recordings
WHERE session_id = ? AND processing_status = 'pending'
`).all(sessionId);
for (const rec of recordings) {
await this.processSingleRecording(rec);
}
}
private async processSingleRecording(recording: any) {
const audioBuffer = readFileSync(recording.file_path);
const result = await transcribeAudio(audioBuffer, recording.discord_user_id);
// Store transcript segments
for (const segment of result.segments) {
sessionService.appendSegment({
sessionId: this.sessionId!,
guildId: /* get from session */,
discordUserId: recording.discord_user_id,
text: segment.text,
startedAt: segment.start,
endedAt: segment.end,
confidence: segment.confidence,
source: 'voice'
});
}
// Mark as processed
db.prepare(`
UPDATE voice_recordings
SET processing_status = 'completed', processed_at = datetime('now')
WHERE id = ?
`).run(recording.id);
}
}
export const voiceRecorder = new VoiceRecorder();
Time: 4-5 hrs discord.js voice API + 2 hrs file handling + 1 hr testing
Phase 4: Discord Command Integration (2-3 hours)
Update /session start command:
// In discordService.ts handleSessionCommand
if (sub === 'start') {
const result = sessionService.startSession(guildId, user.id);
// Get the voice channel the user is in
const member = await interaction.guild.members.fetch(interaction.user.id);
const voiceChannel = member.voice.channel;
if (voiceChannel) {
await voiceRecorder.joinChannel(
interaction.guildId!,
voiceChannel.id,
result.sessionId
);
}
await interaction.reply({
content: `Started session #${result.sessionId}. ${voiceChannel ? '🎤 Recording voice' : '📝 Text only'}`,
ephemeral: false
});
}
Update /session stop command:
if (sub === 'stop') {
await voiceRecorder.stopRecording();
await voiceRecorder.processRecordings(sessionId);
sessionService.stopSession(guildId, user.id);
await interaction.reply({ content: `Stopped session #${active.id}. Processing voice transcripts...`, ephemeral: false });
}
Time: 2 hrs integration + 1 hr testing
Phase 5: UI for Voice Transcripts (4-6 hours)
Update CampaignDetailPage.tsx:
// Add filter toggle for transcript sources
const [showVoice, setShowVoice] = useState(true);
const [showText, setShowText] = useState(true);
// Filter segments
const filteredSegments = segments.filter(s =>
(s.source === 'voice' && showVoice) || (s.source === 'text' && showText)
);
// Add audio player for voice segments
{segment.source === 'voice' && (
<audio controls src={`/api/voice/${segment.recording_id}`} />
)}
New API endpoint: apps/server/src/routes/voiceRoutes.ts
import { Router } from 'express';
import { requireAuth } from '../middleware.js';
export const voiceRoutes = Router();
voiceRoutes.get('/voice/:recordingId', requireAuth, (req, res) => {
const recording = db.prepare('SELECT * FROM voice_recordings WHERE id = ?').get(req.params.recordingId);
if (!recording) return res.status(404).send('Not found');
res.sendFile(path.resolve(recording.file_path));
});
Time: 3 hrs UI components + 2 hrs API + 1 hr testing
Phase 6: AI Recap Integration (2-3 hours)
Update recapService.ts:
// In buildContext, include voice segments
const segments = db
.prepare(`
SELECT sequence, speaker_label_snapshot AS speakerLabel,
text, started_at AS startedAt, source
FROM transcript_segments
WHERE session_id = ?
ORDER BY sequence ASC
`)
.all(sessionId);
// Update prompt to mention voice transcripts
const prompt = `
Summarize this DnD session transcript.
Note: Some segments are from voice transcription (may have minor errors).
${context.segments.map(s =>
`[${s.startedAt}] ${s.speakerLabel}: ${s.text}${s.source === 'voice' ? ' [voice]' : ''}`
).join('\n')}
`;
Time: 1-2 hrs updates + 1 hr testing
Summary Timeline
| Phase | Description | Hours | Cumulative |
|---|---|---|---|
| 1 | DB schema + migration | 2-3 | 2-3 |
| 2 | Whisper Docker service | 3-4 | 5-7 |
| 3 | Voice recording service | 5-7 | 10-14 |
| 4 | Discord command integration | 2-3 | 12-17 |
| 5 | UI for voice transcripts | 4-6 | 16-23 |
| 6 | AI recap integration | 2-3 | 18-26 |
Total: 18-26 hours (3-4 full days)
Dependencies to Install
# Server workspace
npm install @discordjs/voice form-data node-fetch -w @dnd-hub/server
npm install -D @types/node-fetch -w @dnd-hub/server
Next Steps (Where to Pick Up)
- Phase 1: Create
apps/server/src/db/migrations/010_voice_transcripts.sql - Phase 2: Update
docker-compose.ymlwith whisper service - Phase 2: Create
apps/server/src/services/whisperClient.ts
Then proceed through phases 3-6 sequentially.
Key Decisions Made
- Transcription: whisper.cpp via Docker (local, free, runs on existing Ubuntu server)
- Audio format: Per-user .webm tracks (better speaker separation)
- Processing: Batch on /session stop (not real-time)
- Storage: Voice files in
apps/server/data/voice-recordings/ - DB: Single
transcript_segmentstable withsourcecolumn (voice vs text)