dnd-hub/VOICE_TRANSCRIPT_PLAN.md
2026-03-16 22:15:15 -04:00

13 KiB

Discord Voice Transcript Implementation Plan

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ Discord Voice Channel                                           │
│   ┌────────┐  ┌────────┐  ┌────────┐                           │
│   │ Player │  │ Player │  │ Player │                           │
│   └───┬────┘  └───┬────┘  └───┬────┘                           │
│       │           │           │                                 │
└───────┼───────────┼───────────┼─────────────────────────────────┘
        │           │           │
        ▼           ▼           ▼
┌─────────────────────────────────────────────────────────────────┐
│ Discord Bot (discord.js + @discordjs/voice)                     │
│   - Joins voice channel on /session start                       │
│   - Subscribes to each user's audio stream                      │
│   - Saves per-user .webm files to temp storage                  │
└─────────────────────────────────────────────────────────────────┘
        │
        │ (on /session stop)
        ▼
┌─────────────────────────────────────────────────────────────────┐
│ Whisper Service (Docker container on Ubuntu server)             │
│   - whisper.cpp or openai/whisper                               │
│   - Receives audio files via HTTP POST                          │
│   - Returns JSON with timestamps + transcript                   │
└─────────────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────────────┐
│ DnD Hub Server                                                  │
│   - Maps Discord user ID → character (existing logic)           │
│   - Stores voice segments in transcript_segments table          │
│   - Marks segments with source='voice'                          │
│   - AI recap includes both text + voice transcripts             │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Database Schema (2-3 hours)

Migration: apps/server/src/db/migrations/010_voice_transcripts.sql

-- Add source column to distinguish text vs voice
ALTER TABLE transcript_segments ADD COLUMN source TEXT DEFAULT 'text';

-- Table for voice recording metadata
CREATE TABLE voice_recordings (
  id INTEGER PRIMARY KEY,
  session_id INTEGER NOT NULL,
  discord_user_id TEXT NOT NULL,
  file_path TEXT NOT NULL,
  duration_ms INTEGER,
  recorded_at TEXT NOT NULL,
  processed_at TEXT,
  processing_status TEXT DEFAULT 'pending',
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

-- Index for efficient lookups
CREATE INDEX idx_voice_recordings_session ON voice_recordings(session_id);
CREATE INDEX idx_transcript_source ON transcript_segments(source);

Time: 30 min schema design + 30 min migration + 1 hr testing


Phase 2: Whisper Docker Service (3-4 hours)

# Add to docker-compose.yml
services:
  whisper:
    image: ghcr.io/ggerganov/whisper.cpp:server
    ports:
      - "10888:10888"
    volumes:
      - ./whisper-models:/models
    environment:
      - WHISPER_MODEL=/models/ggml-large-v3.bin
    restart: unless-stopped

Option B: OpenAI Whisper (simpler)

  whisper:
    image: ghcr.io/openai/whisper:latest
    ports:
      - "10888:10888"
    volumes:
      - ./whisper-audio:/audio

Setup steps:

  1. Add whisper service to docker-compose.yml
  2. Download model (ggml-large-v3.bin ~3GB, best accuracy)
  3. Test API endpoint: POST /inference with audio file
  4. Create apps/server/src/services/whisperClient.ts

whisperClient.ts:

import FormData from 'form-data';
import fetch from 'node-fetch';

const WHISPER_URL = process.env.WHISPER_BASE_URL ?? 'http://whisper:10888';

export async function transcribeAudio(audioBuffer: Buffer, speakerId: string) {
  const form = new FormData();
  form.append('file', audioBuffer, { filename: `${speakerId}.webm` });
  form.append('model', 'large-v3');
  form.append('response_format', 'verbose_json');
  form.append('word_timestamps', 'true');

  const res = await fetch(`${WHISPER_URL}/inference`, {
    method: 'POST',
    body: form
  });

  return res.json(); // { text, segments: [{start, end, text}] }
}

Time: 2 hrs Docker setup + 1-2 hrs client + testing


Phase 3: Voice Recording in Discord Bot (5-7 hours)

Update discordService.ts with voice intents:

import { joinVoiceChannel, createAudioSubscriber, VoiceConnectionStatus } from '@discordjs/voice';
import { AudioReceiveStream } from 'discord.js';

// Add to intents
const client = new Client({
  intents: [
    GatewayIntentBits.Guilds,
    GatewayIntentBits.GuildMembers,
    GatewayIntentBits.GuildVoiceStates,
    GatewayIntentBits.GuildMessages
  ]
});

New service: apps/server/src/services/voiceRecorder.ts

import { joinVoiceChannel, createAudioSubscriber } from '@discordjs/voice';
import { VoiceConnection } from 'discord.js';
import { db } from '../db/client.js';
import { writeFileSync, mkdirSync } from 'fs';
import path from 'path';

const RECORDINGS_DIR = path.resolve('./data/voice-recordings');

class VoiceRecorder {
  private connection: VoiceConnection | null = null;
  private recordingStreams = new Map<string, WriteStream>();
  private sessionId: number | null = null;

  async joinChannel(guildId: number, channelId: string, sessionId: number) {
    this.connection = joinVoiceChannel({
      channelId,
      guildId,
      adapterCreator: getAdapter() // discord.js adapter
    });

    this.connection.on(VoiceConnectionStatus.Ready, () => {
      console.log(`Voice connected for session ${sessionId}`);
      this.sessionId = sessionId;
    });

    // Subscribe to all users
    this.connection.receiver.subscriptions.on('entry', (userId, stream) => {
      this.startRecording(userId, stream);
    });
  }

  private startRecording(userId: string, stream: AudioReceiveStream) {
    const filePath = path.join(RECORDINGS_DIR, `${this.sessionId}_${userId}.webm`);
    mkdirSync(RECORDINGS_DIR, { recursive: true });
    
    const file = createWriteStream(filePath);
    stream.pipe(file);
    this.recordingStreams.set(userId, file);

    // Track in DB
    db.prepare(`
      INSERT INTO voice_recordings (session_id, discord_user_id, file_path, recorded_at)
      VALUES (?, ?, ?, datetime('now'))
    `).run(this.sessionId, userId, filePath);
  }

  async stopRecording() {
    // Close all file streams
    for (const stream of this.recordingStreams.values()) {
      stream.end();
    }
    this.recordingStreams.clear();

    // Leave voice channel
    this.connection?.destroy();
    this.connection = null;
  }

  async processRecordings(sessionId: number) {
    const recordings = db.prepare(`
      SELECT * FROM voice_recordings 
      WHERE session_id = ? AND processing_status = 'pending'
    `).all(sessionId);

    for (const rec of recordings) {
      await this.processSingleRecording(rec);
    }
  }

  private async processSingleRecording(recording: any) {
    const audioBuffer = readFileSync(recording.file_path);
    const result = await transcribeAudio(audioBuffer, recording.discord_user_id);
    
    // Store transcript segments
    for (const segment of result.segments) {
      sessionService.appendSegment({
        sessionId: this.sessionId!,
        guildId: /* get from session */,
        discordUserId: recording.discord_user_id,
        text: segment.text,
        startedAt: segment.start,
        endedAt: segment.end,
        confidence: segment.confidence,
        source: 'voice'
      });
    }

    // Mark as processed
    db.prepare(`
      UPDATE voice_recordings 
      SET processing_status = 'completed', processed_at = datetime('now')
      WHERE id = ?
    `).run(recording.id);
  }
}

export const voiceRecorder = new VoiceRecorder();

Time: 4-5 hrs discord.js voice API + 2 hrs file handling + 1 hr testing


Phase 4: Discord Command Integration (2-3 hours)

Update /session start command:

// In discordService.ts handleSessionCommand
if (sub === 'start') {
  const result = sessionService.startSession(guildId, user.id);
  
  // Get the voice channel the user is in
  const member = await interaction.guild.members.fetch(interaction.user.id);
  const voiceChannel = member.voice.channel;
  
  if (voiceChannel) {
    await voiceRecorder.joinChannel(
      interaction.guildId!,
      voiceChannel.id,
      result.sessionId
    );
  }
  
  await interaction.reply({
    content: `Started session #${result.sessionId}. ${voiceChannel ? '🎤 Recording voice' : '📝 Text only'}`,
    ephemeral: false
  });
}

Update /session stop command:

if (sub === 'stop') {
  await voiceRecorder.stopRecording();
  await voiceRecorder.processRecordings(sessionId);
  sessionService.stopSession(guildId, user.id);
  await interaction.reply({ content: `Stopped session #${active.id}. Processing voice transcripts...`, ephemeral: false });
}

Time: 2 hrs integration + 1 hr testing


Phase 5: UI for Voice Transcripts (4-6 hours)

Update CampaignDetailPage.tsx:

// Add filter toggle for transcript sources
const [showVoice, setShowVoice] = useState(true);
const [showText, setShowText] = useState(true);

// Filter segments
const filteredSegments = segments.filter(s => 
  (s.source === 'voice' && showVoice) || (s.source === 'text' && showText)
);

// Add audio player for voice segments
{segment.source === 'voice' && (
  <audio controls src={`/api/voice/${segment.recording_id}`} />
)}

New API endpoint: apps/server/src/routes/voiceRoutes.ts

import { Router } from 'express';
import { requireAuth } from '../middleware.js';

export const voiceRoutes = Router();

voiceRoutes.get('/voice/:recordingId', requireAuth, (req, res) => {
  const recording = db.prepare('SELECT * FROM voice_recordings WHERE id = ?').get(req.params.recordingId);
  if (!recording) return res.status(404).send('Not found');
  
  res.sendFile(path.resolve(recording.file_path));
});

Time: 3 hrs UI components + 2 hrs API + 1 hr testing


Phase 6: AI Recap Integration (2-3 hours)

Update recapService.ts:

// In buildContext, include voice segments
const segments = db
  .prepare(`
    SELECT sequence, speaker_label_snapshot AS speakerLabel,
           text, started_at AS startedAt, source
    FROM transcript_segments
    WHERE session_id = ?
    ORDER BY sequence ASC
  `)
  .all(sessionId);

// Update prompt to mention voice transcripts
const prompt = `
Summarize this DnD session transcript.
Note: Some segments are from voice transcription (may have minor errors).

${context.segments.map(s => 
  `[${s.startedAt}] ${s.speakerLabel}: ${s.text}${s.source === 'voice' ? ' [voice]' : ''}`
).join('\n')}
`;

Time: 1-2 hrs updates + 1 hr testing


Summary Timeline

Phase Description Hours Cumulative
1 DB schema + migration 2-3 2-3
2 Whisper Docker service 3-4 5-7
3 Voice recording service 5-7 10-14
4 Discord command integration 2-3 12-17
5 UI for voice transcripts 4-6 16-23
6 AI recap integration 2-3 18-26

Total: 18-26 hours (3-4 full days)


Dependencies to Install

# Server workspace
npm install @discordjs/voice form-data node-fetch -w @dnd-hub/server
npm install -D @types/node-fetch -w @dnd-hub/server

Next Steps (Where to Pick Up)

  1. Phase 1: Create apps/server/src/db/migrations/010_voice_transcripts.sql
  2. Phase 2: Update docker-compose.yml with whisper service
  3. Phase 2: Create apps/server/src/services/whisperClient.ts

Then proceed through phases 3-6 sequentially.


Key Decisions Made

  • Transcription: whisper.cpp via Docker (local, free, runs on existing Ubuntu server)
  • Audio format: Per-user .webm tracks (better speaker separation)
  • Processing: Batch on /session stop (not real-time)
  • Storage: Voice files in apps/server/data/voice-recordings/
  • DB: Single transcript_segments table with source column (voice vs text)