dnd-hub/VOICE_TRANSCRIPT_PLAN.md

# Discord Voice Transcript Implementation Plan

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│ Discord Voice Channel                                           │
│   ┌────────┐  ┌────────┐  ┌────────┐                           │
│   │ Player │  │ Player │  │ Player │                           │
│   └───┬────┘  └───┬────┘  └───┬────┘                           │
│       │           │           │                                 │
└───────┼───────────┼───────────┼─────────────────────────────────┘
        │           │           │
        ▼           ▼           ▼
┌─────────────────────────────────────────────────────────────────┐
│ Discord Bot (discord.js + @discordjs/voice)                     │
│   - Joins voice channel on /session start                       │
│   - Subscribes to each user's audio stream                      │
│   - Saves per-user .webm files to temp storage                  │
└─────────────────────────────────────────────────────────────────┘
        │
        │ (on /session stop)
        ▼
┌─────────────────────────────────────────────────────────────────┐
│ Whisper Service (Docker container on Ubuntu server)             │
│   - whisper.cpp or openai/whisper                               │
│   - Receives audio files via HTTP POST                          │
│   - Returns JSON with timestamps + transcript                   │
└─────────────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────────────┐
│ DnD Hub Server                                                  │
│   - Maps Discord user ID → character (existing logic)           │
│   - Stores voice segments in transcript_segments table          │
│   - Marks segments with source='voice'                          │
│   - AI recap includes both text + voice transcripts             │
└─────────────────────────────────────────────────────────────────┘
```

---

## Phase 1: Database Schema (2-3 hours)

### Migration: `apps/server/src/db/migrations/010_voice_transcripts.sql`

```sql
-- Add source column to distinguish text vs voice
ALTER TABLE transcript_segments ADD COLUMN source TEXT DEFAULT 'text';

-- Table for voice recording metadata
CREATE TABLE voice_recordings (
  id INTEGER PRIMARY KEY,
  session_id INTEGER NOT NULL,
  discord_user_id TEXT NOT NULL,
  file_path TEXT NOT NULL,
  duration_ms INTEGER,
  recorded_at TEXT NOT NULL,
  processed_at TEXT,
  processing_status TEXT DEFAULT 'pending',
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

-- Index for efficient lookups
CREATE INDEX idx_voice_recordings_session ON voice_recordings(session_id);
CREATE INDEX idx_transcript_source ON transcript_segments(source);
```

**Time:** 30 min schema design + 30 min migration + 1 hr testing

---

## Phase 2: Whisper Docker Service (3-4 hours)

### Option A: whisper.cpp (recommended for performance)
```yaml
# Add to docker-compose.yml
services:
  whisper:
    image: ghcr.io/ggerganov/whisper.cpp:server
    ports:
      - "10888:10888"
    volumes:
      - ./whisper-models:/models
    environment:
      - WHISPER_MODEL=/models/ggml-large-v3.bin
    restart: unless-stopped
```

### Option B: OpenAI Whisper (simpler)
```yaml
  whisper:
    image: ghcr.io/openai/whisper:latest
    ports:
      - "10888:10888"
    volumes:
      - ./whisper-audio:/audio
```

**Setup steps:**
1. Add whisper service to `docker-compose.yml`
2. Download model (`ggml-large-v3.bin` ~3GB, best accuracy)
3. Test API endpoint: `POST /inference` with audio file
4. Create `apps/server/src/services/whisperClient.ts`

**whisperClient.ts:**
```typescript
import FormData from 'form-data';
import fetch from 'node-fetch';

const WHISPER_URL = process.env.WHISPER_BASE_URL ?? 'http://whisper:10888';

export async function transcribeAudio(audioBuffer: Buffer, speakerId: string) {
  const form = new FormData();
  form.append('file', audioBuffer, { filename: `${speakerId}.webm` });
  form.append('model', 'large-v3');
  form.append('response_format', 'verbose_json');
  form.append('word_timestamps', 'true');

  const res = await fetch(`${WHISPER_URL}/inference`, {
    method: 'POST',
    body: form
  });

  return res.json(); // { text, segments: [{start, end, text}] }
}
```

**Time:** 2 hrs Docker setup + 1-2 hrs client + testing

---

## Phase 3: Voice Recording in Discord Bot (5-7 hours)

### Update `discordService.ts` with voice intents:

```typescript
import { joinVoiceChannel, createAudioSubscriber, VoiceConnectionStatus } from '@discordjs/voice';
import { AudioReceiveStream } from 'discord.js';

// Add to intents
const client = new Client({
  intents: [
    GatewayIntentBits.Guilds,
    GatewayIntentBits.GuildMembers,
    GatewayIntentBits.GuildVoiceStates,
    GatewayIntentBits.GuildMessages
  ]
});
```

### New service: `apps/server/src/services/voiceRecorder.ts`

```typescript
import { joinVoiceChannel, createAudioSubscriber } from '@discordjs/voice';
import { VoiceConnection } from 'discord.js';
import { db } from '../db/client.js';
import { writeFileSync, mkdirSync } from 'fs';
import path from 'path';

const RECORDINGS_DIR = path.resolve('./data/voice-recordings');

class VoiceRecorder {
  private connection: VoiceConnection | null = null;
  private recordingStreams = new Map<string, WriteStream>();
  private sessionId: number | null = null;

  async joinChannel(guildId: number, channelId: string, sessionId: number) {
    this.connection = joinVoiceChannel({
      channelId,
      guildId,
      adapterCreator: getAdapter() // discord.js adapter
    });

    this.connection.on(VoiceConnectionStatus.Ready, () => {
      console.log(`Voice connected for session ${sessionId}`);
      this.sessionId = sessionId;
    });

    // Subscribe to all users
    this.connection.receiver.subscriptions.on('entry', (userId, stream) => {
      this.startRecording(userId, stream);
    });
  }

  private startRecording(userId: string, stream: AudioReceiveStream) {
    const filePath = path.join(RECORDINGS_DIR, `${this.sessionId}_${userId}.webm`);
    mkdirSync(RECORDINGS_DIR, { recursive: true });

    const file = createWriteStream(filePath);
    stream.pipe(file);
    this.recordingStreams.set(userId, file);

    // Track in DB
    db.prepare(`
      INSERT INTO voice_recordings (session_id, discord_user_id, file_path, recorded_at)
      VALUES (?, ?, ?, datetime('now'))
    `).run(this.sessionId, userId, filePath);
  }

  async stopRecording() {
    // Close all file streams
    for (const stream of this.recordingStreams.values()) {
      stream.end();
    }
    this.recordingStreams.clear();

    // Leave voice channel
    this.connection?.destroy();
    this.connection = null;
  }

  async processRecordings(sessionId: number) {
    const recordings = db.prepare(`
      SELECT * FROM voice_recordings
      WHERE session_id = ? AND processing_status = 'pending'
    `).all(sessionId);

    for (const rec of recordings) {
      await this.processSingleRecording(rec);
    }
  }

  private async processSingleRecording(recording: any) {
    const audioBuffer = readFileSync(recording.file_path);
    const result = await transcribeAudio(audioBuffer, recording.discord_user_id);

    // Store transcript segments
    for (const segment of result.segments) {
      sessionService.appendSegment({
        sessionId: this.sessionId!,
        guildId: /* get from session */,
        discordUserId: recording.discord_user_id,
        text: segment.text,
        startedAt: segment.start,
        endedAt: segment.end,
        confidence: segment.confidence,
        source: 'voice'
      });
    }

    // Mark as processed
    db.prepare(`
      UPDATE voice_recordings
      SET processing_status = 'completed', processed_at = datetime('now')
      WHERE id = ?
    `).run(recording.id);
  }
}

export const voiceRecorder = new VoiceRecorder();
```

**Time:** 4-5 hrs discord.js voice API + 2 hrs file handling + 1 hr testing

---

## Phase 4: Discord Command Integration (2-3 hours)

### Update `/session start` command:

```typescript
// In discordService.ts handleSessionCommand
if (sub === 'start') {
  const result = sessionService.startSession(guildId, user.id);

  // Get the voice channel the user is in
  const member = await interaction.guild.members.fetch(interaction.user.id);
  const voiceChannel = member.voice.channel;

  if (voiceChannel) {
    await voiceRecorder.joinChannel(
      interaction.guildId!,
      voiceChannel.id,
      result.sessionId
    );
  }

  await interaction.reply({
    content: `Started session #${result.sessionId}. ${voiceChannel ? '🎤 Recording voice' : '📝 Text only'}`,
    ephemeral: false
  });
}
```

### Update `/session stop` command:

```typescript
if (sub === 'stop') {
  await voiceRecorder.stopRecording();
  await voiceRecorder.processRecordings(sessionId);
  sessionService.stopSession(guildId, user.id);
  await interaction.reply({ content: `Stopped session #${active.id}. Processing voice transcripts...`, ephemeral: false });
}
```

**Time:** 2 hrs integration + 1 hr testing

---

## Phase 5: UI for Voice Transcripts (4-6 hours)

### Update `CampaignDetailPage.tsx`:

```tsx
// Add filter toggle for transcript sources
const [showVoice, setShowVoice] = useState(true);
const [showText, setShowText] = useState(true);

// Filter segments
const filteredSegments = segments.filter(s =>
  (s.source === 'voice' && showVoice) || (s.source === 'text' && showText)
);

// Add audio player for voice segments
{segment.source === 'voice' && (
  <audio controls src={`/api/voice/${segment.recording_id}`} />
)}
```

### New API endpoint: `apps/server/src/routes/voiceRoutes.ts`

```typescript
import { Router } from 'express';
import { requireAuth } from '../middleware.js';

export const voiceRoutes = Router();

voiceRoutes.get('/voice/:recordingId', requireAuth, (req, res) => {
  const recording = db.prepare('SELECT * FROM voice_recordings WHERE id = ?').get(req.params.recordingId);
  if (!recording) return res.status(404).send('Not found');

  res.sendFile(path.resolve(recording.file_path));
});
```

**Time:** 3 hrs UI components + 2 hrs API + 1 hr testing

---

## Phase 6: AI Recap Integration (2-3 hours)

### Update `recapService.ts`:

```typescript
// In buildContext, include voice segments
const segments = db
  .prepare(`
    SELECT sequence, speaker_label_snapshot AS speakerLabel,
           text, started_at AS startedAt, source
    FROM transcript_segments
    WHERE session_id = ?
    ORDER BY sequence ASC
  `)
  .all(sessionId);

// Update prompt to mention voice transcripts
const prompt = `
Summarize this DnD session transcript.
Note: Some segments are from voice transcription (may have minor errors).

${context.segments.map(s =>
  `[${s.startedAt}] ${s.speakerLabel}: ${s.text}${s.source === 'voice' ? ' [voice]' : ''}`
).join('\n')}
`;
```

**Time:** 1-2 hrs updates + 1 hr testing

---

## Summary Timeline

| Phase | Description | Hours | Cumulative |
|-------|-------------|-------|------------|
| 1 | DB schema + migration | 2-3 | 2-3 |
| 2 | Whisper Docker service | 3-4 | 5-7 |
| 3 | Voice recording service | 5-7 | 10-14 |
| 4 | Discord command integration | 2-3 | 12-17 |
| 5 | UI for voice transcripts | 4-6 | 16-23 |
| 6 | AI recap integration | 2-3 | 18-26 |

**Total: 18-26 hours** (3-4 full days)

---

## Dependencies to Install

```bash
# Server workspace
npm install @discordjs/voice form-data node-fetch -w @dnd-hub/server
npm install -D @types/node-fetch -w @dnd-hub/server
```

---

## Next Steps (Where to Pick Up)

1. **Phase 1**: Create `apps/server/src/db/migrations/010_voice_transcripts.sql`
2. **Phase 2**: Update `docker-compose.yml` with whisper service
3. **Phase 2**: Create `apps/server/src/services/whisperClient.ts`

Then proceed through phases 3-6 sequentially.

---

## Key Decisions Made

- **Transcription**: whisper.cpp via Docker (local, free, runs on existing Ubuntu server)
- **Audio format**: Per-user .webm tracks (better speaker separation)
- **Processing**: Batch on /session stop (not real-time)
- **Storage**: Voice files in `apps/server/data/voice-recordings/`
- **DB**: Single `transcript_segments` table with `source` column (voice vs text)