Read the deep-dive: DEEP_DIVE.md — a full walkthrough of how the fingerprinting and matching work, the database choice I got wrong, and the honest weaknesses I'd fix before putting this in front of real data.
Audio ingestion, speaker identification, and deduplication pipeline.
Give SonicGen a YouTube channel handle and it ingests every video, transcribes with speaker diarization, lets you label speakers in seconds, stores voice embeddings for automatic future identification, and deduplicates clips via Shazam-style audio fingerprinting so you don't waste money re-transcribing duplicates.
Written by hand in Python. No coding agents.
- Ingest — point it at a YouTube handle. It pulls all video metadata via the YouTube Data API.
- Download — audio extracted via yt-dlp + FFmpeg, stored in Google Cloud Storage.
- Fingerprint & deduplicate — Shazam-style constellation hashing detects duplicate clips before they hit transcription. Alignment-offset matching (not just hash collision count) keeps accuracy high.
- Transcribe — AssemblyAI with speaker diarization. Output: timestamped transcript with Speaker A, Speaker B, etc.
- Label speakers — play a short clip of each speaker, label them. Takes seconds, especially for podcasts with 2 speakers.
- Extract speaker clips — isolates each speaker's audio segments from the full video, trimmed to avoid crosstalk.
- Voice embeddings — SpeechBrain ECAPA-TDNN encodes each speaker's clips into 192-dimensional embeddings, stored in Pinecone (vector database, cosine similarity).
- Automatic speaker identification — on future videos, new clips are encoded and matched against stored embeddings. Known speakers are identified automatically.
- Real-world problem: short-form reposts explode across platforms. SonicGen links clips back to their source and identifies who's speaking.
- Full pipeline: DSP (spectrograms, peak detection, landmark hashing), speaker diarization, voice embeddings, vector search, cloud storage, and state-driven batch processing.
- Production thinking: restartable pipelines with crash recovery, chunked inserts, noisy-hash filtering, rate limiting, and configurable thresholds.
- Vector database experience: stores and queries 192-dim speaker embeddings in Pinecone. The same embed-store-query pattern that powers RAG and semantic search.
backend/youtube_api.py— Channel and video metadata ingestion via YouTube Data API v3backend/download.py— yt-dlp download + GCS upload, state-driven batch processingbackend/extract_pipeline.py— End-to-end orchestrator (ingest, download, deduplicate)
backend/fingerprint_audio.py— DSP pipeline: STFT spectrograms, peak detection, constellation hashing, segmentationbackend/fingerprint_pipeline.py— Fingerprint orchestration and matchingbackend/supabase_utils.py— Postgres/Supabase access, chunked inserts, candidate searchbackend/admin_tools.py— Hash count maintenance, noisy-hash refresh
backend/transcribe.py— AssemblyAI transcription with speaker diarizationbackend/label_speakers.py— Map generic speaker labels (Speaker A, B) to real namesbackend/extract_speaker_clips.py— Extract per-speaker audio clips from diarized transcriptsbackend/speaker_embeddings.py— Encode speaker clips with SpeechBrain ECAPA-TDNN, store in Pineconebackend/identify_speaker.py— Query Pinecone to identify speakers in new clips, with accuracy evaluation
backend/utils/utils_audio.py— Audio loading and preprocessing
videos — YouTube metadata + dedup status
id UUID,youtube_id TEXT,title TEXT,duration INToriginal_video_id UUID(references duplicate's source)match_status TEXT(null → pending → fingerprinted/matched/too_short/flag)
fingerprints — Constellation hashes with timestamps
hash TEXT,video_id UUID,t_ref INT
fingerprint_hash_counts — Track hash frequency for noisy-hash filtering
noisy_hashes — Common hashes excluded from matching (room tone, breathing, etc.)
Pinecone index — 192-dim ECAPA speaker embeddings with cosine similarity
The dedup engine doesn't just count hash collisions. It checks whether the matching hashes agree on the same time offset between the query clip and the candidate video. That's the key insight of constellation-hash matching: a real match means many hashes align at a consistent time delta, not just that some hashes collide randomly.
- Threshold: 18+ aligned hashes AND 40%+ alignment ratio
- Noisy hashes (common across many videos) are filtered out
- Segmented matching scales coverage with clip length
- Python 3.11+
- Supabase project (Postgres)
- Google Cloud Storage bucket
- YouTube Data API key
- AssemblyAI API key (for transcription)
- Pinecone account (for speaker embeddings)
Copy .env.example to .env and fill in your API keys:
cp .env.example .envIngest and deduplicate:
python backend/extract_pipeline.py
# Enter a YouTube handle like @SomeChannelTranscribe with speaker diarization:
python backend/transcribe.py ./data/audio_filesLabel speakers and extract clips:
python backend/label_speakers.py --transcripts ./data/transcripts --output ./data/labeled --map speaker_map.json
python backend/extract_speaker_clips.py --transcripts ./data/labeled --audio ./data/audio_files --output ./data/speaker_clips --speaker "John Doe"Store speaker embeddings and identify future speakers:
python backend/speaker_embeddings.py --clips ./data/speaker_clips --label "speaker_01"
python backend/identify_speaker.py identify ./data/new_clip.mp3- Python, NumPy, SciPy, librosa, pydub
- yt-dlp, FFmpeg, YouTube Data API v3
- AssemblyAI (transcription + diarization)
- SpeechBrain ECAPA-TDNN (speaker embeddings)
- Pinecone (vector database)
- Supabase / Postgres
- Google Cloud Storage
- Scale testing — stress-test speaker identification on larger corpora
- Adaptive tiered matching — compare 10%, 25%, 50%, 75% of hashes with early exit for unambiguous matches
- Dynamic original detection — automatically determine which video in a match pair is the true source
- Interactive frontend — web client to input a YouTube URL and see the original source
- FAISS integration — high-speed similarity search for large-scale dedup