SonicGen

Read the deep-dive: DEEP_DIVE.md — a full walkthrough of how the fingerprinting and matching work, the database choice I got wrong, and the honest weaknesses I'd fix before putting this in front of real data.

Audio ingestion, speaker identification, and deduplication pipeline.

Give SonicGen a YouTube channel handle and it ingests every video, transcribes with speaker diarization, lets you label speakers in seconds, stores voice embeddings for automatic future identification, and deduplicates clips via Shazam-style audio fingerprinting so you don't waste money re-transcribing duplicates.

Written by hand in Python. No coding agents.

The full pipeline

Ingest — point it at a YouTube handle. It pulls all video metadata via the YouTube Data API.
Download — audio extracted via yt-dlp + FFmpeg, stored in Google Cloud Storage.
Fingerprint & deduplicate — Shazam-style constellation hashing detects duplicate clips before they hit transcription. Alignment-offset matching (not just hash collision count) keeps accuracy high.
Transcribe — AssemblyAI with speaker diarization. Output: timestamped transcript with Speaker A, Speaker B, etc.
Label speakers — play a short clip of each speaker, label them. Takes seconds, especially for podcasts with 2 speakers.
Extract speaker clips — isolates each speaker's audio segments from the full video, trimmed to avoid crosstalk.
Voice embeddings — SpeechBrain ECAPA-TDNN encodes each speaker's clips into 192-dimensional embeddings, stored in Pinecone (vector database, cosine similarity).
Automatic speaker identification — on future videos, new clips are encoded and matched against stored embeddings. Known speakers are identified automatically.

Why this project matters

Real-world problem: short-form reposts explode across platforms. SonicGen links clips back to their source and identifies who's speaking.
Full pipeline: DSP (spectrograms, peak detection, landmark hashing), speaker diarization, voice embeddings, vector search, cloud storage, and state-driven batch processing.
Production thinking: restartable pipelines with crash recovery, chunked inserts, noisy-hash filtering, rate limiting, and configurable thresholds.
Vector database experience: stores and queries 192-dim speaker embeddings in Pinecone. The same embed-store-query pattern that powers RAG and semantic search.

Key files

Ingestion & download

backend/youtube_api.py — Channel and video metadata ingestion via YouTube Data API v3
backend/download.py — yt-dlp download + GCS upload, state-driven batch processing
backend/extract_pipeline.py — End-to-end orchestrator (ingest, download, deduplicate)

Fingerprinting & dedup

backend/fingerprint_audio.py — DSP pipeline: STFT spectrograms, peak detection, constellation hashing, segmentation
backend/fingerprint_pipeline.py — Fingerprint orchestration and matching
backend/supabase_utils.py — Postgres/Supabase access, chunked inserts, candidate search
backend/admin_tools.py — Hash count maintenance, noisy-hash refresh

Speaker identification

backend/transcribe.py — AssemblyAI transcription with speaker diarization
backend/label_speakers.py — Map generic speaker labels (Speaker A, B) to real names
backend/extract_speaker_clips.py — Extract per-speaker audio clips from diarized transcripts
backend/speaker_embeddings.py — Encode speaker clips with SpeechBrain ECAPA-TDNN, store in Pinecone
backend/identify_speaker.py — Query Pinecone to identify speakers in new clips, with accuracy evaluation

Utilities

backend/utils/utils_audio.py — Audio loading and preprocessing

Data model

videos — YouTube metadata + dedup status

id UUID, youtube_id TEXT, title TEXT, duration INT
original_video_id UUID (references duplicate's source)
match_status TEXT (null → pending → fingerprinted/matched/too_short/flag)

fingerprints — Constellation hashes with timestamps

hash TEXT, video_id UUID, t_ref INT

fingerprint_hash_counts — Track hash frequency for noisy-hash filtering

noisy_hashes — Common hashes excluded from matching (room tone, breathing, etc.)

Pinecone index — 192-dim ECAPA speaker embeddings with cosine similarity

Matching logic

The dedup engine doesn't just count hash collisions. It checks whether the matching hashes agree on the same time offset between the query clip and the candidate video. That's the key insight of constellation-hash matching: a real match means many hashes align at a consistent time delta, not just that some hashes collide randomly.

Threshold: 18+ aligned hashes AND 40%+ alignment ratio
Noisy hashes (common across many videos) are filtered out
Segmented matching scales coverage with clip length

Setup

Requirements

Python 3.11+
Supabase project (Postgres)
Google Cloud Storage bucket
YouTube Data API key
AssemblyAI API key (for transcription)
Pinecone account (for speaker embeddings)

Configuration

Copy .env.example to .env and fill in your API keys:

cp .env.example .env

Running the pipeline

Ingest and deduplicate:

python backend/extract_pipeline.py
# Enter a YouTube handle like @SomeChannel

Transcribe with speaker diarization:

python backend/transcribe.py ./data/audio_files

Label speakers and extract clips:

python backend/label_speakers.py --transcripts ./data/transcripts --output ./data/labeled --map speaker_map.json
python backend/extract_speaker_clips.py --transcripts ./data/labeled --audio ./data/audio_files --output ./data/speaker_clips --speaker "John Doe"

Store speaker embeddings and identify future speakers:

python backend/speaker_embeddings.py --clips ./data/speaker_clips --label "speaker_01"
python backend/identify_speaker.py identify ./data/new_clip.mp3

Stack

Python, NumPy, SciPy, librosa, pydub
yt-dlp, FFmpeg, YouTube Data API v3
AssemblyAI (transcription + diarization)
SpeechBrain ECAPA-TDNN (speaker embeddings)
Pinecone (vector database)
Supabase / Postgres
Google Cloud Storage

Roadmap

Scale testing — stress-test speaker identification on larger corpora
Adaptive tiered matching — compare 10%, 25%, 50%, 75% of hashes with early exit for unambiguous matches
Dynamic original detection — automatically determine which video in a match pair is the true source
Interactive frontend — web client to input a YouTube URL and see the original source
FAISS integration — high-speed similarity search for large-scale dedup

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
supabase		supabase
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
DEEP_DIVE.md		DEEP_DIVE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SonicGen

The full pipeline

Why this project matters

Key files

Ingestion & download

Fingerprinting & dedup

Speaker identification

Utilities

Data model

Matching logic

Setup

Requirements

Configuration

Running the pipeline

Stack

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SonicGen

The full pipeline

Why this project matters

Key files

Ingestion & download

Fingerprinting & dedup

Speaker identification

Utilities

Data model

Matching logic

Setup

Requirements

Configuration

Running the pipeline

Stack

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages