Skip to content

MLGalusha/SonicGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SonicGen

Read the deep-dive: DEEP_DIVE.md — a full walkthrough of how the fingerprinting and matching work, the database choice I got wrong, and the honest weaknesses I'd fix before putting this in front of real data.

Audio ingestion, speaker identification, and deduplication pipeline.

Give SonicGen a YouTube channel handle and it ingests every video, transcribes with speaker diarization, lets you label speakers in seconds, stores voice embeddings for automatic future identification, and deduplicates clips via Shazam-style audio fingerprinting so you don't waste money re-transcribing duplicates.

Written by hand in Python. No coding agents.


The full pipeline

  1. Ingest — point it at a YouTube handle. It pulls all video metadata via the YouTube Data API.
  2. Download — audio extracted via yt-dlp + FFmpeg, stored in Google Cloud Storage.
  3. Fingerprint & deduplicate — Shazam-style constellation hashing detects duplicate clips before they hit transcription. Alignment-offset matching (not just hash collision count) keeps accuracy high.
  4. Transcribe — AssemblyAI with speaker diarization. Output: timestamped transcript with Speaker A, Speaker B, etc.
  5. Label speakers — play a short clip of each speaker, label them. Takes seconds, especially for podcasts with 2 speakers.
  6. Extract speaker clips — isolates each speaker's audio segments from the full video, trimmed to avoid crosstalk.
  7. Voice embeddings — SpeechBrain ECAPA-TDNN encodes each speaker's clips into 192-dimensional embeddings, stored in Pinecone (vector database, cosine similarity).
  8. Automatic speaker identification — on future videos, new clips are encoded and matched against stored embeddings. Known speakers are identified automatically.

Why this project matters

  • Real-world problem: short-form reposts explode across platforms. SonicGen links clips back to their source and identifies who's speaking.
  • Full pipeline: DSP (spectrograms, peak detection, landmark hashing), speaker diarization, voice embeddings, vector search, cloud storage, and state-driven batch processing.
  • Production thinking: restartable pipelines with crash recovery, chunked inserts, noisy-hash filtering, rate limiting, and configurable thresholds.
  • Vector database experience: stores and queries 192-dim speaker embeddings in Pinecone. The same embed-store-query pattern that powers RAG and semantic search.

Key files

Ingestion & download

  • backend/youtube_api.py — Channel and video metadata ingestion via YouTube Data API v3
  • backend/download.py — yt-dlp download + GCS upload, state-driven batch processing
  • backend/extract_pipeline.py — End-to-end orchestrator (ingest, download, deduplicate)

Fingerprinting & dedup

  • backend/fingerprint_audio.py — DSP pipeline: STFT spectrograms, peak detection, constellation hashing, segmentation
  • backend/fingerprint_pipeline.py — Fingerprint orchestration and matching
  • backend/supabase_utils.py — Postgres/Supabase access, chunked inserts, candidate search
  • backend/admin_tools.py — Hash count maintenance, noisy-hash refresh

Speaker identification

  • backend/transcribe.py — AssemblyAI transcription with speaker diarization
  • backend/label_speakers.py — Map generic speaker labels (Speaker A, B) to real names
  • backend/extract_speaker_clips.py — Extract per-speaker audio clips from diarized transcripts
  • backend/speaker_embeddings.py — Encode speaker clips with SpeechBrain ECAPA-TDNN, store in Pinecone
  • backend/identify_speaker.py — Query Pinecone to identify speakers in new clips, with accuracy evaluation

Utilities

  • backend/utils/utils_audio.py — Audio loading and preprocessing

Data model

videos — YouTube metadata + dedup status

  • id UUID, youtube_id TEXT, title TEXT, duration INT
  • original_video_id UUID (references duplicate's source)
  • match_status TEXT (null → pending → fingerprinted/matched/too_short/flag)

fingerprints — Constellation hashes with timestamps

  • hash TEXT, video_id UUID, t_ref INT

fingerprint_hash_counts — Track hash frequency for noisy-hash filtering

noisy_hashes — Common hashes excluded from matching (room tone, breathing, etc.)

Pinecone index — 192-dim ECAPA speaker embeddings with cosine similarity


Matching logic

The dedup engine doesn't just count hash collisions. It checks whether the matching hashes agree on the same time offset between the query clip and the candidate video. That's the key insight of constellation-hash matching: a real match means many hashes align at a consistent time delta, not just that some hashes collide randomly.

  • Threshold: 18+ aligned hashes AND 40%+ alignment ratio
  • Noisy hashes (common across many videos) are filtered out
  • Segmented matching scales coverage with clip length

Setup

Requirements

  • Python 3.11+
  • Supabase project (Postgres)
  • Google Cloud Storage bucket
  • YouTube Data API key
  • AssemblyAI API key (for transcription)
  • Pinecone account (for speaker embeddings)

Configuration

Copy .env.example to .env and fill in your API keys:

cp .env.example .env

Running the pipeline

Ingest and deduplicate:

python backend/extract_pipeline.py
# Enter a YouTube handle like @SomeChannel

Transcribe with speaker diarization:

python backend/transcribe.py ./data/audio_files

Label speakers and extract clips:

python backend/label_speakers.py --transcripts ./data/transcripts --output ./data/labeled --map speaker_map.json
python backend/extract_speaker_clips.py --transcripts ./data/labeled --audio ./data/audio_files --output ./data/speaker_clips --speaker "John Doe"

Store speaker embeddings and identify future speakers:

python backend/speaker_embeddings.py --clips ./data/speaker_clips --label "speaker_01"
python backend/identify_speaker.py identify ./data/new_clip.mp3

Stack

  • Python, NumPy, SciPy, librosa, pydub
  • yt-dlp, FFmpeg, YouTube Data API v3
  • AssemblyAI (transcription + diarization)
  • SpeechBrain ECAPA-TDNN (speaker embeddings)
  • Pinecone (vector database)
  • Supabase / Postgres
  • Google Cloud Storage

Roadmap

  1. Scale testing — stress-test speaker identification on larger corpora
  2. Adaptive tiered matching — compare 10%, 25%, 50%, 75% of hashes with early exit for unambiguous matches
  3. Dynamic original detection — automatically determine which video in a match pair is the true source
  4. Interactive frontend — web client to input a YouTube URL and see the original source
  5. FAISS integration — high-speed similarity search for large-scale dedup

About

Hand-written audio fingerprinting engine for duplicate detection and source matching.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors