Jiehui Huang 黄杰辉

PhD Student · Dept. of CSE, HKUST
Supervised by Prof. Jiaya Jia

My research focuses on physically grounded interactive video generation and world modeling, controllable generation and editing, and multimodal understanding. My long-term goal is to build controllable multimodal "world simulators" that bridge virtual data and the real physical world, accelerating the evolution of embodied agents across both simulated and real environments — ultimately advancing toward early realizations of digital immortality and machine consciousness. I also contribute to StarVLA WM4A as an open-source maintainer, and previously received my M.S. from Sun Yat-sen University under Xiaodan Liang and Shengcai Liao. Feel free to reach out via email.

Email Google Scholar GitHub HuggingFace

News

2025.02🎉 UnityVideo was accepted by CVPR 2026.
2025.12🎉 ConsistentID was accepted by IEEE TPAMI.
2025.12Released UnityVideo, a unified multi-modal multi-task video generation model.
2025.11🎉 One paper accepted by AAAI 2026.
2025.07🎉 One paper accepted by ACM MM 2025.
2025.01🎉 One paper accepted by IEEE Transactions on Instrumentation & Measurement.
2024.11Awarded the National Scholarship from Sun Yat-sen University.
2024.04Released ConsistentID (✨ 900+ Stars), a high-fidelity customized portrait generation model.
2023.12🎉 Two papers accepted by AAAI 2024 and Knowledge-Based Systems.
2023.11🎉 One paper accepted by Neurocomputing.

Internships

2025.04 – present Kling Team, Kuaishou Technology · Collaboration with Tao Xin

2024.12 – 2025.03 Pixocial Technology · Collaboration with Haoxiang Li

2024.04 – 2024.09 Tencent Hunyuan Team, TEG, Shenzhen · Collaboration with Hu Ye

2023.11 – 2024.03 Lenovo Research Institute, Shenzhen

2023.07 – 2023.10 SenseTime Research, Shenzhen

Selected Publications

🤖 Agentic World Model

CVPR 2026

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia

CVPR, 2026

UnityVideo is a unified framework integrating I2V, T2V, and video enhancement into joint training via a modality-adaptive switcher and in-context learner, enabling mutual knowledge transfer across tasks. We release OpenUni (1.3M pairs) and UniBench (30K samples) for unified video model evaluation.

Paper Project Code HuggingFace stars

arXiv 2025

ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding, Mu Hu, Kaixuan Wang, Shaijie Shen, Guang Tan

arXiv, 2025

ReCamDriving achieves camera-controlled novel-trajectory video generation without LiDAR by leveraging 3DGS renderings for structural guidance and precise camera control. We construct ParaDrive, a dataset with 110K+ parallel-trajectory pairs via a novel cross-trajectory data curation strategy.

Paper Project Code stars

arXiv 2026

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

AgenticWM Team · Jiehui Huang (Contributor)

arXiv, 2026

A comprehensive survey of agentic world modeling organized by a "levels × laws" framework — categorizing capabilities across three levels (Predictor, Simulator, Evolver) and four governing-law regimes (physical, digital, social, scientific). Synthesizing 400+ works spanning RL, video generation, and autonomous agents, we derive evaluation principles and architectural guidance for building systems that can simulate and reshape environments.

Paper GitHub stars

Open-Source

StarVLA WM4A: World Model for Agents

StarVLA Team · Jiehui Huang (WM4A Maintainer)

Open-source project under StarVLA

WM4A (World Model for Agents) is an open-source embodied world model framework built on plug-and-play backbone networks, action heads, unified training strategies, and a standardized benchmark interface. I serve as a maintainer, contributing to model development and ongoing release iterations.

Code HuggingFace

🎨 Controllable AIGC

TPAMI

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang

IEEE TPAMI, 2025

ConsistentID improves fine-grained facial customization with multimodal facial region descriptions and an ID-preservation network optimized via facial attention localization. We introduce FGID, the first large-scale fine-grained facial identity dataset capturing diverse identity-preserving details.

Paper Code Demo stars

ICML 2026

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Jiehui Huang, et al.

ICML, 2026

X-Dub is a two-stage audio-visual dubbing framework that uses a mask-based inpainting model to generate pseudo-paired training data, then bootstraps a mask-free DiT editing model that operates on full video context. This eliminates masking artifacts while achieving state-of-the-art lip synchronization and visual fidelity for portrait video dubbing.

Paper

ACM MM 2025

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

ACM MM, 2025

LaVieID tackles identity-preserving text-to-video generation with a local router extracting fine-grained facial cues for spatial structural guidance and a temporal autoregressive module that models long-range frame dependencies, enabling vivid and identity-consistent video generation.

Code

AAAI 2026

Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li

AAAI, 2026

Zo3T enables zero-shot 3D-aware trajectory-guided image-to-video generation via lightweight test-time LoRA modules that adaptively guide generation without target-domain training. Noise score re-evaluation enforces trajectory fidelity during latent manipulation.

Paper Code stars

📊 Other Insightful Projects

KBS 2024Domain Baseline

TMBL: Transformer-based Multimodal Binding Learning for Multimodal Sentiment Analysis

Jiehui Huang, Jun Zhou, Zhenchao Tang, Jiaying Lin, Calvin Yu-Chian Chen

Knowledge-Based Systems, 2024

TMBL redesigns the Transformer with a CLIP-inspired cross-modal binding mechanism to reduce modal heterogeneity in multimodal sentiment analysis. CLS and position embeddings explicitly distinguish modal spaces, achieving a 6% improvement in ACC over prior methods.

Paper Code stars

Neurocomputing 2024

Progressive Network based on Detail Scaling and Texture Extraction for Image Deraining

Jiehui Huang, Zhenchao Tang, Xuedong He, Jun Zhou, Defeng Zhou, Calvin Yu-Chian Chen

Neurocomputing, 2024

DTPNet proposes a progressive deraining framework with a detail scaling module and enhanced Transformer blocks for generalized feature extraction from degraded images, achieving SOTA on SPA-Data, RainDrop, RID, and Rain100.

Paper Code stars

AAAI 2024

Comprehensive View Embedding Learning for Single-cell Multimodal Integration

Zhenchao Tang, Jiehui Huang, Guanxing Chen, Pengfei Wen, Calvin Yu-Chian Chen

AAAI, 2024

CoVEL performs single-cell multimodal integration via three-view embedding learning capturing cross-modal regulatory relationships and fine-grained single-cell features through self-supervised contrastive learning, effectively bridging heterogeneous feature spaces.

Paper Code stars

Education

Hong Kong University of Science and Technology
Ph.D. in Artificial Intelligence, Dept. of CSE
2025.09 – present

Sun Yat-sen University
M.S. in Artificial Intelligence, School of Intelligent Systems Engineering
2022.09 – 2025.06

Honors & Awards

2025.06Outstanding Graduate, Sun Yat-sen University

2024.11China National Scholarship, Sun Yat-sen University

2023.10First Prize Scholarship, Sun Yat-sen University

2021.11China National Scholarship

2021.08CIMC Siemens Cup China Intelligent Manufacturing Challenge — National First Prize

2020.08RoboMaster Infantry Group — National First Prize

2020.02Invention Patent: Non-blocking Controllable Projectile Launch System

Academic Service

Conference Reviewer: CVPR, ECCV, AAAI, ACM MM, AISTATS
Journal Reviewer: TPAMI, TVCG, TIP, TIM, Knowledge-Based Systems