Jiehui Huang

Jiehui Huang 黄杰辉

PhD Student · Dept. of CSE, HKUST
Supervised by Prof. Jiaya Jia

My research focuses on physically grounded interactive video generation and world modeling, controllable generation and editing, and multimodal understanding. My long-term goal is to build controllable multimodal "world simulators" that bridge virtual data and the real physical world, accelerating the evolution of embodied agents across both simulated and real environments — ultimately advancing toward early realizations of digital immortality and machine consciousness. I also contribute to StarVLA WM4A as an open-source maintainer, and previously received my M.S. from Sun Yat-sen University under Xiaodan Liang and Shengcai Liao. Feel free to reach out via email.

News

Internships

2025.04 – present Kling Team, Kuaishou Technology  ·  Collaboration with Tao Xin
2024.12 – 2025.03 Pixocial Technology  ·  Collaboration with Haoxiang Li
2024.04 – 2024.09 Tencent Hunyuan Team, TEG, Shenzhen  ·  Collaboration with Hu Ye
2023.11 – 2024.03 Lenovo Research Institute, Shenzhen
2023.07 – 2023.10 SenseTime Research, Shenzhen

Selected Publications

🤖 Agentic World Model
UnityVideo
CVPR 2026

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia

CVPR, 2026

UnityVideo is a unified framework integrating I2V, T2V, and video enhancement into joint training via a modality-adaptive switcher and in-context learner, enabling mutual knowledge transfer across tasks. We release OpenUni (1.3M pairs) and UniBench (30K samples) for unified video model evaluation.

ReCamDriving
arXiv 2025

ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding, Mu Hu, Kaixuan Wang, Shaijie Shen, Guang Tan

arXiv, 2025

ReCamDriving achieves camera-controlled novel-trajectory video generation without LiDAR by leveraging 3DGS renderings for structural guidance and precise camera control. We construct ParaDrive, a dataset with 110K+ parallel-trajectory pairs via a novel cross-trajectory data curation strategy.

Agentic World Modeling
arXiv 2026

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

AgenticWM Team  ·  Jiehui Huang (Contributor)

arXiv, 2026

A comprehensive survey of agentic world modeling organized by a "levels × laws" framework — categorizing capabilities across three levels (Predictor, Simulator, Evolver) and four governing-law regimes (physical, digital, social, scientific). Synthesizing 400+ works spanning RL, video generation, and autonomous agents, we derive evaluation principles and architectural guidance for building systems that can simulate and reshape environments.

StarVLA WM4A
Open-Source

StarVLA WM4A: World Model for Agents

StarVLA Team  ·  Jiehui Huang (WM4A Maintainer)

Open-source project under StarVLA

WM4A (World Model for Agents) is an open-source embodied world model framework built on plug-and-play backbone networks, action heads, unified training strategies, and a standardized benchmark interface. I serve as a maintainer, contributing to model development and ongoing release iterations.

🎨 Controllable AIGC
ConsistentID
TPAMI

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang

IEEE TPAMI, 2025

ConsistentID improves fine-grained facial customization with multimodal facial region descriptions and an ID-preservation network optimized via facial attention localization. We introduce FGID, the first large-scale fine-grained facial identity dataset capturing diverse identity-preserving details.

X-Dub
ICML 2026

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Jiehui Huang, et al.

ICML, 2026

X-Dub is a two-stage audio-visual dubbing framework that uses a mask-based inpainting model to generate pseudo-paired training data, then bootstraps a mask-free DiT editing model that operates on full video context. This eliminates masking artifacts while achieving state-of-the-art lip synchronization and visual fidelity for portrait video dubbing.

LaVieID
ACM MM 2025

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

ACM MM, 2025

LaVieID tackles identity-preserving text-to-video generation with a local router extracting fine-grained facial cues for spatial structural guidance and a temporal autoregressive module that models long-range frame dependencies, enabling vivid and identity-consistent video generation.

Zo3T
AAAI 2026

Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li

AAAI, 2026

Zo3T enables zero-shot 3D-aware trajectory-guided image-to-video generation via lightweight test-time LoRA modules that adaptively guide generation without target-domain training. Noise score re-evaluation enforces trajectory fidelity during latent manipulation.

📊 Other Insightful Projects
TMBL
KBS 2024Domain Baseline

TMBL: Transformer-based Multimodal Binding Learning for Multimodal Sentiment Analysis

Jiehui Huang, Jun Zhou, Zhenchao Tang, Jiaying Lin, Calvin Yu-Chian Chen

Knowledge-Based Systems, 2024

TMBL redesigns the Transformer with a CLIP-inspired cross-modal binding mechanism to reduce modal heterogeneity in multimodal sentiment analysis. CLS and position embeddings explicitly distinguish modal spaces, achieving a 6% improvement in ACC over prior methods.

DTPNet
Neurocomputing 2024

Progressive Network based on Detail Scaling and Texture Extraction for Image Deraining

Jiehui Huang, Zhenchao Tang, Xuedong He, Jun Zhou, Defeng Zhou, Calvin Yu-Chian Chen

Neurocomputing, 2024

DTPNet proposes a progressive deraining framework with a detail scaling module and enhanced Transformer blocks for generalized feature extraction from degraded images, achieving SOTA on SPA-Data, RainDrop, RID, and Rain100.

CoVEL
AAAI 2024

Comprehensive View Embedding Learning for Single-cell Multimodal Integration

Zhenchao Tang, Jiehui Huang, Guanxing Chen, Pengfei Wen, Calvin Yu-Chian Chen

AAAI, 2024

CoVEL performs single-cell multimodal integration via three-view embedding learning capturing cross-modal regulatory relationships and fine-grained single-cell features through self-supervised contrastive learning, effectively bridging heterogeneous feature spaces.

Education

Hong Kong University of Science and Technology
Ph.D. in Artificial Intelligence, Dept. of CSE
2025.09 – present
Sun Yat-sen University
M.S. in Artificial Intelligence, School of Intelligent Systems Engineering
2022.09 – 2025.06

Honors & Awards

2025.06Outstanding Graduate, Sun Yat-sen University
2024.11China National Scholarship, Sun Yat-sen University
2023.10First Prize Scholarship, Sun Yat-sen University
2021.11China National Scholarship
2021.08CIMC Siemens Cup China Intelligent Manufacturing Challenge — National First Prize
2020.08RoboMaster Infantry Group — National First Prize
2020.02Invention Patent: Non-blocking Controllable Projectile Launch System

Academic Service

Conference Reviewer: CVPR, ECCV, AAAI, ACM MM, AISTATS
Journal Reviewer: TPAMI, TVCG, TIP, TIM, Knowledge-Based Systems