PhD Student · Dept. of CSE, HKUST
Supervised by Prof. Jiaya Jia
My research focuses on physically grounded interactive video generation and world modeling, controllable generation and editing, and multimodal understanding. My long-term goal is to build controllable multimodal "world simulators" that bridge virtual data and the real physical world, accelerating the evolution of embodied agents across both simulated and real environments — ultimately advancing toward early realizations of digital immortality and machine consciousness. I also contribute to StarVLA WM4A as an open-source maintainer, and previously received my M.S. from Sun Yat-sen University under Xiaodan Liang and Shengcai Liao. Feel free to reach out via email.
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
CVPR, 2026
UnityVideo is a unified framework integrating I2V, T2V, and video enhancement into joint training via a modality-adaptive switcher and in-context learner, enabling mutual knowledge transfer across tasks. We release OpenUni (1.3M pairs) and UniBench (30K samples) for unified video model evaluation.
ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation
arXiv, 2025
ReCamDriving achieves camera-controlled novel-trajectory video generation without LiDAR by leveraging 3DGS renderings for structural guidance and precise camera control. We construct ParaDrive, a dataset with 110K+ parallel-trajectory pairs via a novel cross-trajectory data curation strategy.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
arXiv, 2026
A comprehensive survey of agentic world modeling organized by a "levels × laws" framework — categorizing capabilities across three levels (Predictor, Simulator, Evolver) and four governing-law regimes (physical, digital, social, scientific). Synthesizing 400+ works spanning RL, video generation, and autonomous agents, we derive evaluation principles and architectural guidance for building systems that can simulate and reshape environments.
StarVLA WM4A: World Model for Agents
Open-source project under StarVLA
WM4A (World Model for Agents) is an open-source embodied world model framework built on plug-and-play backbone networks, action heads, unified training strategies, and a standardized benchmark interface. I serve as a maintainer, contributing to model development and ongoing release iterations.
ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving
IEEE TPAMI, 2025
ConsistentID improves fine-grained facial customization with multimodal facial region descriptions and an ID-preservation network optimized via facial attention localization. We introduce FGID, the first large-scale fine-grained facial identity dataset capturing diverse identity-preserving details.
From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping
ICML, 2026
X-Dub is a two-stage audio-visual dubbing framework that uses a mask-based inpainting model to generate pseudo-paired training data, then bootstraps a mask-free DiT editing model that operates on full video context. This eliminates masking artifacts while achieving state-of-the-art lip synchronization and visual fidelity for portrait video dubbing.
LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
ACM MM, 2025
LaVieID tackles identity-preserving text-to-video generation with a local router extracting fine-grained facial cues for spatial structural guidance and a temporal autoregressive module that models long-range frame dependencies, enabling vivid and identity-consistent video generation.
Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training
AAAI, 2026
Zo3T enables zero-shot 3D-aware trajectory-guided image-to-video generation via lightweight test-time LoRA modules that adaptively guide generation without target-domain training. Noise score re-evaluation enforces trajectory fidelity during latent manipulation.
TMBL: Transformer-based Multimodal Binding Learning for Multimodal Sentiment Analysis
Knowledge-Based Systems, 2024
TMBL redesigns the Transformer with a CLIP-inspired cross-modal binding mechanism to reduce modal heterogeneity in multimodal sentiment analysis. CLS and position embeddings explicitly distinguish modal spaces, achieving a 6% improvement in ACC over prior methods.
Progressive Network based on Detail Scaling and Texture Extraction for Image Deraining
Neurocomputing, 2024
DTPNet proposes a progressive deraining framework with a detail scaling module and enhanced Transformer blocks for generalized feature extraction from degraded images, achieving SOTA on SPA-Data, RainDrop, RID, and Rain100.
Comprehensive View Embedding Learning for Single-cell Multimodal Integration
AAAI, 2024
CoVEL performs single-cell multimodal integration via three-view embedding learning capturing cross-modal regulatory relationships and fine-grained single-cell features through self-supervised contrastive learning, effectively bridging heterogeneous feature spaces.
Conference Reviewer: CVPR, ECCV, AAAI, ACM MM, AISTATS
Journal Reviewer: TPAMI, TVCG, TIP, TIM, Knowledge-Based Systems