Hi, this is Xiaojie Xu(徐啸捷). I am an incoming Ph.D. student in Information Science and Technology at The University of Tokyo. My current research focuses on Generative AI, including image, video and multimodal generation. Representative works include:

Multimodal Generation: POSTA(visually appealing movie poster generation from text, CVPR 25), PreGenie(MLLM Agents for text-image document understanding and presentation generation, EMNLP 25), Orchestrating Audio(MLLM Agents for long-video understanding and audio generation, EMNLP 25)
Image/Video Generation: VBench++(benchmarking video generative models, T-PAMI 25), BEV to Street View(street-view images generation from bird’s-eye view map, ICRA 24)

Prior, I did research with Shanda AI Research Tokyo, Tencent AI Lab and NTU MMLab. Feel free to contact me for collaboration🤠.

📝 Recent Publications

* indicates equal contributions. For a complete list of publications, please refer to my Google Scholar profile.

T-PAMI 2025

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

Ziqi Huang*, Fan Zhang*, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

IEEE Transactions on Pattern Analysis and Machine Intelligence(T-PAMI), Github stars > 1k

EMNLP 2025, Findings