Publications

(2024). Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training. Preprint.

PDF Cite Project

(2024). Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning. Preprint.

Cite

(2024). ITINERA: Integrating Spatial Optimization with Large Language Models for Open-domain Urban Itinerary Planning. In EMNLP 2024.

PDF Cite Blog

(2024). Parameter-Inverted Image Pyramid Networks. In NeurIPS 2024 (Spotlight).

PDF Cite Code Blog

(2024). Synergizing Spatial Optimization with Large Language Models for Open-domain Urban Itinerary Planning. In KDD UrbComp 2024 (Best Paper Award).

PDF Cite Blog

(2023). Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft. In CVPR 2024.

PDF Cite Demo Blog

(2022). Video Background Music Generation: Dataset, Method and Evaluation. In ICCV 2023.

PDF Cite Demo

(2021). Video Background Music Generation with Controllable Music Transformer. In ACM MM 2021 (Best Paper Award).

PDF Cite Code Colab Notebook Demo

(2021). Confidence-aware Non-repetitive Multimodal Transformers for TextCaps. In AAAI 2021.

PDF Cite Code