You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are the DreamX team at Amap (Alibaba), driving cutting-edge research and production AI systems across large language models, reinforcement learning, agent systems, multimodal understanding, generative AI (image/video), world models, autonomous driving, and intelligent mobility. With 6,000+ GitHub stars across 30+ open-source research projects, our work has been published at top-tier venues including ICLR, CVPR, ACL, AAAI, SIGGRAPH, ICCV, EMNLP, and ACM MM.
We are always looking for talented interns and full-time researchers with strong coding skills and research experience. Please email us at cxxgtxy@gmail.com if you are interested.
🔥 News
2026.05.12 🎉 is accepted by ACL 2026 -- Training LLM Agents via Agent-Data Mutual Evolution.
2026.05.12 🎉 is accepted by ACL 2026 Findings -- Reinforced Parallel Map-Augmented Agent for Geolocalization.
2026.05.11 💻 We released 5B-Cam model and inference code -- A General-Purpose Interactive World Model.
2026.04.22 💻 We open-sourced -- Elucidating the SNR-t Bias of Diffusion Probabilistic Models (CVPR 2026).
2026.04.22 💻 We open-sourced -- Extending One-Step Image Generation from Class Labels to Text (CVPR 2026).
2026.04.10 💻 We open-sourced -- Let Skills Evolve Collectively with Agentic Evolver.
2026.04.10 💻 We open-sourced -- A General-Purpose Interactive World Model.
2026.04.01 🎉 is accepted by SIGGRAPH 2026 -- Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation.
2026.03.23 💻 We open-sourced -- A Comprehensive Benchmark for Evaluating Interactive Response Capabilities of World Models.
2026.03.20 💻 We open-sourced -- Incentivizing Reasoning and Self-Reflection for VLA in Autonomous Driving.
2026.03.18 💻 We open-sourced -- Reinforcing Open-Vocabulary Action Recognition with Tools.
2026.03.11 💻 We open-sourced -- Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing.
2026.03.01 🎉 is accepted by CVPR 2026 -- Beyond Generation: Advancing Image Editing Priors for Depth and Normal Estimation.
2026.02.28 🎉 is accepted by ICLR 2026 -- Frequency-Aware Sparse Attention.
2026.02.27 🎉 is accepted by Findings of CVPR 2026 -- Towards Close-up High-resolution Video-based Virtual Try-on.
2026.02.06 💻 We open-sourced -- A Scalable Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios.
2026.02.06 🎉 is accepted by ICLR 2026 -- Reinforcing Open-Vocabulary Action Recognition with Tools.
2026.02.06 🎉 is accepted by ICLR 2026 -- Incentivizing Reasoning and Self-Reflection for VLA in Autonomous Driving.
2026.02.06 🎉 is accepted by ICLR 2026 -- Benchmarking Spatial Intelligence of Text-to-Image Models.
2026.02.06 🎉 is accepted by ICLR 2026 -- Tree Search for LLM Agent Reinforcement Learning.
2026.02.06 🎉 is accepted by ICLR 2026 -- Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models.
2026.02.05 🎉 is accepted by ICLR 2026 -- Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation.
2026.02.04 💻 We open-sourced -- A GUI World Model via Renderable Code Generation.
2026.02.04 🎉 is accepted by ICLR 2026 -- A Simple and Strong Reinforcement Learning Baseline for Model Reasoning.
2026.02.04 🎉 is accepted by ICLR 2026 -- A Comprehensive Narrative-Centric Evaluation for Long Video Generation Models.
2026.02.04 🎉 is accepted by ICLR 2026 -- Advancing End-To-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training.
2026.02.04 🎉 is accepted by AAAI 2026 -- Unified and Spatially-Controllable Visual Effects Generation.
2026.02.04 🎉 is accepted by AAAI 2026 -- Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints.
2026.02.02 🎉 is accepted by ICCV 2025 -- A Benchmark for Perception-Aligned Video Motion Generation.
2026.01.31 🎉 is accepted by ICLR 2026 -- Urban Socio-Semantic Segmentation with Vision-Language Reasoning.
2026.01.07 💻 We open-sourced -- Reinforced Parallel Map-Augmented Agent for Geolocalization.
2025.10.22 💻 We open-sourced -- Boosting MLLMs' Video Understanding via Counterfactual Video Generation.
2025.06.20 💻 We open-sourced -- A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing.
2025.05.21 💻 We open-sourced -- Reasoning Guided Universal Visual Grounding with Reinforcement Learning.
2025.04.07 💻 We open-sourced -- Realistic Image Quality and Aesthetic Scoring with Multimodal LLM.
A framework enabling LLM agent skills to evolve collectively from real interactions, with automatic deduplication, improvement, and verification across sessions, agents, and devices.
Adopts tree-search rollouts in place of independent chain-based rollouts for LLM agent RL, achieving superior performance with only a quarter of the rollout budget.
A minimalist RL approach (Group Policy Gradient) that directly optimizes the original RL objective, eliminating critic/reference models and KL constraints while outperforming GRPO.
Proposes difficulty-aware GRPO and multi-aspect question reformulation to boost math reasoning by targeting harder questions from both algorithmic and data perspectives.
A framework for training LLM agents via agent-data mutual evolution, using RL with failure-signal-driven task synthesis under changing training distributions.
A novel text editing framework for multi-line scene text in complex visual scenarios, with Condition Injection LoRA module and regional text perceptual loss.
An RL-based single-pass 3D scene editing framework using VGGT as geometry-aware reward model and GRPO to anchor 2D editing priors onto the 3D consistency manifold.
Leverages stochastic block-dropping to construct sub-networks for training-free guidance, surpassing CFG on text-to-image and text-to-video generation.
Elucidating the SNR-t bias of diffusion probabilistic models and proposing a differential correction method to improve generation quality across various diffusion models.
Unified self-supervised pretraining via masked latent modeling in VAE space, significantly improving diffusion model convergence and generation quality.
A cascaded expert framework explicitly decoupling motion generation and appearance synthesis for high-quality music-driven dance video generation, with 70K-clip MA-Data dataset.
Combines contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition using GRPO with hierarchical rewards.
A prompt-guided adaptive test-time search strategy that dynamically adjusts search space and reward for imaginative video generation with long-distance semantic dependencies.
Introduces DualityForge, a controllable diffusion framework generating counterfactual videos for contrastive training, reducing MLLM video hallucinations by 24%.
A VLM-based GUI world model that predicts dynamic transitions via renderable code generation, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation.
A general-purpose world model for interactive world simulation, generating diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.
A vision-language-action model using rule-based RL to elicit reasoning and self-reflection for autonomous driving trajectory prediction with physics-grounded rewards.
A vision-language reasoning framework for urban socio-semantic segmentation that simulates human annotation via cross-modal recognition and multi-stage RL-based reasoning.
A 14,715-image UGC dataset with 10 fine-grained attributes for realistic image quality and aesthetic scoring; achieves SOTA on 5 public IQA/IAA benchmarks using next-token prediction.