You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are the DreamX team at Amap (Alibaba), focusing on delivering AI products and cutting-edge research in large language models, reinforcement learning, multimodal understanding, generative AI (image/video), world models, efficient inference, generative recommendation and intelligent mobility. Our work has been published at top-tier venues including ICLR, CVPR, ICCV, ACL, EMNLP, and AAAI.
We are always looking for talented interns and full-time researchers with strong coding skills and research experience. Please email us at cxxgtxy@gmail.com if you are interested.
🔥 News
2026.04.10 💻 We open-sourced -- Let Skills Evolve Collectively with Agentic Evolver.
2026.03.23 💻 We open-sourced -- A Comprehensive Benchmark for Evaluating Interactive Response Capabilities of World Models.
2026.03.11 💻 We open-sourced -- Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing.
2026.03.01 🎉 is accepted by CVPR 2026 -- Beyond Generation: Advancing Image Editing Priors for Depth and Normal Estimation.
2026.02.27 🎉 is accepted by Findings of CVPR 2026 -- Towards Close-up High-resolution Video-based Virtual Try-on.
2026.02.06 💻 We open-sourced -- A Scalable Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios.
2026.02.06 🎉 is accepted by ICLR 2026 -- Benchmarking Spatial Intelligence of Text-to-Image Models.
2026.02.06 🎉 is accepted by ICLR 2026 -- Tree Search for LLM Agent Reinforcement Learning.
2026.02.06 🎉 is accepted by ICLR 2026 -- Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models.
2026.02.05 🎉 is accepted by ICLR 2026 -- Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation.
2026.02.04 💻 We open-sourced -- A GUI World Model via Renderable Code Generation.
2026.02.04 🎉 is accepted by ICLR 2026 -- A Simple and Strong Reinforcement Learning Baseline for Model Reasoning.
2026.02.04 🎉 is accepted by ICLR 2026 -- A Comprehensive Narrative-Centric Evaluation for Long Video Generation Models.
2026.02.04 🎉 is accepted by ICLR 2026 -- Advancing End-To-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training.
2026.02.04 🎉 is accepted by AAAI 2026.
2026.02.02 🎉 is accepted by ICCV 2025 -- A Benchmark for Perception-Aligned Video Motion Generation.
2026.01.31 🎉 is accepted by ICLR 2026 -- Urban Socio-Semantic Segmentation with Vision-Language Reasoning.
2026.01.07 💻 We open-sourced -- Reinforced Parallel Map-Augmented Agent for Geolocalization.
2025.06.20 💻 We open-sourced -- A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing.
2025.05.21 💻 We open-sourced -- Reasoning Guided Universal Visual Grounding with Reinforcement Learning.
2025.04.07 💻 We open-sourced -- Realistic Image Quality and Aesthetic Scoring with Multimodal LLM.
Adopts tree-search rollouts in place of independent chain-based rollouts for LLM agent RL, achieving superior performance with only a quarter of the rollout budget.
A minimalist RL approach (Group Policy Gradient) that directly optimizes the original RL objective, eliminating critic/reference models and KL constraints while outperforming GRPO.
Proposes difficulty-aware GRPO and multi-aspect question reformulation to boost math reasoning by targeting harder questions from both algorithmic and data perspectives.
A hierarchical sampling framework that identifies boundary-level problems and dynamically reallocates sampling budget toward high-utility problems for self-taught reasoners.
A novel text editing framework for multi-line scene text in complex visual scenarios, with Condition Injection LoRA module and regional text perceptual loss.
Leverages stochastic block-dropping to construct sub-networks for training-free guidance, surpassing CFG on text-to-image and text-to-video generation.
Unified self-supervised pretraining via masked latent modeling in VAE space, significantly improving diffusion model convergence and generation quality.
An RL-based single-pass 3D scene editing framework using VGGT as geometry-aware reward model and GRPO to anchor 2D editing priors onto the 3D consistency manifold.
A vision-language reasoning framework for urban socio-semantic segmentation that simulates human annotation via cross-modal recognition and multi-stage RL-based reasoning.
A 14,715-image UGC dataset with 10 fine-grained attributes for realistic image quality and aesthetic scoring; achieves SOTA on 5 public IQA/IAA benchmarks using next-token prediction.
A GRPO-based framework for reliable visual policy optimization in image quality assessment with uncertainty-aware dynamic optimization and perception-aware optimization.
A VLM-based GUI world model that predicts dynamic transitions via renderable code generation, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation.
Disentangled scenario factorization for multi-scenario route ranking with the first large-scale public MSDR dataset; deployed in AMap for online traffic.
Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models? A framework for transferring alignment knowledge from AR-LLMs to Diffusion Models.