Expertise Need Not Monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Our AdaMoE-VLA framework builds upon a pretrained flow-matching Vision-Language-Action (VLA) model π₀, enhancing its action reasoning capacity through Mixture-of-Experts (MoE) integration. Our pipeline (Fig. a) processes multi-modal inputs through a vision-language model and a transformer-based action expert to predict continuous control. Each transformer block in the action expert incorporates an MoE layer (Fig. b) comprising shared experts, which inherit the original FFN weights to capture general manipulation patterns, and routed experts, dynamically selected by a router network to specialize in action-specific behaviors.

As shown in Fig. c, the vanilla MoE couples expert selection and weighting through a single router using top-k softmax gating. In contrast, our AdaMoE-VLA (Fig. d) introduces a decoupled architecture with an independent router for expert selection and a scale adapter for adaptive weighting. This design alleviates optimization conflicts between load balancing and task specialization, enabling more flexible and expressive expert collaboration for robust robotic control.

We select two simulation benchmarks to evaluate our method: (1) Four task suites from LIBERO dataset: LIBERO-Spatial, LIBERO-Object, LIBERO Goal and LIBERO-Long. (2) 19 tasks from RoboTwin 2.0. Each task dataset contains 100 expert trajectories from Clean environments and 400 expert trajectories from Domain Randomized environments.

Our results demonstrate clear performance improvements of MoE over dense models, with particularly pronounced gains on large-scale datasets and long-horizon tasks. On the LIBERO benchmark, our AdaMoE-VLA achieves an average improvement of 1.8% over the baseline π₀ model (94.2% → 96.0%) across all four task suites, as shown in Table 1. As detailed in Table 2, the improvements are more significant on the large-scale RoboTwin dataset, where we observe a substantial 9.3% performance gain (40.4% → 49.7%) across 19 manipulation tasks with 9500 demonstrations.

Notably, our method excels in both domain randomized tasks and long-horizon sequential tasks. In domain randomized scenarios with high environmental and object variation, the diverse expert specialization enables better handling of different lighting conditions, object properties, poses, and manipulation strategies across diverse configurations. The performance gains on long-horizon tasks are particularly pronounced, with our method achieving a 92% success rate on LIBERO-Long, demonstrating that MoE architectures can effectively decompose complex sequential manipulation into specialized sub-skills handled by different experts.

Analysis of expert activation patterns reveals clear task-dependent specialization across different manipulation phases. The above figure shows the activation patterns of experts at certain layer L during various manipulation tasks, where expert usage intensity measures the proportion of tokens assigned to each expert at each frame. We observe distinct activation patterns that correlate with specific manipulation phases. For the same task “put both the alphabet soup and the tomato sauce”, all experts show similar token load distributions as illustrated in subfigures (a) and (b). Furthermore, across different tasks, experts exhibit consistent trends for certain atomic operations. For instance, in subfigures (a), (b), and (c), Expert 3 shows increased token utilization precisely when the policy performs target positioning and gripper release operations. The consistency of activation patterns across similar manipulation phases demonstrates that our experts capture meaningful behavioral primitives rather than arbitrary task divisions.

To validate our decoupled expert selection and weighting mechanism, we conduct comprehensive ablation studies on LIBERO comparing three architectural variants:

Vanilla MoE: Traditional MoE with coupled selection and weighting using softmax router outputs
Concatenated Scale Adapter MoE (CSMoE): Router outputs and action tokens are concatenated and fed to a scale adapter that directly outputs expert weights
Additive Scale Adapter MoE (Our AdaMoE-VLA): Expert weights are computed as the sum of router weights and scale adapter weights

As shown in Table 3, our AdaMoE-VLA achieves the best overall performance across LIBERO task suites, with an average improvement of 1.6% over vanilla MoE (load balance). The concatenated approach shows moderate improvements, validating the importance of decoupling, while our additive design proves most effective.

To validate the practical effectiveness of our AdaMoE-VLA approach, we conduct real-world robotic manipulation experiments using a dual-arm manipulation platform. Our experimental setup utilizes the ALOHA-Agilex system developed by AgileX Robotics, equipped with two Piper robotic arms that enable bimanual manipulation capabilities.

We design four representative manipulation tasks that cover diverse manipulation skills and evaluate our method's performance in real-world scenarios:

1) Place Cup: Precise positioning
3) Click Bell: Coordinated activation

2) Stack Plate: Stable stacking
4) Adjust Bottle: Fine orientation

Table 5 presents the success rates of our AdaMoE-VLA compared to the π₀ baseline across all four real-world manipulation tasks. Our method demonstrates consistent improvements across all tasks, with particularly notable gains in complex manipulation scenarios requiring precise coordination.

Below are videos of our AdaMoE-VLA model demonstrating various robust behaviors across the four representative manipulation tasks. (Videos are sped up by 2×.)

BibTeX

@misc{shen2025expertiseneedmonopolizeactionspecialized,
            title={Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning}, 
            author={Weijie Shen and Yitian Liu and Yuhao Wu and Zhixuan Liang and Sijia Gu and Dehui Wang and Tian Nian and Lei Xu and Yusen Qin and Jiangmiao Pang and Xinping Guan and Xiaokang Yang and Yao Mu},
            year={2025},
            eprint={2510.14300},
            archivePrefix={arXiv},
            primaryClass={cs.RO},
            url={https://arxiv.org/abs/2510.14300}, 
      }

Expertise Need Not Monopolize:
Action-Specialized Mixture of Experts for Vision-Language-Action Learning

The AdaMoE-VLA Model

Experiments

Evaluations on simulation benchmarks

Meaningful Expert Specialization

Effectiveness of Our Decoupled Architecture Design

Real-World Experiments

Place Cup

Stack Plate

Click Bell

Adjust Bottle

BibTeX

Expertise Need Not Monopolize:Action-Specialized Mixture of Experts for Vision-Language-Action Learning

The AdaMoE-VLA Model

Experiments

Evaluations on simulation benchmarks

Meaningful Expert Specialization

Effectiveness of Our Decoupled Architecture Design

Real-World Experiments

Place Cup

Stack Plate

Click Bell

Adjust Bottle

BibTeX

Expertise Need Not Monopolize:
Action-Specialized Mixture of Experts for Vision-Language-Action Learning