Sparse Mixture of Experts (MoE) models are gaining traction due to their ability to enhance accuracy without proportionally increasing computational demands. Traditionally, significant computational resources have been invested in training dense Large Language Models (LLMs) with a single MLP layer. A promising strategy to boost the capacity of such pre-trained models is through upcycling them into sparse MoE models, which allows for expansion without starting the training process from scratch. However, scalable upcycling methods are still a subject of research.
In a new paper Upcycling Large Language Models into Mixture of Experts, an NVIDIA research team introduces a new “virtual group” initialization technique to facilitate the transition of dense models into fine-grained MoE structures. They also propose a weight scaling method that delivers a 1.5% improvement in model loss for the upcycled MoE models.
The core idea behind upcycling is to harness the knowledge embedded in pre-trained dense language models and convert them into large MoE architectures, reducing both training time and computational expense. This transformation maximizes the utility of dense checkpoints while expanding the model’s capacity. To achieve this, the researchers devised the “virtual group” initialization technique, ensuring that every MLP shard has a distinct representation within the router’s topK when transitioning from a dense model to an MoE configuration.
Their findings show that upcycling outperforms continued dense model training for an equivalent amount of compute, as evidenced in both 2-billion and 15-billion parameter models. Depending on the target inference and available upcycling FLOPs, architectures like E8G1T2, which utilize more FLOPs, can deliver superior accuracy compared to dense iso-FLOP MoE models.
Additionally, the research highlights the need for distinct hyperparameter settings during upcycling, differing from those used in fine-tuning. For instance, in the MoE router, routing experts using a softmax-then-topK method yielded better results than the reverse (topK-then-softmax). While finer-grained MoEs can further enhance upcycling accuracy, they demand more meticulous weight scaling and sharding strategies, which in turn can reduce GPU FLOP utilization.
In a practical demonstration, the team upcycled a Nemotron-4 15B model trained on 1T tokens, comparing its performance against a continuously trained version on the same dataset. The upcycled model achieved an MMLU score of 67.6%, outperforming the continuously trained model’s 65.3%. Through this research, the NVIDIA team hopes to provide valuable insights into the upcycling of billion-parameter MoE models at scale.
The paper Upcycling Large Language Models into Mixture of Experts is on arXiv.
Author: Hecate He | Editor: Chain Zhang
The post From Dense to Dynamic: NVIDIA’s Innovations in Upcycling LLMs to Sparse MoE first appeared on Synced.