UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

1Tsinghua University, 2Shanghai AI Laboratory
3The University of Hong Kong, 4The Chinese University of Hong Kong
ICML 2023

Corresponding authors

UPop is the first structured pruning framework for vision-language Transformers. It enables effective structured pruning on various multi-modal & uni-modal tasks, datasets, and model architectures. The video demonstrates that Unified Search rescues us from the burden of repeated experiments (e.g., doing grid search) for searching optimal compression ratios among different modalities and structures. Furthermore, Progressive Pruning eliminates the weight gap between the searched model and the pruned subnet to be retrained, therefore gaining better convergence and performance.


Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially visonlanguage Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

Overall Performance

Overview of experimental results at 2x compression. The proposed UPop framework is efficient and effective on various tasks, datasets, and architectures. Bold indicates the best post-compression performance. Mask-based Pruning is extended from the SOTA pruning method ViT-Slimming (Chavan et al., 2022).

Overall Experimental Results

Detailed Experimental Results


Related Research

  • CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers. CrossGET is a universal token ensemble framework CrossGET for accelerating various vision-language Transformers. [ArXiv] [Paper] [Code]


        title={Upop: Unified and progressive pruning for compressing vision-language transformers},
        author={Shi, Dachuan and Tao, Chaofan and Jin, Ying and Yang, Zhendong and Yuan, Chun and Wang, Jiaqi},
        journal={arXiv preprint arXiv:2301.13741},