UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

¹Tsinghua University, ²Shanghai AI Laboratory
³The University of Hong Kong, ⁴The Chinese University of Hong Kong
ICML 2023
^✉Corresponding authors

Abstract

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially visonlanguage Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

Overall Performance

Overview of experimental results at 2x compression. The proposed UPop framework is efficient and effective on various tasks, datasets, and architectures. Bold indicates the best post-compression performance. Mask-based Pruning is extended from the SOTA pruning method ViT-Slimming (Chavan et al., 2022).

Detailed Experimental Results

Compress BLIP on the NLVR2. Bold indicates the best performance at the same compression ratio. Reduce indicates compression times. The marker ✓ or ✗ indicates whether the model converges at the current compression times. The units of Params and FLOPs are M and G, respectively

Compress BLIP on the Image Caption task and the Visual Question Answering task. The CIDEr, SPICE, test-dev, and test-std are the higher the better. The units of Params and FLOPs are M and G, respectively.

Compress BLIP on the COCO and Flickr30K datasets of the Image-Text Retrieval task. The R@1, R@5, and R@10 are the higher the better. The units of Params and FLOPs are M and G, respectively.

Compress CLIP on the COCO and Flickr30K datasets of the Image-Text Retrieval task.

UPop also works well on uni-modal task. Compress DeiT on the ImageNet dataset. The units of Params and FLOPs are M and G, respectively. The superscript * indicates the performance of the deployable model if the original model is non-deployable. For fairness of comparison, all reported experimental results, including UPop, do not use knowledge distillation.

UPop can also achieve very competitive performance on uni-modal task. The left and right subfigures illustrate the Accuracy-FLOPs and Accuracy-Parameter trade-off, respectively. ∗ indicates the performance of the deployable model if the original model is non-deployable. Two subfigures demonstrate that the proposed UPop (marked with the blue triangle) achieves better performance on both trade-offs. Note that token-specific compression approaches only reduce FLOPs and not the number of parameters. Therefore they are vertical lines in the Accuracy-Parameter trade-off figure.

UPop can also achieve very competitive performance on uni-modal task. Compress Segmenter on the ADE20k dataset. The units of Params and FLOPs are M and G, respectively. The SS and MS mean single-scale and multi-scale testing for the mIoU metric, respectively. With and Without superscript * means CNN-based and Transformer-based models, respectively.

Visualization

The proportion of all compressible components retained in the compressed BLIP model on the NLVR2. These six subfigures represent the original model and the compressed model at the 2x, 3x, 4x, 5x, and 10x compression ratio, respectively. In each subfigure, the horizontal axis represents the layer number, the vertical axis represents the compressible components corresponding to each ζi, and the number in cells represents the retained proportion of a certain component's certain layer.

The left subfigure: variation of compressible components as the compression ratio increases. The right subfigure: variation of layers as the compression ratio increases.

BibTeX

@article{shi2023upop, title={Upop: Unified and progressive pruning for compressing vision-language transformers}, author={Shi, Dachuan and Tao, Chaofan and Jin, Ying and Yang, Zhendong and Yuan, Chun and Wang, Jiaqi}, journal={arXiv preprint arXiv:2301.13741}, year={2023} }