Xiaowen Ma∗, Zhenliang Ni , Xinghao Chen
Huawei Noah’s Ark Lab
{maxiaowen9, nizhenliang2, xinghao.chen}@huawei.comEqual contribution, Corresponding author.
Abstract
Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. We observe that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba.In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM.The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models.Code is available at https://github.com/xwmaxwma/TinyViM.
1 Introduction
In recent years, Mamba has gained widespread attention and application in the field of vision due to its advantages in long-sequence modeling [8]. Mamba is a selective structured state-space model that can overcome the limitations of local perception in Convolution neural networks [19] and the quadratic computational complexity of Transformers [3], thereby demonstrating strong capabilities in handling long-sequence data. Due to its advanced context extraction capabilities, Mamba has been widely applied in image classification tasks and has achieved advanced performance, such as Vim [54], VMamba [30], etc.
Vim[54] is one of the earliest works to apply Mamba to visual tasks, by introducing a bidirectional state-space model to process 2D image data. By introducing the Cross-Scan Module (CSM) and 2D Selective Scanning (SS2D) modules, VMamba[30] reduces the computational complexity from quadratic to linear while maintaining the global acceptance domain. MSVMamba[40] uses a multi-scale 2D scanning method to apply Mamba to both original and down-sampled feature maps. This not only helps to learn long-distance dependencies, but also reduces computational costs. EfficientVMamba[35] is a lightweight Vision Mamba variant that reduces computational complexity through dilated-based selective scanning.Although these methods inherit the advantages of Mamba linear complexity and global receptive field, they do not show competitive performance with lightweight backbones based on Convolution [45, 51] and Transformer [39, 44, 15, 49, 48] in computationally limited and real-time deployment scenarios, as shown in Fig.1.In fact, these methods that simply modify the scan path in the image domain do not utilize the full potential of visual Mamba [30, 54].In this paper, we try to rethink the design of lightweight vision Mamba from the perspective of frequency decoupling, which is a reasonable and effective solution.
Through observation, we find that Mamba prefers to extract low-frequency features and ignores some high-frequency features. As shown in Fig.2, we construct a baseline using mobile-friendly Convolution and vanilla Mamba, and perform spectral analysis of the features before and after the Mamba. Many previous works[26, 39, 46] have verified the necessity of using Convolution in lightweight backbones. After applying the Mamba block, the low-frequency component in the center of the two-dimensional spectrum map is highlighted, and the high-frequency component is suppressed. Therefore, to improve the efficiency of low-frequency modeling and keep the high-frequency features, we introduce an efficient Laplace pyramid to decompose the high- and low-frequency components of the features.Then, we input only the low-frequency components into the Mamba block to obtain a global receptive field, and for the high-frequency components, we use a reparameterized 33 depth-wise Convolution to enhance high frequency information [5, 6].It is worth noting that the size of the low frequency feature is much smaller than that of the original input feature, which can greatly reduce the computational complexity and improve the throughput of the Mamba. Therefore, the strategy of frequency decoupling and specifically enhancement separately effectively improves the efficiency and the feature representation ability of the model.
In addition, previous models based on Convolution or Transformer have verified that: 1) deep neural networks have feature redundancy [16]; 2) deep neural networks tend to require more high-frequency information in the shallow layer and more global information in the deep layer [41]. Therefore, we propose the frequency ramp inception to further improve the performance and efficiency of the model. Specifically, we split the feature map along the channels at each stage.Early stages prioritize high frequencies, reflected in a higher channel ratio for the high-frequency branch, while later stages concentrate on low frequencies, allowing for a greater channel allocation to low frequencies. This strategic distribution enables the network to learn features that are more suitable at each stage of processing effectively.
By integrating frequency decoupling and frequency ramp inception, we propose a tiny and efficient Convolution-Mamba hybrid architecture TinyVim. We then conduct extensive experiments to validate the effectiveness of TinyViM, which achieves state-of-the-art image classification performance on ImageNet and advanced performance in downstream tasks such as detection and segmentation. Specifically, TinyViM has higher classification accuracy than other Convolution or Transformer-based backbones with a similar scale. In addition, compared to other Mamba-based backbones such as EfficientVMamba-T, TinyViM has higher throughput and 2.7% higher accuracy, as shown in Fig.1. Our contributions can be summarized as follows.
- •
We verify that Mamba mainly models low-frequency information under the Convolution-Mamba hybrid architecture based on qualitative analysis and quantitative comparison.
- •
We introduce an efficient Laplace mixer, which improves the modeling efficiency and performance of visual Mamba through frequency decoupling and frequency ramp inception.
- •
We propose TinyViM by integrating the Laplace mixer and Convolution. Extensive experiments show that TinyViM outperforms other Convolutional, Transformer and Mamba-based models with a similar scale, and achieves a better trade-off between performance and efficiency.
2 Related Work
2.1 Efficient Generic Vision Backbones
Convolution has been widely employed for designing efficient vision backbones due to the inductive bias and efficient parallel processing on GPUs. Early works such as the MobileNet series [22, 38, 21] propose the depth-wise separable Convolution and the inverted residual blocks to further reduce the parameter and computational consumption of the plain Convolutional model [19, 42]. A series of subsequent works further improve the efficiency through depth-wise dilated Convolutions [34], network pruning [17], neural structure search [21], channel shuffling [52, 31] and structure reparameterization [44, 45, 46]. However, constrained by a limited receptive field, these efficient Convolutional networks lack global interactions between features, which degrades their performance. Although some works propose to increase the receptive field by increasing the size of the Convolutional kernel [28, 2], however, it imposes high memory access cost and computational cost, and significantly slows down the inference speed. Considering that Transformers are capable of achieving global attention but introduce quadratic complexity with respect to the resolution of the input image, a series of recent works have attempted to combine the advantages of CNNs and Transformers to develop more efficient models [13, 26, 39, 44]. For example, and EfficientFormer[26] only adds self-attention at more deep stages to capture global context, SwiftFormer[39] adds effective additive attention at each stage of the model. The success of this hybrid design motivates us to explore whether a more advanced backbone can be constructed by integrating Convolution and Mamba.
2.2 State Space Models
As a typical architectural paradigm for sequence-to-sequence transformations, state-space models (SSMs) play a pivotal role in handling long-range dependencies in sequence data [10, 11, 43]. Early SSMs have challenges in capturing long-term dependency relationships and a unified understanding of memory [7]. To address these issues, Gu et al. enhance the ability of SSMs to capture extended dependencies using HiPPO initialization [9] and solve the continuous-time memory problem based on a linear state space layer (LSSL) [11]. However, these SSMs are still limited by computational requirements.Subsequent works propose a series of improvements including complex diagonal structures [12, 14], multiple-input multiple-output support [43], and diagonal-plus-low-rank arithmetic [18] to improve computational efficiency and generalization across tasks. In particular, these strategies have been extended to large representation models [4, 32, 33].Recently Mamba [8] introduce a selection mechanism that outperforms Transformers on large-scale real data through a data-dependent SSM layer with linearly scaling sequence lengths.
The success of Mamba has caused the vision community to explore and apply it to various vision tasks including classification [54, 30], detection [47], segmentation [55], deraining [25] and multimodal learning [37]. In particular, ViM [54] and VMamba [30] propose bi-directional scanning and cross-scanning strategies to cope with the non-sequential structure of visual data, respectively. QuadMamba [47] proposes quad-tree scanning to capture local dependencies at different granularities. MSVMamba [40] proposes multiscale two-dimensional scanning, which efficiently mitigates the long-range forgetting problem and reduces the computational cost. A highly related work is EfficientVMamba [36], which improves the scanning efficiency of a model by an atrous-based selective scan approach. However, these works tend to show low throughput and fail to perform competitively with other advanced lightweight backbones based on Convolution or Transformers. This motivates us to design the TinyViM, which obtains satisfactory speed and performance based on Laplace frequency decoupling.
3 Preliminaries
State Space Models (S4). Inspired by the continuous system, State space models (SSMs) are proposed in deep learning as common sequence models [10]. These models transform an input sequence into an output sequence by utilizing a learnable hidden state . The process could be denoted as follows:
(1) | ||||
where is the evolution parameter, denote the learnable projection parameters, and is the state size.
Discretization. The above continuous-time SSMs are not well compatible with deep learning algorithms. Therefore, discretization is needed to align the model with the sampling frequency of the input signal to improve computational efficiency [11]. Following the previous work [14], given the sampling time scale parameter , the above continuous SSMs are discretized through zero-order hold rule, thus converting the continuous-time parameters (, ) to their corresponding discrete counterparts (, ):
(2) | ||||
Then, The discretized formulation of Eq.1 is formulated as:
(3) | ||||
where , . In order to improve the computational efficiency, the iterative process described in Eq.3 can be performed by the parallel computing mode of global convolution [8]:
(4) | ||||
where denotes the convolution operation, and serves as the kernel of the SSMs.
Selective State Space Models (S6). Conventional SSMs (i.e., S4) have been implemented to capture sequence context under linear time complexity, despite the fact that they are constrained by static parameterization and cannot perform content-based reasoning.To address this problem, the selective state space model (i.e., Mamba [8]) has been proposed, which allows the model to selectively propagate or forget information based on the latitude of the current token along the length of the sequence by simply setting the parameters of the SSM as a function of the inputs.In S6, the parameters , , and are computed directly from the input sequence , thus enabling sequence-aware parameterization.
4 Method
4.1 Analysis of Vanilla Mamba Block
As analyzed in Sec.3, Mamba is able to model global context information based on linear complexity with respect to sequence length through pixel-by-pixel state transfer, which shows potential in lightweight model design. Moreover, considering the mobile-friendly and localized feature extraction capability of Convolution, a natural intuition arises: can we build a lightweight and efficient visual backbone based on Convolution and Mamba blocks? Although there have been some works [36] that attempt to do so, the performance and efficiency shown are not competitive with Convolution [45] or Transformer-based models [44, 39] so far.
Baseline Construction. In response to this intuition, we first build a baseline architecture using Convolution (the local block in Fig.3) and a vanilla Mamba block (applying SS2D in VMamba [30] to adapt to the visual data), i.e., the Baseline in Table1.Note that the baseline structure is identical to the final structure of TinyViM-S, which is designed with reference to [39], and specific structural configurations can be found in the Supplementary Material. The only difference between Baseline and TinyViM-S is that we replace the TinyViM blocks with SS2D-based vanilla Mamba blocks.We are surprised to find that this version achieves a Top-1 accuracy of 79.1%, which is significantly better than the purely Convolutional (i.e., Conv only) version. This can be explained by the fact that Mamba can improve the expressive ability of the model by capturing the global context at each stage. However, the low throughput limits the use of the model in real-time applications. We then attempt to improve the baseline based on qualitative and quantitative analysis.
Variants | GMACs | Throughput | Top-1 |
Baseline | 0.96 | 1673 | 79.1 |
Conv only | 0.88 | 3212 | 77.5 |
Low only | 0.93 | 2574 | 79.0 |
High only | 0.96 | 1377 | 78.6 |
Low + High | 0.97 | 1509 | 79.1 |
Qualitative Analysis. We first perform a spectral analysis of the feature maps before and after inputting the Mamba block in the baseline, as shown in Fig.2. It shows that the Mamba block has strong low-frequency capture ability under the Convolution-Mamba hybrid architecture, but loses high-frequency information, such as local edges and textures. This low-frequency preference impairs the performance and efficiency of vision Mamba because 1) Populating each stage with low-frequency information degrades the high-frequency component, weakening the fine-grained recognition ability of visual Mamba, and 2) Additional high-frequency component input to a Mamba block impairs the efficiency and parallel computation of the model. A feasible strategy for this phenomenon is that we decouple the high and low frequencies of the features and input only the low-frequency components of the features into the Mamba block for hidden state transfer.Therefore, we decouple the high and low frequencies of the feature maps based on the Laplace pyramid, which is a more efficient architecture compared to other methods such as the cascaded wavelet transform, to verify the effectiveness of our strategy.
Quantitative Analysis. We next set the inputs of the Mamba block for each stage to be low-frequency only, high-frequency only, high and low-frequency in parallel to explore the effect of the corresponding variants on the classification accuracy, respectively. As shown in Table1, when only low frequencies are used as inputs, the model achieves 79.0% accuracy, reduces the computational consumption of the model and achieves 1.5 times higher throughput. This verifies that we can significantly improve the efficiency of the model without compromising the classification accuracy by using only low-frequency inputs to Mamba. Therefore, based on the above qualitative and quantitative analyses, we design the Laplace mixer to optimize Mamba in the construction of a lightweight vision backbone.
4.2 Laplace Mixer
Previous work such as EfficientVMamba [36] uses dilated-based selective scanning to improve the efficiency of vanilla Mamba [8]. Despite reducing the number of input tokens, EfficientVMamba performs unsatisfactorily on various visual tasks [36]. We argue that the potential of Mamba in lightweight backbones is not fully exploited simply by modifying the scanning path in the image domain. Based on the analysis in Sec.4.1, we reconstruct the Mamba block based on the frequency decoupling and propose an elegant Laplace mixer.Furthermore, consider that in a general visual backbone, the bottom layer is more concerned with capturing high-frequency details, while the top layer is more concerned with modeling low-frequency global information [19]. As in humans, by capturing details in high-frequency components, the lower layers can capture the basic visual features and gradually collect local information to achieve a global understanding of the input. In particular, feature redundancy has been shown to exist in existing networks [16], and separate low-frequency and high-frequency processing for all channels is unnecessary. Therefore, we design a frequency ramp inception structure which reduces the channel size of the high-frequency branches as the depth of the network increases.
Specifically, for the input feature , where H, W and D denote the height, width and channels of , we first split along the channel dimension and divide it into a low-frequency input and a high-frequency input . denotes the partition coefficient, which determines the number of channels in the low-frequency branch. Then, for the low-frequency branch, we adopt a simple Laplace pyramid architecture. Specifically, we average pool to obtain the corresponding low-frequency components and shrink the number of input tokens. Next, we recover the resolution of based on the nearest neighbor upsampling and perform pixel-by-pixel subtraction from to obtain the high-frequency component of , i.e., . This process can be formulated as,
(5) |
Then, we concatenate and along the channel to obtain the high-frequency input , and apply a small depth-wise Convolution kernel, which usually tends to respond to the high frequencies in the input [5, 6], enhancing the high-frequency component of . As for the low-frequency input , we input it into the SS2D block for cross-scanning to capture the global context. The process of frequency separation processing can be described as,
(6) |
where denotes the reparameterized convolution.Finally, we sum up and of the corresponding channels at the element level, followed by a convolution for fusion. Through the frequency decoupling process and frequency ramp inception, the Laplace mixer can effectively weigh and integrate the high and low-frequency components of all layers. As a result, TinyViM achieves a better balance between accuracy and efficiency.
Model | Conference | Type | Params (M) | GMACs | Throughput (im/s) | Epochs | Top-1 (%) |
MobileOne-S1 [45] | CVPR’23 | CNN | 4.8 | 0.8 | 3545 | 300 | 75.9 |
SwiftFormer-S[39] | ICCV’23 | Hybrid | 6.1 | 1.0 | 2626 | 300 | 78.5 |
EfficientVMamba-T [36] | Arxiv’24 | Mamba | 6.1 | 1.0 | 1396 | 300 | 76.5 |
Vim-Ti [54] | ICML’24 | Mamba | 7.0 | 1.5 | - | 300 | 76.1 |
QuadMamba-Li [47] | NeurIPS’24 | Mamba | 5.4 | 0.8 | 1190 | 300 | 74.2 |
MSVMamba-N [40] | NeurIPS’24 | Mamba | 7.0 | 0.9 | 1283 | 300 | 77.3 |
TinyViM-S | - | CNN+Mamba | 5.6 | 0.9 | 2563 | 300 | 79.2 |
PoolFormer-S12 [50] | CVPR’22 | Pool | 12.0 | 2.0 | 1902 | 300 | 77.2 |
MobileOne-S3 [45] | CVPR’23 | CNN | 10.1 | 1.9 | 1900 | 300 | 78.1 |
MobileOne-S4 [45] | CVPR’23 | CNN | 14.8 | 3.0 | 1223 | 300 | 79.4 |
Agent-PVT-T [15] | ECCV’24 | ViT | 11.6 | 2.0 | 1447 | 300 | 78.4 |
EfficientVMamba-S [36] | Arxiv’24 | Mamba | 11.0 | 1.3 | 674 | 300 | 78.7 |
QuadMamba-T [47] | NeurIPS’24 | Mamba | 10.0 | 2.0 | 585 | 300 | 78.2 |
MSVMamba-M [40] | NeurIPS’24 | Mamba | 12.0 | 1.5 | 957 | 300 | 79.8 |
TinyViM-B | - | CNN+Mamba | 11.0 | 1.5 | 1851 | 300 | 81.2 |
PoolFormer-S36 [50] | CVPR’22 | Pool | 31.0 | 5.2 | 667 | 300 | 81.4 |
InceptionNext-T [51] | CVPR’24 | CNN | 28.0 | 4.2 | 987 | 300 | 82.3 |
Agent-PVT-S [15] | ECCV’24 | ViT | 20.6 | 4.0 | 686 | 300 | 82.4 |
FastViT-SA24 [44] | ICCV’23 | Hybrid | 20.6 | 3.8 | 861 | 300 | 82.6 |
EfficientVMamba-B [36] | Arxiv’24 | Mamba | 33.0 | 4.0 | 580 | 300 | 81.8 |
LocalVmamba-T [23] | Arxiv’24 | Mamba | 26.0 | 5.7 | - | 300 | 82.7 |
ViM-S [54] | ICML’24 | Mamba | 26.0 | 5.1 | - | 300 | 80.5 |
VMamba-T [30] | NeurIPS’24 | Mamba | 30.0 | 4.9 | 383 | 300 | 82.6 |
QuadMamba-S [47] | NeurIPS’24 | Mamba | 31.0 | 5.5 | 541 | 300 | 82.4 |
MSVmamba-T [40] | NeurIPS’24 | Mamba | 33.0 | 4.6 | 452 | 300 | 82.8 |
TinyViM-L | - | CNN+Mamba | 31.7 | 4.7 | 843 | 300 | 83.3 |
4.3 Overall Architecture
TinyViM employs a similar multi-scale backbone design similar to previous works [26, 39]. As shown in Fig.3, given an image , a simple stem is employed to output feature map. , denote the height and width of the input image respectively, and is the the channel number of . Note that the stem is implemented with two convolutions with a stride of 2. Then, the output feature maps are fed into the first stage, which begins with Local Blocks. The local block consists of a reparameterized convolution and a feed-forward network (FFN, implemented by two consecutive convolutions) that achieves local feature extraction and feature channel mixing, respectively.The local block can be formulated as,
(7) |
where and denote the and reparameterized convolution, respectively.The features are then fed into the TinyViM block for global context-awareness. Similarly, the TinyViM block consists of a Laplace mixer and FFN for global feature extraction and channel mixing, respectively. The TinyViM Block can be described as,
(8) |
(9) |
There is a Patch Embedding layer between two consecutive stages for increasing the channel dimension and reducing the resolution of feature maps. Next, the resulting feature maps are subsequently fed into the second, third, and fourth stages, producing , and feature maps, respectively.
5 Experiments
We evaluate the proposed TinyViM across four down-stream tasks: classification on ImageNet-1K dataset [1], object detection and instance segmentation on MS-COCO 2017 dataset [27], and semantic segmentation on ADE20K dataset [53]. All experiments are implemented following the common training settings in [39]. Due to page limits, we provide a description of the dataset, the specific experimental setup, details of the model’s architecture, more ablation experiments and qualitative analysis in the Supplementary Material. Please refer to the Supplementary Material for details.
5.1 Image Classification
We evaluate TinyViM’s performance in image classification on ImageNet-1K [1] with some advanced Convolution, ViT and Mamba-based methods, and the comparison results are summarized in Table2. With similar GMACs and throughput, TinyViM-S achieves a top-1 accuracy of 79.2%, outperforming SwiftFormer-S by 0.7%. In particular, TinyViM retains its performance advantages even at small and large scales. For example, TinyViM-B achieves an accuracy of 81.2, which outperforms the convolution-based model MobileOne-S4 by 1.8%, the ViT-based model Agent-PVT-T by 2.8%, and the Mamba-based model MSVMamba-M by 1.4%, respectively. These comparison results show the excellent performance of TinyViM.
In terms of computational efficiency, TinyViM-S achieves a throughput of 2574 images/s, which is significantly better than other visual Mamba models. For example, the throughput of EfficientVmamba is only 1396 images/s, which is only about half of the throughput of TinyViM. This huge efficiency advantage persists on TinyViM-B and TinyViM-L, where the throughput is 2.7 and 1.5 times higher than that of the similarly scaled EfficientVmamba, respectively. Compared with other convolutional and ViT-based models, TinyViM maintains its lead in performance while its throughput is highly competitive. It should be emphasized that this competitiveness is significantly improved on A100 GPUs. This is due to the fact that Mamba has a number of hardware-friendly improvements on the A100 such as triton, which are not yet supported on the V100 GPUs. In summary, the results show that TinyViM strikes a better balance between performance and efficiency than other state-of-the-art lightweight backbones, and offers high value in real-time applications.
5.2 Object Detection and Instance Segmentation
We evaluate the proposed TinyViM for object detection and instance segmentation using the Mask R-CNN [20] framework. As shown in Table3, TinyViM outperforms competitive models such as the pooling-based model PoolFormer, the Transformer-based models FastVit and SwiftFormer, and the Mamba-based model EfficientVMamba in both and metrics. Specifically, compared to recent models SwiftFormer-L1, TinyViM has 1.1% and 0.6% improvement on and , respectively, and has a larger throughput. Compared to EfficientVMamba-S, TinyViM-B has more significant advantages, i.e., 3.0 and 2.0 improvement in and , respectively, with a 1.7 times throughput. When comparing with larger models such as FastViT-SA24 and EfficientVMamba-b, TinyViM still has an advantage in throughput, and shows 2.5% and 0.8% improvements in , respectively. These results demonstrate the superiority of TinyViM when transferring to detection and instance segmentation tasks.
5.3 Semantic Segmentation
Table3 shows the semantic segmentation results of TinyViM as a backbone in comparison with other state-of-the-art backbones, where Semantic FPN [24] serves as the decode head. Specifically, compared with recent methods such as SwiftFormer, TinyVim obtains 0.8% and 0.3% improvement in base and large scale, respectively. Compared with the method FasterViT, TinyViM has more significant advantages, with 3.9% and 3.2% improvements, respectively. The experimental results verify the effectiveness of TinyViM in semantic segmentation tasks.
Backbone Throughput (img/s) Object Detection Instance Segmentation Semantic APbox AP AP APmask AP AP mIoU PoolFormer-S12[50] 171 37.3 59.0 40.1 34.6 55.8 36.9 37.2 EfficientFormer-L1[26] 197 37.9 60.3 41.0 35.4 57.3 37.3 38.9 FastViT-SA12 [44] 162 38.9 60.5 42.2 35.9 57.6 38.1 38.0 SwiftFormer-L1 [39] 174 41.2 63.2 44.8 38.1 60.2 40.7 41.1 EfficientVMamba-S [36] 104 39.3 61.8 42.6 36.7 58.9 39.2 - TinyViM-B 180 42.3 64.2 46.3 38.7 61.1 41.3 41.9 PoolFormer-S36[50] 87 41.0 63.1 44.8 37.7 60.1 40.0 42.0 EfficientFormer-L3[26] 117 41.4 63.9 44.7 38.1 61.0 40.4 43.5 FastViT-SA24 [44] 110 42.0 63.5 45.8 38.0 60.5 40.5 41.0 SwiftFormer-L3 [39] 120 42.7 64.4 46.7 39.1 61.7 41.8 43.9 EfficientVMamba-B [36] 90 43.7 66.2 47.9 40.2 63.3 42.9 - TinyViM-L 111 44.5 66.4 48.6 40.7 63.6 43.8 44.2
5.4 Ablation Studies
Effect of the Frequency Decoupling.In order to investigate the effectiveness of the frequency decoupling strategy, we perform spectral visualization of the feature maps of the low-frequency and high-frequency branches, respectively. As shown in Fig.4, by applying Mamba, the feature maps have higher concentrations at low frequencies; by applying depth-wise convolution, the high frequencies of the model are effectively preserved. This proves that the Laplace mixer can effectively decouple the high and low frequency components of the features and specially enhance them respectively.
Ablation of the Frequency Ramp Inception.We explore the effectiveness of the frequency ramp inception design, as shown in Table4. We set up three experimental variants: variant 1, we implement Laplace frequency decoupling for all channels of the feature map, and input low and high frequencies into Mamba and depth-wise convolution, respectively; variant 2, we adopt a non-ramp structure, i.e., we set the segmentation coefficient of the four stages to be 1/2; and variant 3, we gradually increase with the increase of the stage of the model. In the implementation of this paper, the of the four stages are 0.25, 0.5, 0.5 and 0.75, respectively. It can be observed that, even in the small size, the throughput of variant 1 is significantly reduced and more GMACs are required. Variant 2, although more efficient, has a decreased accuracy due to the lack of adjusting the weights of the low-frequency and high-frequency components in different stages. The frequency ramp used in variant 3 effectively weighs and integrates the high and low frequency information, thus helping the model to achieve a better balance between performance and efficiency.
Inception | Ramp | GMACs | Throughput | Top-1 |
✗ | ✗ | 0.94 | 2478 | 79.2 |
✔ | ✗ | 0.93 | 2574 | 79.0 |
✔ | ✔ | 0.93 | 2563 | 79.2 |
6 Conclusion
Motivated by Mamba’s ability to achieve global context modeling based on linear complexity, we construct a tiny and efficient hybrid vision Mamba TinyViM by combining the mobile-friendly convolution and Mamba in this paper. In contrast to previous vision Mamba efforts that simply modify scan paths in the image domain, whose performance could not compete with Convolution or Transformer based models with similar scale, TinyViM attempts to build a lightweight backbone from a frequency decoupling perspective. Specifically, we verify that the Mamba block mainly models low-frequency information in a Convolution-Mamba hybrid architecture after spectral analysis and quantitative experiments. To this end, we design a frequency decoupling strategy to preserve the high-frequency information and improve the low-frequency modeling efficiency of Mamba. In particular, we introduce a frequency ramp inception to weigh the different strengths of the low-frequency and high-frequency components at different stages of the model, so as to help the model achieve a better balance between accuracy and efficiency. Extensive experiments have shown that TinyViM outperforms other Convolution, Transformer and Mamba-based models. In particular, TinyViM achieves higher throughput compared to other Mamba models, which guarantees the value of TinyViM in real-time applications. In future work, we plan to use it as a lightweight backbone to replace the image encoder in SAM to explore the potential of vision Mamba in multi-modal scenarios.
References
- Deng etal. [2009]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE Conference on Computer Vision and PatternRecognition, pages 248–255, 2009.
- Ding etal. [2022]Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding.Scaling up your kernels to 31x31: Revisiting large kernel design incnns.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pages 11963–11975, 2022.
- Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition atscale.arXiv preprint arXiv:2010.11929, 2020.
- Fu etal. [2022]DanielY Fu, Tri Dao, KhaledK Saab, ArminW Thomas, Atri Rudra, andChristopher Ré.Hungry hungry hippos: Towards language modeling with state spacemodels.arXiv preprint arXiv:2212.14052, 2022.
- Gavrikov and Keuper [2024]Paul Gavrikov and Janis Keuper.Can biases in imagenet models explain generalization?In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pages 22184–22194, 2024.
- Geirhos etal. [2022]Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, FelixA.Wichmann, and Wieland Brendel.Imagenet-trained cnns are biased towards texture; increasing shapebias improves accuracy and robustness, 2022.
- Graves [2012]Alex Graves.Long Short-Term Memory, pages 37–45.Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
- Gu and Dao [2023]Albert Gu and Tri Dao.Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023.
- Gu etal. [2020]Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré.Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems,33:1474–1487, 2020.
- Gu etal. [2021a]Albert Gu, Karan Goel, and Christopher Ré.Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a.
- Gu etal. [2021b]Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, andChristopher Ré.Combining recurrent, convolutional, and continuous-time models withlinear state space layers.Advances in neural information processing systems,34:572–585, 2021b.
- Gu etal. [2022]Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the parameterization and initialization of diagonal state spacemodels.Advances in Neural Information Processing Systems,35:35971–35983, 2022.
- Guo etal. [2022]Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and ChangXu.Cmt: Convolutional neural networks meet vision transformers.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, pages 12175–12185, 2022.
- Gupta etal. [2022]Ankit Gupta, Albert Gu, and Jonathan Berant.Diagonal state spaces are as effective as structured state spaces.Advances in Neural Information Processing Systems,35:22982–22994, 2022.
- Han etal. [2025]Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei Wan,Shiji Song, and Gao Huang.Agent attention: On theintegration ofsoftmax andlinearattention.In Computer Vision – ECCV 2024, pages 124–140, Cham, 2025.Springer Nature Switzerland.
- Han etal. [2020]Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu.Ghostnet: More features from cheap operations.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), 2020.
- Han etal. [2016]Song Han, Huizi Mao, and WilliamJ. Dally.Deep compression: Compressing deep neural networks with pruning,trained quantization and huffman coding, 2016.
- Hasani etal. [2022]Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, AlexanderAmini, and Daniela Rus.Liquid structural state-space models.arXiv preprint arXiv:2209.12951, 2022.
- He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2016.
- He etal. [2017]Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.Mask r-cnn.In Proceedings of the IEEE international conference on computervision, pages 2961–2969, 2017.
- Howard etal. [2019]Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, MingxingTan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, QuocV. Le, andHartwig Adam.Searching for mobilenetv3.In Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV), 2019.
- Howard etal. [2017]AndrewG. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,Tobias Weyand, Marco Andreetto, and Hartwig Adam.Mobilenets: Efficient convolutional neural networks for mobile visionapplications, 2017.
- Huang etal. [2024]Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu.Localmamba: Visual state space model with windowed selective scan,2024.
- Kirillov etal. [2019]Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár.Panoptic feature pyramid networks.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, pages 6399–6408, 2019.
- Li etal. [2024]Dong Li, Yidi Liu, Xueyang Fu, Senyan Xu, and Zheng-Jun Zha.Fouriermamba: Fourier learning integration with state space modelsfor image deraining, 2024.
- Li etal. [2022]Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov,Yanzhi Wang, and Jian Ren.Efficientformer: Vision transformers at mobilenet speed.Advances in Neural Information Processing Systems,35:12934–12949, 2022.
- Lin etal. [2014]Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In ECCV 2014, pages 740–755. Springer, 2014.
- Liu etal. [2023a]Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, TommiKärkkäinen, Mykola Pechenizkiy, Decebal Mocanu, and Zhangyang Wang.More convnets in the 2020s: Scaling up kernels beyond 51x51 usingsparsity, 2023a.
- Liu etal. [2023b]Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, and Yixuan Yuan.Efficientvit: Memory efficient vision transformer with cascaded groupattention.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 14420–14430, 2023b.
- Liu etal. [2024]Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang,Qixiang Ye, and Yunfan Liu.Vmamba: Visual state space model.arXiv preprint arXiv:2401.10166, 2024.
- Ma etal. [2018]Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.Shufflenet v2: Practical guidelines for efficient cnn architecturedesign.In Proceedings of the European Conference on Computer Vision(ECCV), 2018.
- Ma etal. [2022]Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig,Jonathan May, and Luke Zettlemoyer.Mega: moving average equipped gated attention.arXiv preprint arXiv:2209.10655, 2022.
- Mehta etal. [2022]Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur.Long range language modeling via gated state spaces.arXiv preprint arXiv:2206.13947, 2022.
- Mehta etal. [2019]Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi.Espnetv2: A light-weight, power efficient, and general purposeconvolutional neural network.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), 2019.
- Pei etal. [2024a]Xiaohuan Pei, Tao Huang, and Chang Xu.Efficientvmamba: Atrous selective scan for light weight visual mamba.arXiv preprint arXiv:2403.09977, 2024a.
- Pei etal. [2024b]Xiaohuan Pei, Tao Huang, and Chang Xu.Efficientvmamba: Atrous selective scan for light weight visual mamba,2024b.
- Qiao etal. [2024]Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, QiWu, and Jing Liu.Vl-mamba: Exploring state space models for multimodal learning, 2024.
- Sandler etal. [2018]Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-ChiehChen.Mobilenetv2: Inverted residuals and linear bottlenecks.In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018.
- Shaker etal. [2023]Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-HsuanYang, and FahadShahbaz Khan.Swiftformer: Efficient additive attention for transformer-basedreal-time mobile vision applications.In Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV), pages 17425–17436, 2023.
- Shi etal. [2024]Yuheng Shi, Minjing Dong, and Chang Xu.Multi-scale vmamba: Hierarchy in hierarchy visual state space model.arXiv preprint arXiv:2405.14174, 2024.
- Si etal. [2022]Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan.Inception transformer.In Advances in Neural Information Processing Systems, pages23495–23509. Curran Associates, Inc., 2022.
- Simonyan and Zisserman [2015]Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition,2015.
- Smith etal. [2022]JimmyTH Smith, Andrew Warrington, and ScottW Linderman.Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022.
- Vasu etal. [2023a]Pavan KumarAnasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and AnuragRanjan.Fastvit: A fast hybrid vision transformer using structuralreparameterization.In Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV), 2023a.
- Vasu etal. [2023b]Pavan KumarAnasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and AnuragRanjan.Mobileone: An improved one millisecond mobile backbone.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 7907–7917, 2023b.
- Wang etal. [2024]Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding.Repvit: Revisiting mobile cnn from vit perspective.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pages 15909–15920, 2024.
- Xie etal. [2024]Fei Xie, Weijia Zhang, Zhongdao Wang, and Chao Ma.Quadmamba: Learning quadtree-based selective scan for visual statespace model, 2024.
- [48]Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Ashish Sirasao,and Emad Barsoum.Enhancing vision transformer: Amplifying non-linearity in feedforwardnetwork module.In Forty-first International Conference on Machine Learning.
- Xu etal. [2023]Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, and AshishSirasao.Fdvit: Improve the hierarchical architecture of vision transformer.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 5950–5960, 2023.
- Yu etal. [2022]Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, JiashiFeng, and Shuicheng Yan.Metaformer is actually what you need for vision.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, pages 10819–10829, 2022.
- Yu etal. [2024]Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang.Inceptionnext: When inception meets convnext.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pages 5672–5683, 2024.
- Zhang etal. [2018]Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.Shufflenet: An extremely efficient convolutional neural network formobile devices.In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018.
- Zhou etal. [2017]Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and AntonioTorralba.Scene parsing through ade20k dataset.In Proceedings of the IEEE conference on computer vision andpattern recognition, 2017.
- [54]Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and XinggangWang.Vision mamba: Efficient visual representation learning withbidirectional state space model.In Forty-first International Conference on Machine Learning.
- Zhu etal. [2024]Qinfeng Zhu, Yuanzhi Cai, Yuan Fang, Yihan Yang, Cheng Chen, Lei Fan, and AnhNguyen.Samba: Semantic segmentation of remotely sensed images with statespace model, 2024.
\thetitle
Supplementary Material
7 Dataset and Implementation Details
ImageNet is a large image categorization dataset for computer vision research, containing about 1.2 million labeled images of about 1000 categories. All of our models are trained on the Imagenet-1k dataset for 300 epochs using the AdamW optimizer and an initial learning rate of . We use an image resolution of 224x224 for both training and testing. In addition, we use the same teacher model (i.e., the RegNetY-16GF model with a top-1 accuracy of 82.9%) as in SwiftFormer for distillation. All experiments are conducted on 8 NVIDIA V100 GPUs and the throughput is tested with maximum power-of-two batch size that fits in memory.
Name Output Small Base Large stem 56 56 [RepDW-3 2, stride=2] stage1 56 56 stage2 28 28 [RepDW-3 1, stride=2] stage3 14 14 [RepDW-3 1, stride=2] stage4 7 7 [RepDW-3 1, stride=2] Classifier Average pool, 1000d fully-connected GFLOPs 0.9G 1.5G 4.7G Params 5.6M 11.0M 31.7M
MS-COCO 2017 is a large-scale real-world image and annotation dataset that contains 118K training and 5K validation images with 80 object classes. We use TinyViM as the backbone for feature extraction and apply the Mask-RCNN framework for object detection and instance segmentation. Similar to EfficientFormer, we finetune our TinyViM for 12 epochs with an image resolution of 1333 × 800 and batchsize of 32. The AdamW optimizer is used with a learning rate of . We report the performance for object detection and instance segmentation in terms of mean average precision (mAP).
ADE20K is a challenging segmentation dataset, which contains about 20,000 images and covers 150 categories. We apply the ImageNet pre-trained TinyVim as backbone to extract image features and Semantic FPN as decode head for segmentation. Follwing EfficientFormer adn SwiftFormer, we train the segmentation model with an image size of and a batch size of 32 for 40K iterations. The AdamW optimizer with poly learning rate scheduling is applied and the initial learning rate is . The semantic segmentation performance is reported with the mean intersection over union (mIoU) metric.
8 Model Configuration
We give the detailed architectural configuration of the TinyViM variant in Table5, which details each building block of the model variant and the corresponding hyperparameters. We mainly place the proposed TinyViM blocks at the end of each stage to capture the global context-enhanced features. For the third stage, we additionally place a TinyViM in the middle of the stage to obtain a better balance between accuracy and efficiency, and the effectiveness of this design of setting up more and larger blocks in the third stage has been confirmed by previous works. In addition, we set the downsampling ratios in TinyViM Blocks of the four stages from small to large to 8, 4, 2 and 1, respectively. In other words, the resolution of the feature maps input to Mamba block at each stage is 1/32 of the original images.
9 More Comparition Results
9.1 Qualitative results on MS-COCO
The detection and instance segmentation results on coco are visualized in Fig.5. It can be observed that the people and animals in the image are accurately segmented. The location of the detection box is also very accurate. These visualization results show that TinyViM achieves advanced performance in both instance segmentation and detection tasks.
9.2 Qualitative results on ADE20K
We visualize the result of segmentation on ADE20K, which is shown in Fig.6. It can be found that the segmentation result of TinyViM is very close to the label. Specially, the categories and boundaries of objects are accurately segmented.Even for relatively complex scenes in the second image, TinyViM can accurately segment a variety of tiny objects including table lamps and pillows. For larger objects such as buildings or trees, TinyViM’s segmentation performance is also satisfactory. These visualization results show that the frequency decomposition scheme of TinyViM can effectively extract the features of different objects, which demonstrate the state-of-the-art performance of TinyViM in semantic segmentation.