Hybrid CNN-ViT Architecture for Improved Accuracy and Efficiency in Image Classification
Main Article Content
Abstract
We have seen in the past few years that innovations in deep learning (DFS) have shown us that CNN algorithms and ViTs work very well together for tasks surrounding image classification. Whereas CNN algorithms are good at utilizing hierarchical convolutional operations in order to acquire local spatial information, ViTs have been successful in obtaining long-range dependencies from spatial images by highlighting them through a self-attention mechanism. The hybrid CNN-ViT architecture presented in this paper combines the inductive biases of CNNs with the global (or complete) contextual view of images that ViTs provide in order to improve classification accuracy and computational efficiency. The CNN-ViT hybrid architecture utilizes a Convolutional Neural Network (CNN) feature extractor to encode the local and medium features of an image at the beginning and then utilizes a lightweight transformer encoder to increase the global (or complete) connectivity of the tokens representing the features of the image to each other. The hybrid CNN-ViT architecture also introduces a reduced attention complexity and efficient tokenization method to reduce the computational cost. The results of the experiments show that CNN and ViT models operating individually (without the hybrid model's architecture) produce less than desirable outcomes on benchmark image classification datasets; therefore, the proposed CNN-ViT hybrid architecture outperforms both CNN and ViT models in three areas: Speed of convergence of training for multiple runs, efficiency and utilization of model parameters, and overall classification accuracy across one or multiple datasets. The results indicate that the hybrid CNN-ViT architecture is a good balance between performance and resource utilization, making it an appropriate fit for real-time or edge-based applications. This work presents a method for hybrid deep learning architectures to mitigate the limitations of single-model architectures.