Hybrid CNN-ViT Architecture for Improved Accuracy and Efficiency in Image Classification

Shraddha Gugulothu, Jaikumar M. Patil, Dipak Wajgi, Hemantkumar Rishipal Turkar, Pallavi Wankhede, Purnima Niranjane

PDF

Published: Jun 2, 2026

Keywords:

Hybrid CNN-ViT, Image Classification, Deep Learning, Vision Transformer, Convolutional Neural Network, Feature Extraction, Self-Attention, Computational Efficiency, Edge Computing, Transfer Learning

Shraddha Gugulothu, Jaikumar M. Patil, Dipak Wajgi, Hemantkumar Rishipal Turkar, Pallavi Wankhede, Purnima Niranjane

Abstract

We have seen in the past few years that innovations in deep learning (DFS) have shown us that CNN algorithms and ViTs work very well together for tasks surrounding image classification. Whereas CNN algorithms are good at utilizing hierarchical convolutional operations in order to acquire local spatial information, ViTs have been successful in obtaining long-range dependencies from spatial images by highlighting them through a self-attention mechanism. The hybrid CNN-ViT architecture presented in this paper combines the inductive biases of CNNs with the global (or complete) contextual view of images that ViTs provide in order to improve classification accuracy and computational efficiency. The CNN-ViT hybrid architecture utilizes a Convolutional Neural Network (CNN) feature extractor to encode the local and medium features of an image at the beginning and then utilizes a lightweight transformer encoder to increase the global (or complete) connectivity of the tokens representing the features of the image to each other. The hybrid CNN-ViT architecture also introduces a reduced attention complexity and efficient tokenization method to reduce the computational cost. The results of the experiments show that CNN and ViT models operating individually (without the hybrid model's architecture) produce less than desirable outcomes on benchmark image classification datasets; therefore, the proposed CNN-ViT hybrid architecture outperforms both CNN and ViT models in three areas: Speed of convergence of training for multiple runs, efficiency and utilization of model parameters, and overall classification accuracy across one or multiple datasets. The results indicate that the hybrid CNN-ViT architecture is a good balance between performance and resource utilization, making it an appropriate fit for real-time or edge-based applications. This work presents a method for hybrid deep learning architectures to mitigate the limitations of single-model architectures.

How to Cite

Shraddha Gugulothu, Jaikumar M. Patil, Dipak Wajgi, Hemantkumar Rishipal Turkar, Pallavi Wankhede, Purnima Niranjane. (2026). Hybrid CNN-ViT Architecture for Improved Accuracy and Efficiency in Image Classification. Journal of Daoist Studies, 19(S1), 663–672. Retrieved from https://journalofdaoiststudies.org/index.php/journal/article/view/165

Issue

Vol. 19 No. S1 (2026): Journal of Daoist Studies

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details