Publication:
NiNformer: a network in network transformer with token mixing generated gating function

dc.contributor.authorAbdullah, Abdullah Nazhat
dc.contributor.authorAydin, Tarkan
dc.contributor.institutionAbdullah, Abdullah Nazhat, Department of Computer Engineering, Bahçeşehir Üniversitesi, Istanbul, Turkey
dc.contributor.institutionAydin, Tarkan, Department of Computer Engineering, Bahçeşehir Üniversitesi, Istanbul, Turkey
dc.date.accessioned2025-10-05T14:28:51Z
dc.date.issued2025
dc.description.abstractThe attention mechanism is the primary component of the transformer architecture, it has led to significant advancements in deep learning spanning many domains and covering multiple tasks. In computer vision, the attention mechanism was first incorporated in the vision transformer (ViT), and then its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While the attention mechanism is very expressive and capable, it comes with the disadvantage of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perceiver-IO, and many more attempts with different sets of advantages and disadvantages. This paper introduces a new computational block as an alternative to the standard ViT block. The newly proposed block reduces the computational requirements by replacing the normal attention layers with a network in network structure, therefore enhancing the static approach of the MLP-Mixer with a dynamic learning of element-wise gating function generated by a token-mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain. © 2025 Elsevier B.V., All rights reserved.
dc.identifier.doi10.1007/s00521-025-11226-1
dc.identifier.endpage13428
dc.identifier.issn14333058
dc.identifier.issn09410643
dc.identifier.issue19
dc.identifier.scopus2-s2.0-105003850350
dc.identifier.startpage13411
dc.identifier.urihttps://doi.org/10.1007/s00521-025-11226-1
dc.identifier.urihttps://hdl.handle.net/20.500.14719/6269
dc.identifier.volume37
dc.language.isoen
dc.publisherSpringer Science and Business Media Deutschland GmbH
dc.relation.oastatusAll Open Access
dc.relation.oastatusHybrid Gold Open Access
dc.relation.sourceNeural Computing and Applications
dc.subject.authorkeywordsComputer Vision
dc.subject.authorkeywordsDeep Learning
dc.subject.authorkeywordsNetwork In Network
dc.subject.authorkeywordsTransformer
dc.subject.authorkeywordsImage Enhancement
dc.subject.authorkeywordsImage Segmentation
dc.subject.authorkeywordsNetwork Function Virtualization
dc.subject.authorkeywordsObject Detection
dc.subject.authorkeywordsObject Recognition
dc.subject.authorkeywordsAttention Mechanisms
dc.subject.authorkeywordsDeep Learning
dc.subject.authorkeywordsGating Functions
dc.subject.authorkeywordsImage Generations
dc.subject.authorkeywordsIn Networks
dc.subject.authorkeywordsMultiple Tasks
dc.subject.authorkeywordsNetwork In Network
dc.subject.authorkeywordsObjects Detection
dc.subject.authorkeywordsOptimisations
dc.subject.authorkeywordsTransformer
dc.subject.authorkeywordsMixers (machinery)
dc.subject.indexkeywordsImage enhancement
dc.subject.indexkeywordsImage segmentation
dc.subject.indexkeywordsNetwork function virtualization
dc.subject.indexkeywordsObject detection
dc.subject.indexkeywordsObject recognition
dc.subject.indexkeywordsAttention mechanisms
dc.subject.indexkeywordsDeep learning
dc.subject.indexkeywordsGating functions
dc.subject.indexkeywordsImage generations
dc.subject.indexkeywordsIn networks
dc.subject.indexkeywordsMultiple tasks
dc.subject.indexkeywordsNetwork in network
dc.subject.indexkeywordsObjects detection
dc.subject.indexkeywordsOptimisations
dc.subject.indexkeywordsTransformer
dc.subject.indexkeywordsMixers (machinery)
dc.titleNiNformer: a network in network transformer with token mixing generated gating function
dc.typeArticle
dcterms.referencesVaswani, Ashish, Attention is all you need, Advances in Neural Information Processing Systems, 2017-December, pp. 5999-6009, (2017), Brown, Tom B., Language models are few-shot learners, Advances in Neural Information Processing Systems, 2020-December, (2020), Improving Language Understanding by Generative Pre Training, (2018), Llama Open and Efficient Foundation Language Models, (2023), Refinedweb Dataset for Falcon Llm Outperforming Curated Corpora with Web Data and Web Data Only, (2023), Mistral 7b, (2023), An Image is Worth 16x16 Words Transformers for Image Recognition at Scale, (2020), Tolstikhin, Ilya O., MLP-Mixer: An all-MLP Architecture for Vision, Advances in Neural Information Processing Systems, 29, pp. 24261-24272, (2021), Trans Mach Learn Res, (2022), Liu, Ze, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Proceedings of the IEEE International Conference on Computer Vision, pp. 9992-10002, (2021)
dspace.entity.typePublication
local.indexed.atScopus
person.identifier.scopus-author-id58115008900
person.identifier.scopus-author-id35106687700

Files