A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift

Nov 15, 2024·
Sanath Budakegowdanadoddi Nagaraju
,
Brian Bernhard Moser
Tobias Christian Nauen Tobias Christian Nauen
,
Stanislav Frolov
,
Federico Raue
,
Andreas Dengel
PDF
Abstract
Transformer-based Super-Resolution (SR) models have recently advanced image reconstruction quality, yet challenges remain due to computational complexity and an over-reliance on large patch sizes, which constrain fine-grained detail enhancement. In this work, we propose TaylorIR to address these limitations by utilizing a patch size of 1x1, enabling pixel-level processing in any transformer-based SR model. To address the significant computational demands under the traditional self-attention mechanism, we employ the TaylorShift attention mechanism, a memory-efficient alternative based on Taylor series expansion, achieving full token-to-token interactions with linear complexity. Experimental results demonstrate that our approach achieves new state-of-the-art SR performance while reducing memory consumption by up to 60% compared to traditional self-attention-based transformers.
Type
Publication
arXiv preprint (arXiv)
publications

This work builds on the TaylorShift attention mechanism.

For more information, see the paper pdf.

Associated Projects: SEmbedAI, SustAInML, Albatross

Citation

If you use this work, please cite our paper:

@misc{nagaraju2024lowresolutionimageworth1x1,
  title         = {A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift},
  author        = {Sanath Budakegowdanadoddi Nagaraju and Brian Bernhard Moser and Tobias Christian Nauen and Stanislav Frolov and Federico Raue and Andreas Dengel},
  year          = {2024},
  eprint        = {2411.10231},
  archiveprefix = {arXiv},
  primaryclass  = {cs.CV}
}
Tobias Christian Nauen
Authors
PhD Student
I’m a researcher of artificial intelligence at DFKI and RPTU Kaiserslautern-Landau. My research interests include efficient deep learning, transformer models, multimodal learning, and computer vision. In my PhD project, my focus lies on the development of efficient transformer models for vision, language, and multimodal tasks.