SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation

D.-K. Nguyen; M.R. Oswald; C.G.M. Snoek

doi:https://doi.org/10.48550/arXiv.2310.05920

SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation

Authors	D.-K. Nguyen M.R. Oswald C.G.M. Snoek
Publication date	02-2025
Journal	Transactions on Machine Learning Research
Article number	3114
Volume \| Issue number	2025
Number of pages	17
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector ‘SimPLR’ whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation.
Document type	Article
Language	English
Published at	https://doi.org/10.48550/arXiv.2310.05920
Published at	https://openreview.net/forum?id=6LO1y8ZE0F
Other links	https://github.com/kienduynguyen/SimPLR https://jmlr.org/tmlr/papers/index.html
Downloads	3114_SimPLR_A_Simple_and_Plain (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation