Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

D.S. Martinez Pandiani; N. Lazzari; V. Presutti

doi:https://doi.org/10.3233/SSW240008

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

Authors	D.S. Martinez Pandiani N. Lazzari V. Presutti
Publication date	2024
Host editors	A. Salatino M. Alam F. Ongenae S. Vahdati A.-L. Gentile T. Pellegrini S. Jiang
Book title	Knowledge Graphs in the Age of Language Models and Neuro-Symbolic AI
Book subtitle	Proceedings of the 20th International Conference on Semantic Systems, 17–19 September 2024, Amsterdam, The Netherlands
ISBN (electronic)	9781643685373
Series	Studies on the Semantic Web
Event	20th International Conference on Semantic Systems
Pages (from-to)	68-87
Publisher	Amsterdam: IOS Press
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	The increasing demand for automatic high-level image understanding, including the detection of abstract concepts (AC) in images, presents a complex challenge both technically and ethically. This demand highlights the need for innovative and more interpretable approaches, that reconcile traditional deep vision methods with the situated, nuanced knowledge that humans use to interpret images at such high semantic levels. To bridge the gap between the deep vision and situated perceptual paradigms, this study aims to leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. To facilitate downstream tasks such as AC-based image classification, we compute Knowledge Graph Embeddings (KGE). We experiment with relative representations [1] and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. The adoption of the relative representation method significantly bolsters KGE-based AC image classification, while our hybrid methods outperform state-of-the-art approaches. The posthoc interpretability analyses reveal the visual transformer’s proficiency in capturing pixel-level visual attributes, contrasting with our method’s efficacy in representing more abstract and semantic scene elements. Our results demonstrate the synergy and complementarity between KGE embeddings’ situated perceptual knowledge and deep visual model’s sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neurosymbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available at https://github.com/delfimpandiani/Stitching-Gaps .
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.3233/SSW240008
Other links	https://github.com/delfimpandiani/Stitching-Gaps
Downloads	SSW-60-SSW240008 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification