Jan-Martin O. Steitz, and Stefan Roth,
in
Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2024.
Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However they have often been found to be outperformed by other adaptation mechanisms including low-rank adaptation. In this paper we provide an in-depth study of adapters their internal structure as well as various implementation choices. We uncover pitfalls for using …
Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O. Steitz, Stefan Roth, Ivan Vulić, and Iryna Gurevych,
in
Findings of the Association for Computational Linguistics (ACL),
2022.
Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically …
Jan-Martin O. Steitz, Jonas Pfeiffer, Iryna Gurevych, and Stefan Roth,
in
Proc. of the 43rd DAGM German Conference on Pattern Recognition (GCPR),
2021, Best Paper Honorable Mention.
Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today’s multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the …
Jan-Martin O. Steitz, Faraz Saeedan, and Stefan Roth,
in
Proc. of the 40th German Conference on Pattern Recognition (GCPR),
2018.
Motivated by the detection of prohibited objects in carry-on luggage as a part of avionic security screening, we develop a CNN-based object detection approach for multi-view X-ray image data. Our contributions are two-fold. First, we introduce a novel multi-view pooling layer to perform a 3D aggregation of 2D CNN-features extracted from each view. To that end, our pooling layer exploits the known …