Visual Continual Learning: Multi-modal Foundation Models for Open World Object Detection


  • Jelena Epifanic Comenius University Bratislava
  • Markus Vincze TU Wien


Currently, artificial systems operate autonomously only to a certain extent in a narrow task setting. Cognitive robotics, as an interdisciplinary field, aims to endow robots with human-like cognitive abilities, so they can autonomously operate, explore, and learn from experience and interaction. Most importantly, they should be capable of adapting to dynamic and uncertain changes in the environment. To achieve this, machines should have the capacity to continuously learn from dynamic data streams, distill knowledge, generalize effectively, and recall knowledge when needed. This ability of artificial systems termed lifelong, incremental, or continual learning, is mostly inspired by biological systems and neuro-cognitive underpinnings of learning [1].

In this research, we focus on the problem of continual computer vision learning challenges, specifically the Open World Object Detection (OWOD) problem [2]. In robotics, for real-world applications, it is important to equip robots with the capabilities to learn new, previously unseen, object classes incrementally. The OWOD model can detect both known and unknown objects in a scene by being exclusively trained on known object data. Motivated by the recent advances in multi-modal foundation model design, we first evaluated two state-of-the-art solutions, Grounding DINO and OWLv2, using the open-world evaluation protocol for 2D detection [2]. After running the quantitative and qualitative analyses, we proposed strategies to advance the performance of these models for the 2D image-based OWOD. 

With the aim to further enhance Human-Robot Interaction (HRI), we also address the 3D OWOD problem in this research. To improve robot navigation and interaction with its environment, we combine the strengths of both approaches – a profusion of available data for 2D detection and the precision and robustness of 3D detection. For the 3D image-based OWOD task, we extended the formulation of the OWOD problem to the 3D domain. Further, we introduced the Prompting 3D World (Pro3DW) framework, which integrates one of these two multi-modal foundation models with the POPE [3], a zero-shot object pose estimation model designed to operate in wild scenarios. POPE [3] assesses the six degrees-of-freedom (6DoF) object pose between the object prompt and the target object. With our method, we plan to utilize an off-the-shelf 2D object detector, specifically the fine-tuned pretrained Grounding-DINO or OWLv2 models, to generate object prompts and object classifications, while POPE [3] is used to generate 3D bounding boxes.

Primary interested in Cognitive Robotics and HRI, we posit that utilization of large-scale pretrained models, which integrate different modalities like vision and text, has the potential to improve continual learning in computer vision systems both for 2D and 3D domains.


[1] L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”, (accessed May 10, 2024).

[2] K. J. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “Towards open world object detection,”, (accessed May 10, 2024).

[3]  Z. Fan et al., “Pope: 6-DOF promptable pose estimation of any object, in any scene, with one reference,”, (accessed May 10, 2024).