Acquiring Complex Concepts with Comparative Learning
Current vision and language models (VLMs) can acquire simple and complex notions in the form of linguistic expressions grounded on visual inputs. Traditional training setups for these involve noisy examples from internet-scraped data that can lead to inefficient and coarse learning. Taking from literature in cognitive science and developmental psychology, learning frameworks employed by humans could be considered as new training paradigms for VLMs: for instance, learning gradually more complex notions (Progressive Alignment) by comparing examples in a controlled environment (Comparative Learning) could nudge compositionality as ability. Current approaches either consist of training monolithic net- works on sparse concepts of varying complexity, or task-specific modules are specialized with narrow knowledge. In this study we advance two questions: can we train a single, multi-task network to learn primitive concepts? Moreover, can we leverage Comparative Learning to acquire more complex notions, such as the logical composition of such primitives?