Visual reasoning is another basic task in CLIP-as-service. There are four basic visual reasoning skills: object recognition, object counting, color recognition, and spatial relation understanding. Despite how magic it sounds and looks, the idea is fairly simple: just input the reasoning texts as prompts, then calling rank interface of
clip_server. The server will rank the prompts and return sorted prompts with scores.
In this demo, you can choose a picture, or copy-paste your image URL into the text box to get a rough feeling how visual reasoning works. Feel free to add or remove prompts and observe how it affects the ranking results.
The model is
ViT-L/14-336px on one GPU.