Benchmark#

In order to understand the zero-shot performance of CLIP and its limitations, we conducted a benchmark across a variety of computer vision datasets (the dataset details are in the appendix). Here, thanks for the open-source CLIP Benchmark toolkit, we can easily reproduce the results.

We hope that this benchmark can help you to better understand the performance of CLIP models and choose the best model for your application.

Select the right model#

In general, you can select the best model for your application from different perspectives: disk usage, peak RAM and VRAM usages, QPS, and most importantly, the performance.

Based on our experiments, we recommend the ViT models over the RN models for most general applications. More specifically, the ViT-H-14::laion2b_s32b_b79k model and ViT-g-14::laion2b_s12b_b42k model should be first considered since they have the best or close to the best performance in most cases. However, if you are concerned about the encoding speed, you can consider other ViT models because they have higher QPS with decent performance. Anyway, you should choose the model that best fits your requirements. For example, if you are labeling images for diabetic retinopathy, you should probably select the ViT-B-32::laion2b_s34b_b79k model since it has the best top-1 accuracy of 0.734 on zero-shot classification of the Retinopathy dataset. Or if you are dealing with histopathologic images, you should probably select the RN50::openai model since it has the best top-1 accuracy of 0.636 on zero-shot classification of the Patch Camelyon dataset.

The following sections show the performance of different models in details on different datasets and tasks.

Size and efficiency#

We first present the model’s size and efficiency in terms of query time and memory usage (including the peak RAM and VRAM usage). All of the results are obtained on a single Nvidia TITAN RTX GPU (24GB VRAM) with default server settings.

Model

Disk Usage (MB)

Peak RAM Usage (GB)

Peak VRAM Usage (GB)

Text QPS

Image QPS

RN50::openai

244

2.99

1.36

1019

269

RN50::yfcc15m

389

2.86

1.36

1083

262

RN50::cc12m

389

2.84

1.36

1064

264

RN101::openai

278

3.05

1.40

1047

222

RN101::yfcc15m

457

2.88

1.40

1107

223

RN50x4::openai

402

3.23

1.63

1047

218

RN50x16::openai

631

3.63

2.02

1038

121

RN50x64::openai

1291

4.08

2.98

985

59

ViT-B-32::openai

338

3.20

1.40

1064

286

ViT-B-32::laion2b_e16

577

2.93

1.40

1120

292

ViT-B-32::laion400m_e31

577

2.93

1.40

1080

287

ViT-B-32::laion400m_e32

577

2.94

1.40

1092

289

ViT-B-32::laion2b-s34b-b79k

577

2.94

1.40

1102

285

ViT-B-16::openai

335

3.20

1.44

1064

260

ViT-B-16::laion400m_e31

571

2.93

1.44

1099

262

ViT-B-16::laion400m_e32

571

2.94

1.44

1082

268

ViT-B-16-plus-240::laion400m_e31

795

3.03

1.59

1059

235

ViT-B-16-plus-240::laion400m_e32

795

3.03

1.59

1043

239

ViT-L-14::openai

890

3.66

2.04

1040

140

ViT-L-14::laion400m_e31

1631

3.43

2.03

1058

147

ViT-L-14::laion400m_e32

1631

3.42

2.03

1061

146

ViT-L-14::laion2b-s32b-b82k

1631

3.43

2.03

1069

147

ViT-L-14-336::openai

891

3.74

2.23

1070

76

ViT-H-14::laion2b-s32b-b79k

3762

4.45

3.26

642

91

ViT-g-14::laion2b-s12b-b42k

5214

5.16

4.00

639

69

M-CLIP/LABSE-Vit-L-14

3609

4.30

4.70

646

284

M-CLIP/XLM-Roberta-Large-Vit-B-32

4284

5.37

1.68

656

139

M-CLIP/XLM-Roberta-Large-Vit-B-16Plus

4293

4.30

4.13

662

236

M-CLIP/XLM-Roberta-Large-Vit-L-14

4293

4.30

4.97

1027

139

Zero-shot performance#

In this section, we will report the zero-shot performance of the models on classification and retrieval tasks across different datasets. In the following tables, we will highlight the best results in bold for each dataset (higher is better).

Zero-shot retrieval#

In zero-shot retrieval benchmark, each model is evaluated on the following datasets: COCO Caption, Flickr8k and Flickr30k. For the above datasets, there are five corresponding description sentences for each image written by humans. The results are reported in terms of top-5 text-to-image retrieval recall, top-5 image-to-text retrieval recall and their averages. More specifically, the top-5 text-to-image retrieval recall for each retrieved image is either 1 or 0. It is 1 if the input text matches one of the image descriptions among the top-5. The top-5 image-to-text retrieval recall for each image is the number of top-5 retrieved texts matching that image descriptions.

Model

COCO Caption

Flickr 8k

Flickr 30k

Text to image

Image to text

Average

Text to image

Image to text

Average

Text to image

Image to text

Average

RN50::openai

0.529

0.728

0.629

0.504

0.690

0.597

0.392

0.621

0.506

RN50::yfcc15m

0.361

0.534

0.447

0.238

0.394

0.316

0.146

0.278

0.212

RN50::cc12m

0.446

0.607

0.527

0.302

0.435

0.369

0.204

0.316

0.260

RN101::openai

0.555

0.745

0.650

0.523

0.694

0.608

0.415

0.629

0.522

RN101::yfcc15m

0.376

0.549

0.463

0.251

0.417

0.334

0.156

0.296

0.226

RN50x4::openai

0.581

0.767

0.674

0.558

0.729

0.643

0.451

0.671

0.561

RN50x16::openai

0.600

0.787

0.693

0.597

0.768

0.682

0.496

0.713

0.604

RN50x64::openai

0.599

0.803

0.701

0.629

0.790

0.709

0.534

0.756

0.645

ViT-B-32::openai

0.560

0.749

0.654

0.532

0.699

0.616

0.413

0.629

0.521

ViT-B-32::laion2b_e16

0.647

0.795

0.721

0.622

0.760

0.691

0.507

0.687

0.597

ViT-B-32::laion400m_e31

0.600

0.763

0.682

0.562

0.736

0.649

0.438

0.633

0.536

ViT-B-32::laion400m_e32

0.600

0.765

0.682

0.562

0.736

0.649

0.437

0.634

0.536

ViT-B-32::laion2b_s34b_b79k

0.654

0.798

0.726

0.629

0.778

0.703

0.513

0.694

0.603

ViT-B-16::openai

0.584

0.767

0.676

0.564

0.727

0.646

0.452

0.671

0.561

ViT-B-16::laion400m_e31

0.637

0.796

0.717

0.620

0.765

0.692

0.506

0.697

0.602

ViT-B-16::laion400m_e32

0.636

0.796

0.716

0.620

0.767

0.694

0.508

0.697

0.603

ViT-B-16-plus-240::laion400m_e31

0.660

0.809

0.735

0.642

0.788

0.715

0.533

0.725

0.629

ViT-B-16-plus-240::laion400m_e32

0.662

0.811

0.736

0.644

0.791

0.718

0.535

0.727

0.631

ViT-L-14::openai

0.610

0.793

0.702

0.599

0.767

0.683

0.494

0.717

0.605

ViT-L-14::laion400m_e31

0.680

0.821

0.750

0.675

0.806

0.741

0.570

0.751

0.661

ViT-L-14::laion400m_e32

0.680

0.821

0.751

0.675

0.806

0.740

0.570

0.751

0.661

ViT-L-14::laion2b_s32b_b82k

0.711

0.840

0.775

0.712

0.824

0.768

0.620

0.789

0.704

ViT-L-14-336::openai

0.616

0.812

0.714

0.629

0.779

0.704

0.533

0.741

0.637

ViT-H-14::laion2b_s32b_b79k

0.734

0.861

0.797

0.746

0.856

0.801

0.657

0.823

0.740

ViT-g-14::laion2b_s12b_b42k

0.724

0.853

0.788

0.730

0.846

0.788

0.639

0.806

0.722

From the table, we observe that the ViT models outperform the RN models in general. More specifically, the ViT-H-14::laion2b_s32b_b79k model and ViT-g-14::laion2b_s12b_b42k model achieve the best and second-best results on all zero-shot retrieval tasks. For ViT models, the results of the same base model are better on those pre-trained with larger datasets (e.g., ViT-B-32::openai vs ViT-B-32::laion400m_e31 vs ViT-B-32::laion2b-s34b-b79k).

Zero-shot classification#

In zero-shot classification benchmark, each model is evaluated on the following datasets: ImageNetV2, VOC2007 and 19 VTAB datasets. The results are shown in the following table. For each dataset, we report the top-1 accuracy, which is whether the top-1 retrieved class of a image matches its true class.

Model

ImageNetV2

VOC2007

VTAB natural

VTAB specialized

VTAB structured

Caltech101

CIFAR-100

DTD

Flowers102

Pets

Sun397

SVHN

EuroSAT

Resisc45

Patch Camelyon

Retinopathy

Clevr/count

Clevr/distance

dSprites/location

dSprites/orientation

SmallNORB/azimuth

SmallNORB/elevation

DMLab

KITTI/distance

RN50::openai

0.529

0.650

0.772

0.403

0.415

0.660

0.857

0.894

0.303

0.408

0.453

0.636

0.171

0.217

0.148

0.034

0.014

0.056

0.110

0.145

0.170

RN50::yfcc15m

0.214

0.215

0.402

0.116

0.122

0.167

0.174

0.127

0.157

0.172

0.123

0.533

0.358

0.151

0.158

0.032

0.024

0.053

0.120

0.160

0.336

RN50::cc12m

0.224

0.438

0.582

0.178

0.135

0.095

0.331

0.123

0.102

0.148

0.117

0.535

0.293

0.184

0.222

0.031

0.025

0.047

0.096

0.161

0.155

RN101::openai

0.561

0.651

0.780

0.476

0.432

0.652

0.869

0.887

0.226

0.314

0.547

0.583

0.280

0.242

0.130

0.031

0.021

0.054

0.111

0.139

0.263

RN101::yfcc15m

0.221

0.243

0.469

0.125

0.117

0.210

0.177

0.128

0.137

0.151

0.099

0.479

0.584

0.109

0.159

0.031

0.019

0.055

0.097

0.153

0.252

RN50x4::openai

0.594

0.682

0.781

0.451

0.486

0.698

0.887

0.908

0.367

0.335

0.532

0.569

0.318

0.205

0.082

0.031

0.026

0.056

0.108

0.162

0.233

RN50x16::openai

0.643

0.680

0.810

0.522

0.524

0.724

0.898

0.917

0.409

0.433

0.589

0.625

0.715

0.195

0.213

0.030

0.026

0.050

0.116

0.146

0.229

RN50x64::openai

0.670

0.740

0.834

0.598

0.531

0.788

0.936

0.931

0.481

0.577

0.628

0.539

0.073

0.227

0.200

0.034

0.025

0.056

0.125

0.158

0.311

ViT-B-32::openai

0.559

0.764

0.815

0.643

0.443

0.664

0.873

0.913

0.135

0.504

0.537

0.623

0.447

0.232

0.164

0.037

0.024

0.061

0.127

0.193

0.274

ViT-B-32::laion2b_e16

0.573

0.788

0.831

0.754

0.539

0.691

0.893

0.933

0.388

0.503

0.619

0.506

0.195

0.192

0.167

0.031

0.024

0.052

0.110

0.189

0.176

ViT-B-32::laion400m_e31

0.523

0.731

0.818

0.678

0.521

0.659

0.856

0.918

0.220

0.470

0.510

0.549

0.259

0.155

0.161

0.033

0.021

0.053

0.117

0.173

0.122

ViT-B-32::laion400m_e32

0.523

0.733

0.817

0.677

0.523

0.658

0.854

0.917

0.223

0.476

0.510

0.548

0.240

0.153

0.161

0.033

0.021

0.054

0.117

0.173

0.118

ViT-B-32::laion2b_s34b_b79k

0.581

0.791

0.839

0.755

0.557

0.716

0.909

0.937

0.410

0.482

0.610

0.598

0.734

0.153

0.189

0.029

0.034

0.062

0.113

0.159

0.262

ViT-B-16::openai

0.619

0.783

0.819

0.669

0.449

0.712

0.890

0.924

0.313

0.559

0.582

0.507

0.036

0.209

0.158

0.030

0.023

0.053

0.122

0.155

0.263

ViT-B-16::laion400m_e31

0.594

0.767

0.838

0.712

0.513

0.694

0.892

0.939

0.380

0.503

0.585

0.593

0.062

0.289

0.245

0.031

0.030

0.059

0.100

0.152

0.200

ViT-B-16::laion400m_e32

0.597

0.768

0.837

0.712

0.513

0.692

0.892

0.939

0.385

0.501

0.585

0.598

0.077

0.287

0.245

0.032

0.029

0.060

0.099

0.151

0.183

ViT-B-16-plus-240::laion400m_e31

0.614

0.764

0.832

0.733

0.555

0.706

0.904

0.940

0.355

0.569

0.615

0.551

0.093

0.240

0.159

0.041

0.026

0.056

0.111

0.149

0.280

ViT-B-16-plus-240::laion400m_e32

0.615

0.764

0.833

0.738

0.555

0.711

0.902

0.940

0.362

0.581

0.613

0.551

0.095

0.238

0.160

0.043

0.027

0.054

0.110

0.148

0.281

ViT-L-14::openai

0.698

0.783

0.835

0.758

0.554

0.792

0.932

0.937

0.571

0.626

0.633

0.520

0.733

0.194

0.161

0.032

0.023

0.045

0.115

0.163

0.218

ViT-L-14::laion400m_e31

0.654

0.758

0.839

0.774

0.598

0.757

0.917

0.950

0.378

0.632

0.671

0.487

0.058

0.242

0.149

0.030

0.026

0.053

0.109

0.186

0.200

ViT-L-14::laion400m_e32

0.654

0.756

0.839

0.774

0.605

0.756

0.919

0.950

0.380

0.622

0.675

0.493

0.061

0.243

0.149

0.030

0.026

0.053

0.110

0.186

0.203

ViT-L-14::laion2b_s32b_b82k

0.677

0.805

0.851

0.833

0.629

0.758

0.932

0.958

0.459

0.646

0.668

0.563

0.116

0.312

0.161

0.032

0.020

0.056

0.108

0.224

0.229

ViT-L-14-336::openai

0.709

0.781

0.837

0.744

0.556

0.783

0.937

0.940

0.560

0.615

0.638

0.608

0.733

0.200

0.158

0.032

0.024

0.046

0.113

0.158

0.262

ViT-H-14::laion2b_s32b_b79k

0.709

0.777

0.850

0.847

0.678

0.801

0.945

0.961

0.563

0.726

0.699

0.542

0.297

0.268

0.169

0.032

0.027

0.054

0.111

0.140

0.110

ViT-g-14::laion2b_s12b_b42k

0.696

0.811

0.851

0.839

0.682

0.776

0.943

0.962

0.603

0.648

0.718

0.560

0.580

0.332

0.175

0.036

0.031

0.060

0.115

0.190

0.138

From the table, we observe that the ViT models still outperform the RN models in most tasks, except for the Patch Camelyon dataset where RN50::openai has the best top-1 accuracy of 0.636, and the KITTI/distance dataset where RN50::yfcc15m has the best result of 0.336. Similar to retrieval results, the ViT-H-14::laion2b_s32b_b79k model and ViT-g-14::laion2b_s12b_b42k model still have the best or close to the best results on 12/21 zero-shot classification tasks. All models tend to perform well on ImageNetV2, VOC2007, VTAB natural and VTAB specialized (except for Retinopathy) datasets, whereas they perform poorly on VTAB structured datasets. We do not observe any significant difference between the ViT models of the same base model.

Appendix: Datasets description#

  • COCO Caption 1: The dataset contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are provided.

  • Flickr 8k 2: The dataset consists of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

  • Flickr 30k 3: The dataset is an extension of the Flickr 8k Dataset. It consists of 158,915 crowd-sourced captions describing 31,783 images.

  • ImageNetV2 4: ImageNetV2 contains three test sets with 10,000 new images each. Importantly, these test sets were sampled after a decade of progress on the original ImageNet dataset. This makes the new test data independent of existing models and guarantees that the accuracy scores are not affected by adaptive overfitting.

  • VOC2007 5: The training data provided consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the twenty classes present in the image. Note that multiple objects from multiple classes may be present in the same image.

  • VTAB natural group 6: The natural group represents classical vision problems. These tasks contain natural images captured using standard cameras. The classes may represent generic, fine-grained, or abstract objects.

    • Caltech101: The task consists in classifying pictures of objects (101 classes plus a background clutter class), including animals, airplanes, chairs, or scissors. The image size varies, but it typically ranges from 200-300 pixels per edge.

    • CIFAR-100: The task consists in classifying natural images (100 classes, with 500 training images each). Some examples include apples, bottles, dinosaurs, and bicycles. The image size is 32x32.

    • DTD: The task consists in classifying images of textural patterns (47 classes, with 120 training images each). Some of the textures are banded, bubbly, meshed, lined, or porous. The image size ranges between 300x300 and 640x640 pixels.

    • Flowers102: The task consists in classifying images of flowers present in the UK (102 classes, with between 40 and 248 training images per class). Azalea, Californian Poppy, Sunflower, or Petunia are some examples. Each image dimension has at least 500 pixels.

    • Pets: The task consists in classifying pictures of cat and dog breeds (37 classes with around 200 images each), including Persian cat, Chihuahua dog, English Setter dog, or Bengal cat. Images dimensions are typically 200 pixels or larger.

    • Sun397: The Sun397 task is a scenery benchmark with 397 classes and, at least, 100 images per class. Classes have a hierarchy structure, and include cathedral, staircase, shelter, river, or archipelago. The images are (colour) 200x200 pixels or larger.

    • SVHN: This task consists in classifying images of Google’s street-view house numbers (10 classes, with more than 1000 training images each). The image size is 32x32 pixels.

  • VTAB specialized group: The specialized group also contains images of the world, but captured through specialist equipment. These images have different invariances to those in the specialized tasks. Nonetheless, humans recognize the structures therein, thus generic visual representations should also capture the visual concepts. It two sub-groups: remote sensing, and medical.

    • EuroSAT: The task consists in classifying Sentinel-2 satellite images into 10 different types of land use (Residential, Industrial, River, Highway, etc). The spatial resolution corresponds to 10 meters per pixel, and the image size is 64x64 pixels.

    • Resisc45: The Remote Sensing Image Scene Classification (RESISC) dataset is a scene classification task from remote sensing images. There are 45 classes, containing 700 images each, including tennis court, ship, island, lake, parking lot, sparse residential, or stadium. The image size is RGB 256x256 pixels.

    • Patch Camelyon: The Patch Camelyon dataset contains 327,680 images of histopathologic scans of lymph node sections. The classification task consists in predicting the presence of metastatic tissue in given image (i.e., two classes). All images are 96x96 pixels.

    • Retinopathy: The Diabetic Retinopathy dataset consists of image-label pairs with high-resolution retina images, and labels that indicate the presence of Diabetic Retinopahy (DR) in a 0-4 scale (No DR, Mild, Moderate, Severe, or Proliferative DR).

  • VTAB structured group: The structured group assesses comprehension of the structure of a scene, for example, object counting, or 3D depth prediction. Most of these tasks are generated from simulated environments, whose structure is easy for a human to determine, but whose domain differs greatly to datasets like ImageNet. These tasks are intended as a step towards useful representations for perceptual control.

    • Clevr/count: CLEVR is a visual question and answer dataset designed to evaluate algorithmic visual reasoning. We use just the images from this dataset, and create a synthetic task by setting the label equal to the number of objects in the images.

    • Clevr/distance: Another synthetic task we create from CLEVR consists of predicting the depth of the closest object in the image from the camera. The depths are bucketed into size bins.

    • dSprites/location: The dSprites dataset was originally designed to asses disentanglement properties of unsupervised learning algorithms. In particular, each image is a 2D shape where six factors are controlled: color, shape, scale, rotation, and (x,y) center coordinates. Images have 64x64 black-and-white pixels. This task consists in predicting the x (horizontal) coordinate of the object. The locations are bucketed into 16 bins.

    • dSprites/orientation: We create another task from dSprites consists in predicting the orientation of each object, bucketed into 16 bins.

    • SmallNORB/azimuth: The Small NORB dataset contains images of 3D-toys from 50 classes, including animals, human figures, airplanes, trucks, and cars. The image size is 640x480 pixels. In this case, we define labels depending on the azimuth (angle of horizontal deviation), in intervals of 20 degrees (18 classes).

    • SmallNORB/elevation: Another synthetic task we create from Small NORB consists in predicting the elevation in the image. There are 9 classes, corresponding to 9 different elevations ranging from 30 to 70 degrees, in intervals of 5 degrees.

    • DMLab: The DMLab (DeepMind Lab) is a set of control environments focused on 3D navigation and puzzle-solving tasks. The Dmlab dataset contains frames observed by the agent acting in the DeepMind Lab environment, which are annotated by the distance between the agent and various objects present in the environment. The goal is to evaluate the ability of a visual model to reason about distances from the visual input in 3D environments. The Dmlab dataset consists of 360x480 color images in 6 classes. The classes are {close, far, very far} x {positive reward, negative reward} respectively.

    • KITTI-Dist: The KITTI task consists in predicting the (binned) depth to the vehicle (car, van, or truck) in the image. There are 4 bins / classes.

1

https://arxiv.org/pdf/1504.00325.pdf

2

https://www.kaggle.com/datasets/adityajn105/flickr8k

3

https://shannon.cs.illinois.edu/DenotationGraph/

4

https://github.com/modestyachts/ImageNetV2

5

http://host.robots.ox.ac.uk/pascal/VOC/voc2007/

6

https://arxiv.org/pdf/1910.04867.pdf