Fine-tune Models#

Although CLIP-as-service has provided you a list of pre-trained models, you can also fine-tune your models. This guide will show you how to use Finetuner to fine-tune models and use them in CLIP-as-service.

Prepare Training Data#

Finetuner accepts training data and evaluation data in the form of DocumentArray. The training data for CLIP is a list of (text, image) pairs. Each pair is stored in a Document which wraps two chunks with image and text modality respectively. You can push the resulting DocumentArray to the cloud using the push() method.

We use fashion captioning dataset as a sample dataset in this tutorial. The following are examples of descriptions and image urls from the dataset. We also include a preview of each image.

Description Image URL Preview
subtly futuristic and edgy this liquid metal cuff bracelet is shaped from sculptural rectangular link https://n.nordstrommedia.com/id/sr3/
58d1a13f-b6b6-4e68-b2ff-3a3af47c422e.jpeg
high quality leather construction defines a hearty boot one-piece on a tough lug sole https://n.nordstrommedia.com/id/sr3/
21e7a67c-0a54-4d09-a4a4-6a0e0840540b.jpeg
this shimmering tricot knit tote is traced with decorative whipstitching and diamond cut chain the two hallmark of the falabella line https://n.nordstrommedia.com/id/sr3/
1d8dd635-6342-444d-a1d3-4f91a9cf222b.jpeg
... ... ...

You can use the following script to transform the first three entries of the dataset to a DocumentArray and push it to the cloud using the name fashion-sample.

from docarray import Document, DocumentArray

train_da = DocumentArray(
[
Document(
chunks=[
Document(
content='subtly futuristic and edgy this liquid metal cuff bracelet is shaped from sculptural rectangular link',
modality='text',
),
Document(
uri='https://n.nordstrommedia.com/id/sr3/58d1a13f-b6b6-4e68-b2ff-3a3af47c422e.jpeg',
modality='image',
),
],
),
Document(
chunks=[
Document(
content='high quality leather construction defines a hearty boot one-piece on a tough lug sole',
modality='text',
),
Document(
uri='https://n.nordstrommedia.com/id/sr3/21e7a67c-0a54-4d09-a4a4-6a0e0840540b.jpeg',
modality='image',
),
],
),
Document(
chunks=[
Document(
content='this shimmering tricot knit tote is traced with decorative whipstitching and diamond cut chain the two hallmark of the falabella line',
modality='text',
),
Document(
uri='https://n.nordstrommedia.com/id/sr3/1d8dd635-6342-444d-a1d3-4f91a9cf222b.jpeg',
modality='image',
),
],
),
]
)
train_da.push('fashion-sample')


The full dataset has been converted to clip-fashion-train-data and clip-fashion-eval-data and pushed to the cloud which can be directly used in Finetuner.

Start Finetuner#

You may now create and run a fine-tuning job after login to Jina ecosystem.

import finetuner

run = finetuner.fit(
model='openai/clip-vit-base-patch32',
run_name='clip-fashion',
train_data='clip-fashion-train-data',
eval_data='clip-fashion-eval-data',  # optional
epochs=5,
learning_rate=1e-5,
loss='CLIPLoss',
cpu=False,
)


After the job started, you may use status() to check the status of the job.

import finetuner

run = finetuner.get_run('clip-fashion')
print(run.status())


When the status is FINISHED, you can download the tuned model to your local machine.

import finetuner

run = finetuner.get_run('clip-fashion')
run.save_artifact('clip-model')


You should now get a zip file containing the tuned model named clip-fashion.zip under the folder clip-model.

Use the Model#

After unzipping the model you get from the previous step, a folder with the following structure will be generated:

.
└── clip-fashion/
├── config.yml
├── metrics.yml
└── models/
├── clip-text/
│   └── model.onnx
├── clip-vision/
│   └── model.onnx
└── input-map.yml


Since the tuned model generated from Finetuner contains richer information such as metadata and config, we now transform it to simpler structure used by CLIP-as-service.

• Firstly, create a new folder named clip-fashion-cas or name of your choice. This will be the storage of the models to use in CLIP-as-service.

• Secondly, copy the textual model clip-fashion/models/clip-text/model.onnx into the folder clip-fashion-cas and rename the model to textual.onnx.

• Similarly, copy the visual model clip-fashion/models/clip-vision/model.onnx into the folder clip-fashion-cas and rename the model to visual.onnx.

This is the expected structure of clip-fashion-cas:

.
└── clip-fashion-cas/
├── textual.onnx
└── visual.onnx


In order to use the fine-tuned model, create a custom YAML file finetuned_clip.yml like below. Learn more about Flow YAML configuration and clip_server YAML configuration.

jtype: Flow
version: '1'
with:
port: 51000
executors:
- name: clip_o
uses:
jtype: CLIPEncoder
metas:
py_modules:
- clip_server.executors.clip_onnx
with:
name: ViT-B/32
model_path: 'clip-fashion-cas' # path to clip-fashion-cas
replicas: 1


Warning

Note that Finetuner only support ViT-B/32 CLIP model currently. The model name should match the fine-tuned model, or you will get incorrect output.

You can now start the clip_server using fine-tuned model to get a performance boost:

python -m clip_server finetuned_clip.yml


That’s it, enjoy 🚀