Server API#

CLIP-as-service is designed in a client-server architecture. A server is a long-running program that receives raw sentences and images from clients, and returns CLIP embeddings to the client. Additionally, clip_server is optimized for speed, low memory footprint and scalability.

  • Horizontal scaling: adding more replicas easily with one argument.

  • Vertical scaling: using PyTorch JIT, ONNX or TensorRT runtime to speedup single GPU inference.

  • Supporting gRPC, HTTP, Websocket protocols with their TLS counterparts, w/o compressions.

This chapter introduces the API of the server.

Tip

You will need to install server first in Python 3.7+: pip install clip-server.

Start server#

Start a PyTorch-backed server#

Unlike the client, server only has a CLI entrypoint. To start a server, run the following in the terminal:

python -m clip_server

Note that it is underscore _ not the dash -.

First time running will download the pretrained model (Pytorch ViT-B/32 by default), load the model, and finally you will get the address information of the server. This information will then be used in clients.

../../_images/server-start.gif

Start a ONNX-backed server#

To use ONNX runtime for CLIP, you can run:

pip install "clip_server[onnx]"

python -m clip_server onnx-flow.yml

Start a TensorRT-backed server#

nvidia-pyindex package needs to be installed first. It allows your pip to fetch additional Python modules from the NVIDIA NGC™ PyPI repo:

pip install nvidia-pyindex
pip install "clip_server[tensorrt]"

python -m clip_server tensorrt-flow.yml

One may wonder where is this onnx-flow.yml or tensorrt-flow.yml come from. Must be a typo? Believe me, just run it. It should just work. I will explain this YAML file in the next section.

The procedure and UI of ONNX and TensorRT runtime would look the same as Pytorch runtime.

Model support#

The various CLIP models implemented in the OpenAI, OpenCLIP, and MultilingualCLIP are supported. ViT-B-32::openai is used as the default model in all runtimes. Due to the limitation of some runtimes, not every runtime supports all models. Please also note that different models give different sizes of output dimensions. This will affect your downstream applications. For example, switching the model from one to another make your embedding incomparable, which breaks the downstream applications. Below is a list of supported models of each runtime and its corresponding size. We include the disk usage (in delta) and the peak RAM and VRAM usage (in delta) when running on a single Nvidia TITAN RTX GPU (24GB VRAM) for a series of text and image encoding tasks with batch_size=8 using PyTorch runtime.

Model

PyTorch

ONNX

TensorRT

Output Dimension

Disk Usage (MB)

Peak RAM Usage (GB)

Peak VRAM Usage (GB)

RN50::openai

1024

244

2.99

1.36

RN50::yfcc15m

1024

389

2.86

1.36

RN50::cc12m

1024

389

2.84

1.36

RN101::openai

512

278

3.05

1.40

RN101::yfcc15m

512

457

2.88

1.40

RN50x4::openai

640

402

3.23

1.63

RN50x16::openai

768

631

3.63

2.02

RN50x64::openai

1024

1291

4.08

2.98

ViT-B-32::openai

512

338

3.20

1.40

ViT-B-32::laion2b_e16

512

577

2.93

1.40

ViT-B-32::laion400m_e31

512

577

2.93

1.40

ViT-B-32::laion400m_e32

512

577

2.94

1.40

ViT-B-32::laion2b-s34b-b79k

512

577

2.94

1.40

ViT-B-16::openai

512

335

3.20

1.44

ViT-B-16::laion400m_e31

512

571

2.93

1.44

ViT-B-16::laion400m_e32

512

571

2.94

1.44

ViT-B-16-plus-240::laion400m_e31

🚧

640

795

3.03

1.59

ViT-B-16-plus-240::laion400m_e32

🚧

640

795

3.03

1.59

ViT-L-14::openai

768

890

3.66

2.04

ViT-L-14::laion400m_e31

768

1631

3.43

2.03

ViT-L-14::laion400m_e32

768

1631

3.42

2.03

ViT-L-14::laion2b-s32b-b82k

768

1631

3.43

2.03

ViT-L-14-336::openai

768

891

3.74

2.23

ViT-H-14::laion2b-s32b-b79k

1024

3762

4.45

3.26

ViT-g-14::laion2b-s12b-b42k

1024

5214

5.16

4.00

M-CLIP/LABSE-Vit-L-14

768

3609

4.30

4.70

M-CLIP/XLM-Roberta-Large-Vit-B-32

🚧

512

4284

5.37

1.68

M-CLIP/XLM-Roberta-Large-Vit-B-16Plus

🚧

640

4293

4.30

4.13

M-CLIP/XLM-Roberta-Large-Vit-L-14

768

4293

4.30

4.97

✅ = Supported — 🚧 = Working in progress — ❌ = Not supported

Use custom model for onnx#

You can also use your own model in ONNX runtime by specifying the model name and the path to ONNX model directory in YAML file. The model directory should have the same structure as below:

.
└── custom-model/
    ├── textual.onnx
    └── visual.onnx

One may wonder how to produce the model as described above. Fortunately, you can simply use the Finetuner to fine-tune your model based on custom dataset. Finetuner is a cloud service that makes fine-tuning simple and fast. Moving the process into the cloud, Finetuner handles all related complexity and infrastructure, making models performant and production ready. Click here for detail instructions.

YAML config#

You may notice that there is a YAML file in our last ONNX example. All configurations are stored in this file. In fact, python -m clip_server does not support any other argument besides a YAML file. So it is the only source of the truth of your configs.

To load a YAML config from my.yml, simply do

python -m clip_server my.yml

Or one can also pipe the config via stdin:

cat my.yml | python -m clip_server -i

This can be very useful when using clip_server in a Docker container.

And to answer your doubt, clip_server has three built-in YAML configs as a part of the package resources. When you do python -m clip_server it loads the Pytorch config, and when you do python -m clip_server onnx-flow.yml it loads the ONNX config. In the same way, when you do python -m clip_server tensorrt-flow.yml it loads the TensorRT config.

Let’s look at these three built-in YAML configs:

jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_t
    uses:
      jtype: CLIPEncoder
      metas:
        py_modules:
          - clip_server.executors.clip_torch
jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_o
    uses:
      jtype: CLIPEncoder
      metas:
        py_modules:
          - clip_server.executors.clip_onnx
jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_r
    uses:
      jtype: CLIPEncoder
      metas:
        py_modules:
          - clip_server.executors.clip_tensorrt

Basically, each YAML file defines a Jina Flow. The complete Jina Flow YAML syntax can be found here. General parameters of the Flow and Executor can be used here as well. But now we only highlight the most important parameters.

Looking at the YAML file again, we can put it into three subsections as below:

jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_t
    uses:
      jtype: CLIPEncoder
      with:
      metas:
        py_modules:
          - clip_server.executors.clip_torch
jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_t
    uses:
      jtype: CLIPEncoder
      with: 
      metas:
        py_modules:
          - clip_server.executors.clip_torch
jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_t
    uses:
      jtype: CLIPEncoder
      with: 
      metas:
        py_modules:
          - clip_server.executors.clip_torch

CLIP model config#

For all backends, you can set the following parameters via with:

Parameter

Description

name

The name of the model to be used. Default ‘ViT-B-32::openai’. A list of available models can be found here

num_worker_preprocess

The number of CPU workers to preprocess images and texts. Default is 4.

minibatch_size

The size of the minibatch for preprocessing and encoding. Default is 32. Reduce this number if you encounter OOM errors.

There are also runtime-specific parameters listed below:

Parameter

Description

device

‘cpu’ or ‘cuda’. Default is None, which auto-detects the device.

jit

Whether to use JIT compilation. Default is False.

Parameter

Description

device

‘cpu’ or ‘cuda’. Default is None, which auto-detects the device.

model_path

The path to the model to be used. If not specified, the model will be downloaded or loaded from the local cache. See here to learn how to finetune custom models.

For example, to turn on JIT and force PyTorch running on CPU, one can do:

jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_t
    uses:
      jtype: CLIPEncoder
      with: 
        jit: True
        device: cpu
      metas:
        py_modules:
          - clip_server.executors.clip_torch

To use custom model in ONNX runtime, one can do:

jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_o
    uses:
      jtype: CLIPEncoder
      with:
        name: ViT-B/32
        model_path: 'custom-model'
      metas:
        py_modules:
          - clip_server.executors.clip_onnx

Warning

The model name should match the fine-tuned model, or you will get incorrect output.

Executor config#

The full list of configs for Executor can be found via jina executor --help. The most important one is probably replicas, which allows you to run multiple CLIP models in parallel to achieve horizontal scaling.

To scale to 4 CLIP replicas, simply adding replicas: 4 under uses::

jtype: Flow
version: '1'
with:
  port: 51000
executors:
  - name: clip_t
    replicas: 4
    uses:
      jtype: CLIPEncoder
      metas:
        py_modules:
          - clip_server.executors.clip_torch

Flow config#

Flow configs are the ones under top-level with:. We can see the port: 51000 is configured there. Besides port, there are some common parameters you might need.

Parameter

Description

protocol

Communication protocol between server and client. Can be grpc, http, websocket.

cors

Only effective when protocol=http. If set, a CORS middleware is added to FastAPI frontend to allow cross-origin access.

prefetch

Control the maximum streamed request inside the Flow at any given time, default is None, means no limit. Setting prefetch to a small number helps solving the OOM problem, but may slow down the streaming a bit.

As an example, to set protocol and prefetch, one can modify the YAML as follows:

jtype: Flow
version: '1'
with:
  port: 51000
  protocol: websocket
  prefetch: 10
executors:
  - name: clip_t
    replicas: 4
    uses:
      jtype: CLIPEncoder
      metas:
        py_modules:
          - clip_server.executors.clip_torch

Environment variables#

To start a server with more verbose logging,

JINA_LOG_LEVEL=DEBUG python -m clip_server
../../_images/server-log.gif

To run CLIP-server on 3rd GPU,

CUDA_VISIBLE_DEVICES=2 python -m clip_server

Serve on Multiple GPUs#

If you have multiple GPU devices, you can leverage them via CUDA_VISIBLE_DEVICES=RR. For example, if you have 3 GPUs and your Flow YAML says replicas: 5, then

CUDA_VISIBLE_DEVICES=RR python -m clip_server

Will assign GPU devices to the following round-robin fashion:

GPU device

Replica ID

0

0

1

1

2

2

0

3

1

4

You can also restrict the visible devices in round-robin assigment by CUDA_VISIBLE_DEVICES=RR0:2, where 0:2 has the same meaning as Python slice. This will create the following assigment:

GPU device

Replica ID

0

0

1

1

0

2

1

3

0

4

Tip

In pratice, we found it is unnecessary to run clip_server on multiple GPUs for two reasons:

  • A single replica even with largest ViT-L/14-336px takes only 3.5GB VRAM.

  • Real network traffic never utilizes GPU in 100%.

Based on these two points, it makes more sense to have multiple replicas on a single GPU comparing to have multiple replicas on different GPU, which is kind of waste of resources. clip_server scales pretty well by interleaving the GPU time with mulitple replicas.

Monitor with Prometheus and Grafana#

To monitor the performance of the service, you can enable the Prometheus metrics in the Flow YAML:

jtype: Flow
version: '1'
with:
  port: 51000
  monitoring: True
  port_monitoring: 9090
executors:
  - name: clip_t
    uses:
      jtype: CLIPEncoder
      metas:
        py_modules:
          - clip_server.executors.clip_torch
    monitoring: true
    port_monitoring: 9091

This enables Prometheus metrics on both Gateway and the CLIP Executor.

Running it gives you:

../../_images/server-start-monitoring.gif

which exposes two additional endpoints:

  • http://localhost:9090 for the Gateway

  • http://localhost:9091 for the CLIP Executor

To visualize the metrics in Grafana, you can import this JSON file of an example dashboard. You will get something as follows:

../../_images/grafana-dashboard.png

For more information on monitoring a Flow, please read here.

Serve with TLS#

You can turn on TLS for HTTP and gRPC protocols. Your Flow YAML should be changed to the following:

jtype: Flow
version: '1'
with:
  port: 8443
  protocol: http
  cors: true
  uvicorn_kwargs:
    ssl_keyfile_password: blahblah
  ssl_certfile: cert.pem
  ssl_keyfile: key.pem

Here, protocol can be either http or grpc; cert.pem or key.pem represent both parts of a certificate, key being the private key to the certificate and crt being the signed certificate. You can run the following command in terminal:

openssl req -newkey rsa:4096 -nodes -sha512 -x509 -days 3650 -nodes -out cert.pem -keyout key.pem -subj "/CN=demo-cas.jina.ai"

Note that if you are using protocol: grpc then /CN=demo-cas.jina.ai must strictly follow the IP address or the domain name of your server. Mismatch IP or domain name would throw an exception.

Certificate and keys can be also generated via letsencrypt.org, which is a free SSL provider.

Warning

Note that note every port support HTTPS. Commonly support ports are: 443, 2053, 2083, 2087, 2096, 8443.

Warning

If you are using Cloudflare proxied DNS, please be aware:

  • you need to turn on gRPC support manually, please follow the guide here;

  • the free tier of Cloudflare has 100s hard limit on the timeout, meaning sending big batch to a CPU server may throw 524 to the client-side.

When the server is successfully running, you can connect to it via client by setting server to https:// or grpcs:// as follows:

from clip_client import Client

c = Client('grpcs://demo-cas.jina.ai:2096')

r = c.encode(
    [
        'First do it',
        'then do it right',
        'then do it better',
        'https://picsum.photos/200',
    ]
)

Serve in Docker Container#

You can run the server inside a Docker container. We provide a Dockerfile in the repository, which is CUDA-enabled with optimized package installation.

Build#

We have a list of pre-built images available on Docker Hub. If they are too big for you to download, you may consider built it yourself as follows:

git clone https://github.com/jina-ai/clip-as-service.git
docker build . -f Dockerfiles/server.Dockerfile  --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/clip-server

Tip

The build argument --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) is optional, but having them is highly recommended as it allows you to reuse host’s cache with the correct access.

Run#

docker run -p 51009:51000 -v $HOME/.cache:/home/cas/.cache --gpus all jinaai/clip-server

Here, 51009 is the public port on the host and 51000 is the in-container port defined inside YAML. The argument -v $HOME/.cache:/home/cas/.cache leverages host’s cache and prevents you to download the same model next time on start.

Due to the limitation of the terminal inside Docker container, you will not see the classic Jina progress bar on start. Instead, you will face a few minutes awkward silent while model downloading and then see “Flow is ready to serve” dialog.

To pass a YAML config from the host, one can do:

cat my.yml | docker run -i -p 51009:51000 -v $HOME/.cache:/home/cas/.cache --gpus all jinaai/clip-server -i

The CLI usage is the same as described here.

Tip

You can enable debug logging via: docker run --env JINA_LOG_LEVEL=debug ...

Pre-built images#

We have prebuilt images with CUDA support.

The Docker image name always starts with jinaai/clip-server followed by a tag composed of three parts:

jinaai/clip-server:{version}{extra}
  • {version}: The version of Jina. Possible values:

    • latest: the last release;

    • master: the master branch of jina-ai/jina repository;

    • x.y.z: the release of a particular version;

    • x.y and x: the alias to the last x.y.z patch release, i.e. x.y = x.y.max(z);

  • {extra}: the extra dependency installed along with clip_server. Possible values:

    • : Pytorch backend;

    • -onnx: ONNX backend;

    • -tensorrt: TensorRT backend;

Image alias and updates#

Event

Updated images

Aliases

On merge into main

jinaai/clip-server:master{extra}

On x.y.z release

jinaai/clip-server:x.y.z{extra}

jinaai/clip-server:latest{python_version}{extra}, jinaai/clip-server:x.y{python_version}{extra}, jinaai/clip-server:x{python_version}{extra}

3 images are built on the event listed above, i.e. taking the combination of:

  • {extra} = ["", "-onnx", "-tensorrt"]

Image size on different tags#

Warning

Due to a known bug in shields.io/Docker Hub API, the following badge may show “invalid” status randomly.

Image Size