DEtection TRansformer on Onepanel

In this post, you will learn how easily you can add new deep learning training Workflows in Onepanel and use it directly from CVAT to train models on annotated data. We will use a recently released model from Facebook Research, DEtection TRansformer, as an example.

Object Detection models are one of the most widely used models among other computer vision tasks. Consequently, there is a lot of research going on in this area and we have some really good models for this task. One of the most popular model for object detection is Faster R-CNN. Faster R-CNN builds on top of the Fast R-CNN and R-CNN to further improve the speed and accuracy of the model. However, Faster RCNN consists of multiple components which does not make it end-to-end model. By now, it has been well established that end-to-end models make a lot of things easier and in fact it has led to some great results in NLP tasks such as machine translation. In this blog, we are going to see how DEtection TRansformer, a recently introduced model from facebook research, provides a great alternative to Faster R-CNN as an end-to-end model and how you can use it on Onepanel with just a few clicks.

DEtection TRansformer

This paper isn't the first one to propose an end-to-end model for object detection models. Previous approaches used sequential models such as recurrent neural networks to predict bounding boxes, but the result wasn't on par with state-of-the-art models. Of course, we can use conventional fully connected network for a fixed set of boxes, but it isn't usually the case. And to address the issue of permutation-invariance (i.e predicted boxes can be in any order) they used bipartite matching loss.

DETR uses bipartite matching loss as well, but turns to Transformers instead of recurrent neural networks. Below image shows the architecture of DEtection Transformer.

As the paper mentions, DETR views object detection as a direct set prediction problem. Here is a brief summary of how it works.

  1. Pass the image through a pre-trained Convolutional Neural Network (i.e ResNet 50) to generate the feature map.
  2. Pass the feature map though a Transformer which consists of an encoder and a decoder.
  3. The output from Transformer's decoder is then passed through a feed forward network to generate final predicts
  4. For training, we additionally pass this output through a loss function which performs bipartite matching between predicted bounding boxes and ground-truth boxes.
Source: End-to-end Object Detections with Transformers [paper]

The first step is straight forward. They use ResNet-50 or ResNet-101 pre-trained on ImageNet to generate the feature maps. This can be achieved in a few lines of code using torchvision. Since the detailed explanation of Transformer is beyond the scope of this blog, following sections attempts to explain it briefly. For more information on Transformer, check out this excellent blog post by Jay Alammar on Transformer.

Transformer

Source

Transformer is an encoder-decoder based architecture which leverages self-attention layers to gather information from the whole sequence. Transformers have gained a lot of popularity lately and they are being used in many state-of-the-art models for NLP tasks such as machine translation.

For DETR, this encoder takes in feature map combined with positional encoding as an input. Positional encoding allows Transformer to know the order of a given input at time stamp t in a original input. For example, in case of machine translation, it is important to know where the word "San Francisco" appears in the input sequence- San Francisco is in California. Unlike RNN, it does not accept word sequentially and hence does not know the order of input words inherently. Typically, positional encoding can be achieved by summing word embedding and positional embedding. Positional embeddings can be generated by repeating a pair of sines and cosines over time. This post provides a great explanation of position encoding in Transformers.

Combing back to the encoder, the encoder block then uses 1 x 1 convolutional kernel to reduce the dimensions of the feature map. Using 1 x 1 kernel we can essentially control the number of feature maps (or depth) without touching the size of an input. Then, the input is flattened and passed through multiple encoders. Here, encoders follow the standard structure. That is, each encoder has a self-attention layer followed by a feed forward network.

The decoding part is also very similar to the standard architecture with the major difference being the parallel decoding. DETR decodes, let's say, N objects in parallel contrary to standard approach where model like RNN is used to make prediction one time step at a time. A decoder has a self-attention layer, encoder-decoder attention, and a feed forward network. The encoder-decoder attention helps decoder focus on the relevant part of the input.

Finally, the final output is computed by feed forward network and a linear projection layer. The FFN outputs the normalized center of a bounding box along with height and width of a bounding box, whereas linear project layer predicts the class label using softmax function.

An important thing to note here is that since the model predicts a fixed-set of N objects, where N is way larger than the number of objects in ground-truth data, the author used a special class to represent 'no object was detected in this slot'.

Loss

Since the number of predicted objects is much larger than the objects in ground-truth data, they pad a vector representing ground-truth data with nulls to represent "no object". Using pair-wise matching cost, predicted boxes are then matched with target box such that the cost is minimum. As the author says, this process is similar to matching anchors to ground-truth objects in models such as SSD.

The loss function used here is negative log-likelihood for class label and a box. For box loss, a combination of l1 loss and IoU loss is used to ensure loss is scale-invariant since there could be small and big boxes.

DETR on Onepanel

Now that we have some understanding of how DEtection TRansformer works, let's see how can we use it on Onepanel. In the following section, we'll see how easy it is to train this DETR model directly from CVAT with just a few clicks.The following guide explains how you can create a Workflow for DETR and then use it from CVAT with a few clicks.

In other words, you can upload your frames to CVAT, annotate or pre-annotate them using default models, and train DETR on these images with three clicks. The biggest benefit here is that since Onepanel runs on Kubernetes, you can train DETR on any machine you want or even a 8-GPU machine.

1. Requirements

Before we dive into technical details of adding DETR support in CVAT Onepanel, it's important to know what types of models and data can be used with CVAT.

You can think of Execute training Workflow feature, which allows you to execute a Workflow (i.e model training Workflow), as a bridge between data in CVAT, be it annotated or just frames, and Onepanel Workflows. You can have Onepanel Workflows for model training, inference, and many other things. Let's say you have a training workflow for some model X. Then, you can use this feature to use annotated frames from CVAT to train the model X. It dumps annotated data on cloud storage and Onepanel Workflows grabs data from the same location. More details on how to actually create such workflows will be discussed in following sections.

Now that we know how this feature works, it is safe to say that the only requirement here is that your training code has to support format that CVAT data was dumped into. For example, if your training code accepts data that follows COCO format (i.e JSON) then we need to export data in COCO format from CVAT. So, if your code does not accept data in any of the format that CVAT supports, then you can't use that workflow from CVAT unless you just need frames and not the annotations. In any case, you can still create training workflow and execute on Onepanel. But you won't be able to use this directly from CVAT. You will have to export data from CVAT, upload to cloud storage, execute workflow and pass-in correct cloud storage path.

If your code supports one of the following formats, then you are good to go.

  1. MS COCO
  2. YOLO
  3. TF Detection API (TFRecords)
  4. MOT
  5. LabelMe
  6. DatuMaro

2. Uploading code to Github

Now that we know our code will work with CVAT, let's go ahead and create a workflow for the same.The first step here is to upload your repository to Github. The workflow we are about to create will clone this repository and execute training command.

The first thing we need to do is clone DETR's Github repository. We are cloning this because we may need to make some changes in code. If your code is stored locally, then you'll have to upload it to Github. Also note that this code supports MS COCO format, so we can use this directly from CVAT.

3. Running code in JupyterLab

Before we create a Workflow for this, we need to make sure it works without any issues and also understand how it works. We may also need to make some minor changes. You can do this locally or create a JupyterLab workspace on Onepanel.

Here, our goal is to have one script in this repository that takes required inputs (i.e epochs and other hyperparameters) from user and starts training, inference, etc. For this example, our goal is to create a Workflow for DETR training. So, we will focus on training part only. You can have a flag in this script and run training or inference based on user input.

The only major different between running this code locally and in a workflow is that our annotated data will be dumped onto cloud storage, so we will map that directory with any local directory. Hence, you need to update directory structure and fix it to any directory (i.e /mnt/data/dataset/). This is where we will mount our dataset from cloud storage.

Now, let's see if DETR has main script which takes user inputs and runs training. If you look at main.py in the root directory, you will find that this script accepts a myriad of arguments from user and runs training/inference. So, we don't have to create a new script.

If your code supports one of the dataset format that CVAT can export into, then we have to modify only two things: input and output paths.

a. Input paths

Since we will be mounting dataset at a fixed location (i.e /mnt/data/datasets/), you can hard-code this path in your code. Moreover, we also need to look at directory structure inside this directory. Since our code accepts data in COCO format, let's export data from CVAT in MS COCO format and take a look at it's directory structure.

If you are familiar with official COCO directory structure or even take a look at DETR code, you will find that this does not follow official COCO directory structure. Since this code supports official COCO directory structure, we need to modify some lines to make it work. Also, the file_path attribute in JSON file points to frames on the machine where it was exported. So, this won't work on other machine (i.e workflows that we will be running).

So, we will need to make two changes.

1. Update directory structure in the code (datasets/coco.py). This is very simple. We just need to update following lines from datasets/coco.py

PATHS = {
"train": (root / "train2017", root / "annotations" / f'{mode}_train2017.json'),
"val": (root / "val2017", root / "annotations" / f'{mode}_val2017.json'),
}

to

PATHS = {
"train": (root / "images", root / "annotations" / 'instances_default.json'),
"val": (root / "images", root / "annotations" / 'instances_default.json'),
}

Please note that for simplicity we are using train set as a validation set. But that's not the ideal thing to do. You can split train set into train and val set. Or use other dataset present on cloud storage while executing the workflow.

2. Write a function/script to update file_path in a JSON file.

Since we will be mounting dataset at /mnt/data/datasets/, we can update paths in JSON accordingly. So, we will write a script (prepare_data.py) that does this. And, we will execute this script before we call main.py.

def update_img_paths(args):
    # read json data
    with open(args.file_path, "r") as file:
        json_data = json.load(file)
        # update image path
        for img in json_data['images']:
            img['file_name'] = os.path.join(args.prefix, os.path.basename(img['file_name']))
        # write data back to file
        with open(args.file_path, "w") as file_out:
            json.dump(json_data, file_out)

Once this is done. We are good to go. We can now go ahead and create Onepanel Workflow. This can be used from CVAT or can be executed directly.

b. Output path

If we want Onepanel Workflows to store outputs on cloud storage, we can just write output to /mnt/output/ and it will automatically upload outputs onto cloud storage. For this to work, we just need to add output artifact in our template as discussed in the following section.

4. Creating a workflow

Now, let's go ahead and actually create a workflow. Click on WORKFLOWS and then click on CREATE TEMPLATE button.

You will see a YAML editor as shown below.

Give this template an appropriate name. For this example, we will be using DETR Training.

Now, we will use MaskRCNN template as our starting point and we will modify it for our needs.

We'll use MaskRCNN template as our base template. You can find complete template here.

Even though this might look cryptic, it isn't. Let us go through following three steps to create a new template for DETR.

a. Update workflow parameters

The first thing you should do is add/remove parameters from above template. Now, how do you figure out which parameters should we use in there? It's simple. Use arguments/parameters that we take from user plus some system related parameter (optional). Some examples of this is epochs, batch_size, etc. Again, this depends on your code as well. In this case, our main.py accepts all those hyperparameters as an argument. If your code didn't have such an argument parser, then you can pass all hyperparameters, as shown above for hyperparameters parameter, and parse it in your code.

First, we will update source parameter to use code that we just clones. If your code is in private mode, please refer to our guide on git integration to know how you can use private repositories with Workflows. We will also have to update docker image to use PyTorch with cuda. Since we will be deploying this on azure, we will use Standard_NC6 for sys-node-pool. This machine has K80 GPU.

Next, we will remove hyperparameters, cvat-num-classes, and cvat-finetune-checkpoint as we don't need them.

arguments:
  parameters:
  - name: source
    value: https://github.com/onepanelio/detr.git
    displayName: Model source code
    type: hidden
    visibility: private

  - name: cvat-annotation-path
    value: annotation-dump/sample_dataset
    hint: Path to annotated data in default object storage (i.e S3). In CVAT, this parameter will be pre-populated.
    displayName: Dataset path
    visibility: private
    
  - name: cvat-output-path
    value: workflow-data/output/sample_output
    hint: Path to store output artifacts in default object storage (i.e s3). In CVAT, this parameter will be pre-populated.
    displayName: Workflow output path
    visibility: private
    
  - name: dump-format
    value: cvat_coco
    visibility: public
      
  - name: pytorch-image
    value: pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
    visibility: public

  - name: sys-node-pool
    value: Standard_NC6
    visibility: public

Here, if visibility is public, then it will be shown in CVAT. Notice some parameters are prefixed with cvat-, they will be automatically populated by CVAT. dump-format specifies which format should CVAT dump data into. If you have this parameter in a workflow and has correct value (i.e cvat_coco), then it won't be shown in CVAT.

Basically, you don't have to specify any parameter to run this workflow from CVAT.

Now, let's go ahead and add some parameters that we might need for this model.

Let's add epochs and batch_size as we will be using them to run the training.

So, finally our parameters will look like this:

arguments:
  parameters:
  - name: source
    value: https://github.com/onepanelio/detr.git
    displayName: Model source code
    type: hidden
    visibility: private

  - name: cvat-annotation-path
    value: annotation-dump/sample_dataset
    hint: Path to annotated data in default object storage (i.e S3). In CVAT, this parameter will be pre-populated.
    displayName: Dataset path
    visibility: private
    
  - name: cvat-output-path
    value: workflow-data/output/sample_output
    hint: Path to store output artifacts in default object storage (i.e s3). In CVAT, this parameter will be pre-populated.
    displayName: Workflow output path
    visibility: private
    
  - name: dump-format
    value: cvat_coco
    visibility: public
      
  - name: pytorch-image
    value: pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
    visibility: public

  - name: sys-node-pool
    value: Standard_NC6
    visibility: public

  - name: epochs
    value: '50'
    visibility: public

  - name: batch-size
    value: '1'
    visibility: public

b. Update container block

Now, let's take a look at the second block of a base template.

entrypoint: main
templates:
- dag:
    tasks:
    - name: train-model
      template: tensorflow
  name: main
- container:
    args:
    - |
      apt-get update \
      && apt-get install -y git wget libglib2.0-0 libsm6 libxext6 libxrender-dev \
      && pip install -r requirements.txt \
      && pip install boto3 pyyaml google-cloud-storage \
      && git clone https://github.com/waleedka/coco \
      && cd coco/PythonAPI \
      && python setup.py build_ext install \
      && rm -rf build \
      && cd ../../ \
      && wget https://github.com/matterport/Mask_RCNN/releases/download/v2.0/mask_rcnn_coco.h5 \
      && python setup.py install && ls \
      && python samples/coco/cvat.py train --dataset=/mnt/data/datasets \
        --model=workflow_maskrcnn \
        --extras="{{workflow.parameters.extras}}"  \
        --ref_model_path="{{workflow.parameters.cvat-finetune-checkpoint}}"  \
        --num_classes="{{workflow.parameters.cvat-num-classes}}" \
      && cd /mnt/src/ \
      && python prepare_dataset.py /mnt/data/datasets/annotations/instances_default.json
    command:
    - sh
    - -c
    image: '{{workflow.parameters.tf-image}}'
    volumeMounts:
    - mountPath: /mnt/data
      name: data
    - mountPath: /mnt/output
      name: output
    workingDir: /mnt/src
  nodeSelector:
    beta.kubernetes.io/instance-type: '{{workflow.parameters.sys-node-pool}}'
  inputs:
    artifacts:
    - name: data
      path: /mnt/data/datasets/
      s3:
        key: '{{workflow.namespace}}/{{workflow.parameters.cvat-annotation-path}}'
    - git:
        repo: '{{workflow.parameters.source}}'
        revision: "no-boto"
      name: src
      path: /mnt/src/{{workflow.name}}
  name: tensorflow
  outputs:
    artifacts:
    - name: model
      optional: true
      path: /mnt/output
      s3:
        key: '{{workflow.namespace}}/{{workflow.parameters.cvat-output-path}}/{{workflow.name}}'

First thing we are going to do is rename the template name from tensorflow to detr, its not required though. Then, we will remove that no-boto branch from git section as we will be using default (master) branch for DETR.

Lastly, we just need to update the command that we execute to start training. You can see ~20 lines of commands. But we won't need that many for this. Let's remove those lines and write our own.

apt-get update \
    && apt-get install -y build-essential \
    && pip install cython pycocotools scipy \
    && python /mnt/src/prepare_data.py \
    && python /mnt/src/main.py --coco_path=/mnt/data/datasets/ --output_dir=/mnt/output/ --batch_size={{workflow.parameters.batch-size}} --epochs={{workflow.parameters.epochs}}

We know that we need to run prepare_data.py script to modify paths as we discussed in the last section and run main.py to start the training.

Finally, our updated block looks like this:

entrypoint: main
templates:
- dag:
    tasks:
    - name: train-model
      template: detr
  name: main
- container:
    args:
    - |
      apt-get update \
      && apt-get install -y build-essential \
      && pip install cython pycocotools scipy \
      && python /mnt/src/prepare_data.py \
      && python /mnt/src/main.py --coco_path=/mnt/data/datasets/ --output_dir=/mnt/output/ --batch_size={{workflow.parameters.batch-size}} --epochs={{workflow.parameters.epochs}}
    command:
    - sh
    - -c
    image: '{{workflow.parameters.pytorch-image}}'
    volumeMounts:
    - mountPath: /mnt/data
      name: data
    - mountPath: /mnt/output
      name: output
    workingDir: /mnt/src
  nodeSelector:
    beta.kubernetes.io/instance-type: '{{workflow.parameters.sys-node-pool}}'
  inputs:
    artifacts:
    - name: data
      path: /mnt/data/datasets/
      s3:
        key: '{{workflow.namespace}}/{{workflow.parameters.cvat-annotation-path}}'
    - git:
        repo: '{{workflow.parameters.source}}'
      name: src
      path: /mnt/src
  name: detr
  outputs:
    artifacts:
    - name: model
      optional: true
      path: /mnt/output
      s3:
        key: '{{workflow.namespace}}/{{workflow.parameters.cvat-output-path}}/{{workflow.name}}'

Note that we also updated docker image parameter name to pytorch-image.

We also attached some input and output artifacts. For inputs, we had training data and source code. For output, we will be dumping data from /mnt/output/ directory to {{workflow.namespace}}/{{workflow.parameters.cvat-output-path}}/{{workflow.name}}.

Also notice that we selected a node with machine that user specified through parameter sys-node-pool. What we are essentially doing here is that we are using PyTorch container on this machine, attaching input artifacts (i.e training data), and running commands that perform required tasks (i.e training a model).

c. Update volume claims

Now, let's take a look at the final block.

volumeClaimTemplates:
- metadata:
    creationTimestamp: null
    name: data
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 200Gi
- metadata:
    creationTimestamp: null
    name: output
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 200Gi

As we can see, this block defines volume claims. Based on your model and data, you can change this from 200 GB to whatever you need. But we will keep this as it is.

One last thing we need to do in order to use this template from CVAT is to add a label as shown below. If you want to use a Workflow in CVAT, please add a label with key=used-by and value=cvat.

With this, we have a final template for training DETR model.

5. Using it in a CVAT

Now, we can use this model to train models from CVAT.

Click on Actions menu under the CVAT task you want to train model on, select Execute training Workflow.

Select the newly created template. In my case, it was DETR Training.

Modify parameters, if you want. But changes aren't required. Just hit Submit and it will start the training by executing this workflow.

You can find your trained model (i.e output) in cvat-output-path directory on your cloud storage.


References

View all articles
Share: