It uses an object detection model (in our example code[1], we used one from Robo...

It uses an object detection model (in our example code[1], we used one from Roboflow Universe[2] but you should be able to use any object detection model) to get the bounding boxes and then sends a crop of each detected box to CLIP to get the feature vector that Deep SORT uses to differentiate between and track instances across frames.

This is in comparison to the original Deep SORT[3] which requires you to train a second custom "deep appearance descriptor" model for the tracker to use.

[1] https://github.com/roboflow-ai/zero-shot-object-tracking

[2] https://universe.roboflow.com

[3] https://github.com/nwojke/deep_sort