|
| 1 | +English | [简体中文](README_cn.md) |
| 2 | + |
| 3 | +# MOT (Multi-Object Tracking) |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +- [Introduction](#Introduction) |
| 7 | +- [Model Zoo](#Model_Zoo) |
| 8 | +- [Dataset Preparation](#Dataset_Preparation) |
| 9 | +- [Installation](#Installation) |
| 10 | +- [Getting Start](#Getting_Start) |
| 11 | +- [Citations](#Citations) |
| 12 | + |
| 13 | +## Introduction |
| 14 | +PaddleDetection implements three multi-object tracking methods. |
| 15 | +- [DeepSORT](https://arxiv.org/abs/1812.00442) (Deep Cosine Metric Learning SORT) extends the original [SORT](https://arxiv.org/abs/1703.07402) (Simple Online and Realtime Tracking) algorithm to integrate appearance information based on a deep appearance descriptor. It adds a CNN model to extract features in image of human part bounded by a detector. Here we use `JDE` as detection model to generate boxes, and select `PCBPyramid` as the ReID model. We also support loading the boxes from saved detection result files. |
| 16 | + |
| 17 | +- [JDE](https://arxiv.org/abs/1909.12605) (Joint Detection and Embedding) is a fast and high-performance multiple-object tracker that learns the object detection task and appearance embedding task simutaneously in a shared neural network. |
| 18 | + |
| 19 | +- [FairMOT](https://arxiv.org/abs/2004.01888) focuses on accomplishing the detection and re-identification in a single network to improve the inference speed, presents a simple baseline which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the two tasks allows FairMOT to obtain high levels of detection and tracking accuracy. |
| 20 | + |
| 21 | +<div align="center"> |
| 22 | + <img src="../../docs/images/mot16_jde.gif" width=500 /> |
| 23 | +</div> |
| 24 | + |
| 25 | +## Model Zoo |
| 26 | + |
| 27 | +### JDE on MOT-16 training set |
| 28 | + |
| 29 | +| backbone | input shape | MOTA | IDF1 | IDS | FP | FN | FPS | download | config | |
| 30 | +| :----------------- | :------- | :----: | :----: | :---: | :----: | :---: | :---: | :---: | :---: | |
| 31 | +| DarkNet53 | 1088x608 | 73.2 | 69.4 | 1320 | 6613 | 21629 | - |[model](https://paddledet.bj.bcebos.com/models/mot/jde_darknet53_30e_1088x608.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/mot/jde/jde_darknet53_30e_1088x608.yml) | |
| 32 | +| DarkNet53 | 864x480 | 70.1 | 65.4 | 1341 | 6454 | 25208 | - |[model](https://paddledet.bj.bcebos.com/models/mot/jde_darknet53_30e_864x480.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/mot/jde/jde_darknet53_30e_864x480.yml) | |
| 33 | +| DarkNet53 | 576x320 | 63.1 | 64.6 | 1357 | 7083 | 32312 | - |[model](https://paddledet.bj.bcebos.com/models/mot/jde_darknet53_30e_576x320.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/mot/jde/jde_darknet53_30e_576x320.yml) | |
| 34 | + |
| 35 | +**Notes:** |
| 36 | + JDE used 8 GPUs for training and mini-batch size as 4 on each GPU, and trained for 30 epoches. |
| 37 | + |
| 38 | +### DeepSORT on MOT-16 training set |
| 39 | + |
| 40 | +| backbone | input shape | MOTA | IDF1 | IDS | FP | FN | FPS | Detector | ReID | config | |
| 41 | +| :---------| :------- | :----: | :----: | :--: | :----: | :---: | :---: |:-----: | :-----: | :-----: | |
| 42 | +| DarkNet53 | 1088x608 | 72.2 | 60.5 | 998 | 8054 | 21644 | 5.07 |[JDE](https://paddledet.bj.bcebos.com/models/mot/jde_darknet53_30e_1088x608.pdparams)| [ReID](https://paddledet.bj.bcebos.com/models/mot/deepsort_pcb_pyramid_r101.pdparams)|[配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/mot/deepsort/deepsort_pcb_pyramid_r101.yml) | |
| 43 | + |
| 44 | +**Notes:** |
| 45 | + DeepSORT does not need to train, only used for evaluation. Before DeepSORT evaluation, you should get detection results by a detection model first, here we use JDE, and then prepare them like this: |
| 46 | +``` |
| 47 | +det_results_dir |
| 48 | + |——————MOT16-02.txt |
| 49 | + |——————MOT16-04.txt |
| 50 | + |——————MOT16-05.txt |
| 51 | + |——————MOT16-09.txt |
| 52 | + |——————MOT16-10.txt |
| 53 | + |——————MOT16-11.txt |
| 54 | + |——————MOT16-13.txt |
| 55 | +``` |
| 56 | +Each txt is the detection result of all the pictures extracted from each video, and each line describes a bounding box with the following format: |
| 57 | +``` |
| 58 | +[frame_id][identity][bb_left][bb_top][width][height][conf][x][y][z] |
| 59 | +``` |
| 60 | +**Notes:** |
| 61 | +- `frame_id` is the frame number of the image |
| 62 | +- `identity` is the object id using default value `-1` |
| 63 | +- `bb_left` is the X coordinate of the left bound of the object box |
| 64 | +- `bb_top` is the Y coordinate of the upper bound of the object box |
| 65 | +- `width, height` is the pixel width and height |
| 66 | +- `conf` is the object score with default value `1` (the results had been filtered out according to the detection score threshold) |
| 67 | +- `x,y,z` are used in 3D, default to `-1` in 2D. |
| 68 | + |
| 69 | +### FairMOT Results on MOT-16 train set |
| 70 | + |
| 71 | +| backbone | input shape | MOTA | IDF1 | IDS | FP | FN | FPS | download | config | |
| 72 | +| :--------------| :------- | :----: | :----: | :----: | :----: | :----: | :------: | :----: |:-----: | |
| 73 | +| DLA-34(paper) | 1088x608 | 83.3 | 81.9 | 544 | 3822 | 14095 | - | - | - | |
| 74 | +| DLA-34 | 1088x608 | 83.7 | 83.3 | 435 | 3829 | 13764 | - | [model](https://paddledet.bj.bcebos.com/models/mot/fairmot_dla34_30e_1088x608.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/mot/fairmot/fairmot_dla34_30e_1088x608.yml) | |
| 75 | + |
| 76 | + |
| 77 | +### FairMOT Results on MOT-16 test set |
| 78 | + |
| 79 | +| backbone | input shape | MOTA | IDF1 | IDS | FP | FN | FPS | download | config | |
| 80 | +| :--------------| :------- | :----: | :----: | :----: | :----: | :----: | :------: | :----: |:-----: | |
| 81 | +| DLA-34(paper) | 1088x608 | 74.9 | 72.8 | 1074 | - | - | 25.9 | - | - | |
| 82 | +| DLA-34 | 1088x608 | 74.8 | 74.4 | 930 | 7038 | 37994 | - | [model](https://paddledet.bj.bcebos.com/models/mot/fairmot_dla34_30e_1088x608.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/mot/fairmot/fairmot_dla34_30e_1088x608.yml) | |
| 83 | + |
| 84 | +**Notes:** |
| 85 | + FairMOT used 8 GPUs for training and mini-batch size as 6 on each GPU, and trained for 30 epoches. |
| 86 | + |
| 87 | +## Dataset Preparation |
| 88 | + |
| 89 | +### MOT Dataset |
| 90 | +PaddleDetection use the same training data as [JDE](https://github.com/Zhongdao/Towards-Realtime-MOT) and [FairMOT](https://github.com/ifzhang/FairMOT). Please refer to [PrepareMOTDataSet](../../docs/tutorials/PrepareMOTDataSet.md) to download and prepare all the training data including **Caltech Pedestrian, CityPersons, CUHK-SYSU, PRW, ETHZ, MOT17 and MOT16**. **MOT15 and MOT20** can also be downloaded from the official webpage of MOT challenge. If you want to use these datasets, please **follow their licenses**. |
| 91 | + |
| 92 | +### Data Format |
| 93 | +These several relevant datasets have the following structure: |
| 94 | +``` |
| 95 | +Caltech |
| 96 | + |——————images |
| 97 | + | └——————00001.jpg |
| 98 | + | |—————— ... |
| 99 | + | └——————0000N.jpg |
| 100 | + └——————labels_with_ids |
| 101 | + └——————00001.txt |
| 102 | + |—————— ... |
| 103 | + └——————0000N.txt |
| 104 | +MOT17 |
| 105 | + |——————images |
| 106 | + | └——————train |
| 107 | + | └——————test |
| 108 | + └——————labels_with_ids |
| 109 | + └——————train |
| 110 | +``` |
| 111 | +Annotations of these datasets are provided in a unified format. Every image has a corresponding annotation text. Given an image path, the annotation text path can be generated by replacing the string `images` with `labels_with_ids` and replacing `.jpg` with `.txt`. |
| 112 | + |
| 113 | +In the annotation text, each line is describing a bounding box and has the following format: |
| 114 | +``` |
| 115 | +[class] [identity] [x_center] [y_center] [width] [height] |
| 116 | +``` |
| 117 | +**Notes:** |
| 118 | +- `class` should be `0`. Only single-class multi-object tracking is supported now. |
| 119 | +- `identity` is an integer from `0` to `num_identities - 1`(`num_identities` is the total number of instances of objects in the dataset), or `-1` if this box has no identity annotation. |
| 120 | +- `[x_center] [y_center] [width] [height]` are normalized by the width/height of the image, so they are floating point numbers ranging from 0 to 1. |
| 121 | + |
| 122 | +### Dataset Directory |
| 123 | + |
| 124 | +First, follow the command below to download the `image_list.zip` and unzip it in the `dataset/mot` directory: |
| 125 | +``` |
| 126 | +wget https://dataset.bj.bcebos.com/mot/image_lists.zip |
| 127 | +``` |
| 128 | +Then download and unzip each dataset, and the final directory is as follows: |
| 129 | +``` |
| 130 | +dataset/mot |
| 131 | + |——————image_lists |
| 132 | + |——————caltech.10k.val |
| 133 | + |——————caltech.all |
| 134 | + |——————caltech.train |
| 135 | + |——————caltech.val |
| 136 | + |——————citypersons.train |
| 137 | + |——————citypersons.val |
| 138 | + |——————cuhksysu.train |
| 139 | + |——————cuhksysu.val |
| 140 | + |——————eth.train |
| 141 | + |——————mot15.train |
| 142 | + |——————mot16.train |
| 143 | + |——————mot17.train |
| 144 | + |——————mot20.train |
| 145 | + |——————prw.train |
| 146 | + |——————prw.val |
| 147 | + |——————Caltech |
| 148 | + |——————Cityscapes |
| 149 | + |——————CUHKSYSU |
| 150 | + |——————ETHZ |
| 151 | + |——————MOT15 |
| 152 | + |——————MOT16 |
| 153 | + |——————MOT17 |
| 154 | + |——————MOT20 |
| 155 | + |——————PRW |
| 156 | +``` |
| 157 | + |
| 158 | +## Installation |
| 159 | + |
| 160 | +Install all the related dependencies for MOT: |
| 161 | +``` |
| 162 | +pip install lap sklearn motmetrics openpyxl cython_bbox |
| 163 | +or |
| 164 | +pip install -r requirements.txt |
| 165 | +``` |
| 166 | +**Notes:** |
| 167 | + Install `cython_bbox` for windows, please refer to this [tutorial](https://stackoverflow.com/questions/60349980/is-there-a-way-to-install-cython-bbox-for-windows) |
| 168 | + |
| 169 | + |
| 170 | +## Getting Start |
| 171 | + |
| 172 | +### 1. Training |
| 173 | + |
| 174 | +Training FairMOT on 8 GPUs with following command |
| 175 | + |
| 176 | +```bash |
| 177 | +python -m paddle.distributed.launch --log_dir=./fairmot_dla34_30e_1088x608/ --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/mot/fairmot/fairmot_dla34_30e_1088x608.yml |
| 178 | +``` |
| 179 | + |
| 180 | +### 2. Evaluation |
| 181 | + |
| 182 | +Evaluating the track performance of FairMOT on val dataset in single GPU with following commands: |
| 183 | + |
| 184 | +```bash |
| 185 | +# use weights released in PaddleDetection model zoo |
| 186 | +CUDA_VISIBLE_DEVICES=0 python tools/eval_mot.py -c configs/mot/fairmot/fairmot_dla34_30e_1088x608.yml -o weights=https://paddledet.bj.bcebos.com/models/mot/fairmot_dla34_30e_1088x608.pdparams |
| 187 | + |
| 188 | +# use saved checkpoint in training |
| 189 | +CUDA_VISIBLE_DEVICES=0 python tools/eval_mot.py -c configs/mot/fairmot/fairmot_dla34_30e_1088x608.yml -o weights=output/fairmot_dla34_30e_1088x608/model_final |
| 190 | +``` |
| 191 | + |
| 192 | +## Citations |
| 193 | +``` |
| 194 | +@article{wang2019towards, |
| 195 | + title={Towards Real-Time Multi-Object Tracking}, |
| 196 | + author={Wang, Zhongdao and Zheng, Liang and Liu, Yixuan and Wang, Shengjin}, |
| 197 | + journal={arXiv preprint arXiv:1909.12605}, |
| 198 | + year={2019} |
| 199 | +} |
| 200 | +
|
| 201 | +@inproceedings{Wojke2017simple, |
| 202 | + title={Simple Online and Realtime Tracking with a Deep Association Metric}, |
| 203 | + author={Wojke, Nicolai and Bewley, Alex and Paulus, Dietrich}, |
| 204 | + booktitle={2017 IEEE International Conference on Image Processing (ICIP)}, |
| 205 | + year={2017}, |
| 206 | + pages={3645--3649}, |
| 207 | + organization={IEEE}, |
| 208 | + doi={10.1109/ICIP.2017.8296962} |
| 209 | +} |
| 210 | +
|
| 211 | +@inproceedings{Wojke2018deep, |
| 212 | + title={Deep Cosine Metric Learning for Person Re-identification}, |
| 213 | + author={Wojke, Nicolai and Bewley, Alex}, |
| 214 | + booktitle={2018 IEEE Winter Conference on Applications of Computer Vision (WACV)}, |
| 215 | + year={2018}, |
| 216 | + pages={748--756}, |
| 217 | + organization={IEEE}, |
| 218 | + doi={10.1109/WACV.2018.00087} |
| 219 | +} |
| 220 | +
|
| 221 | +@article{wang2019towards, |
| 222 | + title={Towards Real-Time Multi-Object Tracking}, |
| 223 | + author={Wang, Zhongdao and Zheng, Liang and Liu, Yixuan and Wang, Shengjin}, |
| 224 | + journal={arXiv preprint arXiv:1909.12605}, |
| 225 | + year={2019} |
| 226 | +} |
| 227 | +``` |
0 commit comments