Pose Estimation Models
List of Pose Estimation Models
The table below shows the pose estimation models available for each task category.
Category |
Model |
Documentation |
---|---|---|
Whole body |
HRNet |
|
PoseNet |
||
MoveNet |
Benchmarks
Inference Speed
The table below shows the frames per second (FPS) of each model type.
Model |
Type |
Size |
CPU |
GPU |
||
---|---|---|---|---|---|---|
single |
multiple |
single |
multiple |
|||
PoseNet |
50 |
225 |
64.46 |
51.95 |
136.31 |
89.37 |
75 |
225 |
57.62 |
47.01 |
132.84 |
83.73 |
|
100 |
225 |
44.70 |
37.60 |
132.73 |
81.24 |
|
resnet |
225 |
18.77 |
17.21 |
73.15 |
51.65 |
|
HRNet (YOLO) |
(v4tiny) |
256 × 192 (416) |
5.86 |
1.09 |
21.91 |
13.86 |
MoveNet |
SinglePose Lightning |
192 |
40.78 |
40.54 |
99.47 |
– |
SinglePose Thunder |
256 |
25.13 |
24.87 |
92.05 |
– |
|
MultiPose Lightning |
256 or multiple of 32 |
25.33 |
24.90 |
80.64 |
79.32 |
Hardware
- The following hardware were used to conduct the FPS benchmarks:
- -
CPU
: 2.8 GHz 4-Core Intel Xeon (2020, Cascade Lake) CPU and 16GB RAM-GPU
: NVIDIA A100, paired with 2.2 GHz 6-Core Intel Xeon CPU and 85GB RAM
Test Conditions
- The following test conditions were followed:
- -
input.visual
, the model of interest, anddabble.fps
nodes were used to perform inference on videos- 2 videos were used to benchmark each model, one with only 1 human (single
), and the other with multiple humans (multiple
)- Both videos are about 1 minute each, recorded at ~30 FPS, which translates to about 1,800 frames to process per video- 1280×720 (HD ready) resolution was used, as a bridge between 640×480 (VGA) of poorer quality webcams, and 1920×1080 (Full HD) of CCTVs
Model Accuracy
The table below shows the performance of our pose estimation models using the keypoint evaluation metrics from COCO. Description of these metrics can be found here.
Model |
Type |
Size |
AP |
AP OKS=.50 |
AP OKS=.75 |
AP medium |
AP large |
AR |
AR OKS=.50 |
AR OKS=.75 |
AR medium |
AR large |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PoseNet |
50 |
225 |
5.2 |
15.5 |
2.7 |
0.8 |
11.8 |
9.6 |
22.7 |
7.1 |
1.4 |
20.7 |
75 |
225 |
7.2 |
19.7 |
3.6 |
1.3 |
15.9 |
12.1 |
26.5 |
9.3 |
2.2 |
25.5 |
|
100 |
225 |
7.7 |
20.8 |
4.4 |
1.5 |
17.1 |
12.6 |
27.7 |
10.1 |
2.3 |
26.5 |
|
resnet |
225 |
11.9 |
27.4 |
8.3 |
2.2 |
25.3 |
17.3 |
32.5 |
15.9 |
2.9 |
36.8 |
|
HRNet (YOLO) |
(v4tiny) |
256 × 192 (416) |
35.8 |
61.5 |
37.5 |
30.1 |
44.0 |
40.2 |
64.4 |
42.7 |
33.0 |
50.2 |
MoveNet |
singlepose_lightning |
256 x 256 |
7.3 |
15.7 |
5.7 |
1.3 |
15.4 |
8.8 |
17.6 |
7.7 |
1.1 |
19.2 |
singlepose_thunder |
256 x 256 |
11.6 |
21.3 |
10.7 |
3.0 |
23.1 |
13.1 |
22.5 |
12.8 |
2.8 |
27.1 |
|
multipose_lightning |
256 x 256 |
18.7 |
36.8 |
16.3 |
9.0 |
31.8 |
21.0 |
38.5 |
19.2 |
9.3 |
37.0 |
Dataset
The MS COCO (val 2017) dataset is used. We integrated the COCO API into the PeekingDuck pipeline for loading the annotations and evaluating the outputs from the models. All values are reported in percentages.
All images from the “person” category in the MS COCO (val 2017) dataset were processed.
Test Conditions
- The following test conditions were followed:
- - The tests were performed using pycocotools on the MS COCO dataset- The evaluation metrics have been compared with the original repository of the respective pose estimation models for consistency
Keypoint IDs
Whole Body
Keypoint |
ID |
Keypoint |
ID |
---|---|---|---|
nose |
0 |
left wrist |
9 |
left eye |
1 |
right wrist |
10 |
right eye |
2 |
left hip |
11 |
left ear |
3 |
right hip |
12 |
right ear |
4 |
left knee |
13 |
left shoulder |
5 |
right knee |
14 |
right shoulder |
6 |
left ankle |
15 |
left elbow |
7 |
right ankle |
16 |
right elbow |
8 |