Pose Estimation Models

List of Pose Estimation Models

The table below shows the pose estimation models available for each task category.

Category	Model	Documentation
Whole body	HRNet	`model.hrnet`
	PoseNet	`model.posenet`
	MoveNet	`model.movenet`

Benchmarks

Inference Speed

The table below shows the frames per second (FPS) of each model type.

Model	Type	Size	CPU		GPU
Model	Type	Size	single	multiple	single	multiple
PoseNet	50	225	64.46	51.95	136.31	89.37
	75	225	57.62	47.01	132.84	83.73
	100	225	44.70	37.60	132.73	81.24
	resnet	225	18.77	17.21	73.15	51.65
HRNet (YOLO)	(v4tiny)	256 × 192 (416)	5.86	1.09	21.91	13.86
MoveNet	SinglePose Lightning	192	40.78	40.54	99.47	–
	SinglePose Thunder	256	25.13	24.87	92.05	–
	MultiPose Lightning	256 or multiple of 32	25.33	24.90	80.64	79.32

Hardware

The following hardware were used to conduct the FPS benchmarks:: - CPU: 2.8 GHz 4-Core Intel Xeon (2020, Cascade Lake) CPU and 16GB RAM

- GPU: NVIDIA A100, paired with 2.2 GHz 6-Core Intel Xeon CPU and 85GB RAM

Test Conditions

The following test conditions were followed:: - input.visual, the model of interest, and dabble.fps nodes were used to perform inference on videos

- 2 videos were used to benchmark each model, one with only 1 human (single), and the other with multiple humans (multiple)

- Both videos are about 1 minute each, recorded at ~30 FPS, which translates to about 1,800 frames to process per video

- 1280×720 (HD ready) resolution was used, as a bridge between 640×480 (VGA) of poorer quality webcams, and 1920×1080 (Full HD) of CCTVs

Model Accuracy

The table below shows the performance of our pose estimation models using the keypoint evaluation metrics from COCO. Description of these metrics can be found here.

Model	Type	Size	AP	AP ^OKS=.50	AP ^OKS=.75	AP ^medium	AP ^large	AR	AR ^OKS=.50	AR ^OKS=.75	AR ^medium	AR ^large
PoseNet	50	225	5.2	15.5	2.7	0.8	11.8	9.6	22.7	7.1	1.4	20.7
	75	225	7.2	19.7	3.6	1.3	15.9	12.1	26.5	9.3	2.2	25.5
	100	225	7.7	20.8	4.4	1.5	17.1	12.6	27.7	10.1	2.3	26.5
	resnet	225	11.9	27.4	8.3	2.2	25.3	17.3	32.5	15.9	2.9	36.8
HRNet (YOLO)	(v4tiny)	256 × 192 (416)	35.8	61.5	37.5	30.1	44.0	40.2	64.4	42.7	33.0	50.2
MoveNet	singlepose_lightning	256 x 256	7.3	15.7	5.7	1.3	15.4	8.8	17.6	7.7	1.1	19.2
	singlepose_thunder	256 x 256	11.6	21.3	10.7	3.0	23.1	13.1	22.5	12.8	2.8	27.1
	multipose_lightning	256 x 256	18.7	36.8	16.3	9.0	31.8	21.0	38.5	19.2	9.3	37.0

Dataset

The MS COCO (val 2017) dataset is used. We integrated the COCO API into the PeekingDuck pipeline for loading the annotations and evaluating the outputs from the models. All values are reported in percentages.

All images from the “person” category in the MS COCO (val 2017) dataset were processed.

Test Conditions

The following test conditions were followed:: - The tests were performed using pycocotools on the MS COCO dataset

- The evaluation metrics have been compared with the original repository of the respective pose estimation models for consistency

Keypoint IDs

Whole Body

Keypoint	ID	Keypoint	ID
nose	0	left wrist	9
left eye	1	right wrist	10
right eye	2	left hip	11
left ear	3	right hip	12
right ear	4	left knee	13
left shoulder	5	right knee	14
right shoulder	6	left ankle	15
left elbow	7	right ankle	16
right elbow	8