File size: 5,518 Bytes
a96d609 bbde80b 370a919 bbde80b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
thumbnail: "https://motionbert.github.io/assets/teaser.gif"
tags:
- 3D Human Pose Estimation
- Skeleton-based Action Recognition
- Mesh Recovery
arxiv: "2210.06551"
---
# MotionBERT
This is the official PyTorch implementation of the paper *"[Learning Human Motion Representations: A Unified Perspective](https://arxiv.org/pdf/2210.06551.pdf)"*.
<img src="https://motionbert.github.io/assets/teaser.gif" alt="" style="zoom: 60%;" />
## Installation
```bash
conda create -n motionbert python=3.7 anaconda
conda activate motionbert
# Please install PyTorch according to your CUDA version.
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -r requirements.txt
```
## Getting Started
| Task | Document |
| --------------------------------- | ------------------------------------------------------------ |
| Pretrain | [docs/pretrain.md](docs/pretrain.md) |
| 3D human pose estimation | [docs/pose3d.md](docs/pose3d.md) |
| Skeleton-based action recognition | [docs/action.md](docs/action.md) |
| Mesh recovery | [docs/mesh.md](docs/mesh.md) |
## Applications
### In-the-wild inference (for custom videos)
Please refer to [docs/inference.md](docs/inference.md).
### Using MotionBERT for *human-centric* video representations
```python
'''
x: 2D skeletons
type = <class 'torch.Tensor'>
shape = [batch size * frames * joints(17) * channels(3)]
MotionBERT: pretrained human motion encoder
type = <class 'lib.model.DSTformer.DSTformer'>
E: encoded motion representation
type = <class 'torch.Tensor'>
shape = [batch size * frames * joints(17) * channels(512)]
'''
E = MotionBERT.get_representation(x)
```
> **Hints**
>
> 1. The model could handle different input lengths (no more than 243 frames). No need to explicitly specify the input length elsewhere.
> 2. The model uses 17 body keypoints ([H36M format](https://github.com/JimmySuen/integral-human-pose/blob/master/pytorch_projects/common_pytorch/dataset/hm36.py#L32)). If you are using other formats, please convert them before feeding to MotionBERT.
> 3. Please refer to [model_action.py](lib/model/model_action.py) and [model_mesh.py](lib/model/model_mesh.py) for examples of (easily) adapting MotionBERT to different downstream tasks.
> 4. For RGB videos, you need to extract 2D poses ([inference.md](docs/inference.md)), convert the keypoint format ([dataset_wild.py](lib/data/dataset_wild.py)), and then feed to MotionBERT ([infer_wild.py](infer_wild.py)).
>
## Model Zoo
<img src="https://motionbert.github.io/assets/demo.gif" alt="" style="zoom: 50%;" />
| Model | Download Link | Config | Performance |
| ------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ---------------- |
| MotionBERT (162MB) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pretrain/MB_release/latest_epoch.bin) | [pretrain/MB_pretrain.yaml](configs/pretrain/MB_pretrain.yaml) | - |
| MotionBERT-Lite (61MB) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pretrain/MB_lite/latest_epoch.bin) | [pretrain/MB_lite.yaml](configs/pretrain/MB_lite.yaml) | - |
| 3D Pose (H36M-SH, scratch) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pose3d/MB_train_h36m/best_epoch.bin) | [pose3d/MB_train_h36m.yaml](configs/pose3d/MB_train_h36m.yaml) | 39.2mm (MPJPE) |
| 3D Pose (H36M-SH, ft) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/pose3d/FT_MB_release_MB_ft_h36m/best_epoch.bin) | [pose3d/MB_ft_h36m.yaml](configs/pose3d/MB_ft_h36m.yaml) | 37.2mm (MPJPE) |
| Action Recognition (x-sub, ft) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/action/FT_MB_release_MB_ft_NTU60_xsub/best_epoch.bin) | [action/MB_ft_NTU60_xsub.yaml](configs/action/MB_ft_NTU60_xsub.yaml) | 97.2% (Top1 Acc) |
| Action Recognition (x-view, ft) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/action/FT_MB_release_MB_ft_NTU60_xview/best_epoch.bin) | [action/MB_ft_NTU60_xview.yaml](configs/action/MB_ft_NTU60_xview.yaml) | 93.0% (Top1 Acc) |
| Mesh (with 3DPW, ft) | [HuggingFace](https://huggingface.co/walterzhu/MotionBERT/blob/main/checkpoint/mesh/FT_MB_release_MB_ft_pw3d/best_epoch.bin) | [mesh/MB_ft_pw3d.yaml](configs/mesh/MB_ft_pw3d.yaml) | 88.1mm (MPVE) |
In most use cases (especially with finetuning), `MotionBERT-Lite` gives a similar performance with lower computation overhead.
## TODO
- [x] Scripts and docs for pretraining
- [x] Demo for custom videos
## Citation
If you find our work useful for your project, please consider citing the paper:
```bibtex
@article{motionbert2022,
title = {Learning Human Motion Representations: A Unified Perspective},
author = {Zhu, Wentao and Ma, Xiaoxuan and Liu, Zhaoyang and Liu, Libin and Wu, Wayne and Wang, Yizhou},
year = {2022},
journal = {arXiv preprint arXiv:2210.06551},
}
```
|