UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
Abstract
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Community
This paper introduces UniPose, a unified framework that utilizes LLM to comprehend, generate, and edit human poses across diverse modalities (images, text, and 3D SMPL poses). UniPose employs a pose tokenizer to convert 3D poses into discrete tokens, enabling seamless integration into the LLM’s vocabulary. Additionally, it incorporates a mix of visual encoders, including a pose-specific encoder, to enhance fine-grained pose perception. UniPose effectively transfers knowledge across tasks through a unified learning strategy, and adapts to unseen challenges. As the first general-purpose framework for pose understanding, generation, and editing, UniPose performs various pose-related tasks competitively.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding (2024)
- PUMA: Empowering Unified MLLM with Multi-granular Visual Generation (2024)
- FoPru: Focal Pruning for Efficient Large Vision-Language Models (2024)
- JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation (2024)
- Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset (2024)
- UniMuMo: Unified Text, Music and Motion Generation (2024)
- KinMo: Kinematic-aware Human Motion Understanding and Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper