E.T. Chat
arXiv | Project Page | GitHub
E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder, a frame compressor, and a LLM. A special token <vid> is introduced to trigger frame embedding matching for timestamp prediction.
This checkpoint was trained from a mixture of stage-2 and stage-3 data, yielding much better general chatting capabilities but slightly sub-optimal grounding performance. It shall be considered as the default setting for this model.
π Model Details
Model Description
- Developed by: Ye Liu
- Model type: Multi-modal Large Language Model
- Language(s): English
- License: BSD-3-Clause
Training Data
The stage-2+3 checkpoint of E.T. Chat was trained from ET-Instruct-164K, VideoChatGPT, and LLaVA-1.5-Instruct datasets.
More Details
Please refer to our GitHub Repository for more details about this model.
π Citation
Please kindly cite our paper if you find this project helpful.
@inproceedings{liu2024etbench,
title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
booktitle={Neural Information Processing Systems (NeurIPS)},
year={2024}
}
- Downloads last month
- 2