You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for AutoModel

-AutoModel 是一个多模态模型，支持图像、文本和语音输入...

3. 提供可下载文件

模型权重文件（如 AutoModel.pth）。
配置文件（如 config.json）。
依赖文件（如 requirements.txt）。
运行脚本（如 run_model.py）。

用户可以直接下载这些文件并运行模型。

1. import torch
   from model import AutoModel, Config

2. config = Config(config_file="path/to/config.json")
   model = AutoModel(config)
   model.load_state_dict(torch.load("path/to/AutoModel.pth"))
   model.eval()

4. 自动运行模型的限制

Hugging Face Hub 本身不能自动运行上传的模型，但通过 Spaces 提供的接口可以解决这一问题。Spaces 能够运行托管的推理服务，让用户无需本地配置即可测试模型。

##通过这些方式，您可以让模型仓库既支持在线运行，也便于用户离线部署。

import os
import torch
from model import AutoModel, Config

def load_model(model_path, config_path):
    """
    加载模型权重和配置
    """
    # 加载配置
    if not os.path.exists(config_path):
        raise FileNotFoundError(f"配置文件未找到: {config_path}")
    print(f"加载配置文件: {config_path}")
    config = Config()
    
    # 初始化模型
    model = AutoModel(config)
    
    # 加载权重
    if not os.path.exists(model_path):
        raise FileNotFoundError(f"模型文件未找到: {model_path}")
    print(f"加载模型权重: {model_path}")
    state_dict = torch.load(model_path, map_location=torch.device("cpu"))
    model.load_state_dict(state_dict)
    model.eval()
    print("模型加载成功并设置为评估模式。")
    
    return model, config


def run_inference(model, config):
    """
    使用模型运行推理
    """
    # 模拟示例输入
    image = torch.randn(1, 3, 224, 224)  # 图像输入
    text = torch.randn(1, config.max_position_embeddings, config.hidden_size)  # 文本输入
    audio = torch.randn(1, config.audio_sample_rate)  # 音频输入

    # 模型推理
    outputs = model(image, text, audio)
    vqa_output, caption_output, retrieval_output, asr_output, realtime_asr_output = outputs

    # 打印结果
    print("\n推理结果:")
    print(f"VQA output shape: {vqa_output.shape}")
    print(f"Caption output shape: {caption_output.shape}")
    print(f"Retrieval output shape: {retrieval_output.shape}")
    print(f"ASR output shape: {asr_output.shape}")
    print(f"Realtime ASR output shape: {realtime_asr_output.shape}")

if __name__ == "__main__":
    # 文件路径
    model_path = "AutoModel.pth"
    config_path = "config.json"

    # 加载模型
    try:
        model, config = load_model(model_path, config_path)

        # 运行推理
        run_inference(model, config)
    except Exception as e:
        print(f"运行失败: {e}")

Model Description

-- AutoModel is a multimodal deep learning model designed to process and fuse data from three different modalities: images, text, and audio. It supports a variety of downstream tasks, including: Visual Question Answering (VQA) Captioning Information Retrieval Automatic Speech Recognition (ASR) Real-time ASR

-- The model employs separate encoders for each modality (image, text, audio) and combines their outputs through a fusion layer. It is built with PyTorch and leverages a modular architecture for flexible fine-tuning and deployment.

Developed by: Independent researcher Funded by : Self-funded Shared by : Independent researcher Model type: Multimodal Language(s) (NLP): English zh License: Apache-2.0 Finetuned from model : None

Model Sources

Repository: GitHub Repository Placeholder (Add link to code repository) Paper [optional]: Demo [optional]:

How to Use the Model

Clone the repository:

git clone https://huggingface.co/zeroMN/AutoModel

2. pip install torch transformers

3. import torch
   from model import AutoModel, Config

4. config = Config(config_file="path/to/config.json")
   model = AutoModel(config)
   model.load_state_dict(torch.load("path/to/AutoModel.pth"))
   model.eval()

5. image = torch.randn(1, 3, 224, 224)
   text = torch.randn(1, 512, 768)
   audio = torch.randn(1, 16000)

   outputs = model(image, text, audio)
   print(outputs)

Direct Use

-- AutoModel is intended for research and application development in multimodal tasks. It can process and integrate data from multiple input types (images, text, audio) for tasks like VQA, captioning, and ASR.

Downstream Use [optional]

-- AutoModel can be fine-tuned on specific datasets to optimize its performance for custom tasks in various domains, such as medical image-text analysis, video-audio subtitling, and real-time speech-to-text systems.

Out-of-Scope Use

-- - Tasks outside its multimodal capabilities (e.g., pure text processing without fusion). - Non-English language tasks (unless retrained with a multilingual tokenizer and data).

Bias, Risks, and Limitations

-- ### Recommendations

Users should be aware of potential biases in pre-trained encoders and datasets, such as demographic biases in images, text, or speech. Before deployment, it is recommended to evaluate the model's fairness and robustness in real-world settings.

How to Get Started with the Model

-- Use the code below to get started with the model:

python
from model import AutoModel, Config
import torch

Load configuration and model
config = Config(config_file="path/to/config.json")
model = AutoModel(config)

Prepare inputs
image = torch.randn(1, 3, 224, 224)
text = torch.randn(1, 512, 768)
audio = torch.randn(1, 16000)

Perform forward pass
outputs = model(image, text, audio)
print("Model outputs:", outputs)

Downloads last month: 3

Inference API

Any-to-Any

Inference API (serverless) does not yet support zero models for this pipeline type.

Evaluation results

accuracy on Synthetic Multimodal Dataset
test set self-reported

85.000

View on Papers With Code

zeroMN
/

zeroSG