Unexpected Intel Xeon performance on the leaderboard

#35
by Aaaaadore - opened

Hi, we observed that in Intel Xeon results of the leaderboard, the prefill latency of BF16 is generally larger than FP32, which is not expected: e.g., qwen1.5-7b:

leaderboard-qwen.png

We tried the benchmark on c7i-8xlarge AWS instance with optimum-benchmark tool, and the results show BF16 has lower latency than FP32:

Prefill (s) Decode (tokens/s)
fp32-eager 1.612 5.330
bf16-eager 0.377 7.300

Could there be a misalignment in the performance collection ? We did the benchmark with optimum-benchmark CLI and config is as below:

defaults:
  - benchmark
  - scenario: inference
  - launcher: process
  - backend: pytorch
  - _base_
  - _self_

name: cpu_pytorch_qwen

launcher:
  numactl: true
  numactl_kwargs:
    cpunodebind: 0
    membind: 0

backend:
  device: cpu
  # export: true
  no_weights: false # on multi-node machines, intializing weights in the benchmark could harm performance
  torch_dtype: bfloat16 # use bfloat16 on compatible Intel CPUs
  model: Qwen/Qwen1.5-7B

scenario:
  memory: true
  latency: true

  input_shapes:
    batch_size: 1
    sequence_length: 256

  generate_kwargs:
    max_new_tokens: 64
    min_new_tokens: 64

Sign up or log in to comment