optimum/llm-perf-leaderboard · Unexpected Intel Xeon performance on the leaderboard

Hi, we observed that in Intel Xeon results of the leaderboard, the prefill latency of BF16 is generally larger than FP32, which is not expected: e.g., qwen1.5-7b:

We tried the benchmark on c7i-8xlarge AWS instance with optimum-benchmark tool, and the results show BF16 has lower latency than FP32:

	Prefill (s)	Decode (tokens/s)
fp32-eager	1.612	5.330
bf16-eager	0.377	7.300

Could there be a misalignment in the performance collection ? We did the benchmark with optimum-benchmark CLI and config is as below:

defaults:
  - benchmark
  - scenario: inference
  - launcher: process
  - backend: pytorch
  - _base_
  - _self_

name: cpu_pytorch_qwen

launcher:
  numactl: true
  numactl_kwargs:
    cpunodebind: 0
    membind: 0

backend:
  device: cpu
  # export: true
  no_weights: false # on multi-node machines, intializing weights in the benchmark could harm performance
  torch_dtype: bfloat16 # use bfloat16 on compatible Intel CPUs
  model: Qwen/Qwen1.5-7B

scenario:
  memory: true
  latency: true

  input_shapes:
    batch_size: 1
    sequence_length: 256

  generate_kwargs:
    max_new_tokens: 64
    min_new_tokens: 64