Skip to content

perf: Improve CPU performance on x64 and arm64#1968

Open
matthewdouglas wants to merge 27 commits into
mainfrom
cpu-perf-improvements
Open

perf: Improve CPU performance on x64 and arm64#1968
matthewdouglas wants to merge 27 commits into
mainfrom
cpu-perf-improvements

Conversation

@matthewdouglas

Copy link
Copy Markdown
Member

Improves CPU performance on x64 and arm64. Specifically focused on the blockwise quantization/dequantization ops. The improvements come mostly from improved use of SIMD features, with a minor bump from some adjustments to compile flags.

Depending on op, dtype, and hardware, improvements range from 1.1x to over 20x, with the largest gains on fp16 and on x86-64 CPUs without AVX-512.

General changes:

  • Enabled IPO/LTO for builds on Linux/macOS. Enabling this for Windows will be a future endeavor.
  • Add -fno-semantic-interposition on Linux

More specific changes/benchmarks:


x86-64

dequantize_4bit

Without AVX-512

  • Replaced branched tree with direct LUT
  • Use f16c intrinsic for fp32->fp16 conversions

Impacts CPUs without AVX-512 support, or Windows where we do not build for AVX-512 yet. Improvements for fp32/fp16 range 2.4x-23x depending on shape. Improvement for bf16 ranges 3.5x-11x depending on shape.

Linux, Ryzen 7950X, PyTorch 2.12.0, blocksize=64, NF4, AVX-512 disabled

shape dtype before after speedup
2048x2048 fp32 2.1ms 0.09ms 23x
2048x2048 bf16 2.2ms 0.20ms 11x
2048x2048 fp16 2.5ms 0.11ms 23x
4096x4096 fp32 13.0ms 5.4ms 2.4x
4096x4096 bf16 11.2ms 3.2ms 3.5x
4096x4096 fp16 12.4ms 3.0ms 4.1x
4096x14336 fp32 44.7ms 18.8ms 2.4x
4096x14336 bf16 39.1ms 10.9ms 3.6x
4096x14336 fp16 43.3ms 9.9ms 4.4x
5120x13824 fp32 54.3ms 22.3ms 2.4x
5120x13824 bf16 47.1ms 13.0ms 3.6x
5120x13824 fp16 51.4ms 12.0ms 4.3x
7168x4096 fp32 23.1ms 9.6ms 2.4x
7168x4096 bf16 19.6ms 5.5ms 3.6x
7168x4096 fp16 21.6ms 5.1ms 4.2x
8192x8192 fp32 51.8ms 21.3ms 2.4x
8192x8192 bf16 44.8ms 12.7ms 3.5x
8192x8192 fp16 49.5ms 12.0ms 4.1x

With AVX-512

  • Vectorized nibble extraction in the existing AVX-512 path.

Improvements here range from negligible to up to 1.8x-2.3x depending on the shape and dtype. One note: the AVX-512 gains measured here are modest on Zen4 with 2x256 lanes, but may be more pronounced on Zen5+ or Intel chips.

Users with further AVX-512BF16 extension support are more likely to be taking the fused GEMM path for inference than relying on dequantization. E.g. the Zen4 CPU used to benchmark would take the AVX-512BF16 path. However the dequantization would be used on older Intel AVX-512 CPUs.

Linux, Ryzen 7950X, PyTorch 2.12.0, blocksize=64, NF4

shape dtype before after speedup
2048x2048 fp32 0.07ms 0.04ms 1.8x
2048x2048 bf16 0.07ms 0.03ms 2.3x
2048x2048 fp16 0.07ms 0.03ms 2.3x
4096x4096 fp32 5.59ms 5.35ms 1.04x
4096x4096 bf16 2.95ms 2.77ms 1.06x
4096x4096 fp16 2.95ms 2.79ms 1.06x
8192x8192 fp32 22.1ms 22.3ms ~1x
8192x8192 bf16 11.9ms 12.1ms ~1x
8192x8192 fp16 11.6ms 11.6ms ~1x

dequantize_blockwise

  • Improvement only for fp16 by using F16C hardware intrinsic for conversions.

This brings fp16 performances in line with bf16.

shape dtype before after speedup
4096x4096 fp32 5.98ms 6.00ms ~1x
4096x4096 bf16 3.04ms 3.05ms ~1x
4096x4096 fp16 4.82ms 3.14ms 1.5x
4096x14336 fp32 19.8ms 20.8ms ~1x
4096x14336 bf16 10.1ms 10.6ms ~1x
4096x14336 fp16 15.9ms 10.5ms 1.5x
8192x8192 fp32 22.0ms 23.4ms ~1x
8192x8192 bf16 12.2ms 12.2ms ~1x
8192x8192 fp16 20.4ms 11.6ms 1.8x
2.9M fp32 0.06ms 0.05ms ~1x
2.9M bf16 0.04ms 0.05ms ~1x
2.9M fp16 0.38ms 0.06ms 6x
262K fp32 0.01ms 0.01ms ~1x
262K bf16 0.01ms 0.01ms ~1x
262K fp16 0.06ms 0.01ms 6x

quantize_blockwise

  • Used F16C hardware instructions for fp16 conversions.

fp16 input sees up to 1.3x improvement while fp32/bf16 are relatively flat.


ARM64

dequantize_4bit

Improved the existing NEON path by handling LUT with a vectorized table lookup (vqtbl4q_u8). This applies to all dtypes. Overall 1.1x-1.4x improvement.

Apple M4, macOS 15, PyTorch 2.12.0, blocksize=64, NF4

shape dtype before after speedup
2048x2048 fp32 0.69ms 0.55ms 1.3x
2048x2048 bf16 1.19ms 1.10ms 1.1x
2048x2048 fp16 1.06ms 0.85ms 1.2x
4096x4096 fp32 6.09ms 5.07ms 1.2x
4096x4096 bf16 4.98ms 4.46ms 1.1x
4096x4096 fp16 4.87ms 3.57ms 1.4x
8192x8192 fp32 25.40ms 20.33ms 1.2x
8192x8192 bf16 20.55ms 18.73ms 1.1x
8192x8192 fp16 18.77ms 14.93ms 1.3x

dequantize_blockwise

Added NEON vectorized implementation for all dtypes. 1.3-1.6x improvement for fp32 (the most common use case), 2.0-2.6x for bf16, and significantly stronger improvements fp16.

Apple M4, macOS 15, PyTorch 2.12.0, blocksize=256

shape dtype before after speedup
2048x2048 fp32 1.18ms 0.72ms 1.6x
2048x2048 bf16 2.16ms 1.04ms 2.1x
2048x2048 fp16 5.98ms 1.04ms 5.8x
4096x4096 fp32 7.81ms 5.84ms 1.3x
4096x4096 bf16 7.59ms 2.88ms 2.6x
4096x4096 fp16 22.44ms 2.88ms 7.8x
8192x8192 fp32 34.08ms 23.79ms 1.4x
8192x8192 bf16 35.45ms 17.39ms 2.0x
8192x8192 fp16 98.51ms 17.31ms 5.7x
262K fp32 0.08ms 0.05ms 1.6x
917K fp32 0.40ms 0.31ms 1.3x
2.9M fp32 1.38ms 0.93ms 1.5x

quantize_blockwise

Extended existing fp32 NEON absmax reduction to bf16/fp16.

The fp32 performance is unchanged, while bf16/fp16 improve ~3x and 5-6x respectively.

Apple M4, macOS 15, PyTorch 2.12.0, blocksize=256

shape dtype before after speedup
4096x4096 fp32 6.67ms 6.62ms ~1x
4096x4096 bf16 18.00ms 6.20ms 2.9x
4096x4096 fp16 36.98ms 6.19ms 6.0x
8192x8192 fp32 27.01ms 27.01ms ~1x
8192x8192 bf16 76.95ms 27.84ms 2.8x
8192x8192 fp16 152.23ms 28.04ms 5.4x
262K fp32 0.10ms 0.10ms ~1x
262K bf16 0.29ms 0.10ms 2.9x
262K fp16 0.58ms 0.10ms 5.8x
2.9M fp32 1.13ms 1.12ms ~1x
2.9M bf16 3.22ms 1.17ms 2.8x
2.9M fp16 6.43ms 1.18ms 5.4x

@matthewdouglas

Copy link
Copy Markdown
Member Author

cc @jiqing-feng for x86-64 visibility and @pdeep854 for arm64 visibility

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant