perf: Improve CPU performance on x64 and arm64 by matthewdouglas · Pull Request #1968 · bitsandbytes-foundation/bitsandbytes

matthewdouglas · 2026-06-09T18:22:45Z

Improves CPU performance on x64 and arm64. Specifically focused on the blockwise quantization/dequantization ops. The improvements come mostly from improved use of SIMD features, with a minor bump from some adjustments to compile flags.

Depending on op, dtype, and hardware, improvements range from 1.1x to over 20x, with the largest gains on fp16 and on x86-64 CPUs without AVX-512.

General changes:

Enabled IPO/LTO for builds on Linux/macOS. Enabling this for Windows will be a future endeavor.
Add -fno-semantic-interposition on Linux

More specific changes/benchmarks:

x86-64

`dequantize_4bit`

Without AVX-512

Replaced branched tree with direct LUT
Use f16c intrinsic for fp32->fp16 conversions

Impacts CPUs without AVX-512 support, or Windows where we do not build for AVX-512 yet. Improvements for fp32/fp16 range 2.4x-23x depending on shape. Improvement for bf16 ranges 3.5x-11x depending on shape.

_{Linux, Ryzen 7950X, PyTorch 2.12.0, blocksize=64, NF4, AVX-512 disabled}

shape	dtype	before	after	speedup
2048x2048	fp32	2.1ms	0.09ms	23x
2048x2048	bf16	2.2ms	0.20ms	11x
2048x2048	fp16	2.5ms	0.11ms	23x
4096x4096	fp32	13.0ms	5.4ms	2.4x
4096x4096	bf16	11.2ms	3.2ms	3.5x
4096x4096	fp16	12.4ms	3.0ms	4.1x
4096x14336	fp32	44.7ms	18.8ms	2.4x
4096x14336	bf16	39.1ms	10.9ms	3.6x
4096x14336	fp16	43.3ms	9.9ms	4.4x
5120x13824	fp32	54.3ms	22.3ms	2.4x
5120x13824	bf16	47.1ms	13.0ms	3.6x
5120x13824	fp16	51.4ms	12.0ms	4.3x
7168x4096	fp32	23.1ms	9.6ms	2.4x
7168x4096	bf16	19.6ms	5.5ms	3.6x
7168x4096	fp16	21.6ms	5.1ms	4.2x
8192x8192	fp32	51.8ms	21.3ms	2.4x
8192x8192	bf16	44.8ms	12.7ms	3.5x
8192x8192	fp16	49.5ms	12.0ms	4.1x

With AVX-512

Vectorized nibble extraction in the existing AVX-512 path.

Improvements here range from negligible to up to 1.8x-2.3x depending on the shape and dtype. One note: the AVX-512 gains measured here are modest on Zen4 with 2x256 lanes, but may be more pronounced on Zen5+ or Intel chips.

Users with further AVX-512BF16 extension support are more likely to be taking the fused GEMM path for inference than relying on dequantization. E.g. the Zen4 CPU used to benchmark would take the AVX-512BF16 path. However the dequantization would be used on older Intel AVX-512 CPUs.

_{Linux, Ryzen 7950X, PyTorch 2.12.0, blocksize=64, NF4}

shape	dtype	before	after	speedup
2048x2048	fp32	0.07ms	0.04ms	1.8x
2048x2048	bf16	0.07ms	0.03ms	2.3x
2048x2048	fp16	0.07ms	0.03ms	2.3x
4096x4096	fp32	5.59ms	5.35ms	1.04x
4096x4096	bf16	2.95ms	2.77ms	1.06x
4096x4096	fp16	2.95ms	2.79ms	1.06x
8192x8192	fp32	22.1ms	22.3ms	~1x
8192x8192	bf16	11.9ms	12.1ms	~1x
8192x8192	fp16	11.6ms	11.6ms	~1x

`dequantize_blockwise`

Improvement only for fp16 by using F16C hardware intrinsic for conversions.

This brings fp16 performances in line with bf16.

shape	dtype	before	after	speedup
4096x4096	fp32	5.98ms	6.00ms	~1x
4096x4096	bf16	3.04ms	3.05ms	~1x
4096x4096	fp16	4.82ms	3.14ms	1.5x
4096x14336	fp32	19.8ms	20.8ms	~1x
4096x14336	bf16	10.1ms	10.6ms	~1x
4096x14336	fp16	15.9ms	10.5ms	1.5x
8192x8192	fp32	22.0ms	23.4ms	~1x
8192x8192	bf16	12.2ms	12.2ms	~1x
8192x8192	fp16	20.4ms	11.6ms	1.8x
2.9M	fp32	0.06ms	0.05ms	~1x
2.9M	bf16	0.04ms	0.05ms	~1x
2.9M	fp16	0.38ms	0.06ms	6x
262K	fp32	0.01ms	0.01ms	~1x
262K	bf16	0.01ms	0.01ms	~1x
262K	fp16	0.06ms	0.01ms	6x

`quantize_blockwise`

Used F16C hardware instructions for fp16 conversions.

fp16 input sees up to 1.3x improvement while fp32/bf16 are relatively flat.

ARM64

`dequantize_4bit`

Improved the existing NEON path by handling LUT with a vectorized table lookup (vqtbl4q_u8). This applies to all dtypes. Overall 1.1x-1.4x improvement.

_{Apple M4, macOS 15, PyTorch 2.12.0, blocksize=64, NF4}

shape	dtype	before	after	speedup
2048x2048	fp32	0.69ms	0.55ms	1.3x
2048x2048	bf16	1.19ms	1.10ms	1.1x
2048x2048	fp16	1.06ms	0.85ms	1.2x
4096x4096	fp32	6.09ms	5.07ms	1.2x
4096x4096	bf16	4.98ms	4.46ms	1.1x
4096x4096	fp16	4.87ms	3.57ms	1.4x
8192x8192	fp32	25.40ms	20.33ms	1.2x
8192x8192	bf16	20.55ms	18.73ms	1.1x
8192x8192	fp16	18.77ms	14.93ms	1.3x

`dequantize_blockwise`

Added NEON vectorized implementation for all dtypes. 1.3-1.6x improvement for fp32 (the most common use case), 2.0-2.6x for bf16, and significantly stronger improvements fp16.

_{Apple M4, macOS 15, PyTorch 2.12.0, blocksize=256}

shape	dtype	before	after	speedup
2048x2048	fp32	1.18ms	0.72ms	1.6x
2048x2048	bf16	2.16ms	1.04ms	2.1x
2048x2048	fp16	5.98ms	1.04ms	5.8x
4096x4096	fp32	7.81ms	5.84ms	1.3x
4096x4096	bf16	7.59ms	2.88ms	2.6x
4096x4096	fp16	22.44ms	2.88ms	7.8x
8192x8192	fp32	34.08ms	23.79ms	1.4x
8192x8192	bf16	35.45ms	17.39ms	2.0x
8192x8192	fp16	98.51ms	17.31ms	5.7x
262K	fp32	0.08ms	0.05ms	1.6x
917K	fp32	0.40ms	0.31ms	1.3x
2.9M	fp32	1.38ms	0.93ms	1.5x

`quantize_blockwise`

Extended existing fp32 NEON absmax reduction to bf16/fp16.

The fp32 performance is unchanged, while bf16/fp16 improve ~3x and 5-6x respectively.

_{Apple M4, macOS 15, PyTorch 2.12.0, blocksize=256}

shape	dtype	before	after	speedup
4096x4096	fp32	6.67ms	6.62ms	~1x
4096x4096	bf16	18.00ms	6.20ms	2.9x
4096x4096	fp16	36.98ms	6.19ms	6.0x
8192x8192	fp32	27.01ms	27.01ms	~1x
8192x8192	bf16	76.95ms	27.84ms	2.8x
8192x8192	fp16	152.23ms	28.04ms	5.4x
262K	fp32	0.10ms	0.10ms	~1x
262K	bf16	0.29ms	0.10ms	2.9x
262K	fp16	0.58ms	0.10ms	5.8x
2.9M	fp32	1.13ms	1.12ms	~1x
2.9M	bf16	3.22ms	1.17ms	2.8x
2.9M	fp16	6.43ms	1.18ms	5.4x

…ux-x64

…ts' into cpu-perf-improvements

matthewdouglas · 2026-06-09T18:26:08Z

cc @jiqing-feng for x86-64 visibility and @pdeep854 for arm64 visibility

github-actions · 2026-06-09T18:26:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

matthewdouglas added 25 commits June 1, 2026 19:12

perf: improve int8 on arm64 CPU

2cd8562

build: temporary add CPU build verbosity

62427bf

fix

f7f1cdb

Merge remote-tracking branch 'origin/main' into cpu-perf-improvements

f1f6650

cpu: skip _int_mm when not on avx512.

ec1e76d

MSVC optimization for CPU ops

45831bd

msvc improvement

4e2b611

cpu: enable openmp:experimental on windows; add back avx2/fma for lin…

6755f03

…ux-x64

improve optim test perf

44e6da6

cpu perf: improvements for arm64 8bit blockwise quant/dequant (neon)

c5df7fa

cpu perf: ARM64 NEON improvements for blockwise quantization

6223b78

fix msvc arm64 build

2adea99

fix

144edc7

remove dead code

4021345

x86-64 cpu perf improvement

ed1db52

fix

e54ccf8

cpu: update tests

8cc47f0

x64 avx512 improvements, test improvements

365c6d8

Merge remote-tracking branch 'refs/remotes/origin/cpu-perf-improvemen…

49f4293

…ts' into cpu-perf-improvements

update build flags

d855010

update build flags

b8b328c

fix windows

2ebf821

Update build flag

7dad0e6

Update omp simd hints

77ae5fa

fix msvc

aebfb02

fix lint

efc6f4b

matthewdouglas added x64 CPU aarch64 labels Jun 9, 2026

matthewdouglas added this to the v0.50.0 milestone Jun 9, 2026

update build script

fbbf23e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Improve CPU performance on x64 and arm64#1968

perf: Improve CPU performance on x64 and arm64#1968
matthewdouglas wants to merge 27 commits into
mainfrom
cpu-perf-improvements

matthewdouglas commented Jun 9, 2026

Uh oh!

matthewdouglas commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

matthewdouglas commented Jun 9, 2026

x86-64

dequantize_4bit

dequantize_blockwise

quantize_blockwise

ARM64

dequantize_4bit

dequantize_blockwise

quantize_blockwise

Uh oh!

matthewdouglas commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`dequantize_4bit`

`dequantize_blockwise`

`quantize_blockwise`

`dequantize_4bit`

`dequantize_blockwise`

`quantize_blockwise`