July 2024: BitNet performance comparison

Bitnet-1.58B

Two implementations are provided

IQ1_BN - uses 1.625 bits-per-weight (bpw)
IQ2_BN - uses 2.0 bpw

IQ2_BN is faster for PP (CPU and GPU, although the PP performance difference on CUDA is very minor). IQ1_BN can arrive at a higher TG performance on the Ryzen-7950X (given enough threads) because of the smaller model size, but it is always slower on the GPU and on the M2-Max CPU.

There is the unmerged PR 8151 in llama.cpp that implements Bitnet-1.58B for the CPU (AVX and ARM_NEON, no GPU implementation). The following table compares performance between this repo and PR-8151 in llama.cpp. The CUDA results were obtained on an RTX-4080, the Metal results on a 30-core M2-Max GPU.

model	size	backend	threads	test	t/s (llama.cpp)	t/s (this repo)	Speedup
3B - IQ1_BN	729.64 MiB	AVX2	16	pp512	120.61 ± 0.48	423.19 ± 1.28	3.509
		NEON	8	pp512	46.64 ± 0.02	205.90 ± 0.88	4.415
		CUDA	8	pp512	-	10660 ± 170	-
		Metal	8	pp512	-	698.25 ± 1.91	-
		AVX2	2	tg128	15.79 ± 0.01	22.13 ± 0.02	1.402
		AVX2	4	tg128	28.64 ± 1.72	40.14 ± 0.04	1.402
		AVX2	8	tg128	48.91 ± 0.08	61.79 ± 0.09	1.263
		AVX2	16	tg128	57.73 ± 0.05	60.79 ± 0.05	1.053
		NEON	2	tg128	11.43 ± 0.04	16.87 ± 0.02	1.476
		NEON	4	tg128	21.11 ± 0.05	30.66 ± 0.11	1.452
		NEON	8	tg128	37.36 ± 0.07	55.21 ± 0.16	1.478
		CUDA	8	tg128	-	301.44 ± 0.12	-
		Metal	8	tg128	-	76.70 ± 0.07	-
3B - IQ2_BN	873.65 MiB	AVX2	16	pp512	151.39 ± 0.35	540.82 ± 2.48	3.572
		NEON	8	pp512	46.54 ± 0.03	242.05 ± 0.34	5.201
		CUDA	8	pp512	-	10800 ± 160	-
		Metal	8	pp512	-	723.19 ± 0.53	-
		AVX2	2	tg128	18.93 ± 0.02	38.34 ± 0.08	2.026
		AVX2	4	tg128	34.54 ± 0.06	56.29 ± 0.07	1.630
		AVX2	8	tg128	52.97 ± 0.07	53.44 ± 0.08	1.009
		AVX2	16	tg128	51.84 ± 0.25	53.46 ± 0.07	1.031
		NEON	2	tg128	11.40 ± 0.02	32.01 ± 0.27	2.808
		NEON	4	tg128	20.99 ± 0.00	56.45 ± 0.11	2.689
		NEON	8	tg128	37.28 ± 0.08	89.77 ± 0.70	2.408
		CUDA	8	tg128	-	322.10 ± 0.07	-
		Metal	8	tg128	-	110.39 ± 0.13	-

We can make the following observations:

For prompt processing this Bitnet-1.58b implementation is massively better than PR-8151 in llama.cpp, with gains between 3.4X and 5.2X!
We get PP-512 = 520 t/s for the 2.0 bpw variant on the Ryzen-7950X, which costs less than $500. Hey, who needs a GPU?
For low number of threads (2), this implementation is also much faster than PR-8151 for TG, where speed gains are between 1.4X and 2.8X. As we become memory bound on the Ryzen-7950X, the speed advantage goes away there for sufficiently high number of threads. But on the M2-Max this implementation is 1.4X (1.625 bpw) or 2.4X faster even at 8 threads
Looking at TG on the M2-Max, the GPU looks a bit like wasted silicon (90 vs 110 t/s for TG-128 and the 2.0 bpw variant). If the GPU transistors had been spent to double the M2 number of CPU cores (and all memory bandwidth is given to the CPU), the CPU would be wiping the floor with the GPU.
I'm of course kidding with the above. Still, it seems there are massive inefficiencies in the llama.cpp Metal implementation that start showing up when matrix multiplications become very fast as is the case here. The difference between CPU and GPU prompt processing speed is typically at least a factor of 7 in favor of the GPU on the M2-Max, but it is only around a factor of 3 here.
It is worth noting that one needs to offload the token embeddings tensor to the GPU, else performance on CUDA/Metal is significantly lower. Bitnet uses the same tensor for token embeddings and for output. Mainline llama.cpp currently puts the token embeddings tensor on the CPU, and this results in running the matrix multiplication with the output tensor on the CPU. This most likely affects other models as well (e.g., Gemma), but I haven't yet looked into this.

To reproduce these results:

Clone https://huggingface.co/1bitLLM/bitnet_b1_58-3B
Run python3 --outtype f16 path_to_bitnet to convert to GGUF
Run ./bin/llama-quantize path_to_bitnet/ggml-model-f16.gguf quantized.gguf [iq1_bn | iq2_bn]. Note: no imatrix is required (and, if you provide one, it is ignored)
Caveat: only the 3B Bitnet variant works. The smaller Bitnet models contain tensors with number of columns that are not even a multiple of 32, so basically no llama.cpp quant will work for these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

July 2024: BitNet performance comparison

Bitnet-1.58B

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally