Using WebGPU for react-translator with nllb-200-distilled-600M #1286

820391839201z · 2025-04-15T08:03:41Z

System Info

transformers 3.4.0
Windows 11 24H2
Chrome 135
node.js environment

Environment/Platform

Description

I upgraded from transformers v2 to transformers v3 in examples/react-translator for WebGPU, loading xenova/nllb-200-distilled-600M and print output to console as a test.

dtype: { encoder_model: 'q8', decoder_model_merged: 'q8' }, device: 'wasm' produces expected results:

dtype: { encoder_model: 'q8', decoder_model_merged: 'q8' }, device: 'webgpu' produces gibberish:

Both output max_length is set to 10 in the above cases.

Additionally, dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }, device: 'wasm' also works:

My GPU (NVIDIA Tesla P4) does not support fp16 and cannot test for this one.

Moreover, dtype: { encoder_model: 'fp32', decoder_model_merged: 'q8' } cannot load (>4GB Webassembly memory) for both wasm and webgpu:

I plan to convert and quantize nllb-200-distilled-1.3B to ONNX as well, how to properly load it? Thank you.

Reproduction

const [tokenizer, model] = await MyTranslationPipeline.getInstance(x => {
	self.postMessage(x);
});

const dummyText = '你好，世界！';
const tgt_lang = 'eng_Latn';
const tgtLangTokenId = tokenizer.encode(tgt_lang)[1];

const { input_ids, attention_mask } = tokenizer(dummyText, { return_tensors: 'np', max_length: 512, truncation: true});
const output = await model.generate({
	input_ids: input_ids,
	attention_mask: attention_mask,
	forced_bos_token_id: tgtLangTokenId,
	max_length: 10,
	num_beams: 5,
	early_stopping: true,
});

const bigIntArray = output[0].ort_tensor.cpuData;
const intArray = Array.from(bigIntArray, (bigIntValue) => Number(bigIntValue));

const finalOutput = tokenizer.decode(intArray, { skip_special_tokens: true });
console.log('Target: ', finalOutput);

The text was updated successfully, but these errors were encountered:

820391839201z · 2025-04-22T03:15:39Z

Here are some additional information about the models loaded for analysis if anyone has a similar problem.

Q8 encoder, Q8 decoder

encoder_model q8 onnx file size:

decoder_model_merged q8 onnx file size:

Total raw model size: 895MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: ~2877MB (Not full yet)

Behaviour:
No loading error. wasm produces correct result; webgpu produces gibberish.

FP16 encoder, Q8 decoder

encoder_model fp16 onnx file size:

Total raw model size: 1305MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: ~4094MB (Almost full)

Behaviour:
No loading error. wasm produces correct result; ~~not tested on webgpu (I will try later on a NVIDIA Turing GPU with FP16 enabled)~~ still produces gibberish.

FP32 encoder, Q8 decoder

encoder_model fp32 onnx file size:

Total raw model size: ~2175MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: >3788MB (Full, exceeds 4096MB by unknown amount)

Behaviour:
Loaded the q8 decoder, cannot load the fp32 encoder, onnx runtime aborted.

FP32 encoder, FP32 decoder

decoder_model_merged fp32 onnx file size:

Total raw model size: ~3614MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: >3294MB (Full, exceeds 4096MB by unknown amount)

Behaviour:
Cannot load either decoder or encoder, worker.js just hang there without showing ort aborted (error code 3244392712 instead).

820391839201z · 2025-05-07T08:54:04Z

@xenova
I realized that the problem is likely due to q8 decoder with WebGPU.
Both nllb-200-distilled-600M and whisper produce gibberish when using any encoder and q8 decoder.

I can upload this demo if you need it.
Meanwhile, can you add q4 encoder/decoder here?
https://huggingface.co/Xenova/nllb-200-distilled-600M/tree/main/onnx

Thanks a lot! I appreciate how versatile V3 is with the new WebGPU compatibility.

820391839201z added the bug Something isn't working label Apr 15, 2025

820391839201z mentioned this issue May 15, 2025

WebGPU does not work with q8 decoders (AutoModelForSeq2SeqLM, WhisperForConditionalGeneration) #1317

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using WebGPU for react-translator with nllb-200-distilled-600M #1286

Using WebGPU for react-translator with nllb-200-distilled-600M #1286

820391839201z commented Apr 15, 2025 •

edited

Loading

820391839201z commented Apr 22, 2025 •

edited

Loading

820391839201z commented May 7, 2025 •

edited

Loading

Using WebGPU for react-translator with nllb-200-distilled-600M #1286

Using WebGPU for react-translator with nllb-200-distilled-600M #1286

Comments

820391839201z commented Apr 15, 2025 • edited Loading

System Info

Environment/Platform

Description

Reproduction

820391839201z commented Apr 22, 2025 • edited Loading

Q8 encoder, Q8 decoder

FP16 encoder, Q8 decoder

FP32 encoder, Q8 decoder

FP32 encoder, FP32 decoder

820391839201z commented May 7, 2025 • edited Loading

820391839201z commented Apr 15, 2025 •

edited

Loading

820391839201z commented Apr 22, 2025 •

edited

Loading

820391839201z commented May 7, 2025 •

edited

Loading