Skip to content

Using WebGPU for react-translator with nllb-200-distilled-600M #1286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 5 tasks
820391839201z opened this issue Apr 15, 2025 · 2 comments
Open
1 of 5 tasks

Using WebGPU for react-translator with nllb-200-distilled-600M #1286

820391839201z opened this issue Apr 15, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@820391839201z
Copy link

820391839201z commented Apr 15, 2025

System Info

transformers 3.4.0
Windows 11 24H2
Chrome 135
node.js environment

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

I upgraded from transformers v2 to transformers v3 in examples/react-translator for WebGPU, loading xenova/nllb-200-distilled-600M and print output to console as a test.

dtype: { encoder_model: 'q8', decoder_model_merged: 'q8' }, device: 'wasm' produces expected results:
Image

dtype: { encoder_model: 'q8', decoder_model_merged: 'q8' }, device: 'webgpu' produces gibberish:
Image

Both output max_length is set to 10 in the above cases.

Additionally, dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }, device: 'wasm' also works:
Image
My GPU (NVIDIA Tesla P4) does not support fp16 and cannot test for this one.

Moreover, dtype: { encoder_model: 'fp32', decoder_model_merged: 'q8' } cannot load (>4GB Webassembly memory) for both wasm and webgpu:
Image
I plan to convert and quantize nllb-200-distilled-1.3B to ONNX as well, how to properly load it? Thank you.

Reproduction

const [tokenizer, model] = await MyTranslationPipeline.getInstance(x => {
	self.postMessage(x);
});

const dummyText = '你好,世界!';
const tgt_lang = 'eng_Latn';
const tgtLangTokenId = tokenizer.encode(tgt_lang)[1];

const { input_ids, attention_mask } = tokenizer(dummyText, { return_tensors: 'np', max_length: 512, truncation: true});
const output = await model.generate({
	input_ids: input_ids,
	attention_mask: attention_mask,
	forced_bos_token_id: tgtLangTokenId,
	max_length: 10,
	num_beams: 5,
	early_stopping: true,
});

const bigIntArray = output[0].ort_tensor.cpuData;
const intArray = Array.from(bigIntArray, (bigIntValue) => Number(bigIntValue));

const finalOutput = tokenizer.decode(intArray, { skip_special_tokens: true });
console.log('Target: ', finalOutput);
@820391839201z 820391839201z added the bug Something isn't working label Apr 15, 2025
@820391839201z
Copy link
Author

820391839201z commented Apr 22, 2025

Here are some additional information about the models loaded for analysis if anyone has a similar problem.

Q8 encoder, Q8 decoder

encoder_model q8 onnx file size:
Image
decoder_model_merged q8 onnx file size:
Image
Total raw model size: 895MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: ~2877MB (Not full yet)
Image

Behaviour:
No loading error. wasm produces correct result; webgpu produces gibberish.

FP16 encoder, Q8 decoder

encoder_model fp16 onnx file size:
Image
Total raw model size: 1305MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: ~4094MB (Almost full)
Image

Behaviour:
No loading error. wasm produces correct result; not tested on webgpu (I will try later on a NVIDIA Turing GPU with FP16 enabled) still produces gibberish.

FP32 encoder, Q8 decoder

encoder_model fp32 onnx file size:
Image
Total raw model size: ~2175MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: >3788MB (Full, exceeds 4096MB by unknown amount)
Image

Behaviour:
Loaded the q8 decoder, cannot load the fp32 encoder, onnx runtime aborted.

FP32 encoder, FP32 decoder

decoder_model_merged fp32 onnx file size:
Image
Total raw model size: ~3614MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: >3294MB (Full, exceeds 4096MB by unknown amount)
Image

Behaviour:
Cannot load either decoder or encoder, worker.js just hang there without showing ort aborted (error code 3244392712 instead).

@820391839201z
Copy link
Author

820391839201z commented May 7, 2025

@xenova
I realized that the problem is likely due to q8 decoder with WebGPU.
Both nllb-200-distilled-600M and whisper produce gibberish when using any encoder and q8 decoder.
Image

I can upload this demo if you need it.
Meanwhile, can you add q4 encoder/decoder here?
https://huggingface.co/Xenova/nllb-200-distilled-600M/tree/main/onnx

Thanks a lot! I appreciate how versatile V3 is with the new WebGPU compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant