You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
transformers 3.4.0
Windows 11 24H2
Chrome 135
node.js environment
Environment/Platform
Website/web-app
Browser extension
Server-side (e.g., Node.js, Deno, Bun)
Desktop app (e.g., Electron)
Other (e.g., VSCode extension)
Description
I upgraded from transformers v2 to transformers v3 in examples/react-translator for WebGPU, loading xenova/nllb-200-distilled-600M and print output to console as a test.
Both output max_length is set to 10 in the above cases.
Additionally, dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }, device: 'wasm' also works:
My GPU (NVIDIA Tesla P4) does not support fp16 and cannot test for this one.
Moreover, dtype: { encoder_model: 'fp32', decoder_model_merged: 'q8' } cannot load (>4GB Webassembly memory) for both wasm and webgpu:
I plan to convert and quantize nllb-200-distilled-1.3B to ONNX as well, how to properly load it? Thank you.
Here are some additional information about the models loaded for analysis if anyone has a similar problem.
Q8 encoder, Q8 decoder
encoder_model q8 onnx file size:
decoder_model_merged q8 onnx file size:
Total raw model size: 895MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: ~2877MB (Not full yet)
encoder_model fp16 onnx file size:
Total raw model size: 1305MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: ~4094MB (Almost full)
Behaviour:
No loading error. wasm produces correct result; not tested on webgpu (I will try later on a NVIDIA Turing GPU with FP16 enabled) still produces gibberish.
FP32 encoder, Q8 decoder
encoder_model fp32 onnx file size:
Total raw model size: ~2175MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: >3788MB (Full, exceeds 4096MB by unknown amount)
Behaviour:
Loaded the q8 decoder, cannot load the fp32 encoder, onnx runtime aborted.
FP32 encoder, FP32 decoder
decoder_model_merged fp32 onnx file size:
Total raw model size: ~3614MB
Memory used by onnxruntime-web WebAssembly.Memory instance to load the model: >3294MB (Full, exceeds 4096MB by unknown amount)
Behaviour:
Cannot load either decoder or encoder, worker.js just hang there without showing ort aborted (error code 3244392712 instead).
@xenova
I realized that the problem is likely due to q8 decoder with WebGPU.
Both nllb-200-distilled-600M and whisper produce gibberish when using any encoder and q8 decoder.
System Info
transformers 3.4.0
Windows 11 24H2
Chrome 135
node.js environment
Environment/Platform
Description
I upgraded from transformers v2 to transformers v3 in examples/react-translator for WebGPU, loading xenova/nllb-200-distilled-600M and print output to console as a test.
dtype: { encoder_model: 'q8', decoder_model_merged: 'q8' }, device: 'wasm'
produces expected results:dtype: { encoder_model: 'q8', decoder_model_merged: 'q8' }, device: 'webgpu'
produces gibberish:Both output max_length is set to 10 in the above cases.
Additionally,

dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }, device: 'wasm'
also works:My GPU (NVIDIA Tesla P4) does not support fp16 and cannot test for this one.
Moreover,

dtype: { encoder_model: 'fp32', decoder_model_merged: 'q8' }
cannot load (>4GB Webassembly memory) for both wasm and webgpu:I plan to convert and quantize nllb-200-distilled-1.3B to ONNX as well, how to properly load it? Thank you.
Reproduction
The text was updated successfully, but these errors were encountered: