Awww yes!

#2
by BoscoTheDog - opened

So much yes! Finally 128K context in the browser!

BoscoTheDog changed discussion title from Yes yes yes yes yes! to Awww yes!

it's possible to overdo it on the "yes".

Unsupported model type: phi3

I guess I need to be patient a little bit longer.

Hey! Indeed, we're still working on this, and we'll make an announcement once it's working 100%! To answer your questions:
1 ) Yes, this will be part of v3 w/ WebGPU acceleration
2) The model is split into two parts: ~830MB + 1.45GB. Both need to be below 2GB to be cacheable.
3) We're relying on MSFT's official ONNX export/implementation, which simplifies a lot for us! :)

Super cool on all fronts. Thanks for explaining!

If I can help test, let me know.

I couldn't resist and tried to get it running with the latest version of V3, but only got 404's no matter which dtype I tried.

What will be the correct incantation? Is this close?

const generator = await pipeline('text-generation', 'Xenova/Phi-3-mini-128k-instruct', {
                dtype: 'q4',    // fp32, fp16, q8, int8, uint8, q4, bnb4
                progress_callback: (data) => {
                  	if (data.status !== 'progress') return;
                    setLoadProgress(data);
                },
            });

You need to use this revision: https://huggingface.co/Xenova/Phi-3-mini-128k-instruct/discussions/3

By setting revision: 'refs/pr/3'.

You also need to set use_external_data_format: true, which has been introduced by the latest commits.

I can share some example code in a few hours, but I’m still getting erroneous output. Hopefully will get it working soon :)

Here's some HIGHLY EXPERIMENTAL WORK IN PROGRESS code:

import { env, AutoModelForCausalLM, AutoTokenizer } from '@xenova/transformers';

// disable proxying for now (much slower)
env.backends.onnx.wasm.proxy = false;

const model_id = 'Xenova/Phi-3-mini-128k-instruct';
const tokenizer = await AutoTokenizer.from_pretrained(model_id, {
    legacy: true, // TODO: update config
});

const prompt = `<|user|>
Tell me a joke<|end|>
<|assistant|>
`;

const inputs = tokenizer(prompt);

const model = await AutoModelForCausalLM.from_pretrained(model_id, {
    dtype: 'q4',
    // device: 'webgpu', // NOTE: webgpu produces incorrect results
    use_external_data_format: true,
    revision: 'refs/pr/3',
});

// { // warm up
//     const outputs = await model.generate({ ...inputs, max_new_tokens: 1 });
// }
{ // run + time execution
    const start = performance.now();
    const outputs = await model.generate({ ...inputs, max_new_tokens: 5 }); // TODO: increase max new tokens
    const end = performance.now();
    console.log(tokenizer.batch_decode(outputs, { skip_special_tokens: false }));
    console.log('Execution Time:', end - start);
}

NOTE: to get it working, you need to use the latest commit of transformers.js v3.

And just for now, you also need to replace this in src/models.js.

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

(Will not be necessary once we update the model).


WASM produces correct results, while WebGPU does not. Will continue to investigate.

Figured out the problem: the latest version of ONNXRuntime-web hadn't yet been published to NPM.

Here's a demo of phi-3-mini-128k-instruct running at ~20 tokens per second on an RTX 2080:

yup this model is a quanti... run on phone.

I trained this model:
NickyNicky/Phi-3-mini-4k-instruct_orpo_V2 --- >> https://huggingface.co/NickyNicky/Phi-3-mini-4k-instruct_orpo_V2

Then I quantized it with onnx but it gave me 10 GB, how did you compress it so much?

The latest versions of ONNXRuntime support two forms of 4-bit quantization (for certain weights):

You should also be able to quantize the other weights to fp16 of q8.

Hope that helps!

Awesome!

Is this still needed?

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

// yep :-)

Whoop!

<s><|user|> Why is the sky blue?<|end|><|assistant|> The sky appears blue to

@Xenova How did you split it into 2 pieces below 2 GB ?

Would you say the 128K context version is now it's ready for implementation? Or are there still workarounds needed?

When can I use this model in web?
still "Unsupported model type: phi3"

@webjjin You need to install transformers.js v3 from the dev branch:

npm install xenova/transformers.js#v3

See here for example code: https://github.com/xenova/transformers.js/blob/e32d4ebb6fe715e6634335123c07a96d0dc62ac8/examples/webgpu-chat/src/worker.js

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

Thank you for the reply. but I'd better to wait for the stable version of transformer.js#v3

I just tried today, with the following steps:

  1. Cloned the transformers.js project locally.
  2. Switched to the v3 branch.
  3. Installed dependencies and built the project: npm install and npm run build.
  4. Copied the dist folder to a test project folder.
  5. Tried running Phi-3 with this sample code:
import { pipeline, env } from './dist/transformers.js';

const model_id = 'Xenova/Phi-3-mini-4k-instruct';
env.backends.onnx.wasm.proxy = false;

const pipe = await pipeline('text-generation', model_id, {
  dtype: "q4",
  device: 'webgpu',
  use_external_data_format: true,
});

But I'm blocked at this error:

Error: Can't create a session. ERROR_CODE: 1, ERROR_MESSAGE: Deserialize tensor model.layers.1.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file "./model_q4.onnx_data", error: Module.MountedFiles is not available.
    at We (ort.webgpu.min.js:22:13223)
    at Pd (ort.webgpu.min.js:2309:19615)

Any idea how I could solve this?

I'm having problems with loading this model too, but with the external data. I'm using the latest v3 alpha @huggingface/[email protected]

I've also tried "microsoft/Phi-3-mini-4k-instruct-onnx-web" with dtype: q4f16, it never loads the external data.

model.layers.18.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file ""model_q4.onnx_data"", error: Module.MountedFiles is not available.
or
model.layers.18.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file ""model_q4f16.onnx_data"", error: Module.MountedFiles is not available.

Those files definitely exist - so I'm thinking a bug in transformers v3 somewhere, the double double quotes makes me extra suspicious

You wouldn't happen to be on Safari or Firefox would you?

And what are your environment variables? E.g.

env.allowLocalModels = false;
env.allowRemoteModels = true;
env.useBrowserCache = true;

No, and I've already successfully created and run this using ONNX directly (which is a lot more hassle!) in my browser https://github.com/benc-uk/onnx-webgpu/tree/main/phi-chat

I'm using exactly those env settings

Sign up or log in to comment