New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support sharding in model builder #249

Open

wangyems wants to merge 8 commits into main from wangye/shard

Member

wangyems commented Apr 3, 2024 •

edited

Support sharding in model builder
Add mixtral export with MoE

Usage example:
python builder.py -m mistralai/Mixtral-8x7B-v0.1 -e cuda -p fp16 -o ./example-models/mixtral_rank_0 --extra_options world_size=2 rank=0
python builder.py -m mistralai/Mixtral-8x7B-v0.1 -e cuda -p fp16 -o ./example-models/mixtral_rank_1 --extra_options world_size=2 rank=1

We can expose this usage in Readme when the support of Multi-GPU inference in GenAI tool is ready.


          support sharding in model builder

991e715

wangyems requested a review from kunal-vaishnavi

April 3, 2024 23:05

wangyems and others added 7 commits

April 4, 2024 13:33


          minor fix

ca4e244


          rebase/support ssharding in packed gemm

524854d


          Merge branch 'main' of github.com:microsoft/onnxruntime-genai into wa…

8e67dbb

…ngye/shard


          Merge branch 'main' of github.com:microsoft/onnxruntime-genai into wa…

5aebaa3

…ngye/shard


          rebase

3331e64


          fix comment

03195cb


          fix graph output shape

c865bc1

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

@@ @@ -568,6 +611,10 @@ def make_add_bias(self, add, name, root_input, **kwargs): @@
                       else:
                           self.make_add(name, add_bias_inputs, dtype=self.io_dtype, shape=shape)
+                  def make_all_reduce(self, name, root_input):
+                      output = f"{name}/output_0"

Contributor

kunal-vaishnavi Apr 25, 2024

Is it possible to add the value info for the node's output?

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

@@ @@ -1038,27 +1100,34 @@ def make_mlp_proj(self, layer_id, mlp, root_input): @@
                       #             Mul
                       #              |
                       #        DownProjMatMul
+                      if mlp is None:

Contributor

kunal-vaishnavi Apr 25, 2024

Why is this needed?

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

                       fc2_add_name = f"/model/layers.{layer_id}/mlp/fc2/Add"
                       self.make_add_bias(mlp.fc2.bias.detach().numpy(), fc2_add_name, root_input=f"{fc2_matmul_name}/output_0")
                       # Assign output 0 of MLP layer as output of last layer
                       self.mlp_attrs["output_0"] = f"{fc2_add_name}/output_0"
+                  def make_block_sparse_moe(self, layer_id, bsm, root_input):
+                      if bsm is None:

Contributor

kunal-vaishnavi Apr 25, 2024

Can you draw what the subgraph looks like as a comment? It will help to see it visually for documentation purposes.

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

+                          w3_list.append(torch.reshape(bsm.experts[i].w3.weight, (hidden_size, inter_size)))
+                      moe_expert_1_name = f"model.layers.{layer_id}.moe.weight_1"
+                      moe_expert_2_name = f"model.layers.{layer_id}.moe.weight_2"

Contributor

kunal-vaishnavi Apr 25, 2024

Can the expert weight names be named as model.layers.{layer_id}.moe.experts.{expert_num}.weight instead?

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

+                      self.make_external_tensor(moe_experts_weight2.astype(self.to_numpy_dtype[self.io_dtype]), moe_expert_2_name)
+                      self.make_external_tensor(moe_experts_weight3.astype(self.to_numpy_dtype[self.io_dtype]), moe_expert_3_name)
+                      bias_ph = "" # Placeholder for bias

Contributor

kunal-vaishnavi Apr 25, 2024

Suggested change

      
                    bias_ph = "" # Placeholder for bias
          
                    bias_name = "" # Placeholder for bias

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

+                      output = f"{moe_name}/output_0"
+                      if self.world_size > 1:
+                          self.make_node("ShardedMoE", inputs=inputs, outputs=[output], name=moe_name, domain="com.microsoft",
+                                          k=top_k, activation_type=activation_type, normalize_routing_weights=normalize_routing_weights, tensor_shards=self.world_size)

Contributor

kunal-vaishnavi Apr 25, 2024

Can the logic to create an MoE node be factored out into one node creation function? Something like this:

op_type = f"{'Sharded' if self.world_size > 1 else ''}MoE"
kwargs = {"tensor_shards": self.world_size} if self.world_size > 1 else {}
self.make_node(
    op_type, inputs=inputs, outputs=[output], name=moe_name, domain="com.microsoft",
    k=top_k, activation_type=activation_type, normalize_routing_weights=normalize_routing_weights,
    **kwargs,
)

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

                           onnx_model = MistralModel(config, io_dtype, precision, execution_provider, cache_dir, extra_options)
                       elif config.architectures[0] == "PhiForCausalLM":
                           onnx_model = PhiModel(config, io_dtype, precision, execution_provider, cache_dir, extra_options)
+                      elif config.architectures[0] == "MixtralForCausalLM":

Contributor

kunal-vaishnavi Apr 25, 2024

Can you add the new architecture such that the alphabetical order is maintained? This helps quickly identify which architectures are currently supported.

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

@@ @@ -1801,6 +1988,8 @@ def get_args(): @@
                                   The filename for each component will be '<filename>_<component-name>.onnx' (ex: '<filename>_encoder.onnx', '<filename>_decoder.onnx').
                               config_only = Generate config and pre/post processing files only.
                                   Use this option when you already have your optimized and/or quantized ONNX model.
+                              world_size = Number of GPUs to use for distributed inference. Default is 1.

Contributor

kunal-vaishnavi Apr 25, 2024

Can you add examples that use these extra options in the model builder README?

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

+                      gate_reshape_name = f"/model/layers.{layer_id}/moe/gate/Reshape"
+                      self.make_reshape(gate_reshape_name, [f"{gate_name}/output_0", f"{concat_name}/output_0"], dtype=self.io_dtype, shape=['num_rows', num_experts])
+                      moe_name = f"/model/layers.{layer_id}/moe"

Contributor

kunal-vaishnavi Apr 25, 2024

Can you define this as basename in the beginning and then modify the above node names to use it when defining their names (e.g. f"{basename}/gate/MatMul")? This allows us to change the basename if needed without needing to manually update all of the other node names in that subgraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment