Horovod on spark>=2.4 Barrier Execution Mode supporting #3982

max-509 · 2023-09-13T17:23:31Z

Is your feature request related to a problem? Please describe.
When I use Horovod on spark for training distributed DL model, Horovod does some additional actions for data transferring to Horovod processes:

DataFrame partitions saving to some distributed storage using Petastorm (for example, HDFS)
Partitions reading from this storage for data delivering to Horovod processes using client (for example, hadoop).
So this actions can decrease processing speed when we work with big data.

Describe the solution you'd like
I suggest use Barrier Execution Mode that was introduced in spark 2.4 version. Horovod can repartition Dataframe to number of executors and use mapInPandas() for conversion Spark DataFrame partition representation to iterator of pd.Dataframe. Arrow enabling will increase conversation speed. Iterator of pd.Dataframe Horovod can convert to specific DL framework dataloader. This logic can be wrapped into [Torch|Keras|Lighting]Estimator or by adding special function horovod.spark.run_on_dataframe() like horovod.spark.run()

Describe alternatives you've considered
As I understand, Databricks uses similar design for HorovodRunner. And XGBoost on Pyspark uses similar approach

Additional context
This feature can increase Horovod on spark popularity. So, in this presentation Uber engineers describe this problem.
If you support this idea, but you don’t have time to implement it, I can start implementation it as a contribution to horovod.

I'll be waiting for your feedback.

The text was updated successfully, but these errors were encountered:

max-509 added the enhancement label Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod on spark>=2.4 Barrier Execution Mode supporting #3982

Horovod on spark>=2.4 Barrier Execution Mode supporting #3982

max-509 commented Sep 13, 2023

Horovod on spark>=2.4 Barrier Execution Mode supporting #3982

Horovod on spark>=2.4 Barrier Execution Mode supporting #3982

Comments

max-509 commented Sep 13, 2023