You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When I use Horovod on spark for training distributed DL model, Horovod does some additional actions for data transferring to Horovod processes:
DataFrame partitions saving to some distributed storage using Petastorm (for example, HDFS)
Partitions reading from this storage for data delivering to Horovod processes using client (for example, hadoop).
So this actions can decrease processing speed when we work with big data.
Describe the solution you'd like
I suggest use Barrier Execution Mode that was introduced in spark 2.4 version. Horovod can repartition Dataframe to number of executors and use mapInPandas() for conversion Spark DataFrame partition representation to iterator of pd.Dataframe. Arrow enabling will increase conversation speed. Iterator of pd.Dataframe Horovod can convert to specific DL framework dataloader. This logic can be wrapped into [Torch|Keras|Lighting]Estimator or by adding special function horovod.spark.run_on_dataframe() like horovod.spark.run()
Describe alternatives you've considered
As I understand, Databricks uses similar design for HorovodRunner. And XGBoost on Pyspark uses similar approach
Additional context
This feature can increase Horovod on spark popularity. So, in this presentation Uber engineers describe this problem.
If you support this idea, but you don’t have time to implement it, I can start implementation it as a contribution to horovod.
I'll be waiting for your feedback.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When I use Horovod on spark for training distributed DL model, Horovod does some additional actions for data transferring to Horovod processes:
So this actions can decrease processing speed when we work with big data.
Describe the solution you'd like
I suggest use Barrier Execution Mode that was introduced in spark 2.4 version. Horovod can repartition Dataframe to number of executors and use
mapInPandas()
for conversion Spark DataFrame partition representation to iterator ofpd.Dataframe
. Arrow enabling will increase conversation speed. Iterator ofpd.Dataframe
Horovod can convert to specific DL framework dataloader. This logic can be wrapped into [Torch|Keras|Lighting]Estimator or by adding special functionhorovod.spark.run_on_dataframe()
likehorovod.spark.run()
Describe alternatives you've considered
As I understand, Databricks uses similar design for HorovodRunner. And XGBoost on Pyspark uses similar approach
Additional context
This feature can increase Horovod on spark popularity. So, in this presentation Uber engineers describe this problem.
If you support this idea, but you don’t have time to implement it, I can start implementation it as a contribution to horovod.
I'll be waiting for your feedback.
The text was updated successfully, but these errors were encountered: