Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyspark should upload / import packages from local by default #739

Open
kdzhao opened this issue Nov 11, 2021 · 3 comments
Open

pyspark should upload / import packages from local by default #739

kdzhao opened this issue Nov 11, 2021 · 3 comments

Comments

@kdzhao
Copy link

kdzhao commented Nov 11, 2021

Is your feature request related to a problem? Please describe.

When pyspark starts the connection to Spark cluster (Livy), it should load the packages in the local folder by default (or at least a way to specify that), so users can use these packages in the spark session as well.

For example, in pySpark kernel, if I do :

%%local
import matplotlib

It loads successfully. This is expected because "local" reads the package matplotlib I have on the jupyterlab machine.

But if I do:

import matplotlib

Starting Spark application
ID      YARN Application ID     Kind    State   Spark UI        Driver log      Current session?
32      application_1636505612345_0200  pyspark idle    Link    Link    ✔
SparkSession available as 'spark'.
An error was encountered:
No module named 'matplotlib'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'matplotlib'

As we can see it errors out. It can't find the said package on the spark cluster because in this case it runs in the cluster.

Describe the solution you'd like

People may say why not install the said packages on the Spark cluster? Well, most of the time, end users don't have direct permissions to do that. If there is a way so pyspark kernel can upload the packages when it starts the spark session, that will be really helpful! For example, a config before start the session, in which users can specify which packages to upload.

@nicolaslazo
Copy link

Hmmm maybe what you're looking for is this poorly documented function referenced in the AWS EMR documentation, sc.install_pypi_package?

Another option could be to use the %%bash IPython magic to call pip directly

@gloisel
Copy link

gloisel commented Dec 21, 2021

I'll extend slightly on the above request, though I believe that the suggestion that @kdzhao gave would achieve it.

In my case, I have several notebooks that I use on my EMR clusters which re-use some key functions that I would ideally like to only define in one place.

Ideally, the %run magic could be adapted to allow running a python script which is located on the jupyterlab machine on the EMR cluster.

@Michallote
Copy link

Michallote commented May 13, 2024

I also think this is a big problem. Take for example the following scenario:
I would like to have my functions and important codebase managed by a git repository, simplified somewhat:

src/my_module.py

for example consider the following cells in the nb:

%%local
s = "Some string I want to send over"

%%send_to_spark -i s -t str -n s

Will successfully send the string

What I want to achieve or similar is to import things in my local and send them over:

%%local
from src.my_module import foo_function

And send to spark foo_function

This is crucial to have git management instead of the monolithic trash we end up with these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants