Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill. #833

Open
edwardps opened this issue Sep 6, 2023 · 17 comments
Labels
kind:bug An unexpected error or issue with sparkmagic

Comments

@edwardps
Copy link
Contributor

edwardps commented Sep 6, 2023

I initially reported this as a papermill issue(not quite sure about this). I am copying that issue to SparkMagic community to see if there happen to be any expert who can provide advice for unblocking. Please feel free to close if this is not SparkMagic issue. Thanks in advance.

Describe the bug
Our use case is to use SparkMagic wrapper kernels with PaperMill notebook execution.
Most of the functions are working as expected except the %%sql magic, which will get stuck during execution. The SparkMagic works properly when executed in interactive mode in JupyterLab and issue only happens for %%sql magic when running with PaperMill.

From the debugging log(attached), I can see the %%sql logic had been executed and response was retrieved back. The execution state was back to idle at the end. But the output of %%sql cell was not updated properly and the following cells were not executed.

Following content was printed by PaperMill, which shows the %%sql has been executed properly. This content was not rendered into cell output.

msg_type: display_data
content: {'data': {'text/plain': '<IPython.core.display.HTML object>', 'text/html': '

\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
databasetableNameisTemporary
0defaultmovie_reviewsFalse
\n
'}, 'metadata': {}, 'transient': {}}

To Reproduce

conda create --name py310 python=3.10
conda activate pyenv310

pip install sparkmagic
pip install papermill

# install kernelspecs
SITE_PACKAGES_LOC=$(pip show sparkmagic | grep Location | awk '{print $2}')
cd $SITE_PACKAGES_LOC
jupyter-kernelspec install sparkmagic/kernels/sparkkernel --user
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel --user
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel --user

jupyter nbextension enable --py --sys-prefix widgetsnbextension 

pip install notebook==6.5.1 (Downgrade rom 7.0.3 to 6.5.1 due to ModuleNotFoundError: No module named 'notebook.utils')

# Run papermill job(notebook is also uploaded)
# Before run this, an EMR cluster is needed and the sparkmagic configure is also needed. 
# If it's not possible/easy to create it, please comment for any testing/verification needed, I can help. Also, you can check the uploaded the papermill debugging log. 
papermill pm_sparkmagic_test.ipynb output1.ipynb --kernel pysparkkernel  --log-level DEBUG 

Following is package list which might be highly related. I also attached one text contains all the packages.

pip list | grep 'papermill\|sparkmagic\|autovizwidget\|hdijupyterutils\|ipykernel\|ipython\|ipywidgets\|mock\|nest-asyncio\|nose\|notebook\|numpy\|pandas\|requests\|requests-kerberos\|tornado\|ansiwrap\|click\|entrypoints\|nbclient\|nbformat\|pyyaml\|requests\|tenacity\|tqdm\|jupyter\|ipython'|sort
ansiwrap                  0.8.4
autovizwidget             0.20.5
click                     8.1.7
entrypoints               0.4
hdijupyterutils           0.20.5
ipykernel                 6.25.2
ipython                   8.15.0
ipython-genutils          0.2.0
ipywidgets                8.1.0
jupyter                   1.0.0
jupyter_client            8.3.1
jupyter-console           6.6.3
jupyter_core              5.3.1
jupyter-events            0.7.0
jupyterlab                4.0.5
jupyterlab-pygments       0.2.2
jupyterlab_server         2.24.0
jupyterlab-widgets        3.0.8
jupyter-lsp               2.2.0
jupyter_server            2.7.3
jupyter_server_terminals  0.4.4
nbclient                  0.8.0
nbformat                  5.9.2
nest-asyncio              1.5.5
notebook                  6.5.1
notebook_shim             0.2.3
numpy                     1.25.2
pandas                    1.5.3
papermill                 2.4.0
requests                  2.31.0
requests-kerberos         0.14.0
sparkmagic                0.20.5
tenacity                  8.2.3
tornado                   6.3.3
tqdm                      4.66.1

Expected behavior
The %%sql should not hang and following cell should proceed for execution.

Screenshots
Output notebook of papermill:
image

Expected output(from JupyterLab)
image

Versions:

  • SparkMagic (0.20.5)
  • Livy (N/A)
  • Spark (N/A)

Additional context
log and other files.zip contains:

1. log - papermill debugging log
2. my_test_env_requirements.txt - full list of packages in the conda env
3. pm_sparkmagic_test.ipynb - the notebook executed in jupyterlab and it's also the input of papermill job
4. output1.ipynb - output notebook from the papermill job
@dosubot dosubot bot added the kind:bug An unexpected error or issue with sparkmagic label Sep 6, 2023
@dosubot
Copy link

dosubot bot commented Sep 6, 2023

🤖

Hi there,

Thank you for the detailed report. This seems to be a complex issue related to the interaction between SparkMagic, Livy, and Papermill.

From your description, it appears that the %%sql magic command is executed successfully, but the output is not rendered properly in the cell, and the execution of subsequent cells is blocked. This might be due to the asynchronous nature of SparkMagic's interaction with Livy.

When you run a cell with SparkMagic in JupyterLab, it sends a request to Livy and then polls for the result. This works fine in an interactive environment like JupyterLab, but when you're running a notebook non-interactively with Papermill, it might not wait long enough for the result to come back from Livy before moving on to the next cell.

You might want to look into adjusting the Livy session timeout or the polling interval in SparkMagic's configuration to see if that helps. However, please note that this is a common issue with running SparkMagic in non-interactive environments and there might not be a straightforward solution.

As for the codebase, SparkMagic handles the execution of %%sql magic commands through the execute_sqlquery method in the SparkMagicBase class. This method is responsible for executing the SQL query on the remote Spark cluster. It creates an SQLQuery object using the parameters provided by the user and then calls the run_sqlquery method of the spark_controller object to execute the SQL query on the remote Spark cluster.

The SQLQuery object has an execute method that emits a SQL execution start event, converts the SQL query to a command based on the session kind (PySpark, Spark, or SparkR), executes the command, and then emits a SQL execution end event. If the execution is successful, it converts the result to a DataFrame and returns it. If an error occurs during the execution, it emits a SQL execution end event with the error details and raises the error.

I hope this information helps. If you have any further questions or need more clarification, please let me know.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@edwardps
Copy link
Contributor Author

edwardps commented Sep 6, 2023

Thanks Dosu for the quick reply.

I actually tried to troubleshoot by adding more log in command.py. Ideally each retry, there will be one line added into the notebook cell output area. I noticed that sometimes, the notebook cell does not have all the retry logs before the statement reaching from running to available state. This might align with what you mentioned on the timing issue.

But when I changed the configurable_retry_policy_max_retries, it does not help and I saw similar behavior. Not fully understand how Papermill behaves for this case.

but when you're running a notebook non-interactively with Papermill, it might not wait long enough for the result to come back from Livy before moving on to the next cell.

It did not move to the next cell actually and got stuck in the %%sql magic cell.

@dosubot
Copy link

dosubot bot commented Sep 6, 2023

🤖

Hello,

Thank you for providing a detailed description of the issue you're experiencing. This seems to be related to the asynchronous nature of the SparkMagic %%sql command. When running in Jupyter, the command is designed to execute asynchronously, allowing the notebook to continue executing subsequent cells while the Spark job runs in the background. However, when running in Papermill, the execution model is synchronous, meaning it waits for each cell to finish executing before moving on to the next one.

To resolve this issue, you can try to modify the SparkMagic %%sql command to run synchronously. This can be done by modifying the execute_sqlquery method in the SparkMagicBase class in the sparkmagicsbase.py file. You would need to add a call to the wait_for_idle method after the run_sqlquery method call in the execute_sqlquery method. This will force the command to wait for the Spark job to finish before returning.

Here's how you can modify the execute_sqlquery method:

def execute_sqlquery(
        self,
        cell,
        samplemethod,
        maxrows,
        samplefraction,
        session,
        output_var,
        quiet,
        coerce,
    ):
        sqlquery = self._sqlquery(cell, samplemethod, maxrows, samplefraction, coerce)
        df = self.spark_controller.run_sqlquery(sqlquery, session)

        # Add this line to wait for the Spark job to finish
        self.spark_controller.wait_for_idle()

        if output_var is not None:
            self.shell.user_ns[output_var] = df
        if quiet:
            return None
        else:
            return df

Please note that this modification will make the %%sql command run synchronously in all environments, not just in Papermill. If you want to maintain the asynchronous behavior in Jupyter, you might need to add some logic to detect the execution environment and only call wait_for_idle when running in Papermill.

I hope this helps! Let me know if you have any other questions or if there's anything else I can do to assist you.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@edwardps
Copy link
Contributor Author

edwardps commented Sep 6, 2023

Awesome! Thanks a lot.
I will give it a try and see if it could fix the issue.

@edwardps
Copy link
Contributor Author

edwardps commented Sep 7, 2023

Hi Dosu,

sparkmagicsbase.py 
def execute_sqlquery(
    .....
    # Add this line to wait for the Spark job to finish
        self.spark_controller.wait_for_idle()

Given spark_controller does not have wait_for_idle, I made change to command.py. Between line 57 and 58, I added following lines:

            session.wait_for_idle()
            print('session.wait_for_idle...')

Also I added log for the session status.

print(f'session status -> {status}')

I can observe session was waiting for some time, then there was no further log. In the Papermill backend, I see more log until idle again.

In output notebook:

image

In the papermill log, the session reaches to idle.

image

Seems papermill does not wait until the wait_for_idle to complete. Any suggestions?

@dosubot
Copy link

dosubot bot commented Sep 7, 2023

🤖

Hello,

Thank you for providing detailed information about the issue you're experiencing. It seems like the problem might be due to the asynchronous nature of the execute method in the SQLQuery class. This method sends a request to the Spark cluster and returns immediately, without waiting for the Spark job to complete. This can cause issues when running cells in Jupyter notebooks using Papermill, as the notebook might try to execute the next cell before the previous one has finished executing.

To resolve this issue, you can modify the execute method to wait for the Spark job to complete before returning. This can be done by polling the status of the Spark job in a loop until it's done.

Here's an example of how you can modify the execute method:

def execute(self, session):
    self._spark_events.emit_sql_execution_start_event(
        session.guid,
        session.kind,
        session.id,
        self.guid,
        self.samplemethod,
        self.maxrows,
        self.samplefraction,
    )
    command_guid = ""
    try:
        command = self.to_command(session.kind, session.sql_context_variable_name)
        command_guid = command.guid
        (success, records_text, mimetype) = command.execute(session)

        # Poll the status of the command until it's done
        while True:
            statement = session.http_client.get_statement(session.id, command_guid)
            if statement['state'] in ['available', 'error']:
                break
            time.sleep(1)  # Wait for a second before polling the status again

        if not success:
            raise BadUserDataException(records_text)
        result = records_to_dataframe(records_text, session.kind, self._coerce)
    except Exception as e:
        self._spark_events.emit_sql_execution_end_event(
            session.guid,
            session.kind,
            session.id,
            self.guid,
            command_guid,
            False,
            e.__class__.__name__,
            str(e),
        )
        raise

    else:
        self._spark_events.emit_sql_execution_end_event(
            session.guid,
            session.kind,
            session.id,
            self.guid,
            command_guid,
            True,
            "",
            "",
        )
        return result

This modification will make the execute method block until the Spark job is done, which should resolve the issue of cells in Jupyter notebooks not executing in the correct order when run with Papermill.

Please note that this is a simplified example and the actual implementation might need to be adjusted based on the specifics of your Spark cluster and the rest of your code. For example, you might want to add error handling for cases where the Spark job fails or takes too long to complete.

I hope this helps! Let me know if you have any other questions or need further clarification.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@edwardps
Copy link
Contributor Author

edwardps commented Sep 7, 2023

Thanks Dosu.
I added the waiting code in sqlquery.py. Following is the output from jupyterlab. The new log("waiting in sqlquery execute....") is shown in cell output.

image

While executing with papermill, it does not show the log in the notebook cell output. Seems the cell output update stopped some time later and subsequent cell did not run as well.
image

It seems like the problem might be due to the asynchronous nature of the execute method in the SQLQuery class.

Can you share how asynchronous works? It would be greate if you can share the code using the asynchronous mode to call livy API. I am trying to understand the execution pattern difference of %%sql magic.

Following piece works. I am think if we can use same call pattern to run %%sql logic.

sqlContext = HiveContext(sqlContext)

dbs = sqlContext.sql("show databases")
dbs.show()

tables = sqlContext.sql("show tables")
tables.show()

Execution log:
From the log, we can see all the cell logic and sparkmagic logic are executed, but cell output is not updated properly and subsequent cell does not execute.
log3.log

@dosubot
Copy link

dosubot bot commented Sep 7, 2023

🤖

Hello,

Thank you for providing a detailed description of your issue. It seems like you're experiencing a problem with the asynchronous execution of the %%sql magic command in SparkMagic when running with Papermill.

The %%sql magic command in SparkMagic is designed to execute SQL queries against the Spark context. When you run this command, it first checks if a session has been started, parses the arguments provided in the command line, and then executes the SQL query using the execute_sqlquery method. This method creates an SQLQuery object with the provided parameters and runs it using the run_sqlquery method of the spark_controller. The SQLQuery object then converts itself to a Command object based on the session kind and the SQL context variable name. The command is a string that will be executed in the Spark session. The SQLQuery object then executes the Command object in the Spark session. If the command execution is successful, it converts the result to a dataframe and returns it. If the command execution fails, it raises an exception.

As for the asynchronous execution in SparkMagic, it is handled by sending commands to the Livy server and waiting for the result. The actual execution of the commands is done asynchronously on the Livy server, allowing the Jupyter notebook to continue with other tasks.

Now, regarding your issue with Papermill, it seems like Papermill is not waiting for the wait_for_idle to complete. This could be due to the way Papermill handles cell execution. Papermill executes each cell in the notebook individually and waits for the cell to complete before moving on to the next one. If a cell takes too long to execute, Papermill might timeout and stop the execution. This could be why the %%sql command is hanging when running with Papermill.

One possible solution could be to increase the cell execution timeout in Papermill. You can do this by passing the --request-timeout option when running Papermill. For example:

papermill pm_sparkmagic_test.ipynb output1.ipynb --kernel pysparkkernel  --log-level DEBUG --request-timeout 600

This will set the cell execution timeout to 600 seconds (10 minutes). You can adjust this value as needed.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@edwardps
Copy link
Contributor Author

edwardps commented Sep 7, 2023

request-timeout is not supported by papermill. execution-timeout is supported and tried that but not work.

 --execution-timeout INTEGER     Time in seconds to wait for each cell before
                                  failing execution (default: forever)

Can you share how the pyspark code is executed? Is that following async mode?
I see the following code can succeed stably. Can you suggest if we can follow the same pattern for %%sql?

sqlContext = HiveContext(sqlContext)

dbs = sqlContext.sql("show databases")
dbs.show()

tables = sqlContext.sql("show tables")
tables.show()

@devstein
Copy link
Collaborator

devstein commented Sep 7, 2023

Hi @edwardps, glad Dosu (the bot) was able to help you narrow down the cause. I believe the bot is correct that the issue has something to do with the async execution of the cell magic.

I actually encountered this issue when implementing ipykernel >= 6 support and left comment on related papermill issue; however, the final implementation fixed the papermill issue for me 🤔

Maybe try changing your ipykernel version and see if that helps. I'd try both downgrading to v5 and upgrading to v7

@edwardps
Copy link
Contributor Author

edwardps commented Sep 7, 2023

Thanks @devstein. I will give a try and get back soon.
Do you happen to know call pattern difference between %%sql and running pyspark code?

BTW, Dosu works awesome!

@dosubot
Copy link

dosubot bot commented Sep 8, 2023

🤖

Hello,

The %%sql magic command in SparkMagic and direct PySpark code have different execution patterns. The %%sql magic command uses the Livy REST interface to submit the SQL query to the Spark cluster and retrieve the results. This process is asynchronous, meaning the command returns immediately after submitting the query, and the actual execution happens in the Spark cluster. This could be the reason why you're experiencing the hanging issue when using Papermill.

On the other hand, direct PySpark code gives you more control over the execution process. You can create a SparkSession, execute transformations and actions on RDDs/DataFrames, and manage resources manually.

To modify the %%sql magic command to follow the same execution pattern as the direct PySpark code, you would need to modify the SparkMagic codebase. Specifically, you would need to modify the execute method in the SQLQuery class in the sparkmagic.livyclientlib.sqlquery module. This method is responsible for executing the SQL query and retrieving the results.

However, modifying the SparkMagic codebase might not be the best solution to resolve the hanging issue. Instead, you could try the following:

  1. Increase the Livy session timeout in your SparkMagic configuration. This could prevent the Livy session from timing out while waiting for the SQL query to complete.

  2. Use direct PySpark code instead of the %%sql magic command. This gives you more control over the execution process and might help to avoid the hanging issue.

  3. Check the resources of your Spark cluster. The hanging issue could be due to insufficient resources in the Spark cluster, especially if you are running large SQL queries.

  4. Check the logs of your Spark cluster and the Livy server. They might contain useful information about why the SQL query is hanging.

I hope this helps! Let me know if you have any other questions.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@devstein
Copy link
Collaborator

devstein commented Sep 8, 2023

@edwardps Sounds good, let me know.

BTW, Dosu works awesome!

Thank you! Let me know if you have other open source projects that could benefit from it.

^I'm still teaching it when it should and shouldn't respond 😅

@edwardps
Copy link
Contributor Author

edwardps commented Sep 8, 2023

FYI - @devstein

ipykernel                 5.5.6
ipython                   7.33.0
ipython-genutils          0.2.0
ipywidgets                7.7.0

Downgraded ipykernel to 5.5.6(ipykernel has no v7) and did not get any luck.

Do you happen to know call pattern difference between %%sql and running pyspark code?

Is it possible to have %%sql following the call pattern used by pyspark code execution? (as the pyspark code cell can work properly)

@edwardps
Copy link
Contributor Author

edwardps commented Sep 8, 2023

FYI - @devstein
Made some progress on the troubleshooting. I narrowed down this to the visualization.

If comment out the follow two lines, it works.
https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/kernels/wrapperkernel/sparkkernelbase.py#L118-L119

        if conf.use_auto_viz():
            self._register_auto_viz()

From my observation, the plotly based visualization will be disabled.
image

@devstein
Copy link
Collaborator

devstein commented Sep 9, 2023

Thanks @edwardps that's helpful...that means it's failing to execute this code while the kernel is getting created.

I ran into this issue when I tried to make the _register_auto_viz() function async. It's curious that it no longer works.

Is there any difference in behavior if the first cell isn't %%sql?

For example

print("start the spark session")
%%sql
show tables

@edwardps
Copy link
Contributor Author

Is there any difference in behavior if the first cell isn't %%sql?
There is no much difference. Still see cell(%%sql) got stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug An unexpected error or issue with sparkmagic
Projects
None yet
Development

No branches or pull requests

2 participants