Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] %%send-to-spark fails for dataframes with '\n' or ' characters #862

Open
spbansal opened this issue Dec 1, 2023 · 2 comments
Open
Labels
kind:bug An unexpected error or issue with sparkmagic

Comments

@spbansal
Copy link

spbansal commented Dec 1, 2023

Describe the bug
When using %%send-to-spark with a local pandas df which contains '\n' or ' (single quote) character, the command fails with the following error
SyntaxError: invalid syntax
To Reproduce
You can run the following notebook for reproducing this issue

%%local
csv_string="""Column2,Column3
0.5205293969790654,A's and B's
0.2995962535000757,B's_another_example
0.7410351292041173,C's_example
"""
large_df = pd.read_csv(StringIO(csv_string))

%%send_to_spark -i large_df -t df -n large_df

Expected behavior
The dataframe should have been properly send to the remote spark kernel.

Screenshots
Screenshot 2023-12-01 at 11 05 29 AM

Versions:

  • SparkMagic - 0.20.0
  • Spark 2.4.3

Additional context
A workaround for this was to base64 encode any string fields before sending to spark. It would be great if this or a different solution could be patched in the repo.

Copy link

dosubot bot commented Dec 1, 2023

🤖

Hello @spbansal,

I'm Dosu, a bot here to assist you while a human maintainer is not available. I'm designed to help resolve issues, answer questions, and guide you on how to contribute to our projects. I'm here to help, so let's work together to find a solution to your issue.

Thank you for reporting this issue. It seems like the problem is due to the '\n' and single quote characters not being properly escaped when the pandas dataframe is converted to a JSON string. This can be fixed by modifying the _pandas_df_to_json method in the SendPandasDfToSparkCommand class to properly escape these characters.

Here's a patch that should fix the issue:

from pandas.io.json import dumps

class SendPandasDfToSparkCommand(Command):
    # ...

    def _pandas_df_to_json(self, records):
        return dumps(records, ensure_ascii=False).replace('\n', '\\n').replace("'", "\\'")

This patch modifies the _pandas_df_to_json method to replace newline characters ('\n') with their escaped version ('\n') and single quotes with escaped single quotes ('\''). The dumps function from pandas.io.json is used to convert the dataframe to a JSON string.

Please apply this patch to your local copy of sparkmagic and test it with your use case. If it works as expected, you may want to submit it as a pull request to the sparkmagic repository so that it can be included in future versions of the software.

As for your question about how the '%%send-to-spark' command is implemented in the sparkmagic codebase, it is implemented in the SendToSparkCommand class in the sendtosparkcommand.py file. This class extends the Command class and overrides the execute method. The execute method takes a session as an argument and tries to execute a command that is returned by the to_command method. The to_command method takes the session kind, input variable name, input variable value, and output variable name as arguments. Depending on the session kind, it returns a PySpark, Scala, or R command. These commands are created by the _pyspark_command, _scala_command, and _r_command methods respectively. These methods are marked as abstract and need to be implemented in a subclass. They raise a NotImplementedError if they are not overridden.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the kind:bug An unexpected error or issue with sparkmagic label Dec 1, 2023
@lepusarticus
Copy link

lepusarticus commented Jan 17, 2024

@dosu-bot

Based on the error message, I think that a cell containing double quotes would also be a problem.

I prefer the human suggestion to encode everything in base64 and decode on the other end. It would be safer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug An unexpected error or issue with sparkmagic
Projects
None yet
Development

No branches or pull requests

2 participants