{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":17165658,"defaultBranch":"master","name":"spark","ownerLogin":"apache","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2014-02-25T08:00:08.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/47359?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1717355095.0","currentOid":""},"activityList":{"items":[{"before":"8cf3195aed8ac0e92734a3b8a55b5332ae0e3832","after":"bc187013da821eba0ffff2408991e8ec6d2749fe","ref":"refs/heads/master","pushedAt":"2024-06-02T08:52:27.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame()\n\n### What changes were proposed in this pull request?\n- Add support for passing a PyArrow Table to `createDataFrame()`.\n- Document this on the **Apache Arrow in PySpark** user guide page.\n- Fix an issue with timestamp and struct columns in `toArrow()`.\n\n### Why are the changes needed?\nThis seems like a logical next step after the addition of a `toArrow()` DataFrame method in #45481.\n\n### Does this PR introduce _any_ user-facing change?\nUsers will have the ability to pass PyArrow Tables to `createDataFrame()`. There are no changes to the parameters of `createDataFrame()`. The only difference is that `data` can now be a PyArrow Table.\n\n### How was this patch tested?\nMany tests were added, for Spark Classic and Spark Connect. I ran the tests locally with older versions of PyArrow installed (going back to 10.0).\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #46529 from ianmcook/SPARK-48220.\n\nAuthored-by: Ian Cook \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame()"}},{"before":"1cecdc7596e078b4917f456bfbd2435ff9022f2f","after":"8cf3195aed8ac0e92734a3b8a55b5332ae0e3832","ref":"refs/heads/master","pushedAt":"2024-06-02T03:41:02.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"sunchao","name":"Chao Sun","path":"/sunchao","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/506679?s=80&v=4"},"commit":{"message":"[SPARK-48392][CORE][FOLLOWUP] Add `--load-spark-defaults` flag to decide whether to load `spark-defaults.conf`\n\n### What changes were proposed in this pull request?\n\nFollowing the discussions in #46709, this PR adds a flag `--load-spark-defaults` to control whether `spark-defaults.conf` should be loaded when `--properties-file` is specified. By default, the flag is turned off.\n\n### Why are the changes needed?\n\nIdeally we should avoid behavior change and reduce user disruptions. For this reason, a flag is preferred.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nN/A\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #46782 from sunchao/SPARK-48392-followup.\n\nAuthored-by: Chao Sun \nSigned-off-by: Chao Sun ","shortMessageHtmlLink":"[SPARK-48392][CORE][FOLLOWUP] Add --load-spark-defaults flag to dec…"}},{"before":"744b070fa964dee9e5460a24f88f22c3af8170dc","after":"7e0c31445c31a76f6e1835f204e8a09eee2b57dc","ref":"refs/heads/branch-3.5","pushedAt":"2024-06-01T06:28:49.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HeartSaVioR","name":"Jungtaek Lim","path":"/HeartSaVioR","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1317309?s=80&v=4"},"commit":{"message":"[SPARK-48481][SQL][SS] Do not apply OptimizeOneRowPlan against streaming Dataset\n\n### What changes were proposed in this pull request?\n\nThis PR proposes to exclude streaming Dataset from the target of OptimizeOneRowPlan.\n\n### Why are the changes needed?\n\nThe rule should not be applied to streaming source, since the number of rows it sees is just for current microbatch. It does not mean the streaming source will ever produce max 1 rows during lifetime of the query.\n\nSuppose the case: the streaming query has a case where batch 0 runs with empty data in streaming source A which triggers the rule with Aggregate, and batch 1 runs with several data in streaming source A which no longer trigger the rule.\n\nIn the above scenario, this could fail the query as stateful operator is expected to be planned for every batches whereas here it is planned \"selectively\".\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but the behavior can be reverted back with a new config, `spark.sql.streaming.optimizeOneRowPlan.enabled`, although I believe there should be really rare case where users have to turn the config on.\n\n### How was this patch tested?\n\nNew UT.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #46820 from HeartSaVioR/SPARK-48481.\n\nAuthored-by: Jungtaek Lim \nSigned-off-by: Jungtaek Lim \n(cherry picked from commit 1cecdc7596e078b4917f456bfbd2435ff9022f2f)\nSigned-off-by: Jungtaek Lim ","shortMessageHtmlLink":"[SPARK-48481][SQL][SS] Do not apply OptimizeOneRowPlan against stream…"}},{"before":"114164bd3f6bfb67e58f8ead93c5bba0d08b8a0a","after":"1cecdc7596e078b4917f456bfbd2435ff9022f2f","ref":"refs/heads/master","pushedAt":"2024-06-01T06:15:06.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HeartSaVioR","name":"Jungtaek Lim","path":"/HeartSaVioR","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1317309?s=80&v=4"},"commit":{"message":"[SPARK-48481][SQL][SS] Do not apply OptimizeOneRowPlan against streaming Dataset\n\n### What changes were proposed in this pull request?\n\nThis PR proposes to exclude streaming Dataset from the target of OptimizeOneRowPlan.\n\n### Why are the changes needed?\n\nThe rule should not be applied to streaming source, since the number of rows it sees is just for current microbatch. It does not mean the streaming source will ever produce max 1 rows during lifetime of the query.\n\nSuppose the case: the streaming query has a case where batch 0 runs with empty data in streaming source A which triggers the rule with Aggregate, and batch 1 runs with several data in streaming source A which no longer trigger the rule.\n\nIn the above scenario, this could fail the query as stateful operator is expected to be planned for every batches whereas here it is planned \"selectively\".\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but the behavior can be reverted back with a new config, `spark.sql.streaming.optimizeOneRowPlan.enabled`, although I believe there should be really rare case where users have to turn the config on.\n\n### How was this patch tested?\n\nNew UT.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #46820 from HeartSaVioR/SPARK-48481.\n\nAuthored-by: Jungtaek Lim \nSigned-off-by: Jungtaek Lim ","shortMessageHtmlLink":"[SPARK-48481][SQL][SS] Do not apply OptimizeOneRowPlan against stream…"}},{"before":"3cd35f8cb6462051c621cf49de54b9c5692aae1d","after":"114164bd3f6bfb67e58f8ead93c5bba0d08b8a0a","ref":"refs/heads/master","pushedAt":"2024-05-31T23:45:10.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"gengliangwang","name":"Gengliang Wang","path":"/gengliangwang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1097932?s=80&v=4"},"commit":{"message":"[SPARK-48490][CORE] Unescapes any literals for message of MessageWithContext\n\n### What changes were proposed in this pull request?\nThe pr aims to `unescapes` any literals for `message` of `MessageWithContext`\n\n### Why are the changes needed?\n- For example, before this PR\n```\nlogInfo(\"This is a log message\\nThis is a new line \\t other msg\")\n```\nIt will output:\n```\n24/05/31 22:53:27 INFO PatternLoggingSuite: This is a log message\nThis is a new line \t other msg\n```\n\nBut:\n\n```\nlogInfo(log\"This is a log message\\nThis is a new line \\t other msg\")\n```\nIt will output:\n```\n24/05/31 22:53:59 ERROR PatternLoggingSuite: This is a log message\\nThis is a new line \\t other msg\n```\n\nObviously, the latter is not the result we `expected`.\n\n### Does this PR introduce _any_ user-facing change?\nYes, fix bug.\n\n### How was this patch tested?\n- Add new UT.\n- Pass GA.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46824 from panbingkun/SPARK-48490.\n\nAuthored-by: panbingkun \nSigned-off-by: Gengliang Wang ","shortMessageHtmlLink":"[SPARK-48490][CORE] Unescapes any literals for message of MessageWith…"}},{"before":"7d39000f809a117d2ef9e73e46697704e45ba262","after":"744b070fa964dee9e5460a24f88f22c3af8170dc","ref":"refs/heads/branch-3.5","pushedAt":"2024-05-31T22:56:21.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48391][CORE] Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class\n\n### What changes were proposed in this pull request?\n\nUsing addAll instead of add function in fromAccumulators method of TaskMetrics.\n\n### Why are the changes needed?\n\nTo Improve performance. In the fromAccumulators method of TaskMetrics,we should use `\ntm._externalAccums.addAll` instead of `tm._externalAccums.add`, as _externalAccums is a instance of CopyOnWriteArrayList\n\n### Does this PR introduce _any_ user-facing change?\n\nyes.\n\n### How was this patch tested?\n\nNo Tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #46705 from monkeyboy123/fromAccumulators-accelerate.\n\nAuthored-by: Dereck Li \nSigned-off-by: Wenchen Fan \n(cherry picked from commit 3cd35f8cb6462051c621cf49de54b9c5692aae1d)\nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48391][CORE] Using addAll instead of add function in fromAccum…"}},{"before":"96365c86962b4db6fe69988f2cff94ce6d21d848","after":"3cd35f8cb6462051c621cf49de54b9c5692aae1d","ref":"refs/heads/master","pushedAt":"2024-05-31T22:56:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48391][CORE] Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class\n\n### What changes were proposed in this pull request?\n\nUsing addAll instead of add function in fromAccumulators method of TaskMetrics.\n\n### Why are the changes needed?\n\nTo Improve performance. In the fromAccumulators method of TaskMetrics,we should use `\ntm._externalAccums.addAll` instead of `tm._externalAccums.add`, as _externalAccums is a instance of CopyOnWriteArrayList\n\n### Does this PR introduce _any_ user-facing change?\n\nyes.\n\n### How was this patch tested?\n\nNo Tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #46705 from monkeyboy123/fromAccumulators-accelerate.\n\nAuthored-by: Dereck Li \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48391][CORE] Using addAll instead of add function in fromAccum…"}},{"before":"844821c82da56fccb643bde9757799d1cfd3529a","after":"96365c86962b4db6fe69988f2cff94ce6d21d848","ref":"refs/heads/master","pushedAt":"2024-05-31T21:01:02.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48465][SQL] Avoid no-op empty relation propagation\n\n### What changes were proposed in this pull request?\n\nNarrow down `PropagationEmptyRelation` matching pattern from \"non `LocalRelation` `LeafNode`\" to \"`LogicalQueryStage` that contains direct `QueryStageExec`\", which still allows partial aggregate empty relation to be propagated.\n\n### Why are the changes needed?\n\nWe should avoid no-op empty relation propagation in AQE: if we convert an empty QueryStageExec to empty relation, it will further wrapped into a new query stage and execute -> produce empty result -> empty relation propagation again. This issue is currently not exposed because AQE will try to reuse shuffle.\n\n### Does this PR introduce _any_ user-facing change?\nNO\n\n### How was this patch tested?\nExisting test.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNO\n\nCloses #46814 from liuzqt/SPARK-48465.\n\nAuthored-by: Ziqi Liu \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48465][SQL] Avoid no-op empty relation propagation"}},{"before":"f083e61925e9f5b62835b19b83c2dde8c9319a97","after":"844821c82da56fccb643bde9757799d1cfd3529a","ref":"refs/heads/master","pushedAt":"2024-05-31T16:50:12.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"gengliangwang","name":"Gengliang Wang","path":"/gengliangwang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1097932?s=80&v=4"},"commit":{"message":"[SPARK-47578][R] Migrate RPackageUtils with variables to structured logging framework\n\n### What changes were proposed in this pull request?\n\nMigrate logging with variables of the Spark `RPackageUtils` module to structured logging framework. This transforms the log* entries of APIs like this:\n\n```\ndef logWarning(msg: => String): Unit\n```\nto\n```\ndef logWarning(entry: LogEntry): Unit\n```\n\n### Why are the changes needed?\n\nTo enhance Apache Spark's logging system by implementing structured logging.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, Spark core logs will contain additional MDC\n\n### How was this patch tested?\n\nCompiler and scala style checks, as well as code review.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nBrief but appropriate use of GitHub copilot\n\nCloses #46815 from dtenedor/log-migration-r-package-utils.\n\nAuthored-by: Daniel Tenedorio \nSigned-off-by: Gengliang Wang ","shortMessageHtmlLink":"[SPARK-47578][R] Migrate RPackageUtils with variables to structured l…"}},{"before":"747437c80aa875844f41ac61a419443af9f3b4b2","after":"f083e61925e9f5b62835b19b83c2dde8c9319a97","ref":"refs/heads/master","pushedAt":"2024-05-31T16:23:43.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48430][SQL] Fix map value extraction when map contains collated strings\n\n### What changes were proposed in this pull request?\n\nFollowing queries return unexpected results:\n```\nselect collation(map('a', 'b' collate utf8_binary_lcase)['a']);\nselect collation(element_at(map('a', 'b' collate utf8_binary_lcase), 'a'));\n```\nBoth return `UTF8_BINARY` instead of `UTF8_BINARY_LCASE`.\n\nThe error was introduced by changes in https://github.com/apache/spark/pull/46661 that tried to solve other problem with `RaiseError` expression (please refer to PR for details). Partially revert changes of mentioned PR and fix the `RaiseError` issue by explicitly setting `UTF8_BINARY` collation.\n\n### Why are the changes needed?\n\nTo fix wrong results of mentioned queries.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, it fixed explained error.\n\n### How was this patch tested?\n\nAdded test to `CollationSQLExpressionsSuite`.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #46758 from nikolamand-db/SPARK-48430.\n\nAuthored-by: Nikola Mandic \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48430][SQL] Fix map value extraction when map contains collate…"}},{"before":"67d11b1992aaa100d0e1fa30b0e5c33684c93a89","after":"747437c80aa875844f41ac61a419443af9f3b4b2","ref":"refs/heads/master","pushedAt":"2024-05-31T16:10:42.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48476][SQL] fix NPE error message for null delmiter csv\n\n### What changes were proposed in this pull request?\n\nIn this pull request i propose we throw proper error code when customer specifies null as a delimiter for CSV. Currently we throw NPE.\n\n### Why are the changes needed?\n\nTo make spark more user friendly.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, customer will now get INVALID_DELIMITER_VALUE.NULL_VALUE error class when they specify null for delimiter of csv.\n\n### How was this patch tested?\n\nunit test\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #46810 from milastdbx/dev/milast/fixNPEForDelimiterCSV.\n\nAuthored-by: milastdbx \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48476][SQL] fix NPE error message for null delmiter csv"}},{"before":"0d4e1fa5dbb129fd05cbdd61324cfc3e9389c1c4","after":"11c06fcbf2e62e870c758cedcd386ba2d539352d","ref":"refs/heads/branch-3.4","pushedAt":"2024-05-31T14:38:23.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"LuciferYang","name":"YangJie","path":"/LuciferYang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1475305?s=80&v=4"},"commit":{"message":"[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for different task attempts\n\n### What changes were proposed in this pull request?\nAfter #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nGA\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.\n\nLead-authored-by: jackylee-ch \nCo-authored-by: Kent Yao \nSigned-off-by: yangjie01 \n(cherry picked from commit 67d11b1992aaa100d0e1fa30b0e5c33684c93a89)\nSigned-off-by: yangjie01 ","shortMessageHtmlLink":"[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for differ…"}},{"before":"d64f96cbacd9d98b89f31c27cf4aa79262399659","after":"7d39000f809a117d2ef9e73e46697704e45ba262","ref":"refs/heads/branch-3.5","pushedAt":"2024-05-31T14:38:08.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"LuciferYang","name":"YangJie","path":"/LuciferYang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1475305?s=80&v=4"},"commit":{"message":"[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for different task attempts\n\n### What changes were proposed in this pull request?\nAfter #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nGA\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.\n\nLead-authored-by: jackylee-ch \nCo-authored-by: Kent Yao \nSigned-off-by: yangjie01 \n(cherry picked from commit 67d11b1992aaa100d0e1fa30b0e5c33684c93a89)\nSigned-off-by: yangjie01 ","shortMessageHtmlLink":"[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for differ…"}},{"before":"4360ec733d248b62798a191301e2b671f7bcfbd5","after":"67d11b1992aaa100d0e1fa30b0e5c33684c93a89","ref":"refs/heads/master","pushedAt":"2024-05-31T14:37:54.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"LuciferYang","name":"YangJie","path":"/LuciferYang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1475305?s=80&v=4"},"commit":{"message":"[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for different task attempts\n\n### What changes were proposed in this pull request?\nAfter #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nGA\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.\n\nLead-authored-by: jackylee-ch \nCo-authored-by: Kent Yao \nSigned-off-by: yangjie01 ","shortMessageHtmlLink":"[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for differ…"}},{"before":"090022d475d671ee345f22eb661f644b29ca28c5","after":"0d4e1fa5dbb129fd05cbdd61324cfc3e9389c1c4","ref":"refs/heads/branch-3.4","pushedAt":"2024-05-31T05:33:49.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"yaooqinn","name":"Kent Yao","path":"/yaooqinn","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8326978?s=80&v=4"},"commit":{"message":"[SPARK-48172][SQL][FOLLOWUP] Fix escaping issues in JDBCDialects\n\n### What changes were proposed in this pull request?\nRemoval of stripMargin from the code in `DockerJDBCIntegrationV2Suite`.\n\n### Why are the changes needed?\nhttps://github.com/apache/spark/pull/46588\nGiven PR was merged to master/3.5/3.4. This PR broke daily jobs for `OracleIntegrationSuite`. Upon inspection, it was noted that 3.4 and 3.5 are run with JDK8 while master is run with JDK21 and stripMargin was behaving differently in those cases. Upon removing stripMargin and spliting `INSERT INTO` statements into multiple lines, all integration tests have passed.\n\n### Does this PR introduce _any_ user-facing change?\nNo, only loading of the test data was changed to follow language requirements.\n\n### How was this patch tested?\nExisting suite was aborted in the job and now it is running.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #46806\n\nCloses #46807 from mihailom-db/FixOracleMaster.\n\nAuthored-by: Mihailo Milosevic \nSigned-off-by: Kent Yao \n(cherry picked from commit 4360ec733d248b62798a191301e2b671f7bcfbd5)\nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-48172][SQL][FOLLOWUP] Fix escaping issues in JDBCDialects"}},{"before":"dc8f652c2c2e4fd858f0ab6c82ee9adf10d8e548","after":"d64f96cbacd9d98b89f31c27cf4aa79262399659","ref":"refs/heads/branch-3.5","pushedAt":"2024-05-31T05:33:26.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"yaooqinn","name":"Kent Yao","path":"/yaooqinn","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8326978?s=80&v=4"},"commit":{"message":"[SPARK-48172][SQL][FOLLOWUP] Fix escaping issues in JDBCDialects\n\n### What changes were proposed in this pull request?\nRemoval of stripMargin from the code in `DockerJDBCIntegrationV2Suite`.\n\n### Why are the changes needed?\nhttps://github.com/apache/spark/pull/46588\nGiven PR was merged to master/3.5/3.4. This PR broke daily jobs for `OracleIntegrationSuite`. Upon inspection, it was noted that 3.4 and 3.5 are run with JDK8 while master is run with JDK21 and stripMargin was behaving differently in those cases. Upon removing stripMargin and spliting `INSERT INTO` statements into multiple lines, all integration tests have passed.\n\n### Does this PR introduce _any_ user-facing change?\nNo, only loading of the test data was changed to follow language requirements.\n\n### How was this patch tested?\nExisting suite was aborted in the job and now it is running.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #46806\n\nCloses #46807 from mihailom-db/FixOracleMaster.\n\nAuthored-by: Mihailo Milosevic \nSigned-off-by: Kent Yao \n(cherry picked from commit 4360ec733d248b62798a191301e2b671f7bcfbd5)\nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-48172][SQL][FOLLOWUP] Fix escaping issues in JDBCDialects"}},{"before":"80addbbddab73980a0ea0672dac5634465c1997b","after":"4360ec733d248b62798a191301e2b671f7bcfbd5","ref":"refs/heads/master","pushedAt":"2024-05-31T05:33:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"yaooqinn","name":"Kent Yao","path":"/yaooqinn","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8326978?s=80&v=4"},"commit":{"message":"[SPARK-48172][SQL][FOLLOWUP] Fix escaping issues in JDBCDialects\n\n### What changes were proposed in this pull request?\nRemoval of stripMargin from the code in `DockerJDBCIntegrationV2Suite`.\n\n### Why are the changes needed?\nhttps://github.com/apache/spark/pull/46588\nGiven PR was merged to master/3.5/3.4. This PR broke daily jobs for `OracleIntegrationSuite`. Upon inspection, it was noted that 3.4 and 3.5 are run with JDK8 while master is run with JDK21 and stripMargin was behaving differently in those cases. Upon removing stripMargin and spliting `INSERT INTO` statements into multiple lines, all integration tests have passed.\n\n### Does this PR introduce _any_ user-facing change?\nNo, only loading of the test data was changed to follow language requirements.\n\n### How was this patch tested?\nExisting suite was aborted in the job and now it is running.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #46806\n\nCloses #46807 from mihailom-db/FixOracleMaster.\n\nAuthored-by: Mihailo Milosevic \nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-48172][SQL][FOLLOWUP] Fix escaping issues in JDBCDialects"}},{"before":"f64646c96f37290d9e20168ab238ad0a3e19b8aa","after":"80addbbddab73980a0ea0672dac5634465c1997b","ref":"refs/heads/master","pushedAt":"2024-05-31T01:35:18.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-48474][CORE] Fix the class name of the log in `SparkSubmitArguments` & `SparkSubmit`\n\n### What changes were proposed in this pull request?\nThe pr aims to fix `the class name` of the log in `SparkSubmitArguments` & `SparkSubmit`.\n\n### Why are the changes needed?\nWe should display the class names that `match our understanding` in the logs, rather than the `anonymous class names` automatically generated by Scala.\n\n```\nsh bin/spark-shell --verbose\n```\n\nBefore:\n\"image\"\n\nAfter:\n\"image\"\n\n### Does this PR introduce _any_ user-facing change?\nYes, only for log.\n\n### How was this patch tested?\nManually test.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46808 from panbingkun/SPARK-48474.\n\nAuthored-by: panbingkun \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-48474][CORE] Fix the class name of the log in `SparkSubmitArgu…"}},{"before":"c00742bbb65129cd30193524484ce9058ee5004d","after":"f64646c96f37290d9e20168ab238ad0a3e19b8aa","ref":"refs/heads/master","pushedAt":"2024-05-31T01:34:23.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-48467][BUILD] Upgrade Maven to 3.9.7\n\n### What changes were proposed in this pull request?\nThe pr aims to upgrade `maven` from `3.9.6` to `3.9.7`.\n\n### Why are the changes needed?\nhttps://maven.apache.org/docs/3.9.7/release-notes.html\n\"image\"\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\n- Manually test\n- Pass GA.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46798 from panbingkun/SPARK-48467.\n\nAuthored-by: panbingkun \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-48467][BUILD] Upgrade Maven to 3.9.7"}},{"before":"3e27543128c84bb4b6642589bb1c6da21c38b957","after":"c00742bbb65129cd30193524484ce9058ee5004d","ref":"refs/heads/master","pushedAt":"2024-05-31T00:54:29.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-47716][SQL] Avoid view name conflict in SQLQueryTestSuite semantic sort test case\n\n### What changes were proposed in this pull request?\nIn SQLQueryTestSuite, the test case \"Test logic for determining whether a query is semantically sorted\" can sometimes failure with an error\n```\nCannot create table or view `main`.`default`.`t1` because it already exists.\n```\nif run concurrently with other sql test cases that also create tables with the same name.\n\nFix it by putting it in a schema with a unique name.\n\n### Why are the changes needed?\nFix flaky test issue\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nTest itself\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #45855 from jchen5/sql-sort-test.\n\nAuthored-by: Jack Chen \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-47716][SQL] Avoid view name conflict in SQLQueryTestSuite sema…"}},{"before":"ce7a889ad9fae55ca6ffdd262d538239f60be1ca","after":"3e27543128c84bb4b6642589bb1c6da21c38b957","ref":"refs/heads/master","pushedAt":"2024-05-31T00:53:42.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48419][SQL] Foldable propagation replace foldable column shoul…\n\n…d use origin column name\n\n### What changes were proposed in this pull request?\nfix optimizer rule `FoldablePropagation` will change column name, use origin name.\n\n### Why are the changes needed?\nfix bug\n\n### Does this PR introduce _any_ user-facing change?\n`before fix`\nbefor optimizer:\n```shell\n'Project ['x, 'y, 'z]\n+- 'Project ['a AS x, str AS Y, 'b AS z]\n   +- LocalRelation , [a, b]\n```\n\nafter optimizer:\n\n```shell\nProject [x, str AS Y, z]\n+- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]\n   +- LocalRelation , [a, b]\n```\ncolumn name `y` will be replace to 'Y', it change plan schame\n\n`after fix`\nthe query plan schema is still y\n```shell\nProject [x, str AS y, z]\n+- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]\n   +- LocalRelation , [a, b]\n```\n\n### How was this patch tested?\nAdded UT\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #46742 from KnightChess/fix-foldable-propagation.\n\nAuthored-by: KnightChess <981159963@qq.com>\nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48419][SQL] Foldable propagation replace foldable column shoul…"}},{"before":"9e35b00677566c00e906b8d5168acdd6ebb953a1","after":"ce7a889ad9fae55ca6ffdd262d538239f60be1ca","ref":"refs/heads/master","pushedAt":"2024-05-30T23:36:15.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-48461][SQL] Replace NullPointerExceptions with error class in AssertNotNull expression\n\n### What changes were proposed in this pull request?\n\nThis PR replaces `NullPointerException`s with a new error class in the `AssertNotNull` expression.\n\n### Why are the changes needed?\n\nWe bring the advantages from the Spark error class framework to this case, enabling better user experiences and error classification.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, see above.\n\n### How was this patch tested?\n\nThis PR includes unit test coverage.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nGitHub copilot\n\nCloses #46793 from dtenedor/fix-npe.\n\nAuthored-by: Daniel Tenedorio \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-48461][SQL] Replace NullPointerExceptions with error class in …"}},{"before":"bb1f026da0e3193e0cfad2001c7a8cab59a5290a","after":"9e35b00677566c00e906b8d5168acdd6ebb953a1","ref":"refs/heads/master","pushedAt":"2024-05-30T23:30:19.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-48446][SS][DOCS] Update SS doc of dropDuplicates to use the right syntax\n\n### What changes were proposed in this pull request?\nThis PR fixes the wrong usage of `dropDuplicates` and `dropDuplicatesWithinWatermark` in the Structured Streaming Programming Guide.\n\n### Why are the changes needed?\nPreviously the syntax in the guide was wrong, so users will see an error if directly using the example.\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nMade sure that the updated examples conform to the API doc, and can run out of the box.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #46797 from eason-yuchen-liu/dropduplicate-doc.\n\nAuthored-by: Yuchen Liu \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-48446][SS][DOCS] Update SS doc of dropDuplicates to use the ri…"}},{"before":"6b4f97e1411c223b77e7bbc4b46a5f399c39823e","after":"bb1f026da0e3193e0cfad2001c7a8cab59a5290a","ref":"refs/heads/master","pushedAt":"2024-05-30T23:22:20.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"HyukjinKwon","name":"Hyukjin Kwon","path":"/HyukjinKwon","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/6477701?s=80&v=4"},"commit":{"message":"[SPARK-48475][PYTHON] Optimize _get_jvm_function in PySpark\n\n### What changes were proposed in this pull request?\n\nIt is a performance optimization for PySpark. For the context, `sc._jvm` is a `JVMView` object in Py4J. It has an overloaded `__getattr__` implementation ([source](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_gateway.py#L1741)). Accessing `.functions` internally sends a command to the Py4J server. The server then searches through a list of imports to find a package that contains the `functions` class ([source](https://github.com/py4j/py4j/blob/master/py4j-java/src/main/java/py4j/reflection/TypeUtil.java#L249)), and will eventually find `org.apache.spark.sql.functions`. The failed reflection attempts are much more expensive than the last successful reflection. Instead, we can directly use this fully qualified class name and prevent all failed reflection attempts.\n\n### Why are the changes needed?\n\nIt improves the performance in PySpark when building large `DataFrame`.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nExisting tests.\n\nThe following code can verify the performance improvement:\n\n```\nimport pyspark.sql.functions as F\nfrom datetime import datetime\n\nfor i in range(5):\n T = datetime.now()\n df = spark.range(0, 10).agg(F.array([F.sum(F.col(\"id\")) for i in range(0, 500)]))\n print(datetime.now() - T)\n```\n\nOn local PySpark shell, the time consumption before/after the optimization is about 1s/0.5s.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #46809 from chenhao-db/optimize_get_jvm_function.\n\nAuthored-by: Chenhao Li \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-48475][PYTHON] Optimize _get_jvm_function in PySpark"}},{"before":"e1f5a7c856ab7ed4bf055553e490ee7c1307775a","after":"6b4f97e1411c223b77e7bbc4b46a5f399c39823e","ref":"refs/heads/master","pushedAt":"2024-05-30T21:10:22.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48468] Add LogicalQueryStage interface in catalyst\n\n### What changes were proposed in this pull request?\n\nAdding `LogicalQueryStage` interface in catalyst, and `org.apache.spark.sql.execution.adaptive.LogicalQueryStage` inherits from `logical.LogicalQueryStage`\n\n### Why are the changes needed?\n\nMake LogicalQueryStage visible in logical rewrites.\n\n### Does this PR introduce _any_ user-facing change?\n\nno\n\n### How was this patch tested?\nExisting tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nno\n\nCloses #46799 from liuzqt/SPARK-48468.\n\nAuthored-by: Ziqi Liu \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48468] Add LogicalQueryStage interface in catalyst"}},{"before":"a7da9b6b8aed99c5df23ccd83ab21a4a9a50d28a","after":"e1f5a7c856ab7ed4bf055553e490ee7c1307775a","ref":"refs/heads/master","pushedAt":"2024-05-30T21:07:13.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48477][SQL][TESTS] Use withSQLConf in tests: Refactor CollationSuite, CoalesceShufflePartitionsSuite, SQLExecutionSuite\n\n### What changes were proposed in this pull request?\n\nUse withSQLConf in tests when it is appropriate.\n\n### Why are the changes needed?\n\nEnforce good practice for setting config in test cases.\n\n### Does this PR introduce _any_ user-facing change?\n\nNO\n\n### How was this patch tested?\n\nExisting UT\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNO\n\nCloses #46812 from amaliujia/sql_config_4.\n\nAuthored-by: Rui Wang \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48477][SQL][TESTS] Use withSQLConf in tests: Refactor Collatio…"}},{"before":"f68d761c9b21de55dc26bdff59854c2f45ea7806","after":"a7da9b6b8aed99c5df23ccd83ab21a4a9a50d28a","ref":"refs/heads/master","pushedAt":"2024-05-30T16:50:02.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"hvanhovell","name":"Herman van Hovell","path":"/hvanhovell","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9616802?s=80&v=4"},"commit":{"message":"[SPARK-48008][1/2] Support UDAFs in Spark Connect\n\n### What changes were proposed in this pull request?\n\nThis PR changes Spark Connect to support defining and registering `Aggregator[IN, BUF, OUT]` UDAFs.\nThe mechanism is similar to supporting Scaler UDFs. On the client side, we serialize and send the `Aggregator` instance to the server, where the data is deserialized into an `Aggregator` instance recognized by Spark Core.\nWith this PR we now have two `Aggregator` interfaces defined, one in Connect API and one in Core. They define exactly the same abstract methods and share the same `SerialVersionUID`, so the Java serialization engine could map one to another. It is very important to keep these two definitions always in sync.\n\nSecond part of this effort will be adding `Aggregator.toColumn` API (now NotImplemented due to deps to Spark Core).\n\n### Why are the changes needed?\n\nSpark Connect does not have UDAF support. We need to fix that.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, Connect users could now define an Aggregator and register it:\n```scala\nval agg = new Aggregator[INT, INT, INT] { ... }\nspark.udf.register(\"agg\", udaf(agg))\nval ds: Dataset[Data] = ...\nval aggregated = ds.selectExpr(\"agg(i)\")\n```\n\n### How was this patch tested?\n\nAdded new tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNope.\n\nCloses #46245 from xupefei/connect-udaf.\n\nAuthored-by: Paddy Xu \nSigned-off-by: Herman van Hovell ","shortMessageHtmlLink":"[SPARK-48008][1/2] Support UDAFs in Spark Connect"}},{"before":"69afd4be9c93cb31a840b969ed1984c0b6b92f8e","after":"f68d761c9b21de55dc26bdff59854c2f45ea7806","ref":"refs/heads/master","pushedAt":"2024-05-30T16:48:47.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cloud-fan","name":"Wenchen Fan","path":"/cloud-fan","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3182036?s=80&v=4"},"commit":{"message":"[SPARK-48292][CORE] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status\n\n### What changes were proposed in this pull request?\nRevert #36564 According to discuss https://github.com/apache/spark/pull/36564#discussion_r1607575927\n\nWhen spark commit task will commit to committedTaskPath\n`${outputpath}/_temporary//${appAttempId}/${taskId}`\nSo in #36564 's case, since before #38980, each task's job id's date is not the same, when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated.\n\nAfter #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated.\n\nNote: The taskAttemptPath is not same since in the path contains the taskAttemptId.\n\n### Why are the changes needed?\nNo need anymore\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nExisted UT\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #46696 from AngersZhuuuu/SPARK-48292.\n\nAuthored-by: Angerszhuuuu \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-48292][CORE] Revert [SPARK-39195][SQL] Spark OutputCommitCoord…"}},{"before":"df15c8d7744becfd44cd4a447c362e8e007bd574","after":"69afd4be9c93cb31a840b969ed1984c0b6b92f8e","ref":"refs/heads/master","pushedAt":"2024-05-30T09:31:44.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"LuciferYang","name":"YangJie","path":"/LuciferYang","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1475305?s=80&v=4"},"commit":{"message":"[SPARK-47361][SQL] Derby: Calculate suitable precision and scale for DECIMAL type\n\n### What changes were proposed in this pull request?\n\nWhen storing `decimal(p, s)` to derby, if `p > 31`, `s` is wrongly hardcoded to `5` which is the assumed default scale of derby decimal. Actually, 0 is the default scale, 5 is the default precision https://db.apache.org/derby/docs/10.13/ref/rrefsqlj15260.html\n\nThis PR calculates a suitable scale to make room for precision.\n\n### Why are the changes needed?\n\navoid precision loss\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, but derby is rare in production environments, and the new mapping are compatible for most usecases\n\n### How was this patch tested?\n\nnew tests\n\n### Was this patch authored or co-authored using generative AI tooling?\nno\n\nCloses #46776 from yaooqinn/SPARK-48439.\n\nAuthored-by: Kent Yao \nSigned-off-by: yangjie01 ","shortMessageHtmlLink":"[SPARK-47361][SQL] Derby: Calculate suitable precision and scale for …"}},{"before":"b477ef4fa9928c61bd41cc237309f64b06fb7d4b","after":"df15c8d7744becfd44cd4a447c362e8e007bd574","ref":"refs/heads/master","pushedAt":"2024-05-30T09:16:39.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"yaooqinn","name":"Kent Yao","path":"/yaooqinn","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/8326978?s=80&v=4"},"commit":{"message":"[SPARK-48471][CORE] Improve documentation and usage guide for history server\n\n### What changes were proposed in this pull request?\n\nIn this PR, we improve documentation and usage guide for the history server by:\n- Identify and print **unrecognized options** specified by users\n- Obtain and print all history server-related configurations dynamically instead of using an incomplete, outdated hardcoded list.\n- Ensure all configurations are documented for the usage guide\n\n### Why are the changes needed?\n\n- Revise the help guide for the history server to make it more user-friendly. Missing configuration in the help guide is not always reachable in our official documentation. E.g. spark.history.fs.safemodeCheck.interval is still missing from the doc since added in 1.6.\n- Missusage shall be reported to users\n\n### Does this PR introduce _any_ user-facing change?\n\nNo, the print style is still AS-IS with items increased\n\n### How was this patch tested?\n\n#### without this pr\n\n```\nUsage: ./sbin/start-history-server.sh [options]\n24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for TERM\n24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for HUP\n24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for INT\n\nOptions:\n --properties-file FILE Path to a custom Spark properties file.\n Default is conf/spark-defaults.conf.\n\nConfiguration options can be set by setting the corresponding JVM system property.\nHistory Server options are always available; additional options depend on the provider.\n\nHistory Server options:\n\n spark.history.ui.port Port where server will listen for connections\n (default 18080)\n spark.history.acls.enable Whether to enable view acls for all applications\n (default false)\n spark.history.provider Name of history provider class (defaults to\n file system-based provider)\n spark.history.retainedApplications Max number of application UIs to keep loaded in memory\n (default 50)\nFsHistoryProvider options:\n\n spark.history.fs.logDirectory Directory where app logs are stored\n (default: file:/tmp/spark-events)\n spark.history.fs.update.interval How often to reload log data from storage\n (in seconds, default: 10)\n```\n#### For error\n```java\nUnrecognized options: --conf spark.history.ui.port=10000\nUsage: HistoryServer [options]\n\nOptions:\n --properties-file FILE Path to a custom Spark properties file.\n Default is conf/spark-defaults.conf.\n\n```\n\n#### For help\n```java\n sbin/start-history-server.sh --help\nUsage: ./sbin/start-history-server.sh [options]\n{\"ts\":\"2024-05-30T07:15:29.740Z\",\"level\":\"INFO\",\"msg\":\"Registering signal handler for TERM\",\"context\":{\"signal\":\"TERM\"},\"logger\":\"SignalUtils\"}\n{\"ts\":\"2024-05-30T07:15:29.741Z\",\"level\":\"INFO\",\"msg\":\"Registering signal handler for HUP\",\"context\":{\"signal\":\"HUP\"},\"logger\":\"SignalUtils\"}\n{\"ts\":\"2024-05-30T07:15:29.741Z\",\"level\":\"INFO\",\"msg\":\"Registering signal handler for INT\",\"context\":{\"signal\":\"INT\"},\"logger\":\"SignalUtils\"}\n\nOptions:\n --properties-file FILE Path to a custom Spark properties file.\n Default is conf/spark-defaults.conf.\n\nConfiguration options can be set by setting the corresponding JVM system property.\nHistory Server options are always available; additional options depend on the provider.\n\nHistory Server options:\n spark.history.custom.executor.log.url Specifies custom spark executor log url for supporting\n external log service instead of using cluster managers'\n application log urls in the history server. Spark will\n support some path variables via patterns which can vary on\n cluster manager. Please check the documentation for your\n cluster manager to see which patterns are supported, if any.\n This configuration has no effect on a live application, it\n only affects the history server.\n (Default: )\n spark.history.custom.executor.log.url.applyIncompleteApplication Whether to apply custom executor log url, as specified by\n spark.history.custom.executor.log.url, to incomplete\n application as well. Even if this is true, this still only\n affects the behavior of the history server, not running\n spark applications.\n (Default: true)\n spark.history.kerberos.enabled Indicates whether the history server should use kerberos to\n login. This is required if the history server is accessing\n HDFS files on a secure Hadoop cluster.\n (Default: false)\n spark.history.kerberos.keytab When spark.history.kerberos.enabled=true, specifies location\n of the kerberos keytab file for the History Server.\n (Default: )\n spark.history.kerberos.principal When spark.history.kerberos.enabled=true, specifies kerberos\n principal name for the History Server.\n (Default: )\n spark.history.provider Name of the class implementing the application history\n backend.\n (Default: org.apache.spark.deploy.history.FsHistoryProvider)\n spark.history.retainedApplications The number of applications to retain UI data for in the\n cache. If this cap is exceeded, then the oldest applications\n will be removed from the cache. If an application is not in\n the cache, it will have to be loaded from disk if it is\n accessed from the UI.\n (Default: 50)\n spark.history.store.hybridStore.diskBackend Specifies a disk-based store used in hybrid store; ROCKSDB\n or LEVELDB (deprecated).\n (Default: ROCKSDB)\n spark.history.store.hybridStore.enabled Whether to use HybridStore as the store when parsing event\n logs. HybridStore will first write data to an in-memory\n store and having a background thread that dumps data to a\n disk store after the writing to in-memory store is\n completed.\n (Default: false)\n spark.history.store.hybridStore.maxMemoryUsage Maximum memory space that can be used to create HybridStore.\n The HybridStore co-uses the heap memory, so the heap memory\n should be increased through the memory option for SHS if the\n HybridStore is enabled.\n (Default: 2g)\n spark.history.store.maxDiskUsage Maximum disk usage for the local directory where the cache\n application history information are stored.\n (Default: 10g)\n spark.history.store.path Local directory where to cache application history\n information. By default this is not set, meaning all history\n information will be kept in memory.\n (Default: )\n spark.history.store.serializer Serializer for writing/reading in-memory UI objects to/from\n disk-based KV Store; JSON or PROTOBUF. JSON serializer is\n the only choice before Spark 3.4.0, thus it is the default\n value. PROTOBUF serializer is fast and compact, and it is\n the default serializer for disk-based KV store of live UI.\n (Default: JSON)\n spark.history.ui.acls.enable Specifies whether ACLs should be checked to authorize users\n viewing the applications in the history server. If enabled,\n access control checks are performed regardless of what the\n individual applications had set for spark.ui.acls.enable.\n The application owner will always have authorization to view\n their own application and any users specified via\n spark.ui.view.acls and groups specified via\n spark.ui.view.acls.groups when the application was run will\n also have authorization to view that application. If\n disabled, no access control checks are made for any\n application UIs available through the history server.\n (Default: false)\n spark.history.ui.admin.acls Comma separated list of users that have view access to all\n the Spark applications in history server.\n (Default: )\n spark.history.ui.admin.acls.groups Comma separated list of groups that have view access to all\n the Spark applications in history server.\n (Default: )\n spark.history.ui.port Web UI port to bind Spark History Server\n (Default: 18080)\nFsHistoryProvider options:\n spark.history.fs.cleaner.enabled Whether the History Server should periodically clean up\n event logs from storage\n (Default: false)\n spark.history.fs.cleaner.interval When spark.history.fs.cleaner.enabled=true, specifies how\n often the filesystem job history cleaner checks for files to\n delete.\n (Default: 1d)\n spark.history.fs.cleaner.maxAge When spark.history.fs.cleaner.enabled=true, history files\n older than this will be deleted when the filesystem history\n cleaner runs.\n (Default: 7d)\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nno\n\nCloses #46802 from yaooqinn/SPARK-48471.\n\nAuthored-by: Kent Yao \nSigned-off-by: Kent Yao ","shortMessageHtmlLink":"[SPARK-48471][CORE] Improve documentation and usage guide for history…"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEWhY2fQA","startCursor":null,"endCursor":null}},"title":"Activity · apache/spark"}