-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[O] Optimize CharSequence.lines() #5278
Conversation
@hykilpikonna nice catch! |
@@ -1403,14 +1403,14 @@ public inline fun CharSequence.splitToSequence(regex: Regex, limit: Int = 0): Se | |||
* | |||
* The lines returned do not include terminating line separators. | |||
*/ | |||
public fun CharSequence.lineSequence(): Sequence<String> = splitToSequence("\r\n", "\n", "\r") | |||
public fun CharSequence.lineSequence(): Sequence<String> = toString().replace("\r\n", "\n").replace('\r', '\n').splitToSequence('\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CharSequence
interface neither require implementing toString
, nor constraint developers to do that in a particular way:
public interface CharSequence { |
In standard library, there's only one class implementing that interface, StringBuidler
, and it's toString
implementation returns underlying string, but in general, we can't rely on that.
A random example of how CharSequence's implementation may override toString
: https://github.com/h2oai/h2o-3/blob/776b53716559e065ccf2af08d1cbb5b8d2c883e7/h2o-algos/src/main/java/hex/grep/Grep.java#L80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's only one class
Besides String
, of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, it seems to be a problem with how CharSequence's contract is defined in Kotlin (there's no explicit common declaration ), and Java's CharSequence specify toString
's behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... the reason why I used toString is because there doesn't seem to exist a function CharSequence.replace(String, String)
or even CharSequence.replace(Char, Char)
The only replace function I can find under CharSequence uses Regex replace, but it performs worse than a JVM implementation of replace.
However, if I copy the String.replace() algorithm in the following JVM implementation and set the signature to take in a CharSequence
instead of String
, the replacement works and performs well as intended.
public actual fun String.replace(oldValue: String, newValue: String, ignoreCase: Boolean = false): String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have two options for optimizing this:
- We can create a new
String.lines()
implementation separate fromCharSequence.lines()
. - We can either create a new
CharSequence.replace(String, String)
interface or create a helper function for replace, and use this replacement function forCharSequence.lines()
.
On JVM, both approaches should perform equally well, but this might not be the case on other platforms. So I've tested the performance of the second approach on different platforms below.
On Kotlin Native, compiling the JVM CharSequence.replace
into native is slightly slower than using the native String.replace
(but still better than splitting by three strings):
(test sample size reduced by 10x because Kotlin Native is very slow and memory-intensive in processing strings)
On Kotlin nodejs, compiling the JVM CharSequence.replace
function into js is slightly faster than using the js String.replace
:
Please let me know what you think is the best approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to keep in mind on the lineSequence
implementation is that it performs lazy splitting as expected for sequence, whereas using toString().replace(...).splitToSequence(...)
would perform lazy splitting, but eager replaces on a whole string.
This may be wasteful when only part of the resulting sequence is used afterward.
Example benchmark:
private val string = (0..1000).joinToString("\n") { "line_$it" }
private val hugeString = (0..100000000).joinToString("\n") { "line_$it" }
@State(Scope.Benchmark)
class StringLinesBenchmark {
@Benchmark
fun stdLibLineSequenceNormalString() = string.lineSequence().take(100).toList()
@Benchmark
fun newLineSequenceNormalString() = string.newLineSequence().take(100).toList()
@Benchmark
fun stdLibLineSequenceHugeString() = hugeString.lineSequence().take(100).toList()
@Benchmark
fun newLineSequenceHugeString() = hugeString.newLineSequence().take(100).toList()
}
benchmarks summary:
Benchmark Mode Cnt Score Error Units
StringLinesBenchmark.newLineSequenceHugeString avgt 5 866.933 ± 20.053 ms/op
StringLinesBenchmark.newLineSequenceNormalString avgt 5 0.005 ± 0.001 ms/op
StringLinesBenchmark.stdLibLineSequenceHugeString avgt 5 0.015 ± 0.001 ms/op
StringLinesBenchmark.stdLibLineSequenceNormalString avgt 5 0.014 ± 0.001 ms/op
In this case, stdlib implementation performance doesn't really rely on input string lines count as long as only part of lines sequence result is used later.
But it is worth noting that for a string with 1000 lines, where only 10% of them were later used for processing, new implementation was much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hykilpikonna, as @Shykial mentioned, lineSequence
will no longer be lazy after using toString
and replace
. It's also worth mentioning that multiple string transformations will increase object allocation rate which in some scenarios may not be tolerable.
However, it seems like performance could be improved without hurting the laziness (but it needs to be verified) by switching from rangesDelimitedBy
to DelimitedRangesSequence
with a custom getNextMatch
callback. Currently, we use CharSequence.findAnyOf
to find where a line splits. Instead, we can scan the char sequence seeking for either \n
, or \r
(optionally followed by the \n
).
The main caveat here is that there are not so many tests dedicated to line splitting, so before changing anything the test coverage needs to be improved.
To further experiment with performance improvements, I would recommend writing benchmarks like the one in the message above. It would be nice to cover various scenarios (different receiver types, strings using different delimiters and having different structure, different operations on a resulting sequence, like first, last, count, etc) to better understand trade offs. To ease benchmarks configuration, you can clone https://github.com/fzhinkin/benchmarks-template, it's KMP project using kotlinx-benchmark with various benchmarking targets being already set up.
Hey! @hykilpikonna, are you considering proceeding with this PR? Please let me know if additional any help is needed here. |
Splitting lines by doing two string replacements and then splitting by one character is way faster than splitting by three substrings.
https://youtrack.jetbrains.com/issue/KT-66715/Performance-faster-alternative-to-String.lines
Profiling of the two methods:
Detailed time usage:
In-process time output: