Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use '$' instead of '.' to separate module from identifiers in symbols #13050

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

tmcgilchrist
Copy link
Contributor

Currently, OCaml's native code compiler (ocamlopt) mangles names using the following format camlModule.Submodule.itentifier_stamp. This scheme causes issues with LLDB on macOS(#12933) breaking the ability to set breakpoints and needs to be worked around in the MSVC port (#12640). Changing to $ everywhere provides a consistent naming scheme across platforms, supports all targeted assemblers and debuggers (GDB/LLDB). See #12933 (comment)

There are five places that are impacted by this change:

  1. Generated OCaml names become camlModule$Submodule$identifier_stamp
  2. OCaml built in C identifiers _caml_function do not change eg _caml_alloc
  3. Backend specific names like caml_hot$code_end and caml_system$frametable change to use $. This impacts any users of function sections (#8526) who will need to update linker scripts.
  4. caml_system__code_begin and caml_system__code_end still use double underscores because they are referenced from C.
  5. Linux perf's OCaml demangling support needs updating ( See background discussion)

This change fixes the issue with setting breakpoints in LLDB on macOS (#12933), keeps the original fixes (#8998, #11321) from #11430.

@tmcgilchrist tmcgilchrist force-pushed the name_mangling branch 3 times, most recently from a95e311 to 2d02684 Compare March 26, 2024 06:03
runtime/s390x.S Outdated Show resolved Hide resolved
@tmcgilchrist tmcgilchrist force-pushed the name_mangling branch 3 times, most recently from d108ac0 to 705eda9 Compare April 2, 2024 01:19
Changes Outdated Show resolved Hide resolved
@copy
Copy link
Contributor

copy commented Apr 2, 2024

Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).

At least on Linux, I'd suggest keeping the existing . separator (in my testing, setting breakpoints in symbols containing . works in both lldb and gdb on Linux).

@tmcgilchrist
Copy link
Contributor Author

Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).

Could you provide more detail about which demangling implementations you mean? I am aware that Linux perf needs updating.

The intention with this change is to have a common demangling scheme across operating systems, $ was chosen because it supports Windows MSVC, macOS LLDB and Linux (gdb and lldb).

@copy
Copy link
Contributor

copy commented Apr 3, 2024

Could you provide more detail about which demangling implementations you mean? I am aware that Linux perf needs updating.

Here is another one: https://github.com/mstange/samply/blob/bb0dbdf13f10dc22036ef2dc4ead35b9647717d6/samply-symbols/src/demangle_ocaml.rs
I'd assume all demangling implementations are broken.

The intention with this change is to have a common demangling scheme across operating systems, $ was chosen because it supports Windows MSVC, macOS LLDB and Linux (gdb and lldb).

While uniformity is nice, this causes a fair amount of breakage and makes demangling more complicated. And . seems to work fine on Linux (and maybe future version of macOS too).

@Octachron Octachron added this to the 5.3 milestone Apr 3, 2024
@xavierleroy
Copy link
Contributor

I see the hot potato was assigned to me...

With @let-def's advice and Wikipedia's wisdom (https://en.m.wikipedia.org/wiki/Unambiguous_finite_automaton), I coded an automaton-based check for ambiguities in name mangling schemas: https://gist.github.com/xavierleroy/ace2f70d664d738970aa3034e0751d3d . The verdict is as follows:

Encoding with . separator : unambiguous
Encoding with $ separator : ambiguous, example: Z$ff$ff
Encoding with $ separator and ASCII chars only : unambiguous
Encoding with $ separator and decimal escapes: unambiguous

So, yes, the change proposed in this PR re-introduces an ambiguity in name mangling that the switch to . as a separator managed to avoid. However, the ambiguity happens only if bytes above 0x80 are used, e.g. in the past with Latin-1 characters in identifiers, and in the future with UTF8-encoded Unicode characters in identifiers (#12664, https://github.com/whitequark/ocaml-m17n/ ). To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).

@tmcgilchrist
Copy link
Contributor Author

To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).

@xavierleroy would you want to see that escaping change made in this PR? It's not clear how large a change that is, I am currently looking at the code.

@copy I wasn't aware of that project, thank you for the link. Not having used it did that project break with the change from __ to . separators? Would it be sufficient to contribute a fix to the demangler there if this change is accepted? If we get this change right (including accomodating Unicode identifiers) then it shouldn't need to change again.

@copy
Copy link
Contributor

copy commented Apr 4, 2024

Not having used it did that project break with the change from __ to . separators?

No, the . was already printed as-is in the original implementation.

Would it be sufficient to contribute a fix to the demangler there if this change is accepted?

Yes, that would be appreciated.

FWIW, I tested perf report with this change. Non-ascii characters are printed as <XX> (e.g. Stdlib.Map<AD>d_436), so at least the terminal isn't messed up.

To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).

One thing to keep in mind is that demanglers may want to keep support for older versions of OCaml. I wonder if overloading $ this way unnecessarily makes the demangling algorithm more complicated that necessary.

@tmcgilchrist
Copy link
Contributor Author

73c979d introduces octal escaping. For this small example program:

(* fib.ml *)
let rec fib n =
  if n = 0 then 0
  else if n = 1 then 1
  else fib (n-1) + fib (n-2)

(*
   With octal escaping this becomes "_camlFib$$136$052_272"
   With hexidecimal escaping this was "_camlFib$$5e$2a_272"
*)

let (^*) a r = Printf.printf "fib(%d) = %d" a r

let main () =
  let a = 10 in
  let r = fib a in
  a ^* r

let () = main ()

Not sure if there are other places that need updating?

@xavierleroy
Copy link
Contributor

73c979d introduces octal escaping.

Are you using a 36-bit computer? Because otherwise there is no reason whatsoever to use octal. OCaml uses decimal and hexadecimal for character escapes.

@tmcgilchrist
Copy link
Contributor Author

a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ). For the same example program, with hexidecimal escaping this was _camlFib$$5e$2a_272 this now becomes _camlFib$$094$042_272.

@xavierleroy
Copy link
Contributor

a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ).

I'm reassured :-)

@shindere
Copy link
Contributor

shindere commented Apr 22, 2024 via email

@shindere
Copy link
Contributor

shindere commented Apr 22, 2024 via email

@shindere
Copy link
Contributor

shindere commented Apr 22, 2024 via email

@shindere
Copy link
Contributor

Thinking about this further: am I correct that it should be possible to patch the demanglers so that they can (roughly) accomodate the two mangling schemes? Based on that a name can't contain both a . and a $, I'd expect it to be possible to figure out which mangling scheme has been used and thus know how to demangle?

@copy
Copy link
Contributor

copy commented Apr 23, 2024

Apologies for the naïve question: are there that many demanglers around?

I'm not aware of any other demanglers except the ones already mentioned in this thread.

I also do understand that none of the approaches will be ideal and still
we will have to choose one, or to make the mangling configurable but I
am not sure that's something we want to do.

My suggestion would be to keep using . on Linux, report the llvm issue upstream (note that . is already handled correctly in lldb on Linux), and eventually switch back to . on macOS.

Thinking about this further: am I correct that it should be possible to patch the demanglers so that they can (roughly) accomodate the two mangling schemes? Based on that a name can't contain both a . and a $, I'd expect it to be possible to figure out which mangling scheme has been used and thus know how to demangle?

4.14 and 5.0 symbols don't contain ., but use the old mangling scheme. The following may be sufficient:

  • If symbol contains .: Hex escapes (5.1)
  • If symbol doesn't contain $ or character after first $ matches [0-9]: Hex escapes (4.14/5.0)
  • Otherwise: Decimal escapes, except for the first $ (this PR)

But I prefer my suggestion above, for the following reasons:

  • It doesn't break existing demanglers, which might take a while to get updated
  • It keeps the demangling algorithm simple, which is implemented in C in perf
  • It doesn't overload $xxx in symbols between versions of OCaml
  • . is a little more readable in mangled symbols
  • . can later be used to separate submodules

Anyway, it's not a hill I will die on.

@tmcgilchrist tmcgilchrist force-pushed the name_mangling branch 2 times, most recently from 55ab2f3 to bea5f93 Compare April 30, 2024 05:40
@NickBarnes
Copy link
Contributor

NickBarnes commented May 14, 2024

The Rust mangling system, I think wisely, includes a mangling convention version number in every mangled identifier. If we don't do that now, we should consider it the next time we look at name mangling. Maybe one day we will end up with something like _O0$Mod$Submod$fun$line$uniq (most languages seem to go for _<L>... where L identifies the language). In any case, a demangler should be careful about false positives, as it may be mistaken or misinformed about the source language of any given symbol, or out-of-date with respect to mangling conventions in any given source language. As @shindere says, I think it's fairly easy to accommodate both mangling conventions in a single demangler (which is particularly good as this change brings all platforms into line: any demangler which can't handle $ would currently be broken on MSVC).

@damiendoligez
Copy link
Member

My suggestion would be to keep using . on Linux, report the llvm issue upstream (note that . is already handled correctly in lldb on Linux), and eventually switch back to . on macOS.

In my experience, reporting this kind of bug to Apple is just a waste of time...

Change mangling of OCaml long identifiers from `camlModule.name_NNN`
to `camlModule$name_NNN`. Also changes the encoding of special
characters from $xx (two hex digits) to e.g. $ddd (three decimal
digits). The previous mangling schema, using `.`, conflicted with LLDB
on macOS and MASM in the MSVC port. Mangled names are now consistent
across all ports.
@tmcgilchrist
Copy link
Contributor Author

@xavierleroy Thank you for the ambiguity code and suggestions for additional improvements. I have taken the opportunity to add you as a contributor to this fix (if you prefer otherwise please let me know).

In my experience, reporting this kind of bug to Apple is just a waste of time...

My perspective on this is, I need to work with the functionality available from lldb and Apple. LLDB doesn't currently support breakpoints with . in them and even when/if a fix is available that won't necessarily flow down to older macOS versions. I've also had reports of GDB not working with setting breakpoints with . in them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants