Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stamps of identifiers are not unique #13036

Open
samsa1 opened this issue Mar 18, 2024 · 5 comments
Open

Stamps of identifiers are not unique #13036

samsa1 opened this issue Mar 18, 2024 · 5 comments

Comments

@samsa1
Copy link
Contributor

samsa1 commented Mar 18, 2024

The documentation of Ident.rename function states that it returns a identifier with a fresh stamp.

ocaml/typing/ident.mli

Lines 37 to 40 in e72d835

val rename: t -> t
(** Creates an identifier with the same name as the input, a fresh
stamp, and no scope.
@raise [Fatal_error] if called on a persistent / predef ident. *)

However it is not the case because when compiling a module we use a certain stamp generator and another one is used when compiling another module. Thus we can have stamp collisions between different modules.

For example when modifying the function rename in ident.ml to

let rename = function
  | Local { name; stamp = _ }
  | Scoped { name; stamp; scope = _ } ->
      incr currentstamp;
      if name = "H" then Format.printf "H(%i -> %i)\n" stamp !currentstamp;
      Local { name; stamp = !currentstamp }
  | id ->
      Misc.fatal_errorf "Ident.rename %s" (name id)

We clearly see examples when compiling the whole compiler of modules being compiled with rename reducing the stamp of a identifier. Thus showing that the invariant for generating a fresh stamp is broken.

Is the fact that the stamp is not fresh (even if the documentation says so) an accepted behavior or something uncontrolled ?

PS : As a consequence, during the stap OCAMLOPT ocamldoc/odoc_ast.cmx of compilation the comparison function between identifiers is called on 2 identifiers with the same stamp and different names.

@lpw25
Copy link
Contributor

lpw25 commented Mar 18, 2024

Stamps from other modules are lazyily freshened on import via a substitution. I would say the documentation is correct. If you find a place where the stamps are not being freshened appropriately on import that is a bug. Feel free to open a more specific issue if you have such an example.

@lpw25 lpw25 closed this as completed Mar 18, 2024
@lthls
Copy link
Contributor

lthls commented Mar 18, 2024

The issue here is that we have a very strong hint that the implementation of stamp freshening is wrong. When building the freshening substitution for a given cmi, we expect the substitution to be from old idents to new idents (or paths in general), but functor types introduce some kind of alpha-renaming between already fresh idents that gets incorporated in the rest of the substitution. If the parameter of a functor type has the same name as a regular module being imported, and their stamps happen to match by chance, then one of them will erase the information about the other and substitution will have the wrong paths.

I'll try to create a standalone repro, it might take a few days but I'm fairly certain that we have a real bug so I'll re-open the issue.

@lthls lthls reopened this Mar 18, 2024
@lthls
Copy link
Contributor

lthls commented Mar 19, 2024

After a bit more investigating, I believe that we cannot cause the compiler to produce wrong code, but the reasoning is subtle and I think it would help to have someone else confirming my analysis.

Here is what I understand about this:

  • Substitutions (Subst.t) are used for two different purposes. Some are freshenings, binding all identifiers from a given scope to fresh identifiers (or paths, or whatever). Some are just regular local substitutions, typically used to implement destructive substitution (type t := ty, module M := Mexpr) but also in a few other places (functor types introduce such substitutions, I believe).
  • Composition of two local substitutions produces a local substitution, and that's what Subst.compose implements
  • Composition of a freshening with either another freshening or a local substitution should produce a freshening. This currently also uses Subst.compose, but the implementation has a weird twist: the keys of the right-hand substitution end up in the keys of the result freshening, despite the fact that they're not made of identifiers from the input scope of the freshening. So we end up with a map with mixed keys, and this is the issue that @samsa1 reported (when the freshening is for importing foo.cmi into bar.ml, the keys of the freshening are identifiers with stamps from Foo which can conflict with fresh identifiers generated for Bar). However, the freshening is only used for looking up names from Foo so as long as the keys from Bar do not override the keys from Foo it works as intended. And because the implementation of Subst.compose overrides the values of the right-hand side with the ones from the left-hand side for conflicting keys, this produces the intended result.

Note that the distinction between freshenings and local substitutions is something that I came up with myself to understand the issue, it is not reflected in any way in the code or comments. I think it would make sense to start doing this distinction, and as a bonus we could use that to trim freshenings a bit (no need to keep the useless bindings from the right-hand side), and to make comparison of Ident.t straight-forward (no need to compare the names anymore, if all comparisons are between idents from the same compilation unit the stamps are enough).

Finally, a small example that can trigger a conflict of identifiers:

(* lib.mli *)
module M : sig type t_m val x_m : t_m end
module N := M
module F (M : sig type t_m2 val x_m2 : t_m2 val y_m : M.t_m end) :
  sig val y : M.t_m2 val z : N.t_m end

(* main.ml *)
(* Adjusting the stamps. The exact number to trigger the conflict might be slightly different. *)
let x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18 = assert false

(* Any use of Lib may trigger a conflict between the two modules M.
   The following code was just an attempt to trigger a bug out of it *)
module R = Lib.F(struct type t_m2 = bool let x_m2 = false let y_m = Lib.M.x_m end)

let y : bool = R.y
let z : Lib.M.t_m = R.z
let x : Lib.M.t_m = Lib.M.x_m

@gasche
Copy link
Member

gasche commented Mar 19, 2024

A term has free and bound variables. A (parallel) substitution can be classified according to (1) the free variables in the term that it takes as input, and (2) the free variables in the term that it returns as output. For example, the single-binding substitution [x |-> y] goes from a term with free variables in any V,{x} (V is a set of free variables that does not contain x, and the comma is disjoint union) and returns a term with free variables in V, {y}. (It is also correct to describe then by listing only a subset of free variables in the input term, with the convention that any other free variable is left unchanged -- then that substitution goes from {x} to {y}.)

In addition, OCaml substitutions "freshen" the bound variables along the way: if you apply the substitution above to the term fun z -> (x, z), you will get not fun z -> (y, z) but rather fun z' -> (y, z') for some fresh z'. This freshening happens when the substitution rho traverses the term recursively, and it hits a binder (here fun z ->), it will traverse the terms under the binders with an extended substitution rho, z |-> z' for some fresh z'.

This is important/useful for two reasons:

  1. This guarantees that the substitution is capture-avoiding: if I was to ask to substitute y |-> z in the term fun z -> (y, z), I would get the correct result fun z' -> (z, z') instead of the incorrect result fun z -> (z, z).
  2. If we are combining the result of the substitution with the input of the substitution in some larger term -- the substitution creates a modified copy, that is kept alongside the original -- then we need to do this to (try to) guarantee that bound variables are unique, that the same binder does not occur twice in different binding positions. (The type-theory jargon for this is "the Barendregt convention".)

This is all a precise and long way to say that you are only partly right. It is not correct to think of "local substitutions" and "freshenings" as two distinct kinds of substitutions, all substitutions freshen the binders during their recursive traversal. In particular, it is possible to "freshen" a term by just applying the identity substitution (the empty substitution, going from {} to {}) to it. This is what .cmi loading uses. We probably wouldn't want to have a dedicated code path for freshenings, given that almost all of the logic is shared and this is a fair amount of not-so-trivial code (due to the fact that some parts of the substitutions/freshenings are done lazily.)

Regarding the original question of @samsa1: we are doing separate compilation, so there is no way to avoid the fact that different compilation units could use the same stamps for their names. If module C depends A and B, I could have compiled A and B both in the empty environment (instead of having one in the environment when compiling the other), so they can start from the same initial stamp, and C may observe the same stamp used for a name in A and a name in B. As Leo mentioned, this is fine as long as we don't consider the signature unmarshalled from the .cmi as added to the current typing environment, but only a freshened copy of it. The freshening/substitution code of course will see the unmarshalled version (and again this freshening/substitution may be done lazily, over time), but not the rest of the type-checker.

In other words: "fresh stamp" is relative to the typing environment of the current compilation unit, and .cmi signatures are not part of it, only their freshening is.

@lthls
Copy link
Contributor

lthls commented Mar 20, 2024

@gasche I think you're missing my point. You mention in your message substitutions and freshening, but I think you're not noticing that not only are we using the same code for both, we are actually doing partial substitution of freshened identifiers in the middle of freshening. That causes us to build maps with keys coming from different compilation units, and if we have some overlap we will silently lose one of the entries from the map (and replace it with an unrelated entry).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants