Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some memory bugs in runtime_events_consumer.c #13091

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

eutro
Copy link
Contributor

@eutro eutro commented Apr 10, 2024

This PR fixes a few memory bugs surrounding runtime_events_loc in runtime_events_consumer.c, as mentioned in #13089, specifically in caml_runtime_events_create_cursor.

These bugs are:

  • The code path for Runtime_events.create_cursor None (the pid < 0 branch in the C code) allocates runtime_events_loc but reassigns it, making the old pointer unreachable without deallocating it:

runtime_events_loc = caml_stat_alloc_noexc(RING_FILE_NAME_MAX_LEN);
if (runtime_events_loc == NULL) {
caml_stat_free(cursor);
return E_ALLOC_FAIL;
}
/* If pid < 0 then we create a cursor for the current process */
if (pid < 0) {
runtime_events_loc = caml_runtime_events_current_location();
if( runtime_events_loc == NULL ) {
caml_stat_free(cursor);
return E_NO_CURRENT_RING;
}
} else {

  • The function frees runtime_events_loc after it is introduced if and only if the function returns with an error.1 This is incorrect, both because:

This PR:

  • Fixes these by moving runtime_events_loc's allocation into the relevant branch, rewriting the function to be single-exit after cursor allocation succeeds, and by having both frees state their explicit guards:

*cursor_res = cursor;
ret = E_SUCCESS;
free_events_loc:
if (should_free_events_loc) {
caml_stat_free(runtime_events_loc);
}
free_cursor:
if (ret != E_SUCCESS) {
caml_stat_free(cursor);
}
return ret;
}

  • Deduplicates the two identical path-freeing if-blocks in the caml_ml_runtime_events_create_cursor wrapper which calls it, for clarity. This does not change the behaviour of the program.
  • Adds a test which tries to reach these cases, though only the double-free (and Windows on the commit before #13089) actually causes a failure of the test.
    • This chmod 000s the ring buffer file in order to cause subsequent read attempts to fail.
    • It uses a slight workaround to find the correct PID and .events file, so it works on Windows (rather than skipping), because I wanted to actually test #13089.
    • The testsuite does fail on this commit.

Footnotes

  1. This is the correct behaviour for cursor, since the caller takes ownership of the returned cursor iff the function returns successfully

@eutro
Copy link
Contributor Author

eutro commented Apr 10, 2024

It looks like, perhaps unsurprisingly, not all platforms work with the new test (though MSVC just had a quite disappointing network error), so I intend to remove it later, since I can't think of a better way to force failure.

@nojb nojb closed this Apr 10, 2024
@nojb nojb reopened this Apr 10, 2024
@@ -100,28 +100,31 @@ caml_runtime_events_create_cursor(const char_os* runtime_events_path, int pid,

struct caml_runtime_events_cursor *cursor =
caml_stat_alloc_noexc(sizeof(struct caml_runtime_events_cursor));
int should_free_events_loc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see that variable initialized to zero at the declaration point, so that future changes in this routine do not risk using an uninitialized value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following in that style should int ret; get initialised to something since it is being returned at the end of the function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't hurt either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made ret uninitialised specifically so the compiler would emit a warning if it isn't assigned before jumping; and should_free_events_loc uninitialised because it's to be initialised where runtime_events_loc is.

A change that would accidentally use these with the default initialisation is already wrong, so my reasoning is that it's better to have an obviously invalid value that the compiler would yell at us about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relying upon the C compiler being always correct in figuring out whether a variable is initialized or not. It turns out they aren't always, with both false positives and false negatives. I wouldn't put too much trust in the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment: do we need a boolean to tell us to free runtime_events_loc, or could we just free exactly when the variable is not NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment: do we need a boolean to tell us to free runtime_events_loc, or could we just free exactly when the variable is not NULL?

It's not the NULL case we're worried about, but the case where it's initialised to caml_runtime_events_current_location(), which isn't a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I looked out of curiosity, wondering if just making a copy in that case would make things nicer. My quick skim suggests that the function is sprawling and somewhat of a mess, and I think that this is going to bite us again in the future. So I would encourage you to refactor it as you see fit -- for example maybe the system-dependent bits could be factored out into their own helper functions -- to make the code nicer, and not just safer. Your current change is okay but it does not make the code nicer to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored caml_runtime_events_create_cursor and made runtime_events_loc be a new string every time to make reasoning easier.

Runtime_events.pause ()

(* workaround for finding the events file even on Windows, where
[Unix.getpid] doesn't match the one used to open the file *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the difference is on Windows? The header file says:

[pid] is the process id (or equivalent) of the startup OCaml process.

https://github.com/ocaml/ocaml/blob/trunk/otherlibs/runtime_events/caml/runtime_events_consumer.h#L31

If there is a known difference it would be useful to state that in the header file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not with Windows or Runtime_events, but with the way the Unix library handles PIDs on Windows (as Windows HANDLEs casted to int). Unix.getpid() doesn't actually return the PID of the process (see #4034), whereas the ring buffer file does use the actual PID. I mention possible fixes in this footnote on #13089.

- Introduce `format_runtime_events_loc`, allocating a new string each time

- Introduce `cursor_map_ring_file`, which now also closes `ring_fd`, and closes
handles on Windows on failure

- Use `memset` to zero-initialise cursor callbacks
@eutro
Copy link
Contributor Author

eutro commented Apr 16, 2024

I've spotted and fixed another bug, notably Runtime_events.create_cursor None never closing the file descriptor on Unix.

On trunk this crashes after opening too many file descriptors, causing create_cursor None to fail and trigger the existing double-free bug:

let () =
  Runtime_events.start ();
  try
    for _ = 1 to 1024 (* or whatever your [ulimit -n] is *) do
      Runtime_events.(create_cursor None |> free_cursor)
    done
  with _ ->
    Runtime_events.(create_cursor None |> free_cursor)

I also perform cleanup on the Windows handles if the mapping fails.

@NickBarnes
Copy link
Contributor

I'll review this.

@MisterDA
Copy link
Contributor

Should the ring file be also marked non-inheritable on Windows and close-on-exec on Unix?

Copy link
Contributor

@NickBarnes NickBarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly an improvement on the previous code. A few stylistic quibbles.

struct caml_runtime_events_cursor *cursor =
caml_stat_alloc_noexc(sizeof(struct caml_runtime_events_cursor));
/** Return a new string with the path of the ring file */
static runtime_events_error format_runtime_events_loc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that (because C is antediluvian) enumeration constants such as E_PATH_FAILURE are of type int, so here you are forcing an implicit conversion into the enum type, and then at the call site there's an implicit conversion back to int. So maybe this return type should be int? For this reason, I never name enum types (so I never have variables or slots of enum types).

}
failed2:
CloseHandle(cursor->ring_handle);
failed1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using fail labels which reflect the failure, e.g. fail_file_mapping and fail_map_view here.

if (cursor->ring_file_handle == INVALID_HANDLE_VALUE) {
caml_stat_free(cursor);
caml_stat_free(runtime_events_loc);
return E_OPEN_FAILURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, please set ret and then goto fail_create_file here (putting that label right before the return ret).

static runtime_events_error
cursor_map_ring_file(struct caml_runtime_events_cursor *cursor,
char_os *runtime_events_loc) {
int ret = 0;
#ifdef _WIN32
cursor->ring_file_handle = CreateFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong preference for not setting any fields in cursor until we know we're succeeding. Use local variables and then copy them in right before return E_SUCCESS.

failed1:
CloseHandle(cursor->ring_file_handle);
return ret;
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stylistic distinction between the Windows and non-Windows sides of this function is quite jarring.

}

ret = E_SUCCESS;
/* fallthrough */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no fall-through.

*cursor_res = cursor;
ret = E_SUCCESS;
/* fallthrough */
failed2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remarks as for previous functions:

  • send all failure cases to the failure section;
  • use distinct failure labels for each failure case (even if there is no undo action between a pair of failure labels);
  • name each failure label according to the action which has failed;
  • only initialize local variables until you know you have succeeded.

if (ret != E_SUCCESS) goto failed1;

ret = cursor_map_ring_file(cursor, runtime_events_loc);
if (ret != E_SUCCESS) goto failed2;

cursor->current_positions =
caml_stat_alloc(cursor->metadata->max_domains * sizeof(uint64_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this allocation fails, you get an exception and all this careful failure-path code is stymied. Use caml_stat_alloc_noexc and then handle the failure case in the same way as all the others.

failed2:
caml_stat_free(runtime_events_loc);
failed1:
if (ret != E_SUCCESS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ret is E_SUCCESS here then surely something has gone badly wrong?

*cursor_res = cursor;
ret = E_SUCCESS;
/* fallthrough */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants