Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime doesn't utilize available Ram and crashes #102332

Open
talweiss1982 opened this issue May 16, 2024 · 30 comments
Open

Runtime doesn't utilize available Ram and crashes #102332

talweiss1982 opened this issue May 16, 2024 · 30 comments
Milestone

Comments

@talweiss1982
Copy link

talweiss1982 commented May 16, 2024

Description

Hi,
I'm trying to load a huge amount of data on a single dotnet process using a machine with 4 tb of RAM ~128 cpu with ubuntu 22.04 using latest dotnet 8 sdk (updated today I think it's 8.0.5). The code reads 464 csv files each 3gb long it reads it line by line parsing each row and generating an object out of it. The object is then placed into one out of 5 concurrent dictionaries according to its type, each is initialized with a 500 million capacity which is enough so it won't need to increase. The dictionaries are there to remove duplicates and merge their data, the dedup objects are than written to csv files. I don't go over the 2 billion object limit of the dictionary. I don't read the entire file or keep any file in memory only their buffers which have the default size of 4kb and I don't process more than 10 files in parallel. Yet the machine never utilize more than 1.5 tb of memory. I investigated the Ubuntu configuration and it doesn't seem to have any limitations configured for a process. Moreover I tried running 4 of those processes and I didn't get above 1.8 tb. I checked the dotnet-counters and the gc heap size don't seem to grow above 300gb. I'm using server gc, tried setting heap count but it seems to be capped at 128 and I can only reduce it but setting it to a higher value is been ignored. I have configured 1 tb of big pages 2mb size and set up the dotnet flag to use huge page as I read they will be used when the gc got to its hard limit, as a result dotnet crashes with a heap initialization error I can't even build my project as the dotnet build also crashes with the same error. I tried running with the low latency for long duration mode of the gc but after processing 90% of the data the gc gen2 is invoked afterwards the process crashes. This means the runtime thinks it reached its limit although there are 2.5 free tb of ram. I also printed the gc available memory and it shows 4tb when the process starts. Tried asking the gc to retain vm memory, it didn't help. Please advise what voodoo I need to practice to make my dotnet process utilize it's God given RAM pool?

Reproduction Steps

Use a 22.04 Ubuntu machine with 4tb of ram with dotnet sdk 8.0.5 read 464 csv files each 3gb of size parse it line by line shove it into a concurrent dictionary, makes sure to split it into enough dictionaries so to not reach 2billion object in a single dictionary. Make sure to set its capacity in advence. Watch as the gc heap size doesn't utilize the whole ram and crashes

Expected behavior

I expect the gc heap to increase in size to utilize all the ram I don't mind gc pauses or slowness

Actual behavior

Gc heap size doesn't seem to grow above 300gb causing gc pressure and random crashes

Regression?

Didn't try on other version

Known Workarounds

I wish

Configuration

Dotnet 8.0.5 running on Ubuntu 22.04 with x64 architecture

Other information

There are two issues 1 gc heap count seem to be capped at 128 and I belive each heap size is capped so I can't get to 4tb...
The other issues is that trying to use huge pages seems to crash the runtime

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 16, 2024
@mangod9
Copy link
Member

mangod9 commented May 16, 2024

Thanks @talweiss1982 for reporting the issue. Do you have any specific GC config set for your application? Assume this isnt a regression from a previous release?

@talweiss1982
Copy link
Author

talweiss1982 commented May 16, 2024 via email

@talweiss1982
Copy link
Author

talweiss1982 commented May 16, 2024 via email

@mangod9
Copy link
Member

mangod9 commented May 16, 2024

can you try setting DOTNET_GCRegionRange=10700000000 env. var to check if that makes any difference?

@mangod9
Copy link
Member

mangod9 commented May 16, 2024

that sets the initial GC reservation size to ~1TB. The config should be looking at the available physical memory to make that determination but wondering if there is something off with that logic.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label May 16, 2024
@mangod9 mangod9 added this to the 9.0.0 milestone May 16, 2024
@cshung
Copy link
Member

cshung commented May 16, 2024

Is it possible to do some live debugging together? A crash dump will likely be too big, and traces are unlikely to be useful for solving this crash.

@neon-sunset
Copy link
Contributor

/offtopic
I know this is unrelated but if you are not using Sep and NonBlocking.ConcurrentDictionary - you might want to try as they will reduce memory footprint per item (NonBlocking has different but not worse performance profile compared to out of box CDict but consumes less memory per entry).

NonBlocking.CDict: https://github.com/VSadov/NonBlocking
Sep: https://github.com/nietras/Sep/

@talweiss1982
Copy link
Author

talweiss1982 commented May 17, 2024 via email

@talweiss1982
Copy link
Author

talweiss1982 commented May 17, 2024 via email

@talweiss1982
Copy link
Author

talweiss1982 commented May 17, 2024 via email

@talweiss1982
Copy link
Author

After setting DOTNET_GCRegionRange=10700000000 dotnet-counters:
Press p to pause, r to resume, q to quit.
Status: Running

Name Current Value
[System.Runtime]
% Time in GC since last GC (%) 15
Allocation Rate (B / 1 sec) 8,200
CPU Usage (%) 0.78
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 292,488.401
GC Fragmentation (%) 5.099
GC Heap Size (MB) 277,783.096
Gen 0 GC Budget (MB) 11,502
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 0
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 1.457e+10
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 2.4346e+11
IL Bytes Jitted (B) 875,351
LOH Size (B) 2.8e+10
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 64
Number of Methods Jitted 10,115
POH (Pinned Object Heap) Size (B) 167,360
ThreadPool Completed Work Item Count (Count / 1 sec) 0
ThreadPool Queue Length 0
ThreadPool Thread Count 12
Time paused by GC (ms / 1 sec) 0
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 288,691.38

The error:
Fatal error. Failed to create RW mapping for RX memory

@talweiss1982
Copy link
Author

sdk is 8.0.105 (remembered something with 5)

@ayende
Copy link
Contributor

ayende commented May 19, 2024

I think that I know what the problem is.
You are running into the vm.max_map_count limit. Try: sysctl -w vm.max_map_count=262144 or something like that (may need sudo) and then run your code.

Basically, what I think is happening is that you are using a lot of memory regions, not just memory.
And there is a limit by default of 64K of those.

At some point, the JIT need to emit some code, or allocate more memory, and there aren't enough memory regions for that.

See also:
https://ravendb.net/articles/production-postmortem-the-big-server-that-couldnt-handle-the-load
https://ayende.com/blog/197635-A/production-postmortem-out-of-memory-on-a-clear-sky

@talweiss1982
Copy link
Author

sysctl -w vm.max_map_count=262144

Will try, thanks Oren.

@talweiss1982
Copy link
Author

@ayende it seems to have helped i'm able to pass 288 GB working set:
Press p to pause, r to resume, q to quit.
Status: Running

Name Current Value
[System.Runtime]
% Time in GC since last GC (%) 5
Allocation Rate (B / 1 sec) 5.0745e+09
CPU Usage (%) 16.122
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 337,193.353
GC Fragmentation (%) 7.262
GC Heap Size (MB) 308,255.841
Gen 0 GC Budget (MB) 4,529
Gen 0 GC Count (Count / 1 sec) 2
Gen 0 Size (B) 0
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 1.7257e+08
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 3.0788e+11
IL Bytes Jitted (B) 818,897
LOH Size (B) 2.4e+10
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 67
Number of Methods Jitted 9,668
POH (Pinned Object Heap) Size (B) 167,360
ThreadPool Completed Work Item Count (Count / 1 sec) 2
ThreadPool Queue Length 0
ThreadPool Thread Count 12
Time paused by GC (ms / 1 sec) 82.403
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 336,538.276

@ayende
Copy link
Contributor

ayende commented May 19, 2024

That being said, I would say that this is a runtime bug, in the error reportin

@talweiss1982
Copy link
Author

can you try setting DOTNET_GCRegionRange=10700000000 env. var to check if that makes any difference?

now it crashed at 1.164 TB, what should be the value if i want to utilize 3 TB?

@mangod9
Copy link
Member

mangod9 commented May 20, 2024

The config setting is the hex value in bytes. So you should be able to change it accordingly.

@talweiss1982
Copy link
Author

talweiss1982 commented May 20, 2024 via email

@janvorli
Copy link
Member

That being said, I would say that this is a runtime bug, in the error reportin

Do you mean that the error message is not pointing to the source of the problem (which I am going to fix) or something else?

@ayende
Copy link
Contributor

ayende commented May 20, 2024

Yes, it should give a clearer error message.

@talweiss1982
Copy link
Author

i'm actually still getting OOM using this settings:
DOTNET_GCRegionRange=32000000000
sysctl -w vm.max_map_count=524240

I didn't get to 3TB, i think its the GC allocating large amount of unmanaged memory, when trying to profile localy on a single file i saw the GC allocated 400MB of native memory and the number was growing rapidly
my numbers before the crash are:

Name Current Value
[System.Runtime]
% Time in GC since last GC (%) 25
Allocation Rate (B / 1 sec) 8,200
CPU Usage (%) 17.025
Exception Count (Count / 1 sec) 0
GC Committed Bytes (MB) 2,278,842.536
GC Fragmentation (%) 1.611
GC Heap Size (MB) 2,242,855.335
Gen 0 GC Budget (MB) 26,843
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 0
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 4.1449e+09
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 2.1622e+12
IL Bytes Jitted (B) 961,600
LOH Size (B) 8.918e+10
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 67
Number of Methods Jitted 10,877
POH (Pinned Object Heap) Size (B) 130,680
ThreadPool Completed Work Item Count (Count / 1 sec) 0
ThreadPool Queue Length 0
ThreadPool Thread Count 12
Time paused by GC (ms / 1 sec) 0
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 2,318,343.016

@janvorli
Copy link
Member

@talweiss1982 is it still crashing with the "Failed to create RW mapping for RX memory." or in some other way? maybe 524240 is still too low for your case.

@talweiss1982
Copy link
Author

crashes with Out of memory, i think that the GC is allocating quite a bit of native memory, how can i monitor the amount of memory the GC itself is allocating ?

@cshung
Copy link
Member

cshung commented May 20, 2024

You can use the CommittedUsage event here. The event will give you a breakdown on the large chunks of native memory the GC acquired from the OS (which may or may not be used by managed objects yet).

If you collect a GCCollectOnly trace, you should see this show up on the events view of PerfView.

If you get the OOM trapped in a debugger, you can also use the MAddress command to see a breakdown of all the native memory used.

@janvorli
Copy link
Member

@cshung, I assume the CommittedUsage doesn't include the card tables and similar GC bookkeeping stuff, right? I guess those might be quite large for such a large heap size.

@cshung
Copy link
Member

cshung commented May 20, 2024

@cshung, I assume the CommittedUsage doesn't include the card tables and similar GC bookkeeping stuff, right? I guess those might be quite large for such a large heap size.

It does - the last field is total_bookkeeping_committed, which include everything listed here.

@talweiss1982
Copy link
Author

This time the OOM was due to me passing the max capacity of the dictionary, i think I'm good, was able to load 2.4 TB of data.

@mangod9
Copy link
Member

mangod9 commented May 21, 2024

Ok good to know that you have been able to get things to work. Can you please confirm the config settings you had to apply to get ~3TB of working set? We will need to look into why the GC isnt defaulting to the higher region range. Thx!

@cshung
Copy link
Member

cshung commented May 21, 2024

@mangod9 is right, you shouldn't have to set the DOTNET_GCRegionRange manually.

Can we take a look at what happened around here - this code should be run exactly once during process startup.

https://github.com/dotnet/runtime/blob/362a95d53d7266ee9b7452e351d07845dcc1edd2/src/coreclr/gc/gc.cpp#L48493-L48517

It should be set automatically by the formula there. It must be the case something is wrong there if a manual configuration helped with the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants