Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Process Potentially Leaking Memory #118

Open
stevekinney opened this issue Jul 8, 2014 · 7 comments
Open

Build Process Potentially Leaking Memory #118

stevekinney opened this issue Jul 8, 2014 · 7 comments

Comments

@stevekinney
Copy link
Member

@jugglinmike I noticed the site was down today, but the server was up. So I ran the deploy.sh script and got this error that I think we talked about in #65.

I'm going to try to power cycle the server in the mean time.

Tracing dependencies for: socket.io-client
Compressed CSS output to 78%.
Compressed CSS output to 78%.
Compressed CSS output to 78%.
FATAL ERROR: Evacuation Allocation failed - process out of memory
./deploy.sh: line 19:  2475 Aborted                 (core dumped) grunt build
@stevekinney
Copy link
Member Author

Update: Power cycling worked, which leads me to believe that we do have a memory leak on our hands.

@jugglinmike
Copy link
Contributor

Hi Steve,

A little more information about the system in its failing state will help determine where to look. The next time you experience this failure, could you run the following commands and share the contents of the created files?

$ COLUMNS=512 top -bcn 1 > top.txt
$ free -t -m > free.txt
$ ps -aeF > ps.txt

Also, knowing the time since last deployment would help @mzgoddard and I estimate severity. Do you know how long it had been since you last deployed prior to the incident you reported here?

@stevekinney
Copy link
Member Author

@jugglinmike So, it looks like we're coming across a daily memory leak issue. I rebooted the server yesterday and it was out of memory again today.

Here is the console message:

screen shot 2014-09-18 at 9 00 41 am

The last deployment was the last merge into master. But it's run out of memory since the last time I rebooted the server, which was yesterday.

Thoughts?

/cc @escoleman3 @kgotchet @jlefeber @mzgoddard

@jugglinmike
Copy link
Contributor

@stevekinney The next time this happens (tomorrow morning, by the sound of it) and before rebooting the server, could you grab the stats I mentioned in my previous comment?

@stevekinney
Copy link
Member Author

Yup, I couldn't log in because the key on the server was from my CEE iMac, which I don't have anymore. So, I need @escoleman3 to pop in my personal key. I rebooted because someone needed to use it in the next two hours.

@stevekinney
Copy link
Member Author

So, @jugglinmike—the server went down twice today. I believe @escoleman3 reset it once this morning. I'm including the information you requested.

https://gist.github.com/stevekinney/be2a2de91aa864306577

@jugglinmike
Copy link
Contributor

Thanks @stevekinney . @mzgoddard and I have run through the data, and we think we understand the problem. This is our theory:

It looks like the "top" server is failing occasionally and leaving its child processes (the activity servers) orphaned. The forever module is correctly restarting the top-level server, and it is spawning new activity servers. This repeats over time, until the environment is filled with zombie servers.

This highlights two separate problems:

  1. The top-level server is failing on a regular basis
  2. The children are left running

#1 is likely caused by a memory leak, and resolving it may require additional forensics. #2 can be resolved if by maintaining a list of child process IDs on disk and killing those processes on startup.

#1 is definitely the trickier problem, but (if we've interpreted all this correctly), resolving #2 will result in improved application behavior: the app will continue to fail intermittently, but it will immediately restart itself cleanly. The site will suffer little downtime (and it will be resolved automatically), but it will kick active users and lose saved activity results.

I'm going to begin work on a fix for #2 tomorrow, as it seems to be the low-hanging fruit here.

Does this make sense to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants