Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue is never restarted after SSL error #14

Open
bloudermilk opened this issue Aug 1, 2014 · 7 comments
Open

Queue is never restarted after SSL error #14

bloudermilk opened this issue Aug 1, 2014 · 7 comments

Comments

@bloudermilk
Copy link

We're seeing intermittent SSL errors that produce the following log output:

2014-08-01T18:31:59.274859+00:00 app[web.2]: apnagent:agent-live [278ms] (gateway) error: 139718328477472:error:14094416:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate unknown:../deps/openssl/openssl/ssl/s3_pkt.c:1275:SSL alert number 46
2014-08-01T18:31:59.275014+00:00 app[web.2]: Gateway error [Error: 139718328477472:error:14094416:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate unknown:../deps/openssl/openssl/ssl/s3_pkt.c:1275:SSL alert number 46

No idea why we're seeing the error, given that the certs work fine 99% of the time. The main issue though is that the agent's queue is never restarted and/or a connection to the gateway is not made, so our application eventually runs out of memory from the queue backing up.

@bloudermilk
Copy link
Author

It seems that the agent receives the gateway:close event when this happens (I think, it's hard to tell because we're using Node clustering), so I've just added some code to my server to restart workers when this event is triggered. Is my assumption that the agent isn't automatically reconnected after this event correct?

Edit: I can say with reasonable confidence that the gateway:close events I saw were indeed triggered as part of the SSL failure. The logic in the gateway.close handler depends on connected being set to true for it to reconnect, so the only case it should trigger the agent's gateway:close event is if the tls.connect handler was never called. I am not seeing any unauthorized events in our logs.

@bloudermilk
Copy link
Author

Restarting the workers as I mentioned above is working for us as a temporary solution, but I have a feeling the logic for the gateway.close handler should be changed so that this case triggers a reconnection instead of the agent being closed.

@bloudermilk
Copy link
Author

After updating the logic in the gateway.close handler, apnagent can now recover from these SSL errors gracefully. I'm running an updated version on my fork. Let me know if you're interested in a PR.

@logicalparadox
Copy link
Owner

Looked at your fork. Looks great. Is there any way to add a test to simulate this behavior? Otherwise a PR would be greatly appreciated for both #14 and #15

@bloudermilk
Copy link
Author

@logicalparadox I'll take a look at the test harness and see if I can simulate both! Thanks for the response.

@logicalparadox
Copy link
Owner

Cool, let me know if you have questions. Also, I like that you exposed debug!

@olilavoie
Copy link

I can tell that we're having the same problem! Push notifications are working when our node app is freshly started but after ~1hour the console can't send any push and we receive a gateway:error with an empty error and msg object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants