-
Notifications
You must be signed in to change notification settings - Fork 810
[etcd] Report errors connecting to etcd endpoint #3007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the agent can't reach etcd for any reason (eg. when it gets a 'Connection refused' error), it should report the service check as CRITICAL, so we don't have to know of all possible exceptions in advance.
|
||
def test_closed_port(self): | ||
self.assertRaises(Exception, | ||
lambda: self.run_check({"instances": [{"url": "http://localhost:9999"}]})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose port 9999 as a port that is likely closed. Not sure if this is a good assumption. If needed, we could check if the port is closed at the start of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, could you just document this in a comment please?
FYI: I checked the
It doesn't look like it ran the etcd tests. |
Hey @pbitty! Thank you for your Pull Request! This looks like a good change, we'll get around to reviewing it soon. I'll run the testing issues by the rest of the team and see if they know what happened there, I know we've been updating the tests to make them run faster and more efficiently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pbitty, I've added 2 small comments, could you address them?
We've fixed the issue that was making Travis skipping tests that should run, and the tests are running and passing now
checks.d/etcd.py
Outdated
message="Timeout when hitting %s" % url, | ||
tags=["url:{0}".format(url)]) | ||
raise | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you catch the Exception
class here instead? It doesn't make sense to catch KeyboardInterrupt
s here for instance.
|
||
def test_closed_port(self): | ||
self.assertRaises(Exception, | ||
lambda: self.run_check({"instances": [{"url": "http://localhost:9999"}]})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, could you just document this in a comment please?
@olivielpeau Np! Changes addressed. |
Thanks @pbitty! |
If the agent can't reach etcd for any reason (eg. when it gets a 'Connection refused' error), it should report the service check as CRITICAL, so we don't have to know of all possible exceptions in advance.
* master: (254 commits) Reduce the maximum amount of argument to pylint. [forwarder] stop flushing after 10s [etcd] Report errors connecting to etcd endpoint (#3007) Postfix check should pass raise_on_empty_output=False [status] Silence requests exception [psutil] Only set `psutil.PROCFS_PATH` once in the collector (#3013) [ci] fix check name detection of test files (#3021) [ci][rabbitmq] Increase wait timeout (#3022) [core] SpooledTemporaryFile for subprocess output (#3002) [collector] isolate system checks (#3001) [ci] Fix Travis jobs timing out (#3017) [mongo] Use db.current_op instead of manually querying (#3016) [ci] fix bad citizens detection (#3020) [ci] add debug logs (#3019) [mongo] use `currentOp` for mongodb 3.2+ Use proxy for API key check in info page (#3012) [mongo] Add MongoDB 3.2 support [mongo] Add MongoDB 3.2 to travis-ci config [packaging] 5.11.0 nightlies (#3009) [core] hard-deprecate start/stop/restart/status (#3004) ...
If the agent can't reach etcd for any reason (eg. when it gets a 'Connection refused' error), it should report the service check as CRITICAL, so we don't have to know of all possible exceptions in advance.
This PR makes the
etcd
check send aCRITICAL
status if it encouters any errorswhen talking to etcd.
Motivation
When creating some forced failure scenarios in our infrastructure, but stopping
etcd
processes,I noticed that the agent was not reporting the 'Connection refused' error to Datadog.