Skip to content

Conversation

pbitty
Copy link
Contributor

@pbitty pbitty commented Nov 8, 2016

This PR makes the etcd check send a CRITICAL status if it encouters any errors
when talking to etcd.

Motivation

When creating some forced failure scenarios in our infrastructure, but stopping etcd processes,
I noticed that the agent was not reporting the 'Connection refused' error to Datadog.

If the agent can't reach etcd for any reason (eg. when it gets a
'Connection refused' error), it should report the service check as
CRITICAL, so we don't have to know of all possible exceptions
in advance.

def test_closed_port(self):
self.assertRaises(Exception,
lambda: self.run_check({"instances": [{"url": "http://localhost:9999"}]}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose port 9999 as a port that is likely closed. Not sure if this is a good assumption. If needed, we could check if the port is closed at the start of the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, could you just document this in a comment please?

@pbitty
Copy link
Contributor Author

pbitty commented Nov 8, 2016

FYI: I checked the etcd test output and found this:

$ rake ci:run
Comparing 22693a01bf8785272a4c36b74ba9fc2edfaf184e with master
Git diff: 
checks.d/etcd.py
tests/checks/integration/test_etcd.py
Skipping etcd tests, not affected by the change

It doesn't look like it ran the etcd tests.

@gmmeyer
Copy link
Contributor

gmmeyer commented Nov 10, 2016

Hey @pbitty! Thank you for your Pull Request! This looks like a good change, we'll get around to reviewing it soon.

I'll run the testing issues by the rest of the team and see if they know what happened there, I know we've been updating the tests to make them run faster and more efficiently.

Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pbitty, I've added 2 small comments, could you address them?

We've fixed the issue that was making Travis skipping tests that should run, and the tests are running and passing now

checks.d/etcd.py Outdated
message="Timeout when hitting %s" % url,
tags=["url:{0}".format(url)])
raise
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you catch the Exception class here instead? It doesn't make sense to catch KeyboardInterrupts here for instance.


def test_closed_port(self):
self.assertRaises(Exception,
lambda: self.run_check({"instances": [{"url": "http://localhost:9999"}]}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, could you just document this in a comment please?

@pbitty
Copy link
Contributor Author

pbitty commented Nov 14, 2016

@olivielpeau Np! Changes addressed.

@olivielpeau olivielpeau merged commit b5ef0b6 into DataDog:master Nov 14, 2016
@olivielpeau
Copy link
Member

Thanks @pbitty!

@olivielpeau olivielpeau modified the milestones: 5.11.0, Triage Nov 14, 2016
wholroyd pushed a commit to wholroyd/dd-agent that referenced this pull request Nov 15, 2016
If the agent can't reach etcd for any reason (eg. when it gets a
'Connection refused' error), it should report the service check as
CRITICAL, so we don't have to know of all possible exceptions
in advance.
degemer added a commit that referenced this pull request Nov 15, 2016
* master: (254 commits)
  Reduce the maximum amount of argument to pylint.
  [forwarder] stop flushing after 10s
  [etcd] Report errors connecting to etcd endpoint (#3007)
  Postfix check should pass raise_on_empty_output=False
  [status] Silence requests exception
  [psutil] Only set `psutil.PROCFS_PATH` once in the collector (#3013)
  [ci] fix check name detection of test files (#3021)
  [ci][rabbitmq] Increase wait timeout (#3022)
  [core] SpooledTemporaryFile for subprocess output (#3002)
  [collector] isolate system checks (#3001)
  [ci] Fix Travis jobs timing out (#3017)
  [mongo] Use db.current_op instead of manually querying (#3016)
  [ci] fix bad citizens detection (#3020)
  [ci] add debug logs (#3019)
  [mongo] use `currentOp` for mongodb 3.2+
  Use proxy for API key check in info page (#3012)
  [mongo] Add MongoDB 3.2 support
  [mongo] Add MongoDB 3.2 to travis-ci config
  [packaging] 5.11.0 nightlies (#3009)
  [core] hard-deprecate start/stop/restart/status (#3004)
  ...
@pbitty pbitty deleted the etcd_conn_refused_check branch November 21, 2016 20:47
wholroyd pushed a commit to wholroyd/dd-agent that referenced this pull request Jan 23, 2017
If the agent can't reach etcd for any reason (eg. when it gets a
'Connection refused' error), it should report the service check as
CRITICAL, so we don't have to know of all possible exceptions
in advance.
@masci masci modified the milestones: 5.11.0, 5.12.0 Jan 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants