Skip to content

Conversation

@mrunalp
Copy link
Member

@mrunalp mrunalp commented Oct 24, 2017

This reverts commit b198c57.

Signed-off-by: Mrunal Patel [email protected]

This reverts commit b198c57.

Signed-off-by: Mrunal Patel <[email protected]>
@mrunalp
Copy link
Member Author

mrunalp commented Oct 24, 2017

/test ami

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes labels Oct 24, 2017
@mrunalp
Copy link
Member Author

mrunalp commented Oct 24, 2017

@rhatdan @runcom @cevich PTAL
I believe this should fix the CI. Great catch by @rhatdan.

@runcom
Copy link
Member

runcom commented Oct 24, 2017

LGTM if makes tests happy again, great Dan!

@TomSweeneyRedHat
Copy link
Contributor

LGTM, fingers crossed for the tests.

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

@rhatdan @mrunalp @rhatdan Great find, but I wouldn't consider this a long-term fix. As discussed on IRC - this was originally added to prevent an OOM during cri-o compiling. While that may have been subsequently resolved by jamming more memory onto the VMs, I think it's a mistake to assume test-subject swap is not needed for every test. Also, if a test-subject already has swap setup, this change will not resolve the issue (the playbook does nothing in that case).

A better long-term solution would be to fix the few (one) specific test that needs to control it's memory placement. i.e. Lock the pages into pages into main memory, preventing them from being swapped. IIRC, there's a command or systemcall that's used to do this...can't remember what it's called at the moment....@mrunalp and I will keep looking for it...

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

@rhatdan @mrunalp found it...this is probably what the test should do:

  1. Making sure it's not setting it's cgroup memory limit higher than available memory (exclusive of any swap).
  2. In the test, before hogging up memory, call Go's equivalent of mlockall().
  3. If the OOM test still doesn't work...there's probably an important bug that needs fixing (somewhere)

This way avoids putting the cart before the horse (for the benefit of a single test). It also better exercises this situation under real-world conditions (the vast majority of customer setups I've seen, use swap). Perhaps this test is even playing in the realm of the OOMKiller firing off against something important - since the test cannot predict what other consumers are doing. An unlikely situation to be sure, but one that creates an even harder to debug situation.

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

Ohh...another thing. At the end, the test could check it's majflt (man proc(5)) from /proc/pid/stat. Compare the count before/after testing to verify it has not increased after calling mlockall().

@mrunalp
Copy link
Member Author

mrunalp commented Oct 24, 2017

I checked that the rhel ami job has passed the oom test :)

@mrunalp
Copy link
Member Author

mrunalp commented Oct 24, 2017

@cevich

Making sure it's not setting it's cgroup memory limit higher than available memory (exclusive of any swap).

Our limit is just under 5M. This test is for the container not wreaking havoc on the entire system :)

In the test, before hogging up memory, call Go's equivalent of mlockall().

No real world go programs typically call this.

@runcom
Copy link
Member

runcom commented Oct 24, 2017

🎉

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

Our limit is just under 5M. This test is for the container not wreaking havoc on the entire system :)

Oh good, that's very reasonable.

Hmmm, you sure there's not a systemd/kernel bug here? I would think that a limit should be honored whether or not swap is available. Otherwise, what's the point in having a limit at all?

If there is a bug, ignoring it in the "framework" is the most difficult way to silence it, just ask @edsantiago. For things like this, it's way easier to either fix the underlying problem, or skip the test.

No real world go programs typically call this.

Tests != real-world, ever, and by definition (though sometimes they simulate it).
I'm just suggesting to using better simulation tools 😀

...wait...doesn't that mean if reality is a simulation, then, like, maybe, that, then we could also, like, be in, like, some universal unittest...whoa...deep...

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

Re: cgroup bug: From the kernel docs, though I don't fully parse all of what it says:

In extreme cases, with many concurrent allocations and a complete
breakdown of reclaim progress within the group, the high boundary can
be exceeded. But even then it's mostly better to satisfy the
allocation from the slack available in other groups or the rest of the
system than killing the group. Otherwise, memory.max is there to
limit this type of spillover and ultimately contain buggy or even
malicious applications.

Setting the original memory.limit_in_bytes below the current usage was
subject to a race condition, where concurrent charges could cause the
limit setting to fail. memory.max on the other hand will first set the
limit to prevent new charges, and then reclaim and OOM kill until the
new limit is met - or the task writing to memory.max is killed.

The combined memory+swap accounting and limiting is replaced by real
control over swap space.

The main argument for a combined memory+swap facility in the original
cgroup design was that global or parental pressure would always be
able to swap all anonymous memory of a child group, regardless of the
child's own (possibly untrusted) configuration. However, untrusted
groups can sabotage swapping by other means - such as referencing its
anonymous memory in a tight loop - and an admin can not assume full
swappability when overcommitting untrusted jobs.

For trusted jobs, on the other hand, a combined counter is not an
intuitive userspace interface, and it flies in the face of the idea
that cgroup controllers should account and limit specific physical
resources. Swap space is a resource like all others in the system,
and that's why unified hierarchy allows distributing it separately.

So, to me that reads like swap limiting needs to be handled specially by the application when setting up these limits for the cgroup(s) of a child. This is also the likely reason why docker has separate --memory-swap and --memory-swappiness options to docker run (man docker-run).

Without other details, (I dunno the original situation) this sounds like a bug that's being uncovered by the OOM-test failure, and something that absolutely should NOT be overlooked/ignored. Though I also reserve the right to be dead wrong, it would be nice is someone tried to convince me 😀

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

@mrunalp thx for the chat. So to summarize, this isn't a cri-o bug because...

  1. k8's is responsible for passing down all the cgroup memory-limiting options into cri-o.
  2. The test that's failing is an 'integration test', which means k8's isn't in charge.
  3. There should be a test for this swap limit vs memory limit enforcement in the e2e tests, but there isn't.

This PR is just a quick-fix to get CRI-O out the door, later we can consider bringing it back in and/or fixing the test so swap doesn't get in it's way.

@runcom runcom merged commit e95f75e into cri-o:master Oct 24, 2017
@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

@runcom @mrunalp Ugg, okay, just realized, this isn't actually going to fix the problem because...you've reverted one implementation of swap-enabling for an older (but worse) one. Meaning: Swap is still turned on. It's probably just accidentally not working, because of Origin-CI image caching...I've got an idea for a better fix...

@cevich
Copy link
Contributor

cevich commented Oct 24, 2017

...opened #1063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants