Skip to content

Conversation

mattjj
Copy link
Collaborator

@mattjj mattjj commented Oct 27, 2022

The idea here is that if you know what data structure(s) a leaked tracer is in, it might help you understand what the side-effect was that caused the problem.

With this code:

import jax
jax.config.update('jax_check_tracer_leaks', True)

class A:
  def __init__(self, dct):
    self.dct = dct

a = A({})

def sketch(x):
  a.dct['hi'] = [lambda: x]
  return x

jax.vmap(sketch)(jax.numpy.arange(3))

instead of just this error:

Traced<ShapedArray(int32[])>with<BatchTrace(level=1/0)> with
  val = DeviceArray([0, 1, 2], dtype=int32)
  batch_dim = 0
This Tracer was created on line /usr/local/google/home/mattjj/packages/jax/qiao26.py:14 (<module>)

we now get this error:

Exception: Leaked trace MainTrace(1,BatchTrace). Leaked tracer(s):

Traced<ShapedArray(int32[])>with<BatchTrace(level=1/0)> with
  val = DeviceArray([0, 1, 2], dtype=int32)
  batch_dim = 0
This BatchTracer with object id 140443431328480 was created on line:
  /usr/local/google/home/mattjj/packages/jax/qiao26.py:14 (<module>)
<BatchTracer 140443431328480> is referred to by <function 140443431341072> closed-over variable x
<function 140443431341072> is referred to by <list 140443431333184>[0]
<list 140443431333184> is referred to by <dict 140445269059648>['hi']
<dict 140445269059648> is referred to by <A 140445263828256>.dct
<A 140445263828256> is referred to by __main__.a

That is, when the leak checker finds (via the gc) a Tracer which has outlived its trace, we attempt to walk back up the referrer chain for that Tracer and describe what objects we encounter (and even at which indices/names). In the example code, that might make us notice that the global object, and our update to dct, is a problem

We should also add info about which transformation or primitive the Tracer corresponded to (i.e. point to the last line of the file here). That may be in a follow-up PR though.

@mattjj mattjj requested a review from LenaMartens October 27, 2022 21:19
@mattjj mattjj force-pushed the leak-checker-improvements branch 4 times, most recently from fe6d68c to 69c696f Compare October 28, 2022 01:25
@mattjj mattjj self-assigned this Oct 28, 2022
@mattjj mattjj changed the title Leak checker improvements leak checker: explain why a tracer is alive by printing a reference chain Oct 28, 2022
@mattjj mattjj force-pushed the leak-checker-improvements branch from 69c696f to b26ff93 Compare October 28, 2022 16:47
Copy link
Contributor

@LenaMartens LenaMartens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! cool error message

getattr(parents(parent)[0], '__dict__', None) is parents(child)[0]):
parent = parents(parent)[0]
elif type(parent) is cell_type:
parent = parents(parents(parent)[0])[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment why this is needed for closures? it skips the closure and goes to the function instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea: a little shortcut. Without this we get this printed:

<DynamicJaxprTracer 139764213674128> is referred to by <cell 139764213646192>
<cell 139764213646192> is referred to by <tuple 139764213646096>[0]
<tuple 139764213646096> is referred to by <function 139764213640064>
<function 139764213640064> is referred to by <dict 139766096500800>['hi']
<dict 139766096500800> is referred to by <A 139766089094432>.dct
<A 139766089094432> is referred to by __main__.a

but with it we get:

<DynamicJaxprTracer 140139354034480> is referred to by <function 140139405110576> (foo) closed-over variable x
<function 140139405110576> is referred to by <dict 140141242837056>['hi']
<dict 140141242837056> is referred to by <A 140141237605664>.dct
<A 140141237605664> is referred to by __main__.a

I'll add a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # For namespaces (like modules and class instances) and closures, the
    # references may form a simple chain: e.g. instance refers to its own
    # __dict__ which refers to child, or function refers to its __closure__
    # which refers to cells which refer to child. In these cases, we can provide
    # a more intuitive description by collapsing the chain into a single
    # parent->child jump. We do that by setting `parent` here to be a
    # grandparent (or great-grandparent) of `child`, and then handling that case
    # in _why_alive_container_info.
    # To prevent this collapsing behavior, just comment out this code block.

elif type(parent) is cell_type:
parent = parents(parents(parent)[0])[0]

line = f'<{type(child).__name__} {id(child)}> is referred to by '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding: why is the object ID useful when debugging? IIUC there will be no cycles in this trace, or might there be cycles?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be cycles in the printed result.

I found printing the id useful because I tend to cross-reference ids against print statements I put elsewhere or against interactive pdb spelunking. Also it's a (meaningful) way to distinguish different instances of e.g. a dict.

@google-ml-butler google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Oct 28, 2022
@mattjj mattjj force-pushed the leak-checker-improvements branch 3 times, most recently from 67e27f3 to 7178528 Compare October 28, 2022 21:12
@mattjj mattjj force-pushed the leak-checker-improvements branch from 7178528 to 6ebf44a Compare October 28, 2022 21:12
@copybara-service copybara-service bot merged commit 8dea82e into jax-ml:main Oct 28, 2022
@mattjj mattjj deleted the leak-checker-improvements branch October 31, 2022 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pull ready Ready for copybara import and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants