Skip to content

Conversation

@rayanht
Copy link
Contributor

@rayanht rayanht commented Sep 12, 2022

Description

This diff enables torchrl to seamlessly use the MLFlow Tracking API through its internal Logger API. These changes are consistent with previous integrations such as the one with W&B.

A test suite for the new integration has been included in the diff.

Motivation and Context

close #395

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2022
@rayanht
Copy link
Contributor Author

rayanht commented Sep 12, 2022

Just realised I need to update the examples to add a case for the MLFlow logger

@vmoens vmoens changed the title feature(logging): implement MLFlow logging integration [Logging]: implement MLFlow logging integration Sep 13, 2022
@vmoens vmoens added the enhancement New feature or request label Sep 13, 2022
@rayanht
Copy link
Contributor Author

rayanht commented Sep 14, 2022

Alright so running:

$ python ppo.py logger=mlflow

We get the following generated directory:

Screenshot 2022-09-14 at 14 34 30

Which we can then point the mlflow UI to using:

$ mlflow ui --backend-store-uri ppo_logging

Screenshot 2022-09-14 at 14 36 17

Screenshot 2022-09-14 at 14 36 37

Unfortunately MLFlow doesn't support the preview of video files directly in the browser but there's nothing we can do about that.

Copy link
Collaborator

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely wonderful, thanks for this!
Before landing: Can we add mlflow to the packages installed in the CI (linux and linux_stable)?

EDIT: @rayanht I did it, waiting for the tests to pass

@rayanht
Copy link
Contributor Author

rayanht commented Sep 19, 2022

I'm investigating the test failure, might have something to do with how directories are managed in my implementation.

@vmoens
Copy link
Collaborator

vmoens commented Sep 21, 2022

I'm investigating the test failure, might have something to do with how directories are managed in my implementation.

Let me know if I can help!

@rayanht
Copy link
Contributor Author

rayanht commented Sep 21, 2022

@vmoens I can't repro the test failure locally, is there any way we can get get the full log on Circle CI? rn I'm not even sure what package is failing to import because the stacktrace is cut off:

FAILED test/test_loggers.py::TestMLFlowLogger::test_log_video[None] - ImportE... <---- 
FAILED test/test_loggers.py::TestMLFlowLogger::test_log_video[steps1] - Excep...
=== 2 failed, 8079 passed, 170 skipped, 7914 warnings in 1349.69s (0:22:29) ====

@vmoens
Copy link
Collaborator

vmoens commented Sep 21, 2022

There are 2 errors

test/test_loggers.py::TestMLFlowLogger::test_log_video[None] FAILED      [ 38%]
____________________ TestMLFlowLogger.test_log_video[None] _____________________
Traceback (most recent call last):
  File "/home/circleci/project/test/test_loggers.py", line 148, in test_log_video
    logger.log_video(
  File "/home/circleci/project/torchrl/trainers/loggers/mlflow.py", line 109, in log_video
    torchvision.io.write_video(filename=f.name, video_array=video, fps=fps)
  File "/home/circleci/project/env/lib/python3.8/site-packages/torchvision/io/video.py", line 83, in write_video
    _check_av_available()
  File "/home/circleci/project/env/lib/python3.8/site-packages/torchvision/io/video.py", line 42, in _check_av_available
    raise av
ImportError: PyAV is not installed, and is necessary for the video operations in torchvision.
See https://github.com/mikeboers/PyAV#installation for instructions on how to
install PyAV on your system.

This can be solved by adding av to the pip section of environment.yaml in the .circleci folders (not in the one with opt deps of course).

test/test_loggers.py::TestMLFlowLogger::test_log_video[steps1] FAILED    [ 38%]
___________________ TestMLFlowLogger.test_log_video[steps1] ____________________
Traceback (most recent call last):
  File "/home/circleci/project/test/test_loggers.py", line 139, in test_log_video
    logger = MLFlowLogger(exp_name=exp_name, tracking_uri=log_dir_uri)
  File "/home/circleci/project/torchrl/trainers/loggers/mlflow.py", line 56, in __init__
    super().__init__(exp_name=exp_name, log_dir=tracking_uri)
  File "/home/circleci/project/torchrl/trainers/loggers/common.py", line 23, in __init__
    self.experiment = self._create_experiment()
  File "/home/circleci/project/torchrl/trainers/loggers/mlflow.py", line 69, in _create_experiment
    return mlflow.start_run(experiment_id=self.id)
  File "/home/circleci/project/env/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 271, in start_run
    raise Exception(
Exception: Run with UUID 8d99f1ba38d6419584647c3ef8f21026 is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True

This is probably due to a failure to close the session in the previous test.
What we should do is use pytest to make sure that mlflow.end_run() is called at the end of the function, see this for an example or this.

@rayanht
Copy link
Contributor Author

rayanht commented Sep 21, 2022

Cool will amend, thanks! Out of curiosity did you manage to repro locally or did I miss something on the Circle CI UI?

@vmoens
Copy link
Collaborator

vmoens commented Sep 21, 2022

I just looked at the errors, I did not try it locally

@rayanht rayanht force-pushed the feature/mlflow_tracking branch from 914aa99 to 3a0d4a8 Compare September 21, 2022 16:35
@vmoens vmoens merged commit 2b9fbe1 into pytorch:main Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Integrate MLFlow in the loggers backends

3 participants