Skip to content

Conversation

@AkshitaB
Copy link
Contributor

@AkshitaB AkshitaB commented Sep 22, 2022

Changes proposed in this pull request:

  • Adds the following abstractions: RemoteClient, RemoteStepCache, RemoteWorkspace. The RemoteClient models the beaker client (in fact, for beaker, it's mostly a thin wrapper right now).
  • GCS and Beaker step caches and workspaces inherit from above.
  • WandBWorkspace is too different to easily fit into the RemoteWorkspace abstractions, leaving that refactoring for a later date.

Cleanup:

  • Address TODOs
  • Test with larger config.
  • Test BeakerExecutor with GCSWorkspace.
  • Add docstrings

Testing with BeakerExecutor:

workspace:
  type: "gs"
  workspace: "allennlp-tango-bucket"

@AkshitaB AkshitaB changed the title GCSWorkspace preliminary work GCSWorkspace Dec 6, 2022
@AkshitaB AkshitaB requested review from dirkgr and epwalsh and removed request for epwalsh December 6, 2022 07:25
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new abstractions. My main question is about remote locking. It seems like GCSFs does have the notion of "committing" a file, similar to Beaker (https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFile.commit), but it doesn't look like we use that feature? From what I could tell, the "locking" mechanism in GCSClient works by checking for the existence of some placeholder file? Does gcs_fs.touch(placeholder_file) fail if another process beats it by touching that file first?

General comments:

  • I think tango/integrations/(beaker|gs)/common.py and ...util.py can be merged into the same file.

It stores the results of steps on some RemoteWorkspace.
It also keeps a limited in-memory cache as well as a local backup on disk, so fetching a
step's resulting subsequent times should be fast.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
step's resulting subsequent times should be fast.
step's result subsequent times should be fast.

@AkshitaB
Copy link
Contributor Author

AkshitaB commented Dec 7, 2022

My main question is about remote locking. It seems like GCSFs does have the notion of "committing" a file, similar to Beaker (https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFile.commit), but it doesn't look like we use that feature?

That commit() works for files rather than folders. But perhaps I can utilize it recursively (Looking into it). From what I understand, beaker also performs commit at the folder level, right?

From what I could tell, the "locking" mechanism in GCSClient works by checking for the existence of some placeholder file?

The placeholder file is for creation of any folder, since you can't create an empty folder in gcs.

  • I think tango/integrations/(beaker|gs)/common.py and ...util.py can be merged into the same file.

Agreed.

@dirkgr
Copy link
Member

dirkgr commented Dec 7, 2022

I have not looked at the code, just your comments on here. In general, lock files don't work on a datastore like GCS, because they are eventually consistent. Different machines might see different states for a while. But maybe GCS has some magic to make sure states are consistent across machines (and operations like creating a file appear atomic)?

@dirkgr
Copy link
Member

dirkgr commented Dec 7, 2022

Oh wow! https://cloud.google.com/storage/docs/consistency
They do promise consistency for file listings. So lock files should work. That's way better than S3!

@AkshitaB AkshitaB requested a review from epwalsh January 27, 2023 22:50
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!! And fast!

My only comments are about naming. I feel like we should call this integration "gcs" instead of "gs"? Or possibly "gc" for Google Cloud, since we're not just using GCS.. we're also using Datastore.


def __init__(
self,
workspace: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to call this "bucket"?

@AkshitaB
Copy link
Contributor Author

My only comments are about naming. I feel like we should call this integration "gcs" instead of "gs"? Or possibly "gc" for Google Cloud, since we're not just using GCS.. we're also using Datastore.

@epwalsh My main reason for calling it "gs" rather than "gcs" was to keep things uniform with gsutil. "gs://workspace-bucket" is a valid location that people can copy-paste and use with cached-path or gsutil directly.

@epwalsh
Copy link
Member

epwalsh commented Jan 31, 2023

Fair enough!

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's entirely possible that I missed it and it's already there, but we need retries on all network calls, GCS and Datastore.

@AkshitaB
Copy link
Contributor Author

AkshitaB commented Feb 1, 2023

It's entirely possible that I missed it and it's already there, but we need retries on all network calls, GCS and Datastore.

Cloud storage and Datastore both have retries already implemented. The default configuration is here . Do you mean that we should have more control over the retries? We can customize them.

@dirkgr
Copy link
Member

dirkgr commented Feb 1, 2023

Cloud storage and Datastore both have retries already implemented. The default configuration is here . Do you mean that we should have more control over the retries? We can customize them.

No, if we already have retries set up, this is fine. I like their default settings, too.

@AkshitaB AkshitaB requested a review from dirkgr February 2, 2023 01:02
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is one big feature!

@AkshitaB AkshitaB merged commit 379095d into main Feb 8, 2023
@AkshitaB AkshitaB deleted the gcs-work branch February 8, 2023 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants