Skip to content

Conversation

rob-guo
Copy link
Contributor

@rob-guo rob-guo commented Apr 18, 2023

Fixes #98918

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 18, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99453

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7b096fb:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 18, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@rob-guo
Copy link
Contributor Author

rob-guo commented Apr 19, 2023

First time contributing to pytorch after playing around with it recently. If there's anything I missed in the PR process, please feel free to let me know!

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 25, 2023
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!
Suggesting below a small comment to clarify why we do this.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix, but this workaround would require to store entire file in memory before it can be saved on disk. Wouldn't it be better to use FIleIO instead of BytesIO

@albanD
Copy link
Collaborator

albanD commented Apr 25, 2023

I think the test file encoding is not correct and so the string is not parsed properly.

@rob-guo
Copy link
Contributor Author

rob-guo commented Apr 25, 2023

It turned out to be a quirk of the with syntax for Python 3.8 and prior not supporting parentheses (ref).

Updated PR and verified working locally using Python 3.8.16. 🤞

Thanks for the quick insight and feedback all!

@rob-guo
Copy link
Contributor Author

rob-guo commented Apr 25, 2023

Thank you for the fix, but this workaround would require to store entire file in memory before it can be saved on disk. Wouldn't it be better to use FIleIO instead of BytesIO

Thanks for the tip! Will update. I initially went with BytesIO per the impl sketch in #98918 calling for storing the data in a stream before writing to disk and (CPython's) FileIO doesn't do any buffering.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM
Thanks for the quick update

@albanD albanD added release notes: python_frontend python frontend release notes category topic: improvements topic category labels Apr 25, 2023
@albanD
Copy link
Collaborator

albanD commented Apr 25, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 25, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@albanD
Copy link
Collaborator

albanD commented Apr 25, 2023

@pytorchbot merge -r

I think the CI issues are not related to your change.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased support-non-ascii-model-file-paths onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout support-non-ascii-model-file-paths && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the support-non-ascii-model-file-paths branch from e890003 to 7b096fb Compare April 25, 2023 22:48
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

# PyTorchFileWriter only supports ascii filename.
# For filenames with non-ascii characters, we rely on Python
# for writing out the file.
self.file_stream = io.FileIO(self.name, mode='w')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode=w is text mode, it may corrupt binary files on Windows due to newline code conversion.
I think this should be mode='wb'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mode option is confusing here but should be doing the right thing:

>>> import io
>>> f = io.FileIO("tmp", 'w')
>>> f
<_io.FileIO name='tmp' mode='wb' closefd=True>

Note that the internal mode of the FileIO object is wb, different from the user-specified mode of w.

Separately, The writelines() method that FileIO inherits from BaseIO does not add any newlines per the linked documentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was confusing it with open arguments. You are right.

@Apfelkuchenbemme
Copy link

I was about to open a separate issue about saving with non-ascii characters in the path, but does your fix include the torch.save method?

Tested with torch 2.0.1,
on Windows 11 Education, build 22000.1817,
this example below saves .pt files correctly if torch.save is called with _use_new_zipfile_serialization=False; however, if called with _use_new_zipfile_serialization=True, it will save them with a wrong character in the filename if the directory to the files does not contain non-ascii characters and will straight up throw a RuntimeError if the directory to the files does contain non-ascii characters, regardless of the filename itself:

import torch
from   os.path import isdir
from   os import mkdir

if __name__ == "__main__":
    tensor = torch.tensor([0.0, 1.0], dtype=torch.float32)

    # Saving both of these works:
    dir_does_work = "Both_Work"
    if not isdir(dir_does_work):
        mkdir(dir_does_work)

    torch.save(tensor, "/".join([dir_does_work, "tensor_False.pt"]), _use_new_zipfile_serialization=False)
    torch.save(tensor, "/".join([dir_does_work, "tensor_True.pt"]), _use_new_zipfile_serialization=True)

    # Saving both of these "works" too, however filename #2 comes out wrong:
    #   tensör_False.pt <= correct
    #   tensör_True.pt <= incorrect
    torch.save(tensor, "tensör_False.pt", _use_new_zipfile_serialization=False)
    torch.save(tensor, "tensör_True.pt", _use_new_zipfile_serialization=True)

    # First one will get saved, second one throws:
    #   RuntimeError: Parent directory Only_One_Wörks does not exist.
    dir_does_not_work = "Only_One_Wörks"
    if not isdir(dir_does_not_work):
        mkdir(dir_does_not_work)

    torch.save(tensor, "/".join([dir_does_not_work, "tensor_False.pt"]), _use_new_zipfile_serialization=False)
    torch.save(tensor, "/".join([dir_does_not_work, "tensor_True.pt"]), _use_new_zipfile_serialization=True)

@albanD
Copy link
Collaborator

albanD commented May 12, 2023

Hi,

Yes this covers torch.save but didn't make it to 2.0.1 I'm afraid as it was finalized before we did this.
You can use the nightly to get this fix though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: python_frontend python frontend release notes category topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pth name is garbled when there is Chinese character
8 participants