Support non-ASCII characters in model file paths #99453

rob-guo · 2023-04-18T20:14:35Z

Fixes #98918

pytorch-bot · 2023-04-18T20:14:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99453

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7b096fb:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-04-18T20:14:39Z

The committers listed above are authorized under a signed CLA.

✅ login: rob-guo (2315e02)

rob-guo · 2023-04-19T13:17:01Z

First time contributing to pytorch after playing around with it recently. If there's anything I missed in the PR process, please feel free to let me know!

albanD

Sounds good to me!
Suggesting below a small comment to clarify why we do this.

torch/serialization.py

albanD

Thanks

malfet

Thank you for the fix, but this workaround would require to store entire file in memory before it can be saved on disk. Wouldn't it be better to use FIleIO instead of BytesIO

albanD · 2023-04-25T17:25:40Z

I think the test file encoding is not correct and so the string is not parsed properly.

rob-guo · 2023-04-25T18:02:40Z

It turned out to be a quirk of the with syntax for Python 3.8 and prior not supporting parentheses (ref).

Updated PR and verified working locally using Python 3.8.16. 🤞

Thanks for the quick insight and feedback all!

rob-guo · 2023-04-25T18:18:49Z

Thank you for the fix, but this workaround would require to store entire file in memory before it can be saved on disk. Wouldn't it be better to use FIleIO instead of BytesIO

Thanks for the tip! Will update. I initially went with BytesIO per the impl sketch in #98918 calling for storing the data in a stream before writing to disk and (CPython's) FileIO doesn't do any buffering.

albanD

SGTM
Thanks for the quick update

albanD · 2023-04-25T19:03:53Z

@pytorchbot merge

pytorchmergebot · 2023-04-25T19:05:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-25T19:21:00Z

Merge failed

Reason: 3 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12), trunk / macos-12-py3-arm64-mps / test (default, 1, 1)

Details for Dev Infra team

Raised by workflow job

albanD · 2023-04-25T22:46:22Z

@pytorchbot merge -r

I think the CI issues are not related to your change.

pytorchmergebot · 2023-04-25T22:48:08Z

@pytorchbot successfully started a rebase job. Check the current status here

Co-authored-by: albanD <[email protected]>

pytorchmergebot · 2023-04-25T22:48:14Z

Successfully rebased support-non-ascii-model-file-paths onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout support-non-ascii-model-file-paths && git pull --rebase)

pytorchmergebot · 2023-04-25T22:49:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

nagadomi · 2023-04-26T02:16:14Z

torch/serialization.py

+            # PyTorchFileWriter only supports ascii filename.
+            # For filenames with non-ascii characters, we rely on Python
+            # for writing out the file.
+            self.file_stream = io.FileIO(self.name, mode='w')


mode=w is text mode, it may corrupt binary files on Windows due to newline code conversion.
I think this should be mode='wb'.

I think the mode option is confusing here but should be doing the right thing:

>>> import io >>> f = io.FileIO("tmp", 'w') >>> f <_io.FileIO name='tmp' mode='wb' closefd=True>

Note that the internal mode of the FileIO object is wb, different from the user-specified mode of w.

Separately, The writelines() method that FileIO inherits from BaseIO does not add any newlines per the linked documentation.

Sorry, I was confusing it with open arguments. You are right.

Apfelkuchenbemme · 2023-05-12T09:09:46Z

I was about to open a separate issue about saving with non-ascii characters in the path, but does your fix include the torch.save method?

Tested with torch 2.0.1,
on Windows 11 Education, build 22000.1817,
this example below saves .pt files correctly if torch.save is called with _use_new_zipfile_serialization=False; however, if called with _use_new_zipfile_serialization=True, it will save them with a wrong character in the filename if the directory to the files does not contain non-ascii characters and will straight up throw a RuntimeError if the directory to the files does contain non-ascii characters, regardless of the filename itself:

import torch
from   os.path import isdir
from   os import mkdir

if __name__ == "__main__":
    tensor = torch.tensor([0.0, 1.0], dtype=torch.float32)

    # Saving both of these works:
    dir_does_work = "Both_Work"
    if not isdir(dir_does_work):
        mkdir(dir_does_work)

    torch.save(tensor, "/".join([dir_does_work, "tensor_False.pt"]), _use_new_zipfile_serialization=False)
    torch.save(tensor, "/".join([dir_does_work, "tensor_True.pt"]), _use_new_zipfile_serialization=True)

    # Saving both of these "works" too, however filename #2 comes out wrong:
    #   tensör_False.pt <= correct
    #   tensÃ¶r_True.pt <= incorrect
    torch.save(tensor, "tensör_False.pt", _use_new_zipfile_serialization=False)
    torch.save(tensor, "tensör_True.pt", _use_new_zipfile_serialization=True)

    # First one will get saved, second one throws:
    #   RuntimeError: Parent directory Only_One_Wörks does not exist.
    dir_does_not_work = "Only_One_Wörks"
    if not isdir(dir_does_not_work):
        mkdir(dir_does_not_work)

    torch.save(tensor, "/".join([dir_does_not_work, "tensor_False.pt"]), _use_new_zipfile_serialization=False)
    torch.save(tensor, "/".join([dir_does_not_work, "tensor_True.pt"]), _use_new_zipfile_serialization=True)

albanD · 2023-05-12T14:21:09Z

Hi,

Yes this covers torch.save but didn't make it to 2.0.1 I'm afraid as it was finalized before we did this.
You can use the nightly to get this fix though!

pytorchbot added the open source label Apr 18, 2023

mikaylagawarecki requested review from malfet and albanD April 25, 2023 15:10

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 25, 2023

albanD reviewed Apr 25, 2023

View reviewed changes

torch/serialization.py Outdated Show resolved Hide resolved

albanD approved these changes Apr 25, 2023

View reviewed changes

malfet approved these changes Apr 25, 2023

View reviewed changes

albanD approved these changes Apr 25, 2023

View reviewed changes

albanD added release notes: python_frontend python frontend release notes category topic: improvements topic category labels Apr 25, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 25, 2023

pytorchmergebot added the merging label Apr 25, 2023

rob-guo and others added 5 commits April 25, 2023 22:48

Support non-ASCII characters in model file paths

c3a675b

Explicitly set UTF-8 source file encoding for Python 3.8

d836ddd

Add clarification comment to torch/serialization.py

67f798d

Co-authored-by: albanD <[email protected]>

support python 3.8 syntax

19727d7

use FileIO instead of BytesIO for streaming files

7b096fb

pytorchmergebot force-pushed the support-non-ascii-model-file-paths branch from e890003 to 7b096fb Compare April 25, 2023 22:48

pytorchmergebot added Merged and removed merging labels Apr 26, 2023

pytorchmergebot closed this in 111358d Apr 26, 2023

nagadomi reviewed Apr 26, 2023

View reviewed changes

Support non-ASCII characters in model file paths #99453

Support non-ASCII characters in model file paths #99453

Uh oh!

Conversation

rob-guo commented Apr 18, 2023

Uh oh!

pytorch-bot bot commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99453

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rob-guo commented Apr 19, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Apr 25, 2023

Uh oh!

rob-guo commented Apr 25, 2023

Uh oh!

rob-guo commented Apr 25, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Apr 25, 2023

Uh oh!

pytorchmergebot commented Apr 25, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 25, 2023

Merge failed

Uh oh!

albanD commented Apr 25, 2023

Uh oh!

pytorchmergebot commented Apr 25, 2023

Uh oh!

pytorchmergebot commented Apr 25, 2023

Uh oh!

pytorchmergebot commented Apr 25, 2023

Merge started

Uh oh!

nagadomi Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

rob-guo Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

nagadomi Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

Apfelkuchenbemme commented May 12, 2023

Uh oh!

albanD commented May 12, 2023

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 18, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 18, 2023 •

edited

Loading