[DataLoader] Adding `StackDataset` #101338

stsouko · 2023-05-13T15:19:42Z

Torch wrapping datasets list has:
TensorDataset
ConcatDataset
ChainDataset

TensorDataset is useful for stacking sets of tensors but can't work with objects without .size() method.

This PR proposes StackDataset, similar to TensorDataset but for a general case like ConcatDataset.

Possible usage of StackDataset is multimodal networks with different input like image+text or for staking non-tensor input and property to predict.

pytorch-bot · 2023-05-13T15:19:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101338

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c5bbb0d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/utils/data/dataset.py

ejguan · 2023-05-15T22:22:24Z

torch/utils/data/dataset.py

@@ -206,6 +208,54 @@ def __len__(self):
        return self.tensors[0].size(0)


+class StackDataset(Dataset[Union[Tuple[T_co, ...], Dict[str, T_co]]]):


Another thing needs to call out. It might be better to define a TypeVar that contains either Tuple[T_co, ...] or Dict[str, T_co].

Not sure I get the idea.
Do you mean:

T_td = Union[Tuple[T_co, ...], Dict[str, T_co]] class StackDataset(Dataset[T_td]):

instead of:

class StackDataset(Dataset[Union[Tuple[T_co, ...], Dict[str, T_co]]]):

No. Let's do something like T_stack = TypeVar('T_stack', Tuple[T_co, ...], Dict[str, T_co]). Using Union means the output can be either Tuple or Dict. But, using TypeVar is like template-style that only allows one type per dataset instance.

ejguan · 2023-05-15T22:23:42Z

torch/utils/data/dataset.py

+            tmp = list(kwargs.values())
+            if any(len(tmp[0]) != len(dataset) for dataset in tmp):  # type: ignore[arg-type]


nit:

self._length = ... if any(self._length != ...

ejguan · 2023-05-15T22:24:38Z

torch/utils/data/dataset.py

+        else:
+            raise ValueError("At least one dataset should be passed")
+
+    def __getitem__(self, index):


Need a type annotation of TypeVar here

but output already typed in generic Dataset.

ejguan

Overall LGTM. I only have one question about this use case. Do we expect all datasets having the same size or just supporting the least length like TorchData's zip operation.
cc: @NivekT

NivekT

Please fix the mypy linting error

NivekT

I think we should be good to go. Thanks!

Looks like this requires all Datasets that are passed in to have the same length and keys, which seems fine with me.

torch/utils/data/dataset.py

NivekT · 2023-05-17T19:06:46Z

@pytorchbot merge -r

pytorchmergebot · 2023-05-17T19:08:53Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

fixes

pytorchmergebot · 2023-05-17T19:08:58Z

Successfully rebased stack onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout stack && git pull --rebase)

pytorchmergebot · 2023-05-17T19:10:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-17T19:30:26Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-py3.8-gcc7 / test (default, 2, 3, linux.2xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

test/test_dataloader.py

torch/utils/data/dataset.py

NivekT · 2023-05-17T22:37:21Z

@pytorchbot merge

pytorchmergebot · 2023-05-17T22:40:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

New dataset class added by #101338 missed in documentation. Pull Request resolved: #101927 Approved by: https://github.com/kit1980

Torch wrapping datasets list has: `TensorDataset` `ConcatDataset` `ChainDataset` `TensorDataset` is useful for stacking sets of tensors but can't work with objects without `.size()` method. This PR proposes `StackDataset`, similar to `TensorDataset` but for a general case like `ConcatDataset`. Possible usage of `StackDataset` is multimodal networks with different input like image+text or for staking non-tensor input and property to predict. Pull Request resolved: #101338 Approved by: https://github.com/ejguan, https://github.com/NivekT

stsouko requested review from NivekT and ejguan as code owners May 13, 2023 15:19

pytorch-bot bot added the release notes: dataloader release notes category label May 13, 2023

pytorchbot added the open source label May 13, 2023

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2023

ejguan reviewed May 15, 2023

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

stsouko requested a review from ejguan May 15, 2023 20:16

ejguan reviewed May 15, 2023

View reviewed changes

torch/utils/data/dataset.py Show resolved Hide resolved

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

ejguan reviewed May 15, 2023

View reviewed changes

ejguan approved these changes May 16, 2023

View reviewed changes

NivekT reviewed May 17, 2023

View reviewed changes

NivekT approved these changes May 17, 2023

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

NivekT added the topic: improvements topic category label May 17, 2023

NivekT changed the title ~~stacking dataset~~ [DataLoader] Adding StackDataset May 17, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2023

stsouko added 10 commits May 17, 2023 19:08

stacking dataset

738e81a

tests added.

c4610b3

fixes

imports fixed

3275db1

test fixed

07372f8

fixed again

bb77ee6

dict output

9e0be74

linter fixed

19bac18

example fixed

0e02683

example fixed

8dffde0

exceptions fixed

324ff73

stsouko and others added 6 commits May 17, 2023 19:08

refactored

f840f27

fixup

79f5ce7

typevar added

8ac524b

typevar fixed

3ce36b2

typevar fixed

edecdcf

Update torch/utils/data/dataset.py

6aef1d2

pytorchmergebot force-pushed the stack branch from 14f421c to 6aef1d2 Compare May 17, 2023 19:09

pytorchmergebot added the merging label May 17, 2023

pytorchmergebot removed the merging label May 17, 2023

NivekT reviewed May 17, 2023

View reviewed changes

test/test_dataloader.py Outdated Show resolved Hide resolved

Update test/test_dataloader.py

8d32d72

NivekT reviewed May 17, 2023

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

Update torch/utils/data/dataset.py

c5bbb0d

pytorchmergebot added the merging label May 17, 2023

pytorchmergebot added Merged and removed merging labels May 18, 2023

pytorchmergebot closed this in 28098ca May 18, 2023

stsouko mentioned this pull request May 20, 2023

missed StackDataset documentation #101927

Closed

stsouko deleted the stack branch May 20, 2023 12:23

pytorchmergebot pushed a commit that referenced this pull request May 22, 2023

missed StackDataset documentation (#101927)

2ae87a1

New dataset class added by #101338 missed in documentation. Pull Request resolved: #101927 Approved by: https://github.com/kit1980

		@@ -206,6 +208,54 @@ def __len__(self):
		return self.tensors[0].size(0)


		class StackDataset(Dataset[Union[Tuple[T_co, ...], Dict[str, T_co]]]):

		tmp = list(kwargs.values())
		if any(len(tmp[0]) != len(dataset) for dataset in tmp): # type: ignore[arg-type]

[DataLoader] Adding StackDataset #101338

[DataLoader] Adding StackDataset #101338

Uh oh!

Conversation

stsouko commented May 13, 2023

Uh oh!

pytorch-bot bot commented May 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101338

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ejguan May 15, 2023

Choose a reason for hiding this comment

Uh oh!

stsouko May 16, 2023

Choose a reason for hiding this comment

Uh oh!

ejguan May 16, 2023

Choose a reason for hiding this comment

Uh oh!

stsouko May 16, 2023

Choose a reason for hiding this comment

Uh oh!

ejguan May 15, 2023

Choose a reason for hiding this comment

Uh oh!

ejguan May 15, 2023

Choose a reason for hiding this comment

Uh oh!

stsouko May 16, 2023

Choose a reason for hiding this comment

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

NivekT left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NivekT commented May 17, 2023

Uh oh!

pytorchmergebot commented May 17, 2023

Uh oh!

pytorchmergebot commented May 17, 2023

Uh oh!

pytorchmergebot commented May 17, 2023

Merge started

Uh oh!

pytorchmergebot commented May 17, 2023

Merge failed

Uh oh!

Uh oh!

Uh oh!

NivekT commented May 17, 2023

Uh oh!

pytorchmergebot commented May 17, 2023

Merge started

Uh oh!

Uh oh!

[DataLoader] Adding `StackDataset` #101338

[DataLoader] Adding `StackDataset` #101338

pytorch-bot bot commented May 13, 2023 •

edited

Loading

NivekT left a comment •

edited

Loading