Skip to content

Conversation

Flamefire
Copy link
Collaborator

Remove an optimization potentially using a transposed matrix as input for linalg_lu_factor_ex_out.

Depending on whether the input memory layout is contiguous or not this may lead to slightly different results which may cause larger differences in subsequent steps ultimately leading to test failures in e.g. test_vmapvjp_linalg_tensorsolve_cpu_float32 & test_vmapvjpvjp_linalg_tensorsolve_cpu_float32.

The intended optimization no longer applies after 59bc76f so this code can be removed too resolving the accuracy issues observed in those tests.

With this change the code path used for the "regular" and "vmap" cases are identical: A batched tensor is iterated over in the batch dimension in lu_solve and lu_factor
Prior to this it might not be the case as either tensor would/could have been non-contiguous leading to using a transposed tensor for the LU factorization instead.

So the (CPU) results should now be identical.

Fixes #151440
Fixes #114868

Copy link

pytorch-bot bot commented Jun 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154983

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 80295cd with merge base 81dbeb0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Flamefire Flamefire changed the title Solve accuracy fix Avoid differing results in linalg.(tensor_)solve Jun 3, 2025
@Flamefire Flamefire added the release notes: nn release notes category label Jun 3, 2025
@lezcano
Copy link
Collaborator

lezcano commented Jun 3, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@lezcano
Copy link
Collaborator

lezcano commented Jun 3, 2025

the mps failures are real. @malfet is there anything that needs changing in the mps path?

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15), trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14), trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-13)

Details for Dev Infra team Raised by workflow job

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
Is the MPS failure still a problem? Can you rebase to get fresh CI signal?

@Flamefire
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased solve-accuracy-fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout solve-accuracy-fix && git pull --rebase)

Copy link
Contributor

github-actions bot commented Sep 8, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Sep 8, 2025
@Flamefire
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased solve-accuracy-fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout solve-accuracy-fix && git pull --rebase)

@Flamefire
Copy link
Collaborator Author

Looks like the MPS implementation either differs or has a workaround to account for the potential difference in the current implementation.

Is there anyone that knows enough about the MPS backend to clarify or investigate this?

@github-actions github-actions bot closed this Oct 8, 2025
@Flamefire Flamefire reopened this Oct 9, 2025
@Flamefire
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Remove an optimization potentially using a transposed matrix as input
for `linalg_lu_factor_ex_out`.

Depending on whether the input memory layout is contiguous or not this
may lead to slightly different results which may cause larger
differences in subsequent steps ultimately leading to test failures in
e.g. `test_vmapvjp_linalg_tensorsolve_cpu_float32` & `test_vmapvjpvjp_linalg_tensorsolve_cpu_float32`.

The intended optimization no longer applies after 59bc76f so this code
can be removed too resolving the accuracy issues observed in those tests.

Fixes #151440
@pytorchmergebot
Copy link
Collaborator

Successfully rebased solve-accuracy-fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout solve-accuracy-fix && git pull --rebase)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request open source release notes: nn release notes category Stale

Projects

None yet

5 participants