fix(upload_large_folder): avoid infinite retries on non‑retryable commit errors #3642
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details:
upload_large_folderwas getting stuck in an infinite retry loop when commits failed with non retry-able errors, most notably the 403 "storage patterns tripped our internal systems" storage‑limit error. The worker logic would keep putting the same files back into the commit queue and continuously rehash/retry them. and solves #3325.This PR fixes Infinite retry loop in upload_large_folder when encountering HTTP 403 storage limit errors #3325
This PR introduces a small abort mechanism in the large upload scheduler:
LargeUploadStatusnow tracks afatal_errorand anabortedflag.The COMMIT worker inspects
HfHubHTTPErrorand treats the 403 storage‑limit message as a fatal error instead of something to retry.When such an error occurs, we mark the upload as aborted, stop re‑queuing items, and re‑raise the underlying exception once workers are done.
As a result, storage‑limit issues now fail fast with a clear error instead of looping forever, while transient errors still benefit from the existing retry/backoff behaviour.