Skip to content

Conversation

@PredictiveManish
Copy link

@PredictiveManish PredictiveManish commented Dec 22, 2025

Details:
upload_large_folder was getting stuck in an infinite retry loop when commits failed with non retry-able errors, most notably the 403 "storage patterns tripped our internal systems" storage‑limit error. The worker logic would keep putting the same files back into the commit queue and continuously rehash/retry them. and solves #3325.

  • This PR fixes Infinite retry loop in upload_large_folder when encountering HTTP 403 storage limit errors #3325
    This PR introduces a small abort mechanism in the large upload scheduler:

  • LargeUploadStatus now tracks a fatal_error and an aborted flag.

  • The COMMIT worker inspects HfHubHTTPError and treats the 403 storage‑limit message as a fatal error instead of something to retry.

  • When such an error occurs, we mark the upload as aborted, stop re‑queuing items, and re‑raise the underlying exception once workers are done.

As a result, storage‑limit issues now fail fast with a clear error instead of looping forever, while transient errors still benefit from the existing retry/backoff behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Infinite retry loop in upload_large_folder when encountering HTTP 403 storage limit errors

1 participant