-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Open
Labels
bugSomething isn't workingSomething isn't workingtriageIssue needs to be triaged/prioritizedIssue needs to be triaged/prioritized
Description
Bug Description
Bug Report: Retry policy is completely broken - delay ignored AND attempt counter reset
Description
There are two bugs in the workflow retry mechanism that together cause infinite immediate retries:
- Bug Add initial version of GPTIndex #1: The
delayparameter from retry policy is never applied - retries happen immediately - Bug rename to gpt_index #2: The
attemptscounter is discarded when processing retry events -maximum_attemptsis never reached
Environment
- llama-index-core: 0.14.10
- llama-index-workflows: 2.11.5
- Python version: 3.12
Expected Behavior
When a step fails, the retry should wait for the specified delay (5 seconds in this case) before retrying.
Actual Behavior
The retry mechanism is completely broken:
- Retries happen immediately without any delay (Bug Add initial version of GPTIndex #1)
- The attempt counter resets to 0 on each retry, so
maximum_attemptsis never reached (Bug rename to gpt_index #2) - This creates an infinite loop of immediate retries that runs until the process is killed
- With
num_workers=4(the default), this rapidly overwhelms resources
Severity: Critical
Any step with a retry policy that throws an exception will cause an infinite loop.
Root Cause
There are two bugs in workflows/runtime/control_loop.py:
Bug #1: Delay not passed to queue_tick (line ~176)
if isinstance(command, CommandQueueEvent):
self.queue_tick(
TickAddEvent(
event=command.event,
step_name=command.step_name,
attempts=command.attempts,
first_attempt_at=command.first_attempt_at,
)
) # <-- Missing: delay=command.delayThe CommandQueueEvent.delay is set correctly when scheduling a retry (line ~427), but it's never passed to queue_tick().
Bug #2: Attempts counter discarded (line ~617-618)
def _process_add_event_tick(tick: TickAddEvent, init: BrokerState, now_seconds: float):
# ...
subcommands = _add_or_enqueue_event(
EventAttempt(event=tick.event), # <-- BUG: attempts & first_attempt_at DISCARDED!
step_name,
state.workers[step_name],
now_seconds,
)The TickAddEvent contains attempts and first_attempt_at, but when creating EventAttempt, only event is passed. The attempt counter resets to 0 on every retry, so maximum_attempts is never reached.
Combined Effect
- Step fails → retry scheduled with
attempts=1,delay=5 delayis ignored (Bug Add initial version of GPTIndex #1) → retry happens immediatelyattemptsis discarded (Bug rename to gpt_index #2) → counter resets to 0- Step executes with
attempts=0, fails → retry scheduled withattempts=1 - Infinite loop of immediate retries
Suggested Fixes
Fix for Bug #1:
if isinstance(command, CommandQueueEvent):
self.queue_tick(
TickAddEvent(
event=command.event,
step_name=command.step_name,
attempts=command.attempts,
first_attempt_at=command.first_attempt_at,
),
delay=command.delay, # <-- Add this parameter
)Fix for Bug #2:
subcommands = _add_or_enqueue_event(
EventAttempt(
event=tick.event,
attempts=tick.attempts, # <-- Add this
first_attempt_at=tick.first_attempt_at, # <-- Add this
),
step_name,
state.workers[step_name],
now_seconds,
)Version
0.14.10
Steps to Reproduce
"""
Minimal reproduction script for llama-index workflow retry policy bugs.
This script demonstrates two bugs in the workflows package:
1. Bug #1: Retry delay is ignored - retries happen immediately instead of waiting
2. Bug #2: Attempt counter resets to 0 on each retry - maximum_attempts is never reached
Expected behavior:
- 3 attempts total, with 2 second delays between each
- Total runtime: ~4 seconds (2 delays × 2 seconds)
Actual behavior:
- Infinite immediate retries until manually stopped
- No delay between retries
- Attempt counter never increases past 1
Run with: uv run python -m research.litellm.reproduce
"""
import asyncio
import time
from importlib.metadata import version
from llama_index.core.workflow import Workflow, Context, StartEvent, StopEvent, step
from llama_index.core.workflow.retry_policy import ConstantDelayRetryPolicy
# Get package versions
LLAMA_INDEX_VERSION = version("llama-index-core")
try:
WORKFLOWS_VERSION = version("llama-index-workflows")
except Exception:
try:
WORKFLOWS_VERSION = version("workflows")
except Exception:
WORKFLOWS_VERSION = "unknown"
# Track execution times to verify delay (or lack thereof)
execution_times: list[float] = []
start_time: float = 0
class BugReproductionWorkflow(Workflow):
@step(retry_policy=ConstantDelayRetryPolicy(delay=2, maximum_attempts=3))
async def failing_step(self, ctx: Context, ev: StartEvent) -> StopEvent:
global execution_times
current_time = time.time()
execution_times.append(current_time)
attempt_num = len(execution_times)
elapsed = current_time - start_time
# Calculate time since last attempt
if len(execution_times) > 1:
time_since_last = current_time - execution_times[-2]
else:
time_since_last = 0
print(f"Attempt #{attempt_num} at {elapsed:.3f}s (gap: {time_since_last:.3f}s)")
# Safety: stop after 10 attempts to prevent infinite loop
if attempt_num >= 10:
print("\n⚠️ STOPPING after 10 attempts to prevent infinite loop!")
print("This demonstrates Bug #2: maximum_attempts=3 was never reached.\n")
return StopEvent(result="forced_stop")
# Always fail to trigger retry
raise Exception(f"Simulated failure on attempt {attempt_num}")
async def main():
global start_time
print("=" * 60)
print("Retry Policy Bug Reproduction")
print("=" * 60)
print(f"\nVersions:")
print(f" - llama-index-core: {LLAMA_INDEX_VERSION}")
print(f" - llama-index-workflows: {WORKFLOWS_VERSION}")
print("\nConfiguration:")
print(" - delay: 2 seconds")
print(" - maximum_attempts: 3")
print("\nExpected: 3 attempts, ~4 seconds total (2 delays × 2s)")
print("Actual: Watch what happens...\n")
print("-" * 60)
workflow = BugReproductionWorkflow(timeout=30)
start_time = time.time()
try:
result = await workflow.run()
print(f"\nWorkflow completed with result: {result}")
except Exception as e:
print(f"\nWorkflow failed with: {e}")
total_time = time.time() - start_time
print("-" * 60)
print(f"\nResults:")
print(f" - Total attempts: {len(execution_times)}")
print(f" - Total time: {total_time:.3f}s")
if len(execution_times) > 1:
gaps = [execution_times[i] - execution_times[i-1]
for i in range(1, len(execution_times))]
avg_gap = sum(gaps) / len(gaps)
print(f" - Average gap between attempts: {avg_gap:.3f}s")
print(f" - Expected gap: 2.000s")
print("\n" + "=" * 60)
if len(execution_times) > 3:
print("❌ BUG CONFIRMED: More than 3 attempts occurred!")
print(" Bug #2: Attempt counter resets, maximum_attempts ignored")
if len(execution_times) > 1:
gaps = [execution_times[i] - execution_times[i-1]
for i in range(1, len(execution_times))]
if all(g < 1.0 for g in gaps):
print("❌ BUG CONFIRMED: No delay between retries!")
print(" Bug #1: Delay parameter is not being applied")
print("=" * 60)
if __name__ == "__main__":
asyncio.run(main())
- Run that script
uv run python -m research.litellm.reproduceRelevant Logs/Tracbacks
uv run python -m research.litellm.reproduce
============================================================
Retry Policy Bug Reproduction
============================================================
Versions:
- llama-index-core: 0.14.10
- llama-index-workflows: 2.11.5
Configuration:
- delay: 2 seconds
- maximum_attempts: 3
Expected: 3 attempts, ~4 seconds total (2 delays × 2s)
Actual: Watch what happens...
------------------------------------------------------------
Attempt #1 at 0.001s (gap: 0.000s)
Attempt #2 at 0.001s (gap: 0.000s)
Attempt #3 at 0.001s (gap: 0.000s)
Attempt #4 at 0.001s (gap: 0.000s)
Attempt #5 at 0.001s (gap: 0.000s)
Attempt #6 at 0.001s (gap: 0.000s)
Attempt #7 at 0.002s (gap: 0.000s)
Attempt #8 at 0.002s (gap: 0.000s)
Attempt #9 at 0.002s (gap: 0.000s)
Attempt #10 at 0.002s (gap: 0.000s)
⚠️ STOPPING after 10 attempts to prevent infinite loop!
This demonstrates Bug #2: maximum_attempts=3 was never reached.
Workflow completed with result: forced_stop
------------------------------------------------------------
Results:
- Total attempts: 10
- Total time: 0.002s
- Average gap between attempts: 0.000s
- Expected gap: 2.000s
============================================================
❌ BUG CONFIRMED: More than 3 attempts occurred!
Bug #2: Attempt counter resets, maximum_attempts ignored
❌ BUG CONFIRMED: No delay between retries!
Bug #1: Delay parameter is not being applied
============================================================Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingtriageIssue needs to be triaged/prioritizedIssue needs to be triaged/prioritized