Skip to content

[Bug]: Retry policy delay is ignored in workflow step execution #20403

@gringParis

Description

@gringParis

Bug Description

Bug Report: Retry policy is completely broken - delay ignored AND attempt counter reset

Description

There are two bugs in the workflow retry mechanism that together cause infinite immediate retries:

  1. Bug Add initial version of GPTIndex #1: The delay parameter from retry policy is never applied - retries happen immediately
  2. Bug rename to gpt_index #2: The attempts counter is discarded when processing retry events - maximum_attempts is never reached

Environment

  • llama-index-core: 0.14.10
  • llama-index-workflows: 2.11.5
  • Python version: 3.12

Expected Behavior

When a step fails, the retry should wait for the specified delay (5 seconds in this case) before retrying.

Actual Behavior

The retry mechanism is completely broken:

  1. Retries happen immediately without any delay (Bug Add initial version of GPTIndex #1)
  2. The attempt counter resets to 0 on each retry, so maximum_attempts is never reached (Bug rename to gpt_index #2)
  3. This creates an infinite loop of immediate retries that runs until the process is killed
  4. With num_workers=4 (the default), this rapidly overwhelms resources

Severity: Critical

Any step with a retry policy that throws an exception will cause an infinite loop.

Root Cause

There are two bugs in workflows/runtime/control_loop.py:

Bug #1: Delay not passed to queue_tick (line ~176)

if isinstance(command, CommandQueueEvent):
    self.queue_tick(
        TickAddEvent(
            event=command.event,
            step_name=command.step_name,
            attempts=command.attempts,
            first_attempt_at=command.first_attempt_at,
        )
    )  # <-- Missing: delay=command.delay

The CommandQueueEvent.delay is set correctly when scheduling a retry (line ~427), but it's never passed to queue_tick().

Bug #2: Attempts counter discarded (line ~617-618)

def _process_add_event_tick(tick: TickAddEvent, init: BrokerState, now_seconds: float):
    # ...
    subcommands = _add_or_enqueue_event(
        EventAttempt(event=tick.event),  # <-- BUG: attempts & first_attempt_at DISCARDED!
        step_name,
        state.workers[step_name],
        now_seconds,
    )

The TickAddEvent contains attempts and first_attempt_at, but when creating EventAttempt, only event is passed. The attempt counter resets to 0 on every retry, so maximum_attempts is never reached.

Combined Effect

  1. Step fails → retry scheduled with attempts=1, delay=5
  2. delay is ignored (Bug Add initial version of GPTIndex #1) → retry happens immediately
  3. attempts is discarded (Bug rename to gpt_index #2) → counter resets to 0
  4. Step executes with attempts=0, fails → retry scheduled with attempts=1
  5. Infinite loop of immediate retries

Suggested Fixes

Fix for Bug #1:

if isinstance(command, CommandQueueEvent):
    self.queue_tick(
        TickAddEvent(
            event=command.event,
            step_name=command.step_name,
            attempts=command.attempts,
            first_attempt_at=command.first_attempt_at,
        ),
        delay=command.delay,  # <-- Add this parameter
    )

Fix for Bug #2:

subcommands = _add_or_enqueue_event(
    EventAttempt(
        event=tick.event,
        attempts=tick.attempts,           # <-- Add this
        first_attempt_at=tick.first_attempt_at,  # <-- Add this
    ),
    step_name,
    state.workers[step_name],
    now_seconds,
)

Version

0.14.10

Steps to Reproduce

"""
Minimal reproduction script for llama-index workflow retry policy bugs.

This script demonstrates two bugs in the workflows package:
1. Bug #1: Retry delay is ignored - retries happen immediately instead of waiting
2. Bug #2: Attempt counter resets to 0 on each retry - maximum_attempts is never reached

Expected behavior:
- 3 attempts total, with 2 second delays between each
- Total runtime: ~4 seconds (2 delays × 2 seconds)

Actual behavior:
- Infinite immediate retries until manually stopped
- No delay between retries
- Attempt counter never increases past 1

Run with: uv run python -m research.litellm.reproduce
"""

import asyncio
import time
from importlib.metadata import version

from llama_index.core.workflow import Workflow, Context, StartEvent, StopEvent, step
from llama_index.core.workflow.retry_policy import ConstantDelayRetryPolicy

# Get package versions
LLAMA_INDEX_VERSION = version("llama-index-core")
try:
    WORKFLOWS_VERSION = version("llama-index-workflows")
except Exception:
    try:
        WORKFLOWS_VERSION = version("workflows")
    except Exception:
        WORKFLOWS_VERSION = "unknown"

# Track execution times to verify delay (or lack thereof)
execution_times: list[float] = []
start_time: float = 0


class BugReproductionWorkflow(Workflow):
    @step(retry_policy=ConstantDelayRetryPolicy(delay=2, maximum_attempts=3))
    async def failing_step(self, ctx: Context, ev: StartEvent) -> StopEvent:
        global execution_times
        
        current_time = time.time()
        execution_times.append(current_time)
        attempt_num = len(execution_times)
        
        elapsed = current_time - start_time
        
        # Calculate time since last attempt
        if len(execution_times) > 1:
            time_since_last = current_time - execution_times[-2]
        else:
            time_since_last = 0
        
        print(f"Attempt #{attempt_num} at {elapsed:.3f}s (gap: {time_since_last:.3f}s)")
        
        # Safety: stop after 10 attempts to prevent infinite loop
        if attempt_num >= 10:
            print("\n⚠️  STOPPING after 10 attempts to prevent infinite loop!")
            print("This demonstrates Bug #2: maximum_attempts=3 was never reached.\n")
            return StopEvent(result="forced_stop")
        
        # Always fail to trigger retry
        raise Exception(f"Simulated failure on attempt {attempt_num}")


async def main():
    global start_time
    
    print("=" * 60)
    print("Retry Policy Bug Reproduction")
    print("=" * 60)
    print(f"\nVersions:")
    print(f"  - llama-index-core: {LLAMA_INDEX_VERSION}")
    print(f"  - llama-index-workflows: {WORKFLOWS_VERSION}")
    print("\nConfiguration:")
    print("  - delay: 2 seconds")
    print("  - maximum_attempts: 3")
    print("\nExpected: 3 attempts, ~4 seconds total (2 delays × 2s)")
    print("Actual: Watch what happens...\n")
    print("-" * 60)
    
    workflow = BugReproductionWorkflow(timeout=30)
    start_time = time.time()
    
    try:
        result = await workflow.run()
        print(f"\nWorkflow completed with result: {result}")
    except Exception as e:
        print(f"\nWorkflow failed with: {e}")
    
    total_time = time.time() - start_time
    
    print("-" * 60)
    print(f"\nResults:")
    print(f"  - Total attempts: {len(execution_times)}")
    print(f"  - Total time: {total_time:.3f}s")
    
    if len(execution_times) > 1:
        gaps = [execution_times[i] - execution_times[i-1] 
                for i in range(1, len(execution_times))]
        avg_gap = sum(gaps) / len(gaps)
        print(f"  - Average gap between attempts: {avg_gap:.3f}s")
        print(f"  - Expected gap: 2.000s")
    
    print("\n" + "=" * 60)
    if len(execution_times) > 3:
        print("❌ BUG CONFIRMED: More than 3 attempts occurred!")
        print("   Bug #2: Attempt counter resets, maximum_attempts ignored")
    if len(execution_times) > 1:
        gaps = [execution_times[i] - execution_times[i-1] 
                for i in range(1, len(execution_times))]
        if all(g < 1.0 for g in gaps):
            print("❌ BUG CONFIRMED: No delay between retries!")
            print("   Bug #1: Delay parameter is not being applied")
    print("=" * 60)


if __name__ == "__main__":
    asyncio.run(main())
  1. Run that script
uv run python -m research.litellm.reproduce

Relevant Logs/Tracbacks

uv run python -m research.litellm.reproduce
============================================================
Retry Policy Bug Reproduction
============================================================

Versions:
  - llama-index-core: 0.14.10
  - llama-index-workflows: 2.11.5

Configuration:
  - delay: 2 seconds
  - maximum_attempts: 3

Expected: 3 attempts, ~4 seconds total (2 delays × 2s)
Actual: Watch what happens...

------------------------------------------------------------
Attempt #1 at 0.001s (gap: 0.000s)
Attempt #2 at 0.001s (gap: 0.000s)
Attempt #3 at 0.001s (gap: 0.000s)
Attempt #4 at 0.001s (gap: 0.000s)
Attempt #5 at 0.001s (gap: 0.000s)
Attempt #6 at 0.001s (gap: 0.000s)
Attempt #7 at 0.002s (gap: 0.000s)
Attempt #8 at 0.002s (gap: 0.000s)
Attempt #9 at 0.002s (gap: 0.000s)
Attempt #10 at 0.002s (gap: 0.000s)

⚠️  STOPPING after 10 attempts to prevent infinite loop!
This demonstrates Bug #2: maximum_attempts=3 was never reached.


Workflow completed with result: forced_stop
------------------------------------------------------------

Results:
  - Total attempts: 10
  - Total time: 0.002s
  - Average gap between attempts: 0.000s
  - Expected gap: 2.000s

============================================================
❌ BUG CONFIRMED: More than 3 attempts occurred!
   Bug #2: Attempt counter resets, maximum_attempts ignored
❌ BUG CONFIRMED: No delay between retries!
   Bug #1: Delay parameter is not being applied
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions