[Bug]: Retry policy delay is ignored in workflow step execution

### Bug Description

# Bug Report: Retry policy is completely broken - delay ignored AND attempt counter reset

## Description

There are **two bugs** in the workflow retry mechanism that together cause infinite immediate retries:

1. **Bug #1**: The `delay` parameter from retry policy is never applied - retries happen immediately
2. **Bug #2**: The `attempts` counter is discarded when processing retry events - `maximum_attempts` is never reached

## Environment

  - llama-index-core: 0.14.10
  - llama-index-workflows: 2.11.5
  - Python version: 3.12

## Expected Behavior

When a step fails, the retry should wait for the specified delay (5 seconds in this case) before retrying.

## Actual Behavior

**The retry mechanism is completely broken:**

1. Retries happen immediately without any delay (Bug #1)
2. The attempt counter resets to 0 on each retry, so `maximum_attempts` is never reached (Bug #2)
3. This creates an **infinite loop of immediate retries** that runs until the process is killed
4. With `num_workers=4` (the default), this rapidly overwhelms resources

### Severity: Critical

Any step with a retry policy that throws an exception will cause an infinite loop.

## Root Cause

There are two bugs in `workflows/runtime/control_loop.py`:

### Bug #1: Delay not passed to queue_tick (line ~176)

```python
if isinstance(command, CommandQueueEvent):
    self.queue_tick(
        TickAddEvent(
            event=command.event,
            step_name=command.step_name,
            attempts=command.attempts,
            first_attempt_at=command.first_attempt_at,
        )
    )  # <-- Missing: delay=command.delay
```

The `CommandQueueEvent.delay` is set correctly when scheduling a retry (line ~427), but it's never passed to `queue_tick()`.

### Bug #2: Attempts counter discarded (line ~617-618)

```python
def _process_add_event_tick(tick: TickAddEvent, init: BrokerState, now_seconds: float):
    # ...
    subcommands = _add_or_enqueue_event(
        EventAttempt(event=tick.event),  # <-- BUG: attempts & first_attempt_at DISCARDED!
        step_name,
        state.workers[step_name],
        now_seconds,
    )
```

The `TickAddEvent` contains `attempts` and `first_attempt_at`, but when creating `EventAttempt`, only `event` is passed. **The attempt counter resets to 0 on every retry**, so `maximum_attempts` is never reached.

### Combined Effect

1. Step fails → retry scheduled with `attempts=1`, `delay=5`
2. `delay` is ignored (Bug #1) → retry happens immediately
3. `attempts` is discarded (Bug #2) → counter resets to 0
4. Step executes with `attempts=0`, fails → retry scheduled with `attempts=1`
5. **Infinite loop of immediate retries**

## Suggested Fixes

### Fix for Bug #1:

```python
if isinstance(command, CommandQueueEvent):
    self.queue_tick(
        TickAddEvent(
            event=command.event,
            step_name=command.step_name,
            attempts=command.attempts,
            first_attempt_at=command.first_attempt_at,
        ),
        delay=command.delay,  # <-- Add this parameter
    )
```

### Fix for Bug #2:

```python
subcommands = _add_or_enqueue_event(
    EventAttempt(
        event=tick.event,
        attempts=tick.attempts,           # <-- Add this
        first_attempt_at=tick.first_attempt_at,  # <-- Add this
    ),
    step_name,
    state.workers[step_name],
    now_seconds,
)
```


### Version

0.14.10

### Steps to Reproduce


```python
"""
Minimal reproduction script for llama-index workflow retry policy bugs.

This script demonstrates two bugs in the workflows package:
1. Bug #1: Retry delay is ignored - retries happen immediately instead of waiting
2. Bug #2: Attempt counter resets to 0 on each retry - maximum_attempts is never reached

Expected behavior:
- 3 attempts total, with 2 second delays between each
- Total runtime: ~4 seconds (2 delays × 2 seconds)

Actual behavior:
- Infinite immediate retries until manually stopped
- No delay between retries
- Attempt counter never increases past 1

Run with: uv run python -m research.litellm.reproduce
"""

import asyncio
import time
from importlib.metadata import version

from llama_index.core.workflow import Workflow, Context, StartEvent, StopEvent, step
from llama_index.core.workflow.retry_policy import ConstantDelayRetryPolicy

# Get package versions
LLAMA_INDEX_VERSION = version("llama-index-core")
try:
    WORKFLOWS_VERSION = version("llama-index-workflows")
except Exception:
    try:
        WORKFLOWS_VERSION = version("workflows")
    except Exception:
        WORKFLOWS_VERSION = "unknown"

# Track execution times to verify delay (or lack thereof)
execution_times: list[float] = []
start_time: float = 0


class BugReproductionWorkflow(Workflow):
    @step(retry_policy=ConstantDelayRetryPolicy(delay=2, maximum_attempts=3))
    async def failing_step(self, ctx: Context, ev: StartEvent) -> StopEvent:
        global execution_times
        
        current_time = time.time()
        execution_times.append(current_time)
        attempt_num = len(execution_times)
        
        elapsed = current_time - start_time
        
        # Calculate time since last attempt
        if len(execution_times) > 1:
            time_since_last = current_time - execution_times[-2]
        else:
            time_since_last = 0
        
        print(f"Attempt #{attempt_num} at {elapsed:.3f}s (gap: {time_since_last:.3f}s)")
        
        # Safety: stop after 10 attempts to prevent infinite loop
        if attempt_num >= 10:
            print("\n⚠️  STOPPING after 10 attempts to prevent infinite loop!")
            print("This demonstrates Bug #2: maximum_attempts=3 was never reached.\n")
            return StopEvent(result="forced_stop")
        
        # Always fail to trigger retry
        raise Exception(f"Simulated failure on attempt {attempt_num}")


async def main():
    global start_time
    
    print("=" * 60)
    print("Retry Policy Bug Reproduction")
    print("=" * 60)
    print(f"\nVersions:")
    print(f"  - llama-index-core: {LLAMA_INDEX_VERSION}")
    print(f"  - llama-index-workflows: {WORKFLOWS_VERSION}")
    print("\nConfiguration:")
    print("  - delay: 2 seconds")
    print("  - maximum_attempts: 3")
    print("\nExpected: 3 attempts, ~4 seconds total (2 delays × 2s)")
    print("Actual: Watch what happens...\n")
    print("-" * 60)
    
    workflow = BugReproductionWorkflow(timeout=30)
    start_time = time.time()
    
    try:
        result = await workflow.run()
        print(f"\nWorkflow completed with result: {result}")
    except Exception as e:
        print(f"\nWorkflow failed with: {e}")
    
    total_time = time.time() - start_time
    
    print("-" * 60)
    print(f"\nResults:")
    print(f"  - Total attempts: {len(execution_times)}")
    print(f"  - Total time: {total_time:.3f}s")
    
    if len(execution_times) > 1:
        gaps = [execution_times[i] - execution_times[i-1] 
                for i in range(1, len(execution_times))]
        avg_gap = sum(gaps) / len(gaps)
        print(f"  - Average gap between attempts: {avg_gap:.3f}s")
        print(f"  - Expected gap: 2.000s")
    
    print("\n" + "=" * 60)
    if len(execution_times) > 3:
        print("❌ BUG CONFIRMED: More than 3 attempts occurred!")
        print("   Bug #2: Attempt counter resets, maximum_attempts ignored")
    if len(execution_times) > 1:
        gaps = [execution_times[i] - execution_times[i-1] 
                for i in range(1, len(execution_times))]
        if all(g < 1.0 for g in gaps):
            print("❌ BUG CONFIRMED: No delay between retries!")
            print("   Bug #1: Delay parameter is not being applied")
    print("=" * 60)


if __name__ == "__main__":
    asyncio.run(main())


```

2. Run that script 
```sh
uv run python -m research.litellm.reproduce
```


### Relevant Logs/Tracbacks

```shell
uv run python -m research.litellm.reproduce
============================================================
Retry Policy Bug Reproduction
============================================================

Versions:
  - llama-index-core: 0.14.10
  - llama-index-workflows: 2.11.5

Configuration:
  - delay: 2 seconds
  - maximum_attempts: 3

Expected: 3 attempts, ~4 seconds total (2 delays × 2s)
Actual: Watch what happens...

------------------------------------------------------------
Attempt #1 at 0.001s (gap: 0.000s)
Attempt #2 at 0.001s (gap: 0.000s)
Attempt #3 at 0.001s (gap: 0.000s)
Attempt #4 at 0.001s (gap: 0.000s)
Attempt #5 at 0.001s (gap: 0.000s)
Attempt #6 at 0.001s (gap: 0.000s)
Attempt #7 at 0.002s (gap: 0.000s)
Attempt #8 at 0.002s (gap: 0.000s)
Attempt #9 at 0.002s (gap: 0.000s)
Attempt #10 at 0.002s (gap: 0.000s)

⚠️  STOPPING after 10 attempts to prevent infinite loop!
This demonstrates Bug #2: maximum_attempts=3 was never reached.


Workflow completed with result: forced_stop
------------------------------------------------------------

Results:
  - Total attempts: 10
  - Total time: 0.002s
  - Average gap between attempts: 0.000s
  - Expected gap: 2.000s

============================================================
❌ BUG CONFIRMED: More than 3 attempts occurred!
   Bug #2: Attempt counter resets, maximum_attempts ignored
❌ BUG CONFIRMED: No delay between retries!
   Bug #1: Delay parameter is not being applied
============================================================
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Retry policy delay is ignored in workflow step execution #20403

Bug Description

Bug Report: Retry policy is completely broken - delay ignored AND attempt counter reset

Description

Environment

Expected Behavior

Actual Behavior

Severity: Critical

Root Cause

Bug #1: Delay not passed to queue_tick (line ~176)

Bug #2: Attempts counter discarded (line ~617-618)

Combined Effect

Suggested Fixes

Fix for Bug #1:

Fix for Bug #2:

Version

Steps to Reproduce

Relevant Logs/Tracbacks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Retry policy delay is ignored in workflow step execution #20403

Description

Bug Description

Bug Report: Retry policy is completely broken - delay ignored AND attempt counter reset

Description

Environment

Expected Behavior

Actual Behavior

Severity: Critical

Root Cause

Bug #1: Delay not passed to queue_tick (line ~176)

Bug #2: Attempts counter discarded (line ~617-618)

Combined Effect

Suggested Fixes

Fix for Bug #1:

Fix for Bug #2:

Version

Steps to Reproduce

Relevant Logs/Tracbacks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions