Skip to content

Feature request: Cost reporting via dynamic metadata in EPP #2019

@fcfort

Description

@fcfort

FYI I also put the design in a globally readable and commentable Google Doc: https://docs.google.com/document/d/10_u1Pvb3MD2Wii6NB50OVIGQVN3p_YEuV0j7PBdIOiY/edit?tab=t.0 if that's easier to leave comments on.

Overview

The Gateway API Inference Extension project aims to optimize self-hosting Generative Models on GKE. A key component of this system is the Endpoint Picker (EPP), which intelligently routes requests to appropriate model server backends. For advanced routing and load balancing, particularly in features like prefix sharding, the data plane needs to be aware of the "cost" associated with processing a request.

This proposal defines a flexible plugin for the EPP server configuration to allow users to declaratively configure how this cost is calculated and reported, without requiring code changes to the EPP binary itself. This API will be configurable via the existing --config-file or --config-text command-line flags used by the EPP server.

Proposed API

apiVersion: config.apix.gateway-api-inference-extension.sigs.k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
  - name: input-tokens-cost-reporter
    type: cost-reporter
    parameters:
      # Defines where in dynamic metadata to return the data
      metric:
        namespace: envoy.lb  # Defaults to envoy.lb if omitted
        # What key to use in the provided namespace for the value from expression
        name: x-gateway-inference-request-cost      
        # Specifies the source of data for the CEL expression.
        dataSource: responseBody
        # The CEL expression to calculate the cost.
        expression: |
          (has(responseBody.usage.prompt_tokens) ? responseBody.usage.prompt_tokens : 0) + \
          (has(responseBody.usage.completion_tokens) ? responseBody.usage.completion_tokens : 0)
        # Optional: CEL expression to determine if this metric should be calculated/reported
        condition: "has(responseBody.usage)"

Detailed design

Data plane

The initial implementation will only support parsing the response body. The cost reporting logic will be triggered within the response processing path, specifically when handling the response body.
From pkg/epp/requestcontrol/plugins.go, the plugin will implement the ResponseStreaming and ResponseComplete interfaces.

For each configured metric, if a condition is provided, it's evaluated first. If the condition is met (or absent), the expression is evaluated against the dataSource. The result is expected to be an integer. If not, or if evaluation fails, the defaultValue is used if provided. Evaluation failure will not result in failing the request, instead a warning log will be emitted.

We will avoid buffering the entire response. The CEL expression will be evaluated individually on each chunk of the response body in the streaming case. The first successful evaluation will stop subsequent evaluation of the response body.

The calculated cost value is added to the ext-proc response to instruct Envoy to set the dynamic metadata in the specified namespace with the specified name.

Configuration Loading

The EPP server will parse the costReporting section from the YAML provided via --config-file or --config-text.

CEL Environment

The EPP binary would take a dependency upon the github.com/google/cel-go library. When the plugin is enabled, the plugin will initialize a CEL environment on a per-request basis. For each metric defined in configuration, it will compile the expression and condition strings. The environment will be configured to understand each of the dataSources. The initial implementation will only support responseBody. The expression assumes an integer output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions