Skip to main content

Judge Modules

The judge modules are the prompt-driven comparison and scoring units implemented in src/autopipeline/components/modules/judge.py.

They are documented together because both classes share the same client stack, the same prompt-adapter selection logic, and the same schema-validation path.

Module Group

Included Registry Entries

Registry keyClassPrimary output
pairwise-judgePairwiseJudgenormalized winner: Image A, Image B, Tie, or Failed
viescoreVIEscorePipesingle numeric score or pairwise winner derived from scores
Shared Infra

Shared Infrastructure

build_client(...)

build_client(...) resolves a backend from CLIENT_REGISTRY and forwards backend-specific fields to the client constructor.

Recognized client config keys

KeyMeaning
backendregistry key such as api, google, or vllm
model_namemodel identifier passed to the backend
ip_address, portused by the vLLM client
base_url, api_keyused by HTTP- or SDK-based remote clients
extra kwargsforwarded to the selected client

ClientPipe

Both judge modules inherit from ClientPipe.

ClientPipe(
prompt_template: PromptTemplate,
client_cfg: Dict[str, Any],
)

What ClientPipe does

ClientPipe is responsible for:

  1. building the backend client
  2. selecting google_style or openai_style payload formatting
  3. resolving optional input and output schemas from prompt metadata
  4. validating or cleaning input before a model call

Shared constructor parameters

ParameterRequiredMeaning
prompt_templateYesPrompt asset loaded from PromptAssetStore.
client_cfgYesBackend connection config and prompt metadata.
Class
Class

PairwiseJudge

PairwiseJudge is the canonical A/B evaluator.

Constructor

PairwiseJudge(
prompt_template: PromptTemplate,
client_cfg: Dict[str, Any],
)

It adds no extra constructor arguments beyond ClientPipe.

Public Methods

MethodPurpose
_is_valid_winner(winner_str)Normalize common winner aliases into Image A, Image B, or Tie.
__call__(input_dict, **kwargs)Execute the comparison prompt and return a normalized result dict.

Call Signature

PairwiseJudge.__call__(
input_dict: Dict[str, Any],
**kwargs,
)

Expected input fields

If the prompt declares PairJudgeInput, the input schema is:

FieldTypeMeaning
instructionstrEdit instruction being judged.
input_imageAnySource image.
edited_imageslist[Any]Exactly two candidate edited images.

Output format

The return value is always a dict:

{
"type": "pairwise_comparison",
"value": "Image A" | "Image B" | "Tie" | "Failed",
"meta": {
"raw_response": "<model text>"
}
}

Winner normalization rules

The following raw values are accepted:

  • image_a, image a, a -> Image A
  • image_b, image b, b -> Image B
  • tie, equal, both, none -> Tie

Anything else is treated as invalid and retried.

Failure behavior

If the model output cannot be parsed or does not validate after retries:

  • value stays "Failed"
  • the last raw response is still preserved in meta.raw_response

This is a soft failure, not an exception.

Class
Class

VIEscorePipe

VIEscorePipe uses the same client path but converts model output into numeric scores.

Constructor

VIEscorePipe(
prompt_template: PromptTemplate,
client_cfg: Dict[str, Any],
)

It also adds no extra constructor arguments beyond ClientPipe.

Public Methods

MethodPurpose
_parse_viescore(json_response)Convert prompt output into a float score.
score_single_input(messages)Retry until one image has a valid score or the retry budget is exhausted.
__call__(input_dict, **kwargs)Score one or two edited images and return either a scalar or a pairwise winner.

Call Signature

VIEscorePipe.__call__(
input_dict: Dict[str, Any],
**kwargs,
)

Expected input fields

The runtime input must contain:

FieldTypeMeaning
instructionstrEdit instruction.
input_imageAnySource image.
edited_imageslist[Any]One or two edited images.

The code asserts that the list length is either 1 or 2.

Output format

For one edited image:

{
"type": "single_score",
"value": <float> | "Failed",
"meta": {
"vie_score": [<float | None>],
"raw_response": {"edited_image_0": "<model text>"}
}
}

For two edited images:

{
"type": "pairwise_comparison",
"value": "Image A" | "Image B" | "Tie" | "Failed",
"meta": {
"vie_score": [score_a, score_b],
"raw_response": {
"edited_image_0": "<model text>",
"edited_image_1": "<model text>"
}
}
}

Score parsing behavior

_parse_viescore(...) supports two prompt families:

Prompt behaviorExpected score fieldParsing rule
UnicEdit-style promptsscalarcast directly to float
EditScore-style v2 promptslistchoose index based on prompt id family

For v2 prompts:

  • instruction_following -> score[0]
  • visual_quality -> min(score)
  • otherwise -> score[1]

Failure behavior

If any score cannot be parsed:

  • that image keeps None in meta.vie_score
  • the overall result remains "Failed"
  • raw responses are still returned

Minimal Config Example

metric_configs:
pair-judge:
pipe_name: pairwise-judge
default_config: ${pipes_default.pairwise-judge}
init_config:
backend: api
model_name: gpt-4o
api_key: ${client_config.api_key}
base_url: ${client_config.base_url}
prompt_info:
prompt_id: vlm/assessment/visual_consistency/pairwise
version: v1

For viescore, replace pipe_name with viescore and point prompt_info at a .../viescore/... prompt asset.

Extension Notes

  • Add a new client if transport changes.
  • Add a new prompt adapter if payload shape changes.
  • Reuse PairwiseJudge if the semantics are still "pick A, B, or Tie."
  • Reuse VIEscorePipe if the semantics are still "produce per-image scalar scores."
  • Only create a new judge module when the return type or retry logic genuinely differs from these two patterns.