VLM-as-a-Judge Pipeline
vlm-as-a-judge is a pairwise pipeline. It does not produce a metric bundle for a single edited image. Instead, it answers a more direct question:
Which of two candidate edited images is better?
Typical use cases
- comparing model outputs on a benchmark
- reward evaluation
- constructing pairwise preference data
- letting a VLM or LLM return a winner directly instead of hand-authoring complex score logic
Starter configs
openai.yamlgemini.yamlvllm.yamleval_if.yamleval_vc.yamleval_vq.yaml
Input and output contract
The key input fields are:
instructioninput_imageedited_images
edited_images should contain the candidate images to compare.
The key output fields are:
winnerraw_responses
The normalized winner values are:
Image AImage BTieFailed
Minimal example
cd <PROJECT_ROOT>
autopipeline eval \
--bmk openedit \
--pipeline-config-path <PROJECT_ROOT>/configs/pipelines/vlm_as_a_judge/openai.yaml \
--user-config <PROJECT_ROOT>/configs/pipelines/user_config.yaml \
--save-path <PROJECT_ROOT>/data/reward_eval_results
How it differs from annotation
annotation:
- is better when you want structured metric scores
- supports metric-driven ranking and filtering
vlm-as-a-judge:
- is better when you want a direct pairwise winner
- is closer to preference modeling or benchmark ranking
When not to use it
If your real requirement is to preserve fine-grained metric scores for each scope rather than collapse the decision into a final winner, start with human-centric or object-centric instead.