Skip to main content

GEditBench v2A Human-Aligned Benchmark forGeneral Image Editing

Zhangqi Jiang1,2Zheng Sun2Xianfang Zeng2Yufeng Yang2Xuanyang Zhang2

Yongliang Wu3Wei Cheng2Gang Yu2Xu Yang3*Bihan Wen1*

1Nanyang Technological University2StepFun3Southeast University

Project leader*Corresponding author

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

1
GEditBench v2

comprising 22 predefined tasks with an open-set category to evaluate editing models in real-world scenarios.

2
VCReward-Bench

including 3,506 expert-annotated preference pairs for evaluating assessment models of image editing in visual consistency.

3
PVC-Judge

a pairwise assessment model for evaluating editing visual consistency.

4
AutoPipeline

two novel region-decoupled preference data synthesis pipelines.

Teaser

Overview of GEditBench v2

Overview of GEditBench v2 benchmark teaser

Benchmark Samples

Open-Set Instructions

Benchmark case for open-set image editing instructions

Object Reference

Benchmark case for object reference image editing

Relation Change

Benchmark case for relation change image editing

Camera Motion

Benchmark case for camera motion image editing

Text Editing

Benchmark case for text editing

Chart Editing

Benchmark case for chart editing

Leaderboard

Model Arena on GEditBench v2

Models are ranked by OVERALL Elo score from pairwise comparisons. Instruction Following and Visual Quality are assessed by GPT-4o, while Visual Consistency is evaluated by PVC-Judge. Confidence Intervals are computed by 1,000 bootstrap iterations. *Arena Elo scores were recorded on March 26, 2026, from Artificial Analysis. Our overall Elo ranking achieves a strong Spearman's rank correlation (ρ = 0.929, p < 2 × 10-7) with the Arena ranking, validating that our automated evaluation ecosystem reliably aligns with human preferences.

RankModelSourceSamplesInstructionFollowingVisualQualityVisualConsistencyOverallArenaEloArenaRank
1Nano Banana Pro (26-03-04)Closed1,1561,126-13/+151,066-9/+101,108-11/+111,096-6/+61,251#2
2Seedream 4.5 (26-03-11)Closed1,1901,111-12/+121,142-11/+111,030-11/+121,089-7/+71,196#3
3GPT Image 1.5 (26-03-04)Closed1,0811,260-13/+151,149-12/+12846-13/+131,071-7/+61,270#1
4FLUX.2 [klein] 9BOpen1,2001,083-13/+121,025-11/+101,019-10/+91,039-6/+61,166#4
5Qwen-Image-Edit-2511Open1,2001,095-10/+101,060-11/+11972-9/+101,038-6/+61,164#5
6FLUX.2 [klein] 4BOpen1,2001,007-12/+121,019-10/+101,070-10/+101,031-6/+61,107#10
7FLUX.2 [dev] TurboOpen1,2001,068-12/+12936-10/+101,064-11/+101,021-6/+61,153#6
8Qwen-Image-Edit-2509Open1,2001,033-10/+111,062-10/+12955-9/+91,014-5/+61,142#7
9Qwen-Image-EditOpen1,200991-10/+101,073-11/+12971-11/+111,010-6/+61,088#12
10FLUX.2 [dev]Open1,2001,037-12/+13965-10/+101,018-11/+111,006-7/+71,137#8
11LongCat-Image-EditOpen1,2001,018-10/+11968-10/+91,017-10/+91,001-6/+51,111#9
12Step1X-Edit-v1p2Open1,200909-12/+121,007-12/+111,067-11/+11996-6/+71,093#11
13GLM-ImageOpen1,200787-13/+141,023-11/+111,109-13/+14979-6/+6930#14
14OmniGen V2Open1,200807-13/+12910-12/+12929-13/+13888-7/+7919#15
15FLUX.1 Kontext [dev]Open1,200849-13/+13900-13/+14840-14/+13869-7/+81,017#13
16BagelOpen1,200820-13/+13694-17/+16987-13/+14851-8/+8915#16

AutoPipeline

AutoPipeline for structured editing consistency evaluation

AutoPipeline organizes evaluation into task-adaptive pipelines. For object-centric edits, it spatially decouples edited and non-edited regions before applying region-specific metrics; for human-centric edits, it routes face, body, and hair related changes into dedicated evaluation branches and rubric-aware outputs.

AutoPipeline object-centric and human-centric pipeline overview

PVC-Judge

Alignment with Human Judgments

We evaluate PVC-Judge on EditScoreReward-Bench and our VCReward-Bench. PVC-Judge achieves strong alignment with human preferences, setting a new state-of-the-art among open-source assessment models and even outperforming GPT-5.1 on average.

PVC-Judge performance on EditScore Reward-Bench and VCReward-Bench

Qualitative Analysis

Qualitative Analysis

Struggle with Small Faces.

Qualitative analysis case on struggle with small faces

Weak Perception of Inter-Object Relations.

Qualitative analysis case on weak perception of inter-object relations

Open-Set Edits

Qualitative analysis case on open-set edits