Skip to main content

ParserGrounderPipe

ParserGrounderPipe is the region-discovery front end for AutoPipeline. It is registered as parser-grounder and implemented in src/autopipeline/components/modules/parser_grounder.py.

This is the module that turns a natural-language edit instruction into structured edit targets and pixel-space bounding boxes. If you need region-aware metrics such as sam_clip_cls_sim, sam_dino_cls_sim, or any mask-based object-centric measurement, this pipe is usually the first prerequisite.

Class
Overview

Registry Entry

FieldValue
Registry keyparser-grounder
ClassParserGrounderPipe
Selected fromparser_grounder_config
Return type`tuple[dict
Constructor

Constructor

ParserGrounderPipe(
config,
prompt_template_dict: Dict[str, PromptTemplate],
)

Parameters

ParameterTypeRequiredMeaning
configdictYesParser-grounder config block containing instruction_parser and general_grounder.
prompt_template_dictdict[str, PromptTemplate]YesPrompt assets keyed by instruction_parser and general_grounder.

Required config shape

The constructor expects both sub-blocks to exist:

  • config["instruction_parser"]["init_config"]
  • config["general_grounder"]["init_config"]

Each init_config is passed to build_client(...), so it usually contains:

  • backend
  • model_name
  • backend-specific connection fields such as base_url, api_key, ip_address, or port
  • prompt_info

Initialization behavior

During construction the pipe:

  1. builds one client for instruction parsing
  2. builds one client for visual grounding
  3. selects google_style or openai_style prompt serialization from the parser backend
  4. resolves optional output schemas from prompt metadata
Methods

Public Methods

MethodPurpose
_get_schema(schema_name=None)Resolve a schema class from schemas.pipeline_io.
_check_format_by_schema(raw_data, schema_class)Validate model output against the resolved schema.
_prepare_grounding_inputs(input_dict, objects_dict, edit_task_type)Decide which image or images to ground and which labels belong to each image.
_valied_grounding_output(grounding_output, object_list)Verify that grounding output is a list and only contains expected labels.
__call__(input_dict, **kwargs)Run the full parse-then-ground pipeline and return structured objects plus flattened coordinates.
Signature

Call Signature

ParserGrounderPipe.__call__(
input_dict: Dict[str, Any],
**kwargs,
)
Input / Output

Runtime Input Contract

Common input fields

FieldRequiredMeaning
instructionYesNatural-language edit instruction.
edit_taskYesNormalized task type used to route grounding behavior.
input_imageUsuallyReference image before editing.
edited_imageTask-dependentEdited image after applying the instruction.

Task routing behavior

_prepare_grounding_inputs(...) is where task-specific routing happens.

edit_task familyImages sent to groundingLabels used for grounding
SUBJECT_ADDedited_imageedited_objects
SUBJECT_REMOVE, COLOR_ALTER, MATERIAL_ALTERinput_imageedited_objects
SUBJECT_REPLACEinput_image, edited_imageedited_objects, generated_objects
OBJECT_EXTRACTION, OREF, SIZE_ADJUSTMENTinput_image, edited_imageedited_objects on both sides
PS_HUMAN, MOTION_CHANGEinput_image, edited_imageedited_subjects on both sides
CREFinput_imageedited_subjects

If you add a new task family, this routing table is one of the first places that must change.

Input / Output

Return Value

On success, the pipe returns:

(
extraction_json_response,
all_coords,
)

extraction_json_response

A parsed object dictionary returned by the instruction parser, augmented with:

  • bboxes a list of bbox lists, one list per grounded image

Typical keys include:

  • edited_objects
  • generated_objects
  • edited_subjects
  • edit_attributes

depending on the prompt schema that was used.

all_coords

A flattened list of (x1, y1, x2, y2) tuples in image pixel coordinates. Downstream modules often consume this directly.

Input / Output

Coordinate Ordering Semantics

The coordinate order is not arbitrary. For multi-image tasks, the code flattens coordinates in the same order that images were processed. Several downstream metrics assume that:

  • the first half belongs to the reference-side image
  • the second half belongs to the edited-side image

This convention is especially important for:

  • sam_clip_cls_sim
  • sam_dino_cls_sim

If you change the ordering logic, you can silently break object-to-object pairing downstream.

Config

Minimal Config Example

parser_grounder_config:
instruction_parser:
default_config: ${pipes_default.instruction_parser}
init_config:
backend: vllm
model_name: Qwen3-4B-Instruct-2507
prompt_info:
prompt_id: llm/instruction_parsing/basic_object
version: v1

general_grounder:
default_config: ${pipes_default.general_grounder}
init_config:
backend: vllm
model_name: Qwen3-VL-8B-Instruct
prompt_info:
prompt_id: vlm/grounding/general_grounding
version: v1

This matches the current object-centric pipeline configs under configs/pipelines/object_centric/.

Failure Mode

Failure and Retry Behavior

The pipe is intentionally retry-heavy because both parsing and grounding are prompt-driven.

Instruction parsing failure

If the parser response:

  • cannot be parsed as JSON, or
  • does not satisfy the expected output schema

the pipe retries until instruction_parser.retries is exhausted.

Grounding failure

Grounding is considered invalid when:

  • the grounding JSON is not a list
  • a returned label is not present in the expected object list
  • no valid pixel coordinates can be extracted
  • one of the grounded image slots ends up with an empty bbox list

If grounding still fails after the retry budget, the pipe returns:

(None, None)
Extension

Extension Notes

  • Add a new task family by updating _prepare_grounding_inputs(...) first.
  • Keep prompt schemas aligned with schemas.pipeline_io, otherwise retries will mask real incompatibilities.
  • Reuse this pipe if your change is only a new parsing prompt or grounding prompt.
  • Create a new pipe only if the runtime is no longer "parse instruction, then ground visual targets."