Skip to main content

DINOv3Pipe

DINOv3Pipe is registered as dino-v3-pipe and implemented in src/autopipeline/components/modules/dino_pipe.py.

Its public surface mirrors CLIPPipe, but the semantics are more structure-oriented. In practice this pipe is used for:

  • patch-structure consistency
  • object-level DINO CLS similarity after SAM isolation
Class
Overview

Registry Entry

FieldValue
Registry keydino-v3-pipe
ClassDINOv3Pipe
Main mixinsDINOv3Mixin, MaskProcessor
Optional dependencySAMSegmentationMixin via nested sam config
Constructor

Constructor

DINOv3Pipe(**kwargs)

Supported init kwargs

KeyRequiredDefaultMeaning
model_pathYes in practicenoneDINOv3 checkpoint path. The current mixin expects a valid string path.
deviceNoautoTorch device for DINO inference.
samNonot setNested SAM config used only for sam_dino_cls_sim.

Derived attributes

During initialization the mixin sets:

  • input_image_size
  • patch_size

based on EMBED_MODEL_RESOLUTION.

If sam is present, the constructor creates a SAMSegmentationMixin instance and stores it as self.sam_block.

Methods

Public Methods

MethodPurpose
_calc_self_sim_matrix(features)Normalize patch features and compute a self-similarity matrix.
calc_structure_similarity(ref_image, edited_image, bool_mask)Compare reference and edited self-similarity matrices.
_pad_image_to_target_size(cropped_image, bg_color)Square-pad and resize a subject crop to the DINO input size.
calc_object_pad_cls_sim(ref_image, edited_image, coords, bg_color)Compare DINO CLS features for paired object crops.
__call__(...)Dispatch between structure similarity and SAM-backed CLS similarity.
Signature

Call Signature

DINOv3Pipe.__call__(
ref_image: Image.Image,
edited_image: Image.Image,
coords: List[Tuple[int, int, int, int]] = None,
mask_mode: str = None,
metric: str = "dinov3_structure_similarity",
**kwargs,
)
Input / Output

Runtime Inputs

ArgumentRequiredMeaning
ref_imageYesReference image.
edited_imageYesEdited image.
coordsMetric-dependentBoxes used for masking or object pairing.
mask_modeOnly for structure modeRegion polarity, normally derived from pipeline scope.
metricYesdinov3_structure_similarity or sam_dino_cls_sim.

Extra runtime kwargs

KeyUsed byDefaultMeaning
patch_mask_thresholdstructure mode0.1Threshold for patch selection after resizing the region mask.
bg_colorSAM-backed CLS mode(255, 255, 255)Fill color for isolated object crops.

Supported Metrics

MetricWhat it measuresReturn typeBetter direction
dinov3_structure_similaritysimilarity of patch self-similarity structurefloathigher is better
sam_dino_cls_simDINO CLS similarity on SAM-isolated paired object cropsfloat or Nonehigher is better

dinov3_structure_similarity

For this metric the pipe:

  1. builds a resized region mask
  2. converts it to a patch mask
  3. extracts DINO patch embeddings
  4. normalizes patch embeddings
  5. computes reference and edited self-similarity matrices
  6. measures MSE between those matrices
  7. converts the loss into a bounded score with 1 / (1 + loss * 100)

sam_dino_cls_sim

This metric uses the same coordinate pairing convention as sam_clip_cls_sim:

  • first half of coords -> reference objects
  • second half of coords -> edited objects

Each paired crop is segmented with SAM, padded to a square, encoded with DINO, and compared by cosine similarity.

Input / Output

Return Value

The pipe returns a single float score.

Depending on the metric branch:

  • structure mode returns 0.0 if no valid patches survive masking
  • object mode returns None if paired crops cannot be produced
Config

Minimal Config Examples

Structure-preservation usage

metric_configs:
dinov3_structure_similarity:
pipe_name: dino-v3-pipe
default_config: ${pipes_default.dino-v3-pipe}
init_config:
scope: edit_area
runtime_params:
patch_mask_threshold: 0.1

Object-reference usage with SAM

metric_configs:
sam_dino_cls_sim:
pipe_name: dino-v3-pipe
default_config: ${pipes_default.dino-v3-pipe}
init_config:
sam: ${pipes_default.sam-pipe}
scope: edit_area
runtime_params:
bg_color: !tuple [255, 255, 255]
Failure Mode

Failure Modes

The important soft-failure paths are:

  • empty patch mask in structure mode -> returns 0.0
  • odd number of coords in object mode -> returns None
  • empty SAM crop -> returns None

The most important init-time precondition is:

  • model_path must be a valid DINO checkpoint path

Unlike CLIPMixin, the current DINOv3Mixin does not safely default model_path.

Extension

Extension Notes

  • Extend this pipe when the backbone remains DINO-like and the output semantics stay feature-based.
  • Keep the patch-mask and coordinate conventions aligned with MaskProcessor and ParserGrounderPipe.
  • If you introduce a new DINO metric, add it as a new branch inside __call__ and document the score direction explicitly.