Skip to main content

CLIPPipe

CLIPPipe is registered as clip-pipe and implemented in src/autopipeline/components/modules/clip_pipe.py.

It provides two related but distinct CLIP-based metrics:

  • patch-level semantic distance outside or inside an edited region
  • object-level CLS similarity on SAM-isolated subject crops
Class
Overview

Registry Entry

FieldValue
Registry keyclip-pipe
ClassCLIPPipe
Main mixinsCLIPMixin, MaskProcessor
Optional dependencySAMSegmentationMixin via nested sam config
Constructor

Constructor

CLIPPipe(**kwargs)

Supported init kwargs

KeyRequiredDefaultMeaning
model_pathNoopenai/clip-vit-base-patch32CLIP vision checkpoint.
deviceNoautoTorch device for CLIP inference.
samNonot setNested SAM config used only for sam_clip_cls_sim.

Derived attributes

During initialization the pipe also derives:

  • img_input_size
  • patch_size

from EMBED_MODEL_RESOLUTION.

If sam is provided, the constructor creates:

self.sam_block = SAMSegmentationMixin(**dict(kwargs["sam"]))

That means sam_clip_cls_sim requires a valid nested SAM config at init time.

Methods

Public Methods

MethodPurpose
calc_emd(ref_image, edited_image, bool_mask)Compute Earth Mover's Distance over CLIP patch features.
_pad_image_to_target_size(cropped_image, bg_color)Square-pad and resize an isolated object crop to the CLIP input size.
calc_object_pad_cls_sim(ref_image, edited_image, coords, bg_color)Compare CLIP CLS features on paired, SAM-isolated object crops.
__call__(...)Dispatch between emd and sam_clip_cls_sim.
Signature

Call Signature

CLIPPipe.__call__(
ref_image: Image.Image,
edited_image: Image.Image,
coords: List[Tuple[int, int, int, int]] = None,
mask_mode: str = None,
metric: str = "emd",
**kwargs,
)
Input / Output

Runtime Inputs

ArgumentRequiredMeaning
ref_imageYesReference image.
edited_imageYesEdited image.
coordsMetric-dependentRegion boxes used for masking or object pairing.
mask_modeOnly for emdinner or outer, usually derived from pipeline scope.
metricYesemd or sam_clip_cls_sim.

Extra runtime kwargs

KeyUsed byDefaultMeaning
patch_mask_thresholdemd0.1Threshold when converting resized masks into patch masks.
bg_colorsam_clip_cls_sim(255, 255, 255)Fill color for isolated object crops.

Supported Metrics

MetricWhat it measuresReturn typeBetter direction
emdPatch-level CLIP feature distancefloat or Nonelower is better
sam_clip_cls_simCLS cosine similarity on SAM-isolated object cropsfloat or Nonehigher is better

emd

For emd, the pipe:

  1. resizes the region mask to the CLIP input resolution
  2. converts the mask into a patch-level selection map
  3. extracts reference features from the whole reference image
  4. extracts edited features from only the selected edited patches
  5. computes EMD with ot.emd2(...)

This metric is typically used with scope: unedit_area.

sam_clip_cls_sim

For sam_clip_cls_sim, the pipe:

  1. splits coords into two equal halves
  2. treats the first half as reference boxes
  3. treats the second half as edited-image boxes
  4. uses SAM to isolate each object crop
  5. pads each crop to a square
  6. compares CLIP CLS embeddings

The coordinate list therefore must have even length and preserve reference/edit pairing order.

Input / Output

Return Value

The pipe returns a single float score on success.

It returns None when the required region information cannot produce a valid comparison.

Config

Minimal Config Examples

Patch-distance usage

metric_configs:
emd:
pipe_name: clip-pipe
default_config: ${pipes_default.clip-pipe}
init_config:
scope: unedit_area
runtime_params:
patch_mask_threshold: 0.1

Object-pair usage with SAM

metric_configs:
sam_clip_cls_sim:
pipe_name: clip-pipe
default_config: ${pipes_default.clip-pipe}
init_config:
sam: ${pipes_default.sam-pipe}
scope: edit_area
runtime_params:
bg_color: !tuple [255, 255, 255]
Failure Mode

Failure Modes

The most important soft-failure paths are:

  • no valid patch selected after mask-to-patch conversion -> returns None
  • coords length is odd for object pairing -> returns None
  • SAM crop is empty for any paired object -> returns None

Two practical preconditions are not guarded by a custom error path:

  • sam_clip_cls_sim assumes self.sam_block exists
  • coords must follow the reference-half then edited-half ordering

If either assumption is violated, extension code should fix the config rather than patching downstream results.

Extension

Extension Notes

  • Extend this pipe when the feature backbone remains CLIP-based.
  • Add a new metric branch if you want a new way to compare CLIP features.
  • Reuse MaskProcessor conventions so scope continues to map cleanly into inner and outer.
  • If you need a different object isolator than SAM, make that change explicit in the constructor and config surface.