CLIPPipe
CLIPPipe is registered as clip-pipe and implemented in src/autopipeline/components/modules/clip_pipe.py.
It provides two related but distinct CLIP-based metrics:
- patch-level semantic distance outside or inside an edited region
- object-level CLS similarity on SAM-isolated subject crops
Registry Entry
| Field | Value |
|---|---|
| Registry key | clip-pipe |
| Class | CLIPPipe |
| Main mixins | CLIPMixin, MaskProcessor |
| Optional dependency | SAMSegmentationMixin via nested sam config |
Constructor
CLIPPipe(**kwargs)
Supported init kwargs
| Key | Required | Default | Meaning |
|---|---|---|---|
model_path | No | openai/clip-vit-base-patch32 | CLIP vision checkpoint. |
device | No | auto | Torch device for CLIP inference. |
sam | No | not set | Nested SAM config used only for sam_clip_cls_sim. |
Derived attributes
During initialization the pipe also derives:
img_input_sizepatch_size
from EMBED_MODEL_RESOLUTION.
If sam is provided, the constructor creates:
self.sam_block = SAMSegmentationMixin(**dict(kwargs["sam"]))
That means sam_clip_cls_sim requires a valid nested SAM config at init time.
Public Methods
| Method | Purpose |
|---|---|
calc_emd(ref_image, edited_image, bool_mask) | Compute Earth Mover's Distance over CLIP patch features. |
_pad_image_to_target_size(cropped_image, bg_color) | Square-pad and resize an isolated object crop to the CLIP input size. |
calc_object_pad_cls_sim(ref_image, edited_image, coords, bg_color) | Compare CLIP CLS features on paired, SAM-isolated object crops. |
__call__(...) | Dispatch between emd and sam_clip_cls_sim. |
Call Signature
CLIPPipe.__call__(
ref_image: Image.Image,
edited_image: Image.Image,
coords: List[Tuple[int, int, int, int]] = None,
mask_mode: str = None,
metric: str = "emd",
**kwargs,
)
Runtime Inputs
| Argument | Required | Meaning |
|---|---|---|
ref_image | Yes | Reference image. |
edited_image | Yes | Edited image. |
coords | Metric-dependent | Region boxes used for masking or object pairing. |
mask_mode | Only for emd | inner or outer, usually derived from pipeline scope. |
metric | Yes | emd or sam_clip_cls_sim. |
Extra runtime kwargs
| Key | Used by | Default | Meaning |
|---|---|---|---|
patch_mask_threshold | emd | 0.1 | Threshold when converting resized masks into patch masks. |
bg_color | sam_clip_cls_sim | (255, 255, 255) | Fill color for isolated object crops. |
Supported Metrics
| Metric | What it measures | Return type | Better direction |
|---|---|---|---|
emd | Patch-level CLIP feature distance | float or None | lower is better |
sam_clip_cls_sim | CLS cosine similarity on SAM-isolated object crops | float or None | higher is better |
emd
For emd, the pipe:
- resizes the region mask to the CLIP input resolution
- converts the mask into a patch-level selection map
- extracts reference features from the whole reference image
- extracts edited features from only the selected edited patches
- computes EMD with
ot.emd2(...)
This metric is typically used with scope: unedit_area.
sam_clip_cls_sim
For sam_clip_cls_sim, the pipe:
- splits
coordsinto two equal halves - treats the first half as reference boxes
- treats the second half as edited-image boxes
- uses SAM to isolate each object crop
- pads each crop to a square
- compares CLIP CLS embeddings
The coordinate list therefore must have even length and preserve reference/edit pairing order.
Return Value
The pipe returns a single float score on success.
It returns None when the required region information cannot produce a valid comparison.
Minimal Config Examples
Patch-distance usage
metric_configs:
emd:
pipe_name: clip-pipe
default_config: ${pipes_default.clip-pipe}
init_config:
scope: unedit_area
runtime_params:
patch_mask_threshold: 0.1
Object-pair usage with SAM
metric_configs:
sam_clip_cls_sim:
pipe_name: clip-pipe
default_config: ${pipes_default.clip-pipe}
init_config:
sam: ${pipes_default.sam-pipe}
scope: edit_area
runtime_params:
bg_color: !tuple [255, 255, 255]
Failure Modes
The most important soft-failure paths are:
- no valid patch selected after mask-to-patch conversion -> returns
None coordslength is odd for object pairing -> returnsNone- SAM crop is empty for any paired object -> returns
None
Two practical preconditions are not guarded by a custom error path:
sam_clip_cls_simassumesself.sam_blockexistscoordsmust follow the reference-half then edited-half ordering
If either assumption is violated, extension code should fix the config rather than patching downstream results.
Extension Notes
- Extend this pipe when the feature backbone remains CLIP-based.
- Add a new metric branch if you want a new way to compare CLIP features.
- Reuse
MaskProcessorconventions soscopecontinues to map cleanly intoinnerandouter. - If you need a different object isolator than SAM, make that change explicit in the constructor and config surface.