Skip to main content

Semantic Consistency Mixins

semantic_consistency.py contains the reusable embedding backbones for semantic similarity. It is implemented in src/autopipeline/components/primitives/semantic_consistency.py.

These mixins provide the feature-extraction layer used by CLIPPipe, DINOv3Pipe, and BodyAppearancePipe.

Primitive Group

Included Classes

ClassPrimary roleTypical consumers
CLIPMixinCLIP patch and CLS embeddingsclip-pipe
DINOv3MixinDINOv3 patch and CLS embeddingsdino-v3-pipe, body-appearance-pipe
Mixin

CLIPMixin

Class Signature

CLIPMixin(**kwargs)

Constructor Parameters

KeyRequiredDefaultMeaning
model_pathNoopenai/clip-vit-base-patch32CLIP vision checkpoint.
deviceNoautoTorch device used for inference.

Derived attributes

The mixin derives:

  • img_input_size
  • patch_size

from EMBED_MODEL_RESOLUTION, with a fallback of (224, 32).

Public Methods

MethodPurpose
get_features(image, mask=None)Return patch embeddings, optionally filtered by a boolean mask.
get_cls_feature(image)Return the CLS embedding.

Input / Output Contract

get_features(image, mask=None)

  • input:
    • PIL image
    • optional boolean mask over patches
  • output:
    • patch embeddings with the CLS token removed
    • if mask is provided, returns only embeddings[mask, :]

get_cls_feature(image)

  • input: PIL image
  • output: one CLS embedding tensor
Mixin

DINOv3Mixin

Class Signature

DINOv3Mixin(**kwargs)

Constructor Parameters

KeyRequiredDefaultMeaning
model_pathYes in practiceNone in codeDINOv3 checkpoint path.
deviceNoautoTorch device used for inference.

Important implementation caveat

Although the code uses kwargs.get("model_path", None), it immediately calls:

model_path.split("/")[-1]

So model_path is effectively required.

Derived attributes

The mixin derives:

  • input_image_size
  • patch_size

from EMBED_MODEL_RESOLUTION, with a fallback of (224, 16).

Public Methods

MethodPurpose
get_features(image, mask=None)Return DINO patch embeddings after removing special tokens.
get_cls_feature(image)Return the CLS embedding.

Input / Output Contract

get_features(image, mask=None)

  • input:
    • PIL image
    • optional boolean mask over patches
  • output:
    • patch embeddings with both:
      • the CLS token
      • DINO register tokens removed

get_cls_feature(image)

  • input: PIL image
  • output: one CLS embedding tensor

Minimal Config Example

CLIP-based module init

init_config:
model_path: openai/clip-vit-base-patch32
device: cuda

DINO-based module init

init_config:
model_path: ${user_config.model_paths.dino_v3_path}
device: cuda

Failure Semantics

Neither mixin adds a local recovery layer around:

  • Hugging Face model loading
  • processor construction
  • tensor indexing with incompatible masks

These failures bubble up to the module layer.

Extension Notes

  • Use this file to change feature backbones or token-handling logic.
  • Keep actual score formulas in the module layer.
  • If you add a new embedding backbone, follow the same split:
    • patch-level feature method
    • CLS feature method
    • explicit input-size / patch-size metadata