AI: Add a CLIP-powered semantic search #941

Open
opened 2026-02-20 00:02:13 -05:00 by deekerman · 9 comments
Owner

Originally created by @flozi00 on GitHub (May 12, 2021).

As a user, I want to perform a CLIP-powered semantic search using a local model so that I can find pictures via natural language queries without relying on external cloud services.

Implementation Steps:

  1. Model plumbing
    • Add ModelTypeClip and default config in internal/ai/vision/models.go; expose ONNX loader using existing runtime glue.
    • Add text encoder runner (shared tokenizer) and image encoder preprocess (224/256px). Cache model files under ModelsPath/clip/<name>.
  2. Indexing
    • Worker: add CLIP embedding generation to internal/workers/vision.go and internal/workers/meta.go respecting Run schedule. Batched embedding generation; matryoshka down-projection to 64-d optional via config.
    • Persist embeddings via repository internal/entity/clip_embedding.go; migrations per DB driver. On SQLite Phase 1, persistence only (search disabled).
  3. Search pipeline
    • Extend search.Query to accept ClipQuery and ClipWeight (float).
    • For DBs with vector ops: ORDER BY distance using driver-specific SQL; then apply existing ordering as secondary.
    • Phase 1: no Go fallback scorer; SQLite and MariaDB 10.5 return “CLIP search not supported” unless vector backend present. Future phase may add Go fallback if requested.

Model Comparison

Criterion (priority ↓) UForm3 multi-base OpenCLIP ViT-B/32 OpenCLIP ViT-L/14 SigLIP B/16 SigLIP L/16
Params / size 206M (~800 MB fp16 ONNX) 151M (~600 MB) 428M (~1.6 GB) 203M (~900 MB) 652M (~2.5 GB)
Embedding dim 64–768 (matryoshka; default 768) 512 768 768 768
Languages 20+ multilingual English-centric English-centric Multilingual (WebLI) Multilingual (WebLI)
Speed on CPU (relative) 2–4× faster vs competitors; ONNX native fast slow medium slow
Accuracy (ImageNet zero-shot) ~75% (vendor) ~63% ~75% ~82% ~83%
ONNX availability Yes (official) Yes (hf/convert) Yes (hf/convert) Yes (OpenCLIP) Yes (OpenCLIP)
License Apache-2.0 Apache-2.0 Apache-2.0 Apache-2.0 Apache-2.0
GPU optional Yes Yes Recommended Yes Recommended
Fit for NAS (RAM <1.5 GB) Yes (small/base) Yes Risky Borderline No

Code Map

  • API: internal/api/photos_search.go and internal/api/photos_search_geo.go (extend request struct & pipeline).
  • Services: internal/ai/vision (model load/run), internal/entity (embedding store).
  • Config: internal/config/options.go (flags/env), vision.yml.
  • CLI: internal/commands/*.go.
  • Workers: internal/workers/vision.go, internal/workers/meta.go.

Acceptance Criteria

  • Users can toggle CLIP search on/off via configuration, choosing a compact UForm-based ONNX model by default.
  • A background worker generates and stores CLIP embeddings without breaking existing filters or other functionality.
  • When enabled and supported by the database (e.g., MariaDB 11.8+ with VECTOR indexes), text queries return images ranked by vector similarity, combined with the existing search pipeline.
  • On databases without vector support, such as older versions of MariaDB and SQLite, embeddings can be stored, but CLIP searches are not possible with the initial implementation.

Documentation & References

Originally created by @flozi00 on GitHub (May 12, 2021). **As a user, I want to perform a CLIP-powered semantic search using a local model so that I can find pictures via natural language queries without relying on external cloud services.** Implementation Steps: 1) **Model plumbing** - Add `ModelTypeClip` and default config in `internal/ai/vision/models.go`; expose ONNX loader using existing runtime glue. - Add text encoder runner (shared tokenizer) and image encoder preprocess (224/256px). Cache model files under `ModelsPath/clip/<name>`. 2) **Indexing** - Worker: add CLIP embedding generation to `internal/workers/vision.go` and `internal/workers/meta.go` respecting `Run` schedule. Batched embedding generation; [matryoshka down-projection](https://arxiv.org/pdf/2205.13147) to 64-d optional via config. - Persist embeddings via repository `internal/entity/clip_embedding.go`; migrations per DB driver. On SQLite Phase 1, persistence only (search disabled). 3) **Search pipeline** - Extend `search.Query` to accept `ClipQuery` and `ClipWeight` (float). - For DBs with vector ops: `ORDER BY distance` using driver-specific SQL; then apply existing ordering as secondary. - Phase 1: no Go fallback scorer; SQLite and MariaDB 10.5 return “CLIP search not supported” unless vector backend present. Future phase may add Go fallback if requested. ### Model Comparison | Criterion (priority ↓) | UForm3 multi-base | OpenCLIP ViT-B/32 | OpenCLIP ViT-L/14 | SigLIP B/16 | SigLIP L/16 | |-------------------------------|-----------------------------------------|-------------------|-----------------------|-----------------------|----------------------| | Params / size | 206M (~800 MB fp16 ONNX) | 151M (~600 MB) | 428M (~1.6 GB) | 203M (~900 MB) | 652M (~2.5 GB) | | Embedding dim | 64–768 ([matryoshka](https://huggingface.co/papers/2205.13147); default 768) | 512 | 768 | 768 | 768 | | Languages | 20+ multilingual | English-centric | English-centric | Multilingual (WebLI) | Multilingual (WebLI) | | Speed on CPU (relative) | 2–4× faster vs competitors; ONNX native | fast | slow | medium | slow | | Accuracy (ImageNet zero-shot) | ~75% (vendor) | ~63% | ~75% | ~82% | ~83% | | ONNX availability | Yes (official) | Yes (hf/convert) | Yes (hf/convert) | Yes (OpenCLIP) | Yes (OpenCLIP) | | License | Apache-2.0 | Apache-2.0 | Apache-2.0 | Apache-2.0 | Apache-2.0 | | GPU optional | Yes | Yes | Recommended | Yes | Recommended | | Fit for NAS (RAM <1.5 GB) | Yes (small/base) | Yes | Risky | Borderline | No | ### Code Map - API: `internal/api/photos_search.go` and `internal/api/photos_search_geo.go` (extend request struct & pipeline). - Services: `internal/ai/vision` (model load/run), `internal/entity` (embedding store). - Config: `internal/config/options.go` (flags/env), `vision.yml`. - CLI: `internal/commands/*.go`. - Workers: `internal/workers/vision.go`, `internal/workers/meta.go`. ### Acceptance Criteria - [ ] Users can toggle CLIP search on/off via configuration, choosing a compact UForm-based ONNX model by default. - [ ] A background worker generates and stores CLIP embeddings without breaking existing filters or other functionality. - [ ] When enabled and supported by the database (e.g., MariaDB 11.8+ with VECTOR indexes), text queries return images ranked by vector similarity, combined with the existing search pipeline. - [ ] On databases without vector support, such as older versions of MariaDB and SQLite, embeddings can be stored, but CLIP searches are not possible with the initial implementation. ### Documentation & References - https://github.com/unum-cloud/UForm - https://github.com/xyb/uform-image-search - https://onnx.ai/ - https://openai.com/blog/clip/ - https://mariadb.org/projects/mariadb-vector/ - https://github.com/pgvector/pgvector - https://github.com/sqliteai/sqlite-vector
Author
Owner

@lastzero commented on GitHub (May 14, 2021):

Looks cool, but we're drowning in work right now. Maybe later this year?

@lastzero commented on GitHub (May 14, 2021): Looks cool, but we're drowning in work right now. Maybe later this year?
Author
Owner

@flozi00 commented on GitHub (May 14, 2021):

If you are okay with I could provide an elasticsearch based API with simple Crud you could use for search. Then it would be only a small change on your side and another service in docker compose

@flozi00 commented on GitHub (May 14, 2021): If you are okay with I could provide an elasticsearch based API with simple Crud you could use for search. Then it would be only a small change on your side and another service in docker compose
Author
Owner

@lastzero commented on GitHub (May 14, 2021):

You're most welcome to contribute! Be aware that even merging a pull request is going to take some time right now :/

@lastzero commented on GitHub (May 14, 2021): You're most welcome to contribute! Be aware that even merging a pull request is going to take some time right now :/
Author
Owner

@flozi00 commented on GitHub (May 22, 2021):

Do you have an channel where I can't contact you ?
I'd really like to provide some Apis with useful features, maybe we could discuss about

@flozi00 commented on GitHub (May 22, 2021): Do you have an channel where I can't contact you ? I'd really like to provide some Apis with useful features, maybe we could discuss about
Author
Owner

@lastzero commented on GitHub (May 23, 2021):

You can find us on matrix if you follow the link for our community chat.

@lastzero commented on GitHub (May 23, 2021): You can find us on matrix if you follow the link for our community chat.
Author
Owner

@tknobi commented on GitHub (Feb 3, 2022):

Hello all,

since this is my first interaction with PhotoPrism, I would like to take this opportunity to thank you @lastzero (and all the other contributors :) ) for this wonderful piece of software. I have tested many different photo galleries, PhotoPrism runs by far the most stable, fastest and offers (almost) everything I need.

Now about this issue: After getting in touch with CLIP and once again desperately trying to find an image by its content in PhotoPrism, I had the same idea as @flozi00 (Thanks for that!) and I want to thank by contributing a CLIP based content search.

As already mentioned, the CLIP model can only be addressed in python. So I propose to provide a CLIP api that can encode an image or text into clip embeddings via a REST interface. I will keep the code on Python side minimal. This is to keep it maintainable for go developers and no own test infrastructure is needed then.

To store the embeddings SQL is not suitable, because there is (to my knowledge) no efficient way to perform a nearest neighbor search. However, based on my research (including this great
post) and experience, I wouldn't use an Elasticsearch instance, as I think it adds too much overhead. Instead, I would recommend qdrant, which only requires a single container that is only ~42 MB in size.

@tknobi commented on GitHub (Feb 3, 2022): Hello all, since this is my first interaction with PhotoPrism, I would like to take this opportunity to thank you @lastzero (and all the other contributors :) ) for this wonderful piece of software. I have tested many different photo galleries, PhotoPrism runs by far the most stable, fastest and offers (almost) everything I need. Now about this issue: After getting in touch with CLIP and once again desperately trying to find an image by its content in PhotoPrism, I had the same idea as @flozi00 (Thanks for that!) and I want to thank by contributing a CLIP based content search. As already mentioned, the CLIP model can only be addressed in python. So I propose to provide a CLIP api that can encode an image or text into clip embeddings via a REST interface. I will keep the code on Python side minimal. This is to keep it maintainable for go developers and no own test infrastructure is needed then. To store the embeddings SQL is not suitable, because there is (to my knowledge) no efficient way to perform a nearest neighbor search. However, based on my research (including this great [post](https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696)) and experience, I wouldn't use an Elasticsearch instance, as I think it adds too much overhead. Instead, I would recommend qdrant, which only requires a single container that is only ~42 MB in size.
Author
Owner

@xyb commented on GitHub (Nov 16, 2025):

UForm is also a good choice, the model size is small but the speed is very fast.

@xyb commented on GitHub (Nov 16, 2025): [UForm](https://github.com/unum-cloud/UForm) is also a good choice, the model size is small but the speed is very fast.
Author
Owner

@xyb commented on GitHub (Nov 28, 2025):

The official documentation for UForm and USearch is somewhat outdated, so I created a demo to show how to use UForm in practice: https://github.com/xyb/uform-image-search

@xyb commented on GitHub (Nov 28, 2025): The official documentation for UForm and USearch is somewhat outdated, so I created a demo to show how to use UForm in practice: https://github.com/xyb/uform-image-search
Author
Owner

@lastzero commented on GitHub (Nov 28, 2025):

@xyb Thanks a lot! I updated the issue description above to include guidance on the implementation steps and acceptance criteria. It also includes the links you shared.

@lastzero commented on GitHub (Nov 28, 2025): @xyb Thanks a lot! I updated the issue description above to include guidance on the implementation steps and acceptance criteria. It also includes the links you shared.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/photoprism#941
No description provided.