starred/photoprism

Fork 0

mirror of https://github.com/photoprism/photoprism.git synced 2026-03-02 22:57:18 -05:00

AI: Add a CLIP-powered semantic search #941

New issue

Open

opened 2026-02-20 00:02:13 -05:00 by deekerman · 9 comments

deekerman commented

2026-02-20 00:02:13 -05:00

Owner

Originally created by @flozi00 on GitHub (May 12, 2021).

As a user, I want to perform a CLIP-powered semantic search using a local model so that I can find pictures via natural language queries without relying on external cloud services.

Implementation Steps:

Model plumbing
- Add ModelTypeClip and default config in internal/ai/vision/models.go; expose ONNX loader using existing runtime glue.
- Add text encoder runner (shared tokenizer) and image encoder preprocess (224/256px). Cache model files under ModelsPath/clip/<name>.
Indexing
- Worker: add CLIP embedding generation to internal/workers/vision.go and internal/workers/meta.go respecting Run schedule. Batched embedding generation; matryoshka down-projection to 64-d optional via config.
- Persist embeddings via repository internal/entity/clip_embedding.go; migrations per DB driver. On SQLite Phase 1, persistence only (search disabled).
Search pipeline
- Extend search.Query to accept ClipQuery and ClipWeight (float).
- For DBs with vector ops: ORDER BY distance using driver-specific SQL; then apply existing ordering as secondary.
- Phase 1: no Go fallback scorer; SQLite and MariaDB 10.5 return “CLIP search not supported” unless vector backend present. Future phase may add Go fallback if requested.

Model Comparison

Criterion (priority ↓)	UForm3 multi-base	OpenCLIP ViT-B/32	OpenCLIP ViT-L/14	SigLIP B/16	SigLIP L/16
Params / size	206M (~800 MB fp16 ONNX)	151M (~600 MB)	428M (~1.6 GB)	203M (~900 MB)	652M (~2.5 GB)
Embedding dim	64–768 (matryoshka; default 768)	512	768	768	768
Languages	20+ multilingual	English-centric	English-centric	Multilingual (WebLI)	Multilingual (WebLI)
Speed on CPU (relative)	2–4× faster vs competitors; ONNX native	fast	slow	medium	slow
Accuracy (ImageNet zero-shot)	~75% (vendor)	~63%	~75%	~82%	~83%
ONNX availability	Yes (official)	Yes (hf/convert)	Yes (hf/convert)	Yes (OpenCLIP)	Yes (OpenCLIP)
License	Apache-2.0	Apache-2.0	Apache-2.0	Apache-2.0	Apache-2.0
GPU optional	Yes	Yes	Recommended	Yes	Recommended
Fit for NAS (RAM <1.5 GB)	Yes (small/base)	Yes	Risky	Borderline	No

Code Map

API: internal/api/photos_search.go and internal/api/photos_search_geo.go (extend request struct & pipeline).
Services: internal/ai/vision (model load/run), internal/entity (embedding store).
Config: internal/config/options.go (flags/env), vision.yml.
CLI: internal/commands/*.go.
Workers: internal/workers/vision.go, internal/workers/meta.go.

Acceptance Criteria

Users can toggle CLIP search on/off via configuration, choosing a compact UForm-based ONNX model by default.
A background worker generates and stores CLIP embeddings without breaking existing filters or other functionality.
When enabled and supported by the database (e.g., MariaDB 11.8+ with VECTOR indexes), text queries return images ranked by vector similarity, combined with the existing search pipeline.
On databases without vector support, such as older versions of MariaDB and SQLite, embeddings can be stored, but CLIP searches are not possible with the initial implementation.

Documentation & References

Originally created by @flozi00 on GitHub (May 12, 2021). **As a user, I want to perform a CLIP-powered semantic search using a local model so that I can find pictures via natural language queries without relying on external cloud services.** Implementation Steps: 1) **Model plumbing** - Add `ModelTypeClip` and default config in `internal/ai/vision/models.go`; expose ONNX loader using existing runtime glue. - Add text encoder runner (shared tokenizer) and image encoder preprocess (224/256px). Cache model files under `ModelsPath/clip/<name>`. 2) **Indexing** - Worker: add CLIP embedding generation to `internal/workers/vision.go` and `internal/workers/meta.go` respecting `Run` schedule. Batched embedding generation; [matryoshka down-projection](https://arxiv.org/pdf/2205.13147) to 64-d optional via config. - Persist embeddings via repository `internal/entity/clip_embedding.go`; migrations per DB driver. On SQLite Phase 1, persistence only (search disabled). 3) **Search pipeline** - Extend `search.Query` to accept `ClipQuery` and `ClipWeight` (float). - For DBs with vector ops: `ORDER BY distance` using driver-specific SQL; then apply existing ordering as secondary. - Phase 1: no Go fallback scorer; SQLite and MariaDB 10.5 return “CLIP search not supported” unless vector backend present. Future phase may add Go fallback if requested. ### Model Comparison | Criterion (priority ↓) | UForm3 multi-base | OpenCLIP ViT-B/32 | OpenCLIP ViT-L/14 | SigLIP B/16 | SigLIP L/16 | |-------------------------------|-----------------------------------------|-------------------|-----------------------|-----------------------|----------------------| | Params / size | 206M (~800 MB fp16 ONNX) | 151M (~600 MB) | 428M (~1.6 GB) | 203M (~900 MB) | 652M (~2.5 GB) | | Embedding dim | 64–768 ([matryoshka](https://huggingface.co/papers/2205.13147); default 768) | 512 | 768 | 768 | 768 | | Languages | 20+ multilingual | English-centric | English-centric | Multilingual (WebLI) | Multilingual (WebLI) | | Speed on CPU (relative) | 2–4× faster vs competitors; ONNX native | fast | slow | medium | slow | | Accuracy (ImageNet zero-shot) | ~75% (vendor) | ~63% | ~75% | ~82% | ~83% | | ONNX availability | Yes (official) | Yes (hf/convert) | Yes (hf/convert) | Yes (OpenCLIP) | Yes (OpenCLIP) | | License | Apache-2.0 | Apache-2.0 | Apache-2.0 | Apache-2.0 | Apache-2.0 | | GPU optional | Yes | Yes | Recommended | Yes | Recommended | | Fit for NAS (RAM <1.5 GB) | Yes (small/base) | Yes | Risky | Borderline | No | ### Code Map - API: `internal/api/photos_search.go` and `internal/api/photos_search_geo.go` (extend request struct & pipeline). - Services: `internal/ai/vision` (model load/run), `internal/entity` (embedding store). - Config: `internal/config/options.go` (flags/env), `vision.yml`. - CLI: `internal/commands/*.go`. - Workers: `internal/workers/vision.go`, `internal/workers/meta.go`. ### Acceptance Criteria - [ ] Users can toggle CLIP search on/off via configuration, choosing a compact UForm-based ONNX model by default. - [ ] A background worker generates and stores CLIP embeddings without breaking existing filters or other functionality. - [ ] When enabled and supported by the database (e.g., MariaDB 11.8+ with VECTOR indexes), text queries return images ranked by vector similarity, combined with the existing search pipeline. - [ ] On databases without vector support, such as older versions of MariaDB and SQLite, embeddings can be stored, but CLIP searches are not possible with the initial implementation. ### Documentation & References - https://github.com/unum-cloud/UForm - https://github.com/xyb/uform-image-search - https://onnx.ai/ - https://openai.com/blog/clip/ - https://mariadb.org/projects/mariadb-vector/ - https://github.com/pgvector/pgvector - https://github.com/sqliteai/sqlite-vector

deekerman added the

idea

help wanted

database

labels

2026-02-20 00:02:13 -05:00

deekerman commented

2026-02-20 00:02:14 -05:00

Author

Owner

@lastzero commented on GitHub (May 14, 2021):

Looks cool, but we're drowning in work right now. Maybe later this year?

@lastzero commented on GitHub (May 14, 2021): Looks cool, but we're drowning in work right now. Maybe later this year?

deekerman commented

2026-02-20 00:02:14 -05:00

Author

Owner

@flozi00 commented on GitHub (May 14, 2021):

If you are okay with I could provide an elasticsearch based API with simple Crud you could use for search. Then it would be only a small change on your side and another service in docker compose

@flozi00 commented on GitHub (May 14, 2021): If you are okay with I could provide an elasticsearch based API with simple Crud you could use for search. Then it would be only a small change on your side and another service in docker compose

deekerman commented

2026-02-20 00:02:15 -05:00

Author

Owner

@lastzero commented on GitHub (May 14, 2021):

You're most welcome to contribute! Be aware that even merging a pull request is going to take some time right now :/

@lastzero commented on GitHub (May 14, 2021): You're most welcome to contribute! Be aware that even merging a pull request is going to take some time right now :/

deekerman commented

2026-02-20 00:02:15 -05:00

Author

Owner

@flozi00 commented on GitHub (May 22, 2021):

Do you have an channel where I can't contact you ?
I'd really like to provide some Apis with useful features, maybe we could discuss about

@flozi00 commented on GitHub (May 22, 2021): Do you have an channel where I can't contact you ? I'd really like to provide some Apis with useful features, maybe we could discuss about

deekerman commented

2026-02-20 00:02:15 -05:00

Author

Owner

@lastzero commented on GitHub (May 23, 2021):

You can find us on matrix if you follow the link for our community chat.

@lastzero commented on GitHub (May 23, 2021): You can find us on matrix if you follow the link for our community chat.

deekerman commented

2026-02-20 00:02:16 -05:00

Author

Owner

@tknobi commented on GitHub (Feb 3, 2022):

Hello all,

since this is my first interaction with PhotoPrism, I would like to take this opportunity to thank you @lastzero (and all the other contributors :) ) for this wonderful piece of software. I have tested many different photo galleries, PhotoPrism runs by far the most stable, fastest and offers (almost) everything I need.

Now about this issue: After getting in touch with CLIP and once again desperately trying to find an image by its content in PhotoPrism, I had the same idea as @flozi00 (Thanks for that!) and I want to thank by contributing a CLIP based content search.

As already mentioned, the CLIP model can only be addressed in python. So I propose to provide a CLIP api that can encode an image or text into clip embeddings via a REST interface. I will keep the code on Python side minimal. This is to keep it maintainable for go developers and no own test infrastructure is needed then.

To store the embeddings SQL is not suitable, because there is (to my knowledge) no efficient way to perform a nearest neighbor search. However, based on my research (including this great
post) and experience, I wouldn't use an Elasticsearch instance, as I think it adds too much overhead. Instead, I would recommend qdrant, which only requires a single container that is only ~42 MB in size.

@tknobi commented on GitHub (Feb 3, 2022): Hello all, since this is my first interaction with PhotoPrism, I would like to take this opportunity to thank you @lastzero (and all the other contributors :) ) for this wonderful piece of software. I have tested many different photo galleries, PhotoPrism runs by far the most stable, fastest and offers (almost) everything I need. Now about this issue: After getting in touch with CLIP and once again desperately trying to find an image by its content in PhotoPrism, I had the same idea as @flozi00 (Thanks for that!) and I want to thank by contributing a CLIP based content search. As already mentioned, the CLIP model can only be addressed in python. So I propose to provide a CLIP api that can encode an image or text into clip embeddings via a REST interface. I will keep the code on Python side minimal. This is to keep it maintainable for go developers and no own test infrastructure is needed then. To store the embeddings SQL is not suitable, because there is (to my knowledge) no efficient way to perform a nearest neighbor search. However, based on my research (including this great [post](https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696)) and experience, I wouldn't use an Elasticsearch instance, as I think it adds too much overhead. Instead, I would recommend qdrant, which only requires a single container that is only ~42 MB in size.

deekerman commented

2026-02-20 00:02:16 -05:00

Author

Owner

@xyb commented on GitHub (Nov 16, 2025):

UForm is also a good choice, the model size is small but the speed is very fast.

@xyb commented on GitHub (Nov 16, 2025): [UForm](https://github.com/unum-cloud/UForm) is also a good choice, the model size is small but the speed is very fast.

deekerman commented

2026-02-20 00:02:16 -05:00

Author

Owner

@xyb commented on GitHub (Nov 28, 2025):

The official documentation for UForm and USearch is somewhat outdated, so I created a demo to show how to use UForm in practice: https://github.com/xyb/uform-image-search

@xyb commented on GitHub (Nov 28, 2025): The official documentation for UForm and USearch is somewhat outdated, so I created a demo to show how to use UForm in practice: https://github.com/xyb/uform-image-search

deekerman commented

2026-02-20 00:02:16 -05:00

Author

Owner

@lastzero commented on GitHub (Nov 28, 2025):

@xyb Thanks a lot! I updated the issue description above to include guidance on the implementation steps and acceptance criteria. It also includes the links you shared.

@lastzero commented on GitHub (Nov 28, 2025): @xyb Thanks a lot! I updated the issue description above to include guidance on the implementation steps and acceptance criteria. It also includes the links you shared.