Faces: Improve performance and strategy for manual tagging #1679

Author

Owner

@pjft commented on GitHub (Feb 24, 2023):

Of course - appreciate the time and effort in looking into this.

Let us know how we can help.

@pjft commented on GitHub (Feb 24, 2023): Of course - appreciate the time and effort in looking into this. Let us know how we can help.

deekerman commented

Author

Owner

@pjft commented on GitHub (May 19, 2023):

Just wanted to say that this seems to be solved in the latest builds!

Awesome work :) It's a breeze to label hundreds of clusters now.

@pjft commented on GitHub (May 19, 2023): Just wanted to say that this seems to be solved in the latest builds! Awesome work :) It's a breeze to label hundreds of clusters now.

deekerman commented

Author

Owner

@lastzero commented on GitHub (May 19, 2023):

@pjft Thank you! That's good to know, since the feedback we've received on the optimizations hasn't been very enthusiastic so far. Most users didn't seem to notice (or care about) the changes when they tested our development preview builds.

@lastzero commented on GitHub (May 19, 2023): @pjft Thank you! That's good to know, since the feedback we've received on the optimizations hasn't been very enthusiastic so far. Most users didn't seem to notice (or care about) the changes when they tested our development preview builds.

deekerman commented

Author

Owner

@pjft commented on GitHub (May 19, 2023):

I have no issues whatsoever, and I recall having to wait 1 min or so before having to move forward after a while.

Only comment I have is that, for some reason, the "audit fix" and "optimize faces" commands seem to cause more long-lasting issues than the ones I experience without them. Unsure if temporary or not, but I have not dug into them as it's painful to go back and redo all the work 😅

@pjft commented on GitHub (May 19, 2023): I have no issues whatsoever, and I recall having to wait 1 min or so before having to move forward after a while. Only comment I have is that, for some reason, the "audit fix" and "optimize faces" commands seem to cause more long-lasting issues than the ones I experience without them. Unsure if temporary or not, but I have not dug into them as it's painful to go back and redo all the work 😅

deekerman commented

Author

Owner

@lastzero commented on GitHub (May 19, 2023):

We will change how manually labeled faces are clustered and compared in one of the upcoming releases. This should solve the remaining issues.

@lastzero commented on GitHub (May 19, 2023): We will change how manually labeled faces are clustered and compared in one of the upcoming releases. This should solve the remaining issues.

deekerman commented

Author

Owner

@pjft commented on GitHub (May 19, 2023):

Awesome:) if you need any testing at any point, by all means tag me directly. I'm always happy to help if I can. I appreciate all the work you and @graciousgrey put into making Photoprism great! Thank you so much.

@pjft commented on GitHub (May 19, 2023): Awesome:) if you need any testing at any point, by all means tag me directly. I'm always happy to help if I can. I appreciate all the work you and @graciousgrey put into making Photoprism great! Thank you so much.

deekerman commented

Author

Owner

@berturu commented on GitHub (Jul 24, 2023):

Unfortunately I still experience a delay when tagging new faces. The delay ranges from 2/3 seconds to possibly a full minute at times. Running Build 1.2306.25

@berturu commented on GitHub (Jul 24, 2023): Unfortunately I still experience a delay when tagging new faces. The delay ranges from 2/3 seconds to possibly a full minute at times. Running Build 1.2306.25

deekerman commented

Author

Owner

@lastzero commented on GitHub (Jul 24, 2023):

Unfortunately, we haven't had time to work on this yet. We will reference this issue in the related commits and mention it in the release notes once the improvements are available. Alternatively, feel free to work on this and provide us with a pull request that we just need to merge. Thank you! ❤️

@lastzero commented on GitHub (Jul 24, 2023): Unfortunately, we haven't had time to work on this yet. We will reference this issue in the related commits and mention it in the release notes once the improvements are available. Alternatively, feel free to work on this and provide us with a pull request that we just need to merge. Thank you! ❤️

deekerman commented

Author

Owner

@judej commented on GitHub (Aug 30, 2023):

+1 on the perf issue. When I manually tag faces, it can take about 20 seconds before I can tag another one. If I try tagging it give me a "busy, try again later" error.

I have found that if I navigate in the reverse order when tagging, sometimes, the tagging is faster - sub second. I get to tag about 4 or so pics and then it slows down again.

not sure if this helps...

@judej commented on GitHub (Aug 30, 2023): +1 on the perf issue. When I manually tag faces, it can take about 20 seconds before I can tag another one. If I try tagging it give me a "busy, try again later" error. I have found that if I navigate in the reverse order when tagging, sometimes, the tagging is faster - sub second. I get to tag about 4 or so pics and then it slows down again. not sure if this helps...

deekerman commented

2026-02-20 00:19:04 -05:00

Author

Owner

@pabsi commented on GitHub (Oct 2, 2023):

Hi there @lastzero , the Photoprism team, and its awesome community,
I was about to raise a new issue, but I think this one fits my usecase.

I have a huge amount of photos for which I manually have to tag faces to. It seems that Tensorflow does not play nicely with children growing from the baby stage, into toddlers, then kids, then teenagers. Obvious face change, but it does require more manual intervention from me. Also, going to public paces (a zoo, a museum, a restaurant, etc) catches lots of unwanted people.

I often found myself having to edit many faces per photo, and find this process extremely tedious and slow. I think, in my humble opinion, and ideally, we shouold be able to do all changes in one photo (add, remove, modify face tags) and when leaving the photo, commit the changes to the DB. Again, in my humble opinion, the way it works now (make change, wait for it to be committed) it's very very slow and underperforming.

Unsure if this GitHub issue is for the same purpose, but thought I'd mention it here first, before raising a new issue unnecessarily.

Thank you so much for this amazing piece of software 🙏

@pabsi commented on GitHub (Oct 2, 2023): Hi there @lastzero , the Photoprism team, and its awesome community, I was about to raise a new issue, but I think this one fits my usecase. I have a huge amount of photos for which I manually have to tag faces to. It seems that Tensorflow does not play nicely with children growing from the baby stage, into toddlers, then kids, then teenagers. Obvious face change, but it does require more manual intervention from me. Also, going to public paces (a zoo, a museum, a restaurant, etc) catches lots of unwanted people. I often found myself having to edit many faces per photo, and find this process extremely tedious and slow. I think, in my humble opinion, and ideally, we shouold be able to do all changes in one photo (add, remove, modify face tags) and when leaving the photo, commit the changes to the DB. Again, in my humble opinion, the way it works now (make change, wait for it to be committed) it's very very slow and underperforming. Unsure if this GitHub issue is for the same purpose, but thought I'd mention it here first, before raising a new issue unnecessarily. Thank you so much for this amazing piece of software :pray:

deekerman commented

Author

Owner

@richardpickett commented on GitHub (Oct 6, 2023):

I'm having this same issue. 26,000+ photos, mostly Asian family. I'm happy to provide as much info as possible, if it's helpful in resolving this issue.

running photoprism/photoprism:ubuntu on a k8s w/ mariadb:10.11

version 230923-e59851350-Linux-AMD64-Plus

Sometimes manually tagging will take 6+ minutes (update: 8+ minutes), with usual wait times around 2 minutes. During the wait I can see 20+ messages like this:

time="2023-10-06T15:53:24Z" level=warning msg="faces: failed removing merged clusters for subject js1xgms3dgksr08g"

Each with a different subject name.

I've run:
photoprism migrations run -f - no output, completes in less than a minute

photoprism index --cleanup - (a) warnings on mp4s that are too big, (b) errors on jpgs making thumbnails (unexpected EOF while decoding), and (c) a lot of failed removing merged clusters for subject <****>

I'm adept on mysql, k8s, bash, etc, so I'm completely comfortable with low-level debugging to help resolve this - just warn me before we do something that may potentially be destructive (also, would like to know where the db backup saves so I can copy it off as needed).

@richardpickett commented on GitHub (Oct 6, 2023): I'm having this same issue. 26,000+ photos, mostly Asian family. I'm happy to provide as much info as possible, if it's helpful in resolving this issue. running `photoprism/photoprism:ubuntu` on a k8s w/ `mariadb:10.11` version `230923-e59851350-Linux-AMD64-Plus` Sometimes manually tagging will take **6+ minutes (update: 8+ minutes)**, with usual wait times around 2 minutes. During the wait I can see 20+ messages like this: `time="2023-10-06T15:53:24Z" level=warning msg="faces: failed removing merged clusters for subject js1xgms3dgksr08g"` Each with a different subject name. I've run: `photoprism migrations run -f` - no output, completes in less than a minute `photoprism index --cleanup` - (a) warnings on mp4s that are too big, (b) errors on jpgs making thumbnails (unexpected EOF while decoding), and (c) a lot of failed removing merged clusters for subject <****> I'm adept on mysql, k8s, bash, etc, so I'm completely comfortable with low-level debugging to help resolve this - just warn me before we do something that may potentially be destructive (also, would like to know where the db backup saves so I can copy it off as needed).

deekerman commented

2026-02-20 00:19:04 -05:00

Author

Owner

@stephanvierkant commented on GitHub (Jan 18, 2024):

I'm having the same issue. It takes about 30-60 seconds to add faces. While it would be great to improve the performance, could it be possible to have it done by a background process? I think it shouldn't block the front end.

@stephanvierkant commented on GitHub (Jan 18, 2024): I'm having the same issue. It takes about 30-60 seconds to add faces. While it would be great to improve the performance, could it be possible to have it done by a background process? I think it shouldn't block the front end.

deekerman commented

Author

Owner

@graciousgrey commented on GitHub (Jan 18, 2024):

We will take care of this as soon as we have released the new authentication options we are currently working on :)

@graciousgrey commented on GitHub (Jan 18, 2024): We will take care of this as soon as we have released the new authentication options we are currently working on :)

deekerman commented

Author

Owner

@trop1kal commented on GitHub (Jan 28, 2024):

Awesome! I notice the delay when tagging manually tagging faces as well.
Takes upwards of 2-3mins in most cases.

You guys rock! TY in advance!

@trop1kal commented on GitHub (Jan 28, 2024): Awesome! I notice the delay when tagging manually tagging faces as well. Takes upwards of 2-3mins in most cases. You guys rock! TY in advance!

deekerman commented

Author

Owner

@niclan commented on GitHub (Mar 13, 2024):

Same, takes minutes with CPU usage on the server ~30%+ and then peaking at 100% right before finishing. And sometimes it takes just seconds.

It seems like the server is blocking while this is going on.

A bit under 150K images.

@niclan commented on GitHub (Mar 13, 2024): Same, takes minutes with CPU usage on the server ~30%+ and then peaking at 100% right before finishing. And sometimes it takes just seconds. It seems like the server is blocking while this is going on. A bit under 150K images.

deekerman commented

Author

Owner

@nodecentral commented on GitHub (May 20, 2024):

Just wanted to add to this, and share how inconsistent it can be when manually tagging (or de-tagging) faces. As such a key part of the product for me (and to help my aging parents remember people from their past) - anything that can be done to speed things up would be amazing..

The face tagging feature is the most important feature for a number of people - thank you so much for making it available

@nodecentral commented on GitHub (May 20, 2024): Just wanted to add to this, and share how inconsistent it can be when manually tagging (or de-tagging) faces. As such a key part of the product for me (and to help my aging parents remember people from their past) - anything that can be done to speed things up would be amazing.. The face tagging feature is the most important feature for a number of people - thank you so much for making it available

deekerman commented

Author

Owner

@theshadow27 commented on GitHub (Dec 20, 2024):

This has been bugging me for a while, but it's now taking 200-300 seconds per face. That's enough time to start looking through the code.

Background

The call that starts the blocking operation, and subsequently fails resulting in 429 is PUT /api/v1/markers/:marker_uid (controller is internal/api/markers:182 func UpdateMarker(router *gin.RouterGroup) . The 429 error is returned on line 191 when the mutex.UpdatePeople.Start() is already locked. The next line uses defer mutex.UpdatePeople.Stop() to lock the mutex until the UpdateMarker function returns.

The issue is that the remainder of the function can take a VERY long time to return. Upwards of 3 minutes. The application exhibits both CPU utilization (100% of 1 core) and DB traffic (distinct spikes of DB traffic and queries). The best I can tell from the logs, when SaveForm reports changed (which will almost always be true when adding a name to a face that did not have one before) we do some stuff:

get.Faces().Optimize() ( when marker.FaceID != "" && marker.SubjUID != "" && marker.SubjSrc == entity.SrcManual, which is again always true from what I can gather based on the front end, since we passed the marker and subject in with the call and the src is always 'manual').
query.UpdateSubjectCovers(true) - seems like a single, well optimized query
entity.UpdateSubjectCounts(true) - also a single, well optimized query
query.PhotoByUID(file.PhotoUID) - also a single query
p.UpdateAndSaveTitle(); - slightly more complicated but nothing alarming)

Based on log line timing, I suspect that the most time is being spent in (1), Faces().Optomize(), which is invoked on every single successful call to PUT markers.

`faces_optimize.go` : `Optimize()`

The Optomize() function is scary for something routinely invoked by a front-end client. First off, it starts with a loop to 11 (although there is an early abort, this is a lot of effort). Up to 11 times, until no progress is made we:

Query EVERY manually added face from the DB: ManuallyAddedFaces(false, false), basically SELECT * FROM faces where face_hidden=false AND face_src="manual" AND subj_uid <> '' AND face_kind <= 1 ORDERBY subj_uid, samples DESC. Remember - up to 10 times! Why?!? we are not manually adding new faces; the mutex has locked us out of doing so.
For EACH n manually added face from the DB (mine has n=322 rows):

merge all faces where the subject has the same UID and continue (fine)
run merge[0].match(faces[j].embedding()) (will come to this later). If that returns ok then print a log line and append the face to the merge
while the faces are the same, try to merge them in the DB using query.MergeFaces - more on this next

Return the result, which has a single property Merged which is an int. So a count of the number merged.
Print the count of merged faces log.Infof("faces: merged %s", english.Plural(res.Merged, "cluster", "clusters")) so that we know that Photoprism is ready to accept the next name (3 minutes later).

In summary, we loop up to 11*322 times, calling match up to len(faces)-len(subjects) (like 322-135 = 187 times for me) times and MergeFaces len(faces)-2*len(subjects) times (per the log lines, this seems to be about 121-122 times).

The 10 times is important - I am moderately confident that this loop never exits early (at least on my setup). See the DB traffic graph below for each name-adding operation:

There are 11 distinct spikes of "UPDATE" statements (peak = 220/second).

What happens in `MergeFaces`

Yikes. Remember this happens essentially 11 * number of subjects for each PUT API call:

Allocate memory and serialized the embeddings (entity.NewFace)
DB call to find the first face with the subject UID or create it if it does not exist FindFirstOrCreate
Attempt to match markers from the IDs MatchMarkers which does another query for all markers with the faceIDs, running Match against each one - and potentially calling SetFace that then calls ResolveCollision that AGAIN matches all the embeddings and might write to the DB. In my case, this is rare, but it does happen. This means that most of the cycles are wasted.
Purging orphan faces via PurgeOrphanFaces - this seems like an over-eager operation considering how many times it is run.

The end result is thousands of UPDATE operations on the DB, along with substantial CPU time spent in Match(), for each PUT API call.

Why Optimize Every Call?

TBH I have no clue. I put a lot of the source through GPT4o and asked, and it said:

Running the Optimize() function every time a face is tagged is likely an overly cautious approach intended to ensure consistency in the face clustering database. However, this is a highly inefficient design for the following reasons:

Redundancy in Optimization: The optimization process (which appears to cluster faces and update subject-related data) does not need to be run for every face tagging event. Most likely, the state of the system doesn’t change drastically with a single new tag, especially for large datasets.

Expensive Operations:

The clustering (Optimize()) is an iterative and computationally expensive process, particularly for large datasets.

Functions like UpdateSubjectCovers() and UpdateSubjectCounts() involve database updates and recalculations that can add significant overhead when called repeatedly.

Mutex Locking (mutex.UpdatePeople.Start()): This locking mechanism prevents concurrent updates to ensure data consistency. While this makes sense to avoid race conditions, it leads to throttling issues (429 Too Many Requests) when the clustering is invoked excessively.

The logic may have been implemented this way to handle:

Immediate consistency: Ensuring that all face tags are optimized and clustering results are updated instantly after every change.

Avoidance of stale data: The developers might have been overly concerned about stale data in subjects or clusters, leading them to enforce Optimize() every time.

It's guess is as good as mine. I do agree that it does not need to run on each tagging operation. Perhaps it means less efficient clusters for the tagging session - fine. I can enter 50 names in the time it takes for one Optimize cycle to complete!

How to not `Optimize` on every call

Batch Optimization: Instead of running Optimize() every time a face is tagged, Queue the optimization task and run it periodically (e.g., every few minutes) or when a batch of changes is detected.
Health Metrics: Run the Optimize call when a specific metric, which is inexpensive to calculate, indicates that the matching would benefit from optimization. Candidates include:

Clustering state (average size of GROUP BY subject_uids in the faces table
Tagging a new subject (that did not exist before), and it's the third or fourth image of them

Async Optimization and Partial Updates:

Immediately tag the face and return
run Optimize() asynchronously, updating only the affected clusters/subjects
Use a message queue pattern with a deduplicating filter to avoid unnecessary work

User-Triggered Optimization:

Use user behavior queues on the front-end (such as, having the edit window open, and clicking previous/next photo) to determine that the user is "tagging faces" - when this happens, start a timer. Each next/previous load, restart the timer. Run Optimize only when there have been no API calls in some time (5 minutes, or user-adjustable)
Allow a User-Initiated optimization via a button in the UI.

If it is necessary to `Optimize` on every call...

Verify that the loop early termination condition is working correctly. Mine never seems to finish in less than 11 cycles.
Potentially use the new VEC_DISTANCE in MariaDB 11.7 (and 10.12) to reduce the number of markers that need to be Matched - I believe there is a separate ticket #4669 for this.
Rather than retrying a merge of every single cluster each time, can we focus on just the subject_uid that changed?
Something else?

In conclusion

Regarding @lastzero 's comment:

Unfortunately, we haven't had time to work on this yet. We will reference this issue in the related commits and mention it in the release notes once the improvements are available. Alternatively, feel free to work on this and provide us with a pull request that we just need to merge. Thank you! ❤️

I am happy to help, but would appreciate some thoughts on the best strategy first so that it does not end up as a wasted effort. Do you feel that it is critical to Optimize on every PUT API call? If so, why? If not, what is your preferred pattern for hiding this latency from the user? My goal is to contribute something usable, not add to the pile of 30-something unmerged PRs.

Edit: The loop actually runs 11 times, not 10 (because it's j <= 10). This matches the number of huge query spikes seen during the Optimize run in DBeaver.

@theshadow27 commented on GitHub (Dec 20, 2024): This has been bugging me for a while, but it's now taking 200-300 seconds per face. That's enough time to start looking through the code. ### Background The call that starts the blocking operation, and subsequently fails resulting in 429 is `PUT /api/v1/markers/:marker_uid` (controller is [internal/api/markers:182 `func UpdateMarker(router *gin.RouterGroup) `](https://github.com/photoprism/photoprism/blob/664f48ec42b0374d229ad9727055f750892ba7db/internal/api/markers.go#L182). The 429 error is returned on [line 191](https://github.com/photoprism/photoprism/blob/664f48ec42b0374d229ad9727055f750892ba7db/internal/api/markers.go#L191) when the `mutex.UpdatePeople.Start()` is already locked. The next line uses `defer mutex.UpdatePeople.Stop()` to lock the mutex until the `UpdateMarker` function returns. The issue is that the remainder of the function can take a VERY long time to return. Upwards of 3 minutes. The application exhibits both CPU utilization (100% of 1 core) and DB traffic (distinct spikes of DB traffic and queries). The best I can tell from the logs, when [`SaveForm`](https://github.com/photoprism/photoprism/blob/664f48ec42b0374d229ad9727055f750892ba7db/internal/api/markers.go#L225) reports `changed` (which will almost always be true when adding a name to a face that did not have one before) we do some stuff: 1) `get.Faces().Optimize()` ( when `marker.FaceID != "" && marker.SubjUID != "" && marker.SubjSrc == entity.SrcManual`, which is again always true from what I can gather based on the front end, since we passed the marker and subject in with the call and the src is always 'manual'). 2) ` query.UpdateSubjectCovers(true)` - seems like a [single, well optimized query](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/covers.go#L271) 3) `entity.UpdateSubjectCounts(true)` - also a [single, well optimized query](https://github.com/photoprism/photoprism/blob/develop/internal/entity/entity_counts.go#L100) 4) `query.PhotoByUID(file.PhotoUID)` - also a [single query](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/photo.go#L35) 4) `p.UpdateAndSaveTitle();` - [slightly more complicated](https://github.com/photoprism/photoprism/blob/develop/internal/entity/photo_title.go#L174) but nothing alarming) Based on log line timing, I suspect that the most time is being spent in (1), `Faces().Optomize()`, which is invoked on **every single successful call to PUT markers**. ### `faces_optimize.go` : `Optimize()` The [`Optomize()`](https://github.com/photoprism/photoprism/blob/develop/internal/photoprism/faces_optimize.go#L17) function is scary for something routinely invoked by a front-end client. First off, it starts with a loop to 11 (although there is an early abort, this is a *lot* of effort). Up to 11 times, until no progress is made we: 1) Query **EVERY** manually added face from the DB: [`ManuallyAddedFaces(false, false)`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/faces.go#L65), basically `SELECT * FROM faces where face_hidden=false AND face_src="manual" AND subj_uid <> '' AND face_kind <= 1 ORDERBY subj_uid, samples DESC`. Remember - up to 10 times! *Why?!?* we are not manually adding new faces; the mutex has locked us out of doing so. 2) For EACH `n` manually added face from the DB (mine has `n=322` rows): - merge all faces where the subject has the same UID and continue (fine) - run `merge[0].match(faces[j].embedding())` (will come to this later). If that returns `ok` then print a log line and append the face to the merge - while the faces are the same, try to merge them in the DB using [`query.MergeFaces`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/faces.go#L189) - more on this next 3) Return the result, which has a single property `Merged` which is an `int`. So a count of the number merged. 4) Print the count of merged faces `log.Infof("faces: merged %s", english.Plural(res.Merged, "cluster", "clusters"))` so that we know that Photoprism is ready to accept the next name (3 minutes later). In summary, we loop up to 11*322 times, calling `match` up to `len(faces)-len(subjects)` (like 322-135 = 187 times for me) times and MergeFaces `len(faces)-2*len(subjects)` times (per the log lines, this seems to be about 121-122 times). The 10 times is important - I am moderately confident that this loop *never* exits early (at least on my setup). See the DB traffic graph below for each name-adding operation: ![image](https://github.com/user-attachments/assets/36225a90-7c32-4f48-8b8d-2f3b70bca292) There are 11 distinct spikes of "UPDATE" statements (peak = 220/second). ### What happens in `MergeFaces` Yikes. Remember this happens essentially 11 * number of subjects for **each** `PUT` API call: 1) Allocate memory and serialized the embeddings (`entity.NewFace`) 2) DB call to find the first face with the subject UID or create it if it does not exist [`FindFirstOrCreate`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/face.go#L387) 3) Attempt to match markers from the IDs [`MatchMarkers`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/face.go#L252) which does another query for all markers with the faceIDs, running `Match` against each one - and potentially calling [`SetFace`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/marker.go#L222) that then calls [`ResolveCollision`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/face.go#L175) that AGAIN matches all the embeddings and might write to the DB. In my case, this is rare, but it does happen. This means that most of the cycles are wasted. 4) Purging orphan faces via [`PurgeOrphanFaces`](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/faces.go#L153) - this seems like an over-eager operation considering how many times it is run. The end result is thousands of `UPDATE` operations on the DB, along with substantial CPU time spent in `Match()`, for each `PUT` API call. ### Why Optimize Every Call? TBH I have no clue. I put a lot of the source through GPT4o and asked, and it said: > Running the `Optimize()` function every time a face is tagged is likely an overly cautious approach intended to ensure consistency in the face clustering database. However, this is a highly inefficient design for the following reasons: > > 1. **Redundancy in Optimization:** The optimization process (which appears to cluster faces and update subject-related data) does not need to be run for every face tagging event. Most likely, the state of the system doesn’t change drastically with a single new tag, especially for large datasets. > 2. **Expensive Operations:** > - The clustering (`Optimize()`) is an iterative and computationally expensive process, particularly for large datasets. > - Functions like `UpdateSubjectCovers()` and `UpdateSubjectCounts()` involve database updates and recalculations that can add significant overhead when called repeatedly. > 3. **Mutex Locking** (`mutex.UpdatePeople.Start()`): This locking mechanism prevents concurrent updates to ensure data consistency. While this makes sense to avoid race conditions, it leads to throttling issues (`429 Too Many Requests`) when the clustering is invoked excessively. > > The logic may have been implemented this way to handle: > > - **Immediate consistency:** Ensuring that all face tags are optimized and clustering results are updated instantly after every change. > - **Avoidance of stale data:** The developers might have been overly concerned about stale data in subjects or clusters, leading them to enforce Optimize() every time. It's guess is as good as mine. I do agree that it does not *need* to run on each tagging operation. Perhaps it means less efficient clusters for the tagging session - fine. I can enter 50 names in the time it takes for one `Optimize` cycle to complete! ### How to **not** `Optimize` on every call 1. **Batch Optimization**: Instead of running `Optimize()` every time a face is tagged, Queue the optimization task and run it periodically (e.g., every few minutes) or when a batch of changes is detected. 2. **Health Metrics**: Run the `Optimize` call when a specific metric, which is inexpensive to calculate, indicates that the matching would benefit from optimization. Candidates include: - Clustering state (average size of `GROUP BY subject_uids` in the `faces` table - Tagging a new subject (that did not exist before), and it's the third or fourth image of them 3. **Async Optimization and Partial Updates:** - Immediately tag the face and return - run `Optimize()` asynchronously, updating only the affected clusters/subjects - Use a message queue pattern with a deduplicating filter to avoid unnecessary work 4. **User-Triggered Optimization**: - Use user behavior queues on the front-end (such as, having the edit window open, and clicking previous/next photo) to determine that the user is "tagging faces" - when this happens, start a timer. Each next/previous load, restart the timer. Run `Optimize` only when there have been no API calls in some time (5 minutes, or user-adjustable) - Allow a User-Initiated optimization via a button in the UI. ### If it is necessary to `Optimize` on every call... 1) Verify that the loop early termination condition is working correctly. Mine never seems to finish in less than 11 cycles. 2) Potentially use the new [`VEC_DISTANCE` in MariaDB 11.7](https://mariadb.com/kb/en/create-table-with-vectors/) (and 10.12) to reduce the number of markers that need to be `Match`ed - I believe there is a separate ticket [#4669](https://github.com/photoprism/photoprism/issues/4669) for this. 3) Rather than retrying a merge of every single cluster each time, can we focus on just the subject_uid that changed? 4) Something else? ### In conclusion Regarding @lastzero 's comment: > Unfortunately, we haven't had time to work on this yet. We will reference this issue in the related commits and mention it in the release notes once the improvements are available. Alternatively, feel free to work on this and provide us with a pull request that we just need to merge. Thank you! ❤️ I am happy to help, but would appreciate some thoughts on the best strategy first so that it does not end up as a wasted effort. Do you feel that it is critical to `Optimize` on every `PUT` API call? If so, why? If not, what is your preferred pattern for hiding this latency from the user? My goal is to contribute something usable, not add to the pile of 30-something unmerged PRs. Edit: The loop actually runs 11 times, not 10 (because it's `j <= 10`). This matches the number of huge query spikes seen during the `Optimize` run in DBeaver.

deekerman commented

Author

Owner

@theshadow27 commented on GitHub (Dec 20, 2024):

To further investigate, I used DBeaver to export the faces table before and after one of these multi-minute, CPU intensive Optimize() calls. The net diff was a single row (the face that was changed), which had only a single field added (subject_uid).

From this, I have to assume that the clusters aren't actually being merged in a meaningful way in the long-running Optimize() calls, despite the logs reporting that they are. Perhaps this is a bug, and if the clusters were actually merged, then there would not be repeated "faces: merged 121 clusters [2m30.469009483s]" lines in the log...

2024-12-20 17:05:55 time="2024-12-20T22:05:55Z" level=info msg="changes successfully saved"
2024-12-20 17:09:07 time="2024-12-20T22:09:07Z" level=info msg="faces: merged 120 clusters"
2024-12-20 17:09:07 time="2024-12-20T22:09:07Z" level=info msg="changes successfully saved"
2024-12-20 17:15:34 time="2024-12-20T22:15:34Z" level=info msg="faces: merged 121 clusters [2m30.469009483s]"
2024-12-20 17:19:29 time="2024-12-20T22:19:29Z" level=info msg="faces: updated 1 marker, recognized 1 face, 0 unknown [3m55.490692353s]"
2024-12-20 17:19:37 time="2024-12-20T22:19:37Z" level=info msg="index: updated 8 photos [7.306792706s]"
2024-12-20 17:33:05 time="2024-12-20T22:33:05Z" level=info msg="changes successfully saved"
2024-12-20 17:33:21 time="2024-12-20T22:33:21Z" level=info msg="changes successfully saved"
2024-12-20 17:35:48 time="2024-12-20T22:35:48Z" level=info msg="changes successfully saved"
2024-12-20 17:36:18 time="2024-12-20T22:36:18Z" level=info msg="changes successfully saved"
2024-12-20 17:36:22 time="2024-12-20T22:36:22Z" level=info msg="changes successfully saved"
2024-12-20 17:39:13 time="2024-12-20T22:39:13Z" level=info msg="faces: merged 121 clusters"
2024-12-20 17:39:14 time="2024-12-20T22:39:14Z" level=info msg="changes successfully saved"
2024-12-20 17:45:33 time="2024-12-20T22:45:33Z" level=info msg="faces: merged 121 clusters [2m29.398296581s]"

@theshadow27 commented on GitHub (Dec 20, 2024): To further investigate, I used DBeaver to export the faces table before and after one of these multi-minute, CPU intensive `Optimize()` calls. The net diff was a single row (the face that was changed), which had only a single field added (`subject_uid`). ![image](https://github.com/user-attachments/assets/f0987cb3-ecfc-415d-9df9-5bd2843cd2d8) From this, I have to assume that the clusters aren't actually being **merged** in a meaningful way in the long-running `Optimize()` calls, despite the logs reporting that they are. Perhaps this is a bug, and if the clusters were actually merged, then there would not be repeated `"faces: merged 121 clusters [2m30.469009483s]"` lines in the log... ``` 2024-12-20 17:05:55 time="2024-12-20T22:05:55Z" level=info msg="changes successfully saved" 2024-12-20 17:09:07 time="2024-12-20T22:09:07Z" level=info msg="faces: merged 120 clusters" 2024-12-20 17:09:07 time="2024-12-20T22:09:07Z" level=info msg="changes successfully saved" 2024-12-20 17:15:34 time="2024-12-20T22:15:34Z" level=info msg="faces: merged 121 clusters [2m30.469009483s]" 2024-12-20 17:19:29 time="2024-12-20T22:19:29Z" level=info msg="faces: updated 1 marker, recognized 1 face, 0 unknown [3m55.490692353s]" 2024-12-20 17:19:37 time="2024-12-20T22:19:37Z" level=info msg="index: updated 8 photos [7.306792706s]" 2024-12-20 17:33:05 time="2024-12-20T22:33:05Z" level=info msg="changes successfully saved" 2024-12-20 17:33:21 time="2024-12-20T22:33:21Z" level=info msg="changes successfully saved" 2024-12-20 17:35:48 time="2024-12-20T22:35:48Z" level=info msg="changes successfully saved" 2024-12-20 17:36:18 time="2024-12-20T22:36:18Z" level=info msg="changes successfully saved" 2024-12-20 17:36:22 time="2024-12-20T22:36:22Z" level=info msg="changes successfully saved" 2024-12-20 17:39:13 time="2024-12-20T22:39:13Z" level=info msg="faces: merged 121 clusters" 2024-12-20 17:39:14 time="2024-12-20T22:39:14Z" level=info msg="changes successfully saved" 2024-12-20 17:45:33 time="2024-12-20T22:45:33Z" level=info msg="faces: merged 121 clusters [2m29.398296581s]" ```

deekerman commented

Author

Owner

@theshadow27 commented on GitHub (Dec 21, 2024):

I think I found the bug causing the Optimize loop to not terminate early (and run 11 times, i := 0; i <= 10; i++):

The early termination code:

               // Done?
		if result.Merged <= c {
			break
		}

checks to see if any faces were merged on the last run through the loop. Basically, if a call to query.MergeFaces(merge, false) with mergeable faces (clusters that Match with merge[0]) does not error, we assume that all faces were merged with result.Merged += len(merge).

This is a flawed assumption because:

for a valid (matching subj_uid and embedding Match) set of mergeable faces, MergeFaces is not guaranteed to actually eliminate any (or all) of the supplied Faces
Specifically, in MatchMarkers, the "old, suboptimal" cluster face_id is not replaced when m.Match(marker.Embeddings()) is not ok; in this case we // Ignore on face.go:L265.
So if the merged cluster does NOT result in EVERY face_id being replaced with the new super-cluster, then PurgeOrphanFaces cannot delete the old clusters.
We try to check this with a if removed > 0 call in faces.go:L216, which seems like we should get an error log when this happens, and we don't see any such errors. Except:
In the PurgeOrphanFaces function, there appears to be a subtle bug where if no rows are affected by the delete, i.e. else if result.RowsAffected > 0 is not true, the len(ids) (batch size) is added to the affected counter anyway
So in the case that we try to merge (for example) 3 clusters, but 0 can be deleted, there is no warning/error but MergeFaces returns as if it did useful work.

Because the daemon runs this function every 15 minutes, it is highly likely that all face clusters which can be merged have been merged. In a static library in steady state that has at least 2 clusters which could be merged according to the embeddings matching, but the smaller cluster can't be removed due to the new cluster not matching the face, this loop will run 11 times; checking every cluster against all others with the same subj_uid each time.

This could be fixed in 2 ways:

Fix PurgeOrphanFaces so that if no rows are affected, it returns 0. This will trigger the WarnF in MergeFaces and the Errorf in Optimize each time, but at least result.Merged won't be incremented and so the 11x loop won't run unnecessarily. The downside to this (besides the log output) is that if PurgeOrphanFaces or MergeFaces is used elsewhere and relies upon the broken behavior, that the other consumers may no longer work as expected. GitHub doesn't seem to think that either function has other consumers, but it might have missed something.
Find a different early-termination condition. This has the upside of not disrupting the current behavior. For example, since we query.ManuallyAddedFaces each time, we could track the number of face clusters returned (n), and terminate if n >= prev_n (i.e. the number of clusters increased or stayed the same). The downside here is that it would require an extra query of all faces (with their embeddings).

@lastzero if you let me know if you prefer fix (1) or fix (2) more, I am happy to submit a PR.

There are a few other things I noticed while tracing this code:

In the Optimize function, the else if len(merge) == 1 { merge = nil } condition probably is a bug. This is trying to reset the set of mergeable faces, for the case where it's the same subject but the faces don't match. In doing so, it skips a potential face to merge. Consider clusters A, B, C, D, E, F all of the same subject:

A, B -> merge, reset
C (D does not match) so reset
Intended Behavior (I think): Try to match D with E
Actual Behavior: Try to match E with F

To fix this, change the code to else if len(merge) == 1 { merge = entity.Faces{faces[j]} }

A more aggressive refactoring could also improve performance significantly: rather than querying all faces each time through the loop, we should get the list of faces at the start, then only query for face IDs with a created_at constraint, fetching only the new faces. Better yet (though more involved), all of the operations could be kept in memory, with a append-only log of 'added' and 'removed' to write to the DB once at the end. When there are hundreds of faces, this repeated querying does consume significant DB resources.

Edit: One further (but schema-changing) optimization would be to add a column optimized_at in the faces table. Whenever a face cluster was matched/added to a marker, this column would be cleared. In the Optimize loop, following an optimization run, each remaining cluster that was processed would have optimized_at set. Add one more argument to ManuallyAddedFaces(hidden, ignored, optimized) that only returns Faces where at least one face with that subj_uid has an empty optimized_at value... The query would be something like

SELECT *
FROM faces f
WHERE 
    f.face_hidden = false 
    AND f.face_src = 'manual'
    AND f.subj_uid <> ''
    AND f.face_kind <= 1
    AND EXISTS (
        SELECT 1
        FROM faces sub_f
        WHERE 
            sub_f.subj_uid = f.subj_uid
            AND (sub_f.optimized_at IS NULL OR sub_f.optimized_at = '')
    )
ORDER BY f.subj_uid, f.samples DESC;

Or thereabouts. In this way, we'd only be processing faces where at least one marker had changed; rather than doing the same work over and over.

Insanity is doing the same thing over and over again and expecting different results - Albert Einstein

@theshadow27 commented on GitHub (Dec 21, 2024): I think I found the bug causing the Optimize loop to not terminate early (and run **11** times, `i := 0; i <= 10; i++`): The [early termination code](https://github.com/photoprism/photoprism/blob/develop/internal/photoprism/faces_optimize.go#L60): ```go // Done? if result.Merged <= c { break } ``` checks to see if any faces were merged on the last run through the loop. Basically, if a call to `query.MergeFaces(merge, false)` with mergeable faces (clusters that `Match` with `merge[0]`) does not error, we assume that all faces were merged with `result.Merged += len(merge)`. This is a flawed assumption because: - for a valid (matching `subj_uid` and embedding `Match`) set of mergeable faces, `MergeFaces` is **not** guaranteed to actually eliminate any (or all) of the supplied `Face`s - Specifically, in `MatchMarkers`, the "old, suboptimal" cluster `face_id` is **not** replaced when `m.Match(marker.Embeddings())` is not `ok`; in this case we `// Ignore` on [face.go:L265](https://github.com/photoprism/photoprism/blob/develop/internal/entity/face.go#L265). - So if the merged cluster does **NOT** result in **EVERY** `face_id` being replaced with the new super-cluster, then `PurgeOrphanFaces` cannot delete the old clusters. - We try to check this with a `if removed > 0` call in [faces.go:L216](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/faces.go#L216), which seems like we *should* get an error log when this happens, and we don't see any such errors. Except: - In the `PurgeOrphanFaces` function, there appears to be a subtle bug where if no rows are affected by the delete, i.e. `else if result.RowsAffected > 0` is not true, the `len(ids)` (batch size) **is added to the `affected` counter anyway** - So in the case that we try to merge (for example) 3 clusters, but 0 can be deleted, there is no warning/error but `MergeFaces` returns as if it did useful work. Because the daemon runs this function every 15 minutes, it is **highly likely** that all face clusters which can be merged have been merged. In a static library in steady state that has at least 2 clusters which could be merged according to the embeddings matching, but the smaller cluster can't be removed due to the new cluster not matching the face, this loop will run 11 times; checking every cluster against all others with the same `subj_uid` each time. This could be fixed in 2 ways: 1) Fix `PurgeOrphanFaces` so that if no rows are affected, it returns 0. This will trigger the [WarnF](https://github.com/photoprism/photoprism/blob/develop/internal/entity/query/faces.go#L219) in `MergeFaces` and the [Errorf](https://github.com/photoprism/photoprism/blob/develop/internal/photoprism/faces_optimize.go#L44) in `Optimize` each time, but at least `result.Merged` won't be incremented and so the 11x loop won't run unnecessarily. The downside to this (besides the log output) is that if `PurgeOrphanFaces` or `MergeFaces` is used elsewhere and relies upon the broken behavior, that the other consumers may no longer work as expected. GitHub doesn't seem to think that either function has other consumers, but it might have missed something. 2) Find a different early-termination condition. This has the upside of not disrupting the current behavior. For example, since we `query.ManuallyAddedFaces` each time, we could track the number of face clusters returned (n), and terminate if `n >= prev_n` (i.e. the number of clusters increased or stayed the same). The downside here is that it would require an extra query of all faces (with their embeddings). @lastzero if you let me know if you prefer fix (1) or fix (2) more, I am happy to submit a PR. There are a few other things I noticed while tracing this code: In the `Optimize` function, the `else if len(merge) == 1 { merge = nil }` condition probably is a bug. This is trying to reset the set of mergeable faces, for the case where it's the same subject but the faces don't match. In doing so, it skips a potential face to merge. Consider clusters A, B, C, D, E, F all of the same subject: - A, B -> merge, reset - C (D does not match) so reset - Intended Behavior (I think): Try to match **D** with **E** - Actual Behavior: Try to match **E** with **F** To fix this, change the code to `else if len(merge) == 1 { merge = entity.Faces{faces[j]} }` A more aggressive refactoring could also improve performance significantly: rather than querying **all** faces each time through the loop, we should get the list of faces at the start, then only query for face IDs with a `created_at` constraint, fetching only the new faces. Better yet (though more involved), all of the operations could be kept in memory, with a append-only log of 'added' and 'removed' to write to the DB once at the end. When there are hundreds of `faces`, this repeated querying does consume significant DB resources. Edit: One further (but schema-changing) optimization would be to add a column `optimized_at` in the `faces` table. Whenever a face cluster was matched/added to a `marker`, this column would be cleared. In the `Optimize` loop, following an optimization run, each remaining cluster that was processed would have `optimized_at` set. Add one more argument to `ManuallyAddedFaces(hidden, ignored, optimized)` that only returns Faces where *at least one* face with that `subj_uid` has an empty `optimized_at` value... The query would be something like ```sql SELECT * FROM faces f WHERE f.face_hidden = false AND f.face_src = 'manual' AND f.subj_uid <> '' AND f.face_kind <= 1 AND EXISTS ( SELECT 1 FROM faces sub_f WHERE sub_f.subj_uid = f.subj_uid AND (sub_f.optimized_at IS NULL OR sub_f.optimized_at = '') ) ORDER BY f.subj_uid, f.samples DESC; ``` Or thereabouts. In this way, we'd only be processing faces where at least one marker had changed; rather than doing the same work over and over. > *Insanity is doing the same thing over and over again and expecting different results* - Albert Einstein

deekerman commented

Author

Owner

@theshadow27 commented on GitHub (Dec 30, 2024):

Have a PR going. No confirmed fix yet. https://github.com/photoprism/photoprism/pull/4691
BTW tests failing in WSL off develop - probably not your native env.

@theshadow27 commented on GitHub (Dec 30, 2024): Have a PR going. No confirmed fix yet. https://github.com/photoprism/photoprism/pull/4691 BTW tests failing in WSL off `develop` - probably not your native env.

deekerman commented

Author

Owner

@theshadow27 commented on GitHub (Dec 30, 2024):

(from the PR for visibility by @lastzero )

This issue is deeper than initially suspected - the proposed fix (1) is necessary but not sufficient.

After implementing the proposed fix (1), the loop is still running all 11 times. It seems that when there are several clusters of the same size (in my case 6) it is possible for each iteration to attempt to merge them, resulting in one new cluster being created and one old cluster being dropped.

Trying to grok what is going on here...

When the new cluster is created, it's ID comes from the embeddings... the centroid of all possible merge candidates. Somehow, this results in a new face hash (rounding issues? order of operations?) and the creation of a new automatic cluster which replaces the old one. Since MatchMarkers eagerly asserts ownership over the markers, it is happy to take them from the other automatic cluster. The database is updated by adding the new cluster, removing the old cluster, (effectively a nop)... and the cycle repeats.

I assumed that the intent of the md5 face_id was to prevent exactly this problem, and initially speculated that a stable sort order from ManuallyAddedFaces might fix it, but thinking about it some more, I can see how it would not:

if A + B + C = D
then D + B + C = E
then E + B + C = F
...

each time, the centroid is shifted slightly, resulting in a new face_id - one that will be very very close, but not exactly, to the previous one.

Initial thought: One way around this would be to do the match in-memory, i.e. MatchMarkers should store the results but not actually write anything. Then, it could be compared to the previous marker set, and if it was the same set of markers, not persist itself or take them over. Given the structure of the code, this is a complicated change. I suppose it could be done with less friction inside a TRANSACTION (with rollback) at the expense of additional DB overhead. I am not sure the best way to proceed here. If you have any suggestions @lastzero I would be happy to try them with my problematic db.

In the meantime, I will add a parameter to Optimize(subj_uid) that can be passed through to ManuallyAddedFaces for the special case of markers.UpdateMarker when the subj_uid is known in advance. Because of the locking here, this will only ever result in that subject being optimized (it's the only face cluster to have changed, and the merge set is all the same subj_uid). At least that will cut down on the amount of work this loop has to do while the user is blocked.

Edit: Noticed another problem (bug?). The sort in ManuallyAddedFaces orders by samples with the assumption that this will somehow be connected to reality. However, it seems that this value is never updated - not with Optimize, not with photoprism faces update -f, not with photoprism faces audit --fix. Once it's there, it's set (as far as I have observed). In doing some spot checks, I have several cases where samples is something like 7-10, but there are 25+ markers with that face_id meeting the minimum score and size. Unless I'm missing something, and the intent is not to denote how many markers are matched to the cluster?

@theshadow27 commented on GitHub (Dec 30, 2024): (from the PR for visibility by @lastzero ) This issue is deeper than initially suspected - the proposed fix (1) is necessary but not sufficient. After implementing the proposed fix (1), the loop is still running all 11 times. It seems that when there are several clusters of the same size (in my case 6) it is possible for each iteration to attempt to merge them, resulting in one new cluster being created and one old cluster being dropped. Trying to grok what is going on here... When the new cluster is created, it's ID comes from the embeddings... the centroid of all possible merge candidates. Somehow, this results in a new face hash (rounding issues? order of operations?) and the creation of a new automatic cluster which replaces the old one. Since `MatchMarkers` eagerly asserts ownership over the markers, it is happy to take them from the other automatic cluster. The database is updated by adding the new cluster, removing the old cluster, (effectively a nop)... and the cycle repeats. I *assumed* that the intent of the md5 `face_id` was to prevent exactly this problem, and initially speculated that a stable sort order from `ManuallyAddedFaces` might fix it, but thinking about it some more, I can see how it would not: ``` if A + B + C = D then D + B + C = E then E + B + C = F ... ``` each time, the centroid is shifted slightly, resulting in a new `face_id` - one that will be very very close, but not exactly, to the previous one. **Initial thought**: One way around this would be to do the match in-memory, i.e. `MatchMarkers` should store the results but not actually write anything. Then, it could be compared to the previous marker set, and if it was the same set of markers, not persist itself or take them over. Given the structure of the code, this is a complicated change. I suppose it could be done with less friction inside a `TRANSACTION` (with rollback) at the expense of additional DB overhead. I am not sure the best way to proceed here. If you have any suggestions @lastzero I would be happy to try them with my problematic db. In the meantime, I will add a parameter to `Optimize(subj_uid)` that can be passed through to `ManuallyAddedFaces` for the special case of `markers.UpdateMarker` when the `subj_uid` is known in advance. Because of the locking here, this will only ever result in that subject being optimized (it's the only face cluster to have changed, and the `merge` set is all the same `subj_uid`). At least that will cut down on the amount of work this loop has to do while the user is blocked. Edit: Noticed another problem (bug?). The sort in `ManuallyAddedFaces` orders by `samples` with the assumption that this will somehow be connected to reality. However, it seems that this value is never updated - not with Optimize, not with `photoprism faces update -f`, not with `photoprism faces audit --fix`. Once it's there, it's set (as far as I have observed). In doing some spot checks, I have several cases where samples is something like 7-10, but there are 25+ markers with that `face_id` meeting the minimum score and size. Unless I'm missing something, and the intent is not to denote how many markers are matched to the cluster?

deekerman commented

Author

Owner

@lastzero commented on GitHub (Dec 31, 2024):

@theshadow27 Thanks for taking a look and sharing your insights! 🧐

Our current plan is to offload (most of) the calculations to the new MariaDB vector functions in an upcoming release:

https://mariadb.org/projects/mariadb-vector/

This should yield better results and be much faster than our own vector comparisons, which were intended as a temporary solution until we have a proper vector database/library in place. Unfortunately, as we are currently performing a major upgrade of our frontend, I do not have any code or a more specific timeline/roadmap for this yet:

https://github.com/photoprism/photoprism/issues/4669

The edge cases that need to be handled/investigated (see your comments and what I've commented in other places) further encourage us to go this way. If you like, you are welcome to experiment with the vector functions and work on a proof of concept? ✨

Of course, if you find improvements to the current code that we can test, merge, and release within a reasonable timeframe, we'd be happy to do that as well - but my feeling is that the effort required for a fast and completely bug-free implementation is beyond what anyone could maintain/contribute... 🤔

@lastzero commented on GitHub (Dec 31, 2024): @theshadow27 Thanks for taking a look and sharing your insights! 🧐 Our current plan is to offload (most of) the calculations to the new MariaDB vector functions in an upcoming release: - https://mariadb.org/projects/mariadb-vector/ This should yield better results and be much faster than our own vector comparisons, which were intended as a temporary solution until we have a proper vector database/library in place. Unfortunately, as we are [currently performing a major upgrade of our frontend](https://github.com/photoprism/photoprism/issues/3168), I do not have any code or a more specific timeline/roadmap for this yet: - https://github.com/photoprism/photoprism/issues/4669 The edge cases that need to be handled/investigated (see your comments and what I've commented in other places) further encourage us to go this way. If you like, you are welcome to experiment with the vector functions and work on a proof of concept? ✨ Of course, if you find improvements to the current code that we can test, merge, and release within a reasonable timeframe, we'd be happy to do that as well - but my feeling is that the effort required for a fast and completely bug-free implementation is beyond what anyone could maintain/contribute... 🤔

deekerman commented

Author

Owner

@theshadow27 commented on GitHub (Dec 31, 2024):

@lastzero thanks for getting back, and happy new years! 🥂

First off, IMHO the current implementation/"temporary" solution is VERY good at meeting it's functional requirements - better than Apple Photos for sure. In the rare situation where it's confused a subject, it's straightforward to eject them and fix it. New markers are identified and matched quickly and accurately. Youth sports are a brutal use case, and I sincerely appreciate the effort that went into the current version! The only issue I have is with the nonfunctional UX requirement of not having excessive delays while naming many markers in quick succession.

To be clear: the current state of the draft PR offers a significant improvement to the UX; by limiting the Optimize function to the face supplied in the POST, it almost never triggers the 11-iteration loop, unless you are specifically working on one of those subjects that has problematic clusters. It is not a big PR (40 lines, 12 lines functional), and I don't believe it changes the behavior at all for the worker/CLI usage - a huge plus for building confidence after years of stable deployments.

Regarding the Vector implementation - I'm already thinking about it, but the scope will be much wider than this PR, with impacts from DBSCAN up to everywhere that Match is used (at least 11 places) - likely requiring substantial rework. In addition to schema updates, a Vector implementation requires a minimum DB version that could break a lot of existing embedded/hosted installs without manual intervention during the upgrade. To keep backwards compatibility, some version of the current implementation strategy needs to be encapsulated behind an interface, further expanding the scope of refactoring, and potentially running with two different Gorm entities depending on a flag. I don't know enough about Gorm to know if that's ok or not. This is a Big project, lots of new code, lots of testing, bug hunting, and so on.

As such, my suggestion is to try to get the minimum change for maximum UX improvement merged sooner than later - there are quite a few users who have reported this or derivative issues going back to October 2022. It doesn't fix the root cause, but it hides it well enough to make a big difference for those who are impacted. What do you think?

P.S. If you're not comfortable rolling it out everywhere, it is not much more to hide it behind a flag (like PHOTOSTREAM_EXPERIMENTAL_FACES_OPTIMIZE) that would eliminate nearly all of the risk for "normal" users doing routine upgrades.

P.P.S. I am working on another branch that has a more substantial diff; a re-write of the Optimize function completely for how I think it should work - Breaking out each subj_uid into it's own struct and operating on them independently, spending more effort on finding mergeable clusters within a subject, going back to ground-truth tagged markers, instead of cluster-of-clusters, and only re-fetching faces when that particular subject had a change. This will be several hundred lines diff, so I don't expect it to be accepted without substantial testing that I will likely not finish on this holiday 😄 .

@theshadow27 commented on GitHub (Dec 31, 2024): @lastzero thanks for getting back, and happy new years! 🥂 First off, IMHO the current implementation/"temporary" solution is **VERY** good at meeting it's functional requirements - better than Apple Photos for sure. In the rare situation where it's confused a subject, it's straightforward to eject them and fix it. New markers are identified and matched quickly and accurately. Youth sports are a brutal use case, and I sincerely appreciate the effort that went into the current version! The only issue I have is with the nonfunctional UX requirement of not having excessive delays while naming many markers in quick succession. To be clear: the current state of the draft PR offers a **significant** improvement to the UX; by limiting the `Optimize` function to the face supplied in the POST, it almost never triggers the 11-iteration loop, unless you are specifically working on one of those subjects that has problematic clusters. It is not a big PR (40 lines, 12 lines functional), and I don't believe it changes the behavior at all for the worker/CLI usage - a huge plus for building confidence after years of stable deployments. Regarding the Vector implementation - I'm already thinking about it, but the scope will be *much* wider than this PR, with impacts from DBSCAN up to everywhere that Match is used (at least 11 places) - likely requiring substantial rework. In addition to schema updates, a Vector implementation requires a minimum DB version that could break a lot of existing embedded/hosted installs without manual intervention during the upgrade. To keep backwards compatibility, some version of the current implementation strategy needs to be encapsulated behind an interface, further expanding the scope of refactoring, and potentially running with two different Gorm entities depending on a flag. I don't know enough about Gorm to know if that's ok or not. This is a *Big* project, lots of new code, lots of testing, bug hunting, and so on. As such, my suggestion is to try to get the minimum change for maximum UX improvement merged sooner than later - there are quite a few users who have reported this or derivative issues going back to October 2022. It doesn't fix the root cause, but it hides it well enough to make a big difference for those who are impacted. What do you think? P.S. If you're not comfortable rolling it out everywhere, it is not much more to hide it behind a flag (like `PHOTOSTREAM_EXPERIMENTAL_FACES_OPTIMIZE`) that would eliminate nearly all of the risk for "normal" users doing routine upgrades. P.P.S. I am working on another branch that has a more substantial diff; a re-write of the `Optimize` function completely for how I think it should work - Breaking out each `subj_uid` into it's own struct and operating on them independently, spending more effort on finding mergeable clusters within a subject, going back to ground-truth tagged markers, instead of cluster-of-clusters, and only re-fetching faces when that particular subject had a change. This will be several hundred lines diff, so I don't expect it to be accepted without substantial testing that I will likely not finish on this holiday 😄 .

deekerman commented

Author

Owner

@lastzero commented on GitHub (Mar 26, 2025):

@theshadow27 Thank you so much for your patience while we completed the frontend upgrade! 🙏

The last batch of UI/UX enhancements was released last week, so I'm happy to report that your PR is now merged and ready to be tested with our development preview build:

Any help with that is much appreciated :)

@lastzero commented on GitHub (Mar 26, 2025): @theshadow27 Thank you so much for your patience while we completed the frontend upgrade! 🙏 The last batch of UI/UX enhancements [was released last week](https://github.com/photoprism/photoprism/releases/tag/250321-57590c48b), so I'm happy to report that your PR is now merged and ready to be tested with our development preview build: - https://docs.photoprism.app/release-notes/#development-preview - https://docs.photoprism.app/getting-started/updates/#development-preview Any help with that is much appreciated :)

deekerman commented

Author

Owner

@Coloradohusky commented on GitHub (Apr 6, 2025):

While I am not OP, I have definitely noticed speedups with manual tagging when on the latest preview build, 250405-eed7bfff7 - speeds went from around 30 seconds to a minute+ before, to around 2-10 seconds now. (although in certain cases it can still take 45 seconds, but it is much better overall)

@Coloradohusky commented on GitHub (Apr 6, 2025): While I am not OP, I have definitely noticed speedups with manual tagging when on the latest preview build, `250405-eed7bfff7` - speeds went from around 30 seconds to a minute+ before, to around 2-10 seconds now. (although in certain cases it can still take 45 seconds, but it is much better overall)

deekerman commented