mirror of
https://github.com/photoprism/photoprism.git
synced 2026-03-02 22:57:18 -05:00
Faces: Improve performance and strategy for manual tagging #1679
Labels
No labels
ai
android
api
auth
awesome
bug
bug
ci
cli
config
database
declined
deprecated
docker
docs 📚
documents
duplicate
easy
enhancement
enhancement
enhancement
epic
faces
feedback wanted
frontend
hacktoberfest
help wanted
idea
in-progress
incomplete
index
invalid
ios
labels
live
live
low-priority
macos
member-feature
metadata
mobile
nas
needs-analysis
no-coding-required
no-coding-required
observability
performance
places
please-test
plus-feature
priority
pro-feature
question
raspberry-pi
raw
released
released
released
research
resolved
security
sharing
tested
tests
third-party-issue
thumbnails
upgrade
upstream-issue
ux
vector
video
waiting
won't fix
won't fix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/photoprism#1679
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @maxime1992 on GitHub (Jan 17, 2023).
Originally assigned to: @theshadow27, @graciousgrey, @omerdduran on GitHub.
Let me start by saying this is a new issue that'll summarize a closed, but still happening issue. I'm creating this new issue based on @lastzero 's comment.
1. What is not working as documented?
For faces that were not added automatically, tagging people manually is slow to extremely slow. With some photos or persons (unsure), it takes a few seconds only (rare) but once that issue happens, most of the time it takes in my case between 1 and 2 minutes to tag 1 person. I've decided recently to move all my photos to photoprism so I've got quite a lot already, and quite a lot more to come. I have more than a couple hundreds pictures with unrecognized faces waiting to be tagged, but I'm not able to wait 1 to 2 mn for each, it'd be unmanageable.
It's documented as a known issue. I have tried the command
docker compose exec photoprism photoprism faces audit --fixbut it didn't help fixing the issue.2. How can we reproduce it?
I have no idea how exactly to reproduce this, but both @pjft and myself have shared parts of our database by email directly to help debug it. See pjft comment and mine. Best case scenario it's enough to debug, if not I'm more than happy to provide more info.
3. What behavior do you expect?
When a face is detected but unknown, it shouldn't take more than a few seconds to add it manually.
4. What could be the cause of your problem?
Really unsure, sorry.
5. Can you provide us with example files for testing, error logs, or screenshots?
As stated above, this has been done by email as it includes personal data.
6. Which software versions do you use?
(a) PhotoPrism Architecture & Build Number: AMD64, ARM64, ARMv7,...
Build 221118-e58fee0fb(b) Database Type & Version: MariaDB, MySQL, SQLite,...
MariaDB(c) Operating System Types & Versions: Linux, Windows, Android,...
Linux(d) Browser Types & Versions: Firefox, Chrome, Safari on iPhone,...
BraveandChrome(e) Ad Blockers, Browser Plugins, and/or Firewall Software?
Probably irrelevant here. (I've tried to turn them off, doesn't change anything).
7. On what kind of device is PhotoPrism installed?
This is especially important if you are reporting a performance, import, or indexing issue. You can skip this if you're reporting a problem you found in our public demo, or if it's a completely unrelated issue, such as incorrect page layout.
(a) Device / Processor Type: Raspberry Pi 4, Intel Core i7-3770, AMD Ryzen 7 3800X,...
Intel(R) Core(TM) i3-3227U CPU @ 1.90GHz(b) Physical Memory & Swap Space in GB
4gb and 4gb
(c) Storage Type: HDD, SSD, RAID, USB, Network Storage,...
SSD
(d) Anything else that might be helpful to know?
8. Do you use a Reverse Proxy, Firewall, VPN, or CDN?
Yes, I use SWAG which itself uses NGINX, but it's probably unrelated here as it used to work perfectly with that config.
I'll summarize what was found in the previous issue:
@mdmarwil commented on GitHub (Jan 30, 2023):
I have am having this same issue and the command
docker compose exec photoprism photoprism faces audit --fixdoes not help.My background is mostly in python, but I tried to do some digging through the code.
The Warning message
faces: failed removing merged clusters for subjectcomes from the following code block in internal/query/faces.go:Just prior to this block of code MatchMarkers() is called. It is trying to match 1 Face to 4 other Faces with the same subj_uid, but it looks like the distance of the embeddings of the 4 others Faces is too far from the first Face to be considered a match. So none complete.
Going back to the block of code shown above that means:
So it falls into the else block with a warning message, although for this case the more accurate message would be to state there were no orphaned cluster to remove.
@dalfry commented on GitHub (Feb 5, 2023):
Running into this issue. Tried running the audit fix command with no change. I have the merge failed error for one single subject. Library has 100k+ files. Spent a week tagging faces. Would prefer not to reset faces. :-)
BTW, Thanks for all your work! Photoprism is awesome! and I am a sponsor as well.
TRAC[2023-02-05T19:31:33Z] config: defaults file /etc/photoprism/defaults.yml does not exist
INFO[2023-02-05T19:31:33Z] config: case-insensitive file system detected
DEBU[2023-02-05T19:31:33Z] config: running on 'Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz', 4.1 GB memory detected
DEBU[2023-02-05T19:31:33Z] settings: loaded from /photoprism/storage/config/settings.yml
DEBU[2023-02-05T19:31:33Z] config: successfully initialized [35.329727ms]
INFO[2023-02-05T19:31:33Z] migrate: executing pre migrations
DEBU[2023-02-05T19:31:33Z] migrate: found 2 previous migrations
DEBU[2023-02-05T19:31:33Z] migrate: 20221015-100000 skipped
DEBU[2023-02-05T19:31:33Z] migrate: 20221015-100100 skipped
INFO[2023-02-05T19:31:34Z] migrate: executing main migrations
DEBU[2023-02-05T19:31:34Z] migrate: found 21 previous migrations
DEBU[2023-02-05T19:31:34Z] migrate: 20211121-094727 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20211124-120008 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-030000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-040000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-050000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-060000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-061000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-070000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-071000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-080000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-081000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-083000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-090000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-091000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-093000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220421-200000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220521-000001 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220521-000002 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220521-000003 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220927-000100 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20221002-000100 skipped
TRAC[2023-02-05T19:31:34Z] migrate: migrations migrated
TRAC[2023-02-05T19:31:34Z] migrate: auth_sessions migrated
TRAC[2023-02-05T19:31:34Z] migrate: countries migrated
TRAC[2023-02-05T19:31:34Z] migrate: labels migrated
TRAC[2023-02-05T19:31:35Z] migrate: photos_labels migrated
TRAC[2023-02-05T19:31:35Z] migrate: reactions migrated
TRAC[2023-02-05T19:31:35Z] migrate: files_share migrated
TRAC[2023-02-05T19:31:35Z] migrate: files_sync migrated
TRAC[2023-02-05T19:31:35Z] migrate: details migrated
TRAC[2023-02-05T19:31:35Z] migrate: photos_albums migrated
TRAC[2023-02-05T19:31:35Z] migrate: auth_users_shares migrated
TRAC[2023-02-05T19:31:35Z] migrate: places migrated
TRAC[2023-02-05T19:31:35Z] migrate: lenses migrated
TRAC[2023-02-05T19:31:35Z] migrate: keywords migrated
TRAC[2023-02-05T19:31:35Z] migrate: auth_users migrated
TRAC[2023-02-05T19:31:35Z] migrate: folders migrated
TRAC[2023-02-05T19:31:36Z] migrate: duplicates migrated
TRAC[2023-02-05T19:31:36Z] migrate: photos_users migrated
TRAC[2023-02-05T19:31:36Z] migrate: albums migrated
TRAC[2023-02-05T19:31:36Z] migrate: albums_users migrated
TRAC[2023-02-05T19:31:40Z] migrate: photos_keywords migrated
TRAC[2023-02-05T19:31:40Z] migrate: links migrated
TRAC[2023-02-05T19:31:40Z] migrate: subjects migrated
TRAC[2023-02-05T19:31:40Z] migrate: markers migrated
TRAC[2023-02-05T19:31:40Z] migrate: auth_users_settings migrated
TRAC[2023-02-05T19:31:40Z] migrate: services migrated
TRAC[2023-02-05T19:31:40Z] migrate: errors migrated
TRAC[2023-02-05T19:31:40Z] migrate: passwords migrated
TRAC[2023-02-05T19:31:40Z] migrate: cells migrated
TRAC[2023-02-05T19:31:40Z] migrate: cameras migrated
TRAC[2023-02-05T19:31:40Z] migrate: categories migrated
TRAC[2023-02-05T19:31:40Z] migrate: faces migrated
TRAC[2023-02-05T19:31:40Z] migrate: auth_users_details migrated
TRAC[2023-02-05T19:31:41Z] migrate: files migrated
TRAC[2023-02-05T19:31:41Z] migrate: photos migrated
DEBU[2023-02-05T19:31:41Z] migrate: completed in 7.796722162s
INFO[2023-02-05T19:31:41Z] faces: found 104 subjects
INFO[2023-02-05T19:31:41Z] faces: found no invalid marker subjects
INFO[2023-02-05T19:31:41Z] faces: found no invalid marker faces
INFO[2023-02-05T19:32:10Z] faces: found no ambiguous subjects
INFO[2023-02-05T19:32:10Z] faces: found no orphan face clusters
INFO[2023-02-05T19:32:10Z] faces: found no orphan people
INFO[2023-02-05T19:32:10Z] completed in 36.75042936s
INFO[2023-02-05T19:32:10Z] closed database connection
@lastzero commented on GitHub (Feb 23, 2023):
I now have a clearer idea of the underlying problem and think we can improve the situation by not matching manually labeled faces if they do not meet the constraints set for automatic clustering (more specifically, cluster core and core distance).
When testing with the pictures we had been using during development, a more aggressive approach for manually labeled faces seemed to work well, but this is only feasible for faces that the model can categorize well. For example, all these faces were matched after labeling a single face in the edit dialog:
@pjft commented on GitHub (Feb 24, 2023):
Thank you for the update - this looks promising! What consequences would this change bring about? Meaning, what will change by not matching manually labeled faces if they do not meet the constraints set for automatic clustering?
Just curious to understand the "why the original decision, what problems did it solve" and "what consequences does it have".
Not urgent, by any means - just intellectual curiosity. :)
Thank you, and happy Friday!
@lastzero commented on GitHub (Feb 24, 2023):
@pjft We can't tell what the final effect will be until it's implemented. It is expected that you will have to tag a few more faces that belong to the same person before the automatic matching starts, so basically how most other apps work.
Since these changes require extensive testing, we will probably wait until after the next stable release to avoid delaying it even further.
The sample faces we use for testing can be downloaded here:
@pjft commented on GitHub (Feb 24, 2023):
Of course - appreciate the time and effort in looking into this.
Let us know how we can help.
@pjft commented on GitHub (May 19, 2023):
Just wanted to say that this seems to be solved in the latest builds!
Awesome work :) It's a breeze to label hundreds of clusters now.
@lastzero commented on GitHub (May 19, 2023):
@pjft Thank you! That's good to know, since the feedback we've received on the optimizations hasn't been very enthusiastic so far. Most users didn't seem to notice (or care about) the changes when they tested our development preview builds.
@pjft commented on GitHub (May 19, 2023):
I have no issues whatsoever, and I recall having to wait 1 min or so before having to move forward after a while.
Only comment I have is that, for some reason, the "audit fix" and "optimize faces" commands seem to cause more long-lasting issues than the ones I experience without them. Unsure if temporary or not, but I have not dug into them as it's painful to go back and redo all the work 😅
@lastzero commented on GitHub (May 19, 2023):
We will change how manually labeled faces are clustered and compared in one of the upcoming releases. This should solve the remaining issues.
@pjft commented on GitHub (May 19, 2023):
Awesome:) if you need any testing at any point, by all means tag me directly. I'm always happy to help if I can. I appreciate all the work you and @graciousgrey put into making Photoprism great! Thank you so much.
@berturu commented on GitHub (Jul 24, 2023):
Unfortunately I still experience a delay when tagging new faces. The delay ranges from 2/3 seconds to possibly a full minute at times. Running Build 1.2306.25
@lastzero commented on GitHub (Jul 24, 2023):
Unfortunately, we haven't had time to work on this yet. We will reference this issue in the related commits and mention it in the release notes once the improvements are available. Alternatively, feel free to work on this and provide us with a pull request that we just need to merge. Thank you! ❤️
@judej commented on GitHub (Aug 30, 2023):
+1 on the perf issue. When I manually tag faces, it can take about 20 seconds before I can tag another one. If I try tagging it give me a "busy, try again later" error.
I have found that if I navigate in the reverse order when tagging, sometimes, the tagging is faster - sub second. I get to tag about 4 or so pics and then it slows down again.
not sure if this helps...
@pabsi commented on GitHub (Oct 2, 2023):
Hi there @lastzero , the Photoprism team, and its awesome community,
I was about to raise a new issue, but I think this one fits my usecase.
I have a huge amount of photos for which I manually have to tag faces to. It seems that Tensorflow does not play nicely with children growing from the baby stage, into toddlers, then kids, then teenagers. Obvious face change, but it does require more manual intervention from me. Also, going to public paces (a zoo, a museum, a restaurant, etc) catches lots of unwanted people.
I often found myself having to edit many faces per photo, and find this process extremely tedious and slow. I think, in my humble opinion, and ideally, we shouold be able to do all changes in one photo (add, remove, modify face tags) and when leaving the photo, commit the changes to the DB. Again, in my humble opinion, the way it works now (make change, wait for it to be committed) it's very very slow and underperforming.
Unsure if this GitHub issue is for the same purpose, but thought I'd mention it here first, before raising a new issue unnecessarily.
Thank you so much for this amazing piece of software 🙏
@richardpickett commented on GitHub (Oct 6, 2023):
I'm having this same issue. 26,000+ photos, mostly Asian family. I'm happy to provide as much info as possible, if it's helpful in resolving this issue.
running
photoprism/photoprism:ubuntuon a k8s w/mariadb:10.11version
230923-e59851350-Linux-AMD64-PlusSometimes manually tagging will take 6+ minutes (update: 8+ minutes), with usual wait times around 2 minutes. During the wait I can see 20+ messages like this:
time="2023-10-06T15:53:24Z" level=warning msg="faces: failed removing merged clusters for subject js1xgms3dgksr08g"Each with a different subject name.
I've run:
photoprism migrations run -f- no output, completes in less than a minutephotoprism index --cleanup- (a) warnings on mp4s that are too big, (b) errors on jpgs making thumbnails (unexpected EOF while decoding), and (c) a lot of failed removing merged clusters for subject <****>I'm adept on mysql, k8s, bash, etc, so I'm completely comfortable with low-level debugging to help resolve this - just warn me before we do something that may potentially be destructive (also, would like to know where the db backup saves so I can copy it off as needed).
@stephanvierkant commented on GitHub (Jan 18, 2024):
I'm having the same issue. It takes about 30-60 seconds to add faces. While it would be great to improve the performance, could it be possible to have it done by a background process? I think it shouldn't block the front end.
@graciousgrey commented on GitHub (Jan 18, 2024):
We will take care of this as soon as we have released the new authentication options we are currently working on :)
@trop1kal commented on GitHub (Jan 28, 2024):
Awesome! I notice the delay when tagging manually tagging faces as well.
Takes upwards of 2-3mins in most cases.
You guys rock! TY in advance!
@niclan commented on GitHub (Mar 13, 2024):
Same, takes minutes with CPU usage on the server ~30%+ and then peaking at 100% right before finishing. And sometimes it takes just seconds.
It seems like the server is blocking while this is going on.
A bit under 150K images.
@nodecentral commented on GitHub (May 20, 2024):
Just wanted to add to this, and share how inconsistent it can be when manually tagging (or de-tagging) faces. As such a key part of the product for me (and to help my aging parents remember people from their past) - anything that can be done to speed things up would be amazing..
The face tagging feature is the most important feature for a number of people - thank you so much for making it available
@theshadow27 commented on GitHub (Dec 20, 2024):
This has been bugging me for a while, but it's now taking 200-300 seconds per face. That's enough time to start looking through the code.
Background
The call that starts the blocking operation, and subsequently fails resulting in 429 is
PUT /api/v1/markers/:marker_uid(controller is internal/api/markers:182func UpdateMarker(router *gin.RouterGroup). The 429 error is returned on line 191 when themutex.UpdatePeople.Start()is already locked. The next line usesdefer mutex.UpdatePeople.Stop()to lock the mutex until theUpdateMarkerfunction returns.The issue is that the remainder of the function can take a VERY long time to return. Upwards of 3 minutes. The application exhibits both CPU utilization (100% of 1 core) and DB traffic (distinct spikes of DB traffic and queries). The best I can tell from the logs, when
SaveFormreportschanged(which will almost always be true when adding a name to a face that did not have one before) we do some stuff:get.Faces().Optimize()( whenmarker.FaceID != "" && marker.SubjUID != "" && marker.SubjSrc == entity.SrcManual, which is again always true from what I can gather based on the front end, since we passed the marker and subject in with the call and the src is always 'manual').query.UpdateSubjectCovers(true)- seems like a single, well optimized queryentity.UpdateSubjectCounts(true)- also a single, well optimized queryquery.PhotoByUID(file.PhotoUID)- also a single queryp.UpdateAndSaveTitle();- slightly more complicated but nothing alarming)Based on log line timing, I suspect that the most time is being spent in (1),
Faces().Optomize(), which is invoked on every single successful call to PUT markers.faces_optimize.go:Optimize()The
Optomize()function is scary for something routinely invoked by a front-end client. First off, it starts with a loop to 11 (although there is an early abort, this is a lot of effort). Up to 11 times, until no progress is made we:ManuallyAddedFaces(false, false), basicallySELECT * FROM faces where face_hidden=false AND face_src="manual" AND subj_uid <> '' AND face_kind <= 1 ORDERBY subj_uid, samples DESC. Remember - up to 10 times! Why?!? we are not manually adding new faces; the mutex has locked us out of doing so.nmanually added face from the DB (mine hasn=322rows):merge[0].match(faces[j].embedding())(will come to this later). If that returnsokthen print a log line and append the face to the mergequery.MergeFaces- more on this nextMergedwhich is anint. So a count of the number merged.log.Infof("faces: merged %s", english.Plural(res.Merged, "cluster", "clusters"))so that we know that Photoprism is ready to accept the next name (3 minutes later).In summary, we loop up to 11*322 times, calling
matchup tolen(faces)-len(subjects)(like 322-135 = 187 times for me) times and MergeFaceslen(faces)-2*len(subjects)times (per the log lines, this seems to be about 121-122 times).The 10 times is important - I am moderately confident that this loop never exits early (at least on my setup). See the DB traffic graph below for each name-adding operation:
There are 11 distinct spikes of "UPDATE" statements (peak = 220/second).
What happens in
MergeFacesYikes. Remember this happens essentially 11 * number of subjects for each
PUTAPI call:entity.NewFace)FindFirstOrCreateMatchMarkerswhich does another query for all markers with the faceIDs, runningMatchagainst each one - and potentially callingSetFacethat then callsResolveCollisionthat AGAIN matches all the embeddings and might write to the DB. In my case, this is rare, but it does happen. This means that most of the cycles are wasted.PurgeOrphanFaces- this seems like an over-eager operation considering how many times it is run.The end result is thousands of
UPDATEoperations on the DB, along with substantial CPU time spent inMatch(), for eachPUTAPI call.Why Optimize Every Call?
TBH I have no clue. I put a lot of the source through GPT4o and asked, and it said:
It's guess is as good as mine. I do agree that it does not need to run on each tagging operation. Perhaps it means less efficient clusters for the tagging session - fine. I can enter 50 names in the time it takes for one
Optimizecycle to complete!How to not
Optimizeon every callOptimize()every time a face is tagged, Queue the optimization task and run it periodically (e.g., every few minutes) or when a batch of changes is detected.Optimizecall when a specific metric, which is inexpensive to calculate, indicates that the matching would benefit from optimization. Candidates include:GROUP BY subject_uidsin thefacestableOptimize()asynchronously, updating only the affected clusters/subjectsOptimizeonly when there have been no API calls in some time (5 minutes, or user-adjustable)If it is necessary to
Optimizeon every call...VEC_DISTANCEin MariaDB 11.7 (and 10.12) to reduce the number of markers that need to beMatched - I believe there is a separate ticket #4669 for this.In conclusion
Regarding @lastzero 's comment:
I am happy to help, but would appreciate some thoughts on the best strategy first so that it does not end up as a wasted effort. Do you feel that it is critical to
Optimizeon everyPUTAPI call? If so, why? If not, what is your preferred pattern for hiding this latency from the user? My goal is to contribute something usable, not add to the pile of 30-something unmerged PRs.Edit: The loop actually runs 11 times, not 10 (because it's
j <= 10). This matches the number of huge query spikes seen during theOptimizerun in DBeaver.@theshadow27 commented on GitHub (Dec 20, 2024):
To further investigate, I used DBeaver to export the faces table before and after one of these multi-minute, CPU intensive
Optimize()calls. The net diff was a single row (the face that was changed), which had only a single field added (subject_uid).From this, I have to assume that the clusters aren't actually being merged in a meaningful way in the long-running
Optimize()calls, despite the logs reporting that they are. Perhaps this is a bug, and if the clusters were actually merged, then there would not be repeated"faces: merged 121 clusters [2m30.469009483s]"lines in the log...@theshadow27 commented on GitHub (Dec 21, 2024):
I think I found the bug causing the Optimize loop to not terminate early (and run 11 times,
i := 0; i <= 10; i++):The early termination code:
checks to see if any faces were merged on the last run through the loop. Basically, if a call to
query.MergeFaces(merge, false)with mergeable faces (clusters thatMatchwithmerge[0]) does not error, we assume that all faces were merged withresult.Merged += len(merge).This is a flawed assumption because:
subj_uidand embeddingMatch) set of mergeable faces,MergeFacesis not guaranteed to actually eliminate any (or all) of the suppliedFacesMatchMarkers, the "old, suboptimal" clusterface_idis not replaced whenm.Match(marker.Embeddings())is notok; in this case we// Ignoreon face.go:L265.face_idbeing replaced with the new super-cluster, thenPurgeOrphanFacescannot delete the old clusters.if removed > 0call in faces.go:L216, which seems like we should get an error log when this happens, and we don't see any such errors. Except:PurgeOrphanFacesfunction, there appears to be a subtle bug where if no rows are affected by the delete, i.e.else if result.RowsAffected > 0is not true, thelen(ids)(batch size) is added to theaffectedcounter anywayMergeFacesreturns as if it did useful work.Because the daemon runs this function every 15 minutes, it is highly likely that all face clusters which can be merged have been merged. In a static library in steady state that has at least 2 clusters which could be merged according to the embeddings matching, but the smaller cluster can't be removed due to the new cluster not matching the face, this loop will run 11 times; checking every cluster against all others with the same
subj_uideach time.This could be fixed in 2 ways:
PurgeOrphanFacesso that if no rows are affected, it returns 0. This will trigger the WarnF inMergeFacesand the Errorf inOptimizeeach time, but at leastresult.Mergedwon't be incremented and so the 11x loop won't run unnecessarily. The downside to this (besides the log output) is that ifPurgeOrphanFacesorMergeFacesis used elsewhere and relies upon the broken behavior, that the other consumers may no longer work as expected. GitHub doesn't seem to think that either function has other consumers, but it might have missed something.query.ManuallyAddedFaceseach time, we could track the number of face clusters returned (n), and terminate ifn >= prev_n(i.e. the number of clusters increased or stayed the same). The downside here is that it would require an extra query of all faces (with their embeddings).@lastzero if you let me know if you prefer fix (1) or fix (2) more, I am happy to submit a PR.
There are a few other things I noticed while tracing this code:
In the
Optimizefunction, theelse if len(merge) == 1 { merge = nil }condition probably is a bug. This is trying to reset the set of mergeable faces, for the case where it's the same subject but the faces don't match. In doing so, it skips a potential face to merge. Consider clusters A, B, C, D, E, F all of the same subject:To fix this, change the code to
else if len(merge) == 1 { merge = entity.Faces{faces[j]} }A more aggressive refactoring could also improve performance significantly: rather than querying all faces each time through the loop, we should get the list of faces at the start, then only query for face IDs with a
created_atconstraint, fetching only the new faces. Better yet (though more involved), all of the operations could be kept in memory, with a append-only log of 'added' and 'removed' to write to the DB once at the end. When there are hundreds offaces, this repeated querying does consume significant DB resources.Edit: One further (but schema-changing) optimization would be to add a column
optimized_atin thefacestable. Whenever a face cluster was matched/added to amarker, this column would be cleared. In theOptimizeloop, following an optimization run, each remaining cluster that was processed would haveoptimized_atset. Add one more argument toManuallyAddedFaces(hidden, ignored, optimized)that only returns Faces where at least one face with thatsubj_uidhas an emptyoptimized_atvalue... The query would be something likeOr thereabouts. In this way, we'd only be processing faces where at least one marker had changed; rather than doing the same work over and over.
@theshadow27 commented on GitHub (Dec 30, 2024):
Have a PR going. No confirmed fix yet. https://github.com/photoprism/photoprism/pull/4691
BTW tests failing in WSL off
develop- probably not your native env.@theshadow27 commented on GitHub (Dec 30, 2024):
(from the PR for visibility by @lastzero )
This issue is deeper than initially suspected - the proposed fix (1) is necessary but not sufficient.
After implementing the proposed fix (1), the loop is still running all 11 times. It seems that when there are several clusters of the same size (in my case 6) it is possible for each iteration to attempt to merge them, resulting in one new cluster being created and one old cluster being dropped.
Trying to grok what is going on here...
When the new cluster is created, it's ID comes from the embeddings... the centroid of all possible merge candidates. Somehow, this results in a new face hash (rounding issues? order of operations?) and the creation of a new automatic cluster which replaces the old one. Since
MatchMarkerseagerly asserts ownership over the markers, it is happy to take them from the other automatic cluster. The database is updated by adding the new cluster, removing the old cluster, (effectively a nop)... and the cycle repeats.I assumed that the intent of the md5
face_idwas to prevent exactly this problem, and initially speculated that a stable sort order fromManuallyAddedFacesmight fix it, but thinking about it some more, I can see how it would not:each time, the centroid is shifted slightly, resulting in a new
face_id- one that will be very very close, but not exactly, to the previous one.Initial thought: One way around this would be to do the match in-memory, i.e.
MatchMarkersshould store the results but not actually write anything. Then, it could be compared to the previous marker set, and if it was the same set of markers, not persist itself or take them over. Given the structure of the code, this is a complicated change. I suppose it could be done with less friction inside aTRANSACTION(with rollback) at the expense of additional DB overhead. I am not sure the best way to proceed here. If you have any suggestions @lastzero I would be happy to try them with my problematic db.In the meantime, I will add a parameter to
Optimize(subj_uid)that can be passed through toManuallyAddedFacesfor the special case ofmarkers.UpdateMarkerwhen thesubj_uidis known in advance. Because of the locking here, this will only ever result in that subject being optimized (it's the only face cluster to have changed, and themergeset is all the samesubj_uid). At least that will cut down on the amount of work this loop has to do while the user is blocked.Edit: Noticed another problem (bug?). The sort in
ManuallyAddedFacesorders bysampleswith the assumption that this will somehow be connected to reality. However, it seems that this value is never updated - not with Optimize, not withphotoprism faces update -f, not withphotoprism faces audit --fix. Once it's there, it's set (as far as I have observed). In doing some spot checks, I have several cases where samples is something like 7-10, but there are 25+ markers with thatface_idmeeting the minimum score and size. Unless I'm missing something, and the intent is not to denote how many markers are matched to the cluster?@lastzero commented on GitHub (Dec 31, 2024):
@theshadow27 Thanks for taking a look and sharing your insights! 🧐
Our current plan is to offload (most of) the calculations to the new MariaDB vector functions in an upcoming release:
This should yield better results and be much faster than our own vector comparisons, which were intended as a temporary solution until we have a proper vector database/library in place. Unfortunately, as we are currently performing a major upgrade of our frontend, I do not have any code or a more specific timeline/roadmap for this yet:
The edge cases that need to be handled/investigated (see your comments and what I've commented in other places) further encourage us to go this way. If you like, you are welcome to experiment with the vector functions and work on a proof of concept? ✨
Of course, if you find improvements to the current code that we can test, merge, and release within a reasonable timeframe, we'd be happy to do that as well - but my feeling is that the effort required for a fast and completely bug-free implementation is beyond what anyone could maintain/contribute... 🤔
@theshadow27 commented on GitHub (Dec 31, 2024):
@lastzero thanks for getting back, and happy new years! 🥂
First off, IMHO the current implementation/"temporary" solution is VERY good at meeting it's functional requirements - better than Apple Photos for sure. In the rare situation where it's confused a subject, it's straightforward to eject them and fix it. New markers are identified and matched quickly and accurately. Youth sports are a brutal use case, and I sincerely appreciate the effort that went into the current version! The only issue I have is with the nonfunctional UX requirement of not having excessive delays while naming many markers in quick succession.
To be clear: the current state of the draft PR offers a significant improvement to the UX; by limiting the
Optimizefunction to the face supplied in the POST, it almost never triggers the 11-iteration loop, unless you are specifically working on one of those subjects that has problematic clusters. It is not a big PR (40 lines, 12 lines functional), and I don't believe it changes the behavior at all for the worker/CLI usage - a huge plus for building confidence after years of stable deployments.Regarding the Vector implementation - I'm already thinking about it, but the scope will be much wider than this PR, with impacts from DBSCAN up to everywhere that Match is used (at least 11 places) - likely requiring substantial rework. In addition to schema updates, a Vector implementation requires a minimum DB version that could break a lot of existing embedded/hosted installs without manual intervention during the upgrade. To keep backwards compatibility, some version of the current implementation strategy needs to be encapsulated behind an interface, further expanding the scope of refactoring, and potentially running with two different Gorm entities depending on a flag. I don't know enough about Gorm to know if that's ok or not. This is a Big project, lots of new code, lots of testing, bug hunting, and so on.
As such, my suggestion is to try to get the minimum change for maximum UX improvement merged sooner than later - there are quite a few users who have reported this or derivative issues going back to October 2022. It doesn't fix the root cause, but it hides it well enough to make a big difference for those who are impacted. What do you think?
P.S. If you're not comfortable rolling it out everywhere, it is not much more to hide it behind a flag (like
PHOTOSTREAM_EXPERIMENTAL_FACES_OPTIMIZE) that would eliminate nearly all of the risk for "normal" users doing routine upgrades.P.P.S. I am working on another branch that has a more substantial diff; a re-write of the
Optimizefunction completely for how I think it should work - Breaking out eachsubj_uidinto it's own struct and operating on them independently, spending more effort on finding mergeable clusters within a subject, going back to ground-truth tagged markers, instead of cluster-of-clusters, and only re-fetching faces when that particular subject had a change. This will be several hundred lines diff, so I don't expect it to be accepted without substantial testing that I will likely not finish on this holiday 😄 .@lastzero commented on GitHub (Mar 26, 2025):
@theshadow27 Thank you so much for your patience while we completed the frontend upgrade! 🙏
The last batch of UI/UX enhancements was released last week, so I'm happy to report that your PR is now merged and ready to be tested with our development preview build:
Any help with that is much appreciated :)
@Coloradohusky commented on GitHub (Apr 6, 2025):
While I am not OP, I have definitely noticed speedups with manual tagging when on the latest preview build,
250405-eed7bfff7- speeds went from around 30 seconds to a minute+ before, to around 2-10 seconds now. (although in certain cases it can still take 45 seconds, but it is much better overall)@lastzero commented on GitHub (Apr 7, 2025):
@Coloradohusky Thanks for testing! While the API could also be faster due to other performance improvements we've implemented in the meantime, it's definitely great to hear :)
@mat1990dj commented on GitHub (Jan 12, 2026):
I have to say that after the last november update (251130), this is happening again, I performed photoprism faces audit --fix and photoprism faces index, and I have a LOT to fix. Every change I need to wait more than one minute until the next action on faces
time="2026-01-09T14:31:57Z" level=info msg="subject: added person 'Olivia Valles'"
time="2026-01-09T14:32:59Z" level=info msg="changes successfully saved"
@davidetoile commented on GitHub (Jan 12, 2026):
I confirm, same problem on my side...
@mat1990dj commented on GitHub (Jan 12, 2026):
additionally, I think we should also be able to have a screen where is shows detected faces ungrouped, and be able to multiselect and then assign a person