mirror of
https://github.com/photoprism/photoprism.git
synced 2026-03-02 22:57:18 -05:00
Import: Better handling of truncated WebDAV uploads. #2094
Labels
No labels
ai
android
api
auth
awesome
bug
bug
ci
cli
config
database
declined
deprecated
docker
docs 📚
documents
duplicate
easy
enhancement
enhancement
enhancement
epic
faces
feedback wanted
frontend
hacktoberfest
help wanted
idea
in-progress
incomplete
index
invalid
ios
labels
live
live
low-priority
macos
member-feature
metadata
mobile
nas
needs-analysis
no-coding-required
no-coding-required
observability
performance
places
please-test
plus-feature
priority
pro-feature
question
raspberry-pi
raw
released
released
released
research
resolved
security
sharing
tested
tests
third-party-issue
thumbnails
upgrade
upstream-issue
ux
vector
video
waiting
won't fix
won't fix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/photoprism#2094
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @kevincox on GitHub (May 11, 2024).
Describe what problem this solves and why this would be valuable to many users
Sometimes WebDAV won't complete. Maybe the network dropped, maybe the battery ran out or maybe PhotoPrism was restarted.
The exact behaviour will depend on the upload/sync tool that you are using but with my scenario (FolderSync) what will happen is this:
This results in two items in my PhotoPrism library. Ideally this common scenario could be handled so that only the complete file ends up in the library.
Describe the solution you'd like
When a file is imported and there are existing files with the same original name and a smaller size PhotoPrism should check if the existing file is a prefix of the new file. If so the old file should be replaced.
Possible tweaks:
Describe alternatives you've considered
Additional context
The files I were uploading were JPEG files from the Google Camera app. The truncated files were handled relatively well by PhotoPrism with the image just grey after some point (where the truncation happened). However needing to review all of the uploads later for files with the same name and deleting the smaller one was annoying, tedious and error prone. In some cases I may not have realized that this happened and had these truncated versions filling my library.
@phemmer commented on GitHub (Jun 23, 2024):
I created a similar issue a few weeks ago (didn't notice this issue at the time): #4310. Ignore the title, that's what it was changed to. The issue was about handling incomplete webdav uploads, with multiple proposed solutions. Didn't seem to be of much interest, and was instead closed by disabling auto-import.
@lastzero commented on GitHub (Jun 23, 2024):
Kindly note that you had opened a bug ticket, not a feature request / improvement / enhancement issue. I've then done my very best to provide feedback and agreed that this feature should not be enabled in it's current form. In addition, I've agreed that the current limitations and potential risks must be better documented. Following that assessment, we worked overtime to publish a new release and improved documentation without any help from our community. Note that you now again get very fast feedback on a Sunday evening to let you and everyone else know that we would appreciate contributions to further improve this. I really don't know how you got the impression that we are not interested? My suggestion would be to create new GitHub issues to cover each aspect that needs improvement as follow up tasks to your original bug report (which according to the template we provide should focus on existing features that do not work as documented vs features you would like to see changed, added or improved). Please let us know if you don't know how to scope the issues / acceptance criteria so we can help. Thank you very much! ❤️
@kevincox commented on GitHub (Jun 23, 2024):
Yeah, the tickets are slightly different but definitely related. I think we should have a good story about uploads on unreliable networks. The best approach may be a native upload tool that can have an explicit commit and can check if files have been uploaded before starting again. However the widely compatible WebDAV is still great and it would be nice to have a good solution.
Firstly it seems from the other ticket that the
PHOTOPRISM_AUTO_IMPORTtimer should be reset upon any write, not just the start or end of an upload request. That way even very slow networks should be able to send a packet every few minutes the vast majority of the time. But this is likely a separate improvement to both this issue and https://github.com/photoprism/photoprism/issues/4310I think this issue should focus on a solid story for this. I like some of the idea proposed in both this issue and https://github.com/photoprism/photoprism/issues/4310. I think the best ideas so far are:
@lastzero commented on GitHub (Jun 24, 2024):
@kevincox Thanks for the summary and ideas!
This is already possible with the current API and the current default settings (i.e. with auto-import disabled), as API clients can trigger an import when the upload is complete (ideally, each app should use its own subfolder for uploads, just like the upload functionality in our web app):
github.com/photoprism/photoprism@b70a360b4a/internal/api/import.go (L31-L35)That is correct. However, I could not find a solution to this in the time I had available. The existing WebDAV request handler was originally developed to check permissions before users can upload or access files. So it is always called before the upload starts, but not at regular intervals while files are being uploaded:
github.com/photoprism/photoprism@b70a360b4a/internal/server/webdav.go (L73-L77)Getting status updates might be possible through the logging callback or by providing a custom FileSystem or LockSystem implementation:
I have started working in this direction by adding a function that returns a
LockSystemwhen implementing the fix for #4310. However, due to our high workload and many other tasks, I have not been able to further evaluate the suitability of theLockSystemfor this purpose and, if it is suitable, develop our own implementation (or find one that we can use):github.com/photoprism/photoprism@b70a360b4a/internal/mutex/webdav.go (L12-L13)What sounds simple is not trivial in practice, as it can be difficult to match newly uploaded (re-uploaded) files with existing originals. The use of file names, often in the format IMG_1234.JPG, is unreliable and could lead to other pictures getting replaced. So this is not an option.
Since the new (complete) and the old (incomplete) file are different by definition, using SHA1 hashes doesn't work either.
The only option I see is to match the files using metadata intended for this purpose, e.g. the same unique IDs we currently support for stacking:
Unfortunately, not many files have a unique ID and the metadata may also be missing if the files are incomplete.
Note that using the filename works if you do not use the import functionality, i.e. upload directly to the originals folder, which is why auto-index is still enabled by default.
Some WebDAV clients might do this automatically (e.g. macOS Finder and maybe PhotoSync?), so it's worth investigating this further.
Another idea would be to check the
Content-Lengthheader if the client sends one. However, this would require implementing an "after-upload" event handler (see above for how this could be done) to compare the actual file size with the expected file size.You could also check whether uploaded files have an
EOFbyte at the end. However, this would make it impossible to upload broken files, which some users might consider a bug.Finally, even if you know that some files are incomplete, you need to decide on the consequences (because the upload could freeze and never be completed or attempted again), for example:
I hope that's enough feedback and inspiration for now, because I need to move on to completing OpenID Connect support. Some users are already complaining because they're not seeing any progress, and that's just as frustrating for us as it is for them.
If you have questions, please also feel free to ask in our Contributors and/or Community Chat!
@kevincox commented on GitHub (Jun 24, 2024):
I put more detail in the original post. I think it should be reliable to use file names to narrow the selection but then check for prefixes before replacing. Even if two Apple devices try to upload IMG_0001.JPG at the same time after a truncated upload only the first one should match the prefix, after it is replaced with the first one the "prefix" is now the full first file so the second won't match.
What do you mean by EOF byte? You mean that the file type itself indicates that it is complete? I think this would work for most media files as they typically can indicate this. I also agree that we should be able to import corrupted files (even if manually). This is why I was recommending the approach of importing the truncated file eagerly, then when the full file comes in detecting it and "upgrading" the prefix to the whole file.
@lastzero commented on GitHub (Jun 24, 2024):
Yes, please share more details about this approach. I'm not sure I understand it correctly yet. Importing files from a (possibly single) directory means that there is no reliable or usable path information and the library may contain thousands of files with the same original name, i.e. the file name before copying/moving a file from the import to the originals folder.
Yes, exactly. There may be file types for which this is not reliable or possible.
@kevincox commented on GitHub (Jun 24, 2024):
The system I am thinking is basically this:
If two different files with the same name are uploaded at the same time it is safe as long as step 4 and 5 are performed atomically. Because once the first file replaces the truncated one the prefixes will no longer match.
@lastzero commented on GitHub (Jun 24, 2024):
Binary comparison of all files with the same name (up to the size of the original file if it is smaller) might be extremely slow and cause a lot of I/O, which is especially problematic for users with files on a network share or slow devices like a Raspberry Pi. I would therefore advise against doing this by default for every file upload or import.
The performance overhead could possibly be managed if we were to create an additional checksum (hash) that only covers e.g. the first 256 KB or 512 KB of a file. However, this is just an idea and would have to be checked for feasibility and performance before implementation.
@kevincox commented on GitHub (Jun 24, 2024):
Yeah, this is the downside but I doubt it will be a huge issue for most cases.
So I think pairing this with either an age limit, candidate limit (sorted by age or something) or both could ensure that we never hit the rare extreme cases.
The precomputed checksum is a good idea as well. The only downside is for files smaller than the selected size. However those can just be checked in their entirety as a fallback and those may be more likely to be truncated anyways. (who takes 256KB photos these days 😛)
There may be other filtering that can be done, for example in formats where there is a clear end marker those could be potentially tagged as "complete" and never considered as candidates. (at the small risk of missing some trailing data that shouldn't be there but may be)
In the end missing a few possible matches is likely acceptable because the behaviour falls back to importing both. So it would require manual work but no data would be lost.
@lastzero commented on GitHub (Jun 24, 2024):
While this may be true in general, keep in mind that we have a very diverse community of users, most of whom have a different setup than you. If network storage is used (and no partial hash is available in the index database), a range request still needs to be performed for each matching file.
Even though S3 generally supports range requests, some of these files may have been moved to cold storage to reduce hosting costs, so retrieving the first few bytes could result in a huge performance penalty. In addition, remote file system drivers/abstractions such as davfs2 may not support range queries at all:
@kevincox commented on GitHub (Jun 24, 2024):
Yeah, there are always infinite use cases. Probably making a few simple things configurable even just
PHOTOPRISM_TRUNCATION_DETECTION_MAX_CANDIDATEScould make it palatable for most users. Set to0to disable entirely,1for low-cost or100if you have fast local storage.