Import: Better handling of truncated WebDAV uploads. #2094

Open
opened 2026-02-20 01:06:02 -05:00 by deekerman · 11 comments
Owner

Originally created by @kevincox on GitHub (May 11, 2024).

Describe what problem this solves and why this would be valuable to many users

Sometimes WebDAV won't complete. Maybe the network dropped, maybe the battery ran out or maybe PhotoPrism was restarted.

The exact behaviour will depend on the upload/sync tool that you are using but with my scenario (FolderSync) what will happen is this:

  1. Upload partial file.
  2. Partial file is imported.
  3. Upload complete file.
  4. Complete file is imported.

This results in two items in my PhotoPrism library. Ideally this common scenario could be handled so that only the complete file ends up in the library.

Describe the solution you'd like

When a file is imported and there are existing files with the same original name and a smaller size PhotoPrism should check if the existing file is a prefix of the new file. If so the old file should be replaced.

Possible tweaks:

  1. What if there are tons of files with the same original name and smaller size. Maybe this should be limited in some way? Maybe only consider the most recent 100 or only files uploaded in the past 30 days?
  2. Are there common situations where a non-truncated file is a prefix of another file? Maybe we do this if the original is damaged in some way? A recency check like above could also mitigate this.

Describe alternatives you've considered

  1. If a WebDAV upload connection drops delete the file.
    • Pro: Simple.
    • Con: This would prevent resuming the upload.
  2. If a WebDAV upload connection drops flag that as "unfinished" until a new connection overwrites/completes the file.
    • Con: What if the upload did complete, then the next upload would never come.
  3. Detect corrupt files and don't import them.
    • Pro: Upload can be resumed forever as the file will sit in the imports folder even past an import.
    • Con: The photo can't be managed in PhotoPrism until the full upload occurs.
    • Con: What if the full file never comes, it may be better to import the partial file. Would this need an alternative process?
  4. Import the file but don't delete it from the upload folder.
    • Pro: Upload can be resumed forever as the file will sit in the imports folder even past an import.
    • Pro: Can be managed immediately.
    • Con: If the full file never comes the download folder will be cluttered. (Possibly causing issue if someone wants to upload a file with the same name).

Additional context

The files I were uploading were JPEG files from the Google Camera app. The truncated files were handled relatively well by PhotoPrism with the image just grey after some point (where the truncation happened). However needing to review all of the uploads later for files with the same name and deleting the smaller one was annoying, tedious and error prone. In some cases I may not have realized that this happened and had these truncated versions filling my library.

Originally created by @kevincox on GitHub (May 11, 2024). **Describe what problem this solves and why this would be valuable to many users** Sometimes WebDAV won't complete. Maybe the network dropped, maybe the battery ran out or maybe PhotoPrism was restarted. The exact behaviour will depend on the upload/sync tool that you are using but with my scenario ([FolderSync](https://foldersync.io/)) what will happen is this: 1. Upload partial file. 2. Partial file is imported. 3. Upload complete file. 4. Complete file is imported. This results in two items in my PhotoPrism library. Ideally this common scenario could be handled so that only the complete file ends up in the library. **Describe the solution you'd like** When a file is imported and there are existing files with the same original name and a smaller size PhotoPrism should check if the existing file is a prefix of the new file. If so the old file should be replaced. Possible tweaks: 1. What if there are *tons* of files with the same original name and smaller size. Maybe this should be limited in some way? Maybe only consider the most recent 100 or only files uploaded in the past 30 days? 2. Are there common situations where a non-truncated file is a prefix of another file? Maybe we do this if the original is damaged in some way? A recency check like above could also mitigate this. **Describe alternatives you've considered** 1. If a WebDAV upload connection drops delete the file. - Pro: Simple. - Con: This would prevent resuming the upload. 2. If a WebDAV upload connection drops flag that as "unfinished" until a new connection overwrites/completes the file. - Con: What if the upload did complete, then the next upload would never come. 3. Detect corrupt files and don't import them. - Pro: Upload can be resumed forever as the file will sit in the imports folder even past an import. - Con: The photo can't be managed in PhotoPrism until the full upload occurs. - Con: What if the full file never comes, it may be better to import the partial file. Would this need an alternative process? 4. Import the file but don't delete it from the upload folder. - Pro: Upload can be resumed forever as the file will sit in the imports folder even past an import. - Pro: Can be managed immediately. - Con: If the full file never comes the download folder will be cluttered. (Possibly causing issue if someone wants to upload a file with the same name). **Additional context** The files I were uploading were JPEG files from the Google Camera app. The truncated files were handled relatively well by PhotoPrism with the image just grey after some point (where the truncation happened). However needing to review all of the uploads later for files with the same name and deleting the smaller one was annoying, tedious and error prone. In some cases I may not have realized that this happened and had these truncated versions filling my library.
Author
Owner

@phemmer commented on GitHub (Jun 23, 2024):

I created a similar issue a few weeks ago (didn't notice this issue at the time): #4310. Ignore the title, that's what it was changed to. The issue was about handling incomplete webdav uploads, with multiple proposed solutions. Didn't seem to be of much interest, and was instead closed by disabling auto-import.

@phemmer commented on GitHub (Jun 23, 2024): I created a similar issue a few weeks ago (didn't notice this issue at the time): #4310. Ignore the title, that's what it was changed to. The issue was about handling incomplete webdav uploads, with multiple proposed solutions. Didn't seem to be of much interest, and was instead closed by disabling auto-import.
Author
Owner

@lastzero commented on GitHub (Jun 23, 2024):

Kindly note that you had opened a bug ticket, not a feature request / improvement / enhancement issue. I've then done my very best to provide feedback and agreed that this feature should not be enabled in it's current form. In addition, I've agreed that the current limitations and potential risks must be better documented. Following that assessment, we worked overtime to publish a new release and improved documentation without any help from our community. Note that you now again get very fast feedback on a Sunday evening to let you and everyone else know that we would appreciate contributions to further improve this. I really don't know how you got the impression that we are not interested? My suggestion would be to create new GitHub issues to cover each aspect that needs improvement as follow up tasks to your original bug report (which according to the template we provide should focus on existing features that do not work as documented vs features you would like to see changed, added or improved). Please let us know if you don't know how to scope the issues / acceptance criteria so we can help. Thank you very much! ❤️

@lastzero commented on GitHub (Jun 23, 2024): Kindly note that you had opened a bug ticket, not a feature request / improvement / enhancement issue. I've then done my very best to provide feedback and agreed that this feature should not be enabled in it's current form. In addition, I've agreed that the current limitations and potential risks must be better documented. Following that assessment, we worked overtime to publish a new release and improved documentation without any help from our community. Note that you now again get very fast feedback on a Sunday evening to let you and everyone else know that we would appreciate contributions to further improve this. I really don't know how you got the impression that we are not interested? My suggestion would be to create new GitHub issues to cover each aspect that needs improvement as follow up tasks to your original bug report (which according to the template we provide should focus on existing features that do not work as documented vs features you would like to see changed, added or improved). Please let us know if you don't know how to scope the issues / acceptance criteria so we can help. Thank you very much! ❤️
Author
Owner

@kevincox commented on GitHub (Jun 23, 2024):

Yeah, the tickets are slightly different but definitely related. I think we should have a good story about uploads on unreliable networks. The best approach may be a native upload tool that can have an explicit commit and can check if files have been uploaded before starting again. However the widely compatible WebDAV is still great and it would be nice to have a good solution.

Firstly it seems from the other ticket that the PHOTOPRISM_AUTO_IMPORT timer should be reset upon any write, not just the start or end of an upload request. That way even very slow networks should be able to send a packet every few minutes the vast majority of the time. But this is likely a separate improvement to both this issue and https://github.com/photoprism/photoprism/issues/4310

I think this issue should focus on a solid story for this. I like some of the idea proposed in both this issue and https://github.com/photoprism/photoprism/issues/4310. I think the best ideas so far are:

  1. Detect re-uploads of imported truncated files and "upgrade" them.
    • Pro: Needs no support from the upload tool.
    • Pro: May work well in more complex cases than just a lost connection (example upload while the source is being written, temporary disk error...)
  2. Use some scheme (such as hidden files) to "hide" files from import until they are completely uploaded.
    • Pro: Avoids any importing of incomplete files.
    • Pro: Not Photoprism specific.
    • Con: Requires support for the uploading tool (upload first as hidden then rename).
@kevincox commented on GitHub (Jun 23, 2024): Yeah, the tickets are slightly different but definitely related. I think we should have a good story about uploads on unreliable networks. The best approach may be a native upload tool that can have an explicit commit and can check if files have been uploaded before starting again. However the widely compatible WebDAV is still great and it would be nice to have a good solution. Firstly it seems from the other ticket that the `PHOTOPRISM_AUTO_IMPORT` timer should be reset upon any write, not just the start or end of an upload request. That way even very slow networks should be able to send a packet every few minutes the vast majority of the time. But this is likely a separate improvement to both this issue and https://github.com/photoprism/photoprism/issues/4310 I think this issue should focus on a solid story for this. I like some of the idea proposed in both this issue and https://github.com/photoprism/photoprism/issues/4310. I think the best ideas so far are: 1. Detect re-uploads of imported truncated files and "upgrade" them. - Pro: Needs no support from the upload tool. - Pro: May work well in more complex cases than just a lost connection (example upload while the source is being written, temporary disk error...) 3. Use some scheme (such as hidden files) to "hide" files from import until they are completely uploaded. - Pro: Avoids any importing of incomplete files. - Pro: Not Photoprism specific. - Con: Requires support for the uploading tool (upload first as hidden then rename).
Author
Owner

@lastzero commented on GitHub (Jun 24, 2024):

@kevincox Thanks for the summary and ideas!

The best approach may be a native upload tool that can have an explicit commit and can check if files have been uploaded before starting again.

This is already possible with the current API and the current default settings (i.e. with auto-import disabled), as API clients can trigger an import when the upload is complete (ideally, each app should use its own subfolder for uploads, just like the upload functionality in our web app):

github.com/photoprism/photoprism@b70a360b4a/internal/api/import.go (L31-L35)

Firstly it seems from the other ticket that the PHOTOPRISM_AUTO_IMPORT timer should be reset upon any write [...]

That is correct. However, I could not find a solution to this in the time I had available. The existing WebDAV request handler was originally developed to check permissions before users can upload or access files. So it is always called before the upload starts, but not at regular intervals while files are being uploaded:

github.com/photoprism/photoprism@b70a360b4a/internal/server/webdav.go (L73-L77)

Getting status updates might be possible through the logging callback or by providing a custom FileSystem or LockSystem implementation:

I have started working in this direction by adding a function that returns a LockSystem when implementing the fix for #4310. However, due to our high workload and many other tasks, I have not been able to further evaluate the suitability of the LockSystem for this purpose and, if it is suitable, develop our own implementation (or find one that we can use):
github.com/photoprism/photoprism@b70a360b4a/internal/mutex/webdav.go (L12-L13)

Detect re-uploads of imported truncated files and "upgrade" them.

What sounds simple is not trivial in practice, as it can be difficult to match newly uploaded (re-uploaded) files with existing originals. The use of file names, often in the format IMG_1234.JPG, is unreliable and could lead to other pictures getting replaced. So this is not an option.

Since the new (complete) and the old (incomplete) file are different by definition, using SHA1 hashes doesn't work either.

The only option I see is to match the files using metadata intended for this purpose, e.g. the same unique IDs we currently support for stacking:

Unfortunately, not many files have a unique ID and the metadata may also be missing if the files are incomplete.

Note that using the filename works if you do not use the import functionality, i.e. upload directly to the originals folder, which is why auto-index is still enabled by default.

Use some scheme (such as hidden files) to "hide" files from import until they are completely uploaded.

Some WebDAV clients might do this automatically (e.g. macOS Finder and maybe PhotoSync?), so it's worth investigating this further.

Another idea would be to check the Content-Length header if the client sends one. However, this would require implementing an "after-upload" event handler (see above for how this could be done) to compare the actual file size with the expected file size.

You could also check whether uploaded files have an EOF byte at the end. However, this would make it impossible to upload broken files, which some users might consider a bug.

Finally, even if you know that some files are incomplete, you need to decide on the consequences (because the upload could freeze and never be completed or attempted again), for example:

  • Skip the incomplete files and the related sidecar files when importing (after the safety delay) and import the others.
  • Do not import files in the import subfolder (what happens if the main import folder was used to upload files? what happens if the file is uploaded again but to a different folder?)

I hope that's enough feedback and inspiration for now, because I need to move on to completing OpenID Connect support. Some users are already complaining because they're not seeing any progress, and that's just as frustrating for us as it is for them.

If you have questions, please also feel free to ask in our Contributors and/or Community Chat!

@lastzero commented on GitHub (Jun 24, 2024): @kevincox Thanks for the summary and ideas! > The best approach may be a native upload tool that can have an explicit commit and can check if files have been uploaded before starting again. This is already possible with the current API and the current default settings (i.e. with auto-import disabled), as API clients can trigger an import when the upload is complete (ideally, each app should use its own subfolder for uploads, just like the upload functionality in our web app): https://github.com/photoprism/photoprism/blob/b70a360b4ab7ec66f71df564e40f3a43a5c6ee4b/internal/api/import.go#L31-L35 > Firstly it seems from the other ticket that the PHOTOPRISM_AUTO_IMPORT timer should be reset upon any write [...] That is correct. However, I could not find a solution to this in the time I had available. The existing WebDAV request handler was originally developed to check permissions before users can upload or access files. So it is always called before the upload starts, but not at regular intervals while files are being uploaded: https://github.com/photoprism/photoprism/blob/b70a360b4ab7ec66f71df564e40f3a43a5c6ee4b/internal/server/webdav.go#L73-L77 Getting status updates might be possible through the logging callback or by providing a custom FileSystem or LockSystem implementation: - https://pkg.go.dev/golang.org/x/net/webdav#Handler I have started working in this direction by adding a function that returns a `LockSystem` when implementing the fix for #4310. However, due to our high workload and many other tasks, I have not been able to further evaluate the suitability of the `LockSystem` for this purpose and, if it is suitable, develop our own implementation (or find one that we can use): https://github.com/photoprism/photoprism/blob/b70a360b4ab7ec66f71df564e40f3a43a5c6ee4b/internal/mutex/webdav.go#L12-L13 > Detect re-uploads of imported truncated files and "upgrade" them. What sounds simple is not trivial in practice, as it can be difficult to match newly uploaded (re-uploaded) files with existing originals. The use of file names, often in the format IMG_1234.JPG, is unreliable and could lead to other pictures getting replaced. So this is not an option. Since the new (complete) and the old (incomplete) file are different by definition, using SHA1 hashes doesn't work either. The only option I see is to match the files using metadata intended for this purpose, e.g. the same unique IDs we currently support for stacking: - https://docs.photoprism.app/user-guide/settings/library/#stacks Unfortunately, not many files have a unique ID and the metadata may also be missing if the files are incomplete. *Note that using the filename works if you do not use the import functionality, i.e. upload directly to the originals folder, which is why auto-index is still enabled by default.* > Use some scheme (such as hidden files) to "hide" files from import until they are completely uploaded. Some WebDAV clients might do this automatically (e.g. macOS Finder and maybe PhotoSync?), so it's worth investigating this further. Another idea would be to check the `Content-Length` header if the client sends one. However, this would require implementing an "after-upload" event handler (see above for how this could be done) to compare the actual file size with the expected file size. You could also check whether uploaded files have an `EOF` byte at the end. However, this would make it impossible to upload broken files, which some users might consider a bug. Finally, even if you know that some files are incomplete, you need to decide on the consequences (because the upload could freeze and never be completed or attempted again), for example: - Skip the incomplete files and the related sidecar files when importing (after the safety delay) and import the others. - Do not import files in the import subfolder (what happens if the main import folder was used to upload files? what happens if the file is uploaded again but to a different folder?) I hope that's enough feedback and inspiration for now, because I need to move on to completing OpenID Connect support. Some users are already complaining because they're not seeing any progress, and that's just as frustrating for us as it is for them. If you have questions, please also feel free to ask in our Contributors and/or [Community Chat](https://link.photoprism.app/chat)!
Author
Owner

@kevincox commented on GitHub (Jun 24, 2024):

it can be difficult to match newly uploaded (re-uploaded) files with existing originals. The use of file names, often in the format IMG_1234.JPG, is unreliable and could lead to other pictures getting replaced.

I put more detail in the original post. I think it should be reliable to use file names to narrow the selection but then check for prefixes before replacing. Even if two Apple devices try to upload IMG_0001.JPG at the same time after a truncated upload only the first one should match the prefix, after it is replaced with the first one the "prefix" is now the full first file so the second won't match.

You could also check whether uploaded files have an EOF byte at the end.

What do you mean by EOF byte? You mean that the file type itself indicates that it is complete? I think this would work for most media files as they typically can indicate this. I also agree that we should be able to import corrupted files (even if manually). This is why I was recommending the approach of importing the truncated file eagerly, then when the full file comes in detecting it and "upgrading" the prefix to the whole file.

@kevincox commented on GitHub (Jun 24, 2024): > it can be difficult to match newly uploaded (re-uploaded) files with existing originals. The use of file names, often in the format IMG_1234.JPG, is unreliable and could lead to other pictures getting replaced. I put more detail in the original post. I think it should be reliable to use file names to narrow the selection but then check for prefixes before replacing. Even if two Apple devices try to upload IMG_0001.JPG at the same time after a truncated upload only the first one should match the prefix, after it is replaced with the first one the "prefix" is now the full first file so the second won't match. > You could also check whether uploaded files have an EOF byte at the end. What do you mean by EOF byte? You mean that the file type itself indicates that it is complete? I think this would work for most media files as they typically can indicate this. I also agree that we should be able to import corrupted files (even if manually). This is why I was recommending the approach of importing the truncated file eagerly, then when the full file comes in detecting it and "upgrading" the prefix to the whole file.
Author
Owner

@lastzero commented on GitHub (Jun 24, 2024):

I put more detail in the original post. I think it should be reliable to use file names to narrow the selection but then check for prefixes before replacing.

Yes, please share more details about this approach. I'm not sure I understand it correctly yet. Importing files from a (possibly single) directory means that there is no reliable or usable path information and the library may contain thousands of files with the same original name, i.e. the file name before copying/moving a file from the import to the originals folder.

You mean that the file type itself indicates that it is complete

Yes, exactly. There may be file types for which this is not reliable or possible.

@lastzero commented on GitHub (Jun 24, 2024): > I put more detail in the original post. I think it should be reliable to use file names to narrow the selection but then check for prefixes before replacing. Yes, please share more details about this approach. I'm not sure I understand it correctly yet. Importing files from a (possibly single) directory means that there is no reliable or usable path information and the library may contain thousands of files with the same original name, i.e. the file name before copying/moving a file from the *import* to the *originals* folder. > You mean that the file type itself indicates that it is complete Yes, exactly. There may be file types for which this is not reliable or possible.
Author
Owner

@kevincox commented on GitHub (Jun 24, 2024):

The system I am thinking is basically this:

  1. Whenever a file is imported.
  2. Search for existing candidates.
    • The simplest search would be "any file with the same original name".
    • Maybe also consider age?
  3. Check if any candidate is a prefix of the new file.
  4. If so then update that candidate to be the new file.

If two different files with the same name are uploaded at the same time it is safe as long as step 4 and 5 are performed atomically. Because once the first file replaces the truncated one the prefixes will no longer match.

@kevincox commented on GitHub (Jun 24, 2024): The system I am thinking is basically this: 1. Whenever a file is imported. 2. Search for existing candidates. - The simplest search would be "any file with the same original name". - Maybe also consider age? 4. Check if any candidate is a prefix of the new file. 5. If so then update that candidate to be the new file. If two different files with the same name are uploaded at the same time it is safe as long as step 4 and 5 are performed atomically. Because once the first file replaces the truncated one the prefixes will no longer match.
Author
Owner

@lastzero commented on GitHub (Jun 24, 2024):

Binary comparison of all files with the same name (up to the size of the original file if it is smaller) might be extremely slow and cause a lot of I/O, which is especially problematic for users with files on a network share or slow devices like a Raspberry Pi. I would therefore advise against doing this by default for every file upload or import.

The performance overhead could possibly be managed if we were to create an additional checksum (hash) that only covers e.g. the first 256 KB or 512 KB of a file. However, this is just an idea and would have to be checked for feasibility and performance before implementation.

@lastzero commented on GitHub (Jun 24, 2024): Binary comparison of all files with the same name (up to the size of the original file if it is smaller) might be extremely slow and cause a lot of I/O, which is especially problematic for users with files on a network share or slow devices like a Raspberry Pi. I would therefore advise against doing this by default for every file upload or import. The performance overhead could possibly be managed if we were to create an additional checksum (hash) that only covers e.g. the first 256 KB or 512 KB of a file. However, this is just an idea and would have to be checked for feasibility and performance before implementation.
Author
Owner

@kevincox commented on GitHub (Jun 24, 2024):

Yeah, this is the downside but I doubt it will be a huge issue for most cases.

  1. Most comparisons will fail quickly.
  2. Most of the time there will not be a large number (or any) matching files. Although I do concede that some devices have collision-prone file naming schemes.
  3. Importing is a typically once-per-file and expensive operation anyways.

So I think pairing this with either an age limit, candidate limit (sorted by age or something) or both could ensure that we never hit the rare extreme cases.

The precomputed checksum is a good idea as well. The only downside is for files smaller than the selected size. However those can just be checked in their entirety as a fallback and those may be more likely to be truncated anyways. (who takes 256KB photos these days 😛)

There may be other filtering that can be done, for example in formats where there is a clear end marker those could be potentially tagged as "complete" and never considered as candidates. (at the small risk of missing some trailing data that shouldn't be there but may be)

In the end missing a few possible matches is likely acceptable because the behaviour falls back to importing both. So it would require manual work but no data would be lost.

@kevincox commented on GitHub (Jun 24, 2024): Yeah, this is the downside but I doubt it will be a huge issue for most cases. 1. Most comparisons will fail quickly. 2. Most of the time there will not be a large number (or any) matching files. Although I do concede that some devices have collision-prone file naming schemes. 3. Importing is a typically once-per-file and expensive operation anyways. So I think pairing this with either an age limit, candidate limit (sorted by age or something) or both could ensure that we never hit the rare extreme cases. The precomputed checksum is a good idea as well. The only downside is for files smaller than the selected size. However those can just be checked in their entirety as a fallback and those may be more likely to be truncated anyways. (who takes 256KB photos these days 😛) There may be other filtering that can be done, for example in formats where there is a clear end marker those could be potentially tagged as "complete" and never considered as candidates. (at the small risk of missing some trailing data that shouldn't be there but may be) In the end missing a few possible matches is likely acceptable because the behaviour falls back to importing both. So it would require manual work but no data would be lost.
Author
Owner

@lastzero commented on GitHub (Jun 24, 2024):

While this may be true in general, keep in mind that we have a very diverse community of users, most of whom have a different setup than you. If network storage is used (and no partial hash is available in the index database), a range request still needs to be performed for each matching file.

Even though S3 generally supports range requests, some of these files may have been moved to cold storage to reduce hosting costs, so retrieving the first few bytes could result in a huge performance penalty. In addition, remote file system drivers/abstractions such as davfs2 may not support range queries at all:

@lastzero commented on GitHub (Jun 24, 2024): While this may be true in general, keep in mind that we have a very diverse community of users, most of whom have a different setup than you. If network storage is used (and no partial hash is available in the index database), a range request still needs to be performed for each matching file. Even though S3 generally supports range requests, some of these files may have been moved to cold storage to reduce hosting costs, so retrieving the first few bytes could result in a huge performance penalty. In addition, remote file system drivers/abstractions such as davfs2 may not support range queries at all: - https://savannah.nongnu.org/support/?107649
Author
Owner

@kevincox commented on GitHub (Jun 24, 2024):

Yeah, there are always infinite use cases. Probably making a few simple things configurable even just PHOTOPRISM_TRUNCATION_DETECTION_MAX_CANDIDATES could make it palatable for most users. Set to 0 to disable entirely, 1 for low-cost or 100 if you have fast local storage.

@kevincox commented on GitHub (Jun 24, 2024): Yeah, there are always infinite use cases. Probably making a few simple things configurable even just `PHOTOPRISM_TRUNCATION_DETECTION_MAX_CANDIDATES` could make it palatable for most users. Set to `0` to disable entirely, `1` for low-cost or `100` if you have fast local storage.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/photoprism#2094
No description provided.