mirror of
https://github.com/photoprism/photoprism.git
synced 2026-03-02 22:57:18 -05:00
AI: Generate captions with Clip Interrogator #1785
Labels
No labels
ai
android
api
auth
awesome
bug
bug
ci
cli
config
database
declined
deprecated
docker
docs 📚
documents
duplicate
easy
enhancement
enhancement
enhancement
epic
faces
feedback wanted
frontend
hacktoberfest
help wanted
idea
in-progress
incomplete
index
invalid
ios
labels
live
live
low-priority
macos
member-feature
metadata
mobile
nas
needs-analysis
no-coding-required
no-coding-required
observability
performance
places
please-test
plus-feature
priority
pro-feature
question
raspberry-pi
raw
released
released
released
research
resolved
security
sharing
tested
tests
third-party-issue
thumbnails
upgrade
upstream-issue
ux
vector
video
waiting
won't fix
won't fix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/photoprism#1785
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @lastzero on GitHub (May 30, 2023).
Discussed in https://github.com/photoprism/photoprism/discussions/3419
Originally posted by sfxworks May 20, 2023

I was looking around a lot for a way to handle and autotag images using something better than the tensor model. I made a stable diffusion interrogator wrapper/endpoint at https://github.com/sfxworks/interrogator-http and a script that (roughly) calls the API to add a description and label the images it goes through. You'll need your session ID, as well as some additional token I found but couldn't find in documentation as noted by my
t=small-token-hereexample.You can use multiple containers. Though it could use some fine tuning when you do so since the iteration doesn't handle errors yet.
I am currently using model ViT-H-14/laion2b_s32b_b79k that runs against Stable Diffusion v2. More information here https://github.com/pharmapsychotic/clip-interrogator
This is a rough script. A lot more work needs to be done such as handle labels a bit better (maybe join them instead of just by word), get video image thumbnails instead of skipping them, and other methods. Note that it does take about 2m to interrogate an image to get these sort of results, and I currently have it running on a GTX 1070 with some performance args to reduce memory (which also reduce quality). So your mileage may vary.
@felipemeres commented on GitHub (Jun 9, 2023):
The recently released RAM might be more efficient for the task:
https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text
@ahdsr commented on GitHub (Aug 24, 2023):
Can something like this be implemented with Photoprism?
@hrdwdmrbl commented on GitHub (Jan 20, 2024):
This again and again! I think you'll have to keep updating the model again and again every 6 months or so. It would be for the best!
@sfxworks commented on GitHub (Apr 22, 2024):
Heya, back in action and going the gpt vision route w/ localai https://localai.io/features/gpt-vision/
that way if someone wants to they can also send things to OpenAI, or a local model.
I believe the model @felipemeres referenced can be used here as well, as a lot is configurable.
Give me a bit to write something up
@sfxworks commented on GitHub (Apr 23, 2024):
Need some cleanup but I've added descriptions to about 1500 images so far overnight. Mine hallucinates a little (this was 455 not 500, and was wearing a backbrace not knee brace)

I might try to add some grammar things to make sure it gives me a return in a certain format so I can also ask for tags and a better title. Maybe even album organization. For now though works well with https://localai.io/ which is a compatible locally deployable API against OpenAI if you wanna also send your images upstream and not analyze them locally.
@sfxworks commented on GitHub (Apr 23, 2024):
(Also if
DescriptionSrchas any spaces the PUT is successful even in photoprism logs but it does not put it. That stumped me for a bit)@hrdwdmrbl commented on GitHub (Apr 24, 2024):
@sfxworks Awesome work! If I sponsor photoprism, that wouldn't include you, would it?
@lastzero commented on GitHub (Apr 24, 2024):
We give free memberships (and whatever else we can afford) for larger and/or regular contributions, provided they can be merged and released.
@sfxworks commented on GitHub (Apr 30, 2024):
Hey I am a donator (though between some chaos during the holidays I am behind on my memebership, but I don't mind resuming it if needed). Either way thanks for the acknowledgement! Just feels good using this vs something like a cloud service and it's how open source should be.
I've made this alternative script that uses ollama. They also have an openai api compatible service but it doesn't work with vision/llava (but it does use my radeon card so I wanted to set it up)
I've got to make tweaks still to handle scenarios where it may not return proper json and try again, but here's what I have so far.
The goal was to work on the other properties, like title, keywords, and the privacy flag.
@lastzero commented on GitHub (May 1, 2024):
Note that with our latest release, you can use standard Bearer Authorization (or X-Auth-Token) headers with script/app specific access tokens:
https://docs.photoprism.app/developer-guide/api/#client-authentication
@sfxworks commented on GitHub (Sep 7, 2024):
Hey all,
Sorry for the delay. Had a very bad family emergency that ended with grieving.
As we stand today, there are models now that can understand video.
https://medium.com/@manish.thota1999/an-experiment-to-unlock-ollamas-potential-video-question-answering-e2b4d1bfb5ba
https://github.com/ollama/ollama/issues/3184
Ontop of this, I was trying to get subtitle generation via whisper which can also be run locally or through an endpoint locally or by a service.
The current plan is utilize grammar for proper formatting of the return of all meta of a photo as a suggestion/default.
I'll give this some work this week.
@sfxworks commented on GitHub (Oct 30, 2024):
Resuming work on this while leaving room for #1090
Also, I have a separate addition now that generates subtitles for videos based on how VLC media player searched for them. Once LlavaNext is runnable locally (localAI work needs to be complete) videos will also be able to be titled/tagged/described
@seeschloss commented on GitHub (Nov 25, 2024):
Maybe I should just open an issue but I've been scripting using
X-Auth-Tokenand just tried to useAuthorization: Bearerinstead according to the doc you linked here, and it doesn't seem to work as I expected.The same query, even a simple
/api/v1/photos?count=1, returns results (one result in this case) usingX-Auth-Tokenwith a token I retrieved from my browser, but with a token generated from command-line and given the "*" scope I can't get anything more than:{"code":400,"error":"Unable to do that"}. I couldn't find any way to get better information than that, the debug log doesn't give any other information.I think it might be due to the fact that this token isn't linked to a user, but then I have no idea how to do anything with it.
@lastzero commented on GitHub (Nov 25, 2024):
Note that you can only use
X-Auth-Token: <access_token>orAuthorisation: Bearer <access_token>, not both at the same time.Does it work if you use an app password or a token associated with a user? Are all API endpoints failing or just this particular one?
Do you see any errors or other hints in the debug/trace logs?
@seeschloss commented on GitHub (Nov 25, 2024):
The logs (with trace enabled) only say:
When the same request with a
X-Auth-Tokensays:I'm indeed using just one method or the other, not both at the same time.
As for the other methods, I had not tried them because they seemed more complex for my simple curl-based script, but looking again I see that app passwords can be used as a Bearer token, and they do work fine! Which I think confirms my suspicions that the error with the access_token is due to it not being linked to a specific user.
Edit And I forgot to answer the other question, a few endpoints do work with an access_token, such as
/api/v1/configand/api/v1/statusbut most of them answer either "Unable to do that" (400) or "Permission denied" (403).@sfxworks commented on GitHub (Feb 4, 2025):
I tried to take a go at it today. The new auth system is either confusing me or not working as expected.
I made quite a few. Both through the webpage through the cli.
Through the combinations I've tried, Only bearer auth token mode with a new "apps and devices" can get the photo list. Though it cannot download the photos directly.
Creating an auth with no username to just get the xauth token doesn't allow me to use the search endpoint at all.
So at this time, I'm not sure how to get photos anymore. Can anyone explain? Looking over https://docs.photoprism.app/developer-guide/api/oauth2/#access-tokens and https://docs.photoprism.app/developer-guide/api/search/#get-apiv1photos isn't revealing any rules that I don't seem to be abiding by.
@lastzero commented on GitHub (Feb 17, 2025):
@sfxworks Indeed, you might not be able to access pictures with a client-only access token, only other API endpoints, e.g. for monitoring. If you want to find and update pictures, you should bind access to a user account, e.g., by using a user app password or specifying a username when creating a token:
@sfxworks commented on GitHub (Feb 18, 2025):
Ah gotcha. I'll give that a try tonight
@hiimhoanganan commented on GitHub (May 26, 2025):
The only reason I chose PhotoPrism to manage my media is because it has built-in local AI and can be expanded in the future. I just want to say that your work is amazing, and I’m really looking forward to seeing this project completed.