AI: Integrate a model for Optical Character Recognition (OCR) #709

Open
opened 2026-02-19 23:14:52 -05:00 by deekerman · 16 comments
Owner

Originally created by @Shamshala on GitHub (Jan 16, 2021).

Hi, when i came across "Scans" mark i was wondering what is the main purpose of that and after i found it out in the quick guide an idea poped in my mind whether wouldn't be possible to also include photos/scans of documents? And when image classification/object detection is already implemented, face recognition is considered (/on the way), what about OCR? 😈 It probably will be helpful with reading street signs and others.


Support for more built-in AI/ML models depends on upgrading the TensorFlow library we use from v1.15 to v2.x first.

In addition, we have started working on a microservice that will allow us to use advanced computer vision models through a REST API (also includes support for faster inference when indexing new pictures e.g. with Nvidia hardware):

It seems possible (and we have already talked about it internally) to have an API endpoint for OCR there as well. Any contributions to this would be much appreciated! 🤗

Related Issues:

Originally created by @Shamshala on GitHub (Jan 16, 2021). Hi, when i came across "Scans" mark i was wondering what is the main purpose of that and after i found it out in the quick guide an idea poped in my mind whether wouldn't be possible to also include photos/scans of documents? And when image classification/object detection is already implemented, face recognition is considered (/on the way), what about OCR? :smiling_imp: It probably will be helpful with reading street signs and others. --- Support for more built-in AI/ML models depends on [upgrading the TensorFlow library we use from v1.15 to v2.x](https://github.com/photoprism/photoprism/issues/222) first. In addition, we have [started working on a microservice](https://github.com/photoprism/photoprism-vision) that will allow us to use advanced computer vision models through a REST API (also includes support for faster inference when indexing new pictures e.g. with Nvidia hardware): - https://github.com/photoprism/photoprism-vision It seems possible (and we have already talked about it internally) to have an API endpoint for OCR there as well. Any contributions to this would be much appreciated! 🤗 **Related Issues:** - https://github.com/photoprism/photoprism/issues/222 - https://github.com/photoprism/photoprism/issues/1090 - https://github.com/photoprism/photoprism/issues/4356 - https://github.com/photoprism/photoprism/issues/3438 - https://github.com/photoprism/photoprism/issues/536
Author
Owner

@lastzero commented on GitHub (Jan 16, 2021):

Already tried, not that easy. Need to extract text areas first, can't just run OCR on a complete image.

@lastzero commented on GitHub (Jan 16, 2021): Already tried, not that easy. Need to extract text areas first, can't just run OCR on a complete image.
Author
Owner

@FadingArabChristians commented on GitHub (Dec 29, 2021):

Has this been added? It would be absolutely perfect for my use case atm.

@FadingArabChristians commented on GitHub (Dec 29, 2021): Has this been added? It would be absolutely perfect for my use case atm.
Author
Owner

@graciousgrey commented on GitHub (Dec 30, 2021):

No, this has not yet been added.

You find an overview of what is planned next on our roadmap: https://github.com/photoprism/photoprism/projects/5.

Ideas about libraries for text area extraction and OCR are very welcome :)

@graciousgrey commented on GitHub (Dec 30, 2021): No, this has not yet been added. You find an overview of what is planned next on our roadmap: https://github.com/photoprism/photoprism/projects/5. Ideas about libraries for text area extraction and OCR are very welcome :)
Author
Owner

@mbethke commented on GitHub (Sep 26, 2022):

Having quite a few photos of random stuff with more or less amusing text on them in my collection, I'd love that feature! So I've been playing around with DeepDetect a little, and apart from eating hefty chunks of memory it performs quite well. I set up a container similar to their their quickstart instructions (just using docker-compose), added their word-detect model and tried it with a few images from the web, like

curl -X POST "http://localhost:8080/predict" -d '{ "service":"word_detect", "parameters":{"output":{"bbox":true,"confidence_threshold":0.3}},"data":["https://www.leanderarchitectural.co.uk/wp-content/uploads/2015/07/LlandudnoFP2.jpg"] }'

It spits out a JSON file with the bounding boxes, to visualize:

convert -strokewidth 1 -stroke black -fill '#ff00ff30' -draw "$(jq <detected.json '.body.predictions[].classes[].bbox | [.xmin,.ymin,.xmax,.ymax] | @sh' | awk '{gsub("\"",""); print "rectangle",$1","$2,$3","$4}')" img.jpg out.png

That will probably still need some work like coalescing overlapping boxes before the results can be fed to tesseract or something, but maybe it's a start?

@mbethke commented on GitHub (Sep 26, 2022): Having quite a few photos of random stuff with more or less amusing text on them in my collection, I'd love that feature! So I've been playing around with [DeepDetect](https://github.com/jolibrain/deepdetect) a little, and apart from eating hefty chunks of memory it performs quite well. I set up a container similar to their [their quickstart instructions](https://www.deepdetect.com/quickstart-server) (just using docker-compose), added their [word-detect model](https://www.deepdetect.com/models) and tried it with a few images from the web, like ``` curl -X POST "http://localhost:8080/predict" -d '{ "service":"word_detect", "parameters":{"output":{"bbox":true,"confidence_threshold":0.3}},"data":["https://www.leanderarchitectural.co.uk/wp-content/uploads/2015/07/LlandudnoFP2.jpg"] }' ``` It spits out a JSON file with the bounding boxes, to visualize: ``` convert -strokewidth 1 -stroke black -fill '#ff00ff30' -draw "$(jq <detected.json '.body.predictions[].classes[].bbox | [.xmin,.ymin,.xmax,.ymax] | @sh' | awk '{gsub("\"",""); print "rectangle",$1","$2,$3","$4}')" img.jpg out.png ``` That will probably still need some work like coalescing overlapping boxes before the results can be fed to tesseract or something, but maybe it's a start?
Author
Owner

@daniellandau commented on GitHub (Feb 18, 2023):

Didn't notice this issue before I opened #3206 above. As said there:

I tested EacyOCR (https://github.com/JaidedAI/EasyOCR) on some samples from my library and while it didn't necessarily provide a perfect transcription usable as a document itself for all cases, it definitely found enough words on almost all images I tried to be usable for a text-to-image search.

I've in the past used Tesseract with manually drawn boxes and EasyOCR seems to perform comparatively very well just given a full image so "can't just run OCR on a complete image" doesn't seem to be totally true anymore. EasyOCR gives both the bounding boxes and the texts in each of them, but for photoprism I'd be happy to just have the raw text searchable so I can find the image and then read the actual text myself from the image.

@daniellandau commented on GitHub (Feb 18, 2023): Didn't notice this issue before I opened #3206 above. As said there: > I tested EacyOCR (https://github.com/JaidedAI/EasyOCR) on some samples from my library and while it didn't necessarily provide a perfect transcription usable as a document itself for all cases, it definitely found enough words on almost all images I tried to be usable for a text-to-image search. I've in the past used Tesseract with manually drawn boxes and EasyOCR seems to perform comparatively very well just given a full image so "can't just run OCR on a complete image" doesn't seem to be totally true anymore. EasyOCR gives both the bounding boxes and the texts in each of them, but for photoprism I'd be happy to just have the raw text searchable so I can find the image and then read the actual text myself from the image.
Author
Owner

@ant0nwax commented on GitHub (Mar 2, 2023):

I currently search for two additional labels that are missing for my daily use:
Drawings and Paintings, i used them a lot on my former photo cloud and the AI there worked better. Also the OCR feature of text stored on Photos is a thing that could be a nice to have integration by the developers

@ant0nwax commented on GitHub (Mar 2, 2023): I currently search for two additional labels that are missing for my daily use: Drawings and Paintings, i used them a lot on my former photo cloud and the AI there worked better. Also the OCR feature of text stored on Photos is a thing that could be a nice to have integration by the developers
Author
Owner

@ALEEF02 commented on GitHub (Aug 26, 2023):

What's the latest on OCR? I'm willing to put development time into this personally if needed. Searching for text in images was one of the main uses I got out of Google Photos and I'd love to see that functionality here.

@ALEEF02 commented on GitHub (Aug 26, 2023): What's the latest on OCR? I'm willing to put development time into this personally if needed. Searching for text in images was one of the main uses I got out of Google Photos and I'd love to see that functionality here.
Author
Owner

@Bobbyjohnsonz commented on GitHub (Sep 10, 2023):

What's the latest on OCR? I'm willing to put development time into this personally if needed. Searching for text in images was one of the main uses I got out of Google Photos and I'd love to see that functionality here.

Agreed, I believe Mylio does this out of the box but is paid for. The implementation seems to be able to be done locally no problem. This is a deciding factor for me. It’s such a powerful tool.

@Bobbyjohnsonz commented on GitHub (Sep 10, 2023): > What's the latest on OCR? I'm willing to put development time into this personally if needed. Searching for text in images was one of the main uses I got out of Google Photos and I'd love to see that functionality here. Agreed, I believe Mylio does this out of the box but is paid for. The implementation seems to be able to be done locally no problem. This is a deciding factor for me. It’s such a powerful tool.
Author
Owner

@ant0nwax commented on GitHub (Sep 10, 2023):

Do I understand right that there is a possibility to integrate OCR into photoprism, but it has nothing to do with an integration of Mylio, but maybe there is any other open source OCR already? and Mylio and Google Photos are not open source?
Could you name an open source OCR that deals with Photos already?

@ant0nwax commented on GitHub (Sep 10, 2023): Do I understand right that there is a possibility to integrate OCR into photoprism, but it has nothing to do with an integration of Mylio, but maybe there is any other open source OCR already? and Mylio and Google Photos are not open source? Could you name an open source OCR that deals with Photos already?
Author
Owner

@ant0nwax commented on GitHub (Sep 10, 2023):

My research on AI

Question>
There are open source OCR tools to read text inside of Images and make it searchable... https://www.hitechnectar.com/blogs/open-source-ocr-tools/ on this link we see some OCR open source tools. Which of these would work best for integrating it into a on-premises photoprism solution?

Answer>
Based on the search results, Tesseract OCR engine seems to be the best open source OCR tool for integrating it into an on-premises photoprism solution. Tesseract OCR engine is sponsored by Google and is considered one of the most accurate, freely available open-source systems available
1
. However, other open source OCR tools like Ocrad, EasyOCR, and OpenCV can also be used for basic OCR tasks and can be trained with your own data
2
3
5
. It is worth noting that PhotoPrism is an AI-powered photos app for the decentralized web that makes use of the latest technologies to tag and find pictures automatically without getting in your way
6
.

@ant0nwax commented on GitHub (Sep 10, 2023): My research on AI **Question>** There are open source OCR tools to read text inside of Images and make it searchable... https://www.hitechnectar.com/blogs/open-source-ocr-tools/ on this link we see some OCR open source tools. Which of these would work best for integrating it into a on-premises photoprism solution? **Answer>** Based on the search results, Tesseract OCR engine seems to be the best open source OCR tool for integrating it into an on-premises photoprism solution. Tesseract OCR engine is sponsored by Google and is considered one of the most accurate, freely available open-source systems available [1](https://www.hitechnectar.com/blogs/open-source-ocr-tools/) . However, other open source OCR tools like Ocrad, EasyOCR, and OpenCV can also be used for basic OCR tasks and can be trained with your own data [2](https://www.affinda.com/tech-ai/6-top-open-source-ocr-tools-an-honest-review) [3](https://towardsdatascience.com/5-open-source-tools-you-can-use-to-train-and-deploy-an-ocr-project-8f204dec862b) [5](https://pdf.wondershare.com/ocr/ocr-software-open-source.html) . It is worth noting that PhotoPrism is an AI-powered photos app for the decentralized web that makes use of the latest technologies to tag and find pictures automatically without getting in your way [6](https://github.com/photoprism/photoprism) .
Author
Owner

@Fireflaker commented on GitHub (Jan 23, 2024):

Earlier I posted this on discussion:

HELP - I have many screenshots with text. How to implement OCR or integrate paperless-ngx for better image search? #4011
Fireflaker started this conversation in Ideas

last week
It looks like implementing OCR is no trivial task and is not on the roadmap😔. However, an OCR search capability similar to one in Google Photos is essential for me. I tried paperless-ngx, but it writes to the originals dataset and is impractical.
...

May I ask for some recommendations on what is the best way to acquire OCR capability right now?

I considered modifying the indexing process to send each photo to a locally running OCR API server (windows or Linux, on a different computer with better GPU), and adding the resulting text to properties; or using paperless-ngx to process OCR locally. May I request some guidance on where to get started?

Alternatively, are there better ways to approch this - like a readonly version of paperless-ngx that can run alongside photoprism in Docker/Truenas

I am a student, but I would love to contribute what I can.

@Fireflaker commented on GitHub (Jan 23, 2024): Earlier I posted this on discussion: > HELP - I have many screenshots with text. How to implement OCR or integrate paperless-ngx for better image search? #4011 [Fireflaker](https://github.com/Fireflaker) started this conversation in [Ideas](https://github.com/photoprism/photoprism/discussions/categories/ideas) > [last week](https://github.com/photoprism/photoprism/discussions/4011#discussion-6085240) It looks like implementing OCR is no trivial task and is not on the roadmap😔. However, an OCR search capability similar to one in Google Photos is essential for me. I tried paperless-ngx, but it writes to the originals dataset and is impractical. ... May I ask for some recommendations on what is the best way to acquire OCR capability right now? > I considered modifying the indexing process to send each photo to a locally running OCR API server (windows or Linux, on a different computer with better GPU), and adding the resulting text to properties; or using paperless-ngx to process OCR locally. May I request some guidance on where to get started? > Alternatively, are there better ways to approch this - like a readonly version of paperless-ngx that can run alongside photoprism in Docker/Truenas I am a student, but I would love to contribute what I can.
Author
Owner

@graciousgrey commented on GitHub (Jan 25, 2024):

Thank you very much for offering your help! We are currently focusing on adding additional authentication options, so unfortunately we can't look into this in depth at this time. However, we will do our best to answer your questions. Please bear with us if it takes us a while to get back to you.

A good starting point would be to do some research and test the available libraries and create a decision matrix to find out which one would integrate best with PhotoPrism. Some points to consider are:

  • Are they actively maintained and do they provide documentation?
  • What input do they require? Can they work with the thumbnails already generated by PhotoPrism or would we need to create additional ones?
  • What does the output look like?
  • Which library provides the most accurate results?
  • How large are the models?
  • How resource intensive are they? Could they be run on low-power devices, such as a Raspberry Pi?
@graciousgrey commented on GitHub (Jan 25, 2024): Thank you very much for offering your help! We are currently focusing on adding additional authentication options, so unfortunately we can't look into this in depth at this time. However, we will do our best to answer your questions. Please bear with us if it takes us a while to get back to you. A good starting point would be to do some research and test the available libraries and create a decision matrix to find out which one would integrate best with PhotoPrism. Some points to consider are: - Are they actively maintained and do they provide documentation? - What input do they require? Can they work with the thumbnails already generated by PhotoPrism or would we need to create additional ones? - What does the output look like? - Which library provides the most accurate results? - How large are the models? - How resource intensive are they? Could they be run on low-power devices, such as a Raspberry Pi?
Author
Owner

@Dobeemixer commented on GitHub (Dec 30, 2024):

Any updates on this? Searching for text in photos seems to be a big deal

@Dobeemixer commented on GitHub (Dec 30, 2024): Any updates on this? Searching for text in photos seems to be a big deal
Author
Owner

@lastzero commented on GitHub (Jan 13, 2025):

@Dobeemixer We will be happy to take another look at this once the upgraded user interface and new viewer are released:

Of course, if you know a good solution (or would like to find one and build a proof-of-concept), we are also happy to accept contributions. This would shorten the time to get it into a stable release.

@lastzero commented on GitHub (Jan 13, 2025): @Dobeemixer We will be happy to take another look at this once the upgraded user interface and new viewer are released: - #3168 - #1307 Of course, if you know a good solution (or would like to find one and build a proof-of-concept), we are also happy to accept contributions. This would shorten the time to get it into a stable release.
Author
Owner

@Brtrnd commented on GitHub (Nov 19, 2025):

I'll be assuming this can be used to achieve OCR
https://github.com/photoprism/photoprism/discussions/4983
I guess one would ask the LLM if there is text in the image and return a binary answer. If postitive, a OCR engine could be released on the document.
Of course, OCR on unstructured documents is hard; so it's very probable it will read most of the words, but mess them up if there's formatting.

@Brtrnd commented on GitHub (Nov 19, 2025): I'll be assuming this can be used to achieve OCR https://github.com/photoprism/photoprism/discussions/4983 I guess one would ask the LLM if there is text in the image and return a binary answer. If postitive, a OCR engine could be released on the document. Of course, OCR on unstructured documents is hard; so it's very probable it will read most of the words, but mess them up if there's formatting.
Author
Owner

@lastzero commented on GitHub (Nov 20, 2025):

@Brtrnd Absolutely! LLMs can do this, and it generally works well — depending on the language of the text and the model. The only remaining issue is that you might prefer to put the text in a new, dedicated OCR metadata field rather than in the Caption. Feel free to suggest alternative/better field names, and let us know what else you need for this feature request to be resolved.

@lastzero commented on GitHub (Nov 20, 2025): @Brtrnd Absolutely! LLMs can do this, and it generally works well — depending on the language of the text and the model. The only remaining issue is that you might prefer to put the text in a new, dedicated OCR metadata field rather than in the Caption. Feel free to suggest alternative/better field names, and let us know what else you need for this feature request to be resolved.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/photoprism#709
No description provided.