mirror of
https://github.com/Mintplex-Labs/anything-llm.git
synced 2026-03-02 22:57:05 -05:00
[BUG]: Document processing API is not online (bulk file uploads) #1873
Labels
No labels
Desktop
Docker
Integration Request
Integration Request
OS: Linux
OS: Mobile
OS: Windows
UI/UX
blocked
bug
bug
core-team-only
documentation
duplicate
embed-widget
enhancement
feature request
github_actions
good first issue
investigating
needs info / can't replicate
possible bug
question
stage: specifications
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/anything-llm#1873
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @rthwm on GitHub (Dec 29, 2024).
How are you running AnythingLLM?
Docker (local)
What happened?
When uploading bulk files into documents, I receive an error message "document processing api is not online" randomly on different files as they're being uploaded.
In experimentation, I had selected 8 PDF files that were all over 300+mb each. One of the 8 failed with the above error. If I wait for the other 7 to complete and then re-upload the one that failed, it uploads successfully.
In small batches, this is manageable, as I can pin-point the one that failed and re-upload it. However, in bulk testing, multiple files will fail and it's impossible to keep track of which ones sent and which ones fails, so the only solution I've found is to delete all the files, then re-upload them at 4 to 6 at a time (which takes HOURS when uploading hundreds of documents).
It appears as if the API that manages the upload is limited to the number of documents it can process at one time and/or if it tries to start an upload and the API is busy handling other files, it will fail as not online.
If a file fails, the system doesn't appear to try and upload the file again. It just errors and the user must try and track which file failed and then re-submit it for upload after the queue has finished. This is next to impossible on bulk files.
A) It would be nice, when uploading files in bulk or uploading large files, to control how many documents it tries to process at once. Example, if I am uploading 1,500 PDF files, a setting to limit the processor to no more than 4 documents at a time (to try and minimize the failures / track which files failed on upload).
B) It would be nice if there was a log file or report produced after a bulk upload that would list which files failed and which were successful. This would make it easier to identify which files need to be re-uploaded.
C) During the upload process, if a file fails to upload due to the API being unavailable, have the system automatically try the file again. Either move the file to the bottom of the list and retry or automatically try and then fail after X number of attempts.
Thank you.
Are there known steps to reproduce?
Windows running Docker version, upload 100+ large (30mb+) documents into the document manager.
@timothycarambat commented on GitHub (Dec 30, 2024):
When you upload this many files are you using the built-in CPU embedder or something external like ollama or openai?
@rthwm commented on GitHub (Dec 30, 2024):
Built in embedder.
@timothycarambat commented on GitHub (Dec 30, 2024):
Then this constraint is likely arsising from resource constraints as the local embedder is running on CPU only and depending on the document chunk throughput could be crashing or failing or allocate. its unrelated to the retry mechanism proposed, but swapping to something like Ollama or OpenAI may alleviate that as they can be done off-machine or use the GPU on device.
@rthwm commented on GitHub (Dec 31, 2024):
I've switched it over to Ollama, rebuilding the embeddings now (going to take a while). Once this completes, I will try uploading another batch of PDFs and see what happens. Ill post back if this fixed the issue or not.
@rthwm commented on GitHub (Jan 1, 2025):
Alright, after switching to Ollama, I am still getting "document processing API is not online" while doing bulk uploads. Granted, there doesn't seem to be nearly as many of these errors, but in a batch upload of around 900 pdf/txt files, I've seen the API offline error come up about 6 times now and counting. Next issue to that (as described initially), once the upload finishes, I will have to delete everything I just uploaded as I can't isolate which files failed vs which ones were successful. The failures do seem to be related to the number of documents being processed at once / the CollectorApi being busy.
@rthwm commented on GitHub (Jan 1, 2025):
Another item to note, I went through the log files to see if I could isolate an error within, with the words "document processing API is not online". Interesting enough, there is no log entry for this exact phrase of error. Searching for "not online", produces no results. The only reference in the logs (which I can't fully confirm is for this exact error and this error is repeated a few times through the logs on different files) is:
2025-01-01 13:02:15 [backend] info: [CollectorApi] Document Cook_better_food.pdf uploaded processed and successfully. It is now available in documents.
2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed
2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed
2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed
2025-01-01 13:02:15 [backend] info: [TELEMETRY SENT] {"event":"document_uploaded","distinctId":"08fe1348-286a-4313-9d72-f6d357f86f90","properties":{"runtime":"docker"}}
This portion of the log may not be fully relevant to the error I am seeing on the front-end as the front-end error doesn't correlate to any direct reference in the backend logs that I can see. It would be nice if the error message was changed from saying "document processing API is not online" to "document processing API is offline", as it would make searching the logs a little easier for failures related to "offline". Even with that, I've gone through the logs line by line (searching for the word "failed") and can't find anything that directly shows this specific error (API is not online) is even happening.
From the front-end, I see 3 different errors at random times.
What I am unsure about in the logs, when I see (example) "2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed", is this a log report for error #3 only or is this the recorded log entry for #2 and #3 together.
For this upload experiment, I uploaded 961 files (half pdf the other half txt), 829 were successfully uploaded. This would indicate that 132 files failed to process / upload, because of one of the 3 errors previously mentioned. There is no easy method that I have found to isolate which files failed due to error #1 verse errors #2 or #3. (which I understand is a separate but related from the API not online issue).
@zxjhellow2 commented on GitHub (Feb 20, 2025):
Did you sovle this ?
@MatisAgr commented on GitHub (Mar 3, 2025):
Hello I join the problem,
I also have the problem of blocking uploads, after sending several files via API or by the interface.
I tried sending everything at once, and also sending one by one
So I get a "fetch failed" and a "Document Processor Unavailable" in the interface, for only solution restart the docker container to upload new files.
I'm using built in embedder
in API response :
Document processing API is not online. Document XY.xlsx will not be processed automatically.@timothycarambat commented on GitHub (Mar 3, 2025):
This would seem to indicate one specific file is the issue, not all of them at the same time. Can you determine which file is the one causing the error? If it is the PPTX file you are using, can you replicate that with the same file consistently? It may just be an issue with PPTX
@MatisAgr commented on GitHub (Mar 4, 2025):
I tried again with the same file and indeed it worked.
After analysis, I think that there is an overload of the API and that it goes offline. I made a small program that sends the files in a loop until the API responds. The API seems offline for a period of time, after about ten minutes the files manage to send themselves before re-blocking.
@timothycarambat commented on GitHub (Mar 4, 2025):
Well the collector is a single thread, and if you are uploading documents that require binaries to parse (PPTX, Word, PDF) or have to run OCR (images, scanned PDF) then what will likely occur is an OOM based on the machine/container resources. I believe that is the root cause here since that would crash the collector and would then become unresponsive.
I suppose this is also possible with many many large text files since they need to be opened, read, and then processed. It is just simple IO but can still cause issues during processing