mirror of
https://github.com/Mintplex-Labs/anything-llm.git
synced 2026-03-02 22:57:05 -05:00
Sitemap collector tries downloading/parsing image files. #34
Labels
No labels
Desktop
Docker
Integration Request
Integration Request
OS: Linux
OS: Mobile
OS: Windows
UI/UX
blocked
bug
bug
core-team-only
documentation
duplicate
embed-widget
enhancement
feature request
github_actions
good first issue
investigating
needs info / can't replicate
possible bug
question
stage: specifications
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/anything-llm#34
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Mike-Benoit on GitHub (Jun 14, 2023).
Sitemap collector tries downloading/parsing image files (ie: PNG)
It appears it only exclude PDF files, but perhaps it should only be including htm/html files instead?
@timothycarambat commented on GitHub (Jun 14, 2023):
@skidvis Any idea on the best angle to attack this?
@Mike-Benoit commented on GitHub (Jun 14, 2023):
Curiously, not sure why it avoids PDF documents, as those can be parsed. Perhaps a SiteMap w/PDF and SiteMap w/o PDF could be handy.
@timothycarambat commented on GitHub (Jun 14, 2023):
@Mike-Benoit What sitemap are you using? So I can test this
@skidvis commented on GitHub (Jun 14, 2023):
@timothycarambat There's a PR to block images.
@Mike-Benoit PDFs are only parsed in the hotdir folder. Using the sitemap feature to parse them would require downloading them to a temp folder somewhere, parsing them, and then deleting them. That feature currently does not exist, and it would also be problematic storage-wise if there were many large pdfs.
If enough people ask for this, it may be worth implementing, but currently the hotdir should be where pdfs are handled.
@timothycarambat commented on GitHub (Jun 15, 2023):
resolved by #56