Sitemap collector tries downloading/parsing image files. #34

Closed
opened 2026-02-28 04:21:25 -05:00 by deekerman · 5 comments
Owner

Originally created by @Mike-Benoit on GitHub (Jun 14, 2023).

Sitemap collector tries downloading/parsing image files (ie: PNG)

It appears it only exclude PDF files, but perhaps it should only be including htm/html files instead?

Originally created by @Mike-Benoit on GitHub (Jun 14, 2023). Sitemap collector tries downloading/parsing image files (ie: PNG) It appears it only exclude PDF files, but perhaps it should only be including htm/html files instead?
deekerman 2026-02-28 04:21:25 -05:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@timothycarambat commented on GitHub (Jun 14, 2023):

@skidvis Any idea on the best angle to attack this?

@timothycarambat commented on GitHub (Jun 14, 2023): @skidvis Any idea on the best angle to attack this?
Author
Owner

@Mike-Benoit commented on GitHub (Jun 14, 2023):

Curiously, not sure why it avoids PDF documents, as those can be parsed. Perhaps a SiteMap w/PDF and SiteMap w/o PDF could be handy.

@Mike-Benoit commented on GitHub (Jun 14, 2023): Curiously, not sure why it avoids PDF documents, as those can be parsed. Perhaps a SiteMap w/PDF and SiteMap w/o PDF could be handy.
Author
Owner

@timothycarambat commented on GitHub (Jun 14, 2023):

@Mike-Benoit What sitemap are you using? So I can test this

@timothycarambat commented on GitHub (Jun 14, 2023): @Mike-Benoit What sitemap are you using? So I can test this
Author
Owner

@skidvis commented on GitHub (Jun 14, 2023):

@timothycarambat There's a PR to block images.

@Mike-Benoit PDFs are only parsed in the hotdir folder. Using the sitemap feature to parse them would require downloading them to a temp folder somewhere, parsing them, and then deleting them. That feature currently does not exist, and it would also be problematic storage-wise if there were many large pdfs.

If enough people ask for this, it may be worth implementing, but currently the hotdir should be where pdfs are handled.

@skidvis commented on GitHub (Jun 14, 2023): @timothycarambat There's a PR to block images. @Mike-Benoit PDFs are only parsed in the hotdir folder. Using the sitemap feature to parse them would require downloading them to a temp folder somewhere, parsing them, and then deleting them. That feature currently does not exist, and it would also be problematic storage-wise if there were many large pdfs. If enough people ask for this, it may be worth implementing, but currently the hotdir should be where pdfs are handled.
Author
Owner

@timothycarambat commented on GitHub (Jun 15, 2023):

resolved by #56

@timothycarambat commented on GitHub (Jun 15, 2023): resolved by #56
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/anything-llm#34
No description provided.