starred/anything-llm

Fork 0

mirror of https://github.com/Mintplex-Labs/anything-llm.git synced 2026-03-02 22:57:05 -05:00

Sitemap collector tries downloading/parsing image files. #34

New issue

Closed

opened 2026-02-28 04:21:25 -05:00 by deekerman · 5 comments

deekerman commented

2026-02-28 04:21:25 -05:00

Owner

Originally created by @Mike-Benoit on GitHub (Jun 14, 2023).

Sitemap collector tries downloading/parsing image files (ie: PNG)

It appears it only exclude PDF files, but perhaps it should only be including htm/html files instead?

Originally created by @Mike-Benoit on GitHub (Jun 14, 2023). Sitemap collector tries downloading/parsing image files (ie: PNG) It appears it only exclude PDF files, but perhaps it should only be including htm/html files instead?

deekerman

2026-02-28 04:21:25 -05:00

closed this issue
added the
bug
label

deekerman commented

2026-02-28 04:21:37 -05:00

Author

Owner

@timothycarambat commented on GitHub (Jun 14, 2023):

@skidvis Any idea on the best angle to attack this?

@timothycarambat commented on GitHub (Jun 14, 2023): @skidvis Any idea on the best angle to attack this?

deekerman commented

2026-02-28 04:21:40 -05:00

Author

Owner

@Mike-Benoit commented on GitHub (Jun 14, 2023):

Curiously, not sure why it avoids PDF documents, as those can be parsed. Perhaps a SiteMap w/PDF and SiteMap w/o PDF could be handy.

@Mike-Benoit commented on GitHub (Jun 14, 2023): Curiously, not sure why it avoids PDF documents, as those can be parsed. Perhaps a SiteMap w/PDF and SiteMap w/o PDF could be handy.

deekerman commented

2026-02-28 04:21:40 -05:00

Author

Owner

@timothycarambat commented on GitHub (Jun 14, 2023):

@Mike-Benoit What sitemap are you using? So I can test this

@timothycarambat commented on GitHub (Jun 14, 2023): @Mike-Benoit What sitemap are you using? So I can test this

deekerman commented

2026-02-28 04:21:40 -05:00

Author

Owner

@skidvis commented on GitHub (Jun 14, 2023):

@timothycarambat There's a PR to block images.

@Mike-Benoit PDFs are only parsed in the hotdir folder. Using the sitemap feature to parse them would require downloading them to a temp folder somewhere, parsing them, and then deleting them. That feature currently does not exist, and it would also be problematic storage-wise if there were many large pdfs.

If enough people ask for this, it may be worth implementing, but currently the hotdir should be where pdfs are handled.

@skidvis commented on GitHub (Jun 14, 2023): @timothycarambat There's a PR to block images. @Mike-Benoit PDFs are only parsed in the hotdir folder. Using the sitemap feature to parse them would require downloading them to a temp folder somewhere, parsing them, and then deleting them. That feature currently does not exist, and it would also be problematic storage-wise if there were many large pdfs. If enough people ask for this, it may be worth implementing, but currently the hotdir should be where pdfs are handled.

deekerman commented

2026-02-28 04:21:40 -05:00

Author

Owner

@timothycarambat commented on GitHub (Jun 15, 2023):

resolved by #56

@timothycarambat commented on GitHub (Jun 15, 2023): resolved by #56

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/anything-llm#34

No description provided.

Rows
Columns