mirror of
https://github.com/Mintplex-Labs/anything-llm.git
synced 2026-03-02 22:57:05 -05:00
[FEAT]: Scraping of Authenticated Pages #1925
Labels
No labels
Desktop
Docker
Integration Request
Integration Request
OS: Linux
OS: Mobile
OS: Windows
UI/UX
blocked
bug
bug
core-team-only
documentation
duplicate
embed-widget
enhancement
feature request
github_actions
good first issue
investigating
needs info / can't replicate
possible bug
question
stage: specifications
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/anything-llm#1925
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @XarHD on GitHub (Jan 17, 2025).
How are you running AnythingLLM?
Docker (local)
What happened?
I tried to use the Bulk Site Scraper tool to scrape my wikipedia on Localhost. I was able to direct the scraper to the correct address, but no matter how many child levels I specify, or how many pages I set as a scraping limit, the results are the same. For the last experiment I asked to scrape 3 child levels and stop at 120 pages; the results popup told me it had successfully scraped 81 pages, but when I went to the list of documents to embed in the workspace, there's always only four of them: a document that refers to the main address; one that refers to the main page of the wiki (http://localhost/wiki/index.php/Main_Page); and two that refer to two sub-pages that appear on the side menu. The main page has more than just 2 subpages on the side menu, as well as at least three subpages on the main portion of the page.
Is this an issue, or am I doing something wrong with the scraper?
Are there known steps to reproduce?
No response
@timothycarambat commented on GitHub (Jan 17, 2025):
The collector logs would should exactly what links were found - what does it say?
@XarHD commented on GitHub (Jan 17, 2025):
With the caveat that I made a mistake (this is the Desktop version of AnythingLLM, not the Docker version - apologies), I went into the storage/logs folder but couldn't find any collector log material (or any other logs, for that matter) created or modified today, even though I ran the Bulk Link Scraper today. I did find a file labeled LOG in the Session Storage folder, updated today, but its content just reads:
The file it refers to, 000004.log, was last updated yesterday evening and is a series of text like this:
going on for several more lines.
Inside AnythingLLM, the Event Logs do not include logs for the link scraper - it reports "workspace created", followed by "workspace_thread_created" and then "workspace_documents_added" (when I tried to add the four documents it seemed to have scraped, to see what they contained), but the link scraper activity (which occurred between the creation of the workspace and the adding of documents) isn't recorded.
Is there anywhere else I should check? Sorry, I have only been using AnythingLLM for a few days.
For context, I am a writer and my goal would be to input all the background information on my book and on the world it's set into, so that I can chat with the LLM to pull out any information I need or identify potential inconsistencies. I can upload single documents - but all the material I have fits into 7000 pages of a MediaWiki installation, so it's very time-consuming to transfer everything to documents. I thought of creating a read-only account for the SQL database, but my installation of MediaWiki via XAMPP seems to have issues with the control panel of the SQL database, so while I can use the wiki just fine, I cannot at this time create new users to access the SQL database, hence why I would need an alternate way to scrape the info out of the wiki and into AnythingLLM.
Thank you!
@timothycarambat commented on GitHub (Jan 17, 2025):
You should see logs by running the app in debug mode
https://docs.anythingllm.com/installation-desktop/debug#anythingllm-debug-mode-on-linux
But you can also checkout your storage/logs folder for a
collector-DATE.logfile./Users/<usr>/.config/anythingllm-desktop/storage/@XarHD commented on GitHub (Jan 18, 2025):
I found the collector log (for some reason it didn't appear in the log folder yesterday when I checked, but it's there today). Here's the report:
The end result on the frontend is that only four pages appear in the document picker, with the following names:
@timothycarambat commented on GitHub (Jan 18, 2025):
http://localhost/theworld/index.php?title=Special:UserLogin&returnto=Category%3AWorld+TemplatesIs this an authenticated service? Seems like a login page or something?
@XarHD commented on GitHub (Jan 19, 2025):
It's a private MediaWiki instance run via XAMPP. It does have a login page,
although my computer has a saved cookie so I'm not required to login every
time. I assumed the scraper would have the same access from the same
computer, but perhaps not?
On Sun, Jan 19, 2025, 02:03 Timothy Carambat @.***>
wrote:
@timothycarambat commented on GitHub (Jan 20, 2025):
Correct, we do not borrow or highjack your current browser session (for obvious reasons) - however private web scraping is for sure something we can enable so that all web-scraping from the desktop client function does have authentication to your protected pages.
On docker however, this is more complex since it would be difficult to enable the session sharing since it would require the user to specify some kind of chrome session data location - which is vastly more complex.
Our current solution for Docker users to scrape protected pages is via the Chrome Extension that can connect to your instance.
@XarHD commented on GitHub (Jan 20, 2025):
Thank you. Is it something that can be done with the current version of the
desktop AnythingLLM? If so, how?
Il giorno lun 20 gen 2025 alle ore 18:16 Timothy Carambat <
@.***> ha scritto:
@timothycarambat commented on GitHub (Jan 20, 2025):
@XarHD No, which is why I renamed the issue to be a feature. it is something I know we can accommodate, but its not live right now.