Local Large Language Models are missing the latest insights. 🧐 This little side quest tackles that head-on!
The goal here is to create a streamlined process for fetching up-to-date documentation from the web and use open-source embedding models to create Vector Database. Think of it as giving your local models a continuous knowledge boost! 💪
For Instance, I imparted PydanticAI documentation as RAG for my locally running StarCoder: A State-of-the-Art LLM for Code to do some cool "vibe coding" 🎶
The core of the web scraping process relies on Crawl4AI,🕷️. While I didn't build it myself, an impressive open-source framework to work with for LLM applications... 🤓
Workflow
-
Scrapes relevant documentation:
- Provide a list of URLs (preferably start pages).
- It then uses
sitemap.xmlto get all URLS of the webpage and saves them as.mdfiles. - Always respect the
robots.txtbest practices. [Soon will be added]
-
Enables embedding creation:
- These Markdown files can then be used to build knowledge embeddings for Retrieval-Augmented Generation (RAG) applications,
- IBM Granite-Embedding is used as it works sufficiently for many use-cases.
- Interact in terminal within Ollama Framework like Gemma3 model.
ollama serve ollama run gemma3
Installation: 🛠️
- As a package📦
- Use
Poetryfor dependency management by refering to thepyproject.toml.
poetry install poetry shell
- Alternatively, install via
requirements.txtfile:
python -m venv venv source venv/bin/activate # On Linux/macOS pip install -r requirements.txt pip install .
- Use
- Running with Docker 🐳
Navigate to the directory containing the Dockerfile and build the Docker image for a more isolated and reproducible running environment:
```bash
docker build -t crawl4ai-doc-scraper .
```
Running the Documentation Scraper ⚙️
Once you have the environment set up (either locally or via Docker), you can run the main.py script to start scraping.
The script accepts the following arguments:
url_list: A list of URLs to scrape documentation from.-g get_all_pages true: This flag tells Crawl4AI to crawl and scrape all sub-pages found on the initial URLs. If set tofalse, only the content of the initial URLs or start page will be scraped.
Example Usage:
# Local execution (after activating the virtual environment)
python3 main.py "[https://langchain-ai.github.io/langgraph](https://langchain-ai.github.io/langgraph)" "[https://docs.smith.langchain.com/](https://docs.smith.langchain.com/)" -g true
# Docker execution (as shown in the Docker installation step)
docker run crawl4ai-doc-scraper python3 main.py --url "[https://langchain-ai.github.io/langgraph](https://langchain-ai.github.io/langgraph)" "[https://docs.smith.langchain.com/](https://docs.smith.langchain.com/)" -g trueThis project utilizes the MIT License. You are free to use, modify, and distribute it according to the terms of this license. See the LICENSE file for the full text.
While this is a personal side quest, feel free to reach out if you have ideas or suggestions! Future improvements could include:
- More sophisticated content filtering during scraping.
- Automated embedding generation after scraping.
- Integration with specific local LLM workflows.
This setup demonstrates a practical approach to keeping local LLMs informed with the latest web documentation. By leveraging the power of Crawl4AI and creating a consistent environment, we can enhance our learning, coding, and exploration with local AI models. Happy exploring! 🗺️