Tiny Web Crawler
A simple and efficient web crawler for Python.
Features
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Installation
Install using pip:
pip install tiny-web-crawler
Usage
from tiny_web_crawler import Spider from tiny_web_crawler import SpiderSettings settings = SpiderSettings( root_url = 'http://github.com', max_links = 2 ) spider = Spider(settings) spider.start() # Set workers and delay (default: delay is 0.5 sec and verbose is True) # If you do not want delay, set delay=0 settings = SpiderSettings( root_url = 'https://github.com', max_links = 5, max_workers = 5, delay = 1, verbose = False ) spider = Spider(settings) spider.start()
Output Format
Crawled output sample for https://github.com
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}Contributing
Thank you for considering to contribute.
- If you are a first time contributor you can pick a
good-first-issueand get started. - Please feel free to ask questions.
- Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
- We are working on doing our first major release. Please check this
issueand see if anything interests you.
Dev setup
- Install poetry in your system
pipx install poetry - Clone the repo you forked
- Create a venv or use
poetry shell - Run
poetry install --with dev pre-commit install(see)pre-commit install --hook-type pre-push
Before raising a PR. Please make sure you have these checks covered
- An issue exists or is created which address the PR
- Tests are written for the changes
- All lint/test passes
