GitHub - Vswaroop04/link-guardian: A tool to scan GitHub repositories and websites for broken links

link-guardian 🔗🛡️

A powerful Rust CLI tool to scan GitHub repositories and websites for broken links.

Perfect for:

CI/CD pipelines to catch broken documentation links
Maintaining website link health

Features

Scan GitHub repositories (README.md)
Scan websites with configurable crawl depth
Detect broken links (404, timeouts, SSL errors, etc.)
Detect redirects (301, 302)
Human-readable table output
JSON output for scripting/CI
Proper exit codes for CI integration
Blazing fast concurrent link checking (500 concurrent by default, configurable)
Polite crawling with delays

Installation

Prerequisites

Rust 1.70 or newer (install from rust-lang.org)

Build from source

# Clone the repository (or navigate to the project folder)
cd link-guardian

# Build in release mode (optimized)
cargo build --release

# The binary will be at target/release/link-guardian

Install globally (optional)

cargo install --path .

# Now you can run 'link-guardian' from anywhere

Usage

Basic Commands

Scan a GitHub repository

# Check links in a GitHub repo's README
link-guardian github https://github.com/rust-lang/rust

# With JSON output
link-guardian github https://github.com/rust-lang/rust --json

# With custom concurrency (default: 500)
link-guardian github https://github.com/rust-lang/rust --concurrency 1000

Scan a website

# Scan just the homepage
link-guardian site https://example.com

# Scan homepage + all linked pages (depth 2)
link-guardian site https://example.com --max-depth 2

# With JSON output
link-guardian site https://example.com --json

# With custom concurrency for faster checking
link-guardian site https://example.com --concurrency 1000 --max-depth 2

Command-line Options

link-guardian --help

Commands:
  github  Scan a GitHub repository for broken links
  site    Scan a website for broken links
  help    Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

GitHub subcommand

link-guardian github --help

Scan a GitHub repository for broken links in README and docs/

Usage: link-guardian github [OPTIONS] <REPO_URL>

Arguments:
  <REPO_URL>  GitHub repository URL (e.g., https://github.com/user/repo)

Options:
      --json  Output results in JSON format instead of a table
  -h, --help  Print help

Site subcommand

link-guardian site --help

Scan a website for broken links

Usage: link-guardian site [OPTIONS] <WEBSITE_URL>

Arguments:
  <WEBSITE_URL>  Website URL to scan (e.g., https://example.com)

Options:
      --json              Output results in JSON format instead of a table
      --max-depth <MAX_DEPTH>  Maximum crawl depth (default: 1) [default: 1]
  -h, --help              Print help

Output Examples

Table Output (default)

🔍 Scanning website: https://example.com
📊 Max crawl depth: 1
📄 Crawled 1 page(s)
   5 links found on https://example.com

🌐 Checking 5 unique link(s)...

URL                                                          STATUS          MESSAGE
=========================================================================================================
https://example.com/about                                    ✅ OK           HTTP 200
https://example.com/contact                                  ✅ OK           HTTP 200
https://example.com/old-page                                 🔀 REDIRECT     HTTP 301 -> /new-page
https://example.com/missing                                  ❌ BROKEN       HTTP 404
https://example.com/timeout                                  ⏱️  TIMEOUT     Request timed out

📊 Summary:
   ✅ OK: 2
   ❌ Broken: 3
   📋 Total: 5

JSON Output

link-guardian site https://example.com --json

[
  {
    "url": "https://example.com/about",
    "status": "ok",
    "message": "HTTP 200"
  },
  {
    "url": "https://example.com/old-page",
    "status": "redirect",
    "redirect": "https://example.com/new-page",
    "message": "HTTP 301 -> https://example.com/new-page"
  },
  {
    "url": "https://example.com/missing",
    "status": "broken",
    "message": "HTTP 404"
  }
]

Exit Codes

Perfect for CI/CD integration:

0: All links are OK (success)
1: Broken links detected (failure)
2: Internal error or invalid usage

Example CI Usage

#!/bin/bash
# In your CI script

link-guardian github https://github.com/youruser/yourrepo

if [ $? -eq 1 ]; then
  echo "❌ Broken links detected!"
  exit 1
else
  echo "✅ All links are healthy!"
fi

Project Structure

link-guardian/
├── Cargo.toml              # Project metadata and dependencies
├── README.md               # This file
└── src/
    ├── main.rs             # Entry point, orchestrates everything
    ├── cli.rs              # Command-line parsing (clap)
    ├── checker/
    │   ├── mod.rs          # Checker module exports
    │   ├── http.rs         # HTTP link checking logic
    │   ├── markdown.rs     # Extract links from Markdown
    │   └── html.rs         # Extract links from HTML
    ├── github/
    │   ├── mod.rs          # GitHub module exports
    │   └── fetch.rs        # Fetch files from GitHub repos
    └── crawl/
        ├── mod.rs          # Crawl module exports
        └── queue.rs        # Website crawling with BFS

How It Works

For GitHub Repositories

Parse the GitHub URL to extract owner/repo
Fetch README.md from raw.githubusercontent.com
Parse Markdown and extract all HTTP/HTTPS links
Check each link concurrently (up to 50 at a time)
Report results

For Websites

Fetch the starting URL
Extract all links from the HTML
If max-depth > 1, crawl same-domain links (breadth-first)
Collect all unique links found across all pages
Check each link concurrently (up to 50 at a time)
Report results

Link Checking

For each link:

Make an HTTP HEAD request (lightweight, no body)
Categorize the response:
- 200-299: ✅ OK
- 300-399: 🔀 Redirect
- 404/410: ❌ Broken
- Timeout: ⏱️ Timeout
- SSL errors: 🔒 SSL Error
- DNS errors: 🌐 DNS Error
- Other: ⚠️ Error

Development

Running Tests

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run a specific test
cargo test test_check_valid_link

Running in Development

# Run without building a release binary
cargo run -- github https://github.com/rust-lang/rust

cargo run -- site https://example.com --max-depth 2

Code Style

# Format code
cargo fmt

# Check for common mistakes
cargo clippy

Learning Resources for Rust Beginners

The code is heavily commented to teach Rust concepts. Look for:

Function-level comments: Explain what each function does
Inline comments: Explain tricky Rust concepts
"BEGINNER NOTES" sections: Deep dives into Rust concepts

Key Rust concepts used in this project:

Modules: Organizing code into namespaces
async/await: Concurrent programming for network I/O
Result<T, E>: Type-safe error handling
Option: Representing values that might not exist
Ownership: Who owns data and when it's freed
Borrowing: Temporary access to data without owning it
Traits: Like interfaces in other languages
Pattern matching: The match keyword for control flow
Iterators: Processing sequences of items efficiently

Troubleshooting

"Failed to fetch README.md"

The repository might use master instead of main branch
The repository might not have a README.md
Check the URL is correct: https://github.com/owner/repo

"SSL certificate error"

Some websites have invalid or expired SSL certificates
This is reported as a "broken" link for safety

"Too many redirects"

The URL might have a redirect loop
Default limit is 5 redirects

Rate Limiting

GitHub's raw.githubusercontent.com has rate limits
For heavy usage, consider implementing GitHub API with authentication
Websites might rate-limit or block rapid requests

Future Enhancements (Stretch Goals)

Use GitHub API (octocrab) for better repo access
Colored terminal output
Progress bars for long scans
Configurable ignore patterns (skip certain URLs)
Support for other platforms (GitLab, Bitbucket)
Retry logic for transient failures
HTML report generation
Recursive docs/ folder scanning for GitHub repos

Contributing

This is a learning project! Contributions welcome:

Fork the repository
Create a feature branch
Make your changes (keep the teaching style!)
Add tests
Submit a pull request

License

MIT License - see LICENSE file for details

Credits

Built with:

clap - Command-line parsing
tokio - Async runtime
reqwest - HTTP client
scraper - HTML parsing
pulldown-cmark - Markdown parsing
url - URL parsing
serde - Serialization
anyhow - Error handling

Made with ❤️ for Rust learners