pii-hound is a lightning-fast, dependency-free CLI tool built in Go that sniffs out unprotected Personally Identifiable Information (PII) and Developer Secrets in your databases, data warehouses, and cloud storage.
It connects to your data sources, samples records (up to a configurable limit), and uses a combination of heuristic column-name matching and regex data sampling to identify risks before they become liabilities.
π Key Features
- Multi-Source: Support for PostgreSQL, MySQL, Snowflake, BigQuery, SQLite, AWS S3, and Google Cloud Storage.
- File Support: Scans CSV, JSON (Array and JSON Lines), Excel (.xlsx, .xlsm), Parquet, and Plain Text (.env, .txt, .log, .yaml) formats.
- Project Configuration: Use a
.pii-hound.yamlfile to set project-wide policies and ignore specific false positives. - High-Risk File Detection: Automatically flags dangerous files like
.env,id_rsa, andcredentials.jsonby name. - Custom Rules: Define your own PII and Secrets patterns using a simple YAML configuration with high-performance Aho-Corasick keyword matching.
- Secrets Detection: Sniffs out AWS Keys, GitHub Tokens, and Private Keys.
- PII Detection: Detects SSNs, Credit Cards (with Luhn validation), Emails, IP Addresses, and Phone Numbers.
- CI/CD Ready: Machine-readable JSON and SARIF output, plus a
--fail-on-piiflag to block risky deployments. - Git Integration: Use the
--diffflag to scan only files that have changed in your repository. - GitHub Integration: Upload SARIF results directly to GitHub's Security tab.
- Intelligence: High-performance Reservoir Sampling for large files and random database sampling.
- Lightning Fast: Concurrent, streaming architecture designed to handle gigabytes of data without high memory usage.
π₯ Installation
macOS (Homebrew)
brew tap saddledata/homebrew-tap brew install pii-hound
Docker
docker run --rm saddledata/pii-hound --help
Binary Downloads
Download the latest pre-compiled binaries for Linux, Windows, or macOS from the Releases page.
π οΈ Usage
Scan a Database
# PostgreSQL (quote the URI!) pii-hound scan "postgres://user:pass@localhost:5432/db?sslmode=disable" # MySQL pii-hound scan "mysql://user:pass@tcp(localhost:3306)/db" # Snowflake pii-hound scan "snowflake://user:pass@account/MY_DB/MY_SCHEMA?warehouse=COMPUTE_WH" # BigQuery pii-hound scan "bigquery://my-project/my_dataset" # SQLite pii-hound scan "./my-app.db"
Scan Cloud Storage
# AWS S3 (requires local AWS credentials) pii-hound scan "s3://my-bucket/exports/*.csv" # Google Cloud Storage (requires GOOGLE_APPLICATION_CREDENTIALS) pii-hound scan "gs://my-bucket/backups/*.json"
GitHub Action
The fastest way to use pii-hound in your CI/CD pipeline is with the official GitHub Action.
- name: Scan for PII uses: saddledata/pii-hound@v0 with: path: './data' fail-on-pii: true sarif: 'pii-results.sarif' - name: Upload SARIF uses: github/codeql-action/upload-sarif@v3 if: always() with: sarif_file: 'pii-results.sarif'
Scan Local Files
...
Scan multiple files (wildcards supported)
pii-hound scan ./data/.csv ./backups/.xlsx ./logs/*.parquet
Scan your environment and config files
pii-hound scan .env ./config/*.yaml docker-compose.yml
### CI/CD Integration
Block your pipeline if PII is detected in your export folder:
```bash
pii-hound scan "./exports/*.csv" --fail-on-pii
Git 'Changed Files' Only
Speed up your scans by only checking files that have changed in git (staged, unstaged, or since a base branch):
# Scan all local changes pii-hound scan --diff # Scan changed files compared to main branch pii-hound scan --diff --base origin/main
GitHub Actions (SARIF)
Generate a SARIF report to see PII findings in your PRs and Security tab:
pii-hound scan ./data --sarif > pii-results.sarifThen use the github/codeql-action/upload-sarif action to upload the results.
Custom Rules
Define proprietary PII patterns or sensitive keywords in a rules.yaml file:
rules: - name: "Internal Project ID" type: "PII" risk: "HIGH" regex: "PRJ-[0-9]{5}" heuristic: "project_id|proj_code" - name: "Sensitive Keywords" type: "Sensitive Keyword" risk: "MEDIUM" keywords: ["AcmeCorp", "SecretProjectX"]
Then run the scan with the --rules flag:
pii-hound scan ./data.csv --rules rules.yaml
Configuration & Ignore
pii-hound automatically looks for a .pii-hound.yaml file in your current directory. You can use this to set global limits, random sampling, and ignore specific files or columns that produce false positives.
Example .pii-hound.yaml:
limit: 500 random: true fail_on_pii: true # Ignore specific false positives ignore: - source: "legacy_data.csv" column: "fake_ssn" - source: "test_users.json" type: "Email Address" - source: "logs/*" # Use wildcards for sources # Define custom rules inline rules: - name: "Internal ID" regex: "INT-[0-9]{4}"
βοΈ CLI Flags
| Flag | Shorthand | Description |
|---|---|---|
--limit |
-l |
Maximum rows/objects to sample per table/file (default: 1000). |
--random |
Sample rows randomly (uses Reservoir Sampling for files). | |
--diff |
Only scan files that have changed in git. | |
--base |
Base git ref to compare against (used with --diff). | |
--json |
Output report in machine-readable JSON format. | |
--sarif |
Output report in SARIF format for GitHub Security. | |
--fail-on-pii |
Exit with code 1 if any PII or Secrets are detected. | |
--config |
Path to a YAML configuration file. | |
--rules |
Path to a YAML configuration file (legacy alias for --config). |
π― What does it detect?
pii-hound uses a dual-engine approach. It first checks for suspicious column names (e.g., ssn, cc_num, apikey) and then evaluates the actual data inside those columns.
π΄ High Risk
- High-Risk Files: Dangerous files flagged by name (e.g.,
.env,id_rsa,sa-key.json,terraform.tfstate). - Social Security Numbers (US): Matches standard patterns.
- Credit Card Numbers: Validated via the Luhn Algorithm to eliminate false positives.
- Developer Secrets: AWS Access Keys, GitHub Tokens, RSA/SSH Private Keys, and generic
PASSWORD=...patterns. - Email Addresses: Standard RFC-compliant detection.
π‘ Medium Risk
- IP Addresses: Both IPv4 and IPv6 detection.
- Phone Numbers: Matches various international and US formats.
- Person Names: Heuristic detection for common column names (e.g.,
first_name,last_name,fullname).
π‘οΈ Found PII? Automate your protection.
pii-hound is an open-source project maintained by the team at Saddle Data.
Finding PII is only half the battle. If you want to automatically mask, hash, and protect this data before it gets synced to your data warehouse, check out Saddle Data's Governance Control Center.
With Saddle Data, you can tag these columns once, and our Execution Circuit Breakers will automatically inject Hash/Mask transformations into every data pipeline you buildβensuring zero-trust compliance on autopilot.
Learn more about Saddle Data Governance β
π€ Contributing
Pull requests are welcome! If you want to add a new detector (e.g., Passport numbers) or a new source connector (e.g., MongoDB), please open an issue first to discuss the changes.