Settings

Theme

Show HN: Narada – Open-source secrets classification model

6 points by sanketsaurav 3 months ago · 2 comments · 2 min read


Hey HN! We're the team behind Autofix Bot (YC W20's DeepSource)[1]. We're open-sourcing Narada (https://huggingface.co/deepsource/Narada-3.2-3B-v1), a fine-tuned Llama3.2-3B-Instruct model that dramatically reduces false positives in secrets detection tools. The model achieves 97% precision with 96% recall on our evaluation set. It's fast enough for CI/CD (3B parameters), works with any regex-based tool, and is MIT-licensed.

Traditional regex-based secrets scanners (Gitleaks, TruffleHog, detect-secrets) face a fundamental tradeoff: crank up sensitivity and drown in false positives flagging things like "YOUR_API_KEY_HERE", or tune it down and miss real credentials. We kept hearing from security teams that they couldn't trust their scanning tools because of the noise – developers would just ignore the alerts.

Regex is great at fast pattern matching, but terrible at understanding context. So instead of trying to make regex smarter, we built a hybrid system: regex does the initial high-recall sweep, then a fine-tuned 3B model filters out false positives by actually understanding the code context.

Technical approach: - Started with teacher-student architecture using DeepSeek R1 as teacher - Curated ~8K diverse secrets from Samsung's CredData dataset, relabeled for consistency - Generated synthetic edge cases using Gemini 2.5 Pro and Claude Sonnet 4 - Fine-tuned on ~900 examples with deterministic outputs (not chain-of-thought)

Integration is straightforward – run your existing regex tool, feed candidates to Narada with ±20 lines of context, get structured JSON output with true/false positive classification and reasoning.

We built this as part of Autofix Bot's secrets detection agent, and it outperformed static-only tools significantly in our benchmarks [2]. Figured the security community would benefit from having this available as an open-source building block. Would love to hear your feedback and learn what other edge cases you encounter.

[1] https://autofix.bot

[2] https://autofix.bot/benchmarks#benchmarks-secrets-detection

[3] https://autofix.bot/news/narada-secrets-detection-classifica...

micksmix 2 months ago

I'm curious how Kingfisher would do against the proprietary dataset: https://github.com/mongodb/kingfisher

Any chance you could try and share results? Full disclosure, I built Kingfisher

  • dolftax 2 months ago

    Jai here, from Autofix Bot team. We've published results of the initial benchmark run[1] comparing Gitleaks, detect-secrets and trufflehog ~3 weeks ago. In the meantime, we've put together a significantly improved dataset, and we're planning to rerun those benchmarks shortly; will include Kingfisher to the list, and share the results here.

    Btw, we use Kingfisher's validation system internally for generating request/expected_response pairs for a given secret, as the last step of the pipeline. We don't run/call the validation queries ourselves, due to rate limit issues. But, we add this information in a structured format as part of the response which can be executed on the client side (or) by the user who is integrating via the API. Thanks for building it :)

    [1] https://autofix.bot/benchmarks/#benchmarks-secrets-detection

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection