Show HN: OpenClaw skills degrade agent safety

1 points by shadab_nazar 2 months ago · 2 comments

Reader

Summer Yue's OpenClaw agent deleted 200 emails despite "confirm before action." We tested the gog skill (Google Workspace) and saw the same behavior — the skill teaches the agent how to bulk-delete but not when to stop.

  That was 1 of 11 security failures in that single skill. Others: data exfiltration, unauthorized forwarding, contact harvesting, impersonation via  
  calendar events.                                                                                                                                    
                                                                                                                                                    
  We tested 10 OpenClaw skills across 186 security properties, with and without each skill loaded. 9 of 10 show the same pattern — some properties
  improve, others degrade. The skill adds domain knowledge that shifts security behavior in ways nobody tested for.

  We hardened all 10. 84% fix rate. Each guardrail traces to a specific regression. Open source, drop-in replacements.

  Per-skill scorecards: https://faberlens.ai/report
  Research: https://faberlens.ai/blog/jagged-surface

  N=10, two models, limitations published. Happy to go deeper on methodology.

jdrhyne 2 months ago

Check out https://agentverus.ai/ where you can upload any skill and get similar information for free - also, you can see the open-source scanner here - https://github.com/agentverus/agentverus-scanner - would love any contributions or help if you all see me missing something.

Settings

Show HN: OpenClaw skills degrade agent safety

Keyboard Shortcuts