An AI Agent Improved OWASP CRS Detection by 80% in 20 Experiments

Last week we published the first autoresearch experiment, where an AI agent tuned CRS configuration (paranoia levels, anomaly thresholds, rule exclusions). The results were good, but the agent never touched the actual detection rules.

This time we went deeper. We pointed the agent at the CRS regex patterns themselves and asked: can you find real bypasses, fix the detection gaps, and reduce false positives, all without breaking anything?

20 experiments. 20 kept. 0 discarded.

Metric	Baseline (CRS nightly, PL1)	After 20 experiments	Delta
Balanced Accuracy	0.630	0.976	+0.346
True Positive Rate	55.8%	100%	+44.2%
False Positive Rate	29.7%	4.8%	-24.9%
False Positives	1,338	218	-1,120

What Changed From Part 1

The first experiment optimized CRS configuration: which rules to enable, what thresholds to set, which content types to allow. Useful, but it is essentially knob-turning.

This experiment modifies the actual rule files. The regex patterns inside .conf files, the data lists, the detection logic. These are changes that could become upstream contributions to the CRS project itself.

The difference matters. Configuration tuning helps your specific deployment. Regex improvements help every CRS user.

The Dataset

We used 4,595 requests:

95 malicious payloads derived from our CVE database (110K+ CVEs mapped to CWE categories and CRS rule families). These are not random fuzzing payloads. Each one targets a specific, known gap in CRS detection.
4,500 legitimate requests from the openappsec/waf-comparison-project: real captured browsing sessions from 692 websites (e-commerce, travel, gaming, food delivery, social media).

The malicious payloads were specifically chosen to test known CRS blind spots: SQLite-specific SQL syntax, PostgreSQL operators, server-side template injection, command injection evasion techniques, and missing restricted file paths.

Infrastructure

WAF: OWASP ModSecurity CRS (nightly) on nginx, Paranoia Level 1
Agent: Claude Code in --print mode, one experiment per invocation
CRS rules: Git submodule, agent modifies rules directly
Evaluation: Docker restart + 4,595 concurrent requests, ~36 seconds per cycle
Loop: Bash orchestrator with 30-minute timeout per experiment
Hardware: MacBook Pro (Apple Silicon), Docker via OrbStack

The agent reads program.md for instructions, runs the evaluator, reads the category breakdown of missed attacks and false positives, picks the highest-impact target, modifies the regex, evaluates again, and commits or reverts.

Phase 1: Fixing Bypasses (Experiments 1-7)

The agent started with 42 missed attacks and went after them one category at a time.

SQLite Double-Equals Bypass

CRS rule 942190 matches SQL equality operators, but its regex only catches =, not ==. SQLite accepts both. The fix:

# Before (operator group in 942190):
[\s\x0b]*(?:and|x?or)\b[\s\x0b]+[0-9A-Z_a-z]+(?:[\s\x0b]*=[^\s\x0b])

# After:
[\s\x0b]*(?:and|x?or)\b[\s\x0b]+[0-9A-Z_a-z]+(?:[\s\x0b]*==[^\s\x0b]|[\s\x0b]+glob\b)

This also added GLOB detection. SQLite's GLOB operator is functionally equivalent to LIKE but was completely invisible to CRS. Payloads like ' AND name GLOB '*admin*'-- sailed through at PL1.

This is CRS issue #4121, open since April 2025. No PR had been submitted.

PostgreSQL Array Containment

PostgreSQL's @> array containment operator was not in any CRS rule. The agent added it to the same operator group in rule 942190:

# Added alternative:
\[[^\]]*\][\s\x0b]*@>

Catches: ' AND ARRAY[1] @> ARRAY[1]--

UNION SELECT Newline Evasion

Injecting %0a (newline) between letters of SQL keywords breaks regex word matching. The agent fixed this by allowing CR/LF between each letter of UNION and SELECT:

# Before:
union\b[\s\x0b]*(?:all|(?:distin|sele)ct)

# After:
u[\x0a\x0d]*n[\x0a\x0d]*i[\x0a\x0d]*o[\x0a\x0d]*n\b[\s\x0b]*(?:all|(?:distin|s[\x0a\x0d]*e[\x0a\x0d]*l[\x0a\x0d]*e)[\x0a\x0d]*c[\x0a\x0d]*t)

This is related to CRS issue #4363 on comment-based evasion.

Command Injection Evasion

The agent fixed three RCE bypass techniques in rules 932130, 932160, and 932340:

Backslash prefix: \id and \whoami bypass command detection because CRS does not account for the backslash. Fixed by extending the pattern to match an optional leading backslash.
ANSI-C hex encoding: $'\x77\x68\x6f\x61\x6d\x69' (hex-encoded "whoami") was not detected. The agent extended rule 932130 to catch the $'\xNN' pattern.
Quote splitting: w'h'o'am'i evades detection because the command name is never present as a contiguous string. Fixed by adding standalone detection in the shell command data file.

Result After Phase 1

After 7 experiments, TPR hit 100%. All 95 bypass payloads detected. Zero increase in false positives.

Phase 2: Reducing False Positives (Experiments 8-20)

With all attacks caught, the agent pivoted to reducing false positives. This is where it got interesting.

The agent identified which CRS rules caused the most false positives on real traffic and made targeted regex tightenings. Each fix removed 8-388 false positives without losing any attack detection.

The Big Wins

Rule	Fix	FPs Removed
932270	Word boundary before tilde expansion pattern	388
901162	Allow text/plain content type	275
942550	Narrow JSON-SQL regex to require quote/backtick	134
942290	Word boundary after MongoDB operator names	55

The Long Tail

After the big wins, the agent found 9 more smaller fixes, each removing 8-20 FPs. These included:

Requiring non-alphanumeric before shell history !- pattern (rule 932330)
Requiring path separator before .profile in restricted files (rule 930120)
Removing url from Windows command list, since it triggered on GraphQL queries (rule 932370)
Removing is_int from PHP function list, since it matched cookie values (rule 933150)
Removing single-character w from no-arguments command list (rule 932340)

Every single one of these is a legitimate CRS improvement. They are the kind of fixes that accumulate in changelogs as "reduce false positive on X" entries.

The Trajectory

Here is the full experiment trajectory:

#	BA	TPR	FPR	Change
0	0.630	0.558	0.297	Baseline
1	0.766	0.832	0.300	PostgreSQL @> operator
2	0.797	0.895	0.300	SSTI detection
3	0.808	0.916	0.300	Newline in UNION SELECT
4	0.834	0.968	0.301	id command detection
5	0.839	0.979	0.301	Hex-encoded commands
6	0.844	0.989	0.301	Backslash prefix
7	0.849	1.000	0.301	Quote-split evasion
8	0.892	1.000	0.215	Tilde word boundary
9	0.939	1.000	0.122	text/plain content type
10	0.954	1.000	0.092	JSON-SQL regex narrowing
11	0.960	1.000	0.079	MongoDB word boundaries
12	0.963	1.000	0.075	Shell history pattern
13	0.965	1.000	0.070	.profile path context
14	0.967	1.000	0.066	Remove url command
15	0.968	1.000	0.063	Content-type parsing
16	0.970	1.000	0.061	PHP is_int removal
17	0.972	1.000	0.056	Session fixation regex
18	0.973	1.000	0.053	Base64 delimiter
19	0.974	1.000	0.051	Remove w command
20	0.976	1.000	0.048	Remove brace trigger

What This Means

For CRS Users

Stock CRS at Paranoia Level 1 has real blind spots. SQLite, PostgreSQL-specific operators, and common command injection evasion techniques are not detected. If your application uses SQLite (common in mobile backends, IoT, embedded systems, and many PHP/Python frameworks), attackers can bypass CRS with trivial modifications like == instead of =.

For CRS Maintainers

Several of these findings map directly to open CRS issues (#4121, #4363, #4112). The regex fixes are minimal and surgical: extend a character class, add an alternation, tighten a word boundary. The total diff is 49 insertions, 18 deletions across 12 files.

We plan to submit PRs for the bypass detection fixes (experiments 1-7). The false positive reductions (experiments 8-20) need validation against a broader traffic corpus before upstream submission.

For the AI/Security Research Community

This approach, using CVE data to generate targeted bypass payloads, then letting an AI agent systematically improve detection, works. The agent found and fixed real issues that have been open for months.

The key insight: the agent is not doing anything a skilled security researcher could not do manually. It is just doing it faster, more systematically, and without getting bored. 20 experiments in 5.5 hours, each one reading regex patterns, understanding why a payload evades detection, making a minimal fix, and verifying no regressions.

Comparison With Part 1

	Part 1 (Config Tuning)	Part 2 (Regex Improvement)
What changed	Anomaly thresholds, rule exclusions	Actual regex patterns in rule files
Scope	Your deployment	Every CRS user
Experiments	30 (v3 + v4)	20
Best BA	0.984 (v4)	0.976
Upstreamable	No (config is deployment-specific)	Yes (regex fixes benefit everyone)

Methodology Notes

All experiments ran at CRS Paranoia Level 1 (the default most deployments use).
The agent was constrained to regex improvements only: no new rules (except one SSTI case where no existing PL1 rule existed), no transformation chain modifications, no paranoia level changes.
Every bypass was independently verified against the CRS Sandbox before inclusion in the dataset.
The legitimate traffic dataset is unmodified from openappsec. We did not cherry-pick easy traffic.
2 out of 22 experiment runs timed out (30-minute limit) and produced no result. The agent recovered and continued. 20 out of 20 completed experiments were kept.

Limitations

The bypass payload set (95 requests) is small and targeted. A larger, more diverse attack corpus might reveal regressions we did not catch.
The legitimate traffic is from 2024 browsing captures. Web traffic patterns evolve.
Some false positive fixes (removing commands from data files, adding content types) may reduce detection surface in edge cases. These need broader validation before upstream submission.
The SSTI detection rule (experiment 2) was a new rule addition, not a regex improvement. CRS has a strict rule ID allocation process, so this would need maintainer approval.

What is Next

Submit PRs for the bypass detection fixes to coreruleset/coreruleset
Expand the dataset with more bypass categories (SSRF, XXE, deserialization)
Test at Paranoia Levels 2-4
Run longer experiments (overnight) to see if the agent can push past the 0.976 plateau

Try It Yourself

The full experiment infrastructure is available:

Clone the repo, docker compose up, and run python3 scripts/evaluate.py to see the baseline
Run nohup bash scripts/run-agent.sh overnight & to start the agent loop
Watch tail -f logs/agent.log for results

All you need is Docker, Claude Code (or any coding agent that supports --print mode), and a few hours of compute.