Settings

Theme

Ask HN: Regex on a File or Stream

1 points by buzzdenver 2 years ago · 6 comments · 1 min read


I just ran into this seemingly not that hard issue of trying to match a multi-line regex against a 3Gb text file. What is the right tool for this? grep and perl failed running into PCRE limits.

jepler 2 years ago

Maybe some other PCRE-compatible implementation offers streaming. For instance, https://www.intel.com/content/www/us/en/developer/articles/t... says it has this feature, but of course given who it's from it may be tied to a single brand of CPU.

github seems to be https://github.com/intel/hyperscan

zaktoo2 2 years ago

Could you paste the regex portion of it please? Possibly some efficiencies to be gained there. You could also split the file into smaller chunks and then check the boundaries of the chunks.

  • buzzdenverOP 2 years ago

    Yes, breaking it up would work, but that is not a solution for streams.

    The regex is dead simple: /Authorization: Basic (.*)\ngrant_type=refresh_token/ "." does not match newline, so I'm basically looking two lines that conform to a template.

    Specific cases can be transformed with some grep/awk magic, but IMO the concept of pattern matching against a stream is interesting regardless.

    • zaktoo2 2 years ago

      I missed the part where it was a stream. Also, is the grant_type guaranteed to be immediately after the token?

cvalka 2 years ago

https://github.com/VirusTotal/yara

burntsushi 2 years ago

ripgrep should be able to handle it with the -U/--multiline flag.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection