Ask HN:Why there's no Regular Expression search for web?

10 points by bluegene 15 years ago · 11 comments · 1 min read

What's keeping Google/other search engines from implementing Regular expression search in particular?

lacker 15 years ago

The main problem is that you would need a totally different indexing system.

Roughly, search engines work in two phases: retrieval, and scoring. Retrieval is when you figure out of the billions of documents in the index, which are the top few thousand that could be worthy of being search results. Scoring is when you look at each of those documents in more detail to figure out the actual top ten.

Scoring based on regular expressions wouldn't be too tough. Retrieval is the killer. Typically retrieval works based on "posting lists", which are basically indices for each word of which documents contain that word. To retrieve based on regular expressions, you would need posting lists for individual characters or short sequences of characters. That would take a lot more space.

You might be able to hack together some hybrid that would use existing posting lists. For example, if you required that the regular expression contain a word within it. But pure regular expressions would require a different index. That sort of added complexity is not worth it for the feature.

curtis 15 years ago

One problem is that there's no easy way to build a regular expression index for the web. In the general case the only way to do regex search is to scan the entire content.

It might be practical to do a hybrid search -- a conventional word or phrased based search to return a limited set of documents that can then be brute-force searched using a regular expression. This could be especially handy for programmers searching for code samples, a position I often find myself in.

seiji 15 years ago

Regex capable code search: http://google.com/codesearch
- curtis 15 years ago
  
  True, but codesearch only searches codebases. But suppose I want to search for mixed content and code. A lot of my programming related searches lead me to StackOverflow, a mailing list entry, or a blog post.
- crasshopper 15 years ago
  
  You can't search for symbols though.

zck 15 years ago

Imagine the added complexity that you would require to do that -- you'd need to have more hardware than a general-purpose search engine. It's also complicated to precalculate anything, as there aren't a list of regexes that are more likely to be entered, unlike text (a dictionary).

Who would use the regex search? Only programmers. So your market is tiny compared to a general-purpose search engine.

So more expensive queries that are harder to code up for many fewer people? Sounds like a losing bet.

brudgers 15 years ago

Sounds like a promising idea for a YC application.

petervandijck 15 years ago

pg asks:
- how will you make money?
- how will you implement this cheaply enough?
- who will really use this? what are they doing now instead?
- brudgers 15 years ago
  
  Initially, with a subscription model for people with an interest in search results relevant to their purpose rather than suffering through search results relevant to the purposes of advertising sales. This will allow controlled growth to match resources to volume.
  [edit] The right model might be a sort of meta-search engine - feed the regex to something like Wikipedia to determine plausible keywords and then return aggregated search results based on the keywords. At prototype and small scale actual search results could be aggregatte from other search engines such as Google or Bing.
  [edit] Interestingly, Wikipedia already has regex capability built in to AutoWikibrowser.
  http://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Regul...
  - petervandijck 15 years ago
    
    pg then asks: how many people are willing to pay a subscription for this? How do you know that?
    And then pg says: I worry.. I worry..

Settings

Ask HN:Why there's no Regular Expression search for web?

Keyboard Shortcuts