Settings

Theme

Show HN: No Trash Search

notrashsearch.github.io

137 points by rickdeveloper 4 years ago · 98 comments

Reader

rickdeveloperOP 4 years ago

I built this website a couple of months ago because I was annoyed by how hard it was to find useful things on Google. As "Google no longer producing high quality search results in significant categories" [0] is currently #1 on the front page I figured I'd share this project again. I hope it's useful to some people.

'No Trash Search' is very focussed on STEM and not "for daily use". It's surprisingly good when you're looking for certain kinds of information. Under the hood it's little more than a programmable search engine [1] with a whitelist of ~120 sites.

[0] https://news.ycombinator.com/item?id=29772136

[1] http://programmablesearchengine.google.com

  • throwawayboise 4 years ago

    > Under the hood it's little more than a programmable search engine [1] with a whitelist of ~120 sites

    So back to what web search was in the 1990s, roughly: an index from a curated selection of sites.

  • rdiddly 4 years ago

    120 sites is pretty hilarious and sad. "Here you go, the worthwhile part of the internet!"

  • BlueTemplar 4 years ago

    While I can understand the appeal, restricting your search engine to only ~120 websites out of hundreds of millions (?) is basically giving up on the Web.

    (BTW, any good search engines these days that aren't indirectly using Google or Bing ?)

    • version_five 4 years ago

      > restricting your search engine to only ~120 websites out of hundreds of millions (?) is basically giving up on the Web.

      Sure - the web is now a cesspool optimized for advertising and attention. The traditional search engines made a lot more sense at the dawn of the internet when it was more about discovery. Now, for the most part, it's closer to an information retrieval tool, where a finite list of established sites have the bulk of what one is looking for. It only makes sense to have a tool that lets one navigate the established, legit internet, and not have to deal with all the crap.

      That doesn't mean there is no use case for google as it is, but some more focused competition is a no brainer.

    • narrator 4 years ago

      There's http://yandex.com . It's great if you want to search controversial subject matter and controversial results that Google wouldn't give you. The reverse image search is also amazing.

      • imglorp 4 years ago

        The reverse image search in particular is very, very good.

        Far better than Bing or Google. It's not obvious why theirs is so terrible, unless that product is not a moneymaker for them, in which case it explains everything.

      • BlueTemplar 4 years ago

        I should have mentioned : ideally from the EU.

        Big Russian or Chinese software is even more out of the question than the GAFAMs (if they're big, they definitely have authorities messing with the results).

        Hmm, what about Baltic or Ukrainian or Israeli search engines ?

      • quocanh 4 years ago

        Which results are different than Google's?

        • jhugo 4 years ago

          Most. Yandex is great, especially for programming searches. It generally ranks GitHub, Stack Overflow and other content-heavy sites highly. Google has been taken over by weird clones of GitHub and SO lately, Yandex has no such trash.

          It completely boggles my mind that the useless GitHub and SO clones rank first page on Google. Do engineers at Google not use their own product?

          • skinkestek 4 years ago

            Regarding stackoverflow there is a fair chance they can congratulate themselves:

            If I am right they played stupid games and won stupid prizes. More specifically they have allowed rampant deletionism for years so while I am fairly certain the questions and answers originated on Stack Overflow it wouldn't surprise me if a good number of of those aren't visible on Stack Overflow anymore which would explain why they rank higher in Google.

            Done right this would actually be a service.

            Sadly some of them seems to mix together various questions and answers in the same page to generate text matches for unusual queries.

          • ramphastidae 4 years ago

            Engineers at Google build what the ads and sales teams tell them to.

          • msrenee 4 years ago

            Don't have time to mess with it right now, but does it normally return about half results in Russian or is that something my phone/browser is doing?

            • jhugo 4 years ago

              I get usually about 10-25% in Russian.

              • msrenee 4 years ago

                Thanks. I've got a basic understanding of Russian and have some settings on my phone to read and write in Cyrillic, so I wasn't sure if it was something to do with that or just the way the site works.

          • hiptobecubic 4 years ago

            Frankly, no. It's kind of a running joke that you can't Google any of your problems at Google because everything is internal.

          • imglorp 4 years ago

            > Google has been taken over by weird clones of GitHub and SO lately

            Do you have an example search leading to a GitHub clone?

            • vgalin 4 years ago

              French is my mother tongue, but I've quickly learned during my studies that using English keywords in my STEM-related searches would simply lead me to better (and more abundant) results.

              A few weeks/months ago however, while I was trying to solve an issue whith a colleague who would search using french keywords, I noticed that some websites featured on the first page of the Google results were off.

              In short, they were machine-translated versions of Stack Overflow threads. And they would appear in most of the searches using french keywords.

              Those websites also appeared rarely in my searches while I was using English keywords, but most of the time I never bothered opening them. But now I notice them every time.

              Some examples: When searching for "wget set http proxy" on Google, the fourth result leads me to qastack.fr, and the ninth to it-swarm-fr.com, both are websites featuring scrapped and machine-translated threads from Stack Overflow.

              When searching deliberately in french for "Eclipse CDT stdout ne s'affiche pas" ("Eclipse CDT stdout not displayed [in console]"), the first result leads me to askcodez.com and the fourth one to qastack.fr (askodez is the same as the other two).

              I have never stumbled upon Github clones, yet, however.

              • imglorp 4 years ago

                One huge help here is uBlacklist, which has filter lists for search engine results. Of course, the Chrome version will be crippled more as Google feels the knife in its revenue artery, so FF is advised!

                https://github.com/iorate/uBlacklist

                • vgalin 4 years ago

                  I read that you can achieve similar results with uBlock Origin, but I indeed ended up using uBlacklist as you can block websites on the go and because it supports multiple search engines. It is working perfectly, one more reason to use FF I guess.

            • jhugo 4 years ago

              I don't have an example search, although I'll try to remember to update this comment the next time it happens. On average I come across these things at least once a day, but it depends what I'm working on. It tends to be when searching for more obscure bugs, for which there is a GitHub issue but it's not ranked highly on Google for whatever reason, but these spam sites are ranked highly.

              GitMemory is probably the most well-known example; it's just a thin layer over the GitHub API with a completely garbage UI, yet it often ranks higher than GitHub itself.

        • cpach 4 years ago

          Try searching for movie name + torrent for example

          • beckman466 4 years ago

            yep it's always a bunch of movie subscription sites instead of the torrent. it's almost like Google's search engine is predominantly focused on collecting advertising dollars...?

    • 1vuio0pswjnm7 4 years ago

      "(BTW, any good search engines these days that aren't indirectly using Google or Bing ?)"

      The code for Gigablast is open-source, including the crawler.

      I could be wrong but I do not think search.marginalia.eu nor wiby.me use Google or Bing.

      The comment about "hundreds of millions" is interesting. Assume hypothetically a search engline claimed to be searching millions of sites for a given query but in truth it was actually only searching 120 sites that it had determined answered this query (i.e., was the most popular answer source) for the majority of users. How would a user verify the search engine's claim about searching millions of sites was true. What if the search engine only allowed the user to retrieve a maxmimum of about 230 results, not matter how many sites it claimed to search.

      • jerf 4 years ago

        "How would a user verify the search engine's claim about searching millions of sites was true."

        Search for things specifically on those pages, by very specific phrases and such.

        Of course you have to find them yourself first for that verification.

        I can say having set up some very teeny tiny websites here and there that the googlebot is hooked up to a lot of stuff. I'm not even sure how it found a couple of them as quickly as it did. Things like "if someone adds an RSS feed to Feed.ly" seem to do the trick. None of them were sites trying to "hide" or anything and I expected them to be found eventually, but they got found much faster than I expected. Or maybe they just scan new domain registrations, though it seemed to me it wasn't that that triggered it.

        • 1vuio0pswjnm7 4 years ago

          Imagine searching for something that is quite common that will produce a large number of results but the user can only retrieve, say, 230 results total. How does the user verify that all of the "millions of sites" that contain results were actually searched when the user submitted her query.

          A search engine can tell users some large number of sites were searched at the time of the user's query and some large number of results exist, but what if it does not allow the user to actually view all the results.

          To put it another way, the question is not what Google has discovered about the www,^1 but what Google is willing to let the user search and retrieve. If retrieving the 963rd result for a common string is not allowed, then it is impossible for the user to verify that the site containing that result was searched when the user submitted her query. Even if the search produced a 963rd result, what difference does it make if the user cannot retrieve it. What is the point of the search engine locating the 963rd result if it never has to show this result to the user querying a common string.

          1. What Google has discovered about the www^2 and what Google users are able to discover about the www through Google may be two different things.^3 Google has its own interests to pursue in the name of online advertising and these may conflict with users' interests. "Censorship" is one concept that often draws negative connotations but there are many more subtle forms of filtering and manipulation that are possible here, including unintentional ones.

          2. The most important focus would be what is "popular".

          3. Some users might care less about what is "popular". Such users would, by and large, be less interesting to an advertising company. Individual interests might become subverted in favour of "popular" interests, to the extent they conflict. An advertising company (that runs a search engine) will favour the larger audience.

      • imachine1980_ 4 years ago

        Gigablast resource tend to be full of trash in my short experience whit it

        • 1vuio0pswjnm7 4 years ago

          All the search engines have trash. I retrieve results from a variety of search engines and mix them into a simplified SERP with zero cruft that can be read very quickly. Some call searching multiple search engines "meta-search". The main differences with mine is 1. it is all done client side (there is no remote "meta-search" engine) and 2. searches can be "continued" where they left off at any time. This allows one to avoid rate limits. There are always trash results, every search engine has them in their SERPs, but I find that the more results and the more varied the results the better the chance of finding useful, non-trash ones. Gigablast allows returning at least 100 results at a time. Few search engines allow 100 results at a time that anymore. Google still allows it but will not allow a user to retrieve more than 200-something results total.

    • ColinHayhurst 4 years ago

      Try Mojeek https://blog.mojeek.com/2021/03/to-track-or-not-to-track.htm... Disclosure: team member. Feedback good or bad appreciated

    • fuckcensorship 4 years ago

      Check out marginalia[1], made by another user on HN.

      [1]: https://search.marginalia.nu/

      • marginalia_nu 4 years ago

        Yeah I do my own crawling, and offer results from around 200k sites (although it's indexed 700k domains, most of which are crap).

    • blaerk 4 years ago

      I think https://www.qwant.com/ use their own, just started using it so I can't really say much about it other than it seems alright compared to ddg and google(?)

  • dataflow 4 years ago

    You might want to add cppreference.com to your list of programming sites.

  • SilasX 4 years ago

    FYI, I think this is just the case where you should prefix the submission title with “Show HN:”. Can mods update it so it shows with the others? @dang?

    https://news.ycombinator.com/show

    https://news.ycombinator.com/showhn.html

  • DantesKite 4 years ago

    Hey I was looking for something like this. Thanks.

pacificmint 4 years ago

I need to buy a 3 hole punch, and when searching for reviews yesterday I had the same problem of lots of hits with affiliate links and low quality sites.

I searched for “3 hole punch review” [1] here, and the results have zero relevancy.

First one is a Chinese cell phone company, second a Wikipedia page for an episode of the office, third a thesaurus page with synonyms for ‘colorful’ and fourth a link to the Wikipedia page of Yellow Submarine.

I can’t even imagine how you get there from “3 hole punch review”

[1] https://notrashsearch.github.io/?q=3+hole+punch+review

  • jaytaylor 4 years ago

    Now your comment above is the first search result for "3 hole punch" on NTS!

    Nice SEO campaign ;)

    If only I could get NTS to whitelist my domain name (myfirstnamelastname dot com), the Big-G has hated it seemingly since even before I acquired it > ten years ago, even though it's ad-free and totally benign. Good thing I mostly just host go pkgs with it and use it for my email.

    p.s. OP this is amazing! Would love an article explaining any backstory and details on how you made this (or setup / configured it).

    • rickdeveloperOP 4 years ago

      > If only I could get NTS to whitelist my domain name

      There is a form [0] on the about page that allows people to suggest websites to add :)

      > p.s. OP this is amazing! Would love an article explaining any backstory and details on how you made this (or setup / configured it).

      Thanks! I think this is gonna be disappointing from an engineering perspective, and certainly not article worthy :) As further explained in my other comments, the website is basically a wrapper around google programmable search [1] where I whitelisted a set of sites I found useful personally, plus some suggestions from other users. It's really easy to set up.

      As to why, I will quote some other comments of mine:

      "I built this website a couple of months ago because I was annoyed by how hard it was to find useful things on Google."

      "to find things more easily while programming or studying (I study biology, cs and ai; and philosophy in my free time, so expect the best results for queries related to those subjects). ... When I'm not doing those things, I just use Google or DDG because they have better results for day-to-day queries."

      Let me know if you have other questions!

      [0] https://docs.google.com/forms/d/e/1FAIpQLSdf8lAoShQz7Wjl9h60...

      [1] https://developers.google.com/custom-search/docs/overview

  • rickdeveloperOP 4 years ago

    "It's a feature not a bug" (TM)

    The site uses a whitelist of URLs to (attempt to) keep results relevant to science and programming. In the context in which I'm using this search engine, I have no interest in (reviews on) 3 hole punches. (That's not to say I never do, but in that case I'd use Google, Reddit, etc.) The fact that results don't show up here means that they also won't show up when I'm not looking for them, which is 100% of the time when I'm using this search engine. That's a plus for me personally.

    Best case would be to have relevant results in a single search engine, but that's not what I intended when building this site.

  • yukinon 4 years ago

    From OP, "'No Trash Search' is very focused on STEM and not 'for daily use'". I would assume your 3 hole punch review search query falls outside of what this specializes in.

  • babalulu 4 years ago

    And as of this minute, your comment is the number one result for that search.

  • pseingatl 4 years ago

    The three hole punch device is only standard in the US and Canada. Everywhere else, it's two punch. This might explain the anomaly.

baal80spam 4 years ago

First test search "python random" - returned just what I would expect instead of multitude of low-quality blogs like Google Search does. +1 from me!

etchalon 4 years ago

This is the approach I imagine Apple would take if they were to ever launch a Search Engine. A large corpus of handpicked sites.

quickthrower2 4 years ago

This is good, but also sad in a way. It means Jo Blogg's blog won't get discovered and they may have had some good information on the topic.

One way to improve is a "bring your own list" feature, and the ability to include vetted lists. Maybe some kind of web of trust - if your friends have whitelisted a site, it is whitelisted for you too. If you find a problem with that site, you let your friend know to remove it. If they don't respond you can remove that friend from your trusted persopn list (maybe they got hacked?). Then maybe you can 'follow' a few lists of famous trusted people (e.g. paulg etc.) to build up a bigger slice of the internet you can search.

A spammer will want to come in then and create something that white lists their spam sites, but they need to convince you to add their list! And when you see the spam you can just unfollow them. They can't succeed.

rank0 4 years ago

The first two results for “Rust awesome” were two ads which took up my entire screen on mobile. Both travel related.

The GitHub repo was third and had to be scrolled to.

Seems pretty trash to me.

schleck8 4 years ago

If you are annoyed by Google, take a look at kagi and neeva, they are new takes on search engines

- https://neeva.com

- https://kagi.com

  • skinkestek 4 years ago

    Haven't tried Neeva but I can vouch for Kagi.

    Also search.marginalia.nu puts a smile on my face almost every time I use it :-)

    (I should try Neeva, I keep hearing good things about it.)

  • twofornone 4 years ago

    Why does neeva require an email sign in?

GlitchMr 4 years ago

Pretty nifty project. I'm curious what whitelist are you using, would it be possible to list allowed domains somewhere publicly.

One suggestion that I have is to remove w3schools.com from the whitelist. MDN is a much better source for information about web development.

throw_me_up 4 years ago

Great job! How are the ads implemented and do they cover the costs? I'm thinking of building a similar search engine for a completely different domain. I am a bit concerned about paying for it though.

  • rickdeveloperOP 4 years ago

    Thanks!

    Ads are added automatically by Google. The whole thing is little more than a wrapper around the 'Programmable Search Element Control API' which is an HTML element you can just insert into any site, like an iframe. Unfortunately this is the only way to make Programmable Search available at scale as the API is restricted to either 10 sites or 10K queries / day, even when paid!

    There is a paid version for the HTML plugin, but that would leak the API key and so it wouldn't work as a business.

    There is an option to get a share of the revenue generated by a search engine. Maybe it's time for me to figure out how that works.

    I was thinking of making a hosted, ad free, customizable version where people upload their own keys. Not sure if people would like that.

    As a side-note, it's super easy to remove ads with 1 line of CSS, but I wasn't sure how Google would feel about that so it's not in the online version. TamperMonkey is an extension that allows people to insert their own CSS on different websites. Hmm.

    You can view all offerings in the docs [0].

    [0] https://developers.google.com/custom-search/docs/overview#su...

    • throw_away 4 years ago

      It would be cool if I could take my existing browser history, aggregate by domain, sort by frequency & then create the necessary xml for the programmable search. Maybe with a pick & choose UI so I could decide which sites I wanted.

      Right now, looking at your allow-list config, it feels a bit custom to you, but if I had an easy way to limit search to the sites I myself know and trust, I could see how that would be useful.

      I know I could probably pick it out of my browser's history UI & poke it into Google's Programmable Search UI, but that seems like a hassle and a half.

      • ffhhj 4 years ago

        That's a good idea! Not OP, but I'm creating a faster search engine for programming queries, depending on the tech searched it will also point to curated sites that could have the answer. Will try to implement your idea as well. Thanks!

      • rickdeveloperOP 4 years ago

        Good idea! Would you pay for a premium version where you can customize the whitelist (and additional features) if it were available? Bing charges around $5 per 1000 searches, so I guess it would cost about the same. (Google's API is limited to 10K searches / day, even when paid)

        • throw_away 4 years ago

          I think just the history->xml conversion would be a bit of a hard sell, but I would be willing to pay for a sub if there were some additional features like:

          * easy add to my filter list (like maybe a browser plugin so I can see that the current site isn't in my filter, but I can click a button and now it's in my filter & opposite for remove for when sites start to suck)

          * stats on which sites I visit after searching

          * aggregate bing+google filtered searches

          * curated site lists for different topics, top 100, etc. Maybe like a temporary search using these sets such that I can try them without affecting my own filters. Maybe sharing lists w friends

          * some sort of search anonymization/log deletion feature

          * integration with browser search on desktop & mobile

          * search flags like duckduckgo so I can easily switch filter sets by typing like /news or /nerdshit in the query

          * integration with archive.ph & wayback machine

    • yanmaani 4 years ago

      How about using the Bing API? Isn't that more open?

      With caching, I think you might be able to reduce the load.

      Also, why is w3fools in the list? It's an awful site.

      • rickdeveloperOP 4 years ago

        Yes! Maybe I'll build a premium customizable version out of it. Do you think that'd be useful? Bing charges around $5 per 1000 searches, so the premium version would be around the same (with caching covering hosting, maybe).

        • yanmaani 4 years ago

          I think I might be willing to pay some amount of money for it, but not very much. I think a lot of other people would be willing to pay decently.

          If you want to be cynical, just do Bing/DDG searches over Tor, and scrape that into the cache. This is $0/1000 searches, though it obviously violates some ToS somewhere. Unless they want to block Tor, you should be good.

  • tyingq 4 years ago

    It appears to be google's custom search that you use to embed search on your own site.

    https://programmablesearchengine.google.com/about/

llaolleh 4 years ago

I agree with this approach. You can't crawl everything from the getgo - focus on a very specific domain catered to a specific set of users.

razemio 4 years ago

Googled "Best smartphone 2021" which resulted in crappy result. Maybe I am missing what significant categories actually means?

  • version_five 4 years ago

    I agree those results are not helpful.

    The poster did say it was mostly for STEM subjects though...

    More importantly though, I think "Best smartphone 2021" is really a search that has been conditioned on the crap google gives back now. At best you might expect to find a "best smartphones" listicle or something.

    This is just a whitelisted search, so in my 5 min playing with it, it looks like popular or consumer queries are more likely to just provide reddit or wikipedia links, while more technical searches land on SO or documentation sites.

    I think with a little tuning, this approach is great. Given the modern internet and all the crap there is, a manual whitelist of sites that are actually legit is always going to be superior to an algorithmic approach.

    • razemio 4 years ago

      I agree. It is a very promising approach to fine tune Google for your own needs.

  • rickdeveloperOP 4 years ago

    (obviously: this is subjective, so what's significant to me may not be to you.)

    Honestly I just created this search engine for myself to find things more easily while programming or studying (I study biology, cs and ai; and philosophy in my free time, so expect results the best results for queries related to those subjects). I think those subjects also appeal to the HN audience, that's why I shared it here. When I'm not doing those things, I just use Google or DDG because they have better results for day-to-day queries.

    That being said, I'm definitely interested in helping improve other people's search as well (reason I'm posting at all), so let me know if you have suggestions for sites to add!

    • razemio 4 years ago

      Thanks for sharing. My initial comment sounded a bit harsh. I am sorry for that. I will look into fine tuning it for my needs. It is an interesting approach to a very annoying problem.

      • rickdeveloperOP 4 years ago

        NP :) It's exciting to have people using & commenting on something I build.

        > I will look into fine tuning it for my needs. It is an interesting approach to a very annoying problem.

        If a premium version were available with a customizable whitelist, would you pay for that? The API is around $5 / 1000 searches so it would cost about the same.

        • razemio 4 years ago

          Yes, a product which cleans my search from most referral sites is something I would pay for.

  • jhugo 4 years ago

    How would you define a good result for that query? It seems like the only thing you could expect from that query is a page of advertising, since the "best X" without any criteria is a concept that doesn't exist in reality.

    • razemio 4 years ago

      Somewhat true. A perfect awnser for such a query would be a page like tomshardware where the best smartphones for multiple categories would have been nominated. A normal Google search actually does return that, after some scrolling. This however is ofc heavily based on my browsing history.

  • marginalia_nu 4 years ago

    Which search engine returns good results for such a query?

    • razemio 4 years ago

      There is none to my knowledge. That is why I tried it. Google returns at least links from tomshardware, wired and Android Authority after a few scrolls.

      • marginalia_nu 4 years ago

        The point I was attempting to make is that what you are looking for doesn't exist within the realm of authentic websites made by people. The only ones making content to match such a query are deliberately attempting to match it because there is so little quality content. So what you get is at best the sort of useless content mills that seem to crop up on Google.

dzhiurgis 4 years ago

So many listed their search engines that I feel at this point we need an aggregater

Supermancho 4 years ago

I did a search for "eternal crusade".

The blob of the ads are still the top results. This is not the "no trash search" I'm looking for.

UltimateFloofy 4 years ago

so you've used google programmable search to make googling better?

bruhhh 4 years ago

the github repo is just few html pages i dont see anh source code.. i wont trust this search engine...

  • rickdeveloperOP 4 years ago

    Glad you're interested in the source code!

    As explained in my other comment, this website is a wrapper around google programmable search. The actual searching happens on Google servers, and I can see why people have problems with that. The code you see on the website is the same as the repo, though. It's actually hosted by GitHub! You can verify this by opening the web inspector in any browser or looking at the `.github.io` portion of the URL.

    You can learn more about Programmable Search here: https://developers.google.com/custom-search/docs/overview. NoTrashSearch uses the 'Programmable Search Element Control API', which is documented here: https://developers.google.com/custom-search/docs/element and can be used with very little code!

    • version_five 4 years ago

      I think your site is great, I've thought about something like this before but didn't realize how simply it could be implemented.

      Stupid question though: where is the list of whitelisted sites? Is that something you set up separately with google? I scanned though the code and expected to find a list somewhere, but obviously you do it in a different way

  • vizzah 4 years ago

    Nothing should be in the repo. There is no source. Check author's link [1].

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection