Ask HN: I've Built a DHT Torrent Sniffer and Search Engine. Should I Release?
Recently I was researching about DHTs and developed a DHT Sniffer in Go which connects to some known DHT Routers and sniffs all the annoucements. I've quickly added ZincSearch and it is now basically a search engine which can search for hashes, name or files contained in the torrents. It is able to index around 5-10k annoucements per second, so the index grows quite fast.
Now, I am thinking about releasing it as open-source for others to study, but not sure if I should, because it might be used for "evil". I had been working on this successfully for a couple years in the past before I got tired of it and moved on. I still think it's a magnificent idea, to be able to host your own torrent site and to decentralise the last centralised bit of BitTorrent. I've built a small cli utility to search the database: https://sr.ht/~rakoo/magneticos/. It also asks the DHT and known trackers for the liveness of the swarms, so I can sort by seeders. It's simpler and more useful than running a full webserver with ports and all when I only need an ssh access. It has served me well and will be more than enough for a long time The main thing that seems to be missing, in my experience, is moderation. If someone can manage to make a decentralized forum-like moderation list, then we wouldn't have any use for centralized trackers anymore. I think it could probably be done by using public GPG keys as identities, then keeping changes in a torrent-distributed git repo where user-generated data is saved in GPG signed git commits. Then the only centralization left would be what branch/torrent people introduce to new users. > I think it could probably be done by using public GPG keys as identities, then keeping changes in a torrent-distributed git repo where user-generated data is saved in GPG signed git commits. Congratulations, you won some HN bingo squares by introducing both PKI and blockchain to the discussion :-D But, in all seriousness: moderation against what? Not against. For. For curation. For categorization and deduplication. And for trust. You can only express so much with titles and seed/leech counts. Trackers that were known to be the best - like What.CD and waffles.fm - were private and heavily moderated. You had to be invited, usually by passing a quick interview to prove you knew how to avoid common data management pitfalls like transcoding from a lossy codec. That made those trackers into carefully curated collections of high quality content. Even public trackers benefit from some moderation. Most people only want to be browsing porn intentionally, so having a category for it let's you browse for movies or whatever. It's all about sorting the noise into signal. It's also helpful to have some commentary on a torrent. Maybe the subtitles are forced. Maybe the software has a virus or includes adware, but ended up getting a critical mass of seeds/leeches anyway. Maybe there is some similar content worth talking about. Have you ever used a public decentralized platform that wasn't full of trolls? I haven't. Yes, moderation is as simple as not downloading something you object to. As for identifying false metadata, that falls outside of moderation. >> then we wouldn't have any use for centralized trackers anymore. So we finally invented emule? Centralized servers that provide selection, moderation, quality control, information, imdb score etc, is exactly why bittorrent protocol won against all the decentralised protocols. I'm not familiar with emule, but as far as I can tell, its credit system is only based on seeding or uploading new content. That can be useful, but it isn't the moderation I was thinking of. See my reply to the sibling comment. I dont remember of any credit system in emule, people upload and share just out of generosity. I guess it also contributed to bittorrent success - uploaders probably get some share of tracker site ad revenue. Thanks for writing that note on the repository -- it was thoughtful and well explained -- reading through your small manifesto that was linked and that's fun too. I think the manifesto was Aaron Swartz's oh you're right I didn't even notice -- aaronsw.com... First time I've read it. https://github.com/boramalper/magnetico Someone else did this a while back, universe continues to exist. Let it rip. DHT has been around for so long now that whatever bad actors/evil use cases you're imagining have already happened. It sounds like a cool project, and I'd be interested to see it. I accidentally read this as "Let it R.I.P." I totally agree, unroll it! While the file sharing is distributed, the centralised web-based indexing is still a game of whack-a-mole. how is what OP made not basically going to be the p1rat3 bay 2.0? It already exists. BTDigg: https://en.m.wikipedia.org/wiki/BTDigg And sibling comments show there is a community of folks doing this. People also do it to Tor’s DHT too for discovering hidden services:
https://donncha.is/2013/05/trawling-tor-hidden-services/ You can’t really stop open source tech like this from existing by choosing to not open source your implementation. You just end up staying out of the conversation yourself. The community will still build it if there is a desire for it. One is a repository of code that you or I can run on our own machines. The other is a publicly accessible website that got a kit of public attention because it could easily be used by anyone including novices to pirate content. Moderation. TPB (and other trackers) are essentially forums. I wrote a similar solution 8 years ago. I repurposed the system to identify IPs owned by the government, and notified them if a malicious copy of Windows (but not limited to) was seeded by them. Meaning there was a chance that an unknown actor had a backdoor in my government's network. If you wanna discuss, I'm happy to talk and even contribute towards a commercial solution majestic.hn@fastmail.com. I didn't pursue this opportunity for money at that time, but I had my fair share of "shoutz". Won't be bad to pick this up again Just AGPL it, I hear it's an effective ward against Alphabet. Very good suggestion, considering the sniffer could be used by copyright claims lawyers. Can't see how AGPL would stop anyone using it. Doesn't it just prevent them from modifying the source code without sharing their changes.
? AGPL fixes a "bug" in GPL that many tech companies have exploited to not release modified source code as the GPL requires. Simply put, GPL says that you have to release the complete source code (including any changes you have made) of a GPL licensed code only if you distribute it to other users. Many tech companies thus avoided GPL code. But with the growth of Software-as-a-Service, where an application only runs on the server and is accessed through a browser or an app, many of these same companies created web applications with GPL code that they customised. However, if a user demanded the source code of the GPL code, along with the modifications they made, the tech companies refused to provide it claiming they weren't "distributing" the application (as in giving you the whole application to run on your computer). And since they weren't doing that they claimed they had no legal obligation to release the complete source code. AGPL fixes this - it recognizes SaaS web applications too as a "distribution" of GPL software. So if a source code is licensed under AGPL, anyone who uses it to create web applications and makes it available to the public is now also legally obliged to provide the complete source code if any user requests for it. (And ofcourse, as with GPL the user is free to use the source code as they want, freedom to improve it, and even start competing SaaS services). That is why the AGPL is currently the best GNU license to ensure that your open source code always remains open source. How would someone know if SaaS uses some AGPL code? All AGPL / GPL licensed code requires that the AGPL / GPL license also be distributed with the application. Moreover, AGPL / GPL also requires public attribution (acknowledgement / credit) of those who have created and contributed to the source code. Ofcourse, if somebody wants to be dishonest and unethical, they can violate these terms to hide the fact that they are using AGPL code in their SaaS application. There's not much you can do about it in such cases. But if anyone (like current or former employee) leaks this info to the public, they can face legal trouble and bad PR in the future as it would be akin to using pirated software. In general, you don't, but that's a different question. It only prevents them from modifying the code and not sharing their changes and making the service available to another company. You are still free to make private changes to AGPL code and run it for yourself (or your company). I'm not considering "anyone" here, I'm considering copyright/license lawyers. Would be kinda dumb for them to violate a license agreement (AGPL) while hunting for license violations. AGPL wouldn't stop copyright lawyers from using it or even making changes and not sharing. It would only stop another company from making changes and not sharing them and selling the service to lawyers. Of course copyright lawyers have to prove in court their means how they hunted some violation. They would therefore have to disclose any modifications they made to an AGPL software, therefore giving insights how they hunt, therefore allow for better countermeasures. Don't be naive about lawyers, they are aware what AGPL means. > They would therefore have to disclose any modifications they made to an AGPL software That's just not what the license says. I have been operating a BitTorrent search engine for a few years (it is closed now), here is some advice if you want to launch it as a public website: 1. There is no absence of people opening the kind of torrent indexers, which you have in mind. In 2010s there were 5-10 such sites launched each year, even nowadays there are at least a few new indexers yearly. Almost all are closed within a year or two (at most). There are reasons for that, some are less obvious than others. 2. Most likely you will close your site after some time as well, here is why: 2.1. It will be difficult to find a hosting, which will tolerate it. Forget about Linode or Digital Ocean or any similar hosting providers - they will suspend your server (first) and account (later) after receiving a few automated DCMA emails. You can use some "second-tier" providers for some time, but eventually even they will get tired of you, you will be out the moment they receive the first paper letter from some law company, which represents some movie company or alike. You can use proxy servers or some other arrangement for some time, but eventually you will decide that it isn't worth the effort. 2.2. The same thing is true for domain provider. Yes, there are a few, which are more resilient, than others, but don't expect that they will protect you and your domain indefinitely for $15 per year. If your site is popular enough - the only option may remain to register a new domain every few months and hop between them. Eventually you will get tired of that. It may take half a year or a few years, but you will. 2.3. There are no money in torrenting nowadays. Forget about ads or donations - even thepiratebay cannot earn money that way, you will not be able to do that either. 2.4. Sadly, torrents are not that popular nowadays as they were 5-10-15 years ago. Some people still use them, but in most developed countries that's more like an exception, than the rule. The rule is netflix, spotify and alike. As a result, if you imagine large and happy community of users around your site - just don't, most likely there won't be any. 2.5. Don't expect just to launch a site and rest. You will spend at least 5-10-20 hours per week on its maintenance (fixing bugs, importing/cleaning data, adding new features, etc). It will feel fine at first, but more and more tiring as the months and years go by. Eventually you will get bored and stop any maintenance. Users don't like old/unmaintained sites, so they will be less and less interested in it. Eventually you will decide, that it isn't worth the effort to run it at all. Don't underestimate possible legal problems. Tech people tend to simply throw them aside ("I will use a VPN/Cloudflare/Onion, no one will find me"), but here are my 5 cents: 1. Most likely - you site will not gain any significant popularity and you will decide to close it on your own. Meanwhile you will just be getting some automated DMCA emails/complaints from time to time, that's it. 2. Less likely - you site will be somewhat popular, it will become an attractive target for copyright owners and law firms (not as attractive as thepiratebay, but still). It usually takes just a month or two for some interested law enforcement agency to find out who you are (forget about all privacy protection methods you know, it is very easy to slip and eventually you will slip; you need to be lucky every time, they need to be lucky just once). Depending on what country you live, you will end up with a few thousand dollar fine (best case) or tens/hundreds of thousands + a small jail sentence + a criminal record. Not bad for a hobby torrent site with no income and a minimal community, is it? Just to reiterate this one more time - it is VERY easy to find out who you are for any interested government agency [in your country], most of those 1-2 months will be spent waiting for replies for their letters and other similar bureaucracy, not for some CSI stuff. Here is how it might go. First they will check the site domain and IP address and ask your hosting/domain providers to reveal you payment details (name/address on credit card). That's enough in 95% of all cases. Even if you use Cloudflare to hide your site IP and even if you use bitcoins to pay for everything - that will help only for some time, eventually Cloudflare will reveal your server IP and bitcoin transaction will be tracked back to your credit card. That's just a question of time and whether you are "wanted" enough or not. If despite all this you still decide to go forward - don't forget to import all the available torrent data on the internet (there are tens of dumps here and there), that way you will have MUCH larger database, than just by DHT scraping alone. This is a good place to start: https://archive.org/details/torrent_metadata_archive_sample Creators of such sites (including me) tend to focus too much on the number of torrents, no matter if they are active or dead. Regular users are interested mainly in active torrents. Plus they want to see the current number of seeders/leechers, which is very difficult to keep up-to-date for a large database. Plus they want to see a torrent creation/upload date, which you cannot get from DHT (you can record the day you found a torrent, but it will work only for newer torrents, not for historical ones). Of course, you can just provide a code for users to run on their own computers, but don't expect that anyone will really use it (maybe just a few people here and there, I really mean it). Everyone, who is really hardcore enough to run something on their computers to obtain torrents will just use Jackett (https://github.com/Jackett/Jackett). It can search through the huge number of torrents, which no local DHT scraping/search engine can provide. I mean, it already exists. [1] Always fun to see what my neighbors behind the same NAT download. I thought streaming had spelled the demise of torrenting, so was v surprised to check out my local hood (quiet suburban, skewing older few ppl <40yrs) and see how it's very much alive and well. I guess I was naive. (Not condemning anyone, as I did my share of peering back in the day, but streaming services are convenient and honestly don't cost that much all things considered. Torrents (tho I prefer IRC) good for finding obscure stuff that's unavailable anywhere else, but that's not what my neighbors are downloading; and it's not like any of them can't afford to pay. Interesting dynamic. This showed up in my news feed, and I wondered if it was a coincidence that I just learned about that site from your comment and then the next day I see they're getting spammed by idiots abusing DMCA: https://torrentfreak.com/i-know-what-you-download-overwhelme... Checked it for my Transmission host static IP which doesn't download anything for a while and it has some completely random stuff I never downloaded or searched for. Even shows items from today which for sure I haven't requested. Could be some DHT spoofing or my Transmission is relaying some DHT requests? It gives quite fascinating results for me. Surprisingly, among dozens of torrents only a couple are porn. Most of my neighbors using BitTorrent are Russians and Ukrainians (I'm in Cyprus), judging by files names and seeder distribution, a couple of English-speaking people (probably brits) and a single person seeding content in Greek. At least half of Russian-speaking neighbors seed IT, engineering and English learning materials. And I seed more than everyone together on the same subnet Wish I could message them. Ah, haven’t checked that one for a while. Surprised to find a tech-minded neighbor who has downloaded kali-linux-2022.3-live-everything-amd64.iso. And no pr0n this time round. You'd better have a good firewall since you have such a tech-minded neighbor. ) Having seen other indexers before I suspect your implementation isn't spec-compliant or well-behaved (perhaps spoofing node-IDs? causing more traffic than necessary?) If you want to build an indexer you should write a normal implementation and then use http://bittorrent.org/beps/bep_0051.html Related comment: You should absolutely open source this. Working with the DHT is a lot of fun, and a great learning exercise. I adapted my local “torrent roulette” application to an electron app that can be shared. My local version downloads the files, but the one I share only fetches the torrents metadata (easy to adapt it back to my roulette approach). I call it Taboo: https://github.com/retrohacker/taboo The amount of “evil” on the DHT is pretty low. I’ve run mine for a long time, and very rarely get anything evil. I suspect it’s because of how poorly BitTorrent plays with privacy tools like VPN and Tor, IIUC it’s easy to leak identifying information with BT and its high bandwidth. Nearly every “evil” file I’ve found is either: a honeypot with not-evil content or password encrypted (maybe to remove the plausible deniability of a random download?). I don’t know if the encrypted files actually contain evil content, I don’t bother trying to crack them and promptly gshred them. What you will find: * a lot of content illegal under US copyright law * a lot of porn (also illegal under US copyright law) * a metric tonne of fascinating content from other cultures you’d otherwise not be exposed to (also probably illegal under US copyright law) There is very little “legal” content on the DHT, but most of it is falling on the wrong side of intellectual property law. Sometimes password/credit card dumps. I once found some very sketchy schematics of Eastern European military equipment. But that stuff is also really rare in my experience. If you do play with these systems, I’ll leave the same warning I left on the Taboo repo: > Note: I AM NOT A LAWYER! To my knowledge, there aren't any other systems doing this that you can run on your laptop. I suspect the nuance of how Taboo works isn't going to be appreciated by your local law enforcement. If you don't want to test the legality of Taboo in court, I'd strongly recommend either: running a VPN (less safe) or not using Taboo (most safe). If you want to use Taboo with a VPN, may I suggest putting some money in an envelope and sending it to Mullvad? Adding this on after reading your comments elsewhere in the threads: If you're worried about Intellectual Property enforcers using this for evil, I wouldn't worry too much about it. BTDigg already exists. Not that what you're doing isn't novel or exciting (great work on this BTW) - but DHT indexing is an art thats already being practiced. Cat is out of the bag. I like how clearly you distinguish "evil" and "not legal" :) If I may suggest, it would be helpful to have a section in the README about how to build/run the application. If I'm not familiar with Electron, the repository isn't very helpful, even though I might be interested in trying it out. I managed to build and run it, filed an issue for a problem I ran into. Hope it's valid, thanks for making this app! Hey, thanks! Let me know what you think of it after using. It was a fun project to work on. Although I managed to run the application, it didn't catch any torrents. The announcements and traffic stats in the console showed 0 all the time. To be fair I ran it for 30 minutes max. Do I need to add trackers in some config file or so? I would recommend writing down the worst and best case scenarios that could happen with your software, then determine if you notice either that through severity or quantity the software outweighs the positives, don't release it. Please do. This old-ish timer is kind of blown away by the idea that you shouldn't. Nothing evil about being a modern archivist/librarian, despite what big companies would tell you. I imagine any bad actors who store IPs of torrent seeders have done so a long time ago already so your software will not do any harm that hasn't been done already. Go for it and open-source it. Would you mind explainng why you chose ZincSearch? Curious on why you picked it over some of the other non-ElasticSearch/OpenSearch alternatives (Meilisearch, Typesense) I'll add to the chorus of people saying "yes, release it". If you're worried about blowback as a result of "evil" uses / users, is there a way to release it (somewhat) anonymously, so it's difficult to be traced back to you? Which evil usages are you concerned about? I think it would be very useful for the public. copyright holder bots DHT has existed for 17 years - the cat's out of the bag. The anti-piracy companies have built their own crawlers. They match swarm IP's and then notify ISP's who voluntarily hassle you on their behalf. An IP doesn't equal a person in USA. And outside of the US? In Germany there's an "efficient" business of law companies acting on behalf of the copyright holders, subpoenaing the ISP to get the owner of the IP/connection and sending them a cease-and-desist with a hefty fee. There's the next part, law companies specializing in "Here's the letter you have to send to say you agree to pay them without admitting guilt, that'll be € please!". I think the demand is usually around 1000 Euro, so it's way cheaper to just pay for a VPN service, configure qBittorrent to only use your VPN interface, and torrent whatever the hell you want. Is it basically btdig.com ? Btdig is awesome. If this is like an open source version in Go then that is a huge contribution. Isn't btdig also open source? I see a "fork me" button on the front page and it leads to a repo: https://github.com/btdigg-org/dhtcrawler2 It is 9 years old and in Erlang. An advantage here would be people self hosting their own based on this project. torrent-paradise [0] is a go project which seems to do the same as your project and has existed since 2019. It’s since gone down but remains up on IPFS [1] but its index hasn’t updated since January. [0] https://github.com/urbanguacamole/torrent-paradise [1] https://cloudflare-ipfs.com/ipfs/QmQjsKamNFZRvCMXDvZXQmRYjsm... > remains up on IPFS [1] but its index hasn’t updated since January. But only the "about" page remains up, right? Because both https://cloudflare-ipfs.com/ipns/torrent-paradise.ml and https://cloudflare-ipfs.com/ipns/12D3KooWB3GY1u6zMLqnf3MJ8zh... show a command-line(!) error If my mental model of IPFS is correct, it means no one is seeding the underlying IPFS ... CIDs? ... that those links point to Consider that such already is available as open source in Go even. Please release it. We are in dire need of good torrent search without ad-trackers or other nasty stuff. That would be very useful... Just release the code and building instructions. Magnetissimo did this too, and he's still around. At the very least, please do a write-up for us on how you're achieving 5-10k announcements a second. I was planning to start learning GO, I'd be definitely interested to learn from your project :) You should def release it, what else can you do with it? Just let it rot and fade away? yes please. you are not responsible for any "Evil" users of the software might do. This is not even about enabling bad stuff, this is just natural progression of technology. Nuclear and biological weapons are also part of the "natural progression of technology" but it's widely agreed that they shouldn't be released to anybody and everybody, and the people distributing those technologies have a moral responsibility. If you think your software would be of more use to "evil" than not, then don't release it widely. Yeah, it is everyone's moral responsibility to consider effects of our actions as far as we're able to. This is a good example of our hubris as developers. We like to think our project will have some impact on the world when in reality you’re extremely lucky if anybody notices much less cares. ;) I think you should. From my understanding, use of DHT is already dead in the eyes of most torrenters it's a tool.
What people use it for is their business. yeah release it please do! >Now, I am thinking about releasing it as open-source for others to study, but not sure if I should, because it might be used for "evil". For evil? I wouldn't worry about that. Not now. You should put prominent warnings it's not "consumer grade" or whatever, but I think there's more value in sharing your code than there is risk someone will perform an attack they otherwise couldn't. Conversely, the second amendment wasn't just meant to apply to guns -- in America, "arms" can absolutely mean "cyber". Now, to be fair... the NRA is basically a way for boomers old enough to get "reoccurring income" to write off donations to the GRU on their taxes at this point, but there was a time in this country[0] when they used to give people who had a re-occurring subscription were sent a video called "Stop! Don't touch!!" (or something to that effect), which was meant to be their first lesson on guns -- and make no mistake, when you "cyber", you're reaching into the toolbox. There was a period in the 2000s where it was EXTREMELY difficult to get some of these tools up and running, and then, in parallel, you could also experience hardware or driver issues, and people quite rightfully used to scare the everloving shit out of the type of person who would bring esoteric knowledge to light too quickly. Even if you knew exactly which commands to run, in which order, you had to deal with stuff like the fact the drivers for Airport literally won't go into monitor mode, which is when you tell the radio in your laptop to store not just the packets addressed to it, but any packets that happen to... drift on by. I really cannot emphasize enough how hilarious it was to me that it took until about twenty goddamn twenty[-1] for some folks to realize the main benefit to "cyber" is that it's remote -- since I was about twelve years old, I've run into people who do... very rude things with the computer, paired with wielding the fact that in the united states, possession of what is now being called "CSAM" was what's called a "strict liability" offense -- pair that with gatekeeping access to the title "security researcher" and it led to a very uneven playing field. (They'd also do things like say oh, gee, you can't get a security clearance if you engage in software piracy, while also telling folks it's probably better to just torrent stuff than use some shady torrent site... and mentioning they have a security clearance in another thread or whatever.) Now, if this program was, say, a script that spiders through a hard drive using the Luhn algorithm[1] to suss out if the drive contains PII or automates bringing down one's wireless interface, changing the MAC address, then bringing it back up... that might be something you might not want to give to your enemies by putting onto Github or whatever. Myself, I usually still stick to the Pirate Bay -- I've still got a backlog of stuff I haven't watched... it feels like just a short time ago I was getting all emo I had no one to watch "Cats"[2] with, ha-HA!! But this? I think you're good to go, and I thank you for taking the time to learn the version control system and share your code. If anything, you'll be aiding public health. Streaming services are abusive... I own very few movies since I've moved around so much -- it became a running gag with my exes -- but it's like back when cable briefly wasn't going to have commercials, then within pretty much one generation they added them right back in addition to collecting the subscription money[3]... those sorts of people should be shown that video from the 80s of the one KGB defector explaining that America is unique, and there is nowhere else to defect to. (He was right, and if you abuse your access folks might be unwelcoming.) --
[-1] I purposefully waited about two years to make that joke... at least two times. [0] (I'm posting from my home) [1] I was told it's used to verify something is a CC # but apparently it can get some false positives -- that wasn't mentioned last time I looked it up, hehe: https://en.wikipedia.org/wiki/Luhn_algorithm [2] Did you people forget that they called it Redphone because it was supposed to REDUCE tensions? [3] https://web.archive.org/web/20150501092025/nytimes.com/1981/...