Show HN: Torrenthistory – an open and legal torrent indexer

23 points by retonato 5 years ago · 30 comments

Reader

retonatoOP 5 years ago

There have been quite a lot of torrent indexers already, but most of them are sooner or later closed. So, I decided to create one more, simple enough (so it can work on its own on a relatively cheap VPS) and legal (so it would be much more difficult to close it). Most of the site data can be freely downloaded. Is there something you think any decent torrent indexer should have?

antman 5 years ago

Could the data be retained on something like IPFS so that cloning of the tracker could happen independent of its existence?
- retonatoOP 5 years ago
  
  I thought about some kind of distributed storage, but the technology is not there yet (=I cannot implement that easily, users cannot use it easily). The site uses a regular PostgreSQL database. Most data, though, is published on Internet archive in a standard format (https://torrenthistory.org/docs/open-data/). There are torrents as well, users can download and seed them. I will upload new archives monthly
- 177tcca 5 years ago
  
  Pin it now!
  - retonatoOP 5 years ago
    
    Sorry, I'm new to HN (as author). What do you mean by "pin"?

retonatoOP 5 years ago

Thanks again for all feedback. I added all recent feature suggestions on github (https://github.com/retonato/torrenthistory-public/issues). You can vote for them there (just give a thumbs up for the issue you like) and see the current priority here: https://torrenthistory.org/vote/

tal8d 5 years ago

It would be very nice if you could search based on file name, as you can with btdig.com. If you could somehow provide BRE without melting down, that would definitely put you way ahead of the other DHT index attempts.

retonatoOP 5 years ago

At the very beginning (of January) search was done even through filenames, but then:
1. It turned out, that such search requires quite a lot of disk space for index. 2. It is not very useful without showing the actual matched filenames. This requires more code, I didn't have time to create it at the time.
I added it here for tracking, sooner or later that should be implemented: https://github.com/retonato/torrenthistory-public/issues/1
- retonatoOP 5 years ago
  
  What is BRE? Search by regular expressions?
  - tal8d 5 years ago
    
    Yup, the most resource constrained form (PCRE2 would be the Cadillac-DoS-my-webapp option). I don't know how abnormal my problem is, but I've run into several cases where I needed a file that had a very specific naming structure - but a couple of substrings needed to be masked because that was the information I was looking for. That can easily be handled with two fixed string passes on a local file system, but not so much at the scale we're talking here.
    As far as the indexing issues you mentioned, I'm not a python guy, so I can't recommend a drop in solution. But I have indexed pretty massive string datasets, and you definitely want to select an index method that was specifically devised for the key/value datatype you intend to ingest. So hashmaps are probably out :) It would also pay to adapt either the implementation or the data itself. For example, say you had a bunch of font file name you wanted to index (XyzSans.otf, AsdfSans.otf, AsdfSerif.otf, etc): a prefix tree would be a pretty good fit, especially if you reversed all the strings.
    https://en.wikipedia.org/wiki/Trie
    https://blog.burntsushi.net/transducers
    
    retonatoOP 5 years ago
    
    In principle it is possible, but there would be some limitations:
    1. Postgresql supports indexing to make regex searches faster from v9.3 (the current version is 13): https://www.postgresql.org/docs/12/pgtrgm.html . However, it makes them faster only in some (simpler) cases, in others we are back to plain old full table scan.
    2. Speaking of full table scan, it is not that bad of an option, the average list of file paths for a torrent file is approximately 6 KB, that's 300 GB per 50 mln torrents, that's completely within reach of some VPS providers (like BuyVM). Still, up to a few minutes per one pass.
    3. So, unless I find some efficient technology for regex search in large volumes of text, I would be able to implement only "offline" search (submit query, receive link to search results in 5-30 minutes, depending on server load).
    4. Another option would be to provide listings of files for downloading as csv files (torrent_id, filepath) and let user to use command line for searching (or some text viewer/editor, though most popular ones still load all contents into ram). Compressed size would be around 1 KB per one torrent file, that's near 50 GB per 50m torrents.
    
    tal8d 5 years ago
    
    Yeah, it isn't something that is feasible if you can't tailor the infrastructure to the underlying data. An SQL backend is about as generic as it gets, hence the poor results. Enterprise DB deployments get away with it because they can justify throwing a lot more hardware behind a centralized generic data store.
    Give that second link a closer look if you change your mind. It demonstrates a way of indexing 1.6 billion 80B strings in 24GB, and then returning the result of a fixed string search in 100ms. That is the reward you get when you venturing outside of the LAMP stack: less resource consumption, greater performance, increased utility.
    
    retonatoOP 5 years ago
    
    Thank you, I will

bartvk 5 years ago

What is this useful for? Genuine question.

retonatoOP 5 years ago

1. Checking if some torrent exists in principle (by searching for words in a name) 2. Searching for the same files in other torrents (when you click on the file size on some torrent page, it will show all other torrents, which contain files of the same size, it is pretty good method of searching for larger files) 3. Just as an open dataset, which contains a lot of torrent metadata and which you can use in your torrent related projects.
- retonatoOP 5 years ago
  
  Also, just to have a torrent indexer, which will be there forever. Other large indexers had almost 40-50 mln torrents, when they were closed, all that information disappeared almost overnight. In my case, the cost of running is very small, the legal threats shouldn't be the problem either (come on, it is really not possible to download anything from there). And even if something happens to me - others can continue where I left, since almost all site data is open, anyone can create a similar site in a month or so.

baldgeek 5 years ago

not sure what definition of "legal" they are using. I clicked Catalog, then 70, then 80, and found lots of "interesting" titles

mikewhy 5 years ago

> This site and all information on it is legal. There are no:
>
> - files
> - links to files
> - torrent files / info hashes
> - magnet links
>
> You cannot find them anywhere - neither on the site, nor in its source code, nor in the database backups, which are published here.
>
> You cannot download anything from here (really). There are just folder names, file names and their sizes (+some other metadata).
(HN's formatting is seriously bad)
- retonatoOP 5 years ago
  
  That's right, you can see torrent names and file names, your can search through them, but you cannot download anything. There are neither torrent files or infohashes there.

thr0aweh 5 years ago

Can files be more specifically identified - more than just filename+size? I.e. hash(es)/checksum(s) or something like that?

implying 5 years ago

Torrent seed files only contain hashes for "pieces" (typically 1-4MB each), which can cross file boundaries. There's no way to pull an md5 for a specific file without actually snatching it.
- m-p-3 5 years ago
  
  BitTorrent v2 also includes a per file hash tree https://blog.libtorrent.org/2020/09/bittorrent-v2/
  so I assume that if you end up with the same file in multiple torrents, you'll be able to grab the files from seeds of these other torrents.
  - retonatoOP 5 years ago
    
    It would be interesting to see if that standard replaces Bittorent v1. For now it is not nearly as popular (everyone just uses v1, I don't know about any site, which has any significant number of v2 torrents).
- toomuchtodo 5 years ago
  
  Pull the files, derive the hashes, and then throw away the files? You only need to retrieve each file once to derive and store the per file hashes.
  - retonatoOP 5 years ago
    
    I though about that, it is technically possible. More than that, it is relatively cheap (incoming traffic on most VPS is free). Still, I decided that searching by file size is much more cost effective. It is almost as good as searching by hash for larger files. It requires little to no code and time to implement. It is available even for torrents without seeds. It doesn't involve downloding files (which, in some people opinion, is illegal as such, even if you don't actually see the downloaded files).

elastolin 5 years ago

Why does hosting a magnet link make a torrent index 'illegal'? After all, you're not hosting any actual content on your server.

retonatoOP 5 years ago

That's a very good question. I think that dialog in filesharing/linking court cases has moved from "link vs torrent vs file" issue to "does this site/person make possible bypassing copyright protection, and, if so, to what extent?" issue.
Part of the torrent community still holds to the old (and comfortable, I must say) idea, that files and links (no matter in which form, http or magnet) are something fundamentally different. Files can be illegal, links cannot. Links are just text, infohashes are just numbers and so on.
The reality, as I see it, is different:
1. If something makes bypassing copyright protection as easy as making 2 mouse clicks and waiting a few minutes - that "something" is illegal, the person which operates it also commits a crime. Indirectly, by facilitation (=providing aid), but still.
2. Even if the law in this area is too old or too general to specifically mention such cases - the court practice is still little by little moving in that direction (=files and links doesn't matter, those are technical details; bypassing copyright - does matter)
- hehetrthrthrjn 5 years ago
  
  So even showing the infohash would be illegal? That's all you really need.
  - retonatoOP 5 years ago
    
    In my opinion even showing infohash would be illegal. Anything, which provides enough information for downloading by a generic torrent client.
- elastolin 5 years ago
  
  That makes total sense. A more pragmatic and less technical interpretation. I don't condone it seems that's the way we're headed.
  It does seem like a slippery slope, though. Someone could argue that your site facilitates bypassing copyright for merely letting users know a particular torrent even exists. They would simply look it up by name in a torrent index that does share magnet links.
  Thanks for the insight, and good luck with the website!

Settings

Show HN: Torrenthistory – an open and legal torrent indexer

Keyboard Shortcuts