Rspamd – Fast, free and open-source spam filtering system

64 points by cebka 11 years ago · 25 comments

Reader

pjsg 11 years ago

To me, the most important things about anti-spam systems are the false positive / false negative numbers.

This information ought to be top and center, and it isn't.

cebkaOP 11 years ago

This information is merely interesting for some academic researches on some very specific messages corpus. But in the real world, I cannot efficiently evaluate the accuracy because it depends on zillions of parameters. Moreover, since rspamd uses not only statistics but a number of sources, such as DNS lists, SPF, DKIM, hashes databases and so on, it is literally impossible to be determined about preciseness.
- pbowyer 11 years ago
  
  > This information is merely interesting for some academic researches on some very specific messages corpus.
  No. It gives a very rough guide to "how much trouble is this spam filter going to be?". If you can say that X000 users have found only Y% of their email was misclassified, and I can compare against other spam filters, that's really useful.
  So yes, too many variables for one to be accurate, but good enough to gauge average-performance across a tribe of users.
  - cebkaOP 11 years ago
    
    I've compared rspamd on random stream of user's messages with SA and Kaspersky antispam several years ago. And I've got almost the same rate of false positives and false negatives for all three products. However, over years spammers are getting much smarter (images spam, valid DKIM, valid SPF and other clever tricks).
    Regarding statistics, rspamd uses OSBF-Bayes classifier and 5-gramms input (so it is not naive bayes). I've used the following academic paper: http://osbf-lua.luaforge.net/papers/osbf-eddc.pdf as reference. This algorithm is also used for crm114 spam classifier. However, bayes classifier is a very small part of rspamd (unlike dspamd, for example) and it could be almost useless if you have, let's say, 50 millions of users accounts. Rspamd is targeted for this grade systems.
- pjsg 11 years ago
  
  If I am to invest in the time and energy to switch over my current anti-solution to this, then I want to ahve some level of assurance that it will be more effective than my current scheme.
  I agree that spam is a moving target and that is why anti-spam systems need constant updating. My current system (over the last 30 days) rejected 87% (around 45k emails) and accepted 13%. Of that 13% (6600) around 300 were classified as spam by the bayesian classifier in thunderbird. Around 80 were manually classified as spam and added to thunderbird's rules. The thunderbird classifier probably classified 2 ham messages as spam. I don't know of any ham->spam errors in the initial filtering phase.
  Should rspamd be expected to do better, about the same, or worse?
  - cebkaOP 11 years ago
    
    From what you are saying, I can conclude that you are using very high scoring for statistical classifier (or basing solely on statistics). This is not an option for a system with millions of users (their accept/reject rate is close to 70/30 percents, as we cannot rely on bayes at all). Therefore, I've never ever evaluated bayes as a single classifier. Nevertheless, I'm using OSB-Bayes as a statistical algorithm for rspamd which has been proven to be a good classifier.
  - zobzu 11 years ago
    
    For similar systems (ie small but doing good manual classification when all else fails) I suspect that if more used razor (or again, similiar) we'd achieve better results (razor allows for sharing this data automatically)
- hobarrera 11 years ago
  
  > This information is merely interesting for some academic researches
  It most definitely is not. It's the most important factor when choosing a spam filter.
  False positives are extremely harmful (it can result in loss of communication, which is what you want to avoid the most). A significant amount of false positives is what would make the difference between useful or useless.
  Nobody want to tell their users "check your spam mailbox, (the one with dozens of spam messages) for ham every once in a while)".
  - EmanueleAina 11 years ago
    
    As I see it, unless you can guarantee that you give zero false positives (which, knowing how certain users compose their mail, is arguably impossible) you still have to do it.
    Also I suppose that the false positive/negative rate can only be given on a well defined corpus, I'm not sure there is one that is a good representation of the current and future spam trends, so in the end giving those numbers could be very misleading.

I have just been using OpenBSD's spamd and I haven't had any problems with spam. And I like to know I'm messing with the spammers too.

hobarrera 11 years ago

Indeed. I was getting a few of dozen spam messages per week. Setting up spamd reduced that to less than a dozen per year. Regrettably, 2015 is the years this seems to be ending, since I've already gotten about 8-10 spam messages this year (with the majority being this last month).

fensipens 11 years ago

How does rspamd compare to mopher (http://mopher.org) in terms of functionality?

I see some interesting things like the surbl module but other than that this seems to be more like mimedefang (or that's at least what I've picked up from the landing page).

Also do you consider supporting multiple database drivers or will you stick with sqlite3?

cebkaOP 11 years ago

They are not related at all. From what I'm observing, mopher is a milter (so it might be compared with my another project called 'rmilter') and it can work with Spamassassin. Hence, if something can work with spamassassin then it will be capable to talk with rspamd at some extent.
Multiple database drivers are in plans for rspamd 1.0 (along with personal statistics and advanced rules planner). The tricky stuff here is that rspamd uses non-blocking model currently which is hardly supported by database drivers (excluding redis and some others). However, rspamd has a concept of asynchronous threads executed in thread pool. So something like MySQL query could be executed within this thread pool with no delay for other filters processing.

k2enemy 11 years ago

I'd be really curious to hear feedback from anyone that has used this. I'm currently using dspam and still deal with a lot of spam hitting my inbox.

darklajid 11 years ago

I'm a happy dspam user (most stuff won't reach me), but looked into this one in the past.
Reasons for me not to give it a try:
- Rule based mostly (which I think of as 'SA')
- No db support, as far as I could tell. My dspam keeps everything in a postgresql db and I can easily backup/restore that with all my other stuff (dovecot/postfix virtual users, for example)
- ~Easy~ to integrate into anything. Look for 'how can I make dovecot-antispam integrate with dspam' and that's been done a thousand times (and works nicely). I haven't found a decent number of rspam resources
That said: My whole post basically says that I didn't try it (for reasons that were important to me). Their site looks interesting and in the end I guess I'd love to hear about successful dspam->rspam migrations as well.
- jwise0 11 years ago
  
  I was a very happy dspam user for about three days -- SpamAssassin's accuracy on my setup is terrible, and dspam performed pretty well (and increasingly so with training).
  Then, dspam started segfaulting, and none of my e-mail was delivered. I looked into what was going on, and it appeared that the dspam hash database had somehow become corrupted; and since dspam is completely unmaintained these days, it was unlikely that whatever bug I tripped upon would ever be fixed.
  Sigh. I also would like to hear user reports about rspamd! I am getting sick of the false negative rate that I'm getting from SpamAssassin.
- scott_karana 11 years ago
  
  > - No db support, as far as I could tell. My dspam keeps everything in a postgresql db and I can easily backup/restore that with all my other stuff (dovecot/postfix virtual users, for example)
  Rspamd appears to use sqlite3: https://rspamd.com/doc/workers/fuzzy_storage.html
  - darklajid 11 years ago
    
    Hey.
    Thanks, but.. That's not quite what I had in mind. For one, somedb-only (sqlite or anything else) is usually not enough. I would hesitate to introduce a system that just supports mysql when everything else is using postgresql for me, for example. And on top of that, this schema is .. limited. My dspam setup learns and can do that for each and every user (though system wide training seems to be the norm, as far as I can tell). This is really just a storage engine as far as I can tell and not really comparable.
    That said: I guess I would give rspam a try if I saw a lot of positive reviews/reports. It's just that it certainly doesn't do the same thing as dspam. It's quite a different animal.
spacenerf 11 years ago

I've been using it on my own machine for a while and have used it for ingress/egress filtering at a small ISP (handling ~5k emails/day). Accuracy is reasonably good & it's very light. If there's something more particular you want to know ask away. ;)
vizzah 11 years ago

I use ASSP and even without Bayesian filters (which I don't like for too many false positives) have only an occasional phishing e-mail in my numerous inboxes, even with e-mail addresses published on sites.
rennpferd 11 years ago

Works like a charm, my setup is postfix + dovecot + rmilter + rspamd.
- zobzu 11 years ago
  
  seems like the perfect setup du jour, should probably document it :)
  - rennpferd 11 years ago
    
    Yeah, should've documented it while I was setting it up...
OrNix 11 years ago

Easy to use, fast and flexible antispam.

feld 11 years ago

I have not yet had a chance to use this, but it's the only modern anti-spam software that I am remotely interested in deploying.

Settings

Rspamd – Fast, free and open-source spam filtering system

Keyboard Shortcuts