Spam with no purpose?

Tue Apr 6 04:30:18 UTC 2004

On Sun, 4 Apr 2004, Michel Py wrote:

> Indeed; notice I did write "Bayesian-like" and not "Bayesian" and
> never mentioned anything about good ones or not-as-good ones.

Right, but if we're going to talk about bayesian filtering in
general, there's little sense in constraining the discussion to
"not-as-good" bayesian filters. The not-as-good filters are obviously
doomed to extinction, if they do not improve and become good ones.

> I understand this too. However, I think the point you are missing
> here is the difference between "what could be done" and "what
> people have". 

I dont see why that matters to a general discussion about the limits
of or "attacks" against bayesian filters.

> penetration rate than messages that don't feature it. The proof is
> in the pudding. And as I said earlier, expect the "bunch of
> dictionary words" to mutate into a more sophisticated animal that
> includes correct grammar.

However, if we ignore "probe" mails, these emails will still have a
spam payload somewhere in them, otherwise they're not spam. That spam
payload should in theory stick out like a sore thumb, in bayesian
terms. The added text will just eventually cause the bayesian filter
to tend to score those phrases towards 0.5 - ie no indicator and,
once again, a good bayesian filter will only consider phrases that
are good indicators of spam or non-spam. Ie drop all phrases with
probabilities of P between x <= P <= y, where x and y are arbitrary.  
(eg 0.1 and 0.9). If we add in the probe emails, these will just help
with better weighting of common text towards 0.5.

The problem at the moment is that *not enough* spammers are using the
extraneous added text bayesian attack to significantly affect filters
to class common text towards 0.5 and hence be pruned from affecting
the outcome due to x <= P <= y. As more spam starts to use this
attack, the (half-decent) bayesian filters will become increasingly
immune to it.

> What you and I do or could do (on a small scale) in terms of spam
> filtering is largely irrelevant.

I dont see why it is irrelevant, what you or I or others use today
for our spam filtering, is potentially what you or I or others will 
use tomorrow to protect joe-six-pack customers.

I give friends, family and some others email - what I find works well
for me, I eventually apply to their email too if I can. If I were to
have to protect customers from spam, my experience gained from using
filtering solutions in more personal situations, I would try to apply 
to protect the customers, or alternatively, if I lacked direct 
experience, I would try go by the experience of others.

> have made tremendous progress in terms of filtering, it is equally
> true that the spammers have made tremendous progress in defeating
> our counter-measures, resulting in end-users getting unprecedented
> and still increasing amounts of spam.

Right.

> The measuring metric here is _not_ that we successfully filter 90%
> or 95% or 99.99% of spam; this is meaningless. The meaningful
> metric is: how many spams does joe-six-pack get a day.

If you pick "90%" or "95%", then you can indeed try to imply a 
percentage metric is meaningless. However, I'm pretty sure that those 
who receive email via services I admin are much happier that those 
services catch x% of spam than 0%.

> There is no difference between a) joe-six-pack getting 50 spams a
> day and us canceling 450 a day and b) joe-six-pack getting 50 spams
> a day and us canceling 9950 a day.

If you wish to compare 90% against 95%, yes. I wonder though if we're
anywhere near 90% filter rate (least not for any useful filtering
service that doesnt have a similarly large false-positive rate).

> Michel.

regards,
-- 
Paul Jakma	paul at clubi.ie	paul at jakma.org	Key ID: 64A2FF6A
	warning: do not ever send email to spam at dishone.st
Fortune:
Cats are smarter than dogs.  You can't make eight cats pull a sled through
the snow.