2006.06.06 NANOG-NOTES network-level spam behaviour

Wed Jun 7 11:18:43 UTC 2006

2006.06.06 Nick Feamster, Network-level spam behaviour
[slides are at:
http://www.nanog.org/mtg-0606/pdf/nick-feamster.pdf

Spam
unsolicited commercial email
feb 2005, 90% of all email is spam
common filtering techniques are
 content based
 DNS balcklist queries are significant fraction
  of DNS traffic today.  (DNSbls)

Using IP address based spam black lists isn't so
useful.
How spammers evade blacklists will be discussed
as well.

Problems with content-based filters
...uh oh, some technical glitches...

Content-based properties are malleable
low cost to evasion
 altering content based on scripts is too easy
customized emails are easy to generate
 content based filters need fuzzy hashes over
  content, etc.
high cost to filter maintainers
 as content changes, filters need to be updated.
 constantly tweaking spamassasain rules is a pain.

false positives are always an issue.

Content-based filters are applied at the destination
 too little, too late -- wasted network bandwidth,
  storage, etc. ;  many users recieve and store the
  same spam content.

Network level spam filtering is robust (hypothesis)
network-level propeerties are more fixed
 hosting or upstream ISP (as number)
 botnet membership
 location in the network
 IP address block
 country?

are there common ISPs that host the spammers, for
example?
Avoid receiving mail from machines that are part
of botnets.

Challenge--which properties are most useful for
 distinguishing spam traffic from legitimate email?

very little if anything is known about these
characteristics yet!

Randy gave a lightning talk last NANOG about some
of this.

Some properties listed.

Spamming techniques
mostly botnets, of course
other techniques too
we're trying to quantify this
 coordination
 characteristics
how we're doing this
 correlations with Bobax victims
  from georgia tech botnet sinkhole
other possilities: heuristics
 distance of client IP from the MX record
 coordinated, low-bandwidth sending

looked at pcaps coming in from hijacked command
and control station from bots trying to talk to
it; spamming bots, Bobax drone botnet, exclusively
used to send spam.

Collection
two domains instrumented with MailAvenger (both on
 the same network)
 sinkhole domain 1
  continuous spam collection since aug 2004
  no real email addresses--sink everything
  10 million + pieces of spam
 sinkhole domain #2
  recently registered Nov 2005
  "clean control" domain posted at a few places
  not much spam yet--perhaps being too conservative
  contact page with random email contact, look at
   who crawls, and then who spams the unique email
   addresses

Monitoring BGP route advertisments from same network

Also capturing traceroutes, DNSBL results, passive
TCP host fingerprinting, simultaneous with spam arrival
(results in this talk focus on BGP+ spam only)

Mail Avenger, not an MTA, it forks to sendmail or
postfix, it sits in front of MTA, does things
like do DNSBL lookups, add headers, passive OS
fingerprinting, as the spam is arriving.
Also logged BGP routes from same network that got
the spam; see connectivity to the spamming machine
at the time.

Picture of collection up at MIT network.

Mail Collection: MailAvenger
X-Avenger header.
best guess at operating system, POF, DNSBL
lookups, traceroutes back to mail relay at the
time the mail was sent (used for debugging BGP)

distribution across IP space
plot /24 prefix vs how much spam coming from it.
steeper lines mean more spam from that part
of the IP space; you can see where spam is
coming from.  bunch comes from apnic, cable
modem space, etc.
few interesting things to note; still redoing
legitimate mail characteristics.
from georgia tech mail machines, it's legit plus
spam, need to split out better.
between 90.* and 180.*, legitimate mail mainly.

Is IP-based blacklisting enough?
Probably not: more than half of spamming client IPs
appear less than twice.

Roughly 50% of the IPs showed up less than twice;
but that's a single sinkhole domain, would help
more across multiple domains.

emphasizes need to collaborate across multiple
domains to build blacklists; any one domain
won't see repeated patterns of IPs.

Distribution across ASes
40% of spam coming from the US

BGP spectrum agility
Log IP addresses of SMTP relays
Join with BGP route advertisements seen at network
where spam trap is co-located.

A small club of persistent players appears to be using
this technique
61.0.0.0/8 AS4678
66.0.0.0/8 AS21562
82.0.0.0/8 AS8717
somewhere between 1-10% of all spam (some clearly
intentional, others might be flapping)

about 10 minute announcement time of the /8 while
spam is flooded out.
Might be interesting to couple this with route
hijacking alerting to filter out if this is
really a hijacking vs a flapping legitimate route.

A slightly different pattern;
announce-spam-withdraw on a minute-by-minute basis.
really really egregious!

Why such big prefixes?
flexibility: client IPs can be scattered throughout
 dark space within a large /8
  same sender usually returns with different IP
   addresses
visibility: route typically won't be filtered (nice
 and short prefix length)

Characteristics of IP-agile senders
IP addresses are widely distributed across the /8 spce
IP addresses typically appear only once at the sinkhole
Depending on which /8, 60-80% of these IP addresses
 were not reachable by traceroute when we spot-checked
some IP addresses were in allocated, albeit unannounced
 space
Some AS paths associated with the routes contained
 reserved AS numbers

Odd AS numbers injected, usually well-known to make
it look more legitimate.

Length of short-lived BGP epochs
10% of spam coming from short-lived BGP events

Spam from Botnets
Example: Bobax
 approximate size: 100k bots

one sinkhole domain--this is ONLY stuff that is
verifiable as coming from bots via command and
control hijacked IPs, intersect the single sinkhole
domain, so much smaller data subset, but well
correlated and verified.

Proportionally less spam from bots in 61-90
range; that tends to be where BGP route hijacks
happen instead.

Most Bot IP addresses do not return
65% of bots only send mail to a domain once over
 18 months.
Some hang around for a *long* time.
About 20% stick around for several months.

collaborative spam filtering seems to be helping
track bot IP addresses.

Most bots send low volumes of spam
most bot IP addresses send very little spam regardless
of how long they have been spamming

Effectiveness of blacklisting:
only about half of the IPs spamming from short-lived
BGP are listed in any blacklist
spam from IP-agile senders tend to be listed in fewer
blacklists

Looking at 8 different spam blacklists, checking when
the spam arrives at the sinkhole.

Known Bobax drones listed in more DNSbls than the
BGP agile senders.
About 90-95% of the Bobax bot drones are listed
in one or more DNSBLs.

Suggests some of the spamming bots are listed more
than other techniques--that is, bots are easier to
identify than BGP-agile spammers or spammers using
other techniques.

Harvesting
tracking web-based harvesting
 register domain, set up MX record
 post, link to page with randomly generated email addresses

Example Phish:
 a flood of email for a phishing attack for paypal.com
 all to: addresses harvested in a single crawl on
  January 16th 2006
 emails received from IPs different from those who
  crawl.
 X-mailer headers totally diffrent.

Lessons for better spam filters:
effective spam filtering requires a btter notion of
 end-host identity
distribution of spamming IP addresses is highly
 skewed
detection based on network-wide, aggregate behavioru
 may be more fruitful than focusing on individual IPs
 large, emergent properties.

two critical pieces of the puzzle
 botnet detection
 securing the internet's routing infrastructure

compare distributions of spam to legitimate mail,
see if certain spaces are more likely to send spam
than legitimate mail.

Questions:
Q: Steve Bellovin, columbia university
bots from strange ASes, is tunnelling taking
place from bots to BGP speakers?
A: Not sure if there's evidence or not; some data
from  TORS??
but TORS latency may be too high.

Q: Fingerprinting to try to identify who is doing
things, see how many hosts are actually doing
this?
Many addresses being used, how many hosts
 does it actually represent?
A: Not sure, haven't checked that.
Haven't checked on aliasing, since not much
was seen from a single IP.
NAT'ing?
What about hosts hopping? (same host using multiple
IPs?)
Not sure, they didn't do that correlation.

Q: Randy Bush, IIJ, they did do OS fingerprinting,
so some of that are in the paper.
didn't do anything with the traceroutes, though.

Q: Matt asks what the difference between the two
domains was; was one of them a recognizable word
or name, or were they both random character strings?
A: they were both random character strings, but one
of them had been used to host a real website for a
while, which might explain why it gets such a huge
volume of spam compared to the other.
Q: Matt points out that for some networks, receiving
spam is actually a good thing, as it helps balance
out traffic ratios, which helps during peering
negotiations.

Q: Randy Bush, IIJ, responding to Matt about traffic
ratios: only those backbones who are on ADSL should
they care which way traffic goes.  :P

Curious to work with large networks, see if filters
could be installed to detect it, and possibly take
action.