Technical details

A few words follow explaining how the tagged email addresses work. Details have been anonymized.

The tag encodes two 32-bit numbers: Unix epoch time, and the client's ipv4 address. These are concatenated, then encrypted using a standard library. The encryption is not required, but was added due to the initially covert nature of this project. The encrypted binary string is then encoded using base 32. The first implementation used a nearly standard base 64 encoding, but spammers fold case in nearly all instances. Hexadecimal is 80% as efficient as base 32, but as a smaller tag is desirable, a base 32 encoder was written and used.

The resulting tagged addresses look (something) like this:


The 'aa' is an arbitrary prefix and encodes no data. I use a number of different prefixes in each page delivered to increase the number of tagged addresses in the spammers database, as well as gleaning additional information on whether spammers will use all addresses they harvest, and in what order if any.

The above email address, if used, will reveal the exact time and location it was originally harvested from; in this case, the IP address of the local machine, and the UTC time at which I loaded the page earlier today:
2006-06-23 15:07:51

By combining this with the logs from the MTA daemon, we have both harvest and spam data in the same record:

2006-06-23 15:07:51,aa,,2006-06-23 15:09:22,,mailhost.example.org,spam-from@example.org

(that's: time harvested, address prefix, harvest IP, time spammed, spammer IP, spammer HELO, spammer FROM envelope sender)


Post a Comment

Links to this post:

Create a Link

<< Home