web developer & system programmer

coder . cl

ramblings and thoughts on programming...

spam detection, phase 1

published: 22-04-2011 / updated: 22-04-2011
posted in: development, programming, projects, sysadmin, tips
by Daniel Molina Wegener

Is very tedious to see you electronic mailboxes to be filled with spam. If you are a web-master or system administrator, you will see that the web site that you are administrating, is usually scrapped by bot software seeking for certain URLs and electronic mail addresses. That is probably the highest origin of the SPAM on the internet. You know that SPAM is, and this experiment should work only with those SPAM that are related to electronic internet email.

Spam is the use of electronic messaging systems (including most broadcast media, digital delivery systems) to send unsolicited bulk messages indiscriminately. While the most widely recognized form of spam is e-mail spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social networking spam, television advertising and file sharing network spam.

This experiment requires a fake electronic mail account, a well known internet domain — which is subject of constant web scrapping seeking for electronic mails — and the proper software development tools, something flexible like Linux, rather than those very closed platforms like Windows. If your domain is subject of spam: “Send the same message indiscriminately to (large numbers of recipients) on the Internet”, this experiment probably will solve your common problems with SPAM and will let you build the proper filters on your SPAM filter daemon or service.

Start creating the fake account. For example spam-booby@example.org. This email address will be scrapped by those internet bots that are seeking for email addresses. Create a fake email client, to review and fetch that folder, that will be seriously affected by those SPAM bots. On your well known domain site, for example www.example.org, put your fake email address that will be subject of web scrapping in a hidden HTML anchor element:

<a href='mailto:spam-booby@example.org'
   style='display:none;'
   title='just fall here fu**ing bot'>
  spam-booby@example.org
</a>
<a href='mailto:spam-booby@example.org'
   style='visibility:hidden;'
   title='just fall here fu**ing bot'>
  spam-booby@example.org
</a>

Those mail links will not be displayed on the web browser of your customers, so, you will be safe, but those bots are not smart enough to process the visibility: hidden or display: none CSS properties. If those bots are smart enough to process that CSS property, well, just use some JavaScript trick to hide them, for example with jQuery that should be like:

<a href='mailto:spam-booby@example.org'
   class='please-hide-me'
   title='just fall here fu**ing bot'>
  spam-booby@example.org
</a>

<script type='text/javascript'><--//
$(document).ready(function () {
    $('a.please-hide-me').hide();
});
//--></script>

That will ensure that those email links will be hidden for your customers using a real browser. Wait one week, and you will be collecting SPAM on the fake spam-booby@example.org email account. Also, I know that those spam bot constructors are so bad software developers that will not consider the RFC2606 referring to example domains, and will use the email addresses on the present article in their spamming tasks.

This is the first phase, on the second phase, I will try to process the collected data from those captured emails. I will begin this experiment with certain domain, and I hope that you will enjoy the results. For the second phase I’ve chosen Python as primary language, and some tools to download and fetch the email from the fake account, some Bayesian filter classifiers and certain key/value database to speed-up the mail processing functions. I’ve started collecting data from today :)

Comment (1) | Back to Top