Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Mail,Apr,2009,Admin

Awk and Mail

These pages focused on using Awk to implement filters on Unix mail files.


categories: Mail,Apr,2009,StevenH

Shell Statistical Spam Filter and Whitelist

Contents

About

Author

Steven Hauser.

Origin

http://www.tc.umn.edu/~hause011/article/Statistical_spam_filter.html

Client Side Unix Shell - AWK with updating email address "Whitelist"

I now use a "Statistical Spam Filter". Wow, the scummy sewer of internet mail is cleansed, refreshed and usable again. Just using the delete button was getting too difficult, I got 8 to 10 spam for every good piece of mail. As a spam detector I am not as good a filter as you might think, just the subject and address is not always enough, an anti-spam tool I am not, I would occasionally open a spam to my great annoyance.

My interpretation of Paul Graham's Spam Article

My filter was inspired by Paul Graham's article about a Naive Bayesian spam filter. The article is at "A Plan for Spam". He basically says that you get statistics on how often tokens show up in two bodies of mail, (spam and good,) and then calculate the a statistical value that a single mail is spam by looking at the tokens in it. The more mail in the good and spam mail bodies, the better the filter is "trained". Jeez, he made it sound so easy. And it is. I slapped an anti-spam tool together as a ksh and awk script for use as a personal filter on a Unix type system. To implement it I put it in the ~/.forward file. The code is at the bottom of the article, less than 100 The total code for the filter and training script is less than 200

This filter differs in lots of ways from the Paul Graham article. I took out some of the biases he describes and simplified it, maybe it is too simple. What I find most interesting is that the differences do not seem to matter much, I still filter out 96+% of spams. I got those results with a spam sample that is at least 500 emails and a good email sample that is at least 700 emails. With smaller training samples or a different mail mix it may not get as good results, or it may be better. Note: I later changed the training body to be more like the proportion of real spam to good mail, which is much more spam than good mail, about 8-10 spam to every good mail received and the anti-spam tool worked better.

How the Spam Filter Works in Unix

First I run the training script on two bodies of mail, ~/Mail/received (good mail) and ~/Mail/junk (saved spam mail.) The ~/Mail/received file is already created on my unix box and holds mail that I have read and not deleted. The training script finds all the tokens in the emails and gives them a probability depending on how the token is found in the "spam mail" and the "good mail". The training script also creates the whitelist of addresses from the "good mail." As the mail flows through the system the training script will then "learn" each time it is run.

I run the actual spam filter script from the .forward file which allows a user to process mail before it hits your inbox. (Look up "man forward" at the shell prompt for further information on the .forward file.) The script first checks the whitelist for a good address, if it is found it passes the filter. If the address is not found it is passed to the statistical spam filter, the tokens are checked and the email is given a spaminess value. Above a certain value the email is classified as spam and put in the ~/Mail/junk file, below the value it passes to /var/spool/mail/mylogin where I read it as god intended email to be read, with a creaky old unix client. However, I can still read it with any other client I want, POP or IMAP.

Testing the Spam Filter

I included a little test script below that I used to check my results. I just split emails into files and run them one at a time and check the value the filter gives.

Testing on email that has been used to train the filter will give results that are very good and not valid, so I tested on email not seen by the training script. The filter does get much better at filtering as the training sample gets bigger, just like the other statistical spam filters. For example, at lower sample sizes (trained with 209 good mails, 301 spammails) the filter was pretty bad. When the average spam value cutoff was raised to .51 so no good mail was blocked, 44% of the spam email passed through on a set of 320 spam and 683 good email. Even so, that means %56 of the spam was blocked. Small sample sizes are not perfect, but are usable and I began using the mail filter with a sample set of about 600 good mail and 300 spam. As the training sample increased the results improved. As I changed the mail mix to reflect the real spam proportions it got even better, around 96-98% of the spam blocked. I think the lower early results were because of the proportions of spam to good email, they should reflect the real proportion received on the system used by the filter.

Paul Graham or others may have superior filters and better mathematics for anti-spam algorithms but I am not sure that it matters all that much, the amount of spam that gets through is small enough not to bother me.

Filter Performance

I used gawk in the filter and checked it with the gawk profiler to look for performance problems. The largest performance constraint is creating the spam-probability associative array in memory, the key-value pairs of tokens and the spam value I assign to them. Creating this associative array is more that 95% of the current time to process an email through the filter and gets worse when the set of tokens gets larger. Perl and other language users can get around this performance problem with DBM file interfaces, currently not available to my gawk filter.

White List Filter Improves Performance and Cuts Errors

I added a "whitelist" of good email addresses, a feature that helps keep good email from a bad classification and improves performance by a huge amount (at least a magnitude of 100) by not having to further filter the message. The white list is not one of the "challenge-response" things that annoys me so much that I toss any such email away, it simply learns from the email used to train the filter, it saves addresses that are from email that has passed the filter and gets in my "received" file. I figure that if I receive a good email from someone, chances are 100% that I want to receive email from that address. Note there is a place in the white list script to get rid of commonly forged email addresses, like your own address.

Why Differences With Bayes Filters Do Not Matter

The main concept put forward by Paul Graham holds true and seems ungodly robust: applying statistics to filter spam works very well compared to lame rule sets and black lists. My program just proves the robustness of the solution; apparently any half-baked formula (like what I used) seems to work as long as the base probability of the tokens is computed.

Here are some of the many differences between this filter and the filter in the Paul Graham article in no particular order of importance:

  • I do not lower-case the tokens, one result is that token frequency is set to three instead of five to be included in in the spam-probability associative array. I think that "case" is an important token characteristic.
  • "Number of mails" is replaced with "number of tokens." My explanation is that I am looking at a token frequency in an interval of a stream of tokens. It seems simpler to think of it that way, instead of number of mails. And when I tried "number of mails" I got the same result values on the messages for the formula I used.
  • "Interesting tokens" were tokens in the message with a spam statistic "greater than 0.95" and "less than 0.05" Easy to implement. I did not figure out the fifteen most interesting tokens, the limit used by Paul Graham. As a result, most of my mail has more than 15 interesting tokens, a few have fewer, which could be a weakness, but does not seem to matter too much.
  • Paul Graham's Naive Bayesian formula goes to 0 or 1 pretty quickly, which is fine, I tried it out in awk too. But now I just sum the "interesting token" probabilities and divide by the number of "interesting tokens" per message. Yes, it is just an average of the probability of "interesting tokens" and it is easy to implement and spreads the values over a 0-1 interval, spam towards 1 and good mail towards 0. I did this to implement some spam filtering as soon as possible. Even with a small sample of mail I was able to adjust the average probability value up to keep all the good mail and still get rid of a good proportion of spam. As I acquired more sample mail the filter caught more spam and I adjusted the average probability value down.
  • I have a "training" program that generates the token probabilities and an address "whitelist" to be run as a batch job at intervals (like once a day or week) and a separate filter program run out of ".forward"
  • I did not double the frequency value of the "good tokens" to bias them in the base spam probability calculation of each token.
  • Tokens not seen by the filter before are ignored. Paul Graham gives them a 0.4 probability of spaminess. Most other methods of calculating the probability of unknown tokens end up being ignored by my formula as they would have a probability outside the "interesting token" ranges.
  • I noticed that Paul Graham ignores HTML comments. When I looked at some of the spam I found out why, some spammers load recipient address and common words into HTML comments spread through the text to pass rule filters but the statistical spam filter seems to find them anyway so I include tags, comments, everything.

Code

Test SpamFilter

Note: Do not test mail that has been used to train the filter, test mail not seen by the training program.

#!/bin/ksh
filter_test () {
  # Split a file of unix email into many mail files with this:
  cat ~/Mail/rece* |csplit -k -f good -n 4 - '/^From /' {900}

  # Run a modified filter that displays the spam value for each mail file.
  # I just commented out the last part of the filter and added a 
  # print statement of the Subject line and spam value the filter found.
  for I in test/good*
  do
     cat $I | [filter_program-that_shows_the_value_only]
  done | sort -n 
}

Train SpamFilter.

Call from the command line or in a crontab file.

#!/bin/ksh
number_of_tokens (){
  zcat $1 | cat $2 - | wc -w
}

 # Note: Get rid of addresses that are commonly forged at the
 #       "My-Own-Address" string.
address_white_list (){
  zcat $1 | 
  cat $2 - | 
  egrep '^From |^Return-Path: ' | 
  nawk '{print tolower($2)}'| 
  nawk '{gsub ("<",""); gsub (">","");print;}'| 
  grep -v 'My-Own-Address'| 
  sort -u > ~/Mail/address_whitelist
}

 # Create a hash with probability of spaminess per token.
 #       Words only in good hash get .01, words only in spam hash get .99
spaminess () {
nawk 'BEGIN {goodnum=ENVIRON["GOODNUM"]; junknum=ENVIRON["JUNKNUM"];}
       FILENAME ~ "spamwordfrequency" {bad_hash[$1]=$2}
       FILENAME ~ "goodwordfrequency" {good_hash[$1]=$2}

    END    {
    for (word in good_hash) {
        if (word in bad_hash) { print word, 
            (bad_hash[word]/junknum)/ \
            ((good_hash[word]/goodnum)+(bad_hash[word]/junknum)) }
        else { print word, "0.01"}
    }
    for (word in bad_hash) {
        if (word in good_hash) { done="already"}
        else { print word, "0.99"}
    }}' ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency 

}

 # Print list of word frequencies
frequency (){
  nawk ' { for (i = 1; i <= NF; i++)
        freq[$i]++ }
    END    {
    for (word in freq){
        if (freq[word] > 2) {
          printf "%s\t%d\n", word, freq[word];
        }
    } 
  }'
}
 # Note: I store the email in compressed files to keep my storage space small,
 #       so I have the gzipped mail that I run through the filter training 
 #       script as well as current uncompressed "good" and spam files.
 #       
prepare_data () {
  export JUNKNUM=$(number_of_tokens '/Your/home/Mail/*junk*.gz' '/Your/home/Mail/junk')
  export GOODNUM=$(number_of_tokens '/Your/home/Mail/*received*.gz' '/Your/home//Mail/received')
  address_white_list '/Your/home/Mail/*received*.gz' '/Your/home/Mail/received'

  echo $JUNKNUM $GOODNUM

  zcat ~/Mail/*junk*.gz | cat ~/Mail/junk - |
    frequency|
    sort -nr -k 2,2 > ~/Mail/spamwordfrequency
  zcat ~/Mail/*received*.gz | cat ~/Mail/received - |
    frequency|
    sort -nr -k 2,2 > ~/Mail/goodwordfrequency

  spaminess| 
    sort -nr -k 2,2 > ~/Mail/spamprobability
  # Clean up files
  rm ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency 
}

 #########
 # Main

prepare_data
exit

Spamfilter using statistical filtering.

Inspired by the Paul Graham article "A Plan for Spam" www.paulgraham.com

Implement in the .forward file like so:

"| /Your/path/to/bin/spamfilter"

If mail is spam then put in a spam file else put in the good mail file.

#!/bin/ksh
spamly () {
/usr/bin/nawk '

   { message[k++]=$0; }

   END { if (k==0) {exit;} # empty message or was in the whitelist.

         good_mail_file="/usr/spool/mail/your_user";
         spam_mail_file="/Your/home/Mail/junk";
         spam_probability_file="/Your/home/Mail/spamprobability";
         total_tokens=0.01;

         while (getline < spam_probability_file)
            bad_hash[$1]=$2; close(spam_probability_file);

         for (line in message){ 
           token_number=split(message[line],tokens);
           for (i = 0; i <= token_number; i++){
             if (tokens[i] in bad_hash) { 
               if (bad_hash[tokens[i]] <= 0.06 || bad_hash[tokens[i]] >= 0.94){
                  total_tokens+=1;
                  spamtotal+=bad_hash[tokens[i]];
                }
              }
            }
         }

         if (spamtotal/total_tokens > 0.50) { 
            for (j = 0; j <= k; j++){ print message[j] >> spam_mail_file}
            print "\n\n" >> spam_mail_file;
         }
         else {
            for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
            print "\n\n" >> good_mail_file;
         }
   }'
}

 # Check whitelist for good address. 
 # if in whitelist then put in good_mail_file
 #   else Pass message through filter.
whitelister () {
  /usr/bin/nawk '
      BEGIN { whitelist_file="/Your/home/Mail/address_whitelist";
              good_mail_file="/usr/spool/mail/your_user";
              found="no";
              while (getline < whitelist_file)
              whitelist[$1]="address"; close(whitelist_file);
      }
      { message[k++]=$0;}
      /^From / {sender=tolower($2); 
            gsub ("\<","",sender);
            gsub ("\>","",sender); 
            if (whitelist[sender]) { found="yes";}
      }
      /^Return-Path: / {sender=tolower($2); 
            gsub ("\<","",sender);
            gsub ("\>","",sender); 
            if (whitelist[sender]) { found="yes";}
      }
      END { if (found=="yes") { 
               for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
               print "\n\n" >> good_mail_file;
            }
            else {
               for (j = 0; j <= k; j++){ print message[j];}
            }
      }'
}

 #####################################
 # Main
 # The mail is first checked by the white list, if it is not found in the
 # white list it is piped to the spam filter.
whitelister | spamly 
exit

categories: Mail,Apr,2009,ArnoldR

Mail Sort

Contents

Author

Arnold Robbins

Download

Download from LAWKER.

Description

Sorts a Unix style mailbox by "thread", in date+subject order.

This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.

This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.

Code

Main

BEGIN {
       TRUE = 1
       FALSE = 0

       split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
       for (i in months)
               Month[months[i]] = i    # map name to number

       MonthDays[1] = 31
       MonthDays[2] = 28       # not used
       MonthDays[3] = 31
       MonthDays[4] = 30
       MonthDays[5] = 31
       MonthDays[6] = 30
       MonthDays[7] = 31
       MonthDays[8] = 31
       MonthDays[9] = 30
       MonthDays[10] = 31
       MonthDays[11] = 30
       MonthDays[12] = 31

       In_header = FALSE
       Body = ""

       LocalOffset = 2 # We are two hours ahead of GMT

       # These keep --lint happier
       Debug = 0
       MessageNum = 0
       Duplicates = FALSE
}

/^From / {
       In_header = TRUE
       if (MessageNum)
               Text[MessageNum] = Body
       MessageNum++
       Body = ""
 # print MessageNum
}

In_header && /^Date: / {
       Date[MessageNum] = compute_date($0)
}

In_header && /^Subject: / {
       Subject[MessageNum] = canonacalize_subject($0)
}

In_header && /^Message-[Ii][Dd]: / {
       if (NF == 1) {
               getline junk
               $0 = $0 RT junk # Preserve original input text!
       }

       # Note: Do not use $0 directly; it's needed as the Body text
       # later on.

       line = tolower($0)
       split(line, linefields)

       message_id = linefields[2]
       Mesg_ID[MessageNum] = message_id        # needed for disambiguating message
       if (message_id in Message_IDs) {
               printf("Message %d is duplicate of %s (%s)\n",
                       MessageNum, Message_IDs[message_id],
                       message_id) > "/dev/stderr"
               Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
               Duplicates++
       } else {
               Message_IDs[message_id] = MessageNum ""
       }
}


In_header && /^$/ {
       In_header = FALSE
       # map subject and date to index into text

       if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
               printf(\
       ("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
       " SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
               MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
               SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
               Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
               Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
               Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
                       > "/dev/stderr"
       }

       SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum

       if (Debug) {
               printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
                       MessageNum, length(SubjectDateId)) > "/dev/stderr"
               if (MessageNum != length(SubjectDateId) && ! Printed1) {
                       Printed1++
                       printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
               }
       }

       # build up mapping of subject to earliest date for that subject
       if (! (Subject[MessageNum] in FirstDates) ||
           FirstDates[Subject[MessageNum]] > Date[MessageNum])
               FirstDates[Subject[MessageNum]] = Date[MessageNum]
}

{
       Body = Body ($0 "\n")
}

END {
       Text[MessageNum] = Body # get last message

       if (Debug) {
               printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
                       length(SubjectDateId), length(Subject), length(Date))
               printf("length(FirstDates) = %d\n", length(FirstDates))
       }

       # Create new array to sort by thread. Subscript is
       # earliest date, subject, actual date
       for (i in SubjectDateId) {
               n = split(i, t, SUBSEP)
               if (n != 3) {
                       printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
                       exit 1
               }
               # now have subject, date, message-id in t
               # create index into Text
               Thread[FirstDates[t[1]], i] = SubjectDateId[i]
       }

       n = asorti(Thread, SortedThread)        # Shazzam!

       if (Debug) {
               printf("length(Thread) = %d, length(SortedThread) = %d\n",
                       length(Thread), length(SortedThread))
       }
       if (n != MessageNum && ! Duplicates) {
               printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
                       n, MessageNum) > "/dev/stderr"
	#               exit 1
       }

       if (Debug) {
               for (i = 1; i <= n; i++)
                       printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
                               i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
               close("DUMP1")
               if (Debug ~ /exit/)
                       exit 0
       }

       for (i = 1; i <= MessageNum; i++) {
               if (Debug) {
                       printf("Date[%d] = %s\n",
                               i, strftime("%c", Date[i]))
                       printf("Subject[%d] = %s\n", i, Subject[i])
               }

               printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
       }
       close("OUTPUT")

       close("/dev/stderr")    # shuts up --lint
}

compute_date

Pull apart a date string and convert to timestamp.

function compute_date(date_rec,         fields, year, month, day,
                                       hour, min, sec, tzoff, timestamp)
{
       split(date_rec, fields, "[:, ]+")
       if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
               # Date: Thu, 05 Jan 2006 17:11:26 -0500
               year = fields[5]
               month = Month[fields[4]]
               day = fields[3] + 0
               hour = fields[6]
               min = fields[7]
               sec = fields[8]
               tzoff = fields[9] + 0
       } else {
               # Date: 05 Jan 2006 17:11:26 -0500
               year = fields[4]
               month = Month[fields[3]]
               day = fields[2] + 0
               hour = fields[5]
               min = fields[6]
               sec = fields[7]
               tzoff = fields[8] + 0
       }
       if (tzoff == "GMT" || tzoff == "gmt")
               tzoff = 0
       tzoff /= 100    # assume offsets are in whole hours
       tzoff = -tzoff

       # crude compensation for timezone
       # mktime() wants a local time:
       #       hour + tzoff yields GMT
       #       GMT + LocalOffset yields local time
       hour += tzoff + LocalOffset

       # if moved into next day, reset other values
       if (hour > 23) {
               hour %= 24
               day++
               if (day > days_in_month(month, year)) {
                       day = 1
                       month++
                       if (month > 12) {
                               month = 1
                               year++
                       }
               }
       }

       timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
                               year, month, day, hour, min, sec))

       # timestamps can be 9 or 10 digits.
       # canonicalize them into 11 digits with leading zeros
       return sprintf("%011d", timestamp)
}

days_in_month

How many days in the given month?

function days_in_month(month, year)
{
       if (month != 2)
               return MonthDays[month]

       if (year % 4 == 0 && year % 400 != 0)
               return 29

       return 28
}

canonacalize_subject

Trim out "Re:", white space.

function canonacalize_subject(subj_line)
{
       subj_line = tolower(subj_line)
       sub(/^subject: +/, "", subj_line)
       sub(/^(re: *)+/, "", subj_line)
       sub(/[[:space:]]+$/, "", subj_line)
       gsub(/[[:space:]]+/, " ", subj_line)

       return subj_line
}

Copyright

Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com

blog comments powered by Disqus