About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
These pages focused on using Awk to implement filters on Unix mail files.
http://www.tc.umn.edu/~hause011/article/Statistical_spam_filter.html
I now use a "Statistical Spam Filter". Wow, the scummy sewer of internet mail is cleansed, refreshed and usable again. Just using the delete button was getting too difficult, I got 8 to 10 spam for every good piece of mail. As a spam detector I am not as good a filter as you might think, just the subject and address is not always enough, an anti-spam tool I am not, I would occasionally open a spam to my great annoyance.
My filter was inspired by Paul Graham's article about a Naive Bayesian spam filter. The article is at "A Plan for Spam". He basically says that you get statistics on how often tokens show up in two bodies of mail, (spam and good,) and then calculate the a statistical value that a single mail is spam by looking at the tokens in it. The more mail in the good and spam mail bodies, the better the filter is "trained". Jeez, he made it sound so easy. And it is. I slapped an anti-spam tool together as a ksh and awk script for use as a personal filter on a Unix type system. To implement it I put it in the ~/.forward file. The code is at the bottom of the article, less than 100 The total code for the filter and training script is less than 200
This filter differs in lots of ways from the Paul Graham article. I took out some of the biases he describes and simplified it, maybe it is too simple. What I find most interesting is that the differences do not seem to matter much, I still filter out 96+% of spams. I got those results with a spam sample that is at least 500 emails and a good email sample that is at least 700 emails. With smaller training samples or a different mail mix it may not get as good results, or it may be better. Note: I later changed the training body to be more like the proportion of real spam to good mail, which is much more spam than good mail, about 8-10 spam to every good mail received and the anti-spam tool worked better.
First I run the training script on two bodies of mail, ~/Mail/received (good mail) and ~/Mail/junk (saved spam mail.) The ~/Mail/received file is already created on my unix box and holds mail that I have read and not deleted. The training script finds all the tokens in the emails and gives them a probability depending on how the token is found in the "spam mail" and the "good mail". The training script also creates the whitelist of addresses from the "good mail." As the mail flows through the system the training script will then "learn" each time it is run.
I run the actual spam filter script from the .forward file which allows a user to process mail before it hits your inbox. (Look up "man forward" at the shell prompt for further information on the .forward file.) The script first checks the whitelist for a good address, if it is found it passes the filter. If the address is not found it is passed to the statistical spam filter, the tokens are checked and the email is given a spaminess value. Above a certain value the email is classified as spam and put in the ~/Mail/junk file, below the value it passes to /var/spool/mail/mylogin where I read it as god intended email to be read, with a creaky old unix client. However, I can still read it with any other client I want, POP or IMAP.
I included a little test script below that I used to check my results. I just split emails into files and run them one at a time and check the value the filter gives.
Testing on email that has been used to train the filter will give results that are very good and not valid, so I tested on email not seen by the training script. The filter does get much better at filtering as the training sample gets bigger, just like the other statistical spam filters. For example, at lower sample sizes (trained with 209 good mails, 301 spammails) the filter was pretty bad. When the average spam value cutoff was raised to .51 so no good mail was blocked, 44% of the spam email passed through on a set of 320 spam and 683 good email. Even so, that means %56 of the spam was blocked. Small sample sizes are not perfect, but are usable and I began using the mail filter with a sample set of about 600 good mail and 300 spam. As the training sample increased the results improved. As I changed the mail mix to reflect the real spam proportions it got even better, around 96-98% of the spam blocked. I think the lower early results were because of the proportions of spam to good email, they should reflect the real proportion received on the system used by the filter.
Paul Graham or others may have superior filters and better mathematics for anti-spam algorithms but I am not sure that it matters all that much, the amount of spam that gets through is small enough not to bother me.
I used gawk in the filter and checked it with the gawk profiler to look for performance problems. The largest performance constraint is creating the spam-probability associative array in memory, the key-value pairs of tokens and the spam value I assign to them. Creating this associative array is more that 95% of the current time to process an email through the filter and gets worse when the set of tokens gets larger. Perl and other language users can get around this performance problem with DBM file interfaces, currently not available to my gawk filter.
I added a "whitelist" of good email addresses, a feature that helps keep good email from a bad classification and improves performance by a huge amount (at least a magnitude of 100) by not having to further filter the message. The white list is not one of the "challenge-response" things that annoys me so much that I toss any such email away, it simply learns from the email used to train the filter, it saves addresses that are from email that has passed the filter and gets in my "received" file. I figure that if I receive a good email from someone, chances are 100% that I want to receive email from that address. Note there is a place in the white list script to get rid of commonly forged email addresses, like your own address.
The main concept put forward by Paul Graham holds true and seems ungodly robust: applying statistics to filter spam works very well compared to lame rule sets and black lists. My program just proves the robustness of the solution; apparently any half-baked formula (like what I used) seems to work as long as the base probability of the tokens is computed.
Here are some of the many differences between this filter and the filter in the Paul Graham article in no particular order of importance:
Note: Do not test mail that has been used to train the filter, test mail not seen by the training program.
#!/bin/ksh
filter_test () {
# Split a file of unix email into many mail files with this:
cat ~/Mail/rece* |csplit -k -f good -n 4 - '/^From /' {900}
# Run a modified filter that displays the spam value for each mail file.
# I just commented out the last part of the filter and added a
# print statement of the Subject line and spam value the filter found.
for I in test/good*
do
cat $I | [filter_program-that_shows_the_value_only]
done | sort -n
}
Call from the command line or in a crontab file.
#!/bin/ksh
number_of_tokens (){
zcat $1 | cat $2 - | wc -w
}
# Note: Get rid of addresses that are commonly forged at the
# "My-Own-Address" string.
address_white_list (){
zcat $1 |
cat $2 - |
egrep '^From |^Return-Path: ' |
nawk '{print tolower($2)}'|
nawk '{gsub ("<",""); gsub (">","");print;}'|
grep -v 'My-Own-Address'|
sort -u > ~/Mail/address_whitelist
}
# Create a hash with probability of spaminess per token.
# Words only in good hash get .01, words only in spam hash get .99
spaminess () {
nawk 'BEGIN {goodnum=ENVIRON["GOODNUM"]; junknum=ENVIRON["JUNKNUM"];}
FILENAME ~ "spamwordfrequency" {bad_hash[$1]=$2}
FILENAME ~ "goodwordfrequency" {good_hash[$1]=$2}
END {
for (word in good_hash) {
if (word in bad_hash) { print word,
(bad_hash[word]/junknum)/ \
((good_hash[word]/goodnum)+(bad_hash[word]/junknum)) }
else { print word, "0.01"}
}
for (word in bad_hash) {
if (word in good_hash) { done="already"}
else { print word, "0.99"}
}}' ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency
}
# Print list of word frequencies
frequency (){
nawk ' { for (i = 1; i <= NF; i++)
freq[$i]++ }
END {
for (word in freq){
if (freq[word] > 2) {
printf "%s\t%d\n", word, freq[word];
}
}
}'
}
# Note: I store the email in compressed files to keep my storage space small,
# so I have the gzipped mail that I run through the filter training
# script as well as current uncompressed "good" and spam files.
#
prepare_data () {
export JUNKNUM=$(number_of_tokens '/Your/home/Mail/*junk*.gz' '/Your/home/Mail/junk')
export GOODNUM=$(number_of_tokens '/Your/home/Mail/*received*.gz' '/Your/home//Mail/received')
address_white_list '/Your/home/Mail/*received*.gz' '/Your/home/Mail/received'
echo $JUNKNUM $GOODNUM
zcat ~/Mail/*junk*.gz | cat ~/Mail/junk - |
frequency|
sort -nr -k 2,2 > ~/Mail/spamwordfrequency
zcat ~/Mail/*received*.gz | cat ~/Mail/received - |
frequency|
sort -nr -k 2,2 > ~/Mail/goodwordfrequency
spaminess|
sort -nr -k 2,2 > ~/Mail/spamprobability
# Clean up files
rm ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency
}
#########
# Main
prepare_data
exit
Inspired by the Paul Graham article "A Plan for Spam" www.paulgraham.com
Implement in the .forward file like so:
"| /Your/path/to/bin/spamfilter"
If mail is spam then put in a spam file else put in the good mail file.
#!/bin/ksh
spamly () {
/usr/bin/nawk '
{ message[k++]=$0; }
END { if (k==0) {exit;} # empty message or was in the whitelist.
good_mail_file="/usr/spool/mail/your_user";
spam_mail_file="/Your/home/Mail/junk";
spam_probability_file="/Your/home/Mail/spamprobability";
total_tokens=0.01;
while (getline < spam_probability_file)
bad_hash[$1]=$2; close(spam_probability_file);
for (line in message){
token_number=split(message[line],tokens);
for (i = 0; i <= token_number; i++){
if (tokens[i] in bad_hash) {
if (bad_hash[tokens[i]] <= 0.06 || bad_hash[tokens[i]] >= 0.94){
total_tokens+=1;
spamtotal+=bad_hash[tokens[i]];
}
}
}
}
if (spamtotal/total_tokens > 0.50) {
for (j = 0; j <= k; j++){ print message[j] >> spam_mail_file}
print "\n\n" >> spam_mail_file;
}
else {
for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
print "\n\n" >> good_mail_file;
}
}'
}
# Check whitelist for good address.
# if in whitelist then put in good_mail_file
# else Pass message through filter.
whitelister () {
/usr/bin/nawk '
BEGIN { whitelist_file="/Your/home/Mail/address_whitelist";
good_mail_file="/usr/spool/mail/your_user";
found="no";
while (getline < whitelist_file)
whitelist[$1]="address"; close(whitelist_file);
}
{ message[k++]=$0;}
/^From / {sender=tolower($2);
gsub ("\<","",sender);
gsub ("\>","",sender);
if (whitelist[sender]) { found="yes";}
}
/^Return-Path: / {sender=tolower($2);
gsub ("\<","",sender);
gsub ("\>","",sender);
if (whitelist[sender]) { found="yes";}
}
END { if (found=="yes") {
for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
print "\n\n" >> good_mail_file;
}
else {
for (j = 0; j <= k; j++){ print message[j];}
}
}'
}
#####################################
# Main
# The mail is first checked by the white list, if it is not found in the
# white list it is piped to the spam filter.
whitelister | spamly
exit
Download from LAWKER.
Sorts a Unix style mailbox by "thread", in date+subject order.
This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.
This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.
BEGIN {
TRUE = 1
FALSE = 0
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
for (i in months)
Month[months[i]] = i # map name to number
MonthDays[1] = 31
MonthDays[2] = 28 # not used
MonthDays[3] = 31
MonthDays[4] = 30
MonthDays[5] = 31
MonthDays[6] = 30
MonthDays[7] = 31
MonthDays[8] = 31
MonthDays[9] = 30
MonthDays[10] = 31
MonthDays[11] = 30
MonthDays[12] = 31
In_header = FALSE
Body = ""
LocalOffset = 2 # We are two hours ahead of GMT
# These keep --lint happier
Debug = 0
MessageNum = 0
Duplicates = FALSE
}
/^From / {
In_header = TRUE
if (MessageNum)
Text[MessageNum] = Body
MessageNum++
Body = ""
# print MessageNum
}
In_header && /^Date: / {
Date[MessageNum] = compute_date($0)
}
In_header && /^Subject: / {
Subject[MessageNum] = canonacalize_subject($0)
}
In_header && /^Message-[Ii][Dd]: / {
if (NF == 1) {
getline junk
$0 = $0 RT junk # Preserve original input text!
}
# Note: Do not use $0 directly; it's needed as the Body text
# later on.
line = tolower($0)
split(line, linefields)
message_id = linefields[2]
Mesg_ID[MessageNum] = message_id # needed for disambiguating message
if (message_id in Message_IDs) {
printf("Message %d is duplicate of %s (%s)\n",
MessageNum, Message_IDs[message_id],
message_id) > "/dev/stderr"
Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
Duplicates++
} else {
Message_IDs[message_id] = MessageNum ""
}
}
In_header && /^$/ {
In_header = FALSE
# map subject and date to index into text
if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
printf(\
("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
" SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
> "/dev/stderr"
}
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum
if (Debug) {
printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
MessageNum, length(SubjectDateId)) > "/dev/stderr"
if (MessageNum != length(SubjectDateId) && ! Printed1) {
Printed1++
printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
}
}
# build up mapping of subject to earliest date for that subject
if (! (Subject[MessageNum] in FirstDates) ||
FirstDates[Subject[MessageNum]] > Date[MessageNum])
FirstDates[Subject[MessageNum]] = Date[MessageNum]
}
{
Body = Body ($0 "\n")
}
END {
Text[MessageNum] = Body # get last message
if (Debug) {
printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
length(SubjectDateId), length(Subject), length(Date))
printf("length(FirstDates) = %d\n", length(FirstDates))
}
# Create new array to sort by thread. Subscript is
# earliest date, subject, actual date
for (i in SubjectDateId) {
n = split(i, t, SUBSEP)
if (n != 3) {
printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
exit 1
}
# now have subject, date, message-id in t
# create index into Text
Thread[FirstDates[t[1]], i] = SubjectDateId[i]
}
n = asorti(Thread, SortedThread) # Shazzam!
if (Debug) {
printf("length(Thread) = %d, length(SortedThread) = %d\n",
length(Thread), length(SortedThread))
}
if (n != MessageNum && ! Duplicates) {
printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
n, MessageNum) > "/dev/stderr"
# exit 1
}
if (Debug) {
for (i = 1; i <= n; i++)
printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
close("DUMP1")
if (Debug ~ /exit/)
exit 0
}
for (i = 1; i <= MessageNum; i++) {
if (Debug) {
printf("Date[%d] = %s\n",
i, strftime("%c", Date[i]))
printf("Subject[%d] = %s\n", i, Subject[i])
}
printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
}
close("OUTPUT")
close("/dev/stderr") # shuts up --lint
}
Pull apart a date string and convert to timestamp.
function compute_date(date_rec, fields, year, month, day,
hour, min, sec, tzoff, timestamp)
{
split(date_rec, fields, "[:, ]+")
if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
# Date: Thu, 05 Jan 2006 17:11:26 -0500
year = fields[5]
month = Month[fields[4]]
day = fields[3] + 0
hour = fields[6]
min = fields[7]
sec = fields[8]
tzoff = fields[9] + 0
} else {
# Date: 05 Jan 2006 17:11:26 -0500
year = fields[4]
month = Month[fields[3]]
day = fields[2] + 0
hour = fields[5]
min = fields[6]
sec = fields[7]
tzoff = fields[8] + 0
}
if (tzoff == "GMT" || tzoff == "gmt")
tzoff = 0
tzoff /= 100 # assume offsets are in whole hours
tzoff = -tzoff
# crude compensation for timezone
# mktime() wants a local time:
# hour + tzoff yields GMT
# GMT + LocalOffset yields local time
hour += tzoff + LocalOffset
# if moved into next day, reset other values
if (hour > 23) {
hour %= 24
day++
if (day > days_in_month(month, year)) {
day = 1
month++
if (month > 12) {
month = 1
year++
}
}
}
timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
year, month, day, hour, min, sec))
# timestamps can be 9 or 10 digits.
# canonicalize them into 11 digits with leading zeros
return sprintf("%011d", timestamp)
}
How many days in the given month?
function days_in_month(month, year)
{
if (month != 2)
return MonthDays[month]
if (year % 4 == 0 && year % 400 != 0)
return 29
return 28
}
Trim out "Re:", white space.
function canonacalize_subject(subj_line)
{
subj_line = tolower(subj_line)
sub(/^subject: +/, "", subj_line)
sub(/^(re: *)+/, "", subj_line)
sub(/[[:space:]]+$/, "", subj_line)
gsub(/[[:space:]]+/, " ", subj_line)
return subj_line
}
Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com