About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
2009: assoc Prof, LCSEE, WVU email: tim@menzies.us web site: http://menzies.us.
Tim has been a Great Auk since Feb'09.
From awk.freeshell.org:
It's a bit embarassing to note that the exact origins of each are a bit hazy. This whole section requires further work, including the addition of links pointing to source repositories and binary distribution points.
Historical list of Awk implementations.
by T. Menzies
"The Enlightened Ones say that....
Awk is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and Awk programs are generally very short. Awk is useful when the overheads of more sophisticated approaches is not worth the bother. Also, the cost of learning Awk is very low.
But aren't there better scripting languages? Faster? Well, maybe yes and maybe no.
And Awk is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:
loc/fp language
------ --------
6, excel 5
13, sql
21, awk <================
21, perl
21, eiffel
21, clos
21, smalltalk
29, delphi
29, visual basic 5
49, ada 95
49, ai shells
53, c++
53, java
64, lisp
71, ada 83
71, fortran 95
80, 3rd generation default
91, ansi cobol 85
91, pascal
107, 2nd generation default
107, algol 68
107, cobol
107, fortran
128, c
320, 1st generation default
640, machine language
3200, natural language
Anyway, there are other considerations. Awk is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For example, here's the complete listing of someone's Awk spell-checking program.
BEGIN {while (getline<"Usr.Dict.Words") dict[$0]=1}
!dict[$1] {print $1}
Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.
Awk is the cure for late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. Awk programmers can hack up usable systems in the time it takes other programmers to boot their IDE. And, as a result of that easy exploration, it is possible to find loopholes missed by other analyst that lead to the innovative better solution to the problems (e.g. see Ronald Loui's O(nlogn) clustering tool).
Certainly, we can drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And Awk's lack of data structures (except num, string, and array) requires some discipline to handle properly.
But experienced Awk programmers know that the cleverer the program, the smaller the audience gets. If it is possible for to explain something succinctly in a simple language like Awk, then it is also possible that more folks will read that code.
Finally, at this may be the most important point, it might be misguided to argue about Awk vs LanguageX in terms of the specifics of those languages. Awk programmers can't over-elaborate their solutions- they are forced to code the solution in the simplest manner possible. This push to simplicity, to the essence of the problem, can be an insightful process. Coding in Awk is like preserving fruit- you boil off everything that is superfluous, that needlessly bloats the material what you are working with. It is amazing how little code is required to code the core of an idea (e.g. see Darius Bacon's LISP interpreter, written in Awk).
Tim Menzies writes...
Anyone interested in collaborating on an Awk Cook Book? That we get published with (say) O'Reilly?
After browsing the excellent entries from the PHP cookbook by David Sklar and Adam Trachtenberg, I think the following topics look relevant.
But what other topics are useful/ simple/ practical/ etc?
I propose the following timetable:
So, according to that timetable, the first step is for everyone to jump in and propose topics. Looking forward to your input!
1 = Strings =
2
3 * Accessing Substrings
4 * Replacing Substrings
5 * Processing a String One Character at a Time
6 * Reversing a String by Word or Character
7 * Expanding and Compressing Tabs
8 * Controlling Case
9 * Interpolating Functions and Expressions Within Strings
10 * Trimming Blanks from a String
11 * Parsing Comma-Separated Data
12 * Parsing Fixed-Width Delimited Data
13 * Taking Strings Apart
14 * Wrapping Text at a Certain Line Length
15 * Storing Binary Data in Strings
16 * Building a String From Parts
17 * Substituting Variables Into Strings
18 * Converting Between Characters and Values
19 * Managing Whitespace
20 * Word-wrapping Lines of Text
21 * Validating an Email Address
22
23 = Numbers =
24
25 * Checking Whether a String Contains a Valid Number
26 * Comparing Floating-Point Numbers
27 * Rounding Floating-Point Numbers
28 * Operating on a Series of Integers
29 * Generating Random Numbers Within a Range
30 * Generating Biased Random Numbers
31 * Taking Logarithms
32 * Calculating Exponents
33 * Formatting Numbers
34 * Printing Correct Plurals
35 * Calculating Trigonometric Functions
36 * Doing Trigonometry in Degrees, not Radians
37 * Handling Very Large or Very Small Numbers
38 * Converting Between Bases
39 * Calculating Using Numbers in Bases Other Than Decimal
40
41 = Dates and Times =
42
43 * Finding the Current Date and Time
44 * Converting Time and Date Parts to an Epoch Timestamp
45 * Converting an Epoch Timestamp to Time and Date Parts
46 * Printing a Date or Time in a Specified Format
47 * Finding the Difference of Two Dates
48 * Finding the Difference of Two Dates with Julian Days
49 * Finding the Day in a Week, Month, Year, or the Week Number in a Year
50 * Validating a Date
51 * Parsing Dates and Times from Strings
52 * Adding to or Subtracting from a Date
53 * Calculating Time with Time Zones
54 * Accounting for Daylight Saving Time
55 * Generating a High-Precision Time
56 * Generating Time Ranges
57 * Using Non-Gregorian Calendars
58 * Program: Calendar
59
60 = Arrays =
61
62 * Specifying an Array Not Beginning at Element 0
63 * Storing Multiple Elements per Key in an Array
64 * Initializing an Array to a Range of Integers
65 * Iterating Through an Array
66 * Deleting Elements from an Array
67 * Changing Array Size
68 * Appending One Array to Another
69 * Turning an Array into a String
70 * Printing an Array with Commas
71 * Checking if a Key Is in an Array
72 * Checking if an Element Is in an Array
73 * Finding the Position of an Element in an Array
74 * Finding Elements That Pass a Certain Test
75 * Finding the Largest or Smallest Valued Element in an Array
76 * Reversing an Array
77 * Sorting an Array
78 * Sorting an Array by a Computable Field
79 * Sorting Multiple Arrays
80 * Sorting an Array Using a Method Instead of a Function
81 * Randomizing an Array
82 * Shuffling a Deck of Cards
83 * Removing Duplicate Elements from an Array
84 * Finding the Union, Intersection, or Difference of Two Arrays
85 * Finding All Element Combinations of an Array
86 * Finding All Permutations of an Array
87 * Program: Printing an Array in a Horizontally Columned HTML Table
88
89 = Hashes =
90
91 * Introduction
92 * Adding an Element to a Hash
93 * Testing for the Presence of a Key in a Hash
94 * Deleting from a Hash
95 * Traversing a Hash
96 * Printing a Hash
97 * Retrieving from a Hash in Insertion Order
98 * Hashes with Multiple Values Per Key
99 * Inverting a Hash
100 * Sorting a Hash
101 * Merging Hashes
102 * Finding Common or Different Keys in Two Hashes
103 * Hashing References
104 * Presizing a Hash
105 * Finding the Most Common Anything
106 * Representing Relationships Between Data
107 * Program: dutree
108
109 = Variables =
110
111 * Avoiding == Versus = Confusion
112 * Establishing a Default Value
113 * Exchanging Values Without Using Temporary Variables
114 * Creating a Dynamic Variable Name
115 * Using Static Variables
116 * Sharing Variables Between Processes
117 * Encapsulating Complex Data Types as a String
118 * Dumping Variable Contents as Strings
119
120 = Functions =
121
122 * Accessing Function Parameters
123 * Setting Default Values for Function Parameters
124 * Passing Values by Reference
125 * Using Named Parameters
126 * Creating Functions That Take a Variable Number of Arguments
127 * Returning Values by Reference
128 * Returning More Than One Value
129 * Skipping Selected Return Values
130 * Returning Failure
131 * Calling Variable Functions
132 * Accessing a Global Variable Inside a Function
133 * Creating Dynamic Functions
134
135 = XML =
136
137 * Generating XML Manually
138 * Generating XML with the DOM
139 * Parsing XML with the DOM
140 * Parsing XML with SAX
141 * Transforming XML with XSLT
142 * Sending XML-RPC Requests
143 * Receiving XML-RPC Requests
144 * Sending SOAP Requests
145 * Receiving SOAP Requests
146 * Exchanging Data with WDDX
147 * Reading RSS Feeds
148
149 = Regular Expressions =
150
151 * Switching From ereg to preg
152 * Matching Words
153 * Finding the nth Occurrence of a Match
154 * Choosing Greedy or Nongreedy Matches
155 * Matching a Valid Email Address
156 * Finding All Lines in a File That Match a Pattern
157 * Capturing Text Inside HTML Tags
158 * Escaping Special Characters in a Regular Expression
159 * Reading Records with a Pattern Separator
160
161
162 = Files =
163 * Creating or Opening a Local File
164 * Creating a Temporary File
165 * Opening a Remote File
166 * Reading from Standard Input
167 * Reading a File into a String
168 * Counting Lines, Paragraphs, or Records in a File
169 * Processing Every Word in a File
170 * Reading a Particular Line in a File
171 * Processing a File Backward by Line or Paragraph
172 * Picking a Random Line from a File
173 * Randomizing All Lines in a File
174 * Processing Variable Length Text Fields
175 * Reading Configuration Files
176 * Reading from or Writing to a Specific Location in a File
177 * Removing the Last Line of a File
178 * Modifying a File in Place Without a Temporary File
179 * Flushing Output to a File
180 * Writing to Standard Output
181 * Writing to Many Filehandles Simultaneously
182 * Escaping Shell Metacharacters
183 * Passing Input to a Program
184 * Reading Standard Output from a Program
185 * Reading Standard Error from a Program
186 * Locking a File
187 * Reading and Writing Compressed Files
188 * Program: Unzip
189
190 = Directories =
191
192 * Getting and Setting File Timestamps
193 * Getting File Information
194 * Changing File Permissions or Ownership
195 * Splitting a Filename into Its Component Parts
196 * Deleting a File
197 * Copying or Moving a File
198 * Processing All Files in a Directory
199 * Getting a List of Filenames Matching a Pattern
200 * Processing All Files in a Directory
201 * Making New Directories
202 * Removing a Directory and Its Contents
203 * Program: Web Server Directory Listing
204 * Program: Site Search
TINY TIM is a tiny web-site manager written in AWK. For a live demo of the site, see http://at.ttoy.net/?tinytim. The site supports runtime content generation; e.g. the quote shown top right of the demo site is auto-generated each time you refresh the page.
The site was written to demonstrate that a little AWK goes a long way. At the time of this writing, the current system is under 100 lines of code (excluding a seperate formatter, that is another 170 lines of code). It took longer to write this doco and the various HTML/CSS theme files, than the actual code itself (fyi: 6 hours for the themes/doc and 3 hours for the code).
TINY TIM has the following features:
In a web accessible directory, type
svn export http://knit.googlecode.com/svn/branches/0.2/tinytim/
In the resulting directory, perform the local juju required to make index.cgi web-runnable (e.g. on my ISP, chmod u+rx index.cgi).
Follow the directions in the next section to customize the site.
TINY TIM is controlled by the following index.cgi file. To select a theme, comment out all but one of the last lines (using the "#" character). For a screen-shots of the current themes, see below.
#!/bin/bash
[ -n "$1" ] && export QUERY_STRING="$1"
tinytim() {
cat content/* themes/$1/theme.txt |
gawk -f lib/tinytim.awk |
sed 's/^<pre>/<script type="syntaxhighlighter" class="brush: cpp"><![CDATA[/' |
sed 's/^<\/pre>/<\/script>/'
}
#tinytim auklet
#tinytim trendygreen
tinytim wink
Notes:
Themes are defined in the sub-directory themes/themename. Each theme is defined by a theme.txt file that holds:
To write a new theme:
The following themes are defined in the directory themes.
Auklet:
Trendygreen (adapted from GetTemplates):
Wink:
The first entry in the content defines strings that can slip into the theme templates. For example, the following slots define the title of a site; the name of formatter script that renders each page; the url of the home directory of the site; a menu to add top of each page; a footer to add to the bottom of each page; and a web-accessible directory for storing images.
``title`` Just another Tiny Tim demo
``formatter`` lib/markup.awk
``description`` (simple cms)
``home`` http://at.ttoy.net
``menu`` <a href="?index">Home</a> |
<a href="?contact">Contact</a> |
<a href="?about">About</a>
``footer`` <p>Powered by <a href="?tinytim">TINY TIM</a>.
© 2010 by Tim Menzies
``images`` http://at.ttoy.net/img
Note the following important convention. TINY TIM auto-generates some of its own strings. The names of these strings start with an uppercase letter. To avoid confusion of your strings with those that are auto-generated, it is best to start your strings with a lower-case letter (e.g. like all those in the above example.
Google offers a nice free site-specific search engine. It takes a few days for the spiders to find the site but after that, it works fine. To set this up, follow the instructions at Google custom search, then
For example, look for google-search in the current templates and content/0config.txt.
After the first entry, the rest of the entries in the content/* define the pages of a site. Each entry must begin with the magic string
For example, this site contains a missing page report. This page is defined as follows. In the following definition of that page, the name is "404"; the tags are "Admin Feb10" and the title is "Sorry".
#12345#################################################################################### 404 Admin Feb10 Sorry I have bad news: <center> [img/404book.jpg] </center>
The contents can contain HTML and MARKUP tags.
MARKUP is a shorthand for writing HTML pages based on MARKDOWN:
Also, in MARKUP, major, minor, sub-, and sub-sub- headings are two line paragraphs where the second line contains two or more "=", "-", "+", "_" (respectively). MARKUP collects these headings as a table of contents, which is added to the top of the page.
Note that MARKUP is separate to TINY TIM. To change the formatting of pages, write your own AWK code and change the string ``formatter`` in the first entry of content/0config.txt.
If a `` entry contains a semi-colon (e.g. ``quotes;``) then it is a plugin. Plugin content is generated at runtime. To write a plugin, modify the file lib/plugins.awk. Currently, that file looks like this:
function slotsPlugIns(str,slots, tmp) {
split(str,tmp,";")
if (tmp[1]=="quotes")
return quotes(str,slots)
return str
}
function quotes(str,slots, n,tmp) {
srand(systime() + PROCINFO["pid"])
n=split(slots["quotes"],tmp,"\n")
return tmp[int(rand()*n) + 1]
}
The function slotsPlugIns is a "traffic-cop" who decides what plugin to call (in the above, there is only one current plugin: quotes).
Each plugin function (e.g. quotes) is passed the string from the template (see str) and an array of key/value pairs holding all the defined string values (see slots). These functions must return a string to be inserted into the rendered HTML.
In the example above, quotes just returns a random quote. It assumes that the predefined strings includes a set of quotes, one per line:
``quotes`` Small things with great love. <br>-- Mother Teresa
It's hard work to it look effortless.<br>-- Katarina Witt
"God bless us every one!".<br>-- Tiny Tim
The quote generated by this plug in can be view, top right of this page.
These pages are focused on Functional Gawk (a.k.a. "Funky").
Funky is enabled by a new feature added to Gawk 3.2: indirect functions. For example:
function foo() { print "foo" }
function bar() { print "bar" }
BEGIN {
the_func = "foo"
@the_func() # calls foo()
the_func = "bar"
@the_func() # calls bar()
}
At the time of this writing, Gawk 3.2 is pre-release and indirect functions can be accessed using the gawk-devel CVS tree:
cvs -d:pserver:anonymous@cvs.sv.gnu.org:/sources/gawk co gawk-devel
Indirect functions enable a new view on library management in Gawk and, perhaps, a way to emulate functional abstraction in languages like Lisp.
So, anyone care to try, say:
all( fun, array [,max]
collect( fun, array1, array2 [,max])
select( fun, array1, array2 [,max])
reject( fun, array1, array2 [,max])
detect( fun, array [,max])
inject( fun, array, carry [,max])
All these functions return the size of array or array2
An interesting new feature in Gawk 3.1.7 is indirect functions. This allows the function name to be a variable, passed as an argument to an array, and called using the syntax
@fun(arg1,arg2,...)
This enables a new kind of funcational programming style in Gawk. For example, generic enumeration patterns can be coded once, then called many different ways with different function names passed as arguments.
This document illustrates this style of programming.
For example, here are some standard enumeration functions:
Applies the function fun to all items in the array. If called with the max argument, then they are iterated in the order i=1 .. max, otherwise we use for(i in a).
Applies fun to each item in array1 and collects the results in array2.
Find all the items in array1 that satisfies fun and add them to array2.
Find all the items in array1 that do not satisfy fun and add them to array2.
Return the first item found in array that satisfies fun. If no such item is found, then return the magic global value Fail.
(This one is a little tricky.) The result of applying fun to each item in array is carried into the processing of the next item. Initially, the carried value is carry. This function returns the final carry.
To illusrate the above, consider the following functions. Each of these are defined for one array item.
function odd(x) { return (x % 2) == 1 }
function show(x) { print "[" x "]" }
function mult(x,y) { return x * y }
function halve(x) { return x/2 }
function do_all( arr) {
split("22 23 24 25 26 27 28",arr)
all("show",arr)
}
When we run this ...
eg/enum1
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_all() }'
we see every item in arr printed using the above show function ...
eg/enum1.out
[25] [26] [27] [28] [22] [23] [24]
function do_collect( max,arr1,arr2,i) {
max=split("22 23 24 25 26 27 28",arr1)
collect("halve",arr1,arr2,max)
for(i=1;i<=max;i++) print arr2[i]
}
When we run this ...
eg/enum2
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_collect() }'
we see every item in arr divided in two ...
eg/enum2.out
11 11.5 12 12.5 13 13.5 14
function do_select( all,less,arr1,arr2,i) {
all = split("22 23 24 25 26 27 28",arr1)
less = select("odd",arr1,arr2,all)
for(i=1;i<=less;i++) print arr2[i]
}
When we run this ...
eg/enum3
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_select() }'
we see every item in arr that satisfies odd....
eg/enum3.out
23 25 27
function do_reject( all,less,arr1,arr2,i) {
all = split("22 23 24 25 26 27 28",arr1)
less = reject("odd",arr1,arr2,all)
for(i=1;i<=less;i++) print arr2[i]
}
When we run this ...
eg/enum4
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_reject() }'
we see every item in arr that do not satisfies odd....
eg/enum4.out
22 24 26 28
function do_detect( all,arr1) {
all = split("22 23 24 25 26 27 28",arr1)
print detect("odd",arr1,all)
}
When we run this ...
eg/enum5
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_detect() }'
we see the first item in arr that satisfies odd....
eg/enum5.out
23
function do_inject( all,less,arr1,arr2,i) {
split("1 2 3 4 5",arr1)
print inject("mult",arr1,1)
}
When we run this ...
eg/enum6
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_inject() }'
we see every the result of multiplying every item in arr by its predecessor.
eg/enum6.out
120
Note one design principle in the following: any newly generated arrays have indexes 1..max where max is the number of elements in that array.
function all (fun,a,max, i) {
if (max)
for(i=1;i<=max;i++) @fun(a[i])
else
for(i in a) @fun(a[i])
}
function collect (fun,a,b,max, i) {
if (max)
for(i=1;i<=max;i++) {n++; b[i]= @fun(a[i]) }
else
for(i in a) {n++; b[i]= @fun(a[i])}
return n
}
function select (fun,a,b,max, i,n) {
if (max)
for(i=1;i<=max;i++) {
if (@fun(a[i])) {n++; b[n]= a[i] }}
else
for(i in a) {
if (@fun(a[i])) {n++; b[n]= a[i] }}
return n
}
function reject (fun,a,b,max, i,n) {
if (max)
for(i=1;i<=max;i++) {
if (! @fun(a[i])) {n++; b[n]= a[i] }}
else
for(i in a) {
if (! @fun(a[i])) {n++; b[n]= a[i] }}
return n
}
BEGIN {Fail="someUnLIKELYSymbol"}
function detect (fun,a,max, i) {
if (max)
for(i=1;i<=max;i++) {
if (@fun(a[i])) return a[i] }
else
for(i in a) {
if (@fun(a[i])) return a[i] }
return Fail
}
function inject (fun,a,carry,max, i) {
if (max)
for(i=1;i<=max;i++)
carry = @fun(a[i],carry)
else
for(i in a)
carry = @fun(a[i],carry)
return carry
}
The above code does not pass around any state information that the fum functions can use. So all their deliberations are either with the current array values (integers or strings) or with global state. It might be worthwhile writing new versions of the above with one more argument, to carry that sate.
The code that renders the awk.info web site can "pretty print" awk code. For example:
To enable that pretty print, add some html syntax inside your code and apply the following conventions.
Note that if you want to see your "looking pretty", then you could could see how it looks using our preview tool:
http://awk.info/?awk:urlWithoutHTTPprefix
For exmaple, the file http://menzies.us/tmp/xx.awk can be previewed using http://awk.info/?awk:menzies.us/tmp/xx.awk
Once you've got it "looking pretty", please consider contributing that code to awk.info, so our code library can grow. To do so, either email mail@awk.info with the URL of your pretty code or zip up the files and email them across.
The first paragraph of the file will be ignored. Use this first para for copyright notices or comments about down-in-the weeds trivia. Note: the first para ends with one blank line.
The next paragraph should start with
#.H1 <join>Title</join>
The code could should be topped and tailed as follows:
#<pre> code #</pre>
All other comment lines should start with a single "#" at front-of-line. These comment characters will be stripped away by the awk.info renderer.
Awk.info's renderer adopts the following html shorthand. If a line starts with
#.WORD other words
this this is replaced with
<WORD> other words</WORD>
If no other words follow #.WORD then the line becomes just <WORD>
Awk.info's renderer supports a few HTML extensions:
That's it. Now you can pretty print your code on the web just be adding a little html in the comments.
Ideally, all code in our code repository comes with unit tests:
Accordingly code offered to this site can contain unit tests, using the methods described in this page.
But before going on, we stress that awk.info gratefully accepts awk contributions in any form. That is, including unit tests with code is optional.
If your code is in directory yourcode then create a sub-directory yourcode/eg
Write a test in a file yourcode/eg/yourtest. Divide that test into two parts:
# assumes
# - the LAWKER trunk has been checked out and
# - .bash_profile contains: export Lawker="$HOME/svns/lawker/fridge"
. $Lawker/lib/bash/setup
gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
print join(a,2)
}'
Write the expected output of that test case in yourcode/eg/yourtest.out
The above file conventions mean that an automatic tool can run over the entire code base and perform a regression test (checking if all the tests generate all the *.out files.
Another advantage of the above scheme is that you can use the tests to document your code.
To show the test case, add the following into your .awk file:
#.BODY yourcode/eg/yourtest #.CODE yourcode/eg/yourtest.out
Then zip the directory yourcode (including yourcode/eg) and send it to awk.info. Once we install those files on our site then when awk.info displays that file, the test case trivia is hidden and the users only see the essential details. For an example of this, see http://awk.info/?gawk/array/join.awk.
by T. Menzies
Imagine Gawk as a kind of a cut-down C language with four tricks:
What to all these do? Well....
You don't need to define variables- they appear as your use them.
There are only three types: stings, numbers, and arrays.
To ensure a number is a number, add zero to it.
x=x+0
To ensure a string is a string, add an empty string to it.
x= x "" "the string you really want to add"
To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables:
function haslocals(passed1,passed2, local1,local2,local3) {
passed1=passes1+1 # changes externally
local1=7 # only changed locally
}
Note that its good practice to add white space between passed and local variables.
Gawk programs can contain functions AND pattern/action pairs.
If the pattern is satisfied, the action is called.
/^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
p = 1;
}
/^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
p = 0;
}
END { if (p != 0) print "missing .P2 at end" }
Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:
BEGIN {
while (getline < "Usr.Dict.Words") #slurp in dictionary
dict[$0] = 1
FS=","; #set field seperator
srand(); #reset random seed
Round=10; #always start globals with U.C.
}
The default action is {print $0}; i.e. print the whole line.
The default pattern is 1; i.e. true.
Patterns are checked, top to bottom, in source-code order.
Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1". Regular expressions are important enough for their own section.
Ok, so now we know enough to explain an simple report function. How does hist.awk work in the following?
% cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
uniq -c | sort -r -n | Gawk -f hist.awk
************************** 26 /var/empty
** 2 /var/virusmails
** 2 /var/root
* 1 /var/xgrid/controller
* 1 /var/xgrid/agent
* 1 /var/teamsserver
* 1 /var/spool/uucp
* 1 /var/spool/postfix
* 1 /var/spool/cups
* 1 /var/pcast/server
* 1 /var/pcast/agent
* 1 /var/imap
* 1 /Library/WebServer
hist.awk reads the maximum width from line one (when NR==1), then scales it to some maximum width value. For each line, it then prints the line ($0) with some stars at front.
NR==1 { Width = Width ? Width : 40 ; sets Width if it is missing
Scale = $1 > Width ? $1 / Width : 1
}
{ Stars=int($1*Scale);
print str(Width - Stars," ") str(Stars,"*") $0
}
# note that, in the following "tmp" is a local variable
function str(n,c, tmp) { # returns a string, size "n", of all "c"
while((n--) > 0 ) tmp= c tmp
return tmp
}
Do you know what these mean?
Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:
function trim(s, t) {
t=s;
sub(/^[ \t\n]*/,"",t);
sub(/[ \t\n]*$/,"",t);
return t
}
or recognize something that isn't a number:
if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ )
{print "ERROR: " $i " not a number}
Regular expressions are an astonishingly useful tool supported by many languages (e.g. Awk, Perl, Python, Java). The following notes review the basics. For full details, see http://www.gnu.org/manual/Gawk-3.1.1/html_node/Regexp.html#Regexp.
Syntax: Here's the basic building blocks of regular expressions:
c
matches the character c (assuming c is a character with no special meaning in regexps).
\c
matches the literal character c; e.g. tabs and newlines are \t and \n respectively.
.
matches any character except newline.
^
matches the beginning of a line or a string.
$
matches the end of a line or a string.
[abc...]
matches any of the characters ac... (character class).
[^ac...]
matches any character except abc... and newline (negated character class).
r*
matches zero or more r's.
And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.
But that's only the start of regular expressions. There's lots more. For example:
r+
matches one or more r's.
r?
matches zero or one r's.
r1|r2
matches either r1 or r2 (alternation).
r1r2
matches r1, and then r2 (concatenation).
(r)
matches r (grouping).
Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:
^[+-]? ...
Numbers begin with zero or one plus or minus signs.
...[0-9]+...
Simple numbers are just one or more numbers.
...[.]?[0-9]*...
which may be followed by a decimal point and zero or more digits.
...|[.][0-9]+...
Alternatively, a number can have zero leading numbers and just start with a decimal point.
.... ([eE]...)?$
Also, there may be an exponent added
...[+-]?[0-9]+)?$
and that exponent is a positive or negative bunch of digits.
Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):
Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename
The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:
#!/usr/bin/awk -f
{for(i=1;i <=NF;i++) freq[$i]++ }
END{for(word in freq) print word, freq[word] }
You can find out if an element exists in an array at a certain index with the expression:
index in array
This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present.
You can remove an individual element of an array using the delete statement:
delete array[index]
It is not an error to delete an element which does not exist.
Gawk has a special kind of for statement for scanning an array:
for (var in array)
body
This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.
There order in which the array is scanned is not defined.
To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack
function top(a) {return a[a[0]]}
function push(a,x, i) {i=++a[0]; a[i]=x; return i}
function pop(a, x,i) {
i=a[0]--;
if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}
The pop function can be used in the usual way:
BEGIN {push(a,1); push(a,2); push(a,3);
while(x=pop(a)) print x
3
2
1
We can catch everything in an array to a string:
function a2s(a, i,s) {
s="";
for (i in a) {s=s " " i "= [" a[i]"]\n"};
return s}
BEGIN {push(L,1); push(L,2); push(L,3);
print a2s(L);}
0= [3]
1= [1]
2= [2]
3= [3]
And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form:
^=include file
This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line:
function rinclude (line, x,a) {
split(line,a,/ /);
if ( a[1] ~ /^\=include/ ) {
while ( ( getline x < a[2] ) > 0) rinclude(x);
close(a[2])}
else {print line}
}
Note that the third argument of the split function can be any regular expression.
By the way, here's a nice trick with arrays. To print the lines in a files in a random order:
BEGIN {srand()}
{Array[rand()]=$0}
END {for(I in Array) print $0}
Short, heh? This is not a perfect solution. Gawk can only generate 1,000,000 different random numbers so the birthday theorem cautions that there is a small chance that the lines will be lost when different lines are written to the same randomly selected location. After some experiments, I can report that you lose around one item after 1,000 inserts and 10 to 12 items after 10,000 random inserts. Nothing to write home about really. But for larger item sets, the above three liner is not what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%) are lost after 100,000 random inserts. Not good!
saya(array [,label,sep,before,after,eq])
Array printing function. Contents printed, sorted on key.
Size of the array
The most common usage is to just use the first two arguments; e.g.
saya(a,"name") ==> name[1] = tim name[2] = menzies
For other usages, see the examples, below.
function saya(a,s, sep0,b4,after,eq, c,m,n,key,val,i,j,tmp,sep) {
sep0 = sep0 ? sep0 : "\n"
b4 = b4 ? b4 : "\n"
after = after ? after : "\n"
eq = eq ? eq : " = "
pre = s ? s"[" : ""
post = s ? "]" : ""
m = asorti(a,b)
printf("%s",b4)
for(i=1;i<=m;i++) {
key=b[i]
val=a[b[i]]
printf("%s", sep pre )
n=split(key,tmp,SUBSEP)
c = ""
for(j=1;j<=n;j++) {
printf("%s", c tmp[j] )
c=","
}
printf("%s", post eq val )
sep=sep0;
};
printf("%s",after)
return m
}
gawk -f saya.awk --source '
BEGIN {
A["fname" ] = "tim"
A["lname" ] = "menzies"
A["address"] = "usa"
saya(A,"",", ","[","]")
print ""
saya(A,"message")
B[2,3,9] = 100
B[10,1,11] = 200
B[1,3,10] = 300
saya(B,"b")
}'
[address = usa, fname = tim, lname = menzies] message[address] = usa message[fname] = tim message[lname] = menzies b[1,3,10] = 300 b[10,1,11] = 200 b[2,3,9] = 100
Tim Menzies
join(a [,start,end,sep])
Joins at array into a string
If sep is set to the magic value SUBSEP then internally, join adds nothing between the items.
A string of a's contents.
gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
print join(a,2)
}'
tom tam
function join(a,start,end,sep, result,i) {
sep = sep ? start : " "
start = start ? start : 1
end = end ? end : sizeof(a)
if (sep == SUBSEP) # magic value
sep = ""
result = a[start]
for (i = start + 1; i <= end; i++)
result = result sep a[i]
return result
}
In earlier gawks, length(a) did not work in functions. Hence....
function sizeof(a, i,n) { for(i in a) n++ ; return n }
Arnold Robbins, then Tim Menzies
These pages focus on macro pre-processors (a natural application for Awk).
Download from LAWKER.
Generates random signtures. Signatures and generation code included in same file so installation is just a matter of calling one file.
Most of the file is a large "here" document. Paragraph 1 of that document is always added to the signatures, followed one of the folowing paragraphs, selected at radonom.
To add to the signtures, include them in the here document, with one preceeding blank line.
pick1() {
gawk 'BEGIN { srand(); RS="" }
NR==1 { print $0 "\n" }
NR>1 { Recs[rand()] = $0 }
END { for ( R in Recs ) {print Recs[R]; exit}}
' $1
}
cat << SoMEI_mpOSSIblE_sYMBOl | pick1
tim.menzies {
title: dr (Ph.D.) and associate professor;
align: csee, west virginia university;
cell: esb 841A;
url: http://menzies.us;
fyi: unless marked "URGENT", i usually won't get 2 your email b4 5pm;
}
Doing a job RIGHT the first time gets the job done. Doing the job WRONG
fourteen times gives you job security.
Rome did not create a great empire by having meetings, they did it by
killing all those who opposed them.
INDECISION is the key to FLEXIBILITY.
"When a subject becomes totally obsolete we make it a required
course." Peter Drucker
I saw two shooting stars last night but they were only satellites .
Its wrong to wish on space hardware. I wish, I wish, I wish you cared.
-- Billy Bragg
Then, in 1995, came the most amazing event in the
history of programming languages: the introduction
of Java. -- Programming Languages: Principles and Practice
Suburbia is where the developer bulldozes out the trees, then names
the streets after them. --Bill Vaughan
Instant gratification takes too long.
-- Carrie Fisher
Complexity is easy. Simplicity is hard.
--Unknown
Tim Menzies
To rearrange the items in the input list:
nshuffle(Array)
To rearrange the items in a copy of the input list:
shuffle(Array,Copy)
The above calls assumes that array item zero stores the length of the array. If this is not the case, use:
shuffles(Array,Copy)
Download from LAWKER.
Suppose we want to shuffle items an array into a random order. This shuffle sort do so in linear time and memory.
The algorithm comes from the dawn of computer time but I first heard of it from Bart Massey (at Portland State). Thank Bart for the clarity of the explanation and blame me for any silliness in the implementation.
A simple way to shuffle an input array of elements is to:
This algorithm is clearly correct. However, the algorithm requires time quadratic in the size of the list, and 2x space.
We can easily reduce the time complexity to O(N). The only thing done with the input array is to select random elements from it, the order of the elements in it is irrelevant. Therefore, instead of closing the hole left by a removed element by shifting elements, we'll close it by moving the first remaining element of the input array to fill the gap.
Note an important invariant of the algorithm:
the number of elements left in the input array + the number of elements in the output array ------------------------------------------------ = the number of elements initially passed in.
This means that once an element is removed from the input array and the hole filled, there is a fresh hole created right at the beginning of the input array. Let us put the newly removed element in that hole. Now we can dispense with the output array altogether, and just return the input array. Now the space complexity is just x+1.
This code assumes that the array "a" stores its size at "a[0]".
function nshuffle(a, i,j,n,tmp) {
n=a[0]; # a has items at 1...n
for(i=1;i<=n;i++) {
j=i+round(rand()*(n-i));
tmp=a[j];
a[j]=a[i];
a[i]=tmp;
};
return n;
}
function round(x) { return int(x + 0.5) }
nshuffle is fast, but rearranges the order of items in the original list. shuffle generates a new copy of the list with the items in a random order.
function shuffle(a,b) {
for(i in a) b[i]=a[i];
nshuffle(b);
}
nshuffle also assumes that the list is stores the list size at position zero. If this is not the case, use shuffles.
function shuffles(a,b, c,n) {
for(i in a) {n++; c[i]=a[i]};
c[0]=n;
shuffle(c,b);
}
By number of loop iterations
One way to use the above is to run down a list in a random order. For example:
BEGIN {
if (ShuffleDemo) {
if (Seed) { srand(Seed) } else { srand() };
s2i(ShuffleDemo,L1," ");
shuffles(L1,L2);
while(Item =pop(L2)) print Item;
}
}
function s2i(str,a,sep, n,i,tmp) {
n=split(str,tmp,sep);
for(i=1;i<=n;i++) a[i]=tmp[i];
return n;
}
function pop(a, x,i) {
i=a[0]--;
if (!i) {return ""} else {x=a[i]; delete a[i]; return x}
}
The above can be run using
gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd"
If you run this twice, you'll see two different orderings. Here's one:
cc aa dd bb
And here's another:
dd bb cc aa
If you are generating the above lists very quickly, then be aware that srand() initializes its random number generator using CPU time in seconds. So, if you are calling the above command line many times per second, you can get repeated outputs.
The fix is to supply a seed from the Bash $RANDOM variable:
gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=$RANDOM
much faster than once a second, the above call will generate (far) fewer repeats.
If you want to repeat some prior run (say, during debugging), set the Seed variable on the command line using (e.g.)
gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=23
This will always print out the same ordering.
Tim Menzies
awkwords --title "Title" file > file.html
awkwords file > file.html
This code requires gawk and bash. To download:
wget http://lawker.googlecode.com/svn/fridge/lib/bash/awkwords chmod +x awkwords
To test the code, apply it to itself:
AwkWords is a simple-to-use markup language for writing documentation for programs whose comment lines start with "#" and whose comments contain HTML code.
For example, awk.info?tools/awkwords shows the html generated from this bash script.
When used with the --title option, a stand alone web page is generated (to control the style of that page, see the CSS function, dicussed below). When used without --title it generated some html suitable for inclusion into other pages.
Also, AwkWords finds all the <h2>, <h3>, <h4>, <h5>, <h6>, <h7>, <h8>, <h9> headings and copies them to a table of contents at the front of the file. Note that AwkWords assumes that the file contains only one <h1> heading- this is printed before the table of contents.
AwkWords adds some short cuts for HTML markup, as well as including nested contents (see below: "including nested content"). This is useful for including, say, program output along with the actual program.
Awkwords is divided into three functions: unhtml fixes the printing of pre-formatted blocks; toc adds the table of contents while includes handles the details of the extra mark-up.
unhtml() { cat $1| gawk '
BEGIN {IGNORECASE=1}
/^<PRE>/ {In=1; print; next}
/^<\/PRE>/ {In=0; print; next}
In {gsub("<","\\<",$0); print; next }
{print $0 }'
}
toc() { cat $1 | gawk '
BEGIN { IGNORECASE = 1 }
/^<[h]1>/ { Header=$0; next}
/^[<]h[23456789]>/ {
T++ ;
Toc[T] = gensub(/(.*)<h(.*)>[ \t]*(.*)[ \t]*<\/h(.*)>(.*)/,
"<""h\\2><""font color=black>\\•</font></a> <""a href=#" T ">\\3</a></h\\4>",
"g",$0)
Pre="<a name="T"></a>" }
{ Line[++N] = Pre $0; Pre="" }
END { print Header;
print "<" "h2>Contents</h2>"
print "<" "div id=\"htmltoc\">"
for(I=1;I<=T;I++) print Toc[I]
print "<" "/div><!--- htmltoc --->"
print "<" "div id=\"htmlbody\">"
for(I=1;I<=N;I++) print Line[I]
print "</" "div><!--- htmlbody --->"
}'
}
The xpand function controls recursive inclusion of content. Note that
includes() { cat $1 | gawk '
function xpand(pre, tmp) {
if ($1 ~ "^#.IN") xpands($2,pre)
else if ($1 ~ "^#.BODY" ) xpandsBody($2,pre)
else if ($1 ~ "^#.LISTING") {
print "<" "pre>"
xpands($2,1) # <===== note the recursive call with "1"
print "<" "/pre>" }
else if ($1 ~ "^#.CODE") {
print "<" "p>" $2 "\n<" "pre>"
xpands($2,1) # <===== note the recursive call with "1"
print "<" "/pre>" }
else if ($1 ~ "^#.URL") {
tmp = $2; $1=$2="";
print "<" "a href=\""tmp"\">" trim($0) "</a>"
}
else if ($1 ~ "^#.TO") {
tmp = $2; $1=$2="";
print "<" "a href=\"mailto:"tmp"\">" trim($0) "</a>"
}
else
xpand1(pre)
}
The xpand1 function controls the printing of a single line. If we are formatting verbatim text, we must remove the start-of-html character "<". Otherwise, we expand any html shortcuts.
function xpand1(pre) {
if (pre)
gsub("<","\\<",$0) # <=== remove start-of-html-character
else {
$0= xpandHtml($0) # <=== expand html short cuts
sub(/^#/,"",$0) }
print $0
}
The function xpandHtml controls the html short cuts
function xpandHtml( str,tag) {
if ($0 ~ /^#\.H1/) {
$1=""
return "<" "h""1><join>" $0 "</join></" "h1>" }
if (sub(/^#\./,"",$1)) {
tag=$1; $1=""
return "<" tag ">" (($0 ~ /^[ \t]*$/) ? "" : $0"</"tag">")
}
return $0
}
The rest of the code is just some book-keeping and managing the recursive addition of content.
function xpands(f,pre) {
if (newFile(f)) {
while((getline <f) > 0) xpand(pre)
close(f) }
}
function xpandsBody(f,pre, using) {
if (newFile(f)) {
while((getline <f) >0) {
if ( !using && ($0 ~ /^[\t ]*$/) ) using = 1
if ( using ) xpand(pre)}
close(f) }
}
function newFile(f) { return ++Seen[f]==1 }
function trim (s) { sub(/^[ \t]*/,"",s); sub(/[ \t]*$/,"",s); return s }
BEGIN { IGNORECASE=1 }
{ xpand() }'
}
If used to generate a full web page, then the following styles are added. Note that the htmltoc class controls the appearance of the table of contents.
css() {
echo "<""STYLE type=\"text/css\">"
cat<<-'EOF'
div.htmltoc h2 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 30px;}
div.htmltoc h3 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 60px;}
div.htmltoc h4 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 90px;}
div.htmltoc h5 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 120px;}
div.htmltoc h6 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 150px;}
div.htmltoc h7 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 180px; }
</STYLE>
EOF
}
main() { cat $1 | includes | unhtml | toc; }
if [ $1 == "--title" ]
then
echo "<""html><""head><""title>$2</title>`css`</head><""body>";
shift 2
main $1
echo "<""/body><""/html>"
else
main $1
fi
There's no checking for valid input (e.g. pre-formatting tags that never close).
If the input file contains no html mark up, the results are pretty messy.
Recursive includes fail silently if the referenced file does not exist.
I don't like the way I need a seperate pass to do "unhtml". I tried making it work within the code but it got messy.
./nbc -[hlx] train test
Download from Lawker
Is less more? Can a few lines of gawk developed in a day or two stand in for a sophisticated state-of-the-art JAVA package? In the general case there may be software engineering advantages to working with rich languages like JAVA. However, in the specific case of a Naive Bayes classifier for discrete data, it is interesting to test if less is indeed more.
A Naive Bayes classifier collects frequency counts of old events, grouped into "classes". Then, if a new event arrives without a classification, it checks through the old list of classes looking for the one with the highest frequency counts for this new event.
The method is called "naive" This assumption allows us to collect frequency counts just on each attribute value (and not pairs, or triples, or quads of values).
In practice, this "naive" strategy works remarkably well- often performs as well as other schemes that try to model interactions between frequencies.
Hence, we call this system a not-so-naive Bayes Classifier.
In summary, the performance on this simple gawk-based Naive Bayes classifier is quite remarkable.
The following table compares classification accuracies between nbc.awk and WEKA's weka.classifiers.NaiveBayes.
All the data sets were discrete so no kernel estimation was used.
Results come from a 10-way cross-val (but no initial randomization of data set order).
The table is sorted by increase mean difference. Nbc.awk does better than WEKA Bayes on the datasets shown at the bottom of the table.
mean significant
difference std. difference?
data in accuracy dev. (alpha=0.01)
-----------------------------------------------
soybean | -3.02 | 2.02 | y
iris | -2.14 | 4.51 | n
zoo | -0.38 | 6.54 | n
primary-tumor | -0.30 | 3.28 | n
audiology | -0.27 | 2.67 | n
mushroom | -0.25 | 0.23 | n
splice | 0.00 | 0.00 | n
kr-vs-kp | 0.00 | 0.00 | n
breast-cancer | 0.00 | 0.00 | n
contact-lenses | 0.00 | 0.00 | n
vote | 0.27 | 1.47 | n
lymph | 0.73 | 5.01 | n
breast-w | 1.63 | 1.32 | y
credit-a | 7.40 | 5.60 | y
letter | 9.44 | 1.40 | y
On the whole, nbc.awk works as well as WEKA Bayes.
The following table compares the runtimes of nbc.awk (awk) vs WEKA BAYES (java) measured in seconds.
Each lines show total times for ten training+test runs (one for each item in the cross val). E.g. letter actually ran in time 4.92 seconds (on average) and this was called 10 times.
Note: the time for dividing files for the x-val is not shown.
The table is sorted on the ratio of awk vs java runtimes. Ratios less than one mean awk ran faster than java. Sampler.awk does better than Weka Bayes on the datasets shown at the bottom of the table (below the middle line).
runtimes (secs) |
-----------------|---------------------
data awk java ratio| insts attrs classes
--------------------------------|---------------------
letter 49.2 17.6 2.8 20,000 17 27
mushroom 10.1 5.9 1.7 8,124 23 3
kr-vs-kp 8.1 5.1 1.6 3,916 37 3
splice 11.3 7.8 1.4 3,190 62 4
soybean 4.2 3.4 1.2 683 36 20
------------------------------------------------------
audiology 2.9 3.4 0.9 226 70 25
primary-tumor 1.3 2.8 0.5 339 18 23
vote 1.0 2.4 0.4 435 17 3
contact-lenses 0.6 2.0 0.3 24 5 4
breast-cancer 0.7 2.4 0.3 286 10 3
credit-a 1.1 3.3 0.3 690 16 3
breast-w 1.0 3.5 0.3 699 10 3
lymph 0.6 2.4 0.2 148 19 5
iris 0.5 2.5 0.2 150 5 4
zoo 0.6 2.4 0.2 101 18 8
------------------------------------------------------
total 93.1 66.8 1.4
All up, the awk-based learner was 40% slower than the JAVA. For larger data sets, JAVA was always faster. However, for smaller datasets (under 1000 instances) the awk version was nearly as fast or faster.
We have run this small Awk script on 100s of megabytes of data, without crashes or core dumps. The code is very memory effecient- unlike the WEKA which loads all the data into RAM.
It is hardly surprising that a state-of-the-art tool kit built and optimized by JAVA gurus can out-perform awk code on large examples. However, what is surprising is that an 32 line AWK script built and debugged in a weekend often works nearly as well, or better.
Perhaps "nbc" is not-so-naive after all.
To check the download, unzip the contents.zip then
chmod +x nbc
./nbc nbceg.train nbceg.test |
gawk -F, '{print $0 "\t " ($1 !=$2 ? " <== bad" : "")}'
This should print:
malign_lymph,malign_lymph metastases,metastases malign_lymph,malign_lymph metastases,metastases malign_lymph,metastases <== bad malign_lymph,malign_lymph malign_lymph,malign_lymph metastases,metastases metastases,metastases metastases,metastases malign_lymph,malign_lymph metastases,metastases malign_lymph,malign_lymph
Here is the nbc.awk code called by the Bash script (shown below).
BEGIN {
#Internal globals:
Total=0 # count of all instances
# Classes # table of class names/frequencies
# Freg # table of counters for values in attributes in classes
# Seen # table of counters for values in attributes
# Attributes # table of number of values per attribute
}
Pass==1 {train()}
Pass==2 {print $NF "," classify()}
function train( i,c) {
Total++;
c=$NF;
Classes[c]++;
for(i=1;i<=NF;i++) {
if ($i=="?") continue;
Freq[c,i,$i]++
if (++Seen[i,$i]==1) Attributes[i]++}
}
function classify( i,temp,what,like,c) {
like = -100000; # smaller than any log
for(c in Classes) {
temp=log(Classes[c]/Total); #uses logs to stop numeric errors
for(i=1;i<NF;i++) {
if ( $i=="?" ) continue;
temp += log((Freq[c,i,$i]+1)/(Classes[c]+Attributes[i]));
};
if ( temp >= like ) {like = temp; what=c}
};
return what;
}
copyleft() { cat<<EOF
nbc: a naive bayes classifier
Copyright (C) 2004 Tim Menzies
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation, version 2.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
EOF
}
usage() { cat<<-EOF
Usage: nbc [FLAGs] TRAIN TEST
Naive bayes classifier.
TRAIN and TEST are comma-separated data files with the same
number of columns. The last column of each is the class
symbol. This classifier learns from TRAIN and then tries
to classify the examples in TEST.
Flags:
-h print this help text
-l copyright notice
-x run an example
EOF
exit
}
nbcDemo() {
main nbceg.train nbceg.test
}
main() {
gawk -F, -f nbc.awk Pass=1 $1 Pass=2 $2
}
demo=""
while getopts "hlx" flag
do case "$flag" in
l) copyleft; exit;;
h) usage; exit ;;
x) demo="nbcDemo";;
esac
done
shift $(($OPTIND - 1))
if [ -n "$demo" ]
then $demo
exit
else main $1 $2
fi
Tim Menzies
Tim Menzies and Andrian Marcus:
Severis is a set of Awk, bash, sed, etc scripts for finding predictors of high severity issues in text reports. Test engineers write such issue reports whenever they encounter anomalies in the code they are inspecting.
Severis was designed to be an audit tool for test engineers, a second "look over the shoulder" to alert a senior engineer if a junior test engineer was doing something strange.
At least for the text issue reports studied by Severis, very simple tools were enough to determine the terms that predicting for different issue severities.
blog comments powered by Disqus