About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
Arnold Robbins, an Atlanta native,
is a professional programmer and technical author.
e has worked with Unix systems since 1980, when he was introduced to a
PDP-11 running a version of Sixth Edition Unix.
He has been a heavy AWK user since 1987, when he became involved with gawk, the GNU project's version of AWK. As a member of the POSIX 1003.2 balloting group, he helped shape the POSIX standard for AWK. He is currently the maintainer of gawk and its documentation.
Since late 1997, he and
his family have been living happily in Israel.
Arnold Robbins writes in Feb 2010..
I am pleased to announce the availability of a test version of gawk. This version uses a byte-code execution engine, and most importantly, it includes a debugger that works at the level of awk statements! The distribution is available at http://www.skeeve.com/gawk/gawk-3.1.7-bc-d.tar.gz.
This version is the same as 3.1.7, but with a new execution engine and a debugging version of gawk named, rather imaginatively, "dgawk". There is a story here. Circa 2003, a gentleman by the name of Jon Haque developed the byte-code execution engine and debugger, in the context of a development gawk version, somewhere between 3.1.3 and 3.1.4.
I never integrated the changes as they were massive and I was busy, and I wasn't able to review them.
The changes languished, and Jon disappeared.
Last fall, Stephen Davies, one of my portability team members, agreed to take on the task of bringing the code into the present. With modest help from me, he succeeded. We then went through additional work to get this version portable to some of the more esoteric systems that gawk supports (64 bit Linux, z/OS and VMS).
I thought it was ready for release at the end of December, until another one of my testers found a severe memory leak in the byte code version. It was a bear to track down, and once again Stephen came through. The debugger uses the readline library, and it is purposely similar to GDB. There is only minimal documentation on the debugger; I'd love to have someone volunteer to write a chapter for the gawk manual that explains it fully.
./dgawk -f ../share/awk/round.awk dgawk> help backtrace backtrace [N] break break [[filename:]N|function] clear clear [[filename :]N|function] continue continue [COUNT] - continue program being debugged. delete delete [breakpoints] [range] disable disable [breakpoints] [range] display display var - print value of variable down down [N] - move N frames down the stack. dump dump [filename] - dump bytecode. enable enable [once|del] [breakpoints] [range] finish finish - execute until selected stack frame returns. frame frame [N] - select and print stack frame number N. help help - print list of commands. ignore ignore N COUNT - set ignore-count of breakpoint info info topic list list [-|+|[filename:]lineno|function|range] next next [COUNT] - step program nexti nexti [COUNT] - step one instruction print print var [var] - print value of a variable quit quit - exit debugger. return return [value] run run - start executing program. set set var = value - assign value to a scalar step step [COUNT] - step program stepi stepi [COUNT] - step one instruction tbreak tbreak [[filename:]N|function] trace trace on|off - print instruction undisplay undisplay [N] - remove variable(s) until until [[filename:]N|function] unwatch unwatch [N] - remove variable(s) from watch list. up up [N] - move N frames up the stack. watch watch var - set a watchpoint for a variable.
Here is the debugger printing a function definition:
dgawk> list
1 # round.awk --- do normal rounding
2 #
3 # Arnold Robbins, arnold@skeeve.com, Public Domain
4 # August, 1996
5
6 function round(x, ival, aval, fraction)
7 {
8 ival = int(x) # integer part, int() truncates
9
10 # see if fractional part
11 if (ival == x) # no fraction
12 return ival # ensure no decimals
13
14 if (x < 0) {
15 aval = -x # absolute value
Aharon Robbins, the maintainer for GNU Awk maintainer, answers some questions from Tim Menzies.
Q: What is your favorite programming language (besides gawk)? And why?
A: It depends for what. A long time ago I was a big Korn shell junkie, although these days I would do most high level things in a mixture of bash and awk, with awk doing the heavy lifting.
For lower level things I prefer C++, although I have something of a love/hate relationship with the language. It's possible to write completely unreadable and unmaintainable code in it. It's also possible to write beautiful, clear, absolutely amazing code in it.
I find that going back to C after working daily in C++ is hard, although I do it for gawk maintenance. For new programs I would work in C++, not C. For something big, I'd use the Qt framework for support and portability.
I've been recently living in the C# world for my day job. The development environment is very addictive, but C# hasn't seduced me away from C++.
Q: The open source world is a fascinating development paradigm. I'm therefore very curious to know what prompted you to write gawk?
A: I didn't write it from scratch. I got involved shortly after picking up and reading the Aho, Weinberger & Kernighan book in late 1987 when it came out.
New awk wasn't widely available. I had been involved with USENET since around 1983, and knew about the GNU project. I also had a strong interest in compilers and interpreters, so I got in touch with the GNU project to see if they had an awk clone and to see if I could get involved in upgrading it to "new" awk.
It turned out that they already had a volunteer, David Trueman, who was working on it, but he was happy to have help. He and I worked together until circa 1993 or 1994 when he had to stop being involved, and I became the sole maintainer.
It was a lot of fun. The number of emails of the "I could not get my work done without gawk" sort was amazing; Unix awk would often roll over and die on some of the data sets people were running though gawk.
Things really got shaken down when gawk became part of GNU/Linux distributions; then people were using it as the only awk, instead of alongside Unix awk.
Q: In retrospect, what are the best/worst features of gawk?
A: The best feature is the pattern/action paradigm. The implicit read-a-record loop is wonderful. This is the language's data-driven nature, as opposed to the imperative nature of most languages.
Associative arrays rank second; they are quite powerful.
There are some warts inherited from Unix awk and left unspecified by POSIX. These are relatively minor.
The lack of an explicit concatenation operator is an obvious one.
The lack of real multi-dimensional arrays is another.
There are features just in gawk that in retrospect seem to have been a waste of time, such as bringing out to the awk level the possibility to internationalize a program. I don't think anyone uses that.
IGNORECASE was a huge pain to get right; if I'd known how long it would take, I wouldn't have bothered.
The biggest "lack" is that there isn't an easy, standard way to provide extensibility; there are way too many things in the C library today (and even yesterday) that the awk programmer just can't get to. (Like the chdir system call!) I hope to eventually provide some better mechanisms for this, but I don't know how much actual filling in I can do also.
Q: Under what circumstances would you recommend/not recommend it?
A: Gawk is good for small to medium level programs that have to process text and/or do simple numeric work (summing up columns, averaging, VERY simple statistics work). It has a central place in traditional Unix / Linux shell scripting when portability is a must.
But I wouldn't care to try to write a military air traffic command and control system in gawk, for example. :-)
Q: Gawk has a reputation of being slow...
A: "Slow" compared to what? As far as I've seen, gawk is always faster than Unix awk. Michael Brennan's mawk is even faster, but until recently it has been unmaintained, and it lacks many important, modern features.
Relative to C? Of course. So what? You have to write 5 - 10 times as much C as you do awk to do the same or less. (I remember one program I wrote in C at around 1200 lines and rewrote in under 300 lines of awk, and the awk was clearer and did more.)
Relative to perl? It depends. I have had emails telling me that gawk was faster than perl for what the users were doing. And if not, do I care? Not really - perl is a write-only language, and don't get me started on Perl 6. :-)
All that said, this got me to thinking about a possible bottleneck that I'll be investigating in the near future.
Q: Awk also has a reputation of not being suitable for "real" projects. Is that reputation deserved?
A: I don't think that contention is true: it may be that scripting languages in general have such a reputation - Ronald Loui has written about this, but I don't think the contention is true for scripting languages either.
As is always the case, the answer is "it depends". What is the scale of what you're trying to do? Who is the customer? When Rick Adams was still running UUNET, he used a suite of awk programs to do his accounting. That's as "real" a project as you can get: billing your (hundreds or thousands of) customers for their resource usage. And he used gawk, since Unix awk would just roll over and die. (Unix awk has gotten better as a result of the "competition", but that's a different story. :-)
Q: Are you aware of any landmark projects that use gawk?
A: GNU/Linux. :-)
Not really. Gawk "just works", and that in and of itself is a testimony to its quality and value.
Q: Looking a decade into the future, can you see gawk disappearing? Why (not)?
A: I don't think so. The bigger question is will I still be involved with it 10 years from now? I don't know.
I still have some things I'd like to see happen with it that are interesting and valuable and may even end up being relatively unique. I just have to find the time (or some other volunteers :-) to work on them.
Q: Currently, how are you filling your time?
A: I have a full time job as a software engineer with Intel. I have a wife and four wonderful children, as well as a dog. That's enough right there to keep me busy.
I am the series editor for the Prentice Hall Open Source Software Development Series which also takes some of my time.
And I still try to do some gawk work in between everything else!
In this exchange from comp.lang.awk, Jason Quinn discusses his super-for loop trick. Arnold Robbins then chimes in to say that, with indirect functions, super-for loops could become a generic tool.
Jason Quinn writes:
#shows an example of a superfor loop
BEGIN {
#define loop maximums
loopmax[1]=4
loopmax[2]=6
loopmax[3]=8
loopmax[4]=10
loopmax[5]=12
loopmax[6]=20
#call the loop
superfor(6)
}
function superfor(loopdepth, zz) { # zz is a local variable
currloopnum++
#start of prologue
#end of prologue
for(loopcounter[currloopnum]=1;
loopcounter[currloopnum]<=loopmax[currloopnum];
loopcounter[currloopnum]++) {
if ( loopdepth==1 ) {
#start of superfor body
for (zz=1;zz<=currloopnum;zz++) {
printf loopcounter[zz] FS
}
print ""
#end of superfor body
}
else if ( loopdepth>1 )
superfor(loopdepth-1)
}
#start of epilog
#end of epilog
loopdepth++ ; currloopnum--
}
Arnold Robbins replies:
function superfor(loopdepth, prologue, body, epilogue, zz)
{
currloopnum++
@prologue()
for(loopcounter[currloopnum]=1;
loopcounter[currloopnum]<=loopmax [currloopnum];
loopcounter[currloopnum]++) {
if ( loopdepth==1 ) {
@body()
}
else if ( loopdepth>1 )
superfor(loopdepth-1, proloogue,
body, epilogue)
}
@epilogue()
loopdepth++ ; currloopnum--
}
Andrew Eaton wrote at comp.lang.awk:
I just started with awk and sed, I am more of a perl/C/C++ person. I have a quick question reguarding the pipe. In Awk, I am trying to use this construct.
while ((getline < "somedata.txt") > 0)
{print | "mv"} #or could be "mv -v" for verbose.
Is it possible that "print" is no longer printing the value of getline, if so how do I correct it?
Arnold Robbins comments:
The problem here is that `mv' doesn't read standard input, it only processes command lines. Assuming that your data is something like:
oldfile newfile
You can do things two ways:
# build the command and execute it
while ((getline < "somedata.txt") > 0) {
command = "mv " $1 " " $2
system(command)
}
close("somedata.txt")
or this way:
# send commands to the shell
while ((getline < "somedata.txt") > 0) {
printf("mv %s %s\n", $1, $2) | "sh"
}
close("somedata.txt")
close("sh")
The latter is more efficient.
by Arnold Robbins
From the Gawk Manual.
The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:
command1 < orig.data | sed 's/old/new/g' | command2 > result
Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.
The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:
# awksed.awk --- do s/foo/bar/g using just print
# Thanks to Michael Brennan for the idea
function usage()
{
print "usage: awksed pat repl [files...]" > "/dev/stderr"
exit 1
}
BEGIN {
# validate arguments
if (ARGC < 3)
usage()
RS = ARGV[1]
ORS = ARGV[2]
# don't use arguments as files
ARGV[1] = ARGV[2] = ""
}
# look ma, no hands!
{
if (RT == "")
printf "%s", $0
else
print
}
The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.
The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.
There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.
The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.
The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.
join(a [,start,end,sep])
Joins at array into a string
If sep is set to the magic value SUBSEP then internally, join adds nothing between the items.
A string of a's contents.
gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
print join(a,2)
}'
tom tam
function join(a,start,end,sep, result,i) {
sep = sep ? start : " "
start = start ? start : 1
end = end ? end : sizeof(a)
if (sep == SUBSEP) # magic value
sep = ""
result = a[start]
for (i = start + 1; i <= end; i++)
result = result sep a[i]
return result
}
In earlier gawks, length(a) did not work in functions. Hence....
function sizeof(a, i,n) { for(i in a) n++ ; return n }
Arnold Robbins, then Tim Menzies
gawk -f pschoose
Download from LAWKER
Pagerange : list of pages from command line.
Pages : array with broken out list.
At end: "(n in Pages)" is true if page n should be printed
Set up the list of paes to print.
function set_pagerange( n, m, i, j, f, g)
{
delete Pages
n = split(Pagerange, f, ",")
for (i = 1; i <= n; i++) {
if (index(f[i], "-") != 0) { # a range
m = split(f[i], g, "-")
if (m != 2 || g[1] >= g[2]) {
printf("bad list of pages: %s\n",
f[i]) > "/dev/stderr"
exit 1
}
for (j = g[1]; j <= g[2]; j++)
Pages[j] = 1
} else
Pages[f[i]] = 1
}
}
BEGIN {
# constants
TRUE = 1
FALSE = 0
if (ARGC != 3) {
print "usage: pschoose range-spec file\n" > "/dev/stderr"
exit 1
}
Pagerange = ARGV[1]
delete ARGV[1]
set_pagerange()
}
NR == 1, /^%%Page:/ {
if (! /^%%Page/) {
Prolog[++nprolog] = $0
next
}
}
/^%%Trailer/ || In_trailer {
In_trailer = TRUE
Epilog[++nepilog] = $0
next
}
/^%%Page: / {
++Npage
line = 0
}
for all non-special lines
{
# only save it if we will want to print it
if (Npage in Pages)
Page[Npage, ++line] = $0
}
END {
# print the prologue
for (i = 1; i in Prolog; i++)
print Prolog[i]
# print the actual body
for (i = 1; i <= Npage; i++) {
if (i in Pages) {
for (j = 1; (i, j) in Page; j++) {
print Page[i, j]
}
}
}
# print the epilog
for (i = 1; i in Epilog; i++)
print Epilog[i]
}
Arnold Robbins
gawk -f psrev.awk
Download from LAWKER
Reverse the pages in a postscript file.
BEGIN {
# constants
TRUE = 1
FALSE = 0
# Initialize global booleans
Twoup = FALSE
# process command line flags
for (i = 1; i in ARGV && ARGV[i] ~ /^-/; i++) {
if (ARGV[i] == "-2")
Twoup = TRUE
else
printf("psrev: unrecognized option %s\n",
ARGV[i]) > "/dev/stderr"
delete ARGV[i]
}
}
NR == 1, /^%%Page:/ {
if (! /^%%Page/) {
Prolog[++nprolog] = $0
next
}
}
/^%%Trailer/ || In_trailer {
In_trailer = TRUE
Epilog[++nepilog] = $0
next
}
/^%%Page: / {
++Npage
line = 0
}
for all non-special lines
{
Page[Npage, ++line] = $0
}
END {
# print the prologue
for (i = 1; i in Prolog; i++)
print Prolog[i]
# print the actual body
if (Twoup) {
hasodd = (Npage %2 == 1)
if (hasodd) {
# print last page
for (j = 1; (Npage, j) in Page; j++)
print Page[Npage, j]
# make a fake last page for psnup
printf "%%%%Page: %d %d\n", Npage+1, Npage+1
printf "showpage\n"
print "%%BeginPageSetup"
print "BP"
print "%%EndPageSetup"
print "EP"
}
lastpage = (hasodd ? Npage - 1 : Npage)
for (i = lastpage; i > 0; i -= 2) {
for (k = i - 1; k <= i; k++)
for (j = 1; (k, j) in Page; j++)
print Page[k, j]
}
} else {
# regular 1 up printing
for (i = Npage; i > 0; i--)
for (j = 1; (i, j) in Page; j++)
print Page[i, j]
}
# print the epilog
for (i = 1; i in Epilog; i++)
print Epilog[i]
}
Arnold Robbins
awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
[=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
[-strip] [-verbose] [file(s)]
Download from LAWKER.
This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.
It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:
(And to write even larger programs, divided into many files, see runawk.)
Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.
For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.
All word matching is case insensitive (subject to the workings of tolower()).
In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.
Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:
ies$ ie ies y # flies -> fly, series -> series, ties -> tie ily$ y ily # happily -> happy, wily -> wily nnily$ n # funnily -> fun
Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.
Suffixes are tested in order of decreasing length, so that the longest matches are tried first.
The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form
filename:linenumber:exception
Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.
BEGIN { initialize() }
{ spell_check_line() }
END { report_exceptions() }
function get_dictionaries( files, key)
{
if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
Dictionaries = ENVIRON["DICTIONARIES"]
if (Dictionaries == "") # Use default dictionary list
{
DictionaryFiles["/usr/dict/words"]++
DictionaryFiles["/usr/local/share/dict/words.knuth"]++
}
else # Use system dictionaries from command line
{
split(Dictionaries, files)
for (key in files)
DictionaryFiles[files[key]]++
}
}
function initialize()
{
NonWordChars = "[^" \
"'" \
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz" \
"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
"]"
get_dictionaries()
scan_options()
load_dictionaries()
load_suffixes()
order_suffixes()
}
function load_dictionaries( file, word)
{
for (file in DictionaryFiles)
{
## print "DEBUG: Loading dictionary " file > "/dev/stderr"
while ((getline word < file) > 0)
Dictionary[tolower(word)]++
close(file)
}
}
function load_suffixes( file, k, line, n, parts)
{
if (NSuffixFiles > 0) # load suffix regexps from files
{
for (file in SuffixFiles)
{
## print "DEBUG: Loading suffix file " file > "/dev/stderr"
while ((getline line < file) > 0)
{
sub(" *#.*$", "", line) # strip comments
sub("^[ \t]+", "", line) # strip leading whitespace
sub("[ \t]+$", "", line) # strip trailing whitespace
if (line == "")
continue
n = split(line, parts)
Suffixes[parts[1]]++
Replacement[parts[1]] = parts[2]
for (k = 3; k <= n; k++)
Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
}
close(file)
}
}
else # load default table of English suffix regexps
{
split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
for (k in parts)
{
Suffixes[parts[k]] = 1
Replacement[parts[k]] = ""
}
}
}
function order_suffixes( i, j, key)
{
# Order suffixes by decreasing length
NOrderedSuffix = 0
for (key in Suffixes)
OrderedSuffix[++NOrderedSuffix] = key
for (i = 1; i < NOrderedSuffix; i++)
for (j = i + 1; j <= NOrderedSuffix; j++)
if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
swap(OrderedSuffix, i, j)
}
function report_exceptions( key, sortpipe)
{
sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
for (key in Exception)
print Exception[key] | sortpipe
close(sortpipe)
}
function scan_options( k)
{
for (k = 1; k < ARGC; k++)
{
if (ARGV[k] == "-strip")
{
ARGV[k] = ""
Strip = 1
}
else if (ARGV[k] == "-verbose")
{
ARGV[k] = ""
Verbose = 1
}
else if (ARGV[k] ~ /^=/) # suffix file
{
NSuffixFiles++
SuffixFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
else if (ARGV[k] ~ /^[+]/) # private dictionary
{
DictionaryFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
}
# Remove trailing empty arguments (for nawk)
while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
ARGC--
}
function spell_check_line( k, word)
{
## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
gsub(NonWordChars, " ") # eliminate nonword chars
for (k = 1; k <= NF; k++)
{
word = $k
sub("^'+", "", word) # strip leading apostrophes
sub("'+$", "", word) # strip trailing apostrophes
if (word != "")
spell_check_word(word)
}
}
function spell_check_word(word, key, lc_word, location, w, wordlist)
{
lc_word = tolower(word)
## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
if (lc_word in Dictionary) # acceptable spelling
return
else # possible exception
{
if (Strip)
{
strip_suffixes(lc_word, wordlist)
## for (w in wordlist) print "DEBUG: wordlist[" w "]"
for (w in wordlist)
if (w in Dictionary)
break
if (w in Dictionary)
return
}
## print "DEBUG: spell_check():", word
location = Verbose ? (FILENAME ":" FNR ":") : ""
if (lc_word in Exception)
Exception[lc_word] = Exception[lc_word] "\n" location word
else
Exception[lc_word] = location word
}
}
function strip_suffixes(word, wordlist, ending, k, n, regexp)
{
## print "DEBUG: strip_suffixes(" word ")"
split("", wordlist)
for (k = 1; k <= NOrderedSuffix; k++)
{
regexp = OrderedSuffix[k]
## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
if (match(word, regexp))
{
word = substr(word, 1, RSTART - 1)
if (Replacement[regexp] == "")
wordlist[word] = 1
else
{
split(Replacement[regexp], ending)
for (n in ending)
{
if (ending[n] == "\"\"")
ending[n] = ""
wordlist[word ending[n]] = 1
}
}
break
}
}
## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
}
function swap(a, i, j, temp)
{
temp = a[i]
a[i] = a[j]
a[j] = temp
}
Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books
You know the song:
99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.
98 bottles of beer on the wall, 98 bottles of beer. Take one down and pass it around, 97 bottles of beer on the wall.
97 bottles of beer on the wall, 97 bottles of beer. Take one down and pass it around, 96 bottles of beer on the wall.
....
But how do you code it? Here's Wilhelm Weske's version. It is kind of fun but its a little hard to read:
#!/usr/bin/awk -f
BEGIN{
split( \
"no mo"\
"rexxN"\
"o mor"\
"exsxx"\
"Take "\
"one dow"\
"n and pas"\
"s it around"\
", xGo to the "\
"store and buy s"\
"ome more, x bot"\
"tlex of beerx o"\
"n the wall" , s,\
"x"); for( i=99 ;\
i>=0; i--){ s[0]=\
s[2] = i ; print \
s[2 + !(i) ] s[8]\
s[4+ !(i-1)] s[9]\
s[10]", " s[!(i)]\
s[8] s[4+ !(i-1)]\
s[9]".";i?s[0]--:\
s[0] = 99; print \
s[6+!i]s[!(s[0])]\
s[8] s[4 +!(i-2)]\
s[9]s[10] ".\n";}}
Osamu Aoki has a more maintainable version. Note how all the screen I/O is localized via functions that return strings, rather than printing straight to the screen. This is very useful for maintaince purposes or including code as libraries into other Awk programs.
BEGIN {
for(i = 99; i >= 0; i--) {
print ubottle(i), "on the wall,", lbottle(i) "."
print action(i), lbottle(inext(i)), "on the wall."
print
}
}
function ubottle(n) {
return \
sprintf("%s bottle%s of beer", n ? n : "No more", n - 1 ? "s" : "")
}
function lbottle(n) {
return \
sprintf("%s bottle%s of beer", n ? n : "no more", n - 1 ? "s" : "")
}
function action(n) {
return \
sprintf("%s", n ? "Take one down and pass it around," : \
"Go to the store and buy some more,")
}
function inext(n) {
return n ? n - 1 : 99
}
Osamu's version is very similar to how it'd be done in C or other languages and it does not take full advantage of Awk's features. So Arnold Robbins wrote a third version that is more data driven. Most of the work is done in a pre-processor and the actual runtime just dumps text decided before the run. This solution might take more time (to do the setup) but it does allow for the simple switching of the interface (just change the last 10 lines).
BEGIN {
# Setup
take = "Take one down, pass it around"
buy = "Go to the store and buy some more"
Instruction[0] = buy
Next[0] = 99
Count[0, 1] = "No more"
Count[0, 0] = "no more"
for (i = 99; i >= 1; i--) {
Instruction[i] = take
Next[i] = i - 1
Count[i, 0] = Count[i, 1] = (i "")
Bottles[i] = "bottles"
}
Bottles[1] = "bottle"
Bottles[0] = "bottles"
# Execution
for (i = 99; i >= 0; i--) {
printf("%s %s of beer on the wall, %s %s of beer.\n",
Count[i, 1],
Bottles[i],
Count[i, 0],
Bottles[i])
printf("%s, %s %s of beer on the wall.\n\n",
Instruction[i],
Count[Next[i], 0],
Bottles[Next[i]])
}
}
I'll drink to that.
Download from LAWKER.
Sorts a Unix style mailbox by "thread", in date+subject order.
This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.
This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.
BEGIN {
TRUE = 1
FALSE = 0
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
for (i in months)
Month[months[i]] = i # map name to number
MonthDays[1] = 31
MonthDays[2] = 28 # not used
MonthDays[3] = 31
MonthDays[4] = 30
MonthDays[5] = 31
MonthDays[6] = 30
MonthDays[7] = 31
MonthDays[8] = 31
MonthDays[9] = 30
MonthDays[10] = 31
MonthDays[11] = 30
MonthDays[12] = 31
In_header = FALSE
Body = ""
LocalOffset = 2 # We are two hours ahead of GMT
# These keep --lint happier
Debug = 0
MessageNum = 0
Duplicates = FALSE
}
/^From / {
In_header = TRUE
if (MessageNum)
Text[MessageNum] = Body
MessageNum++
Body = ""
# print MessageNum
}
In_header && /^Date: / {
Date[MessageNum] = compute_date($0)
}
In_header && /^Subject: / {
Subject[MessageNum] = canonacalize_subject($0)
}
In_header && /^Message-[Ii][Dd]: / {
if (NF == 1) {
getline junk
$0 = $0 RT junk # Preserve original input text!
}
# Note: Do not use $0 directly; it's needed as the Body text
# later on.
line = tolower($0)
split(line, linefields)
message_id = linefields[2]
Mesg_ID[MessageNum] = message_id # needed for disambiguating message
if (message_id in Message_IDs) {
printf("Message %d is duplicate of %s (%s)\n",
MessageNum, Message_IDs[message_id],
message_id) > "/dev/stderr"
Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
Duplicates++
} else {
Message_IDs[message_id] = MessageNum ""
}
}
In_header && /^$/ {
In_header = FALSE
# map subject and date to index into text
if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
printf(\
("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
" SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
> "/dev/stderr"
}
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum
if (Debug) {
printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
MessageNum, length(SubjectDateId)) > "/dev/stderr"
if (MessageNum != length(SubjectDateId) && ! Printed1) {
Printed1++
printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
}
}
# build up mapping of subject to earliest date for that subject
if (! (Subject[MessageNum] in FirstDates) ||
FirstDates[Subject[MessageNum]] > Date[MessageNum])
FirstDates[Subject[MessageNum]] = Date[MessageNum]
}
{
Body = Body ($0 "\n")
}
END {
Text[MessageNum] = Body # get last message
if (Debug) {
printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
length(SubjectDateId), length(Subject), length(Date))
printf("length(FirstDates) = %d\n", length(FirstDates))
}
# Create new array to sort by thread. Subscript is
# earliest date, subject, actual date
for (i in SubjectDateId) {
n = split(i, t, SUBSEP)
if (n != 3) {
printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
exit 1
}
# now have subject, date, message-id in t
# create index into Text
Thread[FirstDates[t[1]], i] = SubjectDateId[i]
}
n = asorti(Thread, SortedThread) # Shazzam!
if (Debug) {
printf("length(Thread) = %d, length(SortedThread) = %d\n",
length(Thread), length(SortedThread))
}
if (n != MessageNum && ! Duplicates) {
printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
n, MessageNum) > "/dev/stderr"
# exit 1
}
if (Debug) {
for (i = 1; i <= n; i++)
printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
close("DUMP1")
if (Debug ~ /exit/)
exit 0
}
for (i = 1; i <= MessageNum; i++) {
if (Debug) {
printf("Date[%d] = %s\n",
i, strftime("%c", Date[i]))
printf("Subject[%d] = %s\n", i, Subject[i])
}
printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
}
close("OUTPUT")
close("/dev/stderr") # shuts up --lint
}
Pull apart a date string and convert to timestamp.
function compute_date(date_rec, fields, year, month, day,
hour, min, sec, tzoff, timestamp)
{
split(date_rec, fields, "[:, ]+")
if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
# Date: Thu, 05 Jan 2006 17:11:26 -0500
year = fields[5]
month = Month[fields[4]]
day = fields[3] + 0
hour = fields[6]
min = fields[7]
sec = fields[8]
tzoff = fields[9] + 0
} else {
# Date: 05 Jan 2006 17:11:26 -0500
year = fields[4]
month = Month[fields[3]]
day = fields[2] + 0
hour = fields[5]
min = fields[6]
sec = fields[7]
tzoff = fields[8] + 0
}
if (tzoff == "GMT" || tzoff == "gmt")
tzoff = 0
tzoff /= 100 # assume offsets are in whole hours
tzoff = -tzoff
# crude compensation for timezone
# mktime() wants a local time:
# hour + tzoff yields GMT
# GMT + LocalOffset yields local time
hour += tzoff + LocalOffset
# if moved into next day, reset other values
if (hour > 23) {
hour %= 24
day++
if (day > days_in_month(month, year)) {
day = 1
month++
if (month > 12) {
month = 1
year++
}
}
}
timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
year, month, day, hour, min, sec))
# timestamps can be 9 or 10 digits.
# canonicalize them into 11 digits with leading zeros
return sprintf("%011d", timestamp)
}
How many days in the given month?
function days_in_month(month, year)
{
if (month != 2)
return MonthDays[month]
if (year % 4 == 0 && year % 400 != 0)
return 29
return 28
}
Trim out "Re:", white space.
function canonacalize_subject(subj_line)
{
subj_line = tolower(subj_line)
sub(/^subject: +/, "", subj_line)
sub(/^(re: *)+/, "", subj_line)
sub(/[[:space:]]+$/, "", subj_line)
gsub(/[[:space:]]+/, " ", subj_line)
return subj_line
}
Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com