Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Who,Feb,2009,ArnoldR

Arnoldr= Arnold Robbins

Arnold Robbins, an Atlanta native, is a professional programmer and technical author. e has worked with Unix systems since 1980, when he was introduced to a PDP-11 running a version of Sixth Edition Unix.

He has been a heavy AWK user since 1987, when he became involved with gawk, the GNU project's version of AWK. As a member of the POSIX 1003.2 balloting group, he helped shape the POSIX standard for AWK. He is currently the maintainer of gawk and its documentation.

Since late 1997, he and his family have been living happily in Israel.


categories: Feb,2010,ArnoldR

New Awk Debugger

Arnold Robbins writes in Feb 2010..

I am pleased to announce the availability of a test version of gawk. This version uses a byte-code execution engine, and most importantly, it includes a debugger that works at the level of awk statements! The distribution is available at http://www.skeeve.com/gawk/gawk-3.1.7-bc-d.tar.gz.

This version is the same as 3.1.7, but with a new execution engine and a debugging version of gawk named, rather imaginatively, "dgawk". There is a story here. Circa 2003, a gentleman by the name of Jon Haque developed the byte-code execution engine and debugger, in the context of a development gawk version, somewhere between 3.1.3 and 3.1.4.

I never integrated the changes as they were massive and I was busy, and I wasn't able to review them.

The changes languished, and Jon disappeared.

Last fall, Stephen Davies, one of my portability team members, agreed to take on the task of bringing the code into the present. With modest help from me, he succeeded. We then went through additional work to get this version portable to some of the more esoteric systems that gawk supports (64 bit Linux, z/OS and VMS).

I thought it was ready for release at the end of December, until another one of my testers found a severe memory leak in the byte code version. It was a bear to track down, and once again Stephen came through. The debugger uses the readline library, and it is purposely similar to GDB. There is only minimal documentation on the debugger; I'd love to have someone volunteer to write a chapter for the gawk manual that explains it fully.

Example

./dgawk -f ../share/awk/round.awk 
dgawk> help
backtrace      backtrace [N] 
break          break [[filename:]N|function] 
clear          clear [[filename :]N|function] 
continue       continue [COUNT] - continue program being debugged.
delete         delete [breakpoints] [range] 
disable        disable [breakpoints] [range] 
display        display var - print value of variable
down           down [N] - move N frames down the stack.
dump           dump [filename] - dump bytecode.
enable         enable [once|del] [breakpoints] [range] 
finish         finish - execute until selected stack frame returns.
frame          frame [N] - select and print stack frame number N.
help           help - print list of commands.
ignore         ignore N COUNT - set ignore-count of breakpoint
info           info topic 
list           list [-|+|[filename:]lineno|function|range] 
next           next [COUNT] - step program
nexti          nexti [COUNT] - step one instruction
print          print var [var] - print value of a variable
quit           quit - exit debugger.
return         return [value] 
run            run - start executing program.
set            set var = value - assign value to a scalar
step           step [COUNT] - step program
stepi          stepi [COUNT] - step one instruction
tbreak         tbreak [[filename:]N|function] 
trace          trace on|off - print instruction
undisplay      undisplay [N] - remove variable(s)
until          until [[filename:]N|function] 
unwatch        unwatch [N] - remove variable(s) from watch list.
up             up [N] - move N frames up the stack.
watch          watch var - set a watchpoint for a variable.

Here is the debugger printing a function definition:

dgawk> list
1       # round.awk --- do normal rounding
2       #
3       # Arnold Robbins, arnold@skeeve.com, Public Domain
4       # August, 1996
5
6       function round(x,   ival, aval, fraction)
7       {
8          ival = int(x)    # integer part, int() truncates
9
10         # see if fractional part
11         if (ival == x)   # no fraction
12            return ival   # ensure no decimals
13
14         if (x < 0) {
15            aval = -x     # absolute value

categories: ,Aug,2009,ArnoldR

Interview with Aharon Robbins

Aharon Robbins, the maintainer for GNU Awk maintainer, answers some questions from Tim Menzies.

Q: What is your favorite programming language (besides gawk)? And why?

    A: It depends for what. A long time ago I was a big Korn shell junkie, although these days I would do most high level things in a mixture of bash and awk, with awk doing the heavy lifting.

    For lower level things I prefer C++, although I have something of a love/hate relationship with the language. It's possible to write completely unreadable and unmaintainable code in it. It's also possible to write beautiful, clear, absolutely amazing code in it.

    I find that going back to C after working daily in C++ is hard, although I do it for gawk maintenance. For new programs I would work in C++, not C. For something big, I'd use the Qt framework for support and portability.

    I've been recently living in the C# world for my day job. The development environment is very addictive, but C# hasn't seduced me away from C++.

Q: The open source world is a fascinating development paradigm. I'm therefore very curious to know what prompted you to write gawk?

    A: I didn't write it from scratch. I got involved shortly after picking up and reading the Aho, Weinberger & Kernighan book in late 1987 when it came out.

    New awk wasn't widely available. I had been involved with USENET since around 1983, and knew about the GNU project. I also had a strong interest in compilers and interpreters, so I got in touch with the GNU project to see if they had an awk clone and to see if I could get involved in upgrading it to "new" awk.

    It turned out that they already had a volunteer, David Trueman, who was working on it, but he was happy to have help. He and I worked together until circa 1993 or 1994 when he had to stop being involved, and I became the sole maintainer.

    It was a lot of fun. The number of emails of the "I could not get my work done without gawk" sort was amazing; Unix awk would often roll over and die on some of the data sets people were running though gawk.

    Things really got shaken down when gawk became part of GNU/Linux distributions; then people were using it as the only awk, instead of alongside Unix awk.

Q: In retrospect, what are the best/worst features of gawk?

    A: The best feature is the pattern/action paradigm. The implicit read-a-record loop is wonderful. This is the language's data-driven nature, as opposed to the imperative nature of most languages.

    Associative arrays rank second; they are quite powerful.

    There are some warts inherited from Unix awk and left unspecified by POSIX. These are relatively minor.

    The lack of an explicit concatenation operator is an obvious one.

    The lack of real multi-dimensional arrays is another.

    There are features just in gawk that in retrospect seem to have been a waste of time, such as bringing out to the awk level the possibility to internationalize a program. I don't think anyone uses that.

    IGNORECASE was a huge pain to get right; if I'd known how long it would take, I wouldn't have bothered.

    The biggest "lack" is that there isn't an easy, standard way to provide extensibility; there are way too many things in the C library today (and even yesterday) that the awk programmer just can't get to. (Like the chdir system call!) I hope to eventually provide some better mechanisms for this, but I don't know how much actual filling in I can do also.

Q: Under what circumstances would you recommend/not recommend it?

    A: Gawk is good for small to medium level programs that have to process text and/or do simple numeric work (summing up columns, averaging, VERY simple statistics work). It has a central place in traditional Unix / Linux shell scripting when portability is a must.

    But I wouldn't care to try to write a military air traffic command and control system in gawk, for example. :-)

Q: Gawk has a reputation of being slow...

    A: "Slow" compared to what? As far as I've seen, gawk is always faster than Unix awk. Michael Brennan's mawk is even faster, but until recently it has been unmaintained, and it lacks many important, modern features.

    Relative to C? Of course. So what? You have to write 5 - 10 times as much C as you do awk to do the same or less. (I remember one program I wrote in C at around 1200 lines and rewrote in under 300 lines of awk, and the awk was clearer and did more.)

    Relative to perl? It depends. I have had emails telling me that gawk was faster than perl for what the users were doing. And if not, do I care? Not really - perl is a write-only language, and don't get me started on Perl 6. :-)

    All that said, this got me to thinking about a possible bottleneck that I'll be investigating in the near future.

Q: Awk also has a reputation of not being suitable for "real" projects. Is that reputation deserved?

    A: I don't think that contention is true: it may be that scripting languages in general have such a reputation - Ronald Loui has written about this, but I don't think the contention is true for scripting languages either.

    As is always the case, the answer is "it depends". What is the scale of what you're trying to do? Who is the customer? When Rick Adams was still running UUNET, he used a suite of awk programs to do his accounting. That's as "real" a project as you can get: billing your (hundreds or thousands of) customers for their resource usage. And he used gawk, since Unix awk would just roll over and die. (Unix awk has gotten better as a result of the "competition", but that's a different story. :-)

Q: Are you aware of any landmark projects that use gawk?

    A: GNU/Linux. :-)

    Not really. Gawk "just works", and that in and of itself is a testimony to its quality and value.

Q: Looking a decade into the future, can you see gawk disappearing? Why (not)?

    A: I don't think so. The bigger question is will I still be involved with it 10 years from now? I don't know.

    I still have some things I'd like to see happen with it that are interesting and valuable and may even end up being relatively unique. I just have to find the time (or some other volunteers :-) to work on them.

Q: Currently, how are you filling your time?

    A: I have a full time job as a software engineer with Intel. I have a wife and four wonderful children, as well as a dog. That's enough right there to keep me busy.

    I am the series editor for the Prentice Hall Open Source Software Development Series which also takes some of my time.

    And I still try to do some gawk work in between everything else!


categories: Funky,Tips,Mar,2009,ArnoldR

Super-For Loops

In this exchange from comp.lang.awk, Jason Quinn discusses his super-for loop trick. Arnold Robbins then chimes in to say that, with indirect functions, super-for loops could become a generic tool.

Jason Quinn writes:

  • Frequently when programming, situations arise for me where I need a nested number of for-loops. Such case arose for me again just recently while I was inventing a dice game. Anyway, here is the implementation that I ended up using to create a "super-for" loop in AWK (a little trickier than C).
  • This simple example merely lists all possible outcomes of rolling 4, 6, 8, 10, 12, and 20 sided dice at once. A super-for loop requires an array to specify the loop indices... here we have 6 dice and the number of sides determines the indices. The code is easily modified for an arbitrary number of dice (which is the whole point).
  • I identify three parts of a super-for which I called the prologue, body, and epilog. Under most circumstances, I think the main body only would get used.
  • For example:
    #shows an example of a superfor loop
    BEGIN {
    	#define loop maximums
    	loopmax[1]=4
    	loopmax[2]=6
    	loopmax[3]=8
    	loopmax[4]=10
    	loopmax[5]=12
    	loopmax[6]=20
    	#call the loop
    	superfor(6)
    }
    function superfor(loopdepth, zz) { # zz is a local variable
            currloopnum++
    
            #start of prologue
            #end of prologue
    
            for(loopcounter[currloopnum]=1; 
                loopcounter[currloopnum]<=loopmax[currloopnum]; 
                loopcounter[currloopnum]++) {
                    if ( loopdepth==1 ) {
                            #start of superfor body
                            for (zz=1;zz<=currloopnum;zz++) {
                                    printf loopcounter[zz] FS
                                    }
                            print ""
                            #end of superfor body
                            }
                    else if ( loopdepth>1 )
                            superfor(loopdepth-1)
                    }
    
            #start of epilog
            #end of epilog
    
            loopdepth++ ; currloopnum--
            }
    

Arnold Robbins replies:

  • I think this would make a great application for indirect function calls. For example:
    function superfor(loopdepth, prologue, body, epilogue,     zz)
    {
            currloopnum++
    
            @prologue()
    
            for(loopcounter[currloopnum]=1; 
                loopcounter[currloopnum]<=loopmax [currloopnum]; 
                loopcounter[currloopnum]++) {
                    if ( loopdepth==1 ) {
                            @body()
                    }
                    else if ( loopdepth>1 )
                            superfor(loopdepth-1, proloogue, 
                                     body, epilogue)
                    }
    
            @epilogue()
    
            loopdepth++ ; currloopnum--
    }
    

categories: Tips,Apr,2009,ArnoldR

Moving Files with Awk

Andrew Eaton wrote at comp.lang.awk:

I just started with awk and sed, I am more of a perl/C/C++ person. I have a quick question reguarding the pipe. In Awk, I am trying to use this construct.

while ((getline < "somedata.txt") > 0)
            {print | "mv"} #or could be "mv -v" for verbose. 

Is it possible that "print" is no longer printing the value of getline, if so how do I correct it?

Arnold Robbins comments:

The problem here is that `mv' doesn't read standard input, it only processes command lines. Assuming that your data is something like:

oldfile newfile

You can do things two ways:

# build the command and execute it
while ((getline < "somedata.txt") > 0) {
          command = "mv " $1 " " $2
          system(command)
}
close("somedata.txt")

or this way:

# send commands to the shell
while ((getline < "somedata.txt") > 0) {
          printf("mv %s %s\n", $1, $2) | "sh"
}
close("somedata.txt")
close("sh")

The latter is more efficient.


categories: Sed,Tips,Apr,2009,ArnoldR

AwkSed: A Simple Stream Editor

by Arnold Robbins

From the Gawk Manual.

The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:

command1 < orig.data | sed 's/old/new/g' | command2 > result

Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.

The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:

# awksed.awk --- do s/foo/bar/g using just print
#    Thanks to Michael Brennan for the idea

function usage()
{
  print "usage: awksed pat repl [files...]" > "/dev/stderr"
  exit 1
}

BEGIN {
    # validate arguments
    if (ARGC < 3)
        usage()

    RS = ARGV[1]
    ORS = ARGV[2]

    # don't use arguments as files
    ARGV[1] = ARGV[2] = ""
}

# look ma, no hands!
{
    if (RT == "")
        printf "%s", $0
    else
        print
}

The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.

The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.

There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.

The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.

The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.


categories: Timm,Arrays,Function,Feb,2009,ArnoldR

join

Synopsis

join(a [,start,end,sep])

Description

Joins at array into a string

Arguments

a
input array
start
Index for where to start in the array a. Default=1.
end
Index for where to start/stop in the array a. Default=size of array
sep
(OPTIONAL) What to write between each item. Defaults to blank space.

If sep is set to the magic value SUBSEP then internally, join adds nothing between the items.

Returns

A string of a's contents.

Example

gawk/array/eg/join »

gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
        print join(a,2)
}'

gawk/array/eg/join.out »

tom tam

Source

function join(a,start,end,sep,    result,i) {
    sep   = sep   ? start :  " "
    start = start ? start : 1
    end   = end   ? end   : sizeof(a)
    if (sep == SUBSEP) # magic value
       sep = ""
    result = a[start]
    for (i = start + 1; i <= end; i++)
        result = result sep a[i]
    return result
}

Helper

In earlier gawks, length(a) did not work in functions. Hence....

function sizeof(a,   i,n) { for(i in a) n++ ; return n }

Change Log

  • Jan 24'08: defaults extended to include start,stop
  • Jan 24'08: Sizeof added to handle old gawk bug

Author

Arnold Robbins, then Tim Menzies


categories: Ps,Apr,2009,ArnoldR

pschoose.awk

Contents

Synopsis

Download

Description

Details

Code

Author

Synopsis

gawk -f pschoose

Download

Download from LAWKER

Description

Pulls out a range of pages from postscript and just print those.

Details

Pagerange : list of pages from command line.

Pages : array with broken out list.

At end: "(n in Pages)" is true if page n should be printed

Code

Set up the list of paes to print.
function set_pagerange(        n, m, i, j, f, g)
{
	delete Pages

	n = split(Pagerange, f, ",")
	for (i = 1; i <= n; i++) {
		if (index(f[i], "-") != 0) { # a range
			m = split(f[i], g, "-")
			if (m != 2 || g[1] >= g[2]) {
				printf("bad list of pages: %s\n",
					f[i]) > "/dev/stderr"
				exit 1
			}
			for (j = g[1]; j <= g[2]; j++)
				Pages[j] = 1
		} else
			Pages[f[i]] = 1
	}
}

BEGIN {
	# constants
	TRUE = 1
	FALSE = 0

	if (ARGC != 3) {
		print "usage: pschoose range-spec file\n" > "/dev/stderr"
		exit 1
	}
	Pagerange = ARGV[1]
	delete ARGV[1]
	set_pagerange()
}

NR == 1, /^%%Page:/ {
	if (! /^%%Page/) {
		Prolog[++nprolog] = $0
		next
	}
}

/^%%Trailer/ || In_trailer {
	In_trailer = TRUE
	Epilog[++nepilog] = $0
	next
}

/^%%Page: /	{
	++Npage
	line = 0
}

 for all non-special lines
{
	# only save it if we will want to print it
	if (Npage in Pages)
		Page[Npage, ++line] = $0
}

END {
	# print the prologue
	for (i = 1; i in Prolog; i++)
		print Prolog[i]

	# print the actual body
	for (i = 1; i <= Npage; i++) {
		if (i in Pages) {
			for (j = 1; (i, j) in Page; j++) {
				print Page[i, j]
			}
		}
	}

	# print the epilog
	for (i = 1; i in Epilog; i++)
		print Epilog[i]
}

Author

Arnold Robbins


categories: Ps,Apr,2009,ArnoldR

psrev.awk

Contents

Synopsis

Download

Description

Code

Author

Synopsis

gawk -f psrev.awk

Download

Download from LAWKER

Description

Reverse the pages in a postscript file.

Code

BEGIN {
	# constants
	TRUE = 1
	FALSE = 0

	# Initialize global booleans
	Twoup = FALSE

	# process command line flags
	for (i = 1; i in ARGV && ARGV[i] ~ /^-/; i++) {
		if (ARGV[i] == "-2")
			Twoup = TRUE
		else
			printf("psrev: unrecognized option %s\n",
				ARGV[i]) > "/dev/stderr"
		delete ARGV[i]
	}
}

NR == 1, /^%%Page:/ {
	if (! /^%%Page/) {
		Prolog[++nprolog] = $0
		next
	}
}

/^%%Trailer/ || In_trailer {
	In_trailer = TRUE
	Epilog[++nepilog] = $0
	next
}

/^%%Page: /	{
	++Npage
	line = 0
}

 for all non-special lines
{
	Page[Npage, ++line] = $0
}

END {
	# print the prologue
	for (i = 1; i in Prolog; i++)
		print Prolog[i]

	# print the actual body
	if (Twoup) {
		hasodd = (Npage %2 == 1)
		if (hasodd) {
			# print last page
			for (j = 1; (Npage, j) in Page; j++)
				print Page[Npage, j]
			# make a fake last page for psnup
			printf "%%%%Page: %d %d\n", Npage+1, Npage+1
			printf "showpage\n"
			print "%%BeginPageSetup"
			print "BP"
			print "%%EndPageSetup"
			print "EP"
		}
		lastpage = (hasodd ? Npage - 1 : Npage)
		for (i = lastpage; i > 0; i -= 2) {
			for (k = i - 1; k <= i; k++)
				for (j = 1; (k, j) in Page; j++)
					print Page[k, j]
		}
	} else {
		# regular 1 up printing
		for (i = Npage; i > 0; i--)
			for (j = 1; (i, j) in Page; j++)
				print Page[i, j]
	}

	# print the epilog
	for (i = 1; i in Epilog; i++)
		print Epilog[i]
}

Author

Arnold Robbins


categories: Top10,Awk100,Mar,2009,NelsonB,Spell,ArnoldR

spell.awk

Contents

Synopsis

awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
    [=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
    [-strip] [-verbose] [file(s)]

Download

Download from LAWKER.

Description

Why Study This Code?

This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.

It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:

  • The code is hundreds of lines long. Yes folks, its true, Awk is not just a tool for writing one-liners.
  • The code is well-structured. Note, for example, how the BEGIN block is used to initialize the system from files/functions.
  • The code uses two tricks that encourages function reuse:
    • Much of the functionality has been moved out of PATTERN-ACTION and into functions.
    • The number of globals is restricted: note the frequent use of local variables in functions.
  • There is an example, in scan_options, of how parse command line arguments;
  • The use of "print pipes" in in report_expcetions shows how to link Awk code to other commands.

(And to write even larger programs, divided into many files, see runawk.)

Dictionaries

Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.

For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.

All word matching is case insensitive (subject to the workings of tolower()).

In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.

Suffixes

Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:

	ies$	ie ies y	# flies -> fly, series -> series, ties -> tie
	ily$	y ily		# happily -> happy, wily -> wily
	nnily$	n		# funnily -> fun

Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.

Suffixes are tested in order of decreasing length, so that the longest matches are tried first.

Output

The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form

	filename:linenumber:exception

Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.

Code

Top-Level

BEGIN	{ initialize() }
	    { spell_check_line() }
END	    { report_exceptions() }

get_dictionaries

function get_dictionaries(        files, key)
{
    if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
	Dictionaries = ENVIRON["DICTIONARIES"]
    if (Dictionaries == "")	# Use default dictionary list
    {
	DictionaryFiles["/usr/dict/words"]++
	DictionaryFiles["/usr/local/share/dict/words.knuth"]++
    }
    else			# Use system dictionaries from command line
    {
	split(Dictionaries, files)
	for (key in files)
	    DictionaryFiles[files[key]]++
    }
}

Initialize

function initialize()
{
   NonWordChars = "[^" \
	"'" \
	"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
	"abcdefghijklmnopqrstuvwxyz" \
	"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
	"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
	"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
	"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
	"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
	"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
	"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
	"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
	"]"
    get_dictionaries()
    scan_options()
    load_dictionaries()
    load_suffixes()
    order_suffixes()
}

load_dictionaries

function load_dictionaries(        file, word)
{
    for (file in DictionaryFiles)
    {
	## print "DEBUG: Loading dictionary " file > "/dev/stderr"
	while ((getline word < file) > 0)
	    Dictionary[tolower(word)]++
	close(file)
    }
}

load_suffixes

function load_suffixes(        file, k, line, n, parts)
{
    if (NSuffixFiles > 0)		# load suffix regexps from files
    {
	for (file in SuffixFiles)
	{
	    ## print "DEBUG: Loading suffix file " file > "/dev/stderr"
	    while ((getline line < file) > 0)
	    {
		sub(" *#.*$", "", line)		# strip comments
		sub("^[ \t]+", "", line)	# strip leading whitespace
		sub("[ \t]+$", "", line)	# strip trailing whitespace
		if (line == "")
		    continue
		n = split(line, parts)
		Suffixes[parts[1]]++
		Replacement[parts[1]] = parts[2]
		for (k = 3; k <= n; k++)
		  Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
	    }
	    close(file)
	}
    }
    else	      # load default table of English suffix regexps
    {
	split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
	for (k in parts)
	{
	    Suffixes[parts[k]] = 1
	    Replacement[parts[k]] = ""
	}
    }
}

order_suffixes

function order_suffixes(        i, j, key)
{
    # Order suffixes by decreasing length
    NOrderedSuffix = 0
    for (key in Suffixes)
	OrderedSuffix[++NOrderedSuffix] = key
    for (i = 1; i < NOrderedSuffix; i++)
	for (j = i + 1; j <= NOrderedSuffix; j++)
	    if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
		swap(OrderedSuffix, i, j)
}

report_execptions

function report_exceptions(        key, sortpipe)
{
  sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
  for (key in Exception)
  print Exception[key] | sortpipe
  close(sortpipe)
}

scan_options

function scan_options(        k)
{
    for (k = 1; k < ARGC; k++)
    {
	if (ARGV[k] == "-strip")
	{
	    ARGV[k] = ""
	    Strip = 1
	}
	else if (ARGV[k] == "-verbose")
	{
	    ARGV[k] = ""
	    Verbose = 1
	}
	else if (ARGV[k] ~ /^=/)	# suffix file
	{
	    NSuffixFiles++
	    SuffixFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	}
	else if (ARGV[k] ~ /^[+]/)	# private dictionary
	{
	    DictionaryFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	}
    }

    # Remove trailing empty arguments (for nawk)
    while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
        ARGC--
}

spell_check_line

function spell_check_line(        k, word)
{
    ## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
    gsub(NonWordChars, " ")		# eliminate nonword chars
    for (k = 1; k <= NF; k++)
    {
	word = $k
	sub("^'+", "", word)		# strip leading apostrophes
	sub("'+$", "", word)		# strip trailing apostrophes
	if (word != "")
	    spell_check_word(word)
    }
}

spell_check_word

function spell_check_word(word,        key, lc_word, location, w, wordlist)
{
    lc_word = tolower(word)
    ## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
    if (lc_word in Dictionary)		# acceptable spelling
	return
    else				# possible exception
    {
	if (Strip)
	{
	    strip_suffixes(lc_word, wordlist)
	    ## for (w in wordlist) print "DEBUG: wordlist[" w "]"
	    for (w in wordlist)
		if (w in Dictionary)
		    break
	    if (w in Dictionary)
		return
	}
	## print "DEBUG: spell_check():", word
	location = Verbose ? (FILENAME ":" FNR ":") : ""
	if (lc_word in Exception)
	    Exception[lc_word] = Exception[lc_word] "\n" location word
	else
	    Exception[lc_word] = location word
    }
}

strip_suffixes

function strip_suffixes(word, wordlist,        ending, k, n, regexp)
{
    ## print "DEBUG: strip_suffixes(" word ")"
    split("", wordlist)
    for (k = 1; k <= NOrderedSuffix; k++)
    {
	regexp = OrderedSuffix[k]
	## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
	if (match(word, regexp))
	{
	    word = substr(word, 1, RSTART - 1)
	    if (Replacement[regexp] == "")
		wordlist[word] = 1
	    else
	    {
		split(Replacement[regexp], ending)
		for (n in ending)
		{
		    if (ending[n] == "\"\"")
			ending[n] = ""
		    wordlist[word ending[n]] = 1
		}
	    }
	    break
	}
    }
     ## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
}

swap

function swap(a, i, j,        temp)
{
    temp = a[i]
    a[i] = a[j]
    a[j] = temp
}

Author

Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books


categories: Apr,2009,WilhelmW,OsamuA,ArnoldR

99 Bottles of Beer

You know the song:

    99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.

    98 bottles of beer on the wall, 98 bottles of beer. Take one down and pass it around, 97 bottles of beer on the wall.

    97 bottles of beer on the wall, 97 bottles of beer. Take one down and pass it around, 96 bottles of beer on the wall.

    ....

But how do you code it? Here's Wilhelm Weske's version. It is kind of fun but its a little hard to read:

#!/usr/bin/awk -f

        BEGIN{
       split( \
       "no mo"\
       "rexxN"\
       "o mor"\
       "exsxx"\
       "Take "\
      "one dow"\
     "n and pas"\
    "s it around"\
   ", xGo to the "\
  "store and buy s"\
  "ome more, x bot"\
  "tlex of beerx o"\
  "n the wall" , s,\
  "x"); for( i=99 ;\
  i>=0; i--){ s[0]=\
  s[2] = i ; print \
  s[2 + !(i) ] s[8]\
  s[4+ !(i-1)] s[9]\
  s[10]", " s[!(i)]\
  s[8] s[4+ !(i-1)]\
  s[9]".";i?s[0]--:\
  s[0] = 99; print \
  s[6+!i]s[!(s[0])]\
  s[8] s[4 +!(i-2)]\
  s[9]s[10] ".\n";}}

Osamu Aoki has a more maintainable version. Note how all the screen I/O is localized via functions that return strings, rather than printing straight to the screen. This is very useful for maintaince purposes or including code as libraries into other Awk programs.

BEGIN { 
   for(i = 99; i >= 0; i--) {
      print ubottle(i), "on the wall,", lbottle(i) "."
      print action(i), lbottle(inext(i)), "on the wall."
      print
   }
}
function ubottle(n) {
   return \ 
     sprintf("%s bottle%s of beer", n ? n : "No more", n - 1 ? "s" : "")
}
function lbottle(n) {
   return \
     sprintf("%s bottle%s of beer", n ? n : "no more", n - 1 ? "s" : "")
}
function action(n) {
   return \
      sprintf("%s", n ? "Take one down and pass it around," : \
                         "Go to the store and buy some more,")
}
function inext(n) {
   return n ? n - 1 : 99
}

Osamu's version is very similar to how it'd be done in C or other languages and it does not take full advantage of Awk's features. So Arnold Robbins wrote a third version that is more data driven. Most of the work is done in a pre-processor and the actual runtime just dumps text decided before the run. This solution might take more time (to do the setup) but it does allow for the simple switching of the interface (just change the last 10 lines).

BEGIN {
        # Setup
        take = "Take one down, pass it around"
        buy = "Go to the store and buy some more"

        Instruction[0] = buy
        Next[0] = 99
        Count[0, 1] = "No more"
        Count[0, 0] = "no more"

        for (i = 99; i >= 1; i--) {
                Instruction[i] = take
                Next[i] = i - 1
                Count[i, 0] = Count[i, 1] = (i "")
                Bottles[i] = "bottles"
        }
        Bottles[1] = "bottle"
        Bottles[0] = "bottles"
        # Execution
        for (i = 99; i >= 0; i--) {
                printf("%s %s of beer on the wall, %s %s of beer.\n",
                        Count[i, 1],
                        Bottles[i],
                        Count[i, 0],
                        Bottles[i])
                printf("%s, %s %s of beer on the wall.\n\n",
                        Instruction[i],
                        Count[Next[i], 0],
                        Bottles[Next[i]])
        }
}

I'll drink to that.


categories: Mail,Apr,2009,ArnoldR

Mail Sort

Contents

Author

Arnold Robbins

Download

Download from LAWKER.

Description

Sorts a Unix style mailbox by "thread", in date+subject order.

This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.

This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.

Code

Main

BEGIN {
       TRUE = 1
       FALSE = 0

       split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
       for (i in months)
               Month[months[i]] = i    # map name to number

       MonthDays[1] = 31
       MonthDays[2] = 28       # not used
       MonthDays[3] = 31
       MonthDays[4] = 30
       MonthDays[5] = 31
       MonthDays[6] = 30
       MonthDays[7] = 31
       MonthDays[8] = 31
       MonthDays[9] = 30
       MonthDays[10] = 31
       MonthDays[11] = 30
       MonthDays[12] = 31

       In_header = FALSE
       Body = ""

       LocalOffset = 2 # We are two hours ahead of GMT

       # These keep --lint happier
       Debug = 0
       MessageNum = 0
       Duplicates = FALSE
}

/^From / {
       In_header = TRUE
       if (MessageNum)
               Text[MessageNum] = Body
       MessageNum++
       Body = ""
 # print MessageNum
}

In_header && /^Date: / {
       Date[MessageNum] = compute_date($0)
}

In_header && /^Subject: / {
       Subject[MessageNum] = canonacalize_subject($0)
}

In_header && /^Message-[Ii][Dd]: / {
       if (NF == 1) {
               getline junk
               $0 = $0 RT junk # Preserve original input text!
       }

       # Note: Do not use $0 directly; it's needed as the Body text
       # later on.

       line = tolower($0)
       split(line, linefields)

       message_id = linefields[2]
       Mesg_ID[MessageNum] = message_id        # needed for disambiguating message
       if (message_id in Message_IDs) {
               printf("Message %d is duplicate of %s (%s)\n",
                       MessageNum, Message_IDs[message_id],
                       message_id) > "/dev/stderr"
               Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
               Duplicates++
       } else {
               Message_IDs[message_id] = MessageNum ""
       }
}


In_header && /^$/ {
       In_header = FALSE
       # map subject and date to index into text

       if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
               printf(\
       ("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
       " SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
               MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
               SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
               Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
               Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
               Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
                       > "/dev/stderr"
       }

       SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum

       if (Debug) {
               printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
                       MessageNum, length(SubjectDateId)) > "/dev/stderr"
               if (MessageNum != length(SubjectDateId) && ! Printed1) {
                       Printed1++
                       printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
               }
       }

       # build up mapping of subject to earliest date for that subject
       if (! (Subject[MessageNum] in FirstDates) ||
           FirstDates[Subject[MessageNum]] > Date[MessageNum])
               FirstDates[Subject[MessageNum]] = Date[MessageNum]
}

{
       Body = Body ($0 "\n")
}

END {
       Text[MessageNum] = Body # get last message

       if (Debug) {
               printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
                       length(SubjectDateId), length(Subject), length(Date))
               printf("length(FirstDates) = %d\n", length(FirstDates))
       }

       # Create new array to sort by thread. Subscript is
       # earliest date, subject, actual date
       for (i in SubjectDateId) {
               n = split(i, t, SUBSEP)
               if (n != 3) {
                       printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
                       exit 1
               }
               # now have subject, date, message-id in t
               # create index into Text
               Thread[FirstDates[t[1]], i] = SubjectDateId[i]
       }

       n = asorti(Thread, SortedThread)        # Shazzam!

       if (Debug) {
               printf("length(Thread) = %d, length(SortedThread) = %d\n",
                       length(Thread), length(SortedThread))
       }
       if (n != MessageNum && ! Duplicates) {
               printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
                       n, MessageNum) > "/dev/stderr"
	#               exit 1
       }

       if (Debug) {
               for (i = 1; i <= n; i++)
                       printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
                               i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
               close("DUMP1")
               if (Debug ~ /exit/)
                       exit 0
       }

       for (i = 1; i <= MessageNum; i++) {
               if (Debug) {
                       printf("Date[%d] = %s\n",
                               i, strftime("%c", Date[i]))
                       printf("Subject[%d] = %s\n", i, Subject[i])
               }

               printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
       }
       close("OUTPUT")

       close("/dev/stderr")    # shuts up --lint
}

compute_date

Pull apart a date string and convert to timestamp.

function compute_date(date_rec,         fields, year, month, day,
                                       hour, min, sec, tzoff, timestamp)
{
       split(date_rec, fields, "[:, ]+")
       if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
               # Date: Thu, 05 Jan 2006 17:11:26 -0500
               year = fields[5]
               month = Month[fields[4]]
               day = fields[3] + 0
               hour = fields[6]
               min = fields[7]
               sec = fields[8]
               tzoff = fields[9] + 0
       } else {
               # Date: 05 Jan 2006 17:11:26 -0500
               year = fields[4]
               month = Month[fields[3]]
               day = fields[2] + 0
               hour = fields[5]
               min = fields[6]
               sec = fields[7]
               tzoff = fields[8] + 0
       }
       if (tzoff == "GMT" || tzoff == "gmt")
               tzoff = 0
       tzoff /= 100    # assume offsets are in whole hours
       tzoff = -tzoff

       # crude compensation for timezone
       # mktime() wants a local time:
       #       hour + tzoff yields GMT
       #       GMT + LocalOffset yields local time
       hour += tzoff + LocalOffset

       # if moved into next day, reset other values
       if (hour > 23) {
               hour %= 24
               day++
               if (day > days_in_month(month, year)) {
                       day = 1
                       month++
                       if (month > 12) {
                               month = 1
                               year++
                       }
               }
       }

       timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
                               year, month, day, hour, min, sec))

       # timestamps can be 9 or 10 digits.
       # canonicalize them into 11 digits with leading zeros
       return sprintf("%011d", timestamp)
}

days_in_month

How many days in the given month?

function days_in_month(month, year)
{
       if (month != 2)
               return MonthDays[month]

       if (year % 4 == 0 && year % 400 != 0)
               return 29

       return 28
}

canonacalize_subject

Trim out "Re:", white space.

function canonacalize_subject(subj_line)
{
       subj_line = tolower(subj_line)
       sub(/^subject: +/, "", subj_line)
       sub(/^(re: *)+/, "", subj_line)
       sub(/[[:space:]]+$/, "", subj_line)
       gsub(/[[:space:]]+/, " ", subj_line)

       return subj_line
}

Copyright

Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com

blog comments powered by Disqus