"Cause a little auk awk
goes a long way."

 »  table of contents
 »  featured topics
 »  page tags

About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

 »  articles
 »  books:


Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Top10,Mar,2009,Admin

Top 10

The Top 10 pages highlights the "best" (most impressive, most insightful, most fun, most visited) pages on this site.

categories: Top10,Papers,Misc,WhyAwk,Jan,2009,Ronl


by R. Loui

ACM Sigplan Notices, Volume 31, Number 8, August 1996

Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a `scripting language.' It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't even been close (we have not seen work in PROLOG or JAVA).

There are some quick answers that have to do with the pragmatics of undergraduate programming. Then there are more instructive answers that might be valuable to those who debate programming paradigms or to those who study the history of AI languages. And there are some deep philosophical answers that expose the nature of reasoning and symbolic AI. I think the answers, especially the last ones, can be even more surprising than the observed effectiveness of GAWK for AI.

First it must be confessed that PERL programmers can cobble together AI projects well, too. Most of GAWK's attractiveness is reproduced in PERL, and the success of PERL forebodes some of the success of GAWK. Both are powerful string-processing languages that allow the programmer to exploit many of the features of a UNIX environment. Both provide powerful constructions for manipulating a wide variety of data in reasonably efficient ways. Both are interpreted, which can reduce development time. Both have short learning curves. The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonful of syntactic sugar. Some will argue that PERL has superior functionality, but for quick AI applications, the additional functionality is rarely missed. In fact, PERL's terse syntax is not friendly when regular expressions begin to proliferate and strings contain fragments of HTML, WWW addresses, or shell commands. PERL provides new ways of doing things, but not necessarily ways of doing new things.

In the end, despite minor difference, both PERL and GAWK minimize programmer time. Neither really provides the programmer the setting in which to worry about minimizing run-time.

There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI test bed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is turn left; turn right. If the robot is Netscape, then the right language is something that can generate Netscape -remote 'openURL( with elan.

Of course, there are deeper answers. Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays. GAWK asks the programmer to use the file system for data organization and the operating system for debugging tools and subroutine libraries. There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time. (footnote: I am reminded of my IBM colleague Ben Grosof's advice for Palo Alto: Don't worry about whether it's highway 101 or 280. Don't worry if you have to head south for an entrance to go north. Just get on the highway as quickly as possible.)

There are some similarities between GAWK and LISP that are illuminating. Both provided a powerful uniform data structure (the associative array implemented as a hash table for GAWK and the S-expression, or list of lists, for LISP). Both were well-supported in their environments (GAWK being a child of UNIX, and LISP being the heart of lisp machines). Both have trivial syntax and find their power in the programmer's willingness to use the simple blocks to build a complex approach.

Deeper still, is the nature of AI programming. AI is about functionality and exploratory programming. It is about bottom-up design and the building of ambitions as greater behaviors can be demonstrated. Woe be to the top-down AI programmer who finds that the bottom-level refinements, `this subroutine parses the sentence,' cannot actually be implemented. Woe be to the programmer who perfects the data structures for that heap sort when the whole approach to the high-level problem needs to be rethought, and the code is sent to the junk heap the next day.

AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor.

Now for the surprising philosophical answers. First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution. Chess, neural nets, and genetic programming show the limits of brute computation. The alternative is clever program organization. (footnote: One might add that the former are the AI approaches that work, but that is easily dismissed: those are the AI approaches that work in general, precisely because cleverness is problem-specific.) So AI programmers always want to maximize the content of their program, not optimize the efficiency of an approach. They want minds, not insects. Instead of enumerating large search spaces, they define ways of reducing search, ways of bringing different knowledge to the task. A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.

Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call reasoning'' instead of logic.'' The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.

I view these last two points as news not only to the programming language community, but also to much of the AI community that has not reflected on the past decade's lessons.

In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.

categories: Top10,Misc,Papers,WhyAwk,Apr,2009,Ronl

In Praise of Scripting: Real Programming Pragmatism

by Ronald P. Loui
Associate Professor of CSE
Washington University in St. Louis

(Pre-publication draft; copyright reserved by author. A subsequent version of this document appeared as IEEE Computer, vol. 41, no. 7, July 2008).

    This article's main purpose is to review the changes in programming practices known collectively as the "rise of scripting," as predicted in 1998 IEEE COMPUTER by Ousterhout. This attempts to be both brief and definitive, drawing on many of the essays that have appeared in online forums. The main new idea is that programming language theory needs to move beyond semantics and take language pragmatics more seriously.

To the credit of this journal, it had the courage to publish the signal paper on scripting, John Ousterhout's "Scripting: Higher Level Programming for the 21st Century" in 1998. Today, that document rolls forward with an ever-growing list of positive citations. More importantly, every major observation in that paper seems now to be entrenched in software practice today; every benefit claimed for scripting appears to be genuine (flexibility of typelessness, rapid turnaround of interpretation, higher level semantics, development speed, appropriateness for gluing components and internet programming, ease of learning and increase in amount of casual programming).

Interestingly, IEEE COMPUTER also just printed one of the most canonical attacks on scripting, by one Diomidis Spinellis, 2005, "Java Makes Scripting Languages Irrelevant?" Part of what makes this attack interesting is that the author seems unconvinced of his own title; the paper concludes with more text devoted to praising scripting languages than it expends in its declaration of Java's progress toward improved usability. It is unclear what is a better recommendation for scripting: the durability of Ousterhout's text or the indecisiveness of this recent critic's.

The real shock is that the academic programming language community continues to reject the sea change in programming practices brought about by scripting. Enamored of the object-oriented paradigm, especially in the undergraduate curriculum, unwilling to accept the LAMP (Linux-Apache-MySQL-Perl/Python/Php) tool set, and firmly believing that more programming theory leads to better programming practice, the academics seem blind to the facts on the ground. The ACM flagship, COMMUNICATIONS OF THE ACM for example, has never published a paper recognizing the scripting philosophy, and the references throughout the ACM Digital Library to scripting are not encouraging.

Part of the problem is that scripting has risen in the shadow of object-oriented programming and highly publicized corporate battles between Sun, Netscape, and Microsoft with their competing software practices. Scripting has been appearing language by language, including object-oriented scripting languages now. Another part of the problem is that scripting is only now mature enough to stand up against its legitimate detractors. Today, there are answers to many of the persistent questions about scripting: is there a scripting language appropriate for the teaching of CS1 (the first programming course for majors in the undergraduate computing curriculum)? Is there a scripting language for enterprise or real-time applications? Is there a way for scripting practices to scale to larger software engineering projects?

I intend to review the recent history briefly for those who have not yet joined the debate, then present some of the answers that scripting advocates now give to those nagging questions. Finally, I will describe how a real pragmatism of academic interest in programming languages would have better prepared the academic computing community to see the changes that have been afoot.

1996-1998 are perhaps the most interesting years in the phylogeny of scripting. In those years, perl "held the web together", and together with a new POSIX awk and GNU gawk, was shipping with every new Linux. Meanwhile javascript was being deployed furiously (javascript bearing no important relation to java, having been renamed from "livescript" for purely corporate purposes, apparently a sign of Netscape's solidarity with Sun, and even renamed "jscript" under Microsoft). Also, a handoff from tcl/tk to python was taking place as the language of choice for GUI developers who would not yield to Microsoft's VisualBasic. Php appeared in those years, though it would take another round of development before it would start displacing server-side perl, cold fusion, and asp. Every one of these languages is now considered a classic, even prototypical, scripting language.

Already by mid-decade, the shift from scheme to java as the dominant CS1 language was complete, and the superiority of c++ over c was unquestioned in industry. But java applets were not well supported in browsers, so the appeal of "write once, run everywhere" quickly became derided as "write once, debug everywhere." Web page forms, which used the common gateway interface (cgi) were proliferating, and systems programming languages like c became recognized as overkill for server-side programming. Developers quickly discovered the main advantage of perl for cgi forms processing, especially in the dot-com setting: it minimized the programmer's write-time. What about performance? The algorithms were simple, network latency masked small delays, and database performance was built into the database software. It turned out that the bottleneck was the programming. Even at run-time, the network and disk properties were the problems, not the cpu processing. What about maintenance? The developers and management were both happy to rewrite code for redesigned services rather than deal with legacy code. Scripting, it turns out, was so powerful and programmer-friendly that it was easier to create new scripts from scratch than to modify old programs. What about user interface? After all, by 1990, most of the programming effort had become the writing of the GUI, and the object-oriented paradigm had much of its momentum in the inheritance of interface widget behaviors. Surprisingly, the interface that most programmers needed could be had in a browser. The html/javascript/cgi trio became the GUI, and if more was needed, then ambitious client-side javascript was more reliable than the browser's java virtual machine. Moreover, the server-side program was simply a better way to distribute automation in a heterogeneous internet than the downloadable client-side program, regardless of whether the download was in binary or bytecode.

Although there was not agreement on what exact necessary and sufficient properties characterized scripting and distinguished it from "more serious" programming, several things were clear:

  • scripting permitted rapid development, often regarded as merely "rapid prototyping," but subsequently recognized as a kind of agile programming;
  • scripting was the kind of high-level programming that had always been envisioned, in the ascent from low-level assembly language programming to higher levels of abstraction: it was concise, and it removed programmers from concerning themselves with many performance and memory management details;
  • scripting was well suited to the majority of a programming task, usually the accumulation, extraction, and transformation of data, followed eventually by its presentation, so that only the performance-critical portion of a project had to be written in a more cumbersome, high-performance language;
  • it was easier to get things right when source code was short, when behavior was determined by code that fit on a page, all types were easily coerced into strings for trace-printing, code fragments could be interpreted, identifiers were short, and when the programmer could turn ideas into code quickly without losing focus.

This last point was extremely counterintuitive. Strong typing, naming regimen, and verbosity were motivated mainly by a desire to help the programmer avoid errors. But the programmer who had to generate too many keystrokes and consult too many pages, who had to search through many different files to discover semantics, and who had to follow too many rules, who had to sustain motivation and concentration over a long period of time, was a distracted and consequently inefficient programmer. Just as vast libraries did not deliver the promise of greater reusability, and virtual machines did not deliver the promise of platform-independence, the language's promise to discipline the programmer quite simply did not reduce the tendency of humans to err. It exchanged one kind of frequent error for another.

Scripting languages became the favorite tools of the independent-minded programmers: the "hackers" yes, but also the gifted and genius programmers who tended to drive a project's design and development. As Paul Graham noted (in a column reprinted in "Hackers and Painters" or this), one of the lasting and legitimate benefits of java is that it permits managers to level the playing field and extract considerable productivity from the less talented and less motivated programmers (hence, more disposable). There was a corollary to this difference between the mundane and the liberating:

  • scripting was not enervating but was actually renewing: programmers who viewed code generation as tedious and tiresome in contrast viewed scripting as rewarding self-expression or recreation.

The distinct features of scripting languages that produce these effects are usually enumerated as semantic features, starting with low I/O specification costs, the use of implicit coercion and weak typing, automatic variable initialization with optional declaration, predominant use of associative arrays for storage and regular expressions for pattern matching, reduced syntax, and powerful control structures. But the main reason for the productivity gains may be found in the name "scripting" itself. To script an environment is to be powerfully embedded in that environment. In the same way that the dolphin reigns over the open ocean, lisp is a powerful language for those who would customize their emacs, javascript is feral among browsers, and gawk and perl rule the linux jungle.

There is even a hint of AI in the idea of scripting: the scripting language is the way to get high level control, to automate by capturing the intentions and routines normally provided by the human. If recording and replaying macros is a kind of autopilot, then scripting is a kind of proxy for human decisionmaking. Nowhere is this clearer than in simple server-side php, or in sysadmin shell scripting.

So where do we stand now? While it may have been risky for Ousterhout to proclaim scripting on the rise in 1998, it would be folly to dismiss the success of scripting today. It is even possible that java will yield its position of dominance in the near future. (By the time this essay is printed, LAMP and AJAX might be the new darlings of the tech press; see recent articles in Business Week, this IEEE COMPUTER, and even James Gosling's blog where he concedes he was wanting to write a scripting language when he was handed the java project. Java is very much in full retreat.) Is scripting ready to fill the huge vacuum that would be produced?

I personally believe that CS1 java is the greatest single mistake in the history of computing curricula. I believe this because of the empirical evidence, not because I have an a priori preference (I too voted to shift from scheme to java in our CS1, over a decade ago, so I am complicit in the java debacle). I reported in SIGPLAN 1996 ("Why gawk for AI?") that only the scripting programmers could generate code fast enough to keep up with the demands of the artificial intelligence laboratory class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting. In the intervening decade, little has changed. We actually see more scripting, as students are happy to compress images so that they can script their computer vision projects rather than stumble around in c and c++. In fact, students who learn to script early are empowered throughout their college years, especially in the crucial UNIX and web environments. Those who learn only java are stifled by enterprise-sized correctness and the chimerae of just-in-time compilation, swing, JRE, JINI, etc. Young programmers need to practice and produce, and to learn through mistakes why discipline is needed. They need to learn design patterns by solving problems, not by reading interfaces to someone else's black box code. It is imperative that programmers learn to be creative and inventive, and they need programming tools that support code exploration rather than code production.

What scripting language could be used for CS1? My personal preferences are gawk, javascript, php, and asp, mainly because of their very gentle learning curves. I don't think perl would be a disaster; its imperfection would create many teaching moments. But there is emerging consensus in the scripting community that python is the right choice for freshman programming. Ruby would also be a defensible choice. Python and ruby have the enviable properties that almost no one dislikes them, and almost everyone respects them. Both languages support a wide variety of programming styles and paradigms and satisfy practitioners and theoreticians equally. Both languages are carefully enough designed that "correct" programming practices can be demonstrated and high standards of code quality can be enforced. The fact that Google stands by python is an added motivation for undergraduate majors.

But do scripting solutions scale? What about the performance gap when the polynomial, or worse the exponential, algorithm faces large n, and the algorithm is written in an interpreted or weakly compiled language? What about software engineering in the large, on big projects? There has been a lot of discussion about scalability of scripts recently. In the past, debates have simply ended with the concession that large systems would have to be rewritten in c++, or a similar language, once the scripting had served its prototyping duty.

The enterprise question is the easier of the two. Just as the individual programmer reaps benefits from a division of labor among tools, writing most of the code in scripts, and writing all bottleneck code in a highly optimizable language, the group of programmers benefits from the use of multiple paradigms and multiple languages. In a recent large project, we used vhdl for fpga's with a lot of gawk to configure the vhdl. We used python and php to generate dynamic html with svg and javascript for the interfaces. We used c and c++ for high performance communications wrappers, which communicated xml to higher level scripts that managed databases and processes. We saw sysadmin and report-generation in perl, ruby, and gawk, data scrubbing in perl and gawk, user scripting in bash, tcl, and gawk, and prototyping in perl and gawk. Only one module was written in java (because that programmer loved java): it was late, it was slow, it failed, and it was eventually rewritten in c++. In retrospect, neither the high performance components nor the lightweight code components were appropriate for the java language. Does scripting scale to enterprise software? I would not manage a project that did not include a lot of scripting, to minimize the amount of "hard" programming, to increase flexibility and reduce delivery time at all stages, to take testing to a higher level, and to free development resources for components where performance is actually critical. I nearly weep when I think about the text processing that was written in c under my managerial watch, because the programmer did not know perl. We write five hundred line scripts in gawk that would be ten thousand line modules in java or c++. In view of the fact that there are much better scripting tools for most of what gets programmed in java and c++, perhaps the question is whether java and c++ scale.

How about algorithmic complexity? Don't scripting languages take too long to perform nested loops? The answer here is that a cpu-bound tight loop such as a matrix multiplication is indeed faster in a language like c. But such bottlenecks are easy to identify and indeed easy to rewrite in c. True system bottlenecks are things like paging, chasing pointers on disk, process initialization, garbage collection, fragmentation, cache mismanagement, and poor data organization. Often, we see that better data organization was unimplemented because it would have required more code, code that would have been attempted in an "easier" programming language like a scripting language, but which was too difficult to attempt in a "harder" programming language. We saw this in the AI class with heuristic search and computer vision, where brute force is better in c, but complex heuristics are better than brute force, and scripting is better for complex heuristics. When algorithms are exponential, it usually doesn't matter what language you use because most practical n will incur too great a cost. Again, the solution is to write heuristics, and scripting is the top dog in that house. Cpu's are so much faster than disks these days that a single extra disk read can erase the CPU advantage of using compiled c instead of interpreted gawk. In any case, java is hardly the first choice for those who have algorithmic bottlenecks.

The real reason why academics were blindsided by scripting is their lack of practicality. Academic computing was generally late to adopt Wintel architectures, late to embrace cgi programming, and late to accept Linux in the same decade that brought scripting's rise. Academia understandably holds industry at a distance. Still, there is a purely intellectual reason why programming language courses are only now warming to scripting. The historical concerns of programming language theory have been syntax and semantics. Java's amazing contribution to computer science is that it raised so many old-fashioned questions that tickled the talents of existing programming language experts: e.g., how can it be compiled? But there are new questions that can be asked, too, such as what a particular language is well-suited to achieve inexpensively, quickly, or elegantly, especially with the new mix of platforms. The proliferation of scripting languages represents a new age of innovation in programming practice.

Linguists recognize something above syntax and semantics, and they call it "pragmatics". Pragmatics has to do with more abstract social and cognitive functions of language: situations, speakers and hearers, discourse, plans and actions, and performance. We are entering an era of comparative programming language study when the issues are higher-level, social, and cognitive too.

My old friend, Michael Scott, has a popular textbook called PROGRAMMING LANGUAGE PRAGMATICS. But it is a fairly traditional tome concerned with parameter passing, types, and bindings (it's hard to see why it merits "pragmatics" in its title, even as it goes to second edition with a chapter on scripting added!). A real programming pragmatics would ask questions like:

  • how well does each language mate to the other UNIX tools?
  • what is the propensity in each language for programmers at various expertise levels to produce a memory leak?
  • what is the likelihood in each language that unmodified code will still function in five years?
  • what is the demand of a programmer's concentration, what is the load on her short-term memory of ontology, and what is the support for visual metaphor in each language?

There have been programming language "shootouts" and "scriptometers" on the internet that have sought to address some of the questions that are relevant to the choice of scripting language, but they have been just first steps. For example, one site reports on the shortest script in each scripting language that can perform a simple task. But absolute brevity for trivial tasks, such as "print hello world" is not as illuminating as typical brevity for real tasks, such as xml parsing.

Pragmatic questions are not the easiest questions for mathematically-inclined computer scientists to address. They refer by their nature to people, their habits, their sociology, and the technological demands of the day. But it is the importance of such questions that makes programmers choose scripting languages. Ousterhout declared scripting on the rise, but perhaps so too are programming language pragmatics.


I have to thank Charlie Comstock for contributing many ideas and references over the past two years that have shaped my views, especially the commitment to the idea of pragmatics.

About the Author

Prof. Dr. Loui and his students are the usual winners of the department programming contest and have contributed to current gnu releases of gawk and malloc. He has lectured on AI for two decades on five continents, taught AI programming for two decades, and is currently funded on a project delivering hardware and software on U.S. government contracts.


categories: Games,Top10,TenLiners,Mar,2009,BrianK




echo Goal | gawk -f story.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] 
echo Goal | gawk -f storyp.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] 


Download from LAWKER.


This code inputs a set of productions and outputs a string of words that satisfy the production rules.

This page describes two versions of that system: story.awk and storyp.awk. The former selects productions at random with equal probability. The latter allows the user to bias the selection by adding weights at the end of line, after each production.


-v Grammar=FILE
Sets the FILE containing the productions. Defaults to "grammar".
-v Seed=NUM
Sets the seed for the random number generator. Defaults to "1". A useful idiom for generating random text is to use Seed=$RANDOM


A Short Example

This grammar..

Sentence -> Nounphrase Verbphrase   
Nounphrase -> the boy              
Nounphrase -> the girl           
Verbphrase -> Verb Modlist Adverb 
Verb -> runs                    
Verb -> walks                  
Modlist ->                    
Modlist -> very Modlist      
Adverb -> quickly           
Adverb -> slowly           
... and this input ...
for i in 1 2 3 4 5 6 7 8 9 10;do
	echo Sentence | 
	gawk -f ../story.awk -v Grammar=english.rules -v Seed=$i | 
... generates these sentences:
the boy runs very slowly
the girl runs slowly
the boy runs very slowly
the girl walks very very quickly
the boy runs quickly
the girl walks very very slowly
the boy walks very very very very very very quickly
the boy walks very quickly
the girl runs slowly
the girl runs very quickly

A Longer Example

Here is Gahan Wilson's sci-fi plot generator ...

Using the above, we can generate the following stories:

 Earth scientists invent giant bugs who want Our Women,  And Take
 A Few And Leave

 Earth is Attacked By tiny lunar superbeings who  Under Stand and
 Are Not radioactive and can not be killed by the Navy but They Die
 From Catching A Cold

 Earth scientists invent enormous bugs who are Friendly and and
 They Get Married And Live Happily Forever After

 Earth is Struck By A Giant cloud and Magically Saved

 Earth scientists invent giant bugs who  Under Stand and Are Not
 radioactive and can not be killed by the Air Force so They Kill

 Earth is Attacked By enormous extra Galactic blobs who  Under Stand
 and Are Not radioactive and can be killed by the Air Force

 Earth scientists discover enormous blobs who  Under Stand and Are
 Not radioactive and can be killed by a Crowd Of Peasants

 Earth falls Into Sun and  Some  Resuced

 Earth is Struck By A Giant comet but Is Saved

 Earth is Struck By A Giant comet and Is Destroyed

This is generated from the following code:

for i in 1 2 3 4 5 6 7 8 9 10;do
	echo Start | 
	gawk -f ../story.awk -v Grammar=scifi.rules -v Seed=$i | 

running on the following grammar:

Start      -> Earth IsStressed
IsStressed -> Catestrophes 
IsStressed -> Science 
IsStressed -> Attack 
IsStressed -> Collision

Catestrophes -> Catestrophe and PossibleMegaDeath

Catestrophe -> burnsUp 
Catestrophe -> freezes
Catestrophe -> fallsIntoSun

Collision -> isStruckByAGiant Floater AndThen

Floater -> comet
Floater -> asteroid
Floater -> cloud

AndThen -> butIsSaved
AndThen -> andIsDestroyed
AndThen -> andMagicallySaved

PossibleMegaDeath -> everybodyDies
PossibleMegaDeath -> Some GoOn 

SomeSaved ->  somePeople
SomeSaved ->  everybody
SomeSaved ->  almostEverybody
GoOn -> dies
GoOn -> Resuced
GoOn -> Saved
Rescued -> isRescuedBy Sizes Extraterestrial Beings
Saved   -> butIsSavedBy SomeOne scientists the  Science

SomeOne -> earth
SomeOne -> extraterestrial

Science -> scientists DoSomething Sizes Beings Whichetc

DoSomething -> invent
DoSomething -> discover

Attack -> isAttackedBy Sizes Extraterestrial Beings Whichetc

Sizes -> tiny 
Sizes -> giant 
Sizes -> enormous
Extraterestrial -> martian
Extraterestrial -> lunar
Extraterestrial -> extraGalactic

Beings -> bugs
Beings -> reptiles
Beings -> blobs
Beings -> superbeings

Whichetc -> who WantSomething

WantSomething -> WantWomen
WantSomething -> areFriendly  and DenoumentOrHappyEnding
WantSomething -> UnderStand ButEtc

Understand -> areFriendly butMisunderstood
Understand -> misunderstandUs
Understand -> understandUsAllTooWell
Understand -> hungry

DenoumentOrHappyEnding -> Denoument
DenoumentOrHappyEnding -> HappyEnding
Dine -> Hungry and eat us Denoument?

WhichEtc -> 
Hungry -> lookUponUsAsASourceOfNourishment

WantWomen -> wantOurWomen, AndTakeAFewAndLeave

ButEtc -> AndAre radioactive and TryToKill

AndAre -> andAre
AndAre -> andAreNot

Killers -> Killer 
Killers -> Killer and Killer

Killer -> aCrowdOfPeasants
Killer -> theArmy
Killer -> theNavy
Killer -> theAirForce
Killer -> theMarines
Killer -> theCoastGuard
Killer -> theAtomBomb

TryToKill -> can be killed by Killers
TryToKill -> can not be killed by Killers SoEtc

SoEtc -> butTheyDieFromCatchingACold
SoEtc -> soTheyKillUs
SoEtc -> soTheyPutUsUnderABenignDictatorShip
SoEtc -> soTheyEatUs
SoEtc -> soScientistsInventAWeapon Which
SeEtc -> but Denoument

Which -> whichTurnsThemIntoDisgustingLumps
Which -> whichKillsThem
Which -> whichFails SoEtc

Denomument? ->  
Denomument? -> Denoument  

Denoument ->  aCuteLittleKidConvincesThemPeopleAreOk Ending
Denoument -> aPriestTalksToThemOfGod Ending
Denoument -> theyFallInLoveWithThisBeautifulGirl EndSadOrHappy

EndSadOrHappy -> Ending
EndSadOrHappy -> HappyEnding

Ending -> andTheyDie
Ending -> andTheyLeave
Ending -> andTheyTurnIntoDisgustingLumps

HappyEnding -> andTheyGetMarriedAndLiveHappilyForeverAfter

Biasing the Story

Here is a grammar suitable for storyp.awk. Note that number at end of line that biases how often a production is selected. For example, "runs" and "slowly" are nine times more likely than other Verbs and Adverbs.

Sentence -> Nounphrase Verbphrase   1
Nounphrase -> the boy               0.75
Nounphrase -> the girl              0.25
Verbphrase -> Verb Modlist Adverb   1
Verb -> runs                        0.9
Verb -> walks                       0.1
Modlist ->                          0.5
Modlist -> very Modlist             0.5
Adverb -> quickly                   0.1
Adverb -> slowly                    0.9
The following code executes the biases story generation:
for((i=1;i<=10;i++)); do echo Sentence ;  done |
gawk -f ../storyp.awk -v Grammar=englishp.rules 

This produces the following output. Note that, usually, we run slowly.

the boy runs very slowly 
the boy runs slowly 
the girl runs very slowly 
the boy runs slowly 
the boy runs slowly 
the girl walks very slowly 
the boy walks slowly 
the girl runs slowly 
the boy runs slowly 
the boy runs slowly 



    srand(Seed ? Seed : 1) 
	Grammar = Grammar ? Grammar : "grammar"
	while (getline < Grammar > 0)
	    if ($2 == "->") {
		    i = ++lhs[$1]              # count lhs
		    rhscnt[$1, i] = NF-2       # how many in rhs
		    for (j = 3; j <= NF; j++)  # record them
		        rhslist[$1, i, j-2] = $j
	    } else
		     if ($0 !~ /^[ \t]*$/)
        	    print "illegal production: " $0
{   if ($1 in lhs) {  # nonterminal to expand
    } else 
        print "unknown nonterminal: " $0   
function gen(sym,    i, j) {
    if (sym in lhs) {       # a nonterminal
        i = int(lhs[sym] * rand()) + 1   # random production
        for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
            gen(rhslist[sym, i, j])
    } else {
        gsub(/[A-Z]/," &",sym)
        printf("%s ", sym) }


Storyp.awk is almost the same as story.awk but it is assumed that each line ends in a number that will bias how often that production gets selected.

    srand(Seed ? Seed : 1) 
    Grammar = Grammar ? Grammar : "grammar"
    while ((getline < Grammar) > 0)
        if ($2 == "->") {
            i = ++lhs[$1]              # count lhs
            rhsprob[$1, i] = $NF       # 0 <= probability <= 1
            rhscnt[$1, i] = NF-3       # how many in rhs
            for (j = 3; j < NF; j++)   # record them
               rhslist[$1, i, j-2] = $j
        } else
            print "illegal production: " $0
    for (sym in lhs)
         for (i = 2; i <= lhs[sym]; i++)
            rhsprob[sym, i] += rhsprob[sym, i-1]
{   if ($1 in lhs) {  # nonterminal to expand
     } else 
         print "unknown nonterminal: " $0   
function gen(sym,    i, j) {
    if (sym in lhs) {       # a nonterminal
        j = rand()          # random production
        for (i = 1; i <= lhs[sym] && j > rhsprob[sym, i]; i++) ;       
        for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
            gen(rhslist[sym, i, j])
    } else
        printf("%s ", sym)


The code comes from Alfred Aho, Brian Kernighan, and Peter Weinberger from the book "The AWK Programming Language", Addison-Wesley, 1988.

The scifi grammar was written by Tim Menzies, 2009, and is based on Gahan Wilson's sci-fi plot generator: "The Science Fiction Horror Movie Pocket Computer" ( in "The Year's Best Science Fiction No. 5", edited by Harry Harrison and Brian Aldiss, Sphere, London, 1972).

categories: Top10,TenLiners,Mar,2009,ScottP

Predicting Gender



echo name | gawk -f gender.awk


Download from LAWKER


The following code predicts gender, given a first name.

This code is an excellent example of rule-based programming in Awk.

For a full description of the code, see


                                          { sex = "m" } # Assume male.

/^.*[aeiy]$/                              { sex = "f" }  # Female names endng in a/e/i/y.
/^All?[iy]((ss?)|z)on$/                   { sex = "f" }  # Allison (and variations)
/^.*een$/                                 { sex = "f" }  # Cathleen, Eileen, Maureen,...
/^[^S].*r[rv]e?y?$/                       { sex = "m" }  # Barry, Larry, Perry,...
/^[^G].*v[ei]$/                           { sex = "m" }  # Clive, Dave, Steve,...
/^[^BD].*(b[iy]|y|via)nn?$/               { sex = "f" }  # Carolyn,Gwendolyn,Vivian,...
/^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$/  { sex = "m" }  # Dewey, Stanley, Wesley,...
/^[^GKSW].*(th|lv)(e[rt])?$/              { sex = "f" }  # Heather, Ruth, Velvet,...
/^[CGJWZ][^o][^dnt]*y$/                   { sex = "m" }  # Gregory, Jeremy, Zachary,...
/^.*[Rlr][abo]y$/                         { sex = "m" }  # Leroy, Murray, Roy,...
/^[AEHJL].*il.*$/                         { sex = "f" }  # Abigail, Jill, Lillian,...
/^.*[Jj](o|o?[ae]a?n.*)$/                 { sex = "f" }  # Janet, Jennifer, Joan,...
/^.*[GRguw][ae]y?ne$/                     { sex = "m" }  # Duane, Eugene, Rene,...
/^[FLM].*ur(.*[^eotuy])?$/                { sex = "f" }  # Fleur, Lauren, Muriel,...
/^[CLMQTV].*[^dl][in]c.*[ey]$/            { sex = "m" }  # Lance, Quincy, Vince,...
/^M[aei]r[^tv].*([^cklnos]|([^o]n))$/     { sex = "f" }  # Margaret, Marylou, Miriam,...
/^.*[ay][dl]e$/                           { sex = "m" }  # Clyde, Kyle, Pascale,...
/^[^o]*ke$/                               { sex = "m" }  # Blake, Luke, Mike,...
/^[CKS]h?(ar[^lst]|ry).+$/                { sex = "f" }  # Carol, Karen, Sharon,...
/^[PR]e?a([^dfju]|qu)*[lm]$/              { sex = "f" }  # Pam, Pearl, Rachel,...
/^.*[Aa]nn.*$/                            { sex = "f" }  # Annacarol, Leann, Ruthann,...
/^.*[^cio]ag?h$/                          { sex = "f" }  # Deborah, Leah, Sarah,...
/^[^EK].*[grsz]h?an(ces)?$/               { sex = "f" }  # Frances, Megan, Susan,...
/^[^P]*([Hh]e|[Ee][lt])[^s]*[ey].*[^t]$/  { sex = "f" }  # Ethel, Helen, Gretchen,...
/^[^EL].*o(rg?|sh?)?(e|ua)$/              { sex = "m" }  # George, Joshua, Theodore,..
/^[DP][eo]?[lr].*se$/                     { sex = "f" }  # Delores, Doris, Precious,...
/^[^JPSWZ].*[denor]n.*y$/                 { sex = "m" }  # Anthony, Henry, Rodney,...
/^K[^v]*i.*[mns]$/                        { sex = "f" }  # Karin, Kim, Kristin,...
/^Br[aou][cd].*[ey]$/                     { sex = "m" }  # Bradley, Brady, Bruce,...
/^[ACGK].*[deinx][^aor]s$/                { sex = "f" }  # Agnes, Alexis, Glynis,...
/^[ILW][aeg][^ir]*e$/                     { sex = "m" }  # Ignace, Lee, Wallace,...
/^[^AGW][iu][gl].*[drt]$/                 { sex = "f" }  # Juliet, Mildred, Millicent,...
/^[ABEIUY][euz]?[blr][aeiy]$/             { sex = "m" }  # Ari, Bela, Ira,...
/^[EGILP][^eu]*i[ds]$/                    { sex = "f" }  # Iris, Lois, Phyllis,...
/^[ART][^r]*[dhn]e?y$/                    { sex = "m" }  # Randy, Timothy, Tony,...
/^[BHL].*i.*[rtxz]$/                      { sex = "f" }  # Beatriz, Bridget, Harriet,...
/^.*oi?[mn]e$/                            { sex = "m" }  # Antoine, Jerome, Tyrone,...
/^D.*[mnw].*[iy]$/                        { sex = "m" }  # Danny, Demetri, Dondi,...
/^[^BG](e[rst]|ha)[^il]*e$/               { sex = "m" }  # Pete, Serge, Shane,...
/^[ADFGIM][^r]*([bg]e[lr]|il|wn)$/        { sex = "f" }  # Angel, Gail, Isabel,...

                                          { print sex }  # Output prediction


by Scott Pakin, August 1991

categories: Eliza,Top10,AwkLisp,Interpreters,Dsl,Mar,2009,DariusB


Download from


awk [-v profiling=1] -f awklisp [optional-Lisp-source-files]

The -v profiling=1 option turns call-count profiling on.

If you want to use it interactively, be sure to include '-' (for the standard input) among the source files. For example:

gawk -f awklisp startup numbers lists -



This program arose out of one-upmanship. At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster. Since then I've added features and polish, in the hope of taking over the burgeoning market for stately language implementations.

This version tries to deal with as many of the essential issues in interpreter implementation as is reasonable in awk (though most would call this program utterly unreasonable from start to finish, perhaps...). Awk's impoverished control structures put error recovery and tail-call optimization out of reach, in that I can't see a non-painful way to code them. The scope of variables is dynamic because that was easier to implement efficiently. Subject to all those constraints, the language is as Schemely as I could make it: it has a single namespace with uniform evaluation of expressions in the function and argument positions, and the Scheme names for primitives and special forms.

The rest of this file is a reference manual. My favorite tutorial would be The Little LISPer (see section 5, References); don't let the cute name and the cartoons turn you off, because it's a really excellent book with some mind-stretching material towards the end. All of its code will work with awklisp, except for the last two chapters. (You'd be better off learning with a serious Lisp implementation, of course.)

For more details on the implementation, see the Implementation notes (below).




(define fib
  (lambda (n)
    (if (< n 2)
        (+ (fib (- n 1))
           (fib (- n 2))))))
(fib 20)

Comamnd line:

gawk -f awklisp startup numbers  lists fib.lsp




Here are the standard ELIZA dialogue patterns:

(define rules
     (How do you do -- please state your problem))
    ((I want)
     (What would it mean if you got -R-)
     (Why do you want -R-)
     (Suppose you got -R- soon))
     (Do you really think its likely that -R-)
     (Do you wish that -R-)
     (What do you think about -R-)
     (Really-- if -R-))
    ((I was)
     (Were you really?)
     (Perhaps I already knew you were -R-)
     (Why do you tell me you were -R- now?))
    ((I am)
     (In what way are you -R-)
     (Do you want to be -R-))
     (Is that the real reason?)
     (What other reasons might there be?)
     (Does that reason seem to explain anything else?))
    ((I feel)
     (Do you often feel -R-))
    ((I felt)
     (What other feelings do you have?))
     (You seem quite positive)
     (You are sure)
     (I understand))
     (Why not?)
     (You are being a bit negative)
     (Are you saying no just to be negative?))
     (Can you be more specific?))
     (Surely not everyone)
     (Can you think of anyone in particular?)
     (Who for example?)
     (You are thinking of a special person))
     (You do not seem quite certain))
     (Did you think they might not be -R-)
     (Possibly they are -R-))
     (Very interesting)
     (I am not sure I understand you fully)
     (What does that suggest to you?)
     (Please continue)
     (Go on)
     (Do you feel strongly about discussing such things?))))

Command line:

gawk -f awklisp startup numbers  lists eliza.lsp -


> (eliza)
Hello-- please state your problem 
> (I feel sick)
Do you often feel sick 
> (I am in love with awk)
In what way are you in love with awk 
> (because it is so easy to use)
Is that the real reason? 
> (I was laughed at by the other kids at space camp)
Were you really? 
> (everyone hates me)
Can you think of anyone in particular? 
> (everyone at space camp)
Surely not everyone 
> (perhaps not tina fey)
You do not seem quite certain 
> (I want her to laugh at me)
What would it mean if you got her to laugh at me 

Expressions and their evaluation

Lisp evaluates expressions, which can be simple (atoms) or compound (lists).

An atom is a string of characters, which can be letters, digits, and most punctuation; the characters may -not- include spaces, quotes, parentheses, brackets, '.', '#', or ';' (the comment character). In this Lisp, case is significant ( X is different from x ).

  • Atoms: atom 42 1/137 + ok? hey:names-with-dashes-are-easy-to-read
  • Not atoms: don't-include-quotes (or spaces or parentheses)

A list is a '(', followed by zero or more objects (each of which is an atom or a list), followed by a ')'.

  • Lists: () (a list of atoms) ((a list) of atoms (and lists))
  • Not lists: ) ((()) (two) (lists)

The special object nil is both an atom and the empty list. That is, nil = (). A non-nil list is called a -pair-, because it is represented by a pair of pointers, one to the first element of the list (its -car-), and one to the rest of the list (its -cdr-). For example, the car of ((a list) of stuff) is (a list), and the cdr is (of stuff). It's also possible to have a pair whose cdr is not a list; the pair with car A and cdr B is printed as (A . B).

That's the syntax of programs and data. Now let's consider their meaning. You can use Lisp like a calculator: type in an expression, and Lisp prints its value. If you type 25, it prints 25. If you type (+ 2 2), it prints 4. In general, Lisp evaluates a particular expression in a particular environment (set of variable bindings) by following this algorithm:

  • If the expression is a number, return that number.
  • If the expression is a non-numeric atom (a -symbol-), return the value of that symbol in the current environment. If the symbol is currently unbound, that's an error.
  • Otherwise the expression is a list. If its car is one of the symbols: quote, lambda, if, begin, while, set!, or define, then the expression is a -special- -form-, handled by special rules. Otherwise it's just a procedure call, handled like this: evaluate each element of the list in the current environment, and then apply the operator (the value of the car) to the operands (the values of the rest of the list's elements). For example, to evaluate (+ 2 3), we first evaluate each of its subexpressions: the value of + is (at least in the initial environment) the primitive procedure that adds, the value of 2 is 2, and the value of 3 is 3. Then we call the addition procedure with 2 and 3 as arguments, yielding 5. For another example, take (- (+ 2 3) 1). Evaluating each subexpression gives the subtraction procedure, 5, and 1. Applying the procedure to the arguments gives 4.
We'll see all the primitive procedures in the next section. A user-defined procedure is represented as a list of the form (lambda <parameters> <body>), such as (lambda (x) (+ x 1)). To apply such a procedure, evaluate its body in the environment obtained by extending the current environment so that the parameters are bound to the corresponding arguments. Thus, to apply the above procedure to the argument 41, evaluate (+ x 1) in the same environment as the current one except that x is bound to 41.

If the procedure's body has more than one expression -- e.g., (lambda () (write 'Hello) (write 'world!)) -- evaluate them each in turn, and return the value of the last one.

We still need the rules for special forms. They are:

  • The value of (quote <x>) is <x>. There's a shorthand for this form: '. E.g., the value of '(+ 2 2) is (+ 2 2), -not- 4.
  • (lambda <parameters> ) returns itself: e.g., the value of (lambda (x) x) is (lambda (x) x).
  • To evaluate (if <test-expr> <then-exp> <else-exp>), first evaluate <test-expr>. If the value is true (non-nil), then return the value of <then-exp>, otherwise return the value of <else-exp>. (<else-exp> is optional; if it's left out, pretend there's a nil there.) Example: (if nil 'yes 'no) returns no.
  • To evaluate (begin <expr-1> <expr-2>...), evaluate each of the subexpressions in order, returning the value of the last one.
  • To evaluate (while <test> <expr-1> <expr-2>...), first evaluate <test>. If it's nil, return nil. Otherwise, evaluate <expr-1>, <expr-2>,... in order, and then repeat.
  • To evaluate (set! <variable> <expr>), evaluate <expr>, and then set the value of <variable> in the current environment to the result. If the variable is currently unbound, that's an error. The value of the whole set! expression is the value of <expr>.
  • (define <variable> <expr>) is like set!, except it's used to introduce new bindings, and the value returned is <variable>.

It's possible to define new special forms using the macro facility provided in the startup file. The macros defined there are:

  • (let ((<var> <expr>)...)
    Bind each <var> to its corresponding <expr> (evaluated in the current environment), and evaluate <body> in the resulting environment.
  • (cond (<test-expr> <result-expr>...)... (else <result-expr>...))
    where the final else clause is optional. Evaluate each <test-expr> in turn, and for the first non-nil result, evaluate its <result-expr>. If none are non-nil, and there's no else clause, return nil.
  • (and <expr>...)
    Evaluate each <expr> in order, until one returns nil; then return nil. If none are nil, return the value of the last <expr>.
  • (or <expr>...)
    Evaluate each <expr> in order, until one returns non-nil; return that value. If all are nil, return nil.

Built-in procedures

List operations:

  • (null? <x>) returns true (non-nil) when <x> is nil.
  • (atom? <x>) returns true when <x> is an atom.
  • (pair? <x>) returns true when <x> is a pair.
  • (car <pair>) returns the car of <pair>.
  • (cdr <pair>) returns the cdr of <pair>.
  • (cadr <pair>) returns the car of the cdr of <pair>. (i.e., the second element.)
  • (cddr <pair>) returns the cdr of the cdr of <pair>.
  • (cons <x> <y>) returns a new pair whose car is <x> and whose cdr is <y>.
  • (list <x>...) returns a list of its arguments.
  • (set-car! <pair> <x>) changes the car of <pair> to <x>.
  • (set-cdr! <pair> <x>) changes the cdr of <pair> to <x>.
  • (reverse! <list>) reverses <list> in place, returning the result.


  • (number? <x>) returns true when <x> is a number.
  • (+ <n> <n>) returns the sum of its arguments.
  • (- <n> <n>) returns the difference of its arguments.
  • (* <n> <n>) returns the product of its arguments.
  • (quotient <n> <n>) returns the quotient. Rounding is towards zero.
  • (remainder <n> <n>) returns the remainder.
  • (< <n1> <n2>) returns true when <n1> is less than <n2>.


  • (write <x>) writes <x> followed by a space.
  • (newline) writes the newline character.
  • (read) reads the next expression from standard input and returns it.


  • (eval <x>) evaluates <x> in the current environment, returning the result.
  • (apply <proc> <list>) calls <proc> with arguments <list>, returning the result.


  • (eq? <x> <y>) returns true when <x> and <y> are the same object. Be careful using eq? with lists, because (eq? (cons <x> <y>) (cons <x> <y>)) is false.
  • (put <x> <y> <z>)
  • (get <x> <y>) returns the last value <z> that was put for <x> and <y>, or nil if there is no such value.
  • (symbol? <x>) returns true when <x> is a symbol.
  • (gensym) returns a new symbol distinct from all symbols that can be read.
  • (random <n>) returns a random integer between 0 and <n>-1 (if <n> is positive).
  • (error <x>...) writes its arguments and aborts with error code 1.

Implementation Notes


Since the code should be self-explanatory to anyone knowledgeable about Lisp implementation, these notes assume you know Lisp but not interpreters. I haven't got around to writing up a complete discussion of everything, though.

The code for an interpreter can be pretty low on redundancy -- this is natural because the whole reason for implementing a new language is to avoid having to code a particular class of programs in a redundant style in the old language. We implement what that class of programs has in common just once, then use it many times. Thus an interpreter has a different style of code, perhaps denser, than a typical application program.

Data representation

Conceptually, a Lisp datum is a tagged pointer, with the tag giving the datatype and the pointer locating the data. We follow the common practice of encoding the tag into the two lowest-order bits of the pointer. This is especially easy in awk, since arrays with non-consecutive indices are just as efficient as dense ones (so we can use the tagged pointer directly as an index, without having to mask out the tag bits). (But, by the way, mawk accesses negative indices much more slowly than positive ones, as I found out when trying a different encoding.)

This Lisp provides three datatypes: integers, lists, and symbols. (A modern Lisp provides many more.)

For an integer, the tag bits are zero and the pointer bits are simply the numeric value; thus, N is represented by N*4. This choice of the tag value has two advantages. First, we can add and subtract without fiddling with the tags. Second, negative numbers fit right in. (Consider what would happen if N were represented by 1+N*4 instead, and we tried to extract the tag as N%4, where N may be either positive or negative. Because of this problem and the above-mentioned inefficiency of negative indices, all other datatypes are represented by positive numbers.)

The evaluation/saved-bindings stack

The following is from an email discussion; it doesn't develop everything from first principles but is included here in the hope it will be helpful.

Hi. I just took a look at awklisp, and remembered that there's more to your question about why we need a stack -- it's a good question. The real reason is because a stack is accessible to the garbage collector.

We could have had apply() evaluate the arguments itself, and stash the results into variables like arg0 and arg1 -- then the case for ADD would look like

if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)

The obvious problem with that approach is how to handle calls to user-defined procedures, which could have any number of arguments. Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1)) is the procedure, and 42 is the argument.

A (wrong) solution could be to evaluate each argument in turn, and bind the corresponding parameter name (like x in this case) to the resulting value (while saving the old value to be restored after we return from the procedure). This is wrong because we must not change the variable bindings until we actually enter the procedure -- for example, with that algorithm ((lambda (x y) y) 1 x) would return 1, when it should return whatever the value of x is in the enclosing environment. (The eval_rands()-type sequence would be: eval the 1, bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind y to that, then eval the body of the lambda.)

Okay, that's easily fixed -- evaluate all the operands and stash them away somewhere until you're done, and *then* do the bindings. So the question is where to stash them. How about a global array? Like

for (i = 0; arglist != NIL; ++i) {
    global_temp[i] = eval(car[arglist])
    arglist = cdr[arglist]

followed by the equivalent of extend_env(). This will not do, because the global array will get clobbered in recursive calls to eval(). Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +, like this: global_temp[0] gets 2, and then global_temp[1] gets the eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set to 3 and global_temp[1] to 4 -- so the original assignment of 2 to global_temp[0] is clobbered before we get a chance to use it. By using a stack[] instead of a global_temp[], we finesse this problem.

You may object that we can solve that by just making the global array local, and that's true; lots of small local arrays may or may not be more efficient than one big global stack, in awk -- we'd have to try it out to see. But the real problem I alluded to at the start of this message is this: the garbage collector has to be able to find all the live references to the car[] and cdr[] arrays. If some of those references are hidden away in local variables of recursive procedures, we're stuck. With the global stack, they're all right there for the gc().

(In C we could use the local-arrays approach by threading a chain of pointers from each one to the next; but awk doesn't have pointers.)

(You may wonder how the code gets away with having a number of local variables holding lisp values, then -- the answer is that in every such case we can be sure the garbage collector can find the values in question from some other source. That's what this comment is about:

  # All the interpretation routines have the precondition that their
  # arguments are protected from garbage collection.

In some cases where the values would not otherwise be guaranteed to be available to the gc, we call protect().)

Oh, there's another reason why apply() doesn't evaluate the arguments itself: it's called by do_apply(), which handles lisp calls like (apply car '((x))) -- where we *don't* want the x to get evaluated by apply().


  • Harold Abelson and Gerald J. Sussman, with Julie Sussman. Structure and Interpretation of Computer Programs. MIT Press, 1985.
  • John Allen. Anatomy of Lisp. McGraw-Hill, 1978. <;i> Daniel P. Friedman and Matthias Felleisen. The Little LISPer. Macmillan, 1989.

Roger Rohrbach wrote a Lisp interpreter, in old awk (which has no procedures!), called walk . It can't do as much as this Lisp, but it certainly has greater hack value. Cooler name, too. It's available at


Eval doesn't check the syntax of expressions. This is a probably-misguided attempt to bump up the speed a bit, that also simplifies some of the code. The macroexpander in the startup file would be the best place to add syntax- checking.


Darius Bacon


Copyright (c) 1994, 2001 by Darius Bacon.

Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:

  1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from defects in it.
  2. The origin of this software must not be misrepresented, either by explicit claim or by omission.
  3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software.

categories: Awk100,Top10,Interpreters,Dsl,Apr,2009,HenryS

Amazing Awk Assembler

Download from

Download from LAWKER.


"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.

aaa is the mainline shell file.

aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).

aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.

abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.

try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.

01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.

If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.


Henry Spencer

categories: Top10,Wp,Dsl,Mar,2009,JesusG




awk -f markdown.awk file.txt > file.html


Download from LAWKER.


(Note: this code was orginally called txt2html.awk by its author but that caused a name clash inside LAWKER. Hence, I've taken the liberty of renamining it. --Timm)

The following code implements a subset of John Gruber's Markdown langauge: a widely-used, ultra light-weight markup language for html.

  • Paragraghs- denoted by a leading blank line.
  • Images:
    ![alt text](/path/img.jpg "Title")
  • Emphasis: **To be in italics**
  • Code: `<code>` spans are delimited by backticks.
  • Headings (Setex style)
    Level 1 Header 
    Level 2 Header
    Level 3 Header 
  • Heaings (Atx style):

    Number of leading "#" codes the heading level:

    # Level 1 Header
    #### Level 4 Header
  • Unordered lists
  • - List item 1
    - List item 2

    Note: beginnging and end of list are automatically inferred, maybe not always correctly.

  • Ordered lists
  • Denoted by a number at start-of-line.

    1 A numbered list item


The following code demonstrates a "exception-style" of Awk programming. Note how all the processing relating to each mark-up tag is localized (exception, carrying round prior text and environments). The modularity of the following code should make it easily hackable.


	env = "none";
	text = "";


/^!\[.+\] *\(.+\)/ {
	split($0, a, /\] *\(/);
	split(a[1], b, /\[/);
	imgtext = b[2];
	split(a[2], b, /\)/);
	imgaddr = b[1];
	print "<p><img src=\"" imgaddr "\" alt=\"" imgtext "\" title=\"\" /></p>\n";
	text = "";


/\] *\(/ {
	do {
		na = split($0, a, /\] *\(/);
		split(a[1], b, "[");
		linktext = b[2];
		nc = split(a[2], c, ")");
		linkaddr = c[1];
		text = text b[1] "<a href=\"" linkaddr "\">" linktext "</a>" c[2];
		for(i = 3; i <= nc; i++)
			text = text ")" c[i];
		for(i = 3; i <= na; i++)
			text = text "](" a[i];
		$0 = text;;
		text = "";
	while (na > 2);


/`/ {
	while (match($0, /`/) != 0) {
		if (env == "code") {
			sub(/`/, "</code>");
			env = pcenv;
		else {
			sub(/`/, "<code>");
			pcenv = env;
			env = "code";


/\*\*/ {
	while (match($0, /\*\*/) != 0) {
		if (env == "emph") {
			sub(//, "</emph>");
			env = peenv;
		else {
			sub(/\*\*/, "<emph>");
			peenv = env;
			env = "emph";

Setex-style Headers

(Plus h3 with underscores.)

/^=+$/ {
	print "<h1>" text "</h1>\n";
	text = "";

/^-+$/ {
	print "<h2>" text "</h2>\n";
	text = "";

/^_+$/ {
	print "<h3>" text "</h3>\n";
	text = "";

Atx-style headers

/^#/ {
	match($0, /#+/);
	if(n > 6)
		n = 6;
	print "<h" n ">" substr($0, RLENGTH + 1) "</h" n ">\n";

Unordered Lists

/^[*-+]/ {
	if (env == "none") {
		env = "ul";
		print "<ul>";
	print "<li>" substr($0, 3) "</li>";
	text = "";

/^[0-9]./ {
	if (env == "none") {
		env = "ol";
		print "<ol>";
	print "<li>" substr($0, 3) "</li>";


/^[ t]*$/ {
	if (env != "none") {
		if (text)
			print text;
		text = "";
		print "</" env ">\n";
		env = "none";
	if (text)
		print "<p>" text "</p>\n";
	text = "";


// {
	text = text $0;


        if (env != "none") {
                if (text)
                        print text;
                text = "";
                print "</" env ">\n";
                env = "none";
        if (text)
                print "<p>" text "</p>\n";
        text = "";


Does not implement the full Markdown syntax.


Jesus Galan (yiyus) 2006

<yiyu DOT jgl AT gmail DOT com>

categories: Top10,Awk100,Papers,Os,Apr,2009,YungC


Awk-Linux Educational Operating Systems


Teaching operating systems.


Yung-Pin Cheng



Software Engineering Lab. Department of Computer Science and Information Engineering National Taiwan Normal University




Educators of Operating Systems


Most well-known instructional operating systems are complex, particularly if their companion software is taken into account. It takes considerable time and effort to craft these systems, and their complexity may introduce maintenance and evolution problems. In this project, a courseware called Awk-Linux is proposed. The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt, which are simulated through program instrumentation over user programs.

A major advantange of the use of Awk for this tool is platform independence. Awk-Linux can be crafted relatively more easily and it does not depend on any hardware simulator or platform. Stable Awk versions run on many platforms so this tool can be readily and easily ported to other machines. The same can not be said for other, more complex operating systems courseware that may be much harder to port to new environments.

In practice, using Awk-Linux is very simple for the instructor and students:

  • Course projects based on Awk-Linux provides source code extracted and simplified from a Linux kernel.
  • Results of our study indicate that the projects helped students better to understand inner workings of operating systems.


Gawk under cygwin or Linux


Windows (CYGWIN required) or Linux


C programming language


Status 3 (Released)


3(Free/public domain)




Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux: A Lightweight Operating Systems Courseware IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.


categories: Top10,Awk100,Mar,2009,NelsonB,Spell,ArnoldR




awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
    [=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
    [-strip] [-verbose] [file(s)]


Download from LAWKER.


Why Study This Code?

This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.

It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:

  • The code is hundreds of lines long. Yes folks, its true, Awk is not just a tool for writing one-liners.
  • The code is well-structured. Note, for example, how the BEGIN block is used to initialize the system from files/functions.
  • The code uses two tricks that encourages function reuse:
    • Much of the functionality has been moved out of PATTERN-ACTION and into functions.
    • The number of globals is restricted: note the frequent use of local variables in functions.
  • There is an example, in scan_options, of how parse command line arguments;
  • The use of "print pipes" in in report_expcetions shows how to link Awk code to other commands.

(And to write even larger programs, divided into many files, see runawk.)


Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.

For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.

All word matching is case insensitive (subject to the workings of tolower()).

In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.


Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:

	ies$	ie ies y	# flies -> fly, series -> series, ties -> tie
	ily$	y ily		# happily -> happy, wily -> wily
	nnily$	n		# funnily -> fun

Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.

Suffixes are tested in order of decreasing length, so that the longest matches are tried first.


The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form


Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.



BEGIN	{ initialize() }
	    { spell_check_line() }
END	    { report_exceptions() }


function get_dictionaries(        files, key)
    if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
    if (Dictionaries == "")	# Use default dictionary list
    else			# Use system dictionaries from command line
	split(Dictionaries, files)
	for (key in files)


function initialize()
   NonWordChars = "[^" \
	"'" \
	"abcdefghijklmnopqrstuvwxyz" \
	"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
	"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
	"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
	"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
	"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
	"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
	"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
	"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \


function load_dictionaries(        file, word)
    for (file in DictionaryFiles)
	## print "DEBUG: Loading dictionary " file > "/dev/stderr"
	while ((getline word < file) > 0)


function load_suffixes(        file, k, line, n, parts)
    if (NSuffixFiles > 0)		# load suffix regexps from files
	for (file in SuffixFiles)
	    ## print "DEBUG: Loading suffix file " file > "/dev/stderr"
	    while ((getline line < file) > 0)
		sub(" *#.*$", "", line)		# strip comments
		sub("^[ \t]+", "", line)	# strip leading whitespace
		sub("[ \t]+$", "", line)	# strip trailing whitespace
		if (line == "")
		n = split(line, parts)
		Replacement[parts[1]] = parts[2]
		for (k = 3; k <= n; k++)
		  Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
    else	      # load default table of English suffix regexps
	split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
	for (k in parts)
	    Suffixes[parts[k]] = 1
	    Replacement[parts[k]] = ""


function order_suffixes(        i, j, key)
    # Order suffixes by decreasing length
    NOrderedSuffix = 0
    for (key in Suffixes)
	OrderedSuffix[++NOrderedSuffix] = key
    for (i = 1; i < NOrderedSuffix; i++)
	for (j = i + 1; j <= NOrderedSuffix; j++)
	    if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
		swap(OrderedSuffix, i, j)


function report_exceptions(        key, sortpipe)
  sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
  for (key in Exception)
  print Exception[key] | sortpipe


function scan_options(        k)
    for (k = 1; k < ARGC; k++)
	if (ARGV[k] == "-strip")
	    ARGV[k] = ""
	    Strip = 1
	else if (ARGV[k] == "-verbose")
	    ARGV[k] = ""
	    Verbose = 1
	else if (ARGV[k] ~ /^=/)	# suffix file
	    SuffixFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	else if (ARGV[k] ~ /^[+]/)	# private dictionary
	    DictionaryFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""

    # Remove trailing empty arguments (for nawk)
    while ((ARGC > 0) && (ARGV[ARGC-1] == ""))


function spell_check_line(        k, word)
    ## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
    gsub(NonWordChars, " ")		# eliminate nonword chars
    for (k = 1; k <= NF; k++)
	word = $k
	sub("^'+", "", word)		# strip leading apostrophes
	sub("'+$", "", word)		# strip trailing apostrophes
	if (word != "")


function spell_check_word(word,        key, lc_word, location, w, wordlist)
    lc_word = tolower(word)
    ## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
    if (lc_word in Dictionary)		# acceptable spelling
    else				# possible exception
	if (Strip)
	    strip_suffixes(lc_word, wordlist)
	    ## for (w in wordlist) print "DEBUG: wordlist[" w "]"
	    for (w in wordlist)
		if (w in Dictionary)
	    if (w in Dictionary)
	## print "DEBUG: spell_check():", word
	location = Verbose ? (FILENAME ":" FNR ":") : ""
	if (lc_word in Exception)
	    Exception[lc_word] = Exception[lc_word] "\n" location word
	    Exception[lc_word] = location word


function strip_suffixes(word, wordlist,        ending, k, n, regexp)
    ## print "DEBUG: strip_suffixes(" word ")"
    split("", wordlist)
    for (k = 1; k <= NOrderedSuffix; k++)
	regexp = OrderedSuffix[k]
	## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
	if (match(word, regexp))
	    word = substr(word, 1, RSTART - 1)
	    if (Replacement[regexp] == "")
		wordlist[word] = 1
		split(Replacement[regexp], ending)
		for (n in ending)
		    if (ending[n] == "\"\"")
			ending[n] = ""
		    wordlist[word ending[n]] = 1
     ## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""


function swap(a, i, j,        temp)
    temp = a[i]
    a[i] = a[j]
    a[j] = temp


Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books

categories: Top10,Boris,Awk100,Feb,2009,Ronl



Demonstration to DoD of a clustering algorithm suitable for streaming data.

Source code


Live demo


Ronald Loui and a programmer named Boris.


Washington University in St. Louis, CS Dept.




This is an evolutionary algorithm and visualization of a clustering algorithm that could be turned from O(n^4) to O(nlogn) with a few judicious uses of constants. Later developments added other interactive devices, including progress meters and mouse-and-click behavior.


Ronald Loui



The code is an excellent example of the power of Awk as a prototyping tool: after getting the code running, with the least development time, a quirk was observed in the code that allowed a reduction from O(n^4) to O(nlogn).

  • Two of the n's are lost (n^2) by noticing that when there is a swap, the delta in the scoring function falls off by the squared distance from the point of a swap. So if you just set a constant, such as 10 or 20, or 100, based on the expected size of your clusters, then you can stop calculating the scoring function when you get past that constant.
  • The other n comes from either fixing the size of the matrix, and occasionally flushing new candidates in and out, or else by sampling over a subset of the n when you calculate the score.
  • The nlogn remains because there is a sort every now and then.




Intended for fast servers, 1+ ghz.





Development Effort

One weekend.

Maintenance Effort





2=in-House use.






Feb 2009.


Streaming Hierarchical Clustering for Concept Mining Looks, M.; Levine, A.; Covington, G.A.; Loui, R.P.; Lockwood, J.W.; Cho, Y.H. Aerospace Conference, 2007 IEEE Volume , Issue , 3-10 March 2007 Page(s):1 - 12 Digital Object Identifier 10.1109/AERO.2007.352792

blog comments powered by Disqus