"Cause a little auk awk
goes a long way."

 »  table of contents
 »  featured topics
 »  page tags

About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

 »  articles
 »  books:


Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Interpreters,Apr,2009,Admin

Writing Interpreters

These pages focus on language interpreters, written in Awk.

categories: Awk100,Interpreters,Apr,2009,HenryS

AASL: Parser Genrator in Awk


Download from LAWKER


aaslg [ -x ] [ file ... ]
aaslr [ -x ] table [ file ... ]


Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output. Aaslr parses the contents of the file(s) (default standard input) according to the AASL table in file table, emitting the table's output on standard output.

Both take a -x option to turn on verbose and cryptic debugging output. Both look in a library directory for pieces of the AASL system; the AASLDIR environment variable, if present, overrides the default notion of the location of this directory.

Aaslr expects input to consist of input tokens, one per line. For sim- ple tokens, the line is just the text of the token. For metatokens like ``identifier'', the line is the metatoken's name, a tab, and the text of the token. [xxx discuss `#' lines]

Aaslr output, in the absence of syntax errors, consists of the input tokens plus action tokens, which are lines consisting of `#!' followed immediately by an identifier. If the syntax of the input does not match that specified in the AASL table, aaslr emits complaint(s) on standard error and attempts to repair the input into a legal form; see ``ERROR REPAIR'' below. Unless errors have cascaded to the point where aaslr gives up (in which case it emits the action token ``#!aargh'' to inform later passes of this), the output will always conform to the AASL syntax given in the table.

Normally, a complete program using AASL consists of three passes, the middle one being an invocation of aaslr. The first pass is a lexical analyzer, which breaks free-form input down into input tokens in some suitable way. The third pass is a semantics interpreter, which typi- cally responds to input tokens by momentarily remembering them and to action tokens by executing some action, often using the remembered value of the previous input token. Aaslg is in fact implemented using AASL, following this structure; it implements the -x option by just passing it to aaslr.

AASL Specifications

An AASL specification consists of class definitions, text definitions, and rules, in arbitrary order (except that class definitions must pre- cede use of the classes they define). A `#' (not enclosed in a string) begins a comment; characters from it to the end of the line are ignored. An identifier follows the same rules as a C identifier, except that in most contexts it can be at most 16 characters long. A string is enclosed in double quotes ("") and generally follows C syn- tax. Most strings denote input tokens, and references to ``input token'' as part of AASL specification syntax should be read as ``string denoting input token''.

A class definition is an identifier enclosed in angle brackets (<>) followed by one or more input tokens followed by a semicolon (;). It gives a name to a set of input tokens. Classes whose names start with capital letters are user abbreviations; see below. Classes whose names start with lowercase letters are special classes, used for internal purposes. The current special classes are:

tokens which the parser can discard at will, in the expectation that they might be inserted erroneously; see ``ERROR REPAIR'' for details.
tokens which terminate a logical line for purposes of resyn- chronization in error repair; see ``ERROR REPAIR'' for details.

For example, the class definitions used for AASL itself are:

<trivial> "," ";"   ;
<lineterm> ";" ;
<endmarker> "EOF"   ;

When AASL error repair is invoked, the parser sometimes needs to gener- ate input tokens. In the case of a metatoken, the parser knows the token's name but needs to generate a text for it as well. A text defi- nition consists of an input token, an arrow (->), and a string specify- ing what text should be generated for that token. For example, the text definitions used for AASL itself are:

"id" -> "___"
"string" -> "\"___\""

The rules of a specification define the syntax that the parser should accept. The order of rules is not significant, except that the first rule is considered to be the top level of the specification. The spec- ification is executed by calling the first rule; when execution of that rule terminates, execution of the specification terminates. If the user wishes this to occur only at end of input, he should arrange for the lexical analyzer to produce an endmarker token (conventionally ``EOF'') at the end of the input, and should write the first rule to require that token at the end.

Note that an input token may be recognized considerably before it is accepted, but the parser emits it to the output only on acceptance.

A rule consists of an identifier naming it, a colon (:), a sequence of items which is the body of the rule, and a semicolon (;). When a rule is called, it is executed by executing the individual items of the body in order (as modified by control structures) until either one of them explicitly terminates execution of the rule or the last item is exe- cuted.

An item which is an input token requires that that token appear in the input at that point, and accepts it (causing it to be emitted as out- put).

An item which is an identifier denotes a call to another rule, which executes the body of that rule and then returns to the caller. It is an error to call a nonexistent rule.

An item which is an identifier preceded by `!' causes that identifier to be emitted as an action token; the identifier has no other signifi- cance.

An item which is `<<' causes execution of the current rule to terminate immediately, returning to the calling rule.

An item which is `>>' causes the execution of the innermost enclosing loop (see below) to terminate immediately, with execution continuing after the end of that loop. The loop must be within the same rule.

An item which is an identifier preceded by `@%&!' causes an internal semantic action to be executed within the parser; this is normally needed only for bizarre situations like C's typedef. [xxx should give details I suppose]

A choice is a sequence of branches enclosed in parentheses (()) and separated by vertical bars (|). The first of the branches that can be executed, is, after which execution continues after the end of the choice.

A loop is a sequence of branches enclosed in braces ({}) and separated by vertical bars (|). The first of the branches that can be executed, is, and this is done repeatedly until the loop is terminated by `>>', after which execution continues after the end of the loop. (A loop can also be terminated by `<<' terminating execution of the whole rule.)

A branch is just a sequence of items, like a rule body, except that it must begin with either an input token or a lookahead. If it begins with an input token, it can be executed only when that token is the next token in the input, and execution starts with acceptance of that token.

A lookahead specifies conditions for execution of a branch based on recognizing but not accepting input token(s). The simplest form is just an input token enclosed in brackets ([]), in which case execution of that branch is possible only when that token is the next token in the input. The brackets can also contain multiple input tokens sepa- rated by commas, in which case the parser looks for any of those tokens. If a user-abbreviation class name appears, either by itself or as an element of a comma-separated list, it stands for the list of tokens given in its definition.

If a lookahead's brackets contain only a `*', this is a default branch, executable regardless of the state of the input.

As a very special case, a lookahead's brackets can contain two input tokens separated by slash (/), in which case that branch is executable only when those two tokens, in sequence, are next in the input. Warn- ing: this is implemented by a delicate perversion of the error-repair machinery, and if the first of those tokens is not then accepted, the parser will die in convulsions. A further restriction is that the same input token may not appear as the first token of a double lookahead and as a normal lookahead token in the same choice/loop.

Certain simple choice/loop structures appear frequently, and there are abbreviations for them:

abbreviation	    expansion
( items ?)	        ( items  |  [*] )
{ items ?}	        { items  |  [*] >> }
( ! [look] items ?) ( [ look]  |  items )
{ ! [look] items ?} { [ look] >>  |  items }

For example, here are the rules of the AASL specification for AASL, minus the actions (which add considerable clutter and are unintelligi- ble without the third pass):

	       rules: {
				   "id" ":" contents ";"
				   | "<" "id" ">" {"string" ?} ";"
				   | "string" "->" "string"
				   | "EOF" >>
	       contents: {
				   | "<<"
				   | "id"
				   | "!" "id"
				   | "@%&!" "id"
				   | "string"
				   | "(" branches ")"
				   | "{" branches "}"
				   | [*] >>
	       branches: (
				   "!" "[" look "]" contents "?"
				   | [*] branch (
				   ["|"] {"|" branch ?}
				   | "?" !endbranch
				   | [*]
	       branch: (
				   "string" contents
				   | "[" look "]" contents
	       look: (
				   ["string"/"/"] "string" "/" "string"
				   | "*"
				   | [*] looker {"," looker ?}
	       looker: ( "string" | "id" ) ;

Error Repair

When the input token is not one of those desired, either because the item being executed is an input token and a different token appears on the input, or because none of the branches of a choice/loop is exe- cutable, error repair is invoked to try to fix things up. Sometimes it can actually guess right and fix the error, but more frequently it merely supplies a legal output so that later passes will not be thrown into chaos by a minor syntax error.

The general error-repair strategy of an AASL parser is to give the parser what it wants and then attempt to resynchronize the input with the parser.

[xxx long discussion of how ``what it wants'' is determined when there are multiple possibilities]

Resynchronization is performed in three stages. The first stage attempts to resynchronize within a logical line, and is applied only if neither the input token nor the desired token is a line terminator (a member of the ``lineterm'' class). If the input token is trivial (a member of the ``trivial'' class), it is discarded. Otherwise it is retained, in hopes that it will be the next token that the parser asks for.

Either way, an error message is produced, indicating what was desired, what was seen, and what was handed to the parser. If too many of these messages have been produced for a single line, the parser gives up, produces a last despairing message, emits a ``#!aargh'' action token to alert later pases, and exits. Barring this disaster, parsing then con- tinues. If the parser at some point is willing to accept the input token, it is accepted and error repair terminates. If a line termina- tor is seen in input, or the parser requests one, before the parser is willing to accept the input token, the second phase begins.

The second stage of resynchronization attempts to line both input and parser up on a line terminator. If the desired token is a line termi- nator and the input token is not, input is discarded until a line ter- minator appears. If the desired token is not a line terminator and the input token is, the input token is retained and parsing continues until the parser asks for a line terminator. Either way, the third phase then begins.

The third stage of resynchronization attempts to reconcile line termi- nators. If the desired and input tokens are identical, the input token is accepted and error repair terminates. If they are not identical and the input token is trivial (yes, line terminators can be trivial, and ones like `;' probably should be), the input token is discarded. If the desired token is the endmarker, then the input token is discarded. Otherwise, the input token continues to be retained in hopes that it will eventually be accepted. [xxx this needs more thought] In any case, the second phase begins again.


all in $AASLDIR:
interp  table interpreter
lex     first pass of aaslg
syn     AASL table for aaslg
sem     third pass of aaslg

See Also

awk(1), yacc(1)


``error-repair disaster'' means that the first token of a double looka- head could not be accepted and error repair was invoked on it.


Written at University of Toronto by Henry Spencer, somewhat in the spirit of S/SL (see ACM TOPLAS April 1982).


Some of the restrictions on double lookahead are annoying.

Most of the C string escapes are recognized but disregarded, with only a backslashed double-quote interpreted properly during text generation.

Error repair needs further tuning; it has an annoying tendency to infi- nite-loop in certain odd situations (although the messages/line limit eventually breaks the loop).

Complex choices/loops with many branches can result in very long lines in the table.


The implementation of AASL was fairly straight forward, with AASL itself used to describe its own syntax. An AASL specification is compiled into a table, which is then processed by a table-walking interpreter. The interpreter expects input to be as tokens, one per line, much likethe output of a traditional scanner. A complete program using AASL (for example, the AASL table generator) is normally three passes: thescanner,the parser (tables plus interpreter), and a semantics pass. The first set of tables was generated byhand for bootstrapping.

Apart from the minor nuisance of repeated iterations of language design, the biggest problem ofimplementing AASL wasthe question of semantic actions. Inserting awk semantic routines into the table interpreter, in the style of yacc,would not be impossible, but it seemed clumsy and inelegant. Awks lack of anyprovision for compile time initialization of tables strongly suggested reading them in at run time, rather than taking up space with a huge BEGIN action whose only purpose was to initialize the tables. This makes insertions into the interpreters code awkward.

The problem was solved by a crucial observation: traditional compilers (etc.) merge a two-stepprocess, first validating a token stream and inserting semantic action cookiesinto it, then interpreting thestream and the cookies to interface to semantics. Forexample, yaccs grammar notation can be viewed asinserting fragments of C code into a parsed output, and then interpreting that output. This approach yieldsan extremely natural pass structure for an AASL parser,with the parsersoutput stream being (in the absenceof syntax errors) a copy of its input stream with annotations. The following semantic pass then processesthis, momentarily remembering normal tokens and interpreting annotations as operations on the remembered values. (The semantic pass is, in fact, a classic pattern+action awk program, with a pattern and anaction for each annotation, and a general save the value in a variableaction for normal tokens.)

The one difficulty that arises with this method is when the language definition involves feedbackloops between semantics and parsing, an obvious example being Cs typedef.Dealing with this reallydoes require some imbedding of semantics into the interpreter,although with care it need not be much: thein-parser code for recognizing C typedefs, including the complications introduced by block structure andnested redeclarations of type names, is about 40 lines of awk.The in-parser actions are invoked by a special variant of the AASL emit semantic annotationsyntax.

Aside benefit of top-down parsing is that the context of errors is known, and it is relatively easy to implement automatic error recovery. When the interpreter is faced with an input token that does not appearin the list of possibilities in the parser table, it givesthe parser one of the possibilities anyway, and then usessimple heuristics to try to adjust the input to resynchronize. The result is that the parser,and subsequentpasses, always see a syntactically-correct program. (This approach is borrowed from S/SL and its predecessors.) Although the detailed error-recovery algorithm is still experimental, and the current one is notentirely satisfactory when a complex AASL specification does certain things, in general it deals with minorsyntax errors simply and cleanly without anyneed for complicating the specification with details of errorrecovery.Knowing the context of errors also makes it much easier to generate intelligible error messagesautomatically.

The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.

As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, andwith the exception of the messy details of Cdeclaratorsis mostly a fairly straightforward transcription of the syntax given in the ANSI standard. Generating tables for this takes about three minutes of CPU time on a Sun 3/180; the tables are about 10K bytes.

The performance of the resulting ANSI C parser is not impressive: in very round numbers, averagedoveralarge program, it parses about one line of C per CPU second. (The scanner,164 lines of awk, accounts for a negligible fraction of this.) Some attention to optimization of both the tables and the interpreter might speed this up somewhat, but remarkable improvements are unlikely. As things stand in the absence of better awk implementations or a rewrite of the table interpreter in C, its a cute toy, possibly of some pedagogical value, but not a useful production tool. On the other hand, there does not appear to be any fundamental reason for the performance shortfall: itspurely the result of the slowexecution of awk programs.

Lessons From AASL

The scanner would be much faster with better regular-expression matching, because it can use regular expressions to determine whether a string is a plausible token but must use substr to extract the string first. Nawk functions would be very handy for modularizing code, especially the complicated and seldom-invoked error-recovery procedure. A switch statement modelled on the pattern+action scheme would be useful in several places.

Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in nawk): there is no array assignment. This lack leads to endless repetitions of code like:

for (i in array) 
    arraystack[i ":" sp] = array[i] 

whenever block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic sugar for this but don't really fix the problem. Not only is this code clumsy, it is woefully inefficient compared to something like

arraystack[sp] = array 

even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.

It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after it had finished executing.

A minor nuisance that surfaces constantly is that getting an error message out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).

The multi-pass input-driven structure that awk naturally lends itself to produces very clean and readable code with different phases neatly separated, but creates substantial difficulties when feedback loops appear. (In the case of AASL,this perhaps says more about language design than about awk.)


Henry Spencer.

categories: Interpreters,May,2009,SteveJ

Brainfuck to C

(Editor's note: One of the benefits of gawk is its ability to quickly code filters that convert artifacts from one form to another. For example, here's a BrainFuck to C translator.)

About BrainFuck

(From Wikipeidia.)

The BrainFuck programming language is an esoteric programming language noted for its extreme minimalism. It is a Turing tarpit, designed to challenge and amuse programmers, and is not suitable for practical use

Urban Muller created BrainFuck in 1993 with the intention of designing a language which could be implemented with the smallest possible compiler, inspired by the 1024-byte compiler for the FALSE programming language. Several BrainFuck compilers have been made smaller than 200 bytes. The classic distribution is Muller's version 2, containing a compiler for the Amiga, an interpreter, example programs, and a readme document.

The language consists of eight commands:

increment the data pointer (to point to the next cell to the right).
decrement the data pointer (to point to the next cell to the left).
increment (increase by one) the byte at the data pointer.
decrement (decrease by one) the byte at the data pointer.
output the value of the byte at the data pointer.
accept one byte of input, storing its value in the byte at the data pointer.
if the byte at the data pointer is zero, then instead of moving the instruction pointer forward to the next command, jump it forward to the command after the matching ] command.
if the byte at the data pointer is nonzero, then instead of moving the instruction pointer forward to the next command, jump it back to the command after the matching [ command.

A Brainfuck program is a sequence of these commands, possibly interspersed with other characters (which are ignored). The commands are executed sequentially, except as noted below; an instruction pointer begins at the first command, and each command it points to is executed, after which it normally moves forward to the next command. The program terminates when the instruction pointer moves past the last command.

The Translator

I wrote a BrainFuck to C translator in awk. It only takes a few minutes and I noticed that no awk version of this existed.

I haven't run it through it's paces (I just wrote a few small BrainFuck programs to test it out) so if you find a bug, please let me know.

#!/sw/bin/awk -f
# a brainfuck to C translator.
# Needs a recent version of gawk, if on OS X,
# try using Fink's version.
# steve jenson

  print "#include <stdio.h>\n";
  print "int main() {";
  print "  int c = 0;";
  print "  static int b[30000];\n";

  #Note: the order in which these are
  #substituted is very important.
  gsub(/\]/, "  }\n");
  gsub(/\[/, "  while(b[c] != 0) {\n");
  gsub(/\+/, "  ++b[c];\n");
  gsub(/\-/, "  --b[c];\n");
  gsub(/>/, "  ++c;\n");
  gsub(/</, "  --c;\n");
  gsub(/\./, "  putchar(b[c]);\n");
  gsub(/\,/, "  b[c] = getchar();\n");
  print $0

  print "\n  return 0;";
  print "}";


  • You can blame this on Evan Martin, his recent post wherein half the universe decided to weigh in about PHP had a mention of mod_bf. Am I easily distracted or what?
  • Last update, I swear. Darius said that I don't need to initialize b if I declare it static. Also, I realized that my previous version wouldn't understand if you had multiple operators on the same line since it was matching records and not fields. This version works on all of the programs I tried in the BrainFuck archive (as long as you strip comments).


Steve Johnson

categories: Eliza,Top10,AwkLisp,Interpreters,Dsl,Mar,2009,DariusB


Download from


awk [-v profiling=1] -f awklisp [optional-Lisp-source-files]

The -v profiling=1 option turns call-count profiling on.

If you want to use it interactively, be sure to include '-' (for the standard input) among the source files. For example:

gawk -f awklisp startup numbers lists -



This program arose out of one-upmanship. At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster. Since then I've added features and polish, in the hope of taking over the burgeoning market for stately language implementations.

This version tries to deal with as many of the essential issues in interpreter implementation as is reasonable in awk (though most would call this program utterly unreasonable from start to finish, perhaps...). Awk's impoverished control structures put error recovery and tail-call optimization out of reach, in that I can't see a non-painful way to code them. The scope of variables is dynamic because that was easier to implement efficiently. Subject to all those constraints, the language is as Schemely as I could make it: it has a single namespace with uniform evaluation of expressions in the function and argument positions, and the Scheme names for primitives and special forms.

The rest of this file is a reference manual. My favorite tutorial would be The Little LISPer (see section 5, References); don't let the cute name and the cartoons turn you off, because it's a really excellent book with some mind-stretching material towards the end. All of its code will work with awklisp, except for the last two chapters. (You'd be better off learning with a serious Lisp implementation, of course.)

For more details on the implementation, see the Implementation notes (below).




(define fib
  (lambda (n)
    (if (< n 2)
        (+ (fib (- n 1))
           (fib (- n 2))))))
(fib 20)

Comamnd line:

gawk -f awklisp startup numbers  lists fib.lsp




Here are the standard ELIZA dialogue patterns:

(define rules
     (How do you do -- please state your problem))
    ((I want)
     (What would it mean if you got -R-)
     (Why do you want -R-)
     (Suppose you got -R- soon))
     (Do you really think its likely that -R-)
     (Do you wish that -R-)
     (What do you think about -R-)
     (Really-- if -R-))
    ((I was)
     (Were you really?)
     (Perhaps I already knew you were -R-)
     (Why do you tell me you were -R- now?))
    ((I am)
     (In what way are you -R-)
     (Do you want to be -R-))
     (Is that the real reason?)
     (What other reasons might there be?)
     (Does that reason seem to explain anything else?))
    ((I feel)
     (Do you often feel -R-))
    ((I felt)
     (What other feelings do you have?))
     (You seem quite positive)
     (You are sure)
     (I understand))
     (Why not?)
     (You are being a bit negative)
     (Are you saying no just to be negative?))
     (Can you be more specific?))
     (Surely not everyone)
     (Can you think of anyone in particular?)
     (Who for example?)
     (You are thinking of a special person))
     (You do not seem quite certain))
     (Did you think they might not be -R-)
     (Possibly they are -R-))
     (Very interesting)
     (I am not sure I understand you fully)
     (What does that suggest to you?)
     (Please continue)
     (Go on)
     (Do you feel strongly about discussing such things?))))

Command line:

gawk -f awklisp startup numbers  lists eliza.lsp -


> (eliza)
Hello-- please state your problem 
> (I feel sick)
Do you often feel sick 
> (I am in love with awk)
In what way are you in love with awk 
> (because it is so easy to use)
Is that the real reason? 
> (I was laughed at by the other kids at space camp)
Were you really? 
> (everyone hates me)
Can you think of anyone in particular? 
> (everyone at space camp)
Surely not everyone 
> (perhaps not tina fey)
You do not seem quite certain 
> (I want her to laugh at me)
What would it mean if you got her to laugh at me 

Expressions and their evaluation

Lisp evaluates expressions, which can be simple (atoms) or compound (lists).

An atom is a string of characters, which can be letters, digits, and most punctuation; the characters may -not- include spaces, quotes, parentheses, brackets, '.', '#', or ';' (the comment character). In this Lisp, case is significant ( X is different from x ).

  • Atoms: atom 42 1/137 + ok? hey:names-with-dashes-are-easy-to-read
  • Not atoms: don't-include-quotes (or spaces or parentheses)

A list is a '(', followed by zero or more objects (each of which is an atom or a list), followed by a ')'.

  • Lists: () (a list of atoms) ((a list) of atoms (and lists))
  • Not lists: ) ((()) (two) (lists)

The special object nil is both an atom and the empty list. That is, nil = (). A non-nil list is called a -pair-, because it is represented by a pair of pointers, one to the first element of the list (its -car-), and one to the rest of the list (its -cdr-). For example, the car of ((a list) of stuff) is (a list), and the cdr is (of stuff). It's also possible to have a pair whose cdr is not a list; the pair with car A and cdr B is printed as (A . B).

That's the syntax of programs and data. Now let's consider their meaning. You can use Lisp like a calculator: type in an expression, and Lisp prints its value. If you type 25, it prints 25. If you type (+ 2 2), it prints 4. In general, Lisp evaluates a particular expression in a particular environment (set of variable bindings) by following this algorithm:

  • If the expression is a number, return that number.
  • If the expression is a non-numeric atom (a -symbol-), return the value of that symbol in the current environment. If the symbol is currently unbound, that's an error.
  • Otherwise the expression is a list. If its car is one of the symbols: quote, lambda, if, begin, while, set!, or define, then the expression is a -special- -form-, handled by special rules. Otherwise it's just a procedure call, handled like this: evaluate each element of the list in the current environment, and then apply the operator (the value of the car) to the operands (the values of the rest of the list's elements). For example, to evaluate (+ 2 3), we first evaluate each of its subexpressions: the value of + is (at least in the initial environment) the primitive procedure that adds, the value of 2 is 2, and the value of 3 is 3. Then we call the addition procedure with 2 and 3 as arguments, yielding 5. For another example, take (- (+ 2 3) 1). Evaluating each subexpression gives the subtraction procedure, 5, and 1. Applying the procedure to the arguments gives 4.
We'll see all the primitive procedures in the next section. A user-defined procedure is represented as a list of the form (lambda <parameters> <body>), such as (lambda (x) (+ x 1)). To apply such a procedure, evaluate its body in the environment obtained by extending the current environment so that the parameters are bound to the corresponding arguments. Thus, to apply the above procedure to the argument 41, evaluate (+ x 1) in the same environment as the current one except that x is bound to 41.

If the procedure's body has more than one expression -- e.g., (lambda () (write 'Hello) (write 'world!)) -- evaluate them each in turn, and return the value of the last one.

We still need the rules for special forms. They are:

  • The value of (quote <x>) is <x>. There's a shorthand for this form: '. E.g., the value of '(+ 2 2) is (+ 2 2), -not- 4.
  • (lambda <parameters> ) returns itself: e.g., the value of (lambda (x) x) is (lambda (x) x).
  • To evaluate (if <test-expr> <then-exp> <else-exp>), first evaluate <test-expr>. If the value is true (non-nil), then return the value of <then-exp>, otherwise return the value of <else-exp>. (<else-exp> is optional; if it's left out, pretend there's a nil there.) Example: (if nil 'yes 'no) returns no.
  • To evaluate (begin <expr-1> <expr-2>...), evaluate each of the subexpressions in order, returning the value of the last one.
  • To evaluate (while <test> <expr-1> <expr-2>...), first evaluate <test>. If it's nil, return nil. Otherwise, evaluate <expr-1>, <expr-2>,... in order, and then repeat.
  • To evaluate (set! <variable> <expr>), evaluate <expr>, and then set the value of <variable> in the current environment to the result. If the variable is currently unbound, that's an error. The value of the whole set! expression is the value of <expr>.
  • (define <variable> <expr>) is like set!, except it's used to introduce new bindings, and the value returned is <variable>.

It's possible to define new special forms using the macro facility provided in the startup file. The macros defined there are:

  • (let ((<var> <expr>)...)
    Bind each <var> to its corresponding <expr> (evaluated in the current environment), and evaluate <body> in the resulting environment.
  • (cond (<test-expr> <result-expr>...)... (else <result-expr>...))
    where the final else clause is optional. Evaluate each <test-expr> in turn, and for the first non-nil result, evaluate its <result-expr>. If none are non-nil, and there's no else clause, return nil.
  • (and <expr>...)
    Evaluate each <expr> in order, until one returns nil; then return nil. If none are nil, return the value of the last <expr>.
  • (or <expr>...)
    Evaluate each <expr> in order, until one returns non-nil; return that value. If all are nil, return nil.

Built-in procedures

List operations:

  • (null? <x>) returns true (non-nil) when <x> is nil.
  • (atom? <x>) returns true when <x> is an atom.
  • (pair? <x>) returns true when <x> is a pair.
  • (car <pair>) returns the car of <pair>.
  • (cdr <pair>) returns the cdr of <pair>.
  • (cadr <pair>) returns the car of the cdr of <pair>. (i.e., the second element.)
  • (cddr <pair>) returns the cdr of the cdr of <pair>.
  • (cons <x> <y>) returns a new pair whose car is <x> and whose cdr is <y>.
  • (list <x>...) returns a list of its arguments.
  • (set-car! <pair> <x>) changes the car of <pair> to <x>.
  • (set-cdr! <pair> <x>) changes the cdr of <pair> to <x>.
  • (reverse! <list>) reverses <list> in place, returning the result.


  • (number? <x>) returns true when <x> is a number.
  • (+ <n> <n>) returns the sum of its arguments.
  • (- <n> <n>) returns the difference of its arguments.
  • (* <n> <n>) returns the product of its arguments.
  • (quotient <n> <n>) returns the quotient. Rounding is towards zero.
  • (remainder <n> <n>) returns the remainder.
  • (< <n1> <n2>) returns true when <n1> is less than <n2>.


  • (write <x>) writes <x> followed by a space.
  • (newline) writes the newline character.
  • (read) reads the next expression from standard input and returns it.


  • (eval <x>) evaluates <x> in the current environment, returning the result.
  • (apply <proc> <list>) calls <proc> with arguments <list>, returning the result.


  • (eq? <x> <y>) returns true when <x> and <y> are the same object. Be careful using eq? with lists, because (eq? (cons <x> <y>) (cons <x> <y>)) is false.
  • (put <x> <y> <z>)
  • (get <x> <y>) returns the last value <z> that was put for <x> and <y>, or nil if there is no such value.
  • (symbol? <x>) returns true when <x> is a symbol.
  • (gensym) returns a new symbol distinct from all symbols that can be read.
  • (random <n>) returns a random integer between 0 and <n>-1 (if <n> is positive).
  • (error <x>...) writes its arguments and aborts with error code 1.

Implementation Notes


Since the code should be self-explanatory to anyone knowledgeable about Lisp implementation, these notes assume you know Lisp but not interpreters. I haven't got around to writing up a complete discussion of everything, though.

The code for an interpreter can be pretty low on redundancy -- this is natural because the whole reason for implementing a new language is to avoid having to code a particular class of programs in a redundant style in the old language. We implement what that class of programs has in common just once, then use it many times. Thus an interpreter has a different style of code, perhaps denser, than a typical application program.

Data representation

Conceptually, a Lisp datum is a tagged pointer, with the tag giving the datatype and the pointer locating the data. We follow the common practice of encoding the tag into the two lowest-order bits of the pointer. This is especially easy in awk, since arrays with non-consecutive indices are just as efficient as dense ones (so we can use the tagged pointer directly as an index, without having to mask out the tag bits). (But, by the way, mawk accesses negative indices much more slowly than positive ones, as I found out when trying a different encoding.)

This Lisp provides three datatypes: integers, lists, and symbols. (A modern Lisp provides many more.)

For an integer, the tag bits are zero and the pointer bits are simply the numeric value; thus, N is represented by N*4. This choice of the tag value has two advantages. First, we can add and subtract without fiddling with the tags. Second, negative numbers fit right in. (Consider what would happen if N were represented by 1+N*4 instead, and we tried to extract the tag as N%4, where N may be either positive or negative. Because of this problem and the above-mentioned inefficiency of negative indices, all other datatypes are represented by positive numbers.)

The evaluation/saved-bindings stack

The following is from an email discussion; it doesn't develop everything from first principles but is included here in the hope it will be helpful.

Hi. I just took a look at awklisp, and remembered that there's more to your question about why we need a stack -- it's a good question. The real reason is because a stack is accessible to the garbage collector.

We could have had apply() evaluate the arguments itself, and stash the results into variables like arg0 and arg1 -- then the case for ADD would look like

if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)

The obvious problem with that approach is how to handle calls to user-defined procedures, which could have any number of arguments. Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1)) is the procedure, and 42 is the argument.

A (wrong) solution could be to evaluate each argument in turn, and bind the corresponding parameter name (like x in this case) to the resulting value (while saving the old value to be restored after we return from the procedure). This is wrong because we must not change the variable bindings until we actually enter the procedure -- for example, with that algorithm ((lambda (x y) y) 1 x) would return 1, when it should return whatever the value of x is in the enclosing environment. (The eval_rands()-type sequence would be: eval the 1, bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind y to that, then eval the body of the lambda.)

Okay, that's easily fixed -- evaluate all the operands and stash them away somewhere until you're done, and *then* do the bindings. So the question is where to stash them. How about a global array? Like

for (i = 0; arglist != NIL; ++i) {
    global_temp[i] = eval(car[arglist])
    arglist = cdr[arglist]

followed by the equivalent of extend_env(). This will not do, because the global array will get clobbered in recursive calls to eval(). Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +, like this: global_temp[0] gets 2, and then global_temp[1] gets the eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set to 3 and global_temp[1] to 4 -- so the original assignment of 2 to global_temp[0] is clobbered before we get a chance to use it. By using a stack[] instead of a global_temp[], we finesse this problem.

You may object that we can solve that by just making the global array local, and that's true; lots of small local arrays may or may not be more efficient than one big global stack, in awk -- we'd have to try it out to see. But the real problem I alluded to at the start of this message is this: the garbage collector has to be able to find all the live references to the car[] and cdr[] arrays. If some of those references are hidden away in local variables of recursive procedures, we're stuck. With the global stack, they're all right there for the gc().

(In C we could use the local-arrays approach by threading a chain of pointers from each one to the next; but awk doesn't have pointers.)

(You may wonder how the code gets away with having a number of local variables holding lisp values, then -- the answer is that in every such case we can be sure the garbage collector can find the values in question from some other source. That's what this comment is about:

  # All the interpretation routines have the precondition that their
  # arguments are protected from garbage collection.

In some cases where the values would not otherwise be guaranteed to be available to the gc, we call protect().)

Oh, there's another reason why apply() doesn't evaluate the arguments itself: it's called by do_apply(), which handles lisp calls like (apply car '((x))) -- where we *don't* want the x to get evaluated by apply().


  • Harold Abelson and Gerald J. Sussman, with Julie Sussman. Structure and Interpretation of Computer Programs. MIT Press, 1985.
  • John Allen. Anatomy of Lisp. McGraw-Hill, 1978. <;i> Daniel P. Friedman and Matthias Felleisen. The Little LISPer. Macmillan, 1989.

Roger Rohrbach wrote a Lisp interpreter, in old awk (which has no procedures!), called walk . It can't do as much as this Lisp, but it certainly has greater hack value. Cooler name, too. It's available at


Eval doesn't check the syntax of expressions. This is a probably-misguided attempt to bump up the speed a bit, that also simplifies some of the code. The macroexpander in the startup file would be the best place to add syntax- checking.


Darius Bacon


Copyright (c) 1994, 2001 by Darius Bacon.

Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:

  1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from defects in it.
  2. The origin of this software must not be misrepresented, either by explicit claim or by omission.
  3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software.

categories: Awk100,Top10,Interpreters,Dsl,Apr,2009,HenryS

Amazing Awk Assembler

Download from

Download from LAWKER.


"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.

aaa is the mainline shell file.

aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).

aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.

abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.

try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.

01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.

If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.


Henry Spencer

categories: Interpreters,Apr,2009,DavidL

Awk A*

Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base.

An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.

They write:

    While there are a number of systems that will help one construct full-blown metaprograms such as compilers and interpreters, we wanted something with extremely low overhead. We set out to build a something with the property that it would help even inexperienced users build simple meta-programs in a matter of minutes with a few lines of code. A* is the result; it is more than anything else an engineering exercise, as most of its ideas are not new. It is the arrangement of these ideas and the purpose to which they are directed distinguish A* from other tools.

    A* is an experimental language designed to facilitate the creation of language-processing tools. It is analogous either to an interpreted yacc with Awk as its statement language, or to a version of Awk which processes programs rather than records. A* offers two principal advantages over the combination of lex, yacc, and C:

    1. a high-level interpreted base language
    2. built-in parse tree construction.
A* programmers are thus able to accomplish many useful tasks with little code.

Reference: A*: a language for implementing language processors Ladd, D.A.; Ramming, J.C.; Software Engineering, IEEE Transactions on Volume 21, Issue 11, Nov. 1995 Page(s):894 - 901

blog comments powered by Disqus