About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
These pages focus on language interpreters, written in Awk.
Download from LAWKER
aaslg [ -x ] [ file ... ] aaslr [ -x ] table [ file ... ]
Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output. Aaslr parses the contents of the file(s) (default standard input) according to the AASL table in file table, emitting the table's output on standard output.
Both take a -x option to turn on verbose and cryptic debugging output. Both look in a library directory for pieces of the AASL system; the AASLDIR environment variable, if present, overrides the default notion of the location of this directory.
Aaslr expects input to consist of input tokens, one per line. For sim- ple tokens, the line is just the text of the token. For metatokens like ``identifier'', the line is the metatoken's name, a tab, and the text of the token. [xxx discuss `#' lines]
Aaslr output, in the absence of syntax errors, consists of the input tokens plus action tokens, which are lines consisting of `#!' followed immediately by an identifier. If the syntax of the input does not match that specified in the AASL table, aaslr emits complaint(s) on standard error and attempts to repair the input into a legal form; see ``ERROR REPAIR'' below. Unless errors have cascaded to the point where aaslr gives up (in which case it emits the action token ``#!aargh'' to inform later passes of this), the output will always conform to the AASL syntax given in the table.
Normally, a complete program using AASL consists of three passes, the middle one being an invocation of aaslr. The first pass is a lexical analyzer, which breaks free-form input down into input tokens in some suitable way. The third pass is a semantics interpreter, which typi- cally responds to input tokens by momentarily remembering them and to action tokens by executing some action, often using the remembered value of the previous input token. Aaslg is in fact implemented using AASL, following this structure; it implements the -x option by just passing it to aaslr.
An AASL specification consists of class definitions, text definitions, and rules, in arbitrary order (except that class definitions must pre- cede use of the classes they define). A `#' (not enclosed in a string) begins a comment; characters from it to the end of the line are ignored. An identifier follows the same rules as a C identifier, except that in most contexts it can be at most 16 characters long. A string is enclosed in double quotes ("") and generally follows C syn- tax. Most strings denote input tokens, and references to ``input token'' as part of AASL specification syntax should be read as ``string denoting input token''.
A class definition is an identifier enclosed in angle brackets (<>) followed by one or more input tokens followed by a semicolon (;). It gives a name to a set of input tokens. Classes whose names start with capital letters are user abbreviations; see below. Classes whose names start with lowercase letters are special classes, used for internal purposes. The current special classes are:
For example, the class definitions used for AASL itself are:
<trivial> "," ";" ; <lineterm> ";" ; <endmarker> "EOF" ;
When AASL error repair is invoked, the parser sometimes needs to gener- ate input tokens. In the case of a metatoken, the parser knows the token's name but needs to generate a text for it as well. A text defi- nition consists of an input token, an arrow (->), and a string specify- ing what text should be generated for that token. For example, the text definitions used for AASL itself are:
"id" -> "___" "string" -> "\"___\""
The rules of a specification define the syntax that the parser should accept. The order of rules is not significant, except that the first rule is considered to be the top level of the specification. The spec- ification is executed by calling the first rule; when execution of that rule terminates, execution of the specification terminates. If the user wishes this to occur only at end of input, he should arrange for the lexical analyzer to produce an endmarker token (conventionally ``EOF'') at the end of the input, and should write the first rule to require that token at the end.
Note that an input token may be recognized considerably before it is accepted, but the parser emits it to the output only on acceptance.
A rule consists of an identifier naming it, a colon (:), a sequence of items which is the body of the rule, and a semicolon (;). When a rule is called, it is executed by executing the individual items of the body in order (as modified by control structures) until either one of them explicitly terminates execution of the rule or the last item is exe- cuted.
An item which is an input token requires that that token appear in the input at that point, and accepts it (causing it to be emitted as out- put).
An item which is an identifier denotes a call to another rule, which executes the body of that rule and then returns to the caller. It is an error to call a nonexistent rule.
An item which is an identifier preceded by `!' causes that identifier to be emitted as an action token; the identifier has no other signifi- cance.
An item which is `<<' causes execution of the current rule to terminate immediately, returning to the calling rule.
An item which is `>>' causes the execution of the innermost enclosing loop (see below) to terminate immediately, with execution continuing after the end of that loop. The loop must be within the same rule.
An item which is an identifier preceded by `@%&!' causes an internal semantic action to be executed within the parser; this is normally needed only for bizarre situations like C's typedef. [xxx should give details I suppose]
A choice is a sequence of branches enclosed in parentheses (()) and separated by vertical bars (|). The first of the branches that can be executed, is, after which execution continues after the end of the choice.
A loop is a sequence of branches enclosed in braces ({}) and separated by vertical bars (|). The first of the branches that can be executed, is, and this is done repeatedly until the loop is terminated by `>>', after which execution continues after the end of the loop. (A loop can also be terminated by `<<' terminating execution of the whole rule.)
A branch is just a sequence of items, like a rule body, except that it must begin with either an input token or a lookahead. If it begins with an input token, it can be executed only when that token is the next token in the input, and execution starts with acceptance of that token.
A lookahead specifies conditions for execution of a branch based on recognizing but not accepting input token(s). The simplest form is just an input token enclosed in brackets ([]), in which case execution of that branch is possible only when that token is the next token in the input. The brackets can also contain multiple input tokens sepa- rated by commas, in which case the parser looks for any of those tokens. If a user-abbreviation class name appears, either by itself or as an element of a comma-separated list, it stands for the list of tokens given in its definition.
If a lookahead's brackets contain only a `*', this is a default branch, executable regardless of the state of the input.
As a very special case, a lookahead's brackets can contain two input tokens separated by slash (/), in which case that branch is executable only when those two tokens, in sequence, are next in the input. Warn- ing: this is implemented by a delicate perversion of the error-repair machinery, and if the first of those tokens is not then accepted, the parser will die in convulsions. A further restriction is that the same input token may not appear as the first token of a double lookahead and as a normal lookahead token in the same choice/loop.
Certain simple choice/loop structures appear frequently, and there are abbreviations for them:
abbreviation expansion
( items ?) ( items | [*] )
{ items ?} { items | [*] >> }
( ! [look] items ?) ( [ look] | items )
{ ! [look] items ?} { [ look] >> | items }
For example, here are the rules of the AASL specification for AASL, minus the actions (which add considerable clutter and are unintelligi- ble without the third pass):
rules: {
"id" ":" contents ";"
| "<" "id" ">" {"string" ?} ";"
| "string" "->" "string"
| "EOF" >>
};
contents: {
">>"
| "<<"
| "id"
| "!" "id"
| "@%&!" "id"
| "string"
| "(" branches ")"
| "{" branches "}"
| [*] >>
};
branches: (
"!" "[" look "]" contents "?"
| [*] branch (
["|"] {"|" branch ?}
| "?" !endbranch
| [*]
)
);
branch: (
"string" contents
| "[" look "]" contents
);
look: (
["string"/"/"] "string" "/" "string"
| "*"
| [*] looker {"," looker ?}
);
looker: ( "string" | "id" ) ;
When the input token is not one of those desired, either because the item being executed is an input token and a different token appears on the input, or because none of the branches of a choice/loop is exe- cutable, error repair is invoked to try to fix things up. Sometimes it can actually guess right and fix the error, but more frequently it merely supplies a legal output so that later passes will not be thrown into chaos by a minor syntax error.
The general error-repair strategy of an AASL parser is to give the parser what it wants and then attempt to resynchronize the input with the parser.
[xxx long discussion of how ``what it wants'' is determined when there are multiple possibilities]
Resynchronization is performed in three stages. The first stage attempts to resynchronize within a logical line, and is applied only if neither the input token nor the desired token is a line terminator (a member of the ``lineterm'' class). If the input token is trivial (a member of the ``trivial'' class), it is discarded. Otherwise it is retained, in hopes that it will be the next token that the parser asks for.
Either way, an error message is produced, indicating what was desired, what was seen, and what was handed to the parser. If too many of these messages have been produced for a single line, the parser gives up, produces a last despairing message, emits a ``#!aargh'' action token to alert later pases, and exits. Barring this disaster, parsing then con- tinues. If the parser at some point is willing to accept the input token, it is accepted and error repair terminates. If a line termina- tor is seen in input, or the parser requests one, before the parser is willing to accept the input token, the second phase begins.
The second stage of resynchronization attempts to line both input and parser up on a line terminator. If the desired token is a line termi- nator and the input token is not, input is discarded until a line ter- minator appears. If the desired token is not a line terminator and the input token is, the input token is retained and parsing continues until the parser asks for a line terminator. Either way, the third phase then begins.
The third stage of resynchronization attempts to reconcile line termi- nators. If the desired and input tokens are identical, the input token is accepted and error repair terminates. If they are not identical and the input token is trivial (yes, line terminators can be trivial, and ones like `;' probably should be), the input token is discarded. If the desired token is the endmarker, then the input token is discarded. Otherwise, the input token continues to be retained in hopes that it will eventually be accepted. [xxx this needs more thought] In any case, the second phase begins again.
all in $AASLDIR: interp table interpreter lex first pass of aaslg syn AASL table for aaslg sem third pass of aaslg
awk(1), yacc(1)
``error-repair disaster'' means that the first token of a double looka- head could not be accepted and error repair was invoked on it.
Written at University of Toronto by Henry Spencer, somewhat in the spirit of S/SL (see ACM TOPLAS April 1982).
Some of the restrictions on double lookahead are annoying.
Most of the C string escapes are recognized but disregarded, with only a backslashed double-quote interpreted properly during text generation.
Error repair needs further tuning; it has an annoying tendency to infi- nite-loop in certain odd situations (although the messages/line limit eventually breaks the loop).
Complex choices/loops with many branches can result in very long lines in the table.
The implementation of AASL was fairly straight forward, with AASL itself used to describe its own syntax. An AASL specification is compiled into a table, which is then processed by a table-walking interpreter. The interpreter expects input to be as tokens, one per line, much likethe output of a traditional scanner. A complete program using AASL (for example, the AASL table generator) is normally three passes: thescanner,the parser (tables plus interpreter), and a semantics pass. The first set of tables was generated byhand for bootstrapping.
Apart from the minor nuisance of repeated iterations of language design, the biggest problem ofimplementing AASL wasthe question of semantic actions. Inserting awk semantic routines into the table interpreter, in the style of yacc,would not be impossible, but it seemed clumsy and inelegant. Awks lack of anyprovision for compile time initialization of tables strongly suggested reading them in at run time, rather than taking up space with a huge BEGIN action whose only purpose was to initialize the tables. This makes insertions into the interpreters code awkward.
The problem was solved by a crucial observation: traditional compilers (etc.) merge a two-stepprocess, first validating a token stream and inserting semantic action cookiesinto it, then interpreting thestream and the cookies to interface to semantics. Forexample, yaccs grammar notation can be viewed asinserting fragments of C code into a parsed output, and then interpreting that output. This approach yieldsan extremely natural pass structure for an AASL parser,with the parsersoutput stream being (in the absenceof syntax errors) a copy of its input stream with annotations. The following semantic pass then processesthis, momentarily remembering normal tokens and interpreting annotations as operations on the remembered values. (The semantic pass is, in fact, a classic pattern+action awk program, with a pattern and anaction for each annotation, and a general save the value in a variableaction for normal tokens.)
The one difficulty that arises with this method is when the language definition involves feedbackloops between semantics and parsing, an obvious example being Cs typedef.Dealing with this reallydoes require some imbedding of semantics into the interpreter,although with care it need not be much: thein-parser code for recognizing C typedefs, including the complications introduced by block structure andnested redeclarations of type names, is about 40 lines of awk.The in-parser actions are invoked by a special variant of the AASL emit semantic annotationsyntax.
Aside benefit of top-down parsing is that the context of errors is known, and it is relatively easy to implement automatic error recovery. When the interpreter is faced with an input token that does not appearin the list of possibilities in the parser table, it givesthe parser one of the possibilities anyway, and then usessimple heuristics to try to adjust the input to resynchronize. The result is that the parser,and subsequentpasses, always see a syntactically-correct program. (This approach is borrowed from S/SL and its predecessors.) Although the detailed error-recovery algorithm is still experimental, and the current one is notentirely satisfactory when a complex AASL specification does certain things, in general it deals with minorsyntax errors simply and cleanly without anyneed for complicating the specification with details of errorrecovery.Knowing the context of errors also makes it much easier to generate intelligible error messagesautomatically.
The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.
As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, andwith the exception of the messy details of Cdeclaratorsis mostly a fairly straightforward transcription of the syntax given in the ANSI standard. Generating tables for this takes about three minutes of CPU time on a Sun 3/180; the tables are about 10K bytes.
The performance of the resulting ANSI C parser is not impressive: in very round numbers, averagedoveralarge program, it parses about one line of C per CPU second. (The scanner,164 lines of awk, accounts for a negligible fraction of this.) Some attention to optimization of both the tables and the interpreter might speed this up somewhat, but remarkable improvements are unlikely. As things stand in the absence of better awk implementations or a rewrite of the table interpreter in C, its a cute toy, possibly of some pedagogical value, but not a useful production tool. On the other hand, there does not appear to be any fundamental reason for the performance shortfall: itspurely the result of the slowexecution of awk programs.
The scanner would be much faster with better regular-expression matching, because it can use regular expressions to determine whether a string is a plausible token but must use substr to extract the string first. Nawk functions would be very handy for modularizing code, especially the complicated and seldom-invoked error-recovery procedure. A switch statement modelled on the pattern+action scheme would be useful in several places.
Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in nawk): there is no array assignment. This lack leads to endless repetitions of code like:
for (i in array)
arraystack[i ":" sp] = array[i]
whenever block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic sugar for this but don't really fix the problem. Not only is this code clumsy, it is woefully inefficient compared to something like
arraystack[sp] = array
even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.
It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after it had finished executing.
A minor nuisance that surfaces constantly is that getting an error message out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).
The multi-pass input-driven structure that awk naturally lends itself to produces very clean and readable code with different phases neatly separated, but creates substantial difficulties when feedback loops appear. (In the case of AASL,this perhaps says more about language design than about awk.)
Henry Spencer.
(Editor's note: One of the benefits of gawk is its ability to quickly code filters that convert artifacts from one form to another. For example, here's a BrainFuck to C translator.)
(From Wikipeidia.)
The BrainFuck programming language is an esoteric programming language noted for its extreme minimalism. It is a Turing tarpit, designed to challenge and amuse programmers, and is not suitable for practical use
Urban Muller created BrainFuck in 1993 with the intention of designing a language which could be implemented with the smallest possible compiler, inspired by the 1024-byte compiler for the FALSE programming language. Several BrainFuck compilers have been made smaller than 200 bytes. The classic distribution is Muller's version 2, containing a compiler for the Amiga, an interpreter, example programs, and a readme document.
The language consists of eight commands:
A Brainfuck program is a sequence of these commands, possibly interspersed with other characters (which are ignored). The commands are executed sequentially, except as noted below; an instruction pointer begins at the first command, and each command it points to is executed, after which it normally moves forward to the next command. The program terminates when the instruction pointer moves past the last command.
I wrote a BrainFuck to C translator in awk. It only takes a few minutes and I noticed that no awk version of this existed.
I haven't run it through it's paces (I just wrote a few small BrainFuck programs to test it out) so if you find a bug, please let me know.
#!/sw/bin/awk -f
# a brainfuck to C translator.
# Needs a recent version of gawk, if on OS X,
# try using Fink's version.
#
# steve jenson
BEGIN {
print "#include <stdio.h>\n";
print "int main() {";
print " int c = 0;";
print " static int b[30000];\n";
}
{
#Note: the order in which these are
#substituted is very important.
gsub(/\]/, " }\n");
gsub(/\[/, " while(b[c] != 0) {\n");
gsub(/\+/, " ++b[c];\n");
gsub(/\-/, " --b[c];\n");
gsub(/>/, " ++c;\n");
gsub(/</, " --c;\n");
gsub(/\./, " putchar(b[c]);\n");
gsub(/\,/, " b[c] = getchar();\n");
print $0
}
END {
print "\n return 0;";
print "}";
}
Steve Johnson http://saladwithsteve.com/
awk [-v profiling=1] -f awklisp [optional-Lisp-source-files]
The -v profiling=1 option turns call-count profiling on.
If you want to use it interactively, be sure to include '-' (for the standard input) among the source files. For example:
gawk -f awklisp startup numbers lists -
This program arose out of one-upmanship. At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster. Since then I've added features and polish, in the hope of taking over the burgeoning market for stately language implementations.
This version tries to deal with as many of the essential issues in interpreter implementation as is reasonable in awk (though most would call this program utterly unreasonable from start to finish, perhaps...). Awk's impoverished control structures put error recovery and tail-call optimization out of reach, in that I can't see a non-painful way to code them. The scope of variables is dynamic because that was easier to implement efficiently. Subject to all those constraints, the language is as Schemely as I could make it: it has a single namespace with uniform evaluation of expressions in the function and argument positions, and the Scheme names for primitives and special forms.
The rest of this file is a reference manual. My favorite tutorial would be The Little LISPer (see section 5, References); don't let the cute name and the cartoons turn you off, because it's a really excellent book with some mind-stretching material towards the end. All of its code will work with awklisp, except for the last two chapters. (You'd be better off learning with a serious Lisp implementation, of course.)
For more details on the implementation, see the Implementation notes (below).
Code:
(define fib
(lambda (n)
(if (< n 2)
1
(+ (fib (- n 1))
(fib (- n 2))))))
(fib 20)
Comamnd line:
gawk -f awklisp startup numbers lists fib.lsp
Output:
10946
Here are the standard ELIZA dialogue patterns:
(define rules
'(((hello)
(How do you do -- please state your problem))
((I want)
(What would it mean if you got -R-)
(Why do you want -R-)
(Suppose you got -R- soon))
((if)
(Do you really think its likely that -R-)
(Do you wish that -R-)
(What do you think about -R-)
(Really-- if -R-))
((I was)
(Were you really?)
(Perhaps I already knew you were -R-)
(Why do you tell me you were -R- now?))
((I am)
(In what way are you -R-)
(Do you want to be -R-))
((because)
(Is that the real reason?)
(What other reasons might there be?)
(Does that reason seem to explain anything else?))
((I feel)
(Do you often feel -R-))
((I felt)
(What other feelings do you have?))
((yes)
(You seem quite positive)
(You are sure)
(I understand))
((no)
(Why not?)
(You are being a bit negative)
(Are you saying no just to be negative?))
((someone)
(Can you be more specific?))
((everyone)
(Surely not everyone)
(Can you think of anyone in particular?)
(Who for example?)
(You are thinking of a special person))
((perhaps)
(You do not seem quite certain))
((are)
(Did you think they might not be -R-)
(Possibly they are -R-))
(()
(Very interesting)
(I am not sure I understand you fully)
(What does that suggest to you?)
(Please continue)
(Go on)
(Do you feel strongly about discussing such things?))))
Command line:
gawk -f awklisp startup numbers lists eliza.lsp -
Interaction:
> (eliza) Hello-- please state your problem > (I feel sick) Do you often feel sick > (I am in love with awk) In what way are you in love with awk > (because it is so easy to use) Is that the real reason? > (I was laughed at by the other kids at space camp) Were you really? > (everyone hates me) Can you think of anyone in particular? > (everyone at space camp) Surely not everyone > (perhaps not tina fey) You do not seem quite certain > (I want her to laugh at me) What would it mean if you got her to laugh at me
Lisp evaluates expressions, which can be simple (atoms) or compound (lists).
An atom is a string of characters, which can be letters, digits, and most punctuation; the characters may -not- include spaces, quotes, parentheses, brackets, '.', '#', or ';' (the comment character). In this Lisp, case is significant ( X is different from x ).
A list is a '(', followed by zero or more objects (each of which is an atom or a list), followed by a ')'.
The special object nil is both an atom and the empty list. That is, nil = (). A non-nil list is called a -pair-, because it is represented by a pair of pointers, one to the first element of the list (its -car-), and one to the rest of the list (its -cdr-). For example, the car of ((a list) of stuff) is (a list), and the cdr is (of stuff). It's also possible to have a pair whose cdr is not a list; the pair with car A and cdr B is printed as (A . B).
That's the syntax of programs and data. Now let's consider their meaning. You can use Lisp like a calculator: type in an expression, and Lisp prints its value. If you type 25, it prints 25. If you type (+ 2 2), it prints 4. In general, Lisp evaluates a particular expression in a particular environment (set of variable bindings) by following this algorithm:
If the procedure's body has more than one expression -- e.g., (lambda () (write 'Hello) (write 'world!)) -- evaluate them each in turn, and return the value of the last one.
We still need the rules for special forms. They are:
It's possible to define new special forms using the macro facility provided in the startup file. The macros defined there are:
(let ((<var> <expr>)...) <body>...)Bind each <var> to its corresponding <expr> (evaluated in the current environment), and evaluate <body> in the resulting environment.
(cond (<test-expr> <result-expr>...)... (else <result-expr>...))where the final else clause is optional. Evaluate each <test-expr> in turn, and for the first non-nil result, evaluate its <result-expr>. If none are non-nil, and there's no else clause, return nil.
(and <expr>...)Evaluate each <expr> in order, until one returns nil; then return nil. If none are nil, return the value of the last <expr>.
(or <expr>...)Evaluate each <expr> in order, until one returns non-nil; return that value. If all are nil, return nil.
Since the code should be self-explanatory to anyone knowledgeable about Lisp implementation, these notes assume you know Lisp but not interpreters. I haven't got around to writing up a complete discussion of everything, though.
The code for an interpreter can be pretty low on redundancy -- this is natural because the whole reason for implementing a new language is to avoid having to code a particular class of programs in a redundant style in the old language. We implement what that class of programs has in common just once, then use it many times. Thus an interpreter has a different style of code, perhaps denser, than a typical application program.
Conceptually, a Lisp datum is a tagged pointer, with the tag giving the datatype and the pointer locating the data. We follow the common practice of encoding the tag into the two lowest-order bits of the pointer. This is especially easy in awk, since arrays with non-consecutive indices are just as efficient as dense ones (so we can use the tagged pointer directly as an index, without having to mask out the tag bits). (But, by the way, mawk accesses negative indices much more slowly than positive ones, as I found out when trying a different encoding.)
This Lisp provides three datatypes: integers, lists, and symbols. (A modern Lisp provides many more.)
For an integer, the tag bits are zero and the pointer bits are simply the numeric value; thus, N is represented by N*4. This choice of the tag value has two advantages. First, we can add and subtract without fiddling with the tags. Second, negative numbers fit right in. (Consider what would happen if N were represented by 1+N*4 instead, and we tried to extract the tag as N%4, where N may be either positive or negative. Because of this problem and the above-mentioned inefficiency of negative indices, all other datatypes are represented by positive numbers.)
The following is from an email discussion; it doesn't develop everything from first principles but is included here in the hope it will be helpful.
Hi. I just took a look at awklisp, and remembered that there's more to your question about why we need a stack -- it's a good question. The real reason is because a stack is accessible to the garbage collector.
We could have had apply() evaluate the arguments itself, and stash the results into variables like arg0 and arg1 -- then the case for ADD would look like
if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)
The obvious problem with that approach is how to handle calls to user-defined procedures, which could have any number of arguments. Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1)) is the procedure, and 42 is the argument.
A (wrong) solution could be to evaluate each argument in turn, and bind the corresponding parameter name (like x in this case) to the resulting value (while saving the old value to be restored after we return from the procedure). This is wrong because we must not change the variable bindings until we actually enter the procedure -- for example, with that algorithm ((lambda (x y) y) 1 x) would return 1, when it should return whatever the value of x is in the enclosing environment. (The eval_rands()-type sequence would be: eval the 1, bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind y to that, then eval the body of the lambda.)
Okay, that's easily fixed -- evaluate all the operands and stash them away somewhere until you're done, and *then* do the bindings. So the question is where to stash them. How about a global array? Like
for (i = 0; arglist != NIL; ++i) {
global_temp[i] = eval(car[arglist])
arglist = cdr[arglist]
}
followed by the equivalent of extend_env(). This will not do, because the global array will get clobbered in recursive calls to eval(). Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +, like this: global_temp[0] gets 2, and then global_temp[1] gets the eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set to 3 and global_temp[1] to 4 -- so the original assignment of 2 to global_temp[0] is clobbered before we get a chance to use it. By using a stack[] instead of a global_temp[], we finesse this problem.
You may object that we can solve that by just making the global array local, and that's true; lots of small local arrays may or may not be more efficient than one big global stack, in awk -- we'd have to try it out to see. But the real problem I alluded to at the start of this message is this: the garbage collector has to be able to find all the live references to the car[] and cdr[] arrays. If some of those references are hidden away in local variables of recursive procedures, we're stuck. With the global stack, they're all right there for the gc().
(In C we could use the local-arrays approach by threading a chain of pointers from each one to the next; but awk doesn't have pointers.)
(You may wonder how the code gets away with having a number of local variables holding lisp values, then -- the answer is that in every such case we can be sure the garbage collector can find the values in question from some other source. That's what this comment is about:
# All the interpretation routines have the precondition that their # arguments are protected from garbage collection.
In some cases where the values would not otherwise be guaranteed to be available to the gc, we call protect().)
Oh, there's another reason why apply() doesn't evaluate the arguments itself: it's called by do_apply(), which handles lisp calls like (apply car '((x))) -- where we *don't* want the x to get evaluated by apply().
Roger Rohrbach wrote a Lisp interpreter, in old awk (which has no procedures!), called walk . It can't do as much as this Lisp, but it certainly has greater hack value. Cooler name, too. It's available at http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/impl/awk/0.html
Eval doesn't check the syntax of expressions. This is a probably-misguided attempt to bump up the speed a bit, that also simplifies some of the code. The macroexpander in the startup file would be the best place to add syntax- checking.
Darius Bacon dairus@wry.me
Copyright (c) 1994, 2001 by Darius Bacon.
Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:
Download from LAWKER.
"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.
aaa is the mainline shell file.
aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).
aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.
abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.
try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.
01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.
If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.
Henry Spencer
Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base.
An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.
They write:
A* is an experimental language designed to facilitate the creation of language-processing tools. It is analogous either to an interpreted yacc with Awk as its statement language, or to a version of Awk which processes programs rather than records. A* offers two principal advantages over the combination of lex, yacc, and C:
Reference: A*: a language for implementing language processors Ladd, D.A.; Ramming, J.C.; Software Engineering, IEEE Transactions on Volume 21, Issue 11, Nov. 1995 Page(s):894 - 901
blog comments powered by Disqus