Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Awk100,Jan,2009,Admin

The Awk 100

Goals

Awk is being used all around the world for real programming problems, but the news is not getting out.

We are aiming to create a database of at least one hundred Awk programs which will:

  • Identify the tasks that Awk is really being used for
  • Enable analysis of the benefits of the language for practical programming
  • Serve as an information exchange for applications

Contribute

If you, or your colleagues or friends have written a program which has been used for purposes small or large, why not take five minutes to record the facts, so that others can see what you've done?

To contribute, fill in this template and mail it to mail@awk.info with the subject line Awk 100 contribution.

Current Listing

(Recent additions are shown first.)

  1. A. Lahm and E. de Rinaldis' Patent Matrix
    • PatentMatrix is an automated tool to survey patents related to large sets of genes or proteins. The tool allows a rapid survey of patents associated with genes or proteins in a particular area of interest as defined by keywords. It can be efficiently used to evaluate the IP-related novelty of scientific findings and to rank genes or proteins according to their IP position.
  2. P Janouch's AWK IRC agent:
    • VitaminA IRC bot is an experiment on what can be done with GNU AWK. It's a very simple though powerful scripting language. Using the coprocess feature, plugins can be implemented very easily and in a language-independent way as a side-effect. The project runs only on Unix-derived systems.
  3. Stephen Jungels' music player:
    • Plaiter (pronounced "player") is a command line front end to command line music players. What does Plaiter do that (say) mpg123 can't already? It queues tracks, first of all. Secondly, it understands commands like play, plause, stop, next and prev. Finally, unlike most of the command line music players out there, Plaiter can handle a play list with more than one type of audio file, selecting the proper helper app to handle each type of file you throw at it.
  4. Dan at sourceforge's Jawk system:
    • Awk, impelemeneted in the Java virtual machine. Very useful for extending lightweight scripting in Awk with (e.g.) network and GUI facilities from Java.
  5. Axel T. Schreiner's OOC system:
    • ooc is an awk program which reads class descriptions and performs the routine coding tasks necessary to do object-oriented coding in ANSI C.
  6. Ladd and Raming's Awk A-star system:
    • Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.
  7. Henry Spencer's Amazing Awk Syntax Language system:
    • Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output.
    • The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.
    • As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, and with the exception of the messy details of C declarators is mostly a fairly straightforward transcription of the syntax given in the ANSI standard.
  8. Jurgen Kahrs (and others) XMLgawk system:
    • XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser.
    • The same tool that can load the XML shared library can also add other libraries (e.g. PostgreSQL).
  9. Henry Spencer's Amazing Awk Assembler
    • "aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. Using "aaa", it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category.
  10. Ronald Loui's AI programming lab.
    • For many years, Ronald Loui has taugh AI using Awk. He writes:
      • Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK.
      • A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the Java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.
      • What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
  11. Henry Spencer's Amazing Awk Formatter.
    • Awf may not be lightning fast, and it has certain restrictions, but it does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.
  12. Yung-Pin Cheng's Awk-Linux Course ware.
    • The stable and cross-platform nature of Awk enabled the simple creation of a robust toolkit for teaching operating system concepts to university students. The toolkit is much simpler/ easier to port to new platforms, than alternative and more elaborate course ware tools.
    • This work was the basis for a quite prestigious publication in the IEEE Transactions on Education journal, 2008, Vol 51, Issue 4. Who said Awk was an old-fashioned tool?
  13. Jon Bentley's m1 micro macro processor.
    • Supports the essential operations of defining strings and replacing strings in text by their definitions. All in 110 lines. A little awk goes a long way.
  14. Arnold Robbins and Nelson Beebe's classic spell checker
    • A powerful spell checker, and a case-study on how to best write applications using hundreds of lines of Awk.
  15. Jim Hart's awk++
    • An object-oriented Awk.
  16. Wolfgan Zekol's Yawk
    • WIKI written in Awk
  17. Darius Bacon: AwkLisp
    • LISP written in Awk
  18. Bill Poser: Name
    • Generate TeX code for a bilingual dictionary.
  19. Ronald Loui: Faster clustering
    • Demonstration to DoD of a clustering algorithm suitable for streaming data
  20. Peter Krumin: Get YouTube videos
    • Download YouTube videos
  21. Jim Hart: Sudoku
    • Solve sudoku puzzles using the same strategies as a person would, not by brute force.
  22. Ronald Loui: Anne's Negotiation Game
    • Research on a model of negotiation incorporating search, dialogue, and changing expectations.
  23. Ronald Loui: Baseball Sim
    • A baseball simulator for investigating the efficiency of batting lineups.
  24. Ronald Loui: Argcol
    • A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.

categories: Awk100,Feb,2010,ALahm

PatentMatrix: survey gene/protien patents

(From Source Code Biol Med. 2007 Sep 6;2:4. by A. Lahm, E. de Rinaldis)

BACKGROUND: The number of patents associated with genes and proteins and the amount of information contained in each patent often present a real obstacle to the rapid evaluation of the novelty of findings associated to genes from an intellectual property (IP) perspective. This assessment, normally carried out by expert patent professionals, can therefore become cumbersome and time consuming. Here we present PatentMatrix, a novel software tool for the automated analysis of patent sequence text entries.

METHODS AND RESULTS: PatentMatrix is written in the Awk language and requires installation of the Derwent GENESEQtrade mark patent sequence database under the sequence retrieval system SRS.The software works by taking as input two files: i) a list of genes or proteins with the associated GENESEQtrade mark patent sequence accession numbers ii) a list of keywords describing the research context of interest (e.g. 'lung', 'cancer', 'therapeutics', 'diagnostics'). The GENESEQtrade mark database is interrogated through the SRS system and each patent entry of interest is screened for the occurrence of user-defined keywords. Moreover, the software extracts the basic information useful for a preliminary assessment of the IP coverage of each patent from the GENESEQtrade mark database. As output, two tab-delimited files are generated which provide the user with a detailed and an aggregated view of the results.An example is given where the IP position of five genes is evaluated in the context of 'development of antibodies for cancer treatment'.

CONCLUSION: PatentMatrix allows a rapid survey of patents associated with genes or proteins in a particular area of interest as defined by keywords. It can be efficiently used to evaluate the IP-related novelty of scientific findings and to rank genes or proteins according to their IP position.


categories: Awk100,Irc,Feb,2010,PremyslJ

VitaminA IRC Bot

Purpose

A modular IRC bot written in GNU AWK.

Developers

Premysl Janouch

Country

Czech republic

Domain

IRC Bot

Contact

See Developers.

Email

p.janouch@gmail.com

Awk

GNU AWK

Shell

Bourne shell

Platform

POSIX-compatible

Lines

1000

Current

Released (3)

Use

Free/Public Domain (3)

Users

N/A

DateDeployed

2010

Dated

February 2010

Url

http://vitamina.googlecode.com

Note: A regular release is planned in something like a month. I've typed Current: Released so you won't have to update the page.


categories: Awk100,Interpreters,Apr,2009,HenryS

AASL: Parser Genrator in Awk

Download

Download from LAWKER

Synopsis

aaslg [ -x ] [ file ... ]
aaslr [ -x ] table [ file ... ]

Description

Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output. Aaslr parses the contents of the file(s) (default standard input) according to the AASL table in file table, emitting the table's output on standard output.

Both take a -x option to turn on verbose and cryptic debugging output. Both look in a library directory for pieces of the AASL system; the AASLDIR environment variable, if present, overrides the default notion of the location of this directory.

Aaslr expects input to consist of input tokens, one per line. For sim- ple tokens, the line is just the text of the token. For metatokens like ``identifier'', the line is the metatoken's name, a tab, and the text of the token. [xxx discuss `#' lines]

Aaslr output, in the absence of syntax errors, consists of the input tokens plus action tokens, which are lines consisting of `#!' followed immediately by an identifier. If the syntax of the input does not match that specified in the AASL table, aaslr emits complaint(s) on standard error and attempts to repair the input into a legal form; see ``ERROR REPAIR'' below. Unless errors have cascaded to the point where aaslr gives up (in which case it emits the action token ``#!aargh'' to inform later passes of this), the output will always conform to the AASL syntax given in the table.

Normally, a complete program using AASL consists of three passes, the middle one being an invocation of aaslr. The first pass is a lexical analyzer, which breaks free-form input down into input tokens in some suitable way. The third pass is a semantics interpreter, which typi- cally responds to input tokens by momentarily remembering them and to action tokens by executing some action, often using the remembered value of the previous input token. Aaslg is in fact implemented using AASL, following this structure; it implements the -x option by just passing it to aaslr.

AASL Specifications

An AASL specification consists of class definitions, text definitions, and rules, in arbitrary order (except that class definitions must pre- cede use of the classes they define). A `#' (not enclosed in a string) begins a comment; characters from it to the end of the line are ignored. An identifier follows the same rules as a C identifier, except that in most contexts it can be at most 16 characters long. A string is enclosed in double quotes ("") and generally follows C syn- tax. Most strings denote input tokens, and references to ``input token'' as part of AASL specification syntax should be read as ``string denoting input token''.

A class definition is an identifier enclosed in angle brackets (<>) followed by one or more input tokens followed by a semicolon (;). It gives a name to a set of input tokens. Classes whose names start with capital letters are user abbreviations; see below. Classes whose names start with lowercase letters are special classes, used for internal purposes. The current special classes are:

trivial
tokens which the parser can discard at will, in the expectation that they might be inserted erroneously; see ``ERROR REPAIR'' for details.
lineterm
tokens which terminate a logical line for purposes of resyn- chronization in error repair; see ``ERROR REPAIR'' for details.
endmarker
xxx

For example, the class definitions used for AASL itself are:

<trivial> "," ";"   ;
<lineterm> ";" ;
<endmarker> "EOF"   ;

When AASL error repair is invoked, the parser sometimes needs to gener- ate input tokens. In the case of a metatoken, the parser knows the token's name but needs to generate a text for it as well. A text defi- nition consists of an input token, an arrow (->), and a string specify- ing what text should be generated for that token. For example, the text definitions used for AASL itself are:

"id" -> "___"
"string" -> "\"___\""

The rules of a specification define the syntax that the parser should accept. The order of rules is not significant, except that the first rule is considered to be the top level of the specification. The spec- ification is executed by calling the first rule; when execution of that rule terminates, execution of the specification terminates. If the user wishes this to occur only at end of input, he should arrange for the lexical analyzer to produce an endmarker token (conventionally ``EOF'') at the end of the input, and should write the first rule to require that token at the end.

Note that an input token may be recognized considerably before it is accepted, but the parser emits it to the output only on acceptance.

A rule consists of an identifier naming it, a colon (:), a sequence of items which is the body of the rule, and a semicolon (;). When a rule is called, it is executed by executing the individual items of the body in order (as modified by control structures) until either one of them explicitly terminates execution of the rule or the last item is exe- cuted.

An item which is an input token requires that that token appear in the input at that point, and accepts it (causing it to be emitted as out- put).

An item which is an identifier denotes a call to another rule, which executes the body of that rule and then returns to the caller. It is an error to call a nonexistent rule.

An item which is an identifier preceded by `!' causes that identifier to be emitted as an action token; the identifier has no other signifi- cance.

An item which is `<<' causes execution of the current rule to terminate immediately, returning to the calling rule.

An item which is `>>' causes the execution of the innermost enclosing loop (see below) to terminate immediately, with execution continuing after the end of that loop. The loop must be within the same rule.

An item which is an identifier preceded by `@%&!' causes an internal semantic action to be executed within the parser; this is normally needed only for bizarre situations like C's typedef. [xxx should give details I suppose]

A choice is a sequence of branches enclosed in parentheses (()) and separated by vertical bars (|). The first of the branches that can be executed, is, after which execution continues after the end of the choice.

A loop is a sequence of branches enclosed in braces ({}) and separated by vertical bars (|). The first of the branches that can be executed, is, and this is done repeatedly until the loop is terminated by `>>', after which execution continues after the end of the loop. (A loop can also be terminated by `<<' terminating execution of the whole rule.)

A branch is just a sequence of items, like a rule body, except that it must begin with either an input token or a lookahead. If it begins with an input token, it can be executed only when that token is the next token in the input, and execution starts with acceptance of that token.

A lookahead specifies conditions for execution of a branch based on recognizing but not accepting input token(s). The simplest form is just an input token enclosed in brackets ([]), in which case execution of that branch is possible only when that token is the next token in the input. The brackets can also contain multiple input tokens sepa- rated by commas, in which case the parser looks for any of those tokens. If a user-abbreviation class name appears, either by itself or as an element of a comma-separated list, it stands for the list of tokens given in its definition.

If a lookahead's brackets contain only a `*', this is a default branch, executable regardless of the state of the input.

As a very special case, a lookahead's brackets can contain two input tokens separated by slash (/), in which case that branch is executable only when those two tokens, in sequence, are next in the input. Warn- ing: this is implemented by a delicate perversion of the error-repair machinery, and if the first of those tokens is not then accepted, the parser will die in convulsions. A further restriction is that the same input token may not appear as the first token of a double lookahead and as a normal lookahead token in the same choice/loop.

Certain simple choice/loop structures appear frequently, and there are abbreviations for them:

abbreviation	    expansion
( items ?)	        ( items  |  [*] )
{ items ?}	        { items  |  [*] >> }
( ! [look] items ?) ( [ look]  |  items )
{ ! [look] items ?} { [ look] >>  |  items }

For example, here are the rules of the AASL specification for AASL, minus the actions (which add considerable clutter and are unintelligi- ble without the third pass):

	       rules: {
				   "id" ":" contents ";"
				   | "<" "id" ">" {"string" ?} ";"
				   | "string" "->" "string"
				   | "EOF" >>
	       };
	       contents: {
				   ">>"
				   | "<<"
				   | "id"
				   | "!" "id"
				   | "@%&!" "id"
				   | "string"
				   | "(" branches ")"
				   | "{" branches "}"
				   | [*] >>
	       };
	       branches: (
				   "!" "[" look "]" contents "?"
				   | [*] branch (
				   ["|"] {"|" branch ?}
				   | "?" !endbranch
				   | [*]
				   )
	       );
	       branch: (
				   "string" contents
				   | "[" look "]" contents
	       );
	       look: (
				   ["string"/"/"] "string" "/" "string"
				   | "*"
				   | [*] looker {"," looker ?}
	       );
	       looker: ( "string" | "id" ) ;

Error Repair

When the input token is not one of those desired, either because the item being executed is an input token and a different token appears on the input, or because none of the branches of a choice/loop is exe- cutable, error repair is invoked to try to fix things up. Sometimes it can actually guess right and fix the error, but more frequently it merely supplies a legal output so that later passes will not be thrown into chaos by a minor syntax error.

The general error-repair strategy of an AASL parser is to give the parser what it wants and then attempt to resynchronize the input with the parser.

[xxx long discussion of how ``what it wants'' is determined when there are multiple possibilities]

Resynchronization is performed in three stages. The first stage attempts to resynchronize within a logical line, and is applied only if neither the input token nor the desired token is a line terminator (a member of the ``lineterm'' class). If the input token is trivial (a member of the ``trivial'' class), it is discarded. Otherwise it is retained, in hopes that it will be the next token that the parser asks for.

Either way, an error message is produced, indicating what was desired, what was seen, and what was handed to the parser. If too many of these messages have been produced for a single line, the parser gives up, produces a last despairing message, emits a ``#!aargh'' action token to alert later pases, and exits. Barring this disaster, parsing then con- tinues. If the parser at some point is willing to accept the input token, it is accepted and error repair terminates. If a line termina- tor is seen in input, or the parser requests one, before the parser is willing to accept the input token, the second phase begins.

The second stage of resynchronization attempts to line both input and parser up on a line terminator. If the desired token is a line termi- nator and the input token is not, input is discarded until a line ter- minator appears. If the desired token is not a line terminator and the input token is, the input token is retained and parsing continues until the parser asks for a line terminator. Either way, the third phase then begins.

The third stage of resynchronization attempts to reconcile line termi- nators. If the desired and input tokens are identical, the input token is accepted and error repair terminates. If they are not identical and the input token is trivial (yes, line terminators can be trivial, and ones like `;' probably should be), the input token is discarded. If the desired token is the endmarker, then the input token is discarded. Otherwise, the input token continues to be retained in hopes that it will eventually be accepted. [xxx this needs more thought] In any case, the second phase begins again.

Files

all in $AASLDIR:
interp  table interpreter
lex     first pass of aaslg
syn     AASL table for aaslg
sem     third pass of aaslg

See Also

awk(1), yacc(1)

Diagnostics

``error-repair disaster'' means that the first token of a double looka- head could not be accepted and error repair was invoked on it.

History

Written at University of Toronto by Henry Spencer, somewhat in the spirit of S/SL (see ACM TOPLAS April 1982).

Bugs

Some of the restrictions on double lookahead are annoying.

Most of the C string escapes are recognized but disregarded, with only a backslashed double-quote interpreted properly during text generation.

Error repair needs further tuning; it has an annoying tendency to infi- nite-loop in certain odd situations (although the messages/line limit eventually breaks the loop).

Complex choices/loops with many branches can result in very long lines in the table.

Assessment

The implementation of AASL was fairly straight forward, with AASL itself used to describe its own syntax. An AASL specification is compiled into a table, which is then processed by a table-walking interpreter. The interpreter expects input to be as tokens, one per line, much likethe output of a traditional scanner. A complete program using AASL (for example, the AASL table generator) is normally three passes: thescanner,the parser (tables plus interpreter), and a semantics pass. The first set of tables was generated byhand for bootstrapping.

Apart from the minor nuisance of repeated iterations of language design, the biggest problem ofimplementing AASL wasthe question of semantic actions. Inserting awk semantic routines into the table interpreter, in the style of yacc,would not be impossible, but it seemed clumsy and inelegant. Awks lack of anyprovision for compile time initialization of tables strongly suggested reading them in at run time, rather than taking up space with a huge BEGIN action whose only purpose was to initialize the tables. This makes insertions into the interpreters code awkward.

The problem was solved by a crucial observation: traditional compilers (etc.) merge a two-stepprocess, first validating a token stream and inserting semantic action cookiesinto it, then interpreting thestream and the cookies to interface to semantics. Forexample, yaccs grammar notation can be viewed asinserting fragments of C code into a parsed output, and then interpreting that output. This approach yieldsan extremely natural pass structure for an AASL parser,with the parsersoutput stream being (in the absenceof syntax errors) a copy of its input stream with annotations. The following semantic pass then processesthis, momentarily remembering normal tokens and interpreting annotations as operations on the remembered values. (The semantic pass is, in fact, a classic pattern+action awk program, with a pattern and anaction for each annotation, and a general save the value in a variableaction for normal tokens.)

The one difficulty that arises with this method is when the language definition involves feedbackloops between semantics and parsing, an obvious example being Cs typedef.Dealing with this reallydoes require some imbedding of semantics into the interpreter,although with care it need not be much: thein-parser code for recognizing C typedefs, including the complications introduced by block structure andnested redeclarations of type names, is about 40 lines of awk.The in-parser actions are invoked by a special variant of the AASL emit semantic annotationsyntax.

Aside benefit of top-down parsing is that the context of errors is known, and it is relatively easy to implement automatic error recovery. When the interpreter is faced with an input token that does not appearin the list of possibilities in the parser table, it givesthe parser one of the possibilities anyway, and then usessimple heuristics to try to adjust the input to resynchronize. The result is that the parser,and subsequentpasses, always see a syntactically-correct program. (This approach is borrowed from S/SL and its predecessors.) Although the detailed error-recovery algorithm is still experimental, and the current one is notentirely satisfactory when a complex AASL specification does certain things, in general it deals with minorsyntax errors simply and cleanly without anyneed for complicating the specification with details of errorrecovery.Knowing the context of errors also makes it much easier to generate intelligible error messagesautomatically.

The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.

As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, andwith the exception of the messy details of Cdeclaratorsis mostly a fairly straightforward transcription of the syntax given in the ANSI standard. Generating tables for this takes about three minutes of CPU time on a Sun 3/180; the tables are about 10K bytes.

The performance of the resulting ANSI C parser is not impressive: in very round numbers, averagedoveralarge program, it parses about one line of C per CPU second. (The scanner,164 lines of awk, accounts for a negligible fraction of this.) Some attention to optimization of both the tables and the interpreter might speed this up somewhat, but remarkable improvements are unlikely. As things stand in the absence of better awk implementations or a rewrite of the table interpreter in C, its a cute toy, possibly of some pedagogical value, but not a useful production tool. On the other hand, there does not appear to be any fundamental reason for the performance shortfall: itspurely the result of the slowexecution of awk programs.

Lessons From AASL

The scanner would be much faster with better regular-expression matching, because it can use regular expressions to determine whether a string is a plausible token but must use substr to extract the string first. Nawk functions would be very handy for modularizing code, especially the complicated and seldom-invoked error-recovery procedure. A switch statement modelled on the pattern+action scheme would be useful in several places.

Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in nawk): there is no array assignment. This lack leads to endless repetitions of code like:

for (i in array) 
    arraystack[i ":" sp] = array[i] 

whenever block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic sugar for this but don't really fix the problem. Not only is this code clumsy, it is woefully inefficient compared to something like

arraystack[sp] = array 

even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.

It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after it had finished executing.

A minor nuisance that surfaces constantly is that getting an error message out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).

The multi-pass input-driven structure that awk naturally lends itself to produces very clean and readable code with different phases neatly separated, but creates substantial difficulties when feedback loops appear. (In the case of AASL,this perhaps says more about language design than about awk.)

Author

Henry Spencer.


categories: Awk100,Top10,Interpreters,Dsl,Apr,2009,HenryS

Amazing Awk Assembler

Download from

Download from LAWKER.

Description

"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.

aaa is the mainline shell file.

aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).

aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.

abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.

try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.

01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.

If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.

Author

Henry Spencer


categories: Awk100,Oo,Dsl,Mar,2009,Jimh

Awk++

Contents

Synopsis

 gawk -f awkpp file-name-of-awk++-program
This command is platform independent and sends the translated program to standard output (stdout). See Running awk++ for variations.

This is an updated revision (#21), released August 1, 2009. In this new version:

  • The code no longer needs a shell script or batch file to launch awkpp
  • Multiple inheritance improved
  • added configuration items at the top of the program
This document may be copied only as part of an awk++ distribution and in unmodified form.

Download

Download awkpp21.zip from LAWKER

Description

Awk++ is a preprocessor, that is it reads in a program written in the awk++ language and outputs a new program. However, it's different than awka. The output from the awk++ preprocessor is awk code, not C or an executable program. So, some version of AWK, such as awk or gawk, has to be used to run the preprocessed program. awka can be used, in a second step, to turn the preprocessed awk++ program into an executable, if desired.

OO in AWK++

The awk++ language provides object oriented programming for AWK that includes:

  • classes
  • class properties (persistent object variables)
  • methods
  • inheritance, including multiple inheritance

Awk++ adds new keywords to standard Awk:

  • class
  • method
  • prop
  • property
  • attr
  • attribute
  • elem
  • element
  • var
  • variable

Syntax

Samples:

 a = class1.new[(optional parameters)] *** similar to Ruby
 b = a.get("aProperty")
 a.delete

 class class1 {
 property aProperty
 method new([optional parameters]) {
 # put initialization stuff here
 }

 method get(propName) {
 if(propName = "aProperty")
 return aProperty ### Note the use of 'return'. It behaves
 ### exactly the same as in an AWK function.
 }
 }

Details

To define a class (similar to C++ but no public/private):

class class_name {.....}

To define a class with inheritance:

class class_name : inherited_class_name [ : inherited_class_name...] {.....}

To add local/private variables (persistent variables; syntax is unique to awk++):

class class_name {
 attribute|attr|property|prop|element|elem|variable|var variable_name
 ..... }

To help programmers who are used to other OO languages, "attribute", "property", "element", and "variable", along with their 4-letter abbreviations, are interchangeable.

Note: these persistent variables cannot be accessed directly. The programmer must define method(s) to return them, if their values are to be made available to code that's outside the class.

To add methods

class class_name {
 attribute variable_name1

 method method_name(parameters) {
 ...any awk code....
 }
 ..other method definitions...
 }

To create an object

 object_variable = class_name.new[(optional parameters)]
(runs the method named "new", if it exists; returns the object ID)

To call an object method

object_variable.method_name(parameters)

The dot isn't used for concatenation in awk/gawk, so it's a natural choice for the separator between the object and method.

To reclaim the memory used by an object, use the delete method, i.e.:

object_variable.delete

but don't define delete() in your classes. awk++ recognizes delete() as a special method and will take care of deleting the object. Deleting objects is only necessary, though, if they hold a lot of data. Overhead for objects themselves is insignificant.

Naming and behavior rules:

  • Class names must obey the same rules as user defined function names.
  • Method names must follow the same rules as AWK user defined function names.
  • Class "local" variables (properties, attributes, etc.) must follow the same
  • naming rules as AWK variables.
  • Objects are number variables, so they must obey number variable rules. However,
  • the values in variables holding objects should never be changed, as they are simply identifiers. Performing math operations on them is meaningless.

Syntax notes

OO syntax goals:

  • easy to parse and match to awk code using an awk program as the "preprocessor"
  • easy to understand
  • easy to remember
  • easy and fast to type
  • distinct from existing AWK syntax

The OO syntax is based partly on C++, partly on Javascript, partly on Ruby and partly on the book "The Object-Oriented Thought Process". It isn't lifted in toto from one langauage because other languages provide features that gawk can't accomplish or have syntax that is hard to parse.

Multiple Inheritance

In awk++, if a method is called that isn't in the object's class and there are inherited classes (superclasses) specified, the inherited classes are called in left to right order until one of them returns a value. That value becomes the result of the method call. This is the way awk++ resolves the diamond problem. As a programmer, you control the sequence in which superclasses are called by the left to right order of the list of inherited classes in the class definition.

There are two important things to note.

  1. The search will proceed up through as many ancestors as it takes to find a matching method.
  2. A "match" is made when a value is returned. If a superclass has a matching
  3. method that returns nothing, the search will continue. Thus, it's possible that more than one method could be executed resulting in unintended consequences. Be careful!

Calls to undefined methods do nothing and return nothing, silently.

Running awk++

The command to preprocess an awk++ program looks like this:

gawk -f awkpp file-name-of-awk++-program
or, if the "she-bang" line (line 1 in awkpp) has the right path to gawk, and awkpp is executable and in a directory in PATH,
awkpp file-name-of-awk++-program
To run the output program immediately,
gawk -f awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
or
awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
When running an awk++ program immediately, standard input (stdin) cannot be used for data. One or more data file paths must be listed on the command line.

Bugs

There is a bug in the standard AWK distributions that affects the preprocessor. Additionally, the preprocessor uses the 3rd array option of the match() function. So, it's best to use GAWK to run the preprocessor.

On the other hand, the AWK code created by translating awk++ is intended to work with all versions of AWK. If you find otherwise, please notify the developer(s).

Copyright

Copyright (c) 2008, 2009 Jim Hart, jhart@mail.avcnet.org All rights reserved. The awk++ code is licensed under the GNU Public license (GPL) any version. awk++ documentation, including this page, may be copied only in unmodified form, subject to fair use guidelines.

Author

Jim Hart, jhart@mail.avcnet.org

categories: Awk100,Oo,Dsl,May,2009,AlexS

Awk + ANSI-C = OO

Description

ooc is an awk program which reads class descriptions and performs the routine coding tasks necessary to do object-oriented coding in ANSI C.

The tool is exceptionally well documented in Object oriented programming with ANSI-C.

Download

Download a 2002 copy of this code from LAWKER.

Or go to the author's web site.

Description

ooc is a technique to do object-oriented programming (classes, methods, dynamic linkage, simple inheritance, polymorphisms, persistent objects, method existence testing, message forwarding, exception handling, etc.) using ANSI-C.

ooc is a preprocessor to simplify the coding task by converting class descriptions and method implementations into ANSI-C as required by the technique. You implement the algorithms inside the methods and the ooc preprocessor produces the boilerplate.

ooc consists of a shell script driving a modular awk script (with provisions for debugging), a set of reports -- code generation templates -- interpreted by the script, and the source of a root class to provide basic functionality. Everything is designed to be changed if desired. There are manual pages, lots of examples, among them a calculator based on curses and X11, and you can ask me about the book.

ooc as a technique requires an ANSI-C system -- classic C would necessitate substantial changes. The preprocessor needs a healthy Bourne-Shell and "new" awk as described in Aho, Weinberger, and Kernighan's book.

ooc was developed primarily to teach about object-oriented programming without having to learn a new language. If you see how it is done in a familiar setting, it is much easier to grasp the concepts and to know what miracles to expect from the technique and what not. Conceivably, the preprocessor can be used for production programming but this was not the original intent. Being able to roll your own object-oriented coding techniques has its possibilities, however...

Technical Details

Most sources should be viewed with tab stops set at 4 characters.

The original system ran on NeXTSTEP 3.2 and older, ESIX (System V) 4.0.4, and Linux 0.99.pl4-49. This rerelease was tested on MacOS X version 10.1.2 and Solaris version 5.8. You need to review paths in the script 'ooc/ooc' before running anything. Make sure the first line of this script points to a Bourne-style shell. Also make sure that the first line of '09/munch' points to a (new) awk.

The rereleased 'ooc' awk-programs have been tested with GNU awk versions 3.0.1 and 3.0.3. Previous versions did not support AWKPATH properly (but this is not essential).

The makefiles could be smarter but they are naive enough for all systems. This is a heterogeneous system -- set the environment variable $OSTYPE to an architecture-specific name. 'make' in the current directory will create everything by calling 'make' in the various subdirectories. Each 'makefile' includes 'make/Makefile.$OSTYPE', review your 'make/Makefile.$OSTYPE' before you start.

The following make calls are supported throughout:

make [all]	create examples
make test	[make and] run examples
make clean	remove all but sources
make depend	make dependencies (if makefile.$OSTYPE supports it)

Make dependencies can be built with the -MM option of the GNU C compiler. They are stored in a file 'depend' in each subdirectory. They should apply to all systems. 'makefile.$OSTYPE' may include a target 'depend' to recreate 'depend' -- check 'makefile.darwin1.4' for an example.

Contents

The following is a walk through the file hierarchy in the order of the book:

makefile
dispatch standard make calls to known directories
make/
Makefile: boilerplate code for makefiles
01/*
chapter 1: abstract data types
  • sets: Set demo
  • bags: Bag demo: Set with reference count
02/*
chapter 2: dynamic linkage
  • strings: String demo
  • atoms: Atom demo: unique String
03/*
chapter 3: manipulating expressions with dyn. linkage
  • postfix: postfix output of expression
  • value: expression evaluation
  • infix: infix output of expression
04/*
chapter 4: inheritance
  • points: Point demo
  • circles: Circle demo: Circle: Point with radius
05/*
chapter 5: symbol table with inheritance
  • value: expression evaluation with vars, consts, functions
06/*
chapter 6: class hierarchy and meta classes
  • any: objects that do not differ from any object
07/*
chapter 7: ooc preprocessor; use ooc -7
  • points: Point demo: PointClass is a new metaclass
  • circles: Circle demo: Circle is a new class
  • queue: Queue demo: List is an abstract base class
  • stack: Stack demo: another subclass of List
08/*
chapter 8: dynamic type checking; use ooc -8
  • circles: Circle demo: nothing changed
  • list: List demo: traps insertion of numbers or strings
09/*
chapter 9: automatic initialization; use ooc -9
  • munch: awk program to collect class list from nm -p output
  • circles: Circle demo: no more init calls
  • list: List demo: no more init calls
10/*
chapter 10: respondsTo method; use ooc -10
  • cmd: Filter demo: how flags and options are handled
  • wc: word count filter
  • sort: sorting filter, adds sort method to List
11/*
chapter 11: class methods
  • value: expression evaluator, based on class hierarchy
  • value: x memory reclamation enabled
12/*
chapter 12: persistent objects
  • value: expression evaluator, with save and load
13/*
chapter 13: exception handling
  • value: expression evaluator with exception handler
  • except: Exception demo
14/*
chapter 14: message forwarding
  • makefile.etc: (naive) generated rules for the library
  • Xapp: resources for X11-based programs
  • hello: LineOut demo: hello, world
  • button: Button demo
  • run: terminal-oriented calculator
  • cbutton: Crt demo: hello, world changes into a
  • crun: curses-based caluclator
  • xhello: XLineOut demo: hello, world
  • xbutton: XButton demo with XawBox and XawForm
  • xrun: X11-based calculator with callbacks
man/*
manual pages
  • *.1: tools
  • *.2: functions
  • *.3: some classes
  • *.4: classes in chapter 14
ooc/*
ooc preprocessor
  • ooc: command script; review 'home' 'OOCPATH' 'AWKPATH'
  • awk/*.awk: modules
  • awk/*.dbg: debugging modules
  • rep/*.rep: reports
  • rep-*/*.rep: reports for early chapters

Copyright

Copyright (c) 1993

While you may use this software package, neither I nor my employers can be made responsible for whatever problems you might cause or encounter.

While you may give away this package and/or software derived with it, you should not charge for it, you should not claim that ooc is your work, and I have published my own book about ooc before you did.

The same restrictions apply to whoever might get this package from you.

Author

Axel T. Schreiner, http://www.cs.rit.edu/~ats/

categories: Awk100,,Music,Tools,June,2009,StephenJ

Plaiter: a music player

Synopsis

plaiter [options] [file, playlist, directory or stream ...]

Download

Download from LAWKER or, for the latest version, from SourceForge

Description

Plaiter (pronounced "player") is a command line front end to command line music players. It uses shell scripting to try to create the command line music player that Plait would have used if it already existed. It complements Plait but is also quite useful on its own, especially if you already use mpg123 or similar programs and find yourself wanting more features.

What does Plaiter do that (say) mpg123 can't already? It queues tracks, first of all. Secondly, it understands commands like play, plause, stop, next and prev. Finally, unlike most of the command line music players out there, Plaiter can handle a play list with more than one type of audio file, selecting the proper helper app to handle each type of file you throw at it.

Plaiter will automatically configure itself to use ogg123, mpg123, and/or mpg321, if they are installed on your system. If you have a helper application that plays other types of audio, Plaiter can be configured to use it as well.

Like many of us, Plaiter is part daemon and part controller. The controller builds a play list from the files you provide on the command line and forwards commands to the daemon. The daemon reads commands and executes them by running helper applications.

Options

--daemon,-d
daemon mode
--queue,-q
add tracks to queue
--enqueue
add tracks to queue
--random
random shuffle
--play
play
--pause
toggle pause mode
--stop,-s
stop
--latch [on|off]
toggle or set stop after current track
--next,-n [n]
skip forward [n tracks]
--prev [n]
skip backward [n tracks]
--search
search in playlist
--rsearch
reverse search in playlist
--reset,-r
play track 1
--loop [on|off]
toggle or set loop mode
--quit
quit daemon
--status
show status
--list,-l
show playlist
--help
show help
--version
show version
-v
be verbose

Copyright

Copyright (C) 2005, 2006 by Stephen Jungels. Released under the GPL.

Author

Written by Stephen Jungels (sjungels@gmail.com)


categories: Awk100,Macros,Tools,Mar,2009,JonB

m1 : A Micro Macro Processor

Contents

Synopsis

awk -f m1.awk [file...]

Download

Download from LAWKER.

Description

M1 is a simple macro language that supports the essential operations of defining strings and replacing strings in text by their definitions. It also provides facilities for file inclusion and for conditional expan- sion of text. It is not designed for any particular application, so it is mildly useful across several applications, including document preparation and programming. This paper describes the evolution of the program; the final version is implemented in about 110 lines of Awk.

M1 copies its input file(s) to its output unchanged except as modified by certain "macro expressions." The following lines define macros for subsequent processing:

 @comment Any text
 @@                     same as @comment
 @define name value
 @default name value    set if name undefined
 @include filename
 @if varname            include subsequent text if varname != 0
 @unless varname        include subsequent text if varname == 0
 @fi                    terminate @if or @unless
 @ignore DELIM          ignore input until line that begins with DELIM
 @stderr stuff          send diagnostics to standard error

A definition may extend across many lines by ending each line with a backslash, thus quoting the following newline.

Any occurrence of @name@ in the input is replaced in the output by the corresponding value.

@name at beginning of line is treated the same as @name@.

Applications

Form Letters

We'll start with a toy example that illustrates some simple uses of m1. Here's a form letter that I've often been tempted to use:

@default MYNAME Jon Bentley 
@default TASK respond to your special offer 
@default EXCUSE the dog ate my homework 
Dear @NAME@: 
    Although I would dearly love to @TASK@, 
I am afraid that I am unable to do so because @EXCUSE@. 
I am sure that you have been in this situation 
many times yourself. 
            Sincerely, 
            @MYNAME@ 

If that file is namedsayno.mac, it might be invoked with this text:

@define NAME Mr. Smith 
@define TASK subscribe to your magazine 
@define EXCUSE I suddenly forgot how to read 

Recall that a @default takes effect only if its variable was not previously @defined.

Troff Pre-Processing

I've found m1 to be a handy Troff preprocessor. Many of my text files (including this one) start with m1 definitions like:

@define ArrayFig @StructureSec@.2 
@define HashTabFig @StructureSec@.3 
@define TreeFig @StructureSec@.4 
@define ProblemSize 100 

Even a simple form of arithmetic would be useful in numeric sequences of definitions. The longer m1 variables get around Troff's dreadful two-character limit on string names; these variables are also avail- able to Troff preprocessors like Pic and Eqn. Various forms of the @define, @if, and @include facilities are present in some of the Troff-family languages (Pic and Troff) but not others (Tbl); m1 provides a consistent mechanism.

I include figures in documents with lines like this:

@define FIGNUM @FIGMFMOVIE@ 
@define FIGTITLE The Multiple Fragment heuristic. 
@FIGSTART@ 
<PS> <@THISDIR@/mfmovie.pic</PS>
@FIGEND@ 

The two @defines are a hack to supply the two parameters of number and title to the figure. The figure might be set off by horizontal lines or enclosed in a box, the number and title might be printed at the top or the bottom, and the figures might be graphs, pictures, or animations of algorithms. All figures, though, are presented in the consistent format defined by FIGSTART and FIGEND.

Awk Library Management

I have also used m1 as a preprocessor for Awk programs. The @include statement allows one to build simple libraries of Awk functions (though some- but not all- Awk implementations provide this facility by allowing multiple program files). File inclusion was used in an earlier version of this paper to include individual functions in the text and then wrap them all together into the completem1 program. The conditional statements allow one to customize a program with macros rather than run-time if statements, which can reduce both run time and compile time.

Controlling Experiments

The most interesting application for which I've used this macro language is unfortunately too complicated to describe in detail. The job for which I wrote the original version of m1 was to control a set of experiments. The experiments were described in a language with a lexical structure that forced me to make substitutions inside text strings; that was the original reason that substitutions are bracketed by at-signs. The experiments are currently controlled by text files that contain descriptions in the experiment language, data extraction programs written in Awk, and graphical displays of data written in Grap; all the programs are tailored bym1commands.

Most experiments are driven by short files that set a few keys parameters and then@includea large file with many @defaults. Separate files describe the fields of shared databases:

 @define N ($1) 
 @define NODES ($2) 
 @define CPU ($3) 
 ... 

These files are @included in both the experiment files and in Troff files that display data from the databases. I had tried to conduct a similar set of experiments before I built m1, and got mired in muck. The few hours I spent building the tool were paid back handsomely in the first days I used it.

The Substitution Function

M1 uses as fast substitution function. The idea is to process the string from left to right, searching for the first substitution to be made. We then make the substitution, and rescan the string starting at the fresh text. We implement this idea by keeping two strings: the text processed so far is in L (for Left), and unprocessed text is in R (for Right). Here is the pseudocode for dosubs:

L = Empty 
R = Input String 
while R contains an "@" sign do 
	let R = A @ B; set L = L A and R = B 
	if R contains no "@" then 
		L = L "@" 
		break 
	let R = A @ B; set M = A and R = B 
	if M is in SymTab then 
		R = SymTab[M] R 
	else 
		L = L "@" M 
		R = "@" R 
	return L R 

Possible Extensions

There are many ways in which them1program could be extended. Here are some of the biggest temptations to "creeping creaturism":

  • A long definition with a trail of backslashes might be more graciously expressed by a @longdefinestatement terminated by a@longend.
  • An @undefinestatement would remove a definition from the symbol table.
  • I've been tempted to add parameters to macros, but so far I have gotten around the problem by using an idiom described in the next section.
  • It would be easy to add stack-based arithmetic and strings to the language by adding@pushand @popcommands that read and write variables.
  • As soon as you try to write interesting macros, you need to have mechanisms for quoting strings (to postpone evaluation) and for forcing immediate evaluation.

Code

The following code is short (around 100 lines), which is significantly shorter than other macro processors; see, for instance, Chapter 8 of Kernighan and Plauger [1981]. The program uses several techniques that can be applied in many Awk programs.

  • Symbol tables are easy to implement with Awk┐s associative arrays.
  • The program makes extensive use of Awk's string-handling facilities: regular expressions, string concatenation, gsub, index, andsubstr.
  • Awk's file handling makes the dofile procedure straightforward.
  • The readline function and pushback mechanism associated with buffer are of general utility.

error

function error(s) {
	print "m1 error: " s | "cat 1>&2"; exit 1
}

dofile

function dofile(fname,  savefile, savebuffer, newstring) {
	if (fname in activefiles)
		error("recursively reading file: " fname)
	activefiles[fname] = 1
	savefile = file; file = fname
	savebuffer = buffer; buffer = ""
	while (readline() != EOF) {
		if (index($0, "@") == 0) {
			print $0
		} else if (/^@define[ \t]/) {
			dodef()
		} else if (/^@default[ \t]/) {
			if (!($2 in symtab))
				dodef()
		} else if (/^@include[ \t]/) {
			if (NF != 2) error("bad include line")
			dofile(dosubs($2))
		} else if (/^@if[ \t]/) {
			if (NF != 2) error("bad if line")
			if (!($2 in symtab) || symtab[$2] == 0)
				gobble()
		} else if (/^@unless[ \t]/) {
			if (NF != 2) error("bad unless line")
			if (($2 in symtab) && symtab[$2] != 0)
				gobble()
		} else if (/^@fi([ \t]?|$)/) { # Could do error checking here
		} else if (/^@stderr[ \t]?/) {
			print substr($0, 9) | "cat 1>&2"
		} else if (/^@(comment|@)[ \t]?/) {
		} else if (/^@ignore[ \t]/) { # Dump input until $2
			delim = $2
			l = length(delim)
			while (readline() != EOF)
				if (substr($0, 1, l) == delim)
					break
		} else {
			newstring = dosubs($0)
			if ($0 == newstring || index(newstring, "@") == 0)
				print newstring
			else
				buffer = newstring "\n" buffer
		}
	}
	close(fname)
	delete activefiles[fname]
	file = savefile
	buffer = savebuffer
}

readline

Put next input line into global string "buffer". Return "EOF" or "" (null string).

function readline(  i, status) {
	status = ""
	if (buffer != "") {
		i = index(buffer, "\n")
		$0 = substr(buffer, 1, i-1)
		buffer = substr(buffer, i+1)
	} else {
		# Hume: special case for non v10: if (file == "/dev/stdin")
		if (getline <file <= 0)
			status = EOF
	}
	# Hack: allow @Mname at start of line w/o closing @
	if ($0 ~ /^@[A-Z][a-zA-Z0-9]*[ \t]*$/)
		sub(/[ \t]*$/, "@")
	return status
}

gobble

function gobble(  ifdepth) {
	ifdepth = 1
	while (readline() != EOF) {
		if (/^@(if|unless)[ \t]/)
			ifdepth++
		if (/^@fi[ \t]?/ && --ifdepth <= 0)
			break
	}
}

dosubs

function dosubs(s,  l, r, i, m) {
	if (index(s, "@") == 0)
		return s
	l = ""	# Left of current pos; ready for output
	r = s	# Right of current; unexamined at this time
	while ((i = index(r, "@")) != 0) {
		l = l substr(r, 1, i-1)
		r = substr(r, i+1)	# Currently scanning @
		i = index(r, "@")
		if (i == 0) {
			l = l "@"
			break
		}
		m = substr(r, 1, i-1)
		r = substr(r, i+1)
		if (m in symtab) {
			r = symtab[m] r
		} else {
			l = l "@" m
			r = "@" r
		}
	}
	return l r
}

docodef

function dodef(fname,  str, x) {
	name = $2
	sub(/^[ \t]*[^ \t]+[ \t]+[^ \t]+[ \t]*/, "")  # OLD BUG: last * was +
	str = $0
	while (str ~ /\\$/) {
		if (readline() == EOF)
			error("EOF inside definition")
		# OLD BUG: sub(/\\$/, "\n" $0, str)
		x = $0
		sub(/^[ \t]+/, "", x)
		str = substr(str, 1, length(str)-1) "\n" x
	}
	symtab[name] = str
}

BEGIN

BEGIN {	
    EOF = "EOF"
	if (ARGC == 1)
		dofile("/dev/stdin")
	else if (ARGC >= 2) {
		for (i = 1; i < ARGC; i++)
			dofile(ARGV[i])
	} else
		error("usage: m1 [fname...]")
}

Bugs

M1 is three steps lower than m4. You'll probably miss something you have learned to expect.

History

M1 was documented in the 1997 sedawk book by Dale Dougherty & Arnold Robbins (ISBN 1-56592-225-5) but may have been written earlier.

This page was adapted from 131.191.66.141:8181/UNIX_BS/sedawk/examples/ch13/m1.pdf (download from LAWKER).

Author

Jon L. Bentley.


categories: Wp,Awk100,Wp,Tools,Apr,2009,HenryS

awf

The amazingly workable (text) formatter

Synopsis

awf -macros [ file ] ...

Download

Download from LAWKER. Type "make r" to run a regression test, formatting the manual page (awf.1) and comparing it to a preformatted copy (awf.1.out). Type "make install" to install it. Pathnames may need changing.

Description

Awf formats the text from the input file(s) (standard input if none) in an imitation of nroff's style with the -man or -ms macro packages. The -macro option is mandatory and must be `-man' or `-ms'.

Awf is slow and has many restrictions, but does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.

Awf implements the following raw nroff requests:

.\"  .ce  .fi  .in  .ne  .pl  .sp
.ad  .de  .ft  .it  .nf  .po  .ta
.bp  .ds  .ie  .ll  .nr  .ps  .ti
.br  .el  .if  .na  .ns  .rs  .tm

and the following in-text codes:

\$   \%   \*   \c   \f   \n   \s

plus the full list of nroff/troff special characters in the original V7 troff manual.

Many restrictions are present; the behavior in general is a subset of nroff's. Of particular note are the following:

  • Point sizes do not exist; .ps and \s are ignored.
  • Conditionals implement only numeric comparisons on \n(.$, string com- parisons between a macro parameter and a literal, and n (always true) and t (always false).
  • The implementation of strings is generally primitive.
  • Expressions in (e.g.) .sp are fairly general, but the |, &, and : operators do not exist, and the implementation of \w requires that quote (') be used as the delimiter and simply counts the characters inside (so that, e.g., \w'\(bu' equals 4).

White space at the beginning of lines, and imbedded white space within lines, is dealt with properly. Sentence terminators at ends of lines are understood to imply extra space afterward in filled lines. Tabs are implemented crudely and not quite correctly, although in most cases they work as expected. Hyphenation is done only at explicit hyphens, emdashes, and nroff discretionary hyphens.

MAN Macros

The -man macro set implements the full V7 manual macros, plus a few semi- random oddballs. The full list is:

.B   .DT  .IP  .P   .RE  .SM
.BI  .HP  .IR  .PD  .RI  .TH
.BR  .I   .LP  .PP  .RS  .TP
.BY  .IB  .NB  .RB  .SH  .UC

.BY and .NB each take a single string argument (respectively, an indi- cation of authorship and a note about the status of the manual page) and arrange to place it in the page footer.

MS Macros

The -ms macro set is a substantial subset of the V7 manuscript macros. The implemented macros are:

.AB  .CD  .ID  .ND  .QP  .RS  .UL
.AE  .DA  .IP  .NH  .QS  .SH  .UX
.AI  .DE  .LD  .NL  .R   .SM
.AU  .DS  .LG  .PP  .RE  .TL
.B   .I   .LP  .QE  .RP  .TP

Size changes are recognized but ignored, as are .RP and .ND. .UL just prints its argument in italics. .DS/.DE does not do a keep, nor do any of the other macros that normally imply keeps.

Assignments to the header/footer string variables are recognized and implemented, but there is otherwise no control over header/footer formatting. The DY string variable is available. The PD, PI, and LL number registers exist and can be changed.

Output

The only output format supported by awf, in its distributed form, is that appropriate to a dumb terminal, using overprinting for italics (via underlining) and bold. The nroff special characters are printed as some vague approximation (it's sometimes very vague) to their correct appearance.

Awf's knowledge of the output device is established by a device file, which is read before the user's input. It is sought in awf's library directory, first as dev.term (where term is the value of the TERM environment variable) and, failing that, as dev.dumb. The device file uses special internal commands to set up resolution, special characters, fonts, etc., and more normal nroff commands to set up page length etc.

FiLes

All in /usr/lib/awf (this can be overridden by the AWFLIB environment variable):

common     common device-independent initialization
dev.*      device-specific initialization
mac.m*     macro packages
pass1      macro substituter
pass2.base central formatter
pass2.m*   macro-package-specific bits of formatter
pass3      line and page composer

See Also

awk(1), nroff(1), man(7), ms(7)

Diagnostics

Unlike nroff, awf complains whenever it sees unknown commands and macros. All diagnostics (these and some internal ones) appear on standard error at the end of the run.

Author

Written at University of Toronto by Henry Spencer, more or less as a supplement to the C News project.

Copyright

Copyright 1990 University of Toronto. All rights reserved. Written by Henry Spencer. This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California.

Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from flaws in it.
  2. The origin of this software must not be misrepresented, either by explicit claim or by omission. Since few users ever read sources, credits must appear in the documentation.
  3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Since few users ever read sources, credits must appear in the documentation.
  4. This notice may not be removed or altered.

Bugs

There are plenty, but what do you expect for a text formatter written entirely in (old) awk?

The -ms stuff has not been checked out very thoroughly.


categories: Awk100,May,2009,Dab

Jawk: Awk in Java

Download

Download from Source Forge.

Description

Jawk parses, analyzes, and interprets and/or compiles AWK scripts. Compilation is targetted for the JVM.

Jawk runs on any platform which supports, at minimum, J2SE 5.

Usage

To use, simply download the application, copy the release jar to the jawk.jar file and execute the following command:
java -jar jawk.jar {command-line-arguments}

To view the command line argument usage summary, execute

java -jar jawk.jar -h
The output of this command is shown below:
java ... org.jawk.Awk [-F fs_val] [-f script-filename] 
                      [-o output-filename] [-c] [-z] [-Z] 
                      [-d dest-directory] [-S] [-s] [-x] [-y] [-r] 
                      [-ext] [-ni] [-t] [-v name=val]... 
                      [script] [name=val | input_filename]...

 -F fs_val = Use fs_val for FS.
 -f filename = Use contents of filename for script.
 -v name=val = Initial awk variable assignments.

 -t = (extension) Maintain array keys in sorted order.
 -c = (extension) Compile to intermediate file. (default: a.ai)
 -o = (extension) Specify output file.
 -z = (extension) | Compile for JVM. (default: AwkScript.class)
 -Z = (extension) | Compile for JVM and execute it. (default: AwkScript.class)
 -d = (extension) | Compile to destination directory.  (default: pwd)
 -S = (extension) Write the syntax tree to file. (default: syntax_tree.lst)
 -s = (extension) Write the intermediate code to file. (default: avm.lst)
 -x = (extension) Enable _sleep, _dump as keywords, and exec as a builtin func.
                  (Note: exec enabled only in interpreted mode.)
 -y = (extension) Enable _INTEGER, _DOUBLE, and _STRING casting keywords.
 -r = (extension) Do NOT hide IllegalFormatExceptions for [s]printf.
-ext= (extension) Enable user-defined extensions. (default: not enabled)
-ni = (extension) Do NOT process stdin or ARGC/V through input rules.
                  (Useful for blocking extensions.)
                  (Note: -ext & -ni available only in interpreted mode.)

 -h or -? = (extension) This help screen.

Extensions

Jawk addresses a drawback with standard Awk. For example, in standard Awk, it us be impossible to create a socket or display a simple GUI without external assistance either from the shell or via extensions to Awk itself (i.e., gawk). To overcome this limitation, an extension facility is added to Jawk .

The Jawk extension facility allows for arbitrary Java code to be called as Awk functions in a Jawk script. These extensions can come from the user (developer) or 3rd party providers (i.e., the Jawk project team). And, Jawk extensions are opt-in. In other words, the -ext flag is required to use Jawk extensions and extensions must be explicitly registered to the Jawk instance via the -Djawk.extensions property (except for core extensions bundled with Jawk ).

Also, Jawk extensions support blocking. You can think of blocking as a tool for extension event management. A Jawk script can block on a collection of blockable services, such as socket input availability, database triggers, user input, GUI dialog input response, or a simple fixed timeout, and, together with the -ni option, action rules can act on block events instead of input text, leveraging a powerful AWK construct originally intended for text processing, but now can be used to process blockable events. A sample enhanced echo server script is included in this article. It uses blocking to handle socket events, standard input from the user, and timeout events, all within the 47-line script (including comments).

Example

The example script implements a simple echo server which also allows broadcast messaging via stdin input from the server process:
## to run: java ... -jar jawk.jar -ext -ni -f {filename}
BEGIN {
	css = CServerSocket(7777);
	print "(echo server socket created)"
}
## note: default input processing disabled by -ni
$0 = SocketAcceptBlock(css,
	SocketInputBlock(sockets,
		SocketCloseBlock(css, sockets,
			StdinBlock(
				Timeout(1000)))));
				## note: default action { print } disabled by -ni
# $1 = "SocketAccept", $2 = socket handle
$1 == "SocketAccept" {
	socket = SocketAccept($2)
	sockets[socket] = 1
}

# $1 = "SocketInput", $2 = socket handle
$1 == "SocketInput" {
	## echo server action:

	socket = $2
	line = SocketRead(socket)
	SocketWrite(socket, line)
}

# $1 = "SocketClose", $2 = socket handle
$1 == "SocketClose" {
	socket = $2
	SocketClose(socket)
	delete sockets[socket]
}
## display a . for every second the server is running
$0 == "Timeout" {
	printf "."
}
## stdin block is last because StdinGetline writes directly to $0
## $0 == "Stdin"
$0 == "Stdin" {
	## broadcast message to all sockets
	retcode = StdinGetline()
	if (retcode != 1)
		exit
	for (socket in sockets)
		SocketWrite(socket, "From server : " $0)
	print "(message sent)"
}

Each extension function used in the script above is covered in some detail below:

  • CServerSocket - Creates a character-based server socket. SocketRead for character-based sockets return lines of text (with newlines stripped), while SocketRead returns blocks of bytes (converted to a String) for sockets accepted by ServerSocket. Use character-based sockets for interactive or line-based input, and use ordinary sockets to achieve high-throughput since arbitrary byte blocks are returned. To create a client socket, use CSocket for character-based sockets, or Socket for byte-block-based sockets.
  • SocketAcceptBlock/SocketInputBlock/SocketCloseBlock/StdinBlock/Timeout - Each of these extensions is a blocking extension, blocking for particular events, such as a server socket is ready to accept an incoming socket, or a connected socket has input to be read, or a certain amount of time has elapsed, etc. Socket*Block extension functions come from SocketExtension, StdinBlock comes from StdinExtension, and Timeout comes from CoreExtension. Each Socket*Block extension returns a string of the format:
    extension-label-prefix OFS parameter
    
    while StdinBlock and Timeout returns
    extension-label-prefix
    
  • SocketAccept/SocketRead/SocketWrite/SocketClose - Socket operations, as the names of the extension functions suggest. Each will block until it is able to complete the operation.
  • StdinGetline - Get a line of input from stdin. If there is no stdin, block until input is available. This is why blocking is a valuable tool. This way, the script can wait for other events while waiting for stdin, bringing AWK out of the focused text processing domain into a powerful event processing language.

As stated by the comments, -ni disables stdin processing (as provided by Jawk itself, not the StdinExtension) and the default blank rule of { print } . Disabling stdin processing is paramount to extension processing because, otherwise, it would be confusing, if not completely impossible, to multiplex extension blocking with Jawk 's default stdin processing. And, disabling the default blank rule allows for easy-to-read blocking statements (like the one provided in the sample script) without the wierd side effect of printing the result.

Author

Dan: ddaglas at users.sourceforge.net.


categories: Xgawk,XML,Awk100,Apr,2009,JurgenK

XMLgawk

Editor's note: Programmers often take awk "as is", never thinking to use it as a lab in which they can explore other language extensions. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach is taken in the Awk A* project and, as shown here, in XMLgawk.

IMHO, XMLgawk is one of the most exciting new innovations seen in Gawk for many years. It shows that Awk is more than "just" a text processor: rather it is also a candidate technology for modern XML-based web applications. )

Purpose

Extends standard gawk with built-in XML processing.

Developers

Main developers: Jurgen Kahrs and Andrew Schorr.

Conceptual guidance: Manuel Collado.

MS Windows build expert: Victor Paeza.

Contributor of ideas for new features: Peter Saveliev.

Domain

XML processing, plus libraries for other extensions to Gawk.

Description

XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.

Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.

XMLgawk provides the following functionality including:

  • AWK's way of reading data line by line is supplemented by reading XML files node by node.
  • XMLgawk can load .awk file as as well as shared libraries.
  • Adds support for an @include directive in the source code. This is the same feature provided by the current igawk script.

Current

3=Released

Use

3=Free/public domain.

Date Deployed

November 2003.

Dated

April 28, 2009.

Url


categories: Games,Awk100,Apr,2009,Ronl

Soccer

Purpose

AI Programming lab class challenge .

Installation

Download from LAWKER. Look at the first line of each file for something that looks like thos:

#!/usr/bin/gawk -f
Replace this with the full path to the local version of Gawk.

Developers

Ronald Loui (programmer and designer)

Organization

Washington University in St. Louis

Country

USA

Domain

Text-based game simulation.

Contact

Ronald P. Loui

Email

r.p.loui@gmail.com

Description

Ronald Loui writes: Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language

This code manages a CGI interface to a process that simulates a soccer game, polling for inputs from two student programs.

A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.

In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.

Awk

Was written for gawk in 1995 but should run on almost any awk dialect; some css positioning commands will not work in all browsers; try IE6.

Platform

Was written on Redhat Linux with multiple hardware platforms in mind.

Uses

Intended to be run on close server to minimize delays.

Lines

605 lines in main cgi with several small aux control programs.

DevelopmentEffort

Minimal compared to development effort, but potentially will require css for new browsers.

MaintenanceEffort

Number of person-months since, including enhancements

Current

2=Evaluation.

Users

50 students in artificial intelligence project classes had to use some version of this code over seven years

DateDeployed

October 2004

Dated

April 2009


categories: Top10,Awk100,Papers,Os,Apr,2009,YungC

Awk-Linux

Awk-Linux Educational Operating Systems

Purpose

Teaching operating systems.

Developers

Yung-Pin Cheng

Email

ypc@csie.ntnu.edu.tw

Organization

Software Engineering Lab. Department of Computer Science and Information Engineering National Taiwan Normal University

Country

TAIWAN

Domain

Educators of Operating Systems

Description

Most well-known instructional operating systems are complex, particularly if their companion software is taken into account. It takes considerable time and effort to craft these systems, and their complexity may introduce maintenance and evolution problems. In this project, a courseware called Awk-Linux is proposed. The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt, which are simulated through program instrumentation over user programs.

A major advantange of the use of Awk for this tool is platform independence. Awk-Linux can be crafted relatively more easily and it does not depend on any hardware simulator or platform. Stable Awk versions run on many platforms so this tool can be readily and easily ported to other machines. The same can not be said for other, more complex operating systems courseware that may be much harder to port to new environments.

In practice, using Awk-Linux is very simple for the instructor and students:

  • Course projects based on Awk-Linux provides source code extracted and simplified from a Linux kernel.
  • Results of our study indicate that the projects helped students better to understand inner workings of operating systems.

Awk

Gawk under cygwin or Linux

Platform

Windows (CYGWIN required) or Linux

Uses

C programming language

Current

Status 3 (Released)

Use

3(Free/public domain)

DateDeployed

2004

References

Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux: A Lightweight Operating Systems Courseware IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.

Url

www.csie.ntnu.edu.tw/~ypc/awklinux.htm


categories: Top10,Awk100,Mar,2009,NelsonB,Spell,ArnoldR

spell.awk

Contents

Synopsis

awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
    [=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
    [-strip] [-verbose] [file(s)]

Download

Download from LAWKER.

Description

Why Study This Code?

This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.

It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:

  • The code is hundreds of lines long. Yes folks, its true, Awk is not just a tool for writing one-liners.
  • The code is well-structured. Note, for example, how the BEGIN block is used to initialize the system from files/functions.
  • The code uses two tricks that encourages function reuse:
    • Much of the functionality has been moved out of PATTERN-ACTION and into functions.
    • The number of globals is restricted: note the frequent use of local variables in functions.
  • There is an example, in scan_options, of how parse command line arguments;
  • The use of "print pipes" in in report_expcetions shows how to link Awk code to other commands.

(And to write even larger programs, divided into many files, see runawk.)

Dictionaries

Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.

For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.

All word matching is case insensitive (subject to the workings of tolower()).

In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.

Suffixes

Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:

	ies$	ie ies y	# flies -> fly, series -> series, ties -> tie
	ily$	y ily		# happily -> happy, wily -> wily
	nnily$	n		# funnily -> fun

Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.

Suffixes are tested in order of decreasing length, so that the longest matches are tried first.

Output

The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form

	filename:linenumber:exception

Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.

Code

Top-Level

BEGIN	{ initialize() }
	    { spell_check_line() }
END	    { report_exceptions() }

get_dictionaries

function get_dictionaries(        files, key)
{
    if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
	Dictionaries = ENVIRON["DICTIONARIES"]
    if (Dictionaries == "")	# Use default dictionary list
    {
	DictionaryFiles["/usr/dict/words"]++
	DictionaryFiles["/usr/local/share/dict/words.knuth"]++
    }
    else			# Use system dictionaries from command line
    {
	split(Dictionaries, files)
	for (key in files)
	    DictionaryFiles[files[key]]++
    }
}

Initialize

function initialize()
{
   NonWordChars = "[^" \
	"'" \
	"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
	"abcdefghijklmnopqrstuvwxyz" \
	"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
	"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
	"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
	"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
	"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
	"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
	"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
	"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
	"]"
    get_dictionaries()
    scan_options()
    load_dictionaries()
    load_suffixes()
    order_suffixes()
}

load_dictionaries

function load_dictionaries(        file, word)
{
    for (file in DictionaryFiles)
    {
	## print "DEBUG: Loading dictionary " file > "/dev/stderr"
	while ((getline word < file) > 0)
	    Dictionary[tolower(word)]++
	close(file)
    }
}

load_suffixes

function load_suffixes(        file, k, line, n, parts)
{
    if (NSuffixFiles > 0)		# load suffix regexps from files
    {
	for (file in SuffixFiles)
	{
	    ## print "DEBUG: Loading suffix file " file > "/dev/stderr"
	    while ((getline line < file) > 0)
	    {
		sub(" *#.*$", "", line)		# strip comments
		sub("^[ \t]+", "", line)	# strip leading whitespace
		sub("[ \t]+$", "", line)	# strip trailing whitespace
		if (line == "")
		    continue
		n = split(line, parts)
		Suffixes[parts[1]]++
		Replacement[parts[1]] = parts[2]
		for (k = 3; k <= n; k++)
		  Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
	    }
	    close(file)
	}
    }
    else	      # load default table of English suffix regexps
    {
	split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
	for (k in parts)
	{
	    Suffixes[parts[k]] = 1
	    Replacement[parts[k]] = ""
	}
    }
}

order_suffixes

function order_suffixes(        i, j, key)
{
    # Order suffixes by decreasing length
    NOrderedSuffix = 0
    for (key in Suffixes)
	OrderedSuffix[++NOrderedSuffix] = key
    for (i = 1; i < NOrderedSuffix; i++)
	for (j = i + 1; j <= NOrderedSuffix; j++)
	    if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
		swap(OrderedSuffix, i, j)
}

report_execptions

function report_exceptions(        key, sortpipe)
{
  sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
  for (key in Exception)
  print Exception[key] | sortpipe
  close(sortpipe)
}

scan_options

function scan_options(        k)
{
    for (k = 1; k < ARGC; k++)
    {
	if (ARGV[k] == "-strip")
	{
	    ARGV[k] = ""
	    Strip = 1
	}
	else if (ARGV[k] == "-verbose")
	{
	    ARGV[k] = ""
	    Verbose = 1
	}
	else if (ARGV[k] ~ /^=/)	# suffix file
	{
	    NSuffixFiles++
	    SuffixFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	}
	else if (ARGV[k] ~ /^[+]/)	# private dictionary
	{
	    DictionaryFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	}
    }

    # Remove trailing empty arguments (for nawk)
    while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
        ARGC--
}

spell_check_line

function spell_check_line(        k, word)
{
    ## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
    gsub(NonWordChars, " ")		# eliminate nonword chars
    for (k = 1; k <= NF; k++)
    {
	word = $k
	sub("^'+", "", word)		# strip leading apostrophes
	sub("'+$", "", word)		# strip trailing apostrophes
	if (word != "")
	    spell_check_word(word)
    }
}

spell_check_word

function spell_check_word(word,        key, lc_word, location, w, wordlist)
{
    lc_word = tolower(word)
    ## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
    if (lc_word in Dictionary)		# acceptable spelling
	return
    else				# possible exception
    {
	if (Strip)
	{
	    strip_suffixes(lc_word, wordlist)
	    ## for (w in wordlist) print "DEBUG: wordlist[" w "]"
	    for (w in wordlist)
		if (w in Dictionary)
		    break
	    if (w in Dictionary)
		return
	}
	## print "DEBUG: spell_check():", word
	location = Verbose ? (FILENAME ":" FNR ":") : ""
	if (lc_word in Exception)
	    Exception[lc_word] = Exception[lc_word] "\n" location word
	else
	    Exception[lc_word] = location word
    }
}

strip_suffixes

function strip_suffixes(word, wordlist,        ending, k, n, regexp)
{
    ## print "DEBUG: strip_suffixes(" word ")"
    split("", wordlist)
    for (k = 1; k <= NOrderedSuffix; k++)
    {
	regexp = OrderedSuffix[k]
	## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
	if (match(word, regexp))
	{
	    word = substr(word, 1, RSTART - 1)
	    if (Replacement[regexp] == "")
		wordlist[word] = 1
	    else
	    {
		split(Replacement[regexp], ending)
		for (n in ending)
		{
		    if (ending[n] == "\"\"")
			ending[n] = ""
		    wordlist[word ending[n]] = 1
		}
	    }
	    break
	}
    }
     ## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
}

swap

function swap(a, i, j,        temp)
{
    temp = a[i]
    a[i] = a[j]
    a[j] = temp
}

Author

Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books


categories: Yawk,Awk100,Feb,2009,WolfganZ

Yawk

Purpose

Run a WIKI using Gawk.

Download

Download from LAWKER or Wolfgan Zekol's web site.

Url

For a live demo, see the Yawk home page.

Developers

Wolfgan Zekol.

Domain

Web application.

Contact

Wolfgan Zekol.

Email

dag@awk-scripting.de

Description

Yawk is "yet another wiki klone", one among a lot of others. Yawk was written because the available wikis were missing some formatting capabilities or used strange formatting rules (and you might not like mine) or imposed too much requirements for understanding a wiki (mysql database installation with or without php installed).

Awk

Gawk 3.1.4 or later.

Platform

CGI

Lines

6000 lines.

Current

Status 3=Released.

Use

3=Free/public domain.

DateDeployed

2004

Dated

2009


categories: AwkLisp,Awk100,Feb,2009,DariusB

AwkLisp

Purpose

Code up a LISP/Scheme interpreter in Awk.

For more details..

See awklisp.

Developers

1

Domain

Domain-specific language.

Contact

Darius Bacon dairus@wry.me

Email

dairus@wry.me

Description

At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster.

Awk

Awk/Gawk

Lines

350

Current

1=Prototype

Use

1=Personal use.

DateDeployed

1994

Dated

2009


categories: Name,Awk100,Feb,2009,BillP

Name

Not a single program.

Purpose

Generate TeX code for a bilingual dictionary from a flat file database. This system has been used to generate multiple editions of dictionaries for several dialects of Carrier, the endangered language of a large portion of the central interior of British Columbia.

Developers

Bill Poser

Organization

Country

Canada

Domain

linguistics - dictionary publishing

Contact

Bill Poser

Email

billposer@alum.mit.edu

Description

A dictionary database consists of four flat files containing records in which fields are identified by tags, in a format isomorphic to Standard Dictionary Format. The four files contain: main entries, example sentences with translations, verb roots, verb stems. This provides modest degree of relativization. Awk scripts controlled by a makefile do the bulk of the work of generating TeX code for printing dictionaries containing front matter, a Carrier-English section, an English-Carrier section, a topical index, an alphabetical root list, a list of roots sorted by English gloss, an alphabetical list of verb stems, a list of verb stems sorted by root, an alphabetical list of affixes, a list of affixes sorted by English gloss, a list of scientific names , a list of placenames, and credits for illustrations.

Awk

gawk

Shell

The awk scripts are executed from a make file.

Platform

GNU/Linux on x86.

Uses

The awk scripts are executed from a makefile by GNU make. The other program used extensively is the sort utility msort.

Lines

5500

DevelopmentEffort

The first usable version took no more than a day (plus the time to create the TeX template into which the generated code is inserted).

MaintenanceEffort

Pure maintenance due to changes in environment, bit rot, etc. has been just about nil. The effort devoted to adding features very difficult to estimate as it has taken place at irregular intervals over a period of 15 years.

Current

Status 1=Prototype, 2=Evaluation, 3=Released, 4=No longer supported, 5=Dead 3, I guess. The code is mature but not really released since the author is the only one who normally uses it.

Use

1=Personal use, 2=in-House use, 3=Free/public domain, 4=Licensed, 5=Sold product 1

Users

1

DateDeployed

June 1993.

References

A paper describing these databases and the process for generating dictionaries from them is available: Lexical Databases for Carrier

Url

Some information about the resulting dictionaries: http://www.ydli.org/products/dicts.htm


categories: Top10,Boris,Awk100,Feb,2009,Ronl

Boris

Purpose

Demonstration to DoD of a clustering algorithm suitable for streaming data.

Source code

gawk/awk100/boris

Live demo

http://www.cse.wustl.edu/~loui/boris.cgi.

Developers

Ronald Loui and a programmer named Boris.

Organization

Washington University in St. Louis, CS Dept.

Country

USA

Domain

This is an evolutionary algorithm and visualization of a clustering algorithm that could be turned from O(n^4) to O(nlogn) with a few judicious uses of constants. Later developments added other interactive devices, including progress meters and mouse-and-click behavior.

Contact

Ronald Loui

Email

r.p.loui@gmail.com

Description

The code is an excellent example of the power of Awk as a prototyping tool: after getting the code running, with the least development time, a quirk was observed in the code that allowed a reduction from O(n^4) to O(nlogn).

  • Two of the n's are lost (n^2) by noticing that when there is a swap, the delta in the scoring function falls off by the squared distance from the point of a swap. So if you just set a constant, such as 10 or 20, or 100, based on the expected size of your clusters, then you can stop calculating the scoring function when you get past that constant.
  • The other n comes from either fixing the size of the matrix, and occasionally flushing new candidates in and out, or else by sampling over a subset of the n when you calculate the score.
  • The nlogn remains because there is a sort every now and then.

Awk

Gawk

Platform

Intended for fast servers, 1+ ghz.

Uses

Html.

Lines

158.

Development Effort

One weekend.

Maintenance Effort

None.

Current

2=Evaluation.

Use

2=in-House use.

Users

5

DateDeployed

2004.

Dated

Feb 2009.

References

Streaming Hierarchical Clustering for Concept Mining Looks, M.; Levine, A.; Covington, G.A.; Loui, R.P.; Lockwood, J.W.; Cho, Y.H. Aerospace Conference, 2007 IEEE Volume , Issue , 3-10 March 2007 Page(s):1 - 12 Digital Object Identifier 10.1109/AERO.2007.352792


categories: WWW,Awk100,Jan,2009,PeterK

Get_YouTube_Vids

Purpose

Download videos from youtube.

Source code

gawk/www/get_youtube_vids.awk

Developers

Peter Krumin: Downloading YouTube Videos With Gawk

Domain

World wide web, slurping, file sharing.

Contact

Peter Krumin

Description

How to download YouTube videos.

Awk

Gawk

Lines

331 lines

Current

3=Released

Use

1=Personal use

DateDeployed

July 2007

Dated

Sat Feb 21 19:46:10 EST 2009

Url

Downloading YouTube Videos With Gawk


categories: Sudoku,Awk100,Jan,2009,Jimh

sudoku

This is a Awk 100 program.

Submitted by

Jim Hart

Purpose

Solve sudoku puzzles using the same strategies as a person would, not by brute force.

Source

gawk/awk100/sudoku

Developers

Jim Hart

Country

US

Domain

command line games

Contact

Jim Hart

Email

jhart50@gmail.com

Description

see Purpose

AWK versions

gawk

Platform

Mac OS X, PowerPC

Lines

529

Development Effort

1

Maintenance Effort

0

Date Deployed

/2006


categories: Negotiate,Awk100,Jan,2009,Ronl

Anne's Negotiation Game

An Awk100 program.

Purpose

Research on a model of negotiation incorporating search, dialogue, and changing expectations

Source code

See gawk/awk100/negotiate.

Developers

Ronald Loui (programmer and designer), Anne Jump (adversary)

Organization

National Science Foundation grant at Washington University in St. Louis

Country

USA

Domain

Prototype of a new idea for cognitive modelling (in artificial intelligence/economics/organizational behavior)

Contact

Ronald P. Loui

Email

r.p.loui@gmail.com

Description

Program generates a game board upon which players take turn searching or declaring according to a protocol. It is based on the same game bimatrix made famous by people like von Neumann and Nash, but invents a new approach to negotiation based on process instead of solution.

Awk

Was written for gawk in 1997 but should run on almost any awk dialect

Platform

Was written on Redhat Linux with multiple hardware platforms in mind

Uses

Was intended to be self-contained

Lines

658 lines, of which 39 are comments

DevelopmentEffort

One day, 6-8 hours total

MaintenanceEffort

Two revisions are available, mainly to permit programs to negotiate instead of humans, and to provide a web-based dashboard to monitor the events

CurrentStatus

2=Evaluation

Use

2=in-House use

Users

50 students in artificial intelligence project classes had to use some version of this code over three yeears

DateDeployed

October 1997

Dated

January 2008

References

There is a draft article (unpublished), and several talks, e.g.

The paper in Harper and Wheeler, Probability and Inference: Essays in Honour of Henry E. Kyburg Jr. (Paperback), Publisher: College Publications (23 April 2007) ISBN-10: 1904987184 ISBN-13: 978-1904987185 also refers to the theory implemented here. Diana Moore's thesis on negotiation and draft article http://citeseer.ist.psu.edu/11983.html contains some precursor ideas.

Url

http://www.cs.wustl.edu/~loui/313f97/anne4.expl.html


categories: Baseballsim,Awk100,Jan,2009,Ronl

Baseball sim

This is a Awk 100 program.

Purpose

A quick and dirty baseball simulator for investigating the efficiency of batting lineups

Source

See gawk/awk100/baseballsim.

Developers

Ronald P. Loui

Organization

Washington University in St. Louis

Country

USA

Domain

Research/Decision Support

Contact

Ronald P. Loui

Email

r.p.loui@gmail.com

Description

This was written for the AI course, and for several investigations, including the determination of whether it is a good idea to bat the pitcher in the 8th spot. One hypothesis that emerges from this program that deserves further study is that the most potent offense is one that spreads rather than concentrates the batting threats.

Awk

Gawk around 2002

Platform

Linux around 2002

Uses

None

Lines

409

DevelopmentEffort

Approximately one day

MaintenanceEffort

Further simulators were developed for improved domain modeling and for successive addition of functionality; no other code maintenance was required.

CurrentStatus

1=Prototype

Use

1=Personal use

Users

About 50 students used this program over three years in AI classes, and two undergraduate theses and one Master's thesis on evolutionary computing made use of this simulator.

DateDeployed

October 2002

Dated

January 2009

References

None, but see Tony LaRussa's comments on batting order while managing the St. Louis Cardinals


categories: Argcol,Awk100,Jan,2009,Ronl

Argcol

An Awk100 program.

Purpose

A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.

Source code

See gawk/awk100/argcol.

Developers

Mark Foltz, Ronald Loui, Thieu Dang, Jeremy Frens

Organization

Washington University in St. Louis

Country

USA

Domain

Application/text support for text editor.

Contact

Ronald Loui

Email

r.p.loui@gmail.com

Awk

Gawk circa 1994, Solaris and MS-DOS-based awk such as mawk.

Platform

Solaris and MS-DOS

Uses

Vi and variants such as stevie.

Lines

278

DevelopmentEffort

One week.

MaintenanceEffort

No maintenance, eventually rewritten as cgi/web program in Room5 project.

Current

4=No longer supported

Use

3=Free/public domain

Users

2

DateDeployed

May 1994

Dated

Jan 2009

References

Progress on Room 5: a testbed for public interactive semi-formal legal argumentation International Conference on Artificial Intelligence and Law archive Proceedings of the 6th international conference on Artificial intelligence and law Melbourne, Australia Pages: 207 - 214 Year of Publication: 1997 ISBN:0-89791-924-6

blog comments powered by Disqus