About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
Awk is being used all around the world for real programming problems, but the news is not getting out.
We are aiming to create a database of at least one hundred Awk programs which will:
If you, or your colleagues or friends have written a program which has been used for purposes small or large, why not take five minutes to record the facts, so that others can see what you've done?
To contribute, fill in this template and mail it to mail@awk.info with the subject line Awk 100 contribution.
(Recent additions are shown first.)
(From Source Code Biol Med. 2007 Sep 6;2:4. by A. Lahm, E. de Rinaldis)
BACKGROUND: The number of patents associated with genes and proteins and the amount of information contained in each patent often present a real obstacle to the rapid evaluation of the novelty of findings associated to genes from an intellectual property (IP) perspective. This assessment, normally carried out by expert patent professionals, can therefore become cumbersome and time consuming. Here we present PatentMatrix, a novel software tool for the automated analysis of patent sequence text entries.
METHODS AND RESULTS: PatentMatrix is written in the Awk language and requires installation of the Derwent GENESEQtrade mark patent sequence database under the sequence retrieval system SRS.The software works by taking as input two files: i) a list of genes or proteins with the associated GENESEQtrade mark patent sequence accession numbers ii) a list of keywords describing the research context of interest (e.g. 'lung', 'cancer', 'therapeutics', 'diagnostics'). The GENESEQtrade mark database is interrogated through the SRS system and each patent entry of interest is screened for the occurrence of user-defined keywords. Moreover, the software extracts the basic information useful for a preliminary assessment of the IP coverage of each patent from the GENESEQtrade mark database. As output, two tab-delimited files are generated which provide the user with a detailed and an aggregated view of the results.An example is given where the IP position of five genes is evaluated in the context of 'development of antibodies for cancer treatment'.
CONCLUSION: PatentMatrix allows a rapid survey of patents associated with genes or proteins in a particular area of interest as defined by keywords. It can be efficiently used to evaluate the IP-related novelty of scientific findings and to rank genes or proteins according to their IP position.
A modular IRC bot written in GNU AWK.
Premysl Janouch
Czech republic
IRC Bot
See Developers.
p.janouch@gmail.com
GNU AWK
Bourne shell
POSIX-compatible
1000
Released (3)
Free/Public Domain (3)
N/A
2010
February 2010
http://vitamina.googlecode.com
Note: A regular release is planned in something like a month. I've typed Current: Released so you won't have to update the page.
Download from LAWKER
aaslg [ -x ] [ file ... ] aaslr [ -x ] table [ file ... ]
Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output. Aaslr parses the contents of the file(s) (default standard input) according to the AASL table in file table, emitting the table's output on standard output.
Both take a -x option to turn on verbose and cryptic debugging output. Both look in a library directory for pieces of the AASL system; the AASLDIR environment variable, if present, overrides the default notion of the location of this directory.
Aaslr expects input to consist of input tokens, one per line. For sim- ple tokens, the line is just the text of the token. For metatokens like ``identifier'', the line is the metatoken's name, a tab, and the text of the token. [xxx discuss `#' lines]
Aaslr output, in the absence of syntax errors, consists of the input tokens plus action tokens, which are lines consisting of `#!' followed immediately by an identifier. If the syntax of the input does not match that specified in the AASL table, aaslr emits complaint(s) on standard error and attempts to repair the input into a legal form; see ``ERROR REPAIR'' below. Unless errors have cascaded to the point where aaslr gives up (in which case it emits the action token ``#!aargh'' to inform later passes of this), the output will always conform to the AASL syntax given in the table.
Normally, a complete program using AASL consists of three passes, the middle one being an invocation of aaslr. The first pass is a lexical analyzer, which breaks free-form input down into input tokens in some suitable way. The third pass is a semantics interpreter, which typi- cally responds to input tokens by momentarily remembering them and to action tokens by executing some action, often using the remembered value of the previous input token. Aaslg is in fact implemented using AASL, following this structure; it implements the -x option by just passing it to aaslr.
An AASL specification consists of class definitions, text definitions, and rules, in arbitrary order (except that class definitions must pre- cede use of the classes they define). A `#' (not enclosed in a string) begins a comment; characters from it to the end of the line are ignored. An identifier follows the same rules as a C identifier, except that in most contexts it can be at most 16 characters long. A string is enclosed in double quotes ("") and generally follows C syn- tax. Most strings denote input tokens, and references to ``input token'' as part of AASL specification syntax should be read as ``string denoting input token''.
A class definition is an identifier enclosed in angle brackets (<>) followed by one or more input tokens followed by a semicolon (;). It gives a name to a set of input tokens. Classes whose names start with capital letters are user abbreviations; see below. Classes whose names start with lowercase letters are special classes, used for internal purposes. The current special classes are:
For example, the class definitions used for AASL itself are:
<trivial> "," ";" ; <lineterm> ";" ; <endmarker> "EOF" ;
When AASL error repair is invoked, the parser sometimes needs to gener- ate input tokens. In the case of a metatoken, the parser knows the token's name but needs to generate a text for it as well. A text defi- nition consists of an input token, an arrow (->), and a string specify- ing what text should be generated for that token. For example, the text definitions used for AASL itself are:
"id" -> "___" "string" -> "\"___\""
The rules of a specification define the syntax that the parser should accept. The order of rules is not significant, except that the first rule is considered to be the top level of the specification. The spec- ification is executed by calling the first rule; when execution of that rule terminates, execution of the specification terminates. If the user wishes this to occur only at end of input, he should arrange for the lexical analyzer to produce an endmarker token (conventionally ``EOF'') at the end of the input, and should write the first rule to require that token at the end.
Note that an input token may be recognized considerably before it is accepted, but the parser emits it to the output only on acceptance.
A rule consists of an identifier naming it, a colon (:), a sequence of items which is the body of the rule, and a semicolon (;). When a rule is called, it is executed by executing the individual items of the body in order (as modified by control structures) until either one of them explicitly terminates execution of the rule or the last item is exe- cuted.
An item which is an input token requires that that token appear in the input at that point, and accepts it (causing it to be emitted as out- put).
An item which is an identifier denotes a call to another rule, which executes the body of that rule and then returns to the caller. It is an error to call a nonexistent rule.
An item which is an identifier preceded by `!' causes that identifier to be emitted as an action token; the identifier has no other signifi- cance.
An item which is `<<' causes execution of the current rule to terminate immediately, returning to the calling rule.
An item which is `>>' causes the execution of the innermost enclosing loop (see below) to terminate immediately, with execution continuing after the end of that loop. The loop must be within the same rule.
An item which is an identifier preceded by `@%&!' causes an internal semantic action to be executed within the parser; this is normally needed only for bizarre situations like C's typedef. [xxx should give details I suppose]
A choice is a sequence of branches enclosed in parentheses (()) and separated by vertical bars (|). The first of the branches that can be executed, is, after which execution continues after the end of the choice.
A loop is a sequence of branches enclosed in braces ({}) and separated by vertical bars (|). The first of the branches that can be executed, is, and this is done repeatedly until the loop is terminated by `>>', after which execution continues after the end of the loop. (A loop can also be terminated by `<<' terminating execution of the whole rule.)
A branch is just a sequence of items, like a rule body, except that it must begin with either an input token or a lookahead. If it begins with an input token, it can be executed only when that token is the next token in the input, and execution starts with acceptance of that token.
A lookahead specifies conditions for execution of a branch based on recognizing but not accepting input token(s). The simplest form is just an input token enclosed in brackets ([]), in which case execution of that branch is possible only when that token is the next token in the input. The brackets can also contain multiple input tokens sepa- rated by commas, in which case the parser looks for any of those tokens. If a user-abbreviation class name appears, either by itself or as an element of a comma-separated list, it stands for the list of tokens given in its definition.
If a lookahead's brackets contain only a `*', this is a default branch, executable regardless of the state of the input.
As a very special case, a lookahead's brackets can contain two input tokens separated by slash (/), in which case that branch is executable only when those two tokens, in sequence, are next in the input. Warn- ing: this is implemented by a delicate perversion of the error-repair machinery, and if the first of those tokens is not then accepted, the parser will die in convulsions. A further restriction is that the same input token may not appear as the first token of a double lookahead and as a normal lookahead token in the same choice/loop.
Certain simple choice/loop structures appear frequently, and there are abbreviations for them:
abbreviation expansion
( items ?) ( items | [*] )
{ items ?} { items | [*] >> }
( ! [look] items ?) ( [ look] | items )
{ ! [look] items ?} { [ look] >> | items }
For example, here are the rules of the AASL specification for AASL, minus the actions (which add considerable clutter and are unintelligi- ble without the third pass):
rules: {
"id" ":" contents ";"
| "<" "id" ">" {"string" ?} ";"
| "string" "->" "string"
| "EOF" >>
};
contents: {
">>"
| "<<"
| "id"
| "!" "id"
| "@%&!" "id"
| "string"
| "(" branches ")"
| "{" branches "}"
| [*] >>
};
branches: (
"!" "[" look "]" contents "?"
| [*] branch (
["|"] {"|" branch ?}
| "?" !endbranch
| [*]
)
);
branch: (
"string" contents
| "[" look "]" contents
);
look: (
["string"/"/"] "string" "/" "string"
| "*"
| [*] looker {"," looker ?}
);
looker: ( "string" | "id" ) ;
When the input token is not one of those desired, either because the item being executed is an input token and a different token appears on the input, or because none of the branches of a choice/loop is exe- cutable, error repair is invoked to try to fix things up. Sometimes it can actually guess right and fix the error, but more frequently it merely supplies a legal output so that later passes will not be thrown into chaos by a minor syntax error.
The general error-repair strategy of an AASL parser is to give the parser what it wants and then attempt to resynchronize the input with the parser.
[xxx long discussion of how ``what it wants'' is determined when there are multiple possibilities]
Resynchronization is performed in three stages. The first stage attempts to resynchronize within a logical line, and is applied only if neither the input token nor the desired token is a line terminator (a member of the ``lineterm'' class). If the input token is trivial (a member of the ``trivial'' class), it is discarded. Otherwise it is retained, in hopes that it will be the next token that the parser asks for.
Either way, an error message is produced, indicating what was desired, what was seen, and what was handed to the parser. If too many of these messages have been produced for a single line, the parser gives up, produces a last despairing message, emits a ``#!aargh'' action token to alert later pases, and exits. Barring this disaster, parsing then con- tinues. If the parser at some point is willing to accept the input token, it is accepted and error repair terminates. If a line termina- tor is seen in input, or the parser requests one, before the parser is willing to accept the input token, the second phase begins.
The second stage of resynchronization attempts to line both input and parser up on a line terminator. If the desired token is a line termi- nator and the input token is not, input is discarded until a line ter- minator appears. If the desired token is not a line terminator and the input token is, the input token is retained and parsing continues until the parser asks for a line terminator. Either way, the third phase then begins.
The third stage of resynchronization attempts to reconcile line termi- nators. If the desired and input tokens are identical, the input token is accepted and error repair terminates. If they are not identical and the input token is trivial (yes, line terminators can be trivial, and ones like `;' probably should be), the input token is discarded. If the desired token is the endmarker, then the input token is discarded. Otherwise, the input token continues to be retained in hopes that it will eventually be accepted. [xxx this needs more thought] In any case, the second phase begins again.
all in $AASLDIR: interp table interpreter lex first pass of aaslg syn AASL table for aaslg sem third pass of aaslg
awk(1), yacc(1)
``error-repair disaster'' means that the first token of a double looka- head could not be accepted and error repair was invoked on it.
Written at University of Toronto by Henry Spencer, somewhat in the spirit of S/SL (see ACM TOPLAS April 1982).
Some of the restrictions on double lookahead are annoying.
Most of the C string escapes are recognized but disregarded, with only a backslashed double-quote interpreted properly during text generation.
Error repair needs further tuning; it has an annoying tendency to infi- nite-loop in certain odd situations (although the messages/line limit eventually breaks the loop).
Complex choices/loops with many branches can result in very long lines in the table.
The implementation of AASL was fairly straight forward, with AASL itself used to describe its own syntax. An AASL specification is compiled into a table, which is then processed by a table-walking interpreter. The interpreter expects input to be as tokens, one per line, much likethe output of a traditional scanner. A complete program using AASL (for example, the AASL table generator) is normally three passes: thescanner,the parser (tables plus interpreter), and a semantics pass. The first set of tables was generated byhand for bootstrapping.
Apart from the minor nuisance of repeated iterations of language design, the biggest problem ofimplementing AASL wasthe question of semantic actions. Inserting awk semantic routines into the table interpreter, in the style of yacc,would not be impossible, but it seemed clumsy and inelegant. Awks lack of anyprovision for compile time initialization of tables strongly suggested reading them in at run time, rather than taking up space with a huge BEGIN action whose only purpose was to initialize the tables. This makes insertions into the interpreters code awkward.
The problem was solved by a crucial observation: traditional compilers (etc.) merge a two-stepprocess, first validating a token stream and inserting semantic action cookiesinto it, then interpreting thestream and the cookies to interface to semantics. Forexample, yaccs grammar notation can be viewed asinserting fragments of C code into a parsed output, and then interpreting that output. This approach yieldsan extremely natural pass structure for an AASL parser,with the parsersoutput stream being (in the absenceof syntax errors) a copy of its input stream with annotations. The following semantic pass then processesthis, momentarily remembering normal tokens and interpreting annotations as operations on the remembered values. (The semantic pass is, in fact, a classic pattern+action awk program, with a pattern and anaction for each annotation, and a general save the value in a variableaction for normal tokens.)
The one difficulty that arises with this method is when the language definition involves feedbackloops between semantics and parsing, an obvious example being Cs typedef.Dealing with this reallydoes require some imbedding of semantics into the interpreter,although with care it need not be much: thein-parser code for recognizing C typedefs, including the complications introduced by block structure andnested redeclarations of type names, is about 40 lines of awk.The in-parser actions are invoked by a special variant of the AASL emit semantic annotationsyntax.
Aside benefit of top-down parsing is that the context of errors is known, and it is relatively easy to implement automatic error recovery. When the interpreter is faced with an input token that does not appearin the list of possibilities in the parser table, it givesthe parser one of the possibilities anyway, and then usessimple heuristics to try to adjust the input to resynchronize. The result is that the parser,and subsequentpasses, always see a syntactically-correct program. (This approach is borrowed from S/SL and its predecessors.) Although the detailed error-recovery algorithm is still experimental, and the current one is notentirely satisfactory when a complex AASL specification does certain things, in general it deals with minorsyntax errors simply and cleanly without anyneed for complicating the specification with details of errorrecovery.Knowing the context of errors also makes it much easier to generate intelligible error messagesautomatically.
The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.
As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, andwith the exception of the messy details of Cdeclaratorsis mostly a fairly straightforward transcription of the syntax given in the ANSI standard. Generating tables for this takes about three minutes of CPU time on a Sun 3/180; the tables are about 10K bytes.
The performance of the resulting ANSI C parser is not impressive: in very round numbers, averagedoveralarge program, it parses about one line of C per CPU second. (The scanner,164 lines of awk, accounts for a negligible fraction of this.) Some attention to optimization of both the tables and the interpreter might speed this up somewhat, but remarkable improvements are unlikely. As things stand in the absence of better awk implementations or a rewrite of the table interpreter in C, its a cute toy, possibly of some pedagogical value, but not a useful production tool. On the other hand, there does not appear to be any fundamental reason for the performance shortfall: itspurely the result of the slowexecution of awk programs.
The scanner would be much faster with better regular-expression matching, because it can use regular expressions to determine whether a string is a plausible token but must use substr to extract the string first. Nawk functions would be very handy for modularizing code, especially the complicated and seldom-invoked error-recovery procedure. A switch statement modelled on the pattern+action scheme would be useful in several places.
Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in nawk): there is no array assignment. This lack leads to endless repetitions of code like:
for (i in array)
arraystack[i ":" sp] = array[i]
whenever block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic sugar for this but don't really fix the problem. Not only is this code clumsy, it is woefully inefficient compared to something like
arraystack[sp] = array
even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.
It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after it had finished executing.
A minor nuisance that surfaces constantly is that getting an error message out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).
The multi-pass input-driven structure that awk naturally lends itself to produces very clean and readable code with different phases neatly separated, but creates substantial difficulties when feedback loops appear. (In the case of AASL,this perhaps says more about language design than about awk.)
Henry Spencer.
Download from LAWKER.
"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.
aaa is the mainline shell file.
aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).
aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.
abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.
try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.
01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.
If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.
Henry Spencer
gawk -f awkpp file-name-of-awk++-programThis command is platform independent and sends the translated program to standard output (stdout). See Running awk++ for variations.
This is an updated revision (#21), released August 1, 2009. In this new version:
Download awkpp21.zip from LAWKER
Awk++ is a preprocessor, that is it reads in a program written in the awk++ language and outputs a new program. However, it's different than awka. The output from the awk++ preprocessor is awk code, not C or an executable program. So, some version of AWK, such as awk or gawk, has to be used to run the preprocessed program. awka can be used, in a second step, to turn the preprocessed awk++ program into an executable, if desired.
The awk++ language provides object oriented programming for AWK that includes:
Awk++ adds new keywords to standard Awk:
a = class1.new[(optional parameters)] *** similar to Ruby
b = a.get("aProperty")
a.delete
class class1 {
property aProperty
method new([optional parameters]) {
# put initialization stuff here
}
method get(propName) {
if(propName = "aProperty")
return aProperty ### Note the use of 'return'. It behaves
### exactly the same as in an AWK function.
}
}
To define a class (similar to C++ but no public/private):
class class_name {.....}
To define a class with inheritance:
class class_name : inherited_class_name [ : inherited_class_name...] {.....}
To add local/private variables (persistent variables; syntax is unique to awk++):
class class_name {
attribute|attr|property|prop|element|elem|variable|var variable_name
..... }
To help programmers who are used to other OO languages, "attribute", "property", "element", and "variable", along with their 4-letter abbreviations, are interchangeable.
Note: these persistent variables cannot be accessed directly. The programmer must define method(s) to return them, if their values are to be made available to code that's outside the class.
To add methods
class class_name {
attribute variable_name1
method method_name(parameters) {
...any awk code....
}
..other method definitions...
}
To create an object
object_variable = class_name.new[(optional parameters)](runs the method named "new", if it exists; returns the object ID)
To call an object method
object_variable.method_name(parameters)
The dot isn't used for concatenation in awk/gawk, so it's a natural choice for the separator between the object and method.
To reclaim the memory used by an object, use the delete method, i.e.:
object_variable.delete
but don't define delete() in your classes. awk++ recognizes delete() as a special method and will take care of deleting the object. Deleting objects is only necessary, though, if they hold a lot of data. Overhead for objects themselves is insignificant.
OO syntax goals:
The OO syntax is based partly on C++, partly on Javascript, partly on Ruby and partly on the book "The Object-Oriented Thought Process". It isn't lifted in toto from one langauage because other languages provide features that gawk can't accomplish or have syntax that is hard to parse.
In awk++, if a method is called that isn't in the object's class and there are inherited classes (superclasses) specified, the inherited classes are called in left to right order until one of them returns a value. That value becomes the result of the method call. This is the way awk++ resolves the diamond problem. As a programmer, you control the sequence in which superclasses are called by the left to right order of the list of inherited classes in the class definition.
There are two important things to note.
Calls to undefined methods do nothing and return nothing, silently.
The command to preprocess an awk++ program looks like this:
gawk -f awkpp file-name-of-awk++-programor, if the "she-bang" line (line 1 in awkpp) has the right path to gawk, and awkpp is executable and in a directory in PATH,
awkpp file-name-of-awk++-programTo run the output program immediately,
gawk -f awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processedor
awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processedWhen running an awk++ program immediately, standard input (stdin) cannot be used for data. One or more data file paths must be listed on the command line.
There is a bug in the standard AWK distributions that affects the preprocessor. Additionally, the preprocessor uses the 3rd array option of the match() function. So, it's best to use GAWK to run the preprocessor.
On the other hand, the AWK code created by translating awk++ is intended to work with all versions of AWK. If you find otherwise, please notify the developer(s).
Copyright (c) 2008, 2009 Jim Hart, jhart@mail.avcnet.org All rights reserved. The awk++ code is licensed under the GNU Public license (GPL) any version. awk++ documentation, including this page, may be copied only in unmodified form, subject to fair use guidelines.
ooc is an awk program which reads class descriptions and performs the routine coding tasks necessary to do object-oriented coding in ANSI C.
The tool is exceptionally well documented in Object oriented programming with ANSI-C.
Download a 2002 copy of this code from LAWKER.
Or go to the author's web site.
ooc is a technique to do object-oriented programming (classes, methods, dynamic linkage, simple inheritance, polymorphisms, persistent objects, method existence testing, message forwarding, exception handling, etc.) using ANSI-C.
ooc is a preprocessor to simplify the coding task by converting class descriptions and method implementations into ANSI-C as required by the technique. You implement the algorithms inside the methods and the ooc preprocessor produces the boilerplate.
ooc consists of a shell script driving a modular awk script (with provisions for debugging), a set of reports -- code generation templates -- interpreted by the script, and the source of a root class to provide basic functionality. Everything is designed to be changed if desired. There are manual pages, lots of examples, among them a calculator based on curses and X11, and you can ask me about the book.
ooc as a technique requires an ANSI-C system -- classic C would necessitate substantial changes. The preprocessor needs a healthy Bourne-Shell and "new" awk as described in Aho, Weinberger, and Kernighan's book.
ooc was developed primarily to teach about object-oriented programming without having to learn a new language. If you see how it is done in a familiar setting, it is much easier to grasp the concepts and to know what miracles to expect from the technique and what not. Conceivably, the preprocessor can be used for production programming but this was not the original intent. Being able to roll your own object-oriented coding techniques has its possibilities, however...
Most sources should be viewed with tab stops set at 4 characters.
The original system ran on NeXTSTEP 3.2 and older, ESIX (System V) 4.0.4, and Linux 0.99.pl4-49. This rerelease was tested on MacOS X version 10.1.2 and Solaris version 5.8. You need to review paths in the script 'ooc/ooc' before running anything. Make sure the first line of this script points to a Bourne-style shell. Also make sure that the first line of '09/munch' points to a (new) awk.
The rereleased 'ooc' awk-programs have been tested with GNU awk versions 3.0.1 and 3.0.3. Previous versions did not support AWKPATH properly (but this is not essential).
The makefiles could be smarter but they are naive enough for all systems. This is a heterogeneous system -- set the environment variable $OSTYPE to an architecture-specific name. 'make' in the current directory will create everything by calling 'make' in the various subdirectories. Each 'makefile' includes 'make/Makefile.$OSTYPE', review your 'make/Makefile.$OSTYPE' before you start.
The following make calls are supported throughout:
make [all] create examples make test [make and] run examples make clean remove all but sources make depend make dependencies (if makefile.$OSTYPE supports it)
Make dependencies can be built with the -MM option of the GNU C compiler. They are stored in a file 'depend' in each subdirectory. They should apply to all systems. 'makefile.$OSTYPE' may include a target 'depend' to recreate 'depend' -- check 'makefile.darwin1.4' for an example.
The following is a walk through the file hierarchy in the order of the book:
Copyright (c) 1993
While you may use this software package, neither I nor my employers can be made responsible for whatever problems you might cause or encounter.
While you may give away this package and/or software derived with it, you should not charge for it, you should not claim that ooc is your work, and I have published my own book about ooc before you did.
The same restrictions apply to whoever might get this package from you.
plaiter [options] [file, playlist, directory or stream ...]
Download from LAWKER or, for the latest version, from SourceForge
Plaiter (pronounced "player") is a command line front end to command line music players. It uses shell scripting to try to create the command line music player that Plait would have used if it already existed. It complements Plait but is also quite useful on its own, especially if you already use mpg123 or similar programs and find yourself wanting more features.
What does Plaiter do that (say) mpg123 can't already? It queues tracks, first of all. Secondly, it understands commands like play, plause, stop, next and prev. Finally, unlike most of the command line music players out there, Plaiter can handle a play list with more than one type of audio file, selecting the proper helper app to handle each type of file you throw at it.
Plaiter will automatically configure itself to use ogg123, mpg123, and/or mpg321, if they are installed on your system. If you have a helper application that plays other types of audio, Plaiter can be configured to use it as well.
Like many of us, Plaiter is part daemon and part controller. The controller builds a play list from the files you provide on the command line and forwards commands to the daemon. The daemon reads commands and executes them by running helper applications.
Copyright (C) 2005, 2006 by Stephen Jungels. Released under the GPL.
Written by Stephen Jungels (sjungels@gmail.com)
awk -f m1.awk [file...]
Download from LAWKER.
M1 is a simple macro language that supports the essential operations of defining strings and replacing strings in text by their definitions. It also provides facilities for file inclusion and for conditional expan- sion of text. It is not designed for any particular application, so it is mildly useful across several applications, including document preparation and programming. This paper describes the evolution of the program; the final version is implemented in about 110 lines of Awk.
M1 copies its input file(s) to its output unchanged except as modified by certain "macro expressions." The following lines define macros for subsequent processing:
@comment Any text @@ same as @comment @define name value @default name value set if name undefined @include filename @if varname include subsequent text if varname != 0 @unless varname include subsequent text if varname == 0 @fi terminate @if or @unless @ignore DELIM ignore input until line that begins with DELIM @stderr stuff send diagnostics to standard error
A definition may extend across many lines by ending each line with a backslash, thus quoting the following newline.
Any occurrence of @name@ in the input is replaced in the output by the corresponding value.
@name at beginning of line is treated the same as @name@.
We'll start with a toy example that illustrates some simple uses of m1. Here's a form letter that I've often been tempted to use:
@default MYNAME Jon Bentley
@default TASK respond to your special offer
@default EXCUSE the dog ate my homework
Dear @NAME@:
Although I would dearly love to @TASK@,
I am afraid that I am unable to do so because @EXCUSE@.
I am sure that you have been in this situation
many times yourself.
Sincerely,
@MYNAME@
If that file is namedsayno.mac, it might be invoked with this text:
@define NAME Mr. Smith @define TASK subscribe to your magazine @define EXCUSE I suddenly forgot how to read
Recall that a @default takes effect only if its variable was not previously @defined.
I've found m1 to be a handy Troff preprocessor. Many of my text files (including this one) start with m1 definitions like:
@define ArrayFig @StructureSec@.2 @define HashTabFig @StructureSec@.3 @define TreeFig @StructureSec@.4 @define ProblemSize 100
Even a simple form of arithmetic would be useful in numeric sequences of definitions. The longer m1 variables get around Troff's dreadful two-character limit on string names; these variables are also avail- able to Troff preprocessors like Pic and Eqn. Various forms of the @define, @if, and @include facilities are present in some of the Troff-family languages (Pic and Troff) but not others (Tbl); m1 provides a consistent mechanism.
I include figures in documents with lines like this:
@define FIGNUM @FIGMFMOVIE@ @define FIGTITLE The Multiple Fragment heuristic. @FIGSTART@ <PS> <@THISDIR@/mfmovie.pic</PS> @FIGEND@
The two @defines are a hack to supply the two parameters of number and title to the figure. The figure might be set off by horizontal lines or enclosed in a box, the number and title might be printed at the top or the bottom, and the figures might be graphs, pictures, or animations of algorithms. All figures, though, are presented in the consistent format defined by FIGSTART and FIGEND.
I have also used m1 as a preprocessor for Awk programs. The @include statement allows one to build simple libraries of Awk functions (though some- but not all- Awk implementations provide this facility by allowing multiple program files). File inclusion was used in an earlier version of this paper to include individual functions in the text and then wrap them all together into the completem1 program. The conditional statements allow one to customize a program with macros rather than run-time if statements, which can reduce both run time and compile time.
The most interesting application for which I've used this macro language is unfortunately too complicated to describe in detail. The job for which I wrote the original version of m1 was to control a set of experiments. The experiments were described in a language with a lexical structure that forced me to make substitutions inside text strings; that was the original reason that substitutions are bracketed by at-signs. The experiments are currently controlled by text files that contain descriptions in the experiment language, data extraction programs written in Awk, and graphical displays of data written in Grap; all the programs are tailored bym1commands.
Most experiments are driven by short files that set a few keys parameters and then@includea large file with many @defaults. Separate files describe the fields of shared databases:
@define N ($1) @define NODES ($2) @define CPU ($3) ...
These files are @included in both the experiment files and in Troff files that display data from the databases. I had tried to conduct a similar set of experiments before I built m1, and got mired in muck. The few hours I spent building the tool were paid back handsomely in the first days I used it.
M1 uses as fast substitution function. The idea is to process the string from left to right, searching for the first substitution to be made. We then make the substitution, and rescan the string starting at the fresh text. We implement this idea by keeping two strings: the text processed so far is in L (for Left), and unprocessed text is in R (for Right). Here is the pseudocode for dosubs:
L = Empty R = Input String while R contains an "@" sign do let R = A @ B; set L = L A and R = B if R contains no "@" then L = L "@" break let R = A @ B; set M = A and R = B if M is in SymTab then R = SymTab[M] R else L = L "@" M R = "@" R return L R
There are many ways in which them1program could be extended. Here are some of the biggest temptations to "creeping creaturism":
The following code is short (around 100 lines), which is significantly shorter than other macro processors; see, for instance, Chapter 8 of Kernighan and Plauger [1981]. The program uses several techniques that can be applied in many Awk programs.
function error(s) {
print "m1 error: " s | "cat 1>&2"; exit 1
}
function dofile(fname, savefile, savebuffer, newstring) {
if (fname in activefiles)
error("recursively reading file: " fname)
activefiles[fname] = 1
savefile = file; file = fname
savebuffer = buffer; buffer = ""
while (readline() != EOF) {
if (index($0, "@") == 0) {
print $0
} else if (/^@define[ \t]/) {
dodef()
} else if (/^@default[ \t]/) {
if (!($2 in symtab))
dodef()
} else if (/^@include[ \t]/) {
if (NF != 2) error("bad include line")
dofile(dosubs($2))
} else if (/^@if[ \t]/) {
if (NF != 2) error("bad if line")
if (!($2 in symtab) || symtab[$2] == 0)
gobble()
} else if (/^@unless[ \t]/) {
if (NF != 2) error("bad unless line")
if (($2 in symtab) && symtab[$2] != 0)
gobble()
} else if (/^@fi([ \t]?|$)/) { # Could do error checking here
} else if (/^@stderr[ \t]?/) {
print substr($0, 9) | "cat 1>&2"
} else if (/^@(comment|@)[ \t]?/) {
} else if (/^@ignore[ \t]/) { # Dump input until $2
delim = $2
l = length(delim)
while (readline() != EOF)
if (substr($0, 1, l) == delim)
break
} else {
newstring = dosubs($0)
if ($0 == newstring || index(newstring, "@") == 0)
print newstring
else
buffer = newstring "\n" buffer
}
}
close(fname)
delete activefiles[fname]
file = savefile
buffer = savebuffer
}
Put next input line into global string "buffer". Return "EOF" or "" (null string).
function readline( i, status) {
status = ""
if (buffer != "") {
i = index(buffer, "\n")
$0 = substr(buffer, 1, i-1)
buffer = substr(buffer, i+1)
} else {
# Hume: special case for non v10: if (file == "/dev/stdin")
if (getline <file <= 0)
status = EOF
}
# Hack: allow @Mname at start of line w/o closing @
if ($0 ~ /^@[A-Z][a-zA-Z0-9]*[ \t]*$/)
sub(/[ \t]*$/, "@")
return status
}
function gobble( ifdepth) {
ifdepth = 1
while (readline() != EOF) {
if (/^@(if|unless)[ \t]/)
ifdepth++
if (/^@fi[ \t]?/ && --ifdepth <= 0)
break
}
}
function dosubs(s, l, r, i, m) {
if (index(s, "@") == 0)
return s
l = "" # Left of current pos; ready for output
r = s # Right of current; unexamined at this time
while ((i = index(r, "@")) != 0) {
l = l substr(r, 1, i-1)
r = substr(r, i+1) # Currently scanning @
i = index(r, "@")
if (i == 0) {
l = l "@"
break
}
m = substr(r, 1, i-1)
r = substr(r, i+1)
if (m in symtab) {
r = symtab[m] r
} else {
l = l "@" m
r = "@" r
}
}
return l r
}
function dodef(fname, str, x) {
name = $2
sub(/^[ \t]*[^ \t]+[ \t]+[^ \t]+[ \t]*/, "") # OLD BUG: last * was +
str = $0
while (str ~ /\\$/) {
if (readline() == EOF)
error("EOF inside definition")
# OLD BUG: sub(/\\$/, "\n" $0, str)
x = $0
sub(/^[ \t]+/, "", x)
str = substr(str, 1, length(str)-1) "\n" x
}
symtab[name] = str
}
BEGIN {
EOF = "EOF"
if (ARGC == 1)
dofile("/dev/stdin")
else if (ARGC >= 2) {
for (i = 1; i < ARGC; i++)
dofile(ARGV[i])
} else
error("usage: m1 [fname...]")
}
M1 is three steps lower than m4. You'll probably miss something you have learned to expect.
M1 was documented in the 1997 sedawk book by Dale Dougherty & Arnold Robbins (ISBN 1-56592-225-5) but may have been written earlier.
This page was adapted from 131.191.66.141:8181/UNIX_BS/sedawk/examples/ch13/m1.pdf (download from LAWKER).
Jon L. Bentley.
The amazingly workable (text) formatter
awf -macros [ file ] ...
Download from LAWKER. Type "make r" to run a regression test, formatting the manual page (awf.1) and comparing it to a preformatted copy (awf.1.out). Type "make install" to install it. Pathnames may need changing.
Awf formats the text from the input file(s) (standard input if none) in an imitation of nroff's style with the -man or -ms macro packages. The -macro option is mandatory and must be `-man' or `-ms'.
Awf is slow and has many restrictions, but does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.
Awf implements the following raw nroff requests:
.\" .ce .fi .in .ne .pl .sp .ad .de .ft .it .nf .po .ta .bp .ds .ie .ll .nr .ps .ti .br .el .if .na .ns .rs .tm
and the following in-text codes:
\$ \% \* \c \f \n \s
plus the full list of nroff/troff special characters in the original V7 troff manual.
Many restrictions are present; the behavior in general is a subset of nroff's. Of particular note are the following:
White space at the beginning of lines, and imbedded white space within lines, is dealt with properly. Sentence terminators at ends of lines are understood to imply extra space afterward in filled lines. Tabs are implemented crudely and not quite correctly, although in most cases they work as expected. Hyphenation is done only at explicit hyphens, emdashes, and nroff discretionary hyphens.
The -man macro set implements the full V7 manual macros, plus a few semi- random oddballs. The full list is:
.B .DT .IP .P .RE .SM .BI .HP .IR .PD .RI .TH .BR .I .LP .PP .RS .TP .BY .IB .NB .RB .SH .UC
.BY and .NB each take a single string argument (respectively, an indi- cation of authorship and a note about the status of the manual page) and arrange to place it in the page footer.
The -ms macro set is a substantial subset of the V7 manuscript macros. The implemented macros are:
.AB .CD .ID .ND .QP .RS .UL .AE .DA .IP .NH .QS .SH .UX .AI .DE .LD .NL .R .SM .AU .DS .LG .PP .RE .TL .B .I .LP .QE .RP .TP
Size changes are recognized but ignored, as are .RP and .ND. .UL just prints its argument in italics. .DS/.DE does not do a keep, nor do any of the other macros that normally imply keeps.
Assignments to the header/footer string variables are recognized and implemented, but there is otherwise no control over header/footer formatting. The DY string variable is available. The PD, PI, and LL number registers exist and can be changed.
The only output format supported by awf, in its distributed form, is that appropriate to a dumb terminal, using overprinting for italics (via underlining) and bold. The nroff special characters are printed as some vague approximation (it's sometimes very vague) to their correct appearance.
Awf's knowledge of the output device is established by a device file, which is read before the user's input. It is sought in awf's library directory, first as dev.term (where term is the value of the TERM environment variable) and, failing that, as dev.dumb. The device file uses special internal commands to set up resolution, special characters, fonts, etc., and more normal nroff commands to set up page length etc.
All in /usr/lib/awf (this can be overridden by the AWFLIB environment variable):
common common device-independent initialization dev.* device-specific initialization mac.m* macro packages pass1 macro substituter pass2.base central formatter pass2.m* macro-package-specific bits of formatter pass3 line and page composer
awk(1), nroff(1), man(7), ms(7)
Unlike nroff, awf complains whenever it sees unknown commands and macros. All diagnostics (these and some internal ones) appear on standard error at the end of the run.
Written at University of Toronto by Henry Spencer, more or less as a supplement to the C News project.
Copyright 1990 University of Toronto. All rights reserved. Written by Henry Spencer. This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California.
Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions:
There are plenty, but what do you expect for a text formatter written entirely in (old) awk?
The -ms stuff has not been checked out very thoroughly.
Download from Source Forge.
Jawk runs on any platform which supports, at minimum, J2SE 5.
java -jar jawk.jar {command-line-arguments}
To view the command line argument usage summary, execute
java -jar jawk.jar -hThe output of this command is shown below:
java ... org.jawk.Awk [-F fs_val] [-f script-filename]
[-o output-filename] [-c] [-z] [-Z]
[-d dest-directory] [-S] [-s] [-x] [-y] [-r]
[-ext] [-ni] [-t] [-v name=val]...
[script] [name=val | input_filename]...
-F fs_val = Use fs_val for FS.
-f filename = Use contents of filename for script.
-v name=val = Initial awk variable assignments.
-t = (extension) Maintain array keys in sorted order.
-c = (extension) Compile to intermediate file. (default: a.ai)
-o = (extension) Specify output file.
-z = (extension) | Compile for JVM. (default: AwkScript.class)
-Z = (extension) | Compile for JVM and execute it. (default: AwkScript.class)
-d = (extension) | Compile to destination directory. (default: pwd)
-S = (extension) Write the syntax tree to file. (default: syntax_tree.lst)
-s = (extension) Write the intermediate code to file. (default: avm.lst)
-x = (extension) Enable _sleep, _dump as keywords, and exec as a builtin func.
(Note: exec enabled only in interpreted mode.)
-y = (extension) Enable _INTEGER, _DOUBLE, and _STRING casting keywords.
-r = (extension) Do NOT hide IllegalFormatExceptions for [s]printf.
-ext= (extension) Enable user-defined extensions. (default: not enabled)
-ni = (extension) Do NOT process stdin or ARGC/V through input rules.
(Useful for blocking extensions.)
(Note: -ext & -ni available only in interpreted mode.)
-h or -? = (extension) This help screen.
The Jawk extension facility allows for arbitrary Java code to be called as Awk functions in a Jawk script. These extensions can come from the user (developer) or 3rd party providers (i.e., the Jawk project team). And, Jawk extensions are opt-in. In other words, the -ext flag is required to use Jawk extensions and extensions must be explicitly registered to the Jawk instance via the -Djawk.extensions property (except for core extensions bundled with Jawk ).
Also, Jawk extensions support blocking. You can think of blocking as a tool for extension event management. A Jawk script can block on a collection of blockable services, such as socket input availability, database triggers, user input, GUI dialog input response, or a simple fixed timeout, and, together with the -ni option, action rules can act on block events instead of input text, leveraging a powerful AWK construct originally intended for text processing, but now can be used to process blockable events. A sample enhanced echo server script is included in this article. It uses blocking to handle socket events, standard input from the user, and timeout events, all within the 47-line script (including comments).
## to run: java ... -jar jawk.jar -ext -ni -f {filename}
BEGIN {
css = CServerSocket(7777);
print "(echo server socket created)"
}
## note: default input processing disabled by -ni
$0 = SocketAcceptBlock(css,
SocketInputBlock(sockets,
SocketCloseBlock(css, sockets,
StdinBlock(
Timeout(1000)))));
## note: default action { print } disabled by -ni
# $1 = "SocketAccept", $2 = socket handle
$1 == "SocketAccept" {
socket = SocketAccept($2)
sockets[socket] = 1
}
# $1 = "SocketInput", $2 = socket handle
$1 == "SocketInput" {
## echo server action:
socket = $2
line = SocketRead(socket)
SocketWrite(socket, line)
}
# $1 = "SocketClose", $2 = socket handle
$1 == "SocketClose" {
socket = $2
SocketClose(socket)
delete sockets[socket]
}
## display a . for every second the server is running
$0 == "Timeout" {
printf "."
}
## stdin block is last because StdinGetline writes directly to $0
## $0 == "Stdin"
$0 == "Stdin" {
## broadcast message to all sockets
retcode = StdinGetline()
if (retcode != 1)
exit
for (socket in sockets)
SocketWrite(socket, "From server : " $0)
print "(message sent)"
}
Each extension function used in the script above is covered in some detail below:
extension-label-prefix OFS parameterwhile StdinBlock and Timeout returns
extension-label-prefix
As stated by the comments, -ni disables stdin processing (as provided
by Jawk
itself, not the StdinExtension) and the default blank rule of
{ print } . Disabling stdin processing is paramount to extension
processing because, otherwise,
it would be confusing, if not completely impossible, to multiplex
extension blocking with Jawk
's default stdin processing. And, disabling
the default blank rule allows for easy-to-read blocking statements
(like the one provided in the sample script) without the wierd side
effect of printing the result.
Dan: ddaglas at users.sourceforge.net.
Editor's note:
Programmers often take awk "as is", never thinking to use it as a lab in which
they can explore other language extensions.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach is taken in the Awk A*
project and, as shown here, in XMLgawk.
IMHO,
XMLgawk is one of the most exciting new innovations
seen in Gawk for many years.
It shows that Awk is more than "just" a text processor: rather
it is also a candidate technology for modern XML-based web applications.
)
Extends standard gawk with built-in XML processing.
Main developers: Jurgen Kahrs and Andrew Schorr.
Conceptual guidance: Manuel Collado.
MS Windows build expert: Victor Paeza.
Contributor of ideas for new features: Peter Saveliev.
XML processing, plus libraries for other extensions to Gawk.
XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.
Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.
XMLgawk provides the following functionality including:
3=Released
3=Free/public domain.
November 2003.
April 28, 2009.
AI Programming lab class challenge .
Download from LAWKER. Look at the first line of each file for something that looks like thos:
#!/usr/bin/gawk -fReplace this with the full path to the local version of Gawk.
Ronald Loui (programmer and designer)
Washington University in St. Louis
USA
Text-based game simulation.
Ronald P. Loui
r.p.loui@gmail.com
Ronald Loui writes: Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language
A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.
In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
Was written for gawk in 1995 but should run on almost any awk dialect; some css positioning commands will not work in all browsers; try IE6.
Was written on Redhat Linux with multiple hardware platforms in mind.
Intended to be run on close server to minimize delays.
605 lines in main cgi with several small aux control programs.
Minimal compared to development effort, but potentially will require css for new browsers.
Number of person-months since, including enhancements
2=Evaluation.
50 students in artificial intelligence project classes had to use some version of this code over seven years
October 2004
April 2009
Awk-Linux Educational Operating Systems
Teaching operating systems.
Yung-Pin Cheng
ypc@csie.ntnu.edu.tw
Software Engineering Lab. Department of Computer Science and Information Engineering National Taiwan Normal University
TAIWAN
Educators of Operating Systems
Most well-known instructional operating systems are complex, particularly if their companion software is taken into account. It takes considerable time and effort to craft these systems, and their complexity may introduce maintenance and evolution problems. In this project, a courseware called Awk-Linux is proposed. The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt, which are simulated through program instrumentation over user programs.
A major advantange of the use of Awk for this tool is platform independence. Awk-Linux can be crafted relatively more easily and it does not depend on any hardware simulator or platform. Stable Awk versions run on many platforms so this tool can be readily and easily ported to other machines. The same can not be said for other, more complex operating systems courseware that may be much harder to port to new environments.
In practice, using Awk-Linux is very simple for the instructor and students:
Gawk under cygwin or Linux
Windows (CYGWIN required) or Linux
C programming language
Status 3 (Released)
3(Free/public domain)
2004
Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux: A Lightweight Operating Systems Courseware IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.
www.csie.ntnu.edu.tw/~ypc/awklinux.htm
awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
[=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
[-strip] [-verbose] [file(s)]
Download from LAWKER.
This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.
It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:
(And to write even larger programs, divided into many files, see runawk.)
Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.
For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.
All word matching is case insensitive (subject to the workings of tolower()).
In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.
Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:
ies$ ie ies y # flies -> fly, series -> series, ties -> tie ily$ y ily # happily -> happy, wily -> wily nnily$ n # funnily -> fun
Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.
Suffixes are tested in order of decreasing length, so that the longest matches are tried first.
The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form
filename:linenumber:exception
Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.
BEGIN { initialize() }
{ spell_check_line() }
END { report_exceptions() }
function get_dictionaries( files, key)
{
if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
Dictionaries = ENVIRON["DICTIONARIES"]
if (Dictionaries == "") # Use default dictionary list
{
DictionaryFiles["/usr/dict/words"]++
DictionaryFiles["/usr/local/share/dict/words.knuth"]++
}
else # Use system dictionaries from command line
{
split(Dictionaries, files)
for (key in files)
DictionaryFiles[files[key]]++
}
}
function initialize()
{
NonWordChars = "[^" \
"'" \
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz" \
"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
"]"
get_dictionaries()
scan_options()
load_dictionaries()
load_suffixes()
order_suffixes()
}
function load_dictionaries( file, word)
{
for (file in DictionaryFiles)
{
## print "DEBUG: Loading dictionary " file > "/dev/stderr"
while ((getline word < file) > 0)
Dictionary[tolower(word)]++
close(file)
}
}
function load_suffixes( file, k, line, n, parts)
{
if (NSuffixFiles > 0) # load suffix regexps from files
{
for (file in SuffixFiles)
{
## print "DEBUG: Loading suffix file " file > "/dev/stderr"
while ((getline line < file) > 0)
{
sub(" *#.*$", "", line) # strip comments
sub("^[ \t]+", "", line) # strip leading whitespace
sub("[ \t]+$", "", line) # strip trailing whitespace
if (line == "")
continue
n = split(line, parts)
Suffixes[parts[1]]++
Replacement[parts[1]] = parts[2]
for (k = 3; k <= n; k++)
Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
}
close(file)
}
}
else # load default table of English suffix regexps
{
split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
for (k in parts)
{
Suffixes[parts[k]] = 1
Replacement[parts[k]] = ""
}
}
}
function order_suffixes( i, j, key)
{
# Order suffixes by decreasing length
NOrderedSuffix = 0
for (key in Suffixes)
OrderedSuffix[++NOrderedSuffix] = key
for (i = 1; i < NOrderedSuffix; i++)
for (j = i + 1; j <= NOrderedSuffix; j++)
if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
swap(OrderedSuffix, i, j)
}
function report_exceptions( key, sortpipe)
{
sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
for (key in Exception)
print Exception[key] | sortpipe
close(sortpipe)
}
function scan_options( k)
{
for (k = 1; k < ARGC; k++)
{
if (ARGV[k] == "-strip")
{
ARGV[k] = ""
Strip = 1
}
else if (ARGV[k] == "-verbose")
{
ARGV[k] = ""
Verbose = 1
}
else if (ARGV[k] ~ /^=/) # suffix file
{
NSuffixFiles++
SuffixFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
else if (ARGV[k] ~ /^[+]/) # private dictionary
{
DictionaryFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
}
# Remove trailing empty arguments (for nawk)
while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
ARGC--
}
function spell_check_line( k, word)
{
## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
gsub(NonWordChars, " ") # eliminate nonword chars
for (k = 1; k <= NF; k++)
{
word = $k
sub("^'+", "", word) # strip leading apostrophes
sub("'+$", "", word) # strip trailing apostrophes
if (word != "")
spell_check_word(word)
}
}
function spell_check_word(word, key, lc_word, location, w, wordlist)
{
lc_word = tolower(word)
## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
if (lc_word in Dictionary) # acceptable spelling
return
else # possible exception
{
if (Strip)
{
strip_suffixes(lc_word, wordlist)
## for (w in wordlist) print "DEBUG: wordlist[" w "]"
for (w in wordlist)
if (w in Dictionary)
break
if (w in Dictionary)
return
}
## print "DEBUG: spell_check():", word
location = Verbose ? (FILENAME ":" FNR ":") : ""
if (lc_word in Exception)
Exception[lc_word] = Exception[lc_word] "\n" location word
else
Exception[lc_word] = location word
}
}
function strip_suffixes(word, wordlist, ending, k, n, regexp)
{
## print "DEBUG: strip_suffixes(" word ")"
split("", wordlist)
for (k = 1; k <= NOrderedSuffix; k++)
{
regexp = OrderedSuffix[k]
## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
if (match(word, regexp))
{
word = substr(word, 1, RSTART - 1)
if (Replacement[regexp] == "")
wordlist[word] = 1
else
{
split(Replacement[regexp], ending)
for (n in ending)
{
if (ending[n] == "\"\"")
ending[n] = ""
wordlist[word ending[n]] = 1
}
}
break
}
}
## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
}
function swap(a, i, j, temp)
{
temp = a[i]
a[i] = a[j]
a[j] = temp
}
Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books
Run a WIKI using Gawk.
Download from LAWKER or Wolfgan Zekol's web site.
For a live demo, see the Yawk home page.
Wolfgan Zekol.
Web application.
Wolfgan Zekol.
dag@awk-scripting.de
Yawk is "yet another wiki klone", one among a lot of others. Yawk was written because the available wikis were missing some formatting capabilities or used strange formatting rules (and you might not like mine) or imposed too much requirements for understanding a wiki (mysql database installation with or without php installed).
Gawk 3.1.4 or later.
CGI
6000 lines.
Status 3=Released.
3=Free/public domain.
2004
2009
Code up a LISP/Scheme interpreter in Awk.
See awklisp.
1
Domain-specific language.
Darius Bacon dairus@wry.me
dairus@wry.me
At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster.
Awk/Gawk
350
1=Prototype
1=Personal use.
1994
2009
Not a single program.
Generate TeX code for a bilingual dictionary from a flat file database. This system has been used to generate multiple editions of dictionaries for several dialects of Carrier, the endangered language of a large portion of the central interior of British Columbia.
Bill Poser
Canada
linguistics - dictionary publishing
Bill Poser
billposer@alum.mit.edu
A dictionary database consists of four flat files containing records in which fields are identified by tags, in a format isomorphic to Standard Dictionary Format. The four files contain: main entries, example sentences with translations, verb roots, verb stems. This provides modest degree of relativization. Awk scripts controlled by a makefile do the bulk of the work of generating TeX code for printing dictionaries containing front matter, a Carrier-English section, an English-Carrier section, a topical index, an alphabetical root list, a list of roots sorted by English gloss, an alphabetical list of verb stems, a list of verb stems sorted by root, an alphabetical list of affixes, a list of affixes sorted by English gloss, a list of scientific names , a list of placenames, and credits for illustrations.
gawk
The awk scripts are executed from a make file.
GNU/Linux on x86.
The awk scripts are executed from a makefile by GNU make. The other program used extensively is the sort utility msort.
5500
The first usable version took no more than a day (plus the time to create the TeX template into which the generated code is inserted).
Pure maintenance due to changes in environment, bit rot, etc. has been just about nil. The effort devoted to adding features very difficult to estimate as it has taken place at irregular intervals over a period of 15 years.
Status 1=Prototype, 2=Evaluation, 3=Released, 4=No longer supported, 5=Dead 3, I guess. The code is mature but not really released since the author is the only one who normally uses it.
1=Personal use, 2=in-House use, 3=Free/public domain, 4=Licensed, 5=Sold product 1
1
June 1993.
A paper describing these databases and the process for generating dictionaries from them is available: Lexical Databases for Carrier
Some information about the resulting dictionaries: http://www.ydli.org/products/dicts.htm
Demonstration to DoD of a clustering algorithm suitable for streaming data.
http://www.cse.wustl.edu/~loui/boris.cgi.
Ronald Loui and a programmer named Boris.
Washington University in St. Louis, CS Dept.
USA
This is an evolutionary algorithm and visualization of a clustering algorithm that could be turned from O(n^4) to O(nlogn) with a few judicious uses of constants. Later developments added other interactive devices, including progress meters and mouse-and-click behavior.
Ronald Loui
r.p.loui@gmail.com
The code is an excellent example of the power of Awk as a prototyping tool: after getting the code running, with the least development time, a quirk was observed in the code that allowed a reduction from O(n^4) to O(nlogn).
Gawk
Intended for fast servers, 1+ ghz.
Html.
158.
One weekend.
None.
2=Evaluation.
2=in-House use.
5
2004.
Feb 2009.
Streaming Hierarchical Clustering for Concept Mining Looks, M.; Levine, A.; Covington, G.A.; Loui, R.P.; Lockwood, J.W.; Cho, Y.H. Aerospace Conference, 2007 IEEE Volume , Issue , 3-10 March 2007 Page(s):1 - 12 Digital Object Identifier 10.1109/AERO.2007.352792
Download videos from youtube.
Peter Krumin: Downloading YouTube Videos With Gawk
World wide web, slurping, file sharing.
Peter Krumin
How to download YouTube videos.
Gawk
331 lines
3=Released
1=Personal use
July 2007
Sat Feb 21 19:46:10 EST 2009
Downloading YouTube Videos With Gawk
This is a Awk 100 program.
Jim Hart
Solve sudoku puzzles using the same strategies as a person would, not by brute force.
Jim Hart
US
Jim Hart
jhart50@gmail.com
see Purpose
gawk
Mac OS X, PowerPC
529
1
0
/2006
An Awk100 program.
Research on a model of negotiation incorporating search, dialogue, and changing expectations
Ronald Loui (programmer and designer), Anne Jump (adversary)
National Science Foundation grant at Washington University in St. Louis
USA
Prototype of a new idea for cognitive modelling (in artificial intelligence/economics/organizational behavior)
Ronald P. Loui
r.p.loui@gmail.com
Program generates a game board upon which players take turn searching or declaring according to a protocol. It is based on the same game bimatrix made famous by people like von Neumann and Nash, but invents a new approach to negotiation based on process instead of solution.
Was written for gawk in 1997 but should run on almost any awk dialect
Was written on Redhat Linux with multiple hardware platforms in mind
Was intended to be self-contained
658 lines, of which 39 are comments
One day, 6-8 hours total
Two revisions are available, mainly to permit programs to negotiate instead of humans, and to provide a web-based dashboard to monitor the events
2=Evaluation
2=in-House use
50 students in artificial intelligence project classes had to use some version of this code over three yeears
October 1997
January 2008
There is a draft article (unpublished), and several talks, e.g.
The paper in Harper and Wheeler, Probability and Inference: Essays in Honour of Henry E. Kyburg Jr. (Paperback), Publisher: College Publications (23 April 2007) ISBN-10: 1904987184 ISBN-13: 978-1904987185 also refers to the theory implemented here. Diana Moore's thesis on negotiation and draft article http://citeseer.ist.psu.edu/11983.html contains some precursor ideas.
http://www.cs.wustl.edu/~loui/313f97/anne4.expl.html
This is a Awk 100 program.
A quick and dirty baseball simulator for investigating the efficiency of batting lineups
Ronald P. Loui
Washington University in St. Louis
USA
Research/Decision Support
Ronald P. Loui
r.p.loui@gmail.com
This was written for the AI course, and for several investigations, including the determination of whether it is a good idea to bat the pitcher in the 8th spot. One hypothesis that emerges from this program that deserves further study is that the most potent offense is one that spreads rather than concentrates the batting threats.
Gawk around 2002
Linux around 2002
None
409
Approximately one day
Further simulators were developed for improved domain modeling and for successive addition of functionality; no other code maintenance was required.
1=Prototype
1=Personal use
About 50 students used this program over three years in AI classes, and two undergraduate theses and one Master's thesis on evolutionary computing made use of this simulator.
October 2002
January 2009
None, but see Tony LaRussa's comments on batting order while managing the St. Louis Cardinals
An Awk100 program.
A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.
See gawk/awk100/argcol.
Mark Foltz, Ronald Loui, Thieu Dang, Jeremy Frens
Washington University in St. Louis
USA
Application/text support for text editor.
Ronald Loui
r.p.loui@gmail.com
Gawk circa 1994, Solaris and MS-DOS-based awk such as mawk.
Solaris and MS-DOS
Vi and variants such as stevie.
278
One week.
No maintenance, eventually rewritten as cgi/web program in Room5 project.
4=No longer supported
3=Free/public domain
2
May 1994
Jan 2009
Progress on Room 5: a testbed for public interactive semi-formal legal argumentation International Conference on Artificial Intelligence and Law archive Proceedings of the 6th international conference on Artificial intelligence and law Melbourne, Australia Pages: 207 - 214 Year of Publication: 1997 ISBN:0-89791-924-6
blog comments powered by Disqus