Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Sitemap,Apr,2009,Admin

Featured Topics

These pages are grouped into the topics, listed below (latest one shown first):


categories: Sitemap,Apr,2009,Admin

Table of Contents

210 pages.

3300 million characters (in an Awk string)
999 bottles of beer
AA Tale of Two Tawks
A Web Server in Awk
Advocacy
Amazing Awk Assembler
Amazing Awk Formatter
An Awk Dungeon Adventure Game
Argcol
Arnold Robbins
array.awk
Automated Results Verification
Awk + Ansi-C = OO
Awk and Mail
Awk Cookbook Project
Awk for AI
Awk for Chemical Engineers
Awk for Engineering
Awk for Mechanical Engineers
Awk for system programming
Awk Games
Awk Mug
Awk on Android
Awk snake
Awk's Equivalent to VI's J
Awk++
Awk-Linux
Awk.info
Awk.info Gaining Popularity
Awk100
Awkbot:
Awklisp
awkwords
BBasebase Sim
Brainfuck to C
Building Interpreters with A*
CCheckers
Coding
Columnate
Community
Contact
Convert Code Comments to Latex
Correlation between numbers
Credits
DDatabases
Davinci mascot
Debugger and Assertion Checker
Domain-Specific Langauges
EEd Morton
Eliza
Errata: WHINY_USERS slows down Awk
Ethiopian Multiplication
Explaining Awk One Liners
FFast Clustering
Faster Hashing in Mawk
Finite State Machine Generator
Forloops
Format Shell Scripts
Four Keys to Awk
Functional Challenge
Functional Enumeration
Functional Gawk
GGenerating random sigs
Get YouTube Vids
Getline
GetXML.awk
Graph
Great Auks
HHandy One Liners
Hiding Email Address
History
Holiday date routines
How to call Awk form "C" with Libmawk
How to contribute
How to Read Minds
IIn praise of scripting
Interpreters
Interview with Arnold Robbins
Intrusion alert normalization
IRC agent in AWK
Issue report mining
JJawk = Java + Awk
Jim Hart
join.awk
LLanguage Analysis
Learning Awk
levenshtein.awk
Lexical and Grammar Analysis
List of Tags
Mm1 : simple macros
m5 : macro processor
Macro pre-processors
Mail sort
Markdown
Mascot
Mastermind
Mastermind (again)
Mawk: faster than C, C++, Java, Perl, Ruby....
md2html : Update to Markdown.awk
MicroTracer
Mike Langman
Monty Hall Problem
Moving Files with Awk
Music analysis workbench
Music tools and Awk
MySql
NNaive Bayes Classifier
Negotiate
Network monitoring in Awk
New AWK debugger
New Awk Mascot ('AWK-eye the Dwarf)?
New mascot
NoSQL
OOne Liners
OO tools in Awk
Operating Systems and Awk
PParallel Awk
Parser Generator
Patent search for genes or proteins
Playing music
Postscript tricks
Predicting Gender
Pretty print
Print an Array
Print Some Postscript Pages
Printing ranges
Processing Bitmaps in Gawk
Project Tools
QQSE: an embeddable Awk Interpreter
QTawk
Quicksort
Quicksort2
RRandom Numbers in Gawk
Reading RSS feeds
Regular Expression Matching Can Be Simple And Fast
Resistor Calculation
Reverse Postscript pages
Ronald Loui
Rot13 in Awk
runawk
Runawk 0.16
Runawk 0.17
Runawk 0.18
Runawk 0.19
SSamples of AWK
Sed in Awk
Sed to Awk
Sed-clones (in Awk)
Shorten my pipes
Shuffle.awk
Simple Awk GUIs in Windows
Simple Stream Editor
Simulations Unicast Applications
Soccer
Sorting
Sorting Arrays Via the Shell
Spam Filtering
Spawk for SuSE Linux
Spawk in GoogleCode
spell.awk
spellcheck.awk
Spreadsheets and Awk.
SQL Powered AWK
Steffen Schuler
Sudoku
Super-For loops
Sys Admin tricks in Awk
SysAdmins: Awk is Your Friend
TTable of Contents
Teaching Awk
Template-driven programming
Ten Liners
Tex-to-bilingual Dictionary
Text Mining
Text Munging
The Awk Book's Code
The Secret WHINY_YSERS flag
The TinyTim Content Management System
Tic-tac-toe
Tim Menzies
Top 10 pages
Top 10 posters
Top 10 posters last year
Top 10 subjects
Top 10 subjects last year
Topics
Towers of Hanoi
UUML: sequence diagrams
Unit tests
Using Awk for Databases
Using field names to reference columns
VVerification
Visual Awk
WWaclaw Sierpinski's Triangle
Why Gawk?
Widen bitmaps, using Gawk
Word-processing in Awk
Writing SciFi
XXgawk for Windows
XML and Awk
XML: Checking for Well-Formednes
XML: Dealing with DTDs
XML: Display components
XML: printing an outline
XML: pulling out data
XMLgawk
xmlparse.awk
Xmonth: Gawk+X-windows GUI
YYawk
ZZipf's Law

categories: Sitemap,Apr,2009,Admin

Page Tags

NumberTag
42 Admin
24 Tools
24 Awk100
19 Timm
13 Tips
13 TenLiners
12 ArnoldR
11 Top10
11 Papers
10 XML
10 Ronl
10 Misc
10 June
10 Games
9 Who
9 Dsl
8 Learn
7 Wp
7 Project
7 EdM
7 Databases
6 TextMining
6 Sept
6 Interpreters
5 Xgawk
5 WhyAwk
5 Steffen
5 Runawk
5 Os
5 Mascot
5 JurgenK
5 AlexC
4 Verification
4 SysAdmin
4 Sorting
4 Sed
4 PanosP
4 Newsgroup
4 Jimh
4 HenryS
4 Funky
4 Engineering
3 Spawk
3 Sitemap
3 Ps
3 Oo
3 OneLiners
3 Music
3 Mawk
3 Mail
3 Macros
3 Contribute
3 Arrays
2 Ysa
2 TedD
2 Spell
2 Sigs
2 ScottS
2 MichealS
2 MichaelS
2 MartinC
2 JonB
2 JesusG
2 Graphics
2 GrantC
2 GUI
2 Function
2 Eliza
2 DonaldM
2 DavidL
2 DariusB
2 BrianK
2 AwkLisp
2 Anon
2 AaronH
1 Zazzle
1 YungC
1 Yawk
1 YasumasaS
1 WolfganZ
1 WmM
1 WimVB
1 WillW
1 WilhelmW
1 Web
1 WWW
1 VictorA
1 VenkatesanS
1 TimS
1 TimM
1 TiborP
1 TerryB
1 Sudoku
1 StevenH
1 SteveL
1 SteveJ
1 SteveC
1 StephenJ
1 Stats
1 ScottP
1 SallyF
1 RussC
1 Rss
1 PremyslJ
1 PierreG
1 PhilipB
1 PeterW
1 PeterK
1 PeterI
1 PPuri
1 OsamuA
1 News
1 NelsonB
1 Negotiate
1 Name
1 MikhailA
1 Mikel
1 MartinF
1 MarkB
1 M0J0
1 LotharS
1 Libmawk
1 KimD
1 KennyM
1 JuergenK
1 JohnF
1 JohnD
1 JiirL
1 JanisP
1 JanW
1 JamesL
1 JMellander
1 Irc
1 HyungC
1 HiroS
1 HermannP
1 GregoryG
1 Getline
1 GerardH
1 Forloop
1 Errata
1 EricP
1 EisaA
1 DickL
1 DebbieF
1 DavidH
1 Dates
1 DataMining
1 DanN
1 Dab
1 Cookbook
1 CarloS
1 CMS
1 BrianJ
1 BrendanO
1 Boris
1 BobO
1 BillP
1 Baseballsim
1 BalkhisB
1 Awk
1 Argcol
1 April
1 Android
1 AlfredA
1 AlexS
1 AlexR
1 AlanL
1 ALahm

categories: Awk100,Jan,2009,Admin

The Awk 100

Goals

Awk is being used all around the world for real programming problems, but the news is not getting out.

We are aiming to create a database of at least one hundred Awk programs which will:

  • Identify the tasks that Awk is really being used for
  • Enable analysis of the benefits of the language for practical programming
  • Serve as an information exchange for applications

Contribute

If you, or your colleagues or friends have written a program which has been used for purposes small or large, why not take five minutes to record the facts, so that others can see what you've done?

To contribute, fill in this template and mail it to mail@awk.info with the subject line Awk 100 contribution.

Current Listing

(Recent additions are shown first.)

  1. A. Lahm and E. de Rinaldis' Patent Matrix
    • PatentMatrix is an automated tool to survey patents related to large sets of genes or proteins. The tool allows a rapid survey of patents associated with genes or proteins in a particular area of interest as defined by keywords. It can be efficiently used to evaluate the IP-related novelty of scientific findings and to rank genes or proteins according to their IP position.
  2. P Janouch's AWK IRC agent:
    • VitaminA IRC bot is an experiment on what can be done with GNU AWK. It's a very simple though powerful scripting language. Using the coprocess feature, plugins can be implemented very easily and in a language-independent way as a side-effect. The project runs only on Unix-derived systems.
  3. Stephen Jungels' music player:
    • Plaiter (pronounced "player") is a command line front end to command line music players. What does Plaiter do that (say) mpg123 can't already? It queues tracks, first of all. Secondly, it understands commands like play, plause, stop, next and prev. Finally, unlike most of the command line music players out there, Plaiter can handle a play list with more than one type of audio file, selecting the proper helper app to handle each type of file you throw at it.
  4. Dan at sourceforge's Jawk system:
    • Awk, impelemeneted in the Java virtual machine. Very useful for extending lightweight scripting in Awk with (e.g.) network and GUI facilities from Java.
  5. Axel T. Schreiner's OOC system:
    • ooc is an awk program which reads class descriptions and performs the routine coding tasks necessary to do object-oriented coding in ANSI C.
  6. Ladd and Raming's Awk A-star system:
    • Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.
  7. Henry Spencer's Amazing Awk Syntax Language system:
    • Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output.
    • The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.
    • As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, and with the exception of the messy details of C declarators is mostly a fairly straightforward transcription of the syntax given in the ANSI standard.
  8. Jurgen Kahrs (and others) XMLgawk system:
    • XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser.
    • The same tool that can load the XML shared library can also add other libraries (e.g. PostgreSQL).
  9. Henry Spencer's Amazing Awk Assembler
    • "aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. Using "aaa", it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category.
  10. Ronald Loui's AI programming lab.
    • For many years, Ronald Loui has taugh AI using Awk. He writes:
      • Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK.
      • A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the Java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.
      • What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
  11. Henry Spencer's Amazing Awk Formatter.
    • Awf may not be lightning fast, and it has certain restrictions, but it does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.
  12. Yung-Pin Cheng's Awk-Linux Course ware.
    • The stable and cross-platform nature of Awk enabled the simple creation of a robust toolkit for teaching operating system concepts to university students. The toolkit is much simpler/ easier to port to new platforms, than alternative and more elaborate course ware tools.
    • This work was the basis for a quite prestigious publication in the IEEE Transactions on Education journal, 2008, Vol 51, Issue 4. Who said Awk was an old-fashioned tool?
  13. Jon Bentley's m1 micro macro processor.
    • Supports the essential operations of defining strings and replacing strings in text by their definitions. All in 110 lines. A little awk goes a long way.
  14. Arnold Robbins and Nelson Beebe's classic spell checker
    • A powerful spell checker, and a case-study on how to best write applications using hundreds of lines of Awk.
  15. Jim Hart's awk++
    • An object-oriented Awk.
  16. Wolfgan Zekol's Yawk
    • WIKI written in Awk
  17. Darius Bacon: AwkLisp
    • LISP written in Awk
  18. Bill Poser: Name
    • Generate TeX code for a bilingual dictionary.
  19. Ronald Loui: Faster clustering
    • Demonstration to DoD of a clustering algorithm suitable for streaming data
  20. Peter Krumin: Get YouTube videos
    • Download YouTube videos
  21. Jim Hart: Sudoku
    • Solve sudoku puzzles using the same strategies as a person would, not by brute force.
  22. Ronald Loui: Anne's Negotiation Game
    • Research on a model of negotiation incorporating search, dialogue, and changing expectations.
  23. Ronald Loui: Baseball Sim
    • A baseball simulator for investigating the efficiency of batting lineups.
  24. Ronald Loui: Argcol
    • A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.

categories: SysAdmin,Oct,2009,Admin

Sys Admin

These pages focus on sys admin tools in Awk.


categories: Top10,Mar,2009,Admin

Top 10

The Awk.info Top 10 pages highlights the "best" (most impressive, most insightful, most fun, most visited) pages on this site.


categories: Who,Feb,2009,Admin

Credits

Awk.info is maintained by the international awk community. There are many ways you can contribute and get listed below.


categories: Who,Feb,2009,Jimh

Jimh= Jim Hart

Author of awk++.

Jim has been a Great Auk since Feb'09.


categories: Who,Feb,2009,Ronl

Ronl= Ronald Loui

2009: consultant, CycCorp.


categories: Who,Feb,2009,Timm

Timm= Tim Menzies

2009: assoc Prof, LCSEE, WVU email: tim@menzies.us web site: http://menzies.us.

Tim has been a Great Auk since Feb'09.


categories: Who,Feb,2009,EdM

Ed= Ed Morton

2009: frequent poster to comp.lang.awk


categories: Who,Jan,2009,Mikel

Mikel= Mike Langmann

From: Tim Menzies <tim@menzies.us>
To: mikelangman@blueyonder.co.uk
Subject: auk images

I write to see if you would be gracious enough to grant us usage rights for your auk paintings to use on this site, in exchange for appropriate credit such as:

  • your name + links to your site on every page of this site;
  • a link with image that take our users to your site.

From: Mike Langman <mikelangman@blueyonder.co.uk>
Date: Mon, Jan 19, 2009 at 2:55 AM
Subject: Re: auk images

I normally charge for the use of images but as there is no money involved please carry on using the images and include a link to my website as suggested.

Many thanks for asking.

- Mike


categories: Who,Feb,2009,ArnoldR

Arnoldr= Arnold Robbins

Arnold Robbins, an Atlanta native, is a professional programmer and technical author. e has worked with Unix systems since 1980, when he was introduced to a PDP-11 running a version of Sixth Edition Unix.

He has been a heavy AWK user since 1987, when he became involved with gawk, the GNU project's version of AWK. As a member of the POSIX 1003.2 balloting group, he helped shape the POSIX standard for AWK. He is currently the maintainer of gawk and its documentation.

Since late 1997, he and his family have been living happily in Israel.


categories: Who,Feb,2009,Steffen

Steffen= Steffen Schuler

2009: gnu utils developer and monitor of comp.lang.awk


categories: Who,Feb,2009,Admin

Great Auks: awk.info's ringmasters

Some must lead, some must follow, and some have to fix the typos.

A Great Auk is someone with write permission to our repository. Since the source for this web site is stored in that repoistory, it also means that they are webmasters of this site. So they (try) to:

  1. keep the code and pages in a (somewhat) consistent form,
  2. encourage code documentation and test suites,
  3. watch comp.lang.awk for cool stuff to add to this site,
  4. write little demo programs,
  5. handle queries about this site,
  6. work the issue reports,
  7. etc.

If you want to be a Great Auk, please start contributing to this site using any of the usual methods. Once it is clear that you know what you are doing and that you play nice with others, then you should ask a current Great Auk to nominate you. Then, all the current Great Auks will vote about giving your write access.

The current Great Auks are


categories: Misc,WhyAwk,Jan,2009,Admin

Awk Advocacy

"Because easy is not wrong." - Anon

From various sources:

Quotes:

  • "Listen to people who program, not to people who want to tell you how to program."
    - Ronald P. Loui
  • "Good design is as little design as possible."
    - Dieter Rams
  • "When we have on occasion rewritten an Awk program in a conventional programming language like C or C++, the result was usually much longer, and much harder to debug."
    - Arnold Robbins & Nelson Beebe

From Project Management Advice:

  • More programming theory does not make better programmers.
  • Don't let old/compiler people tell you what language to use.
  • If there is already a way of doing something, do not invent a harder way.

From Awk programming:

  • Awk is a simple and elegant pattern scanning and processing language.
  • Awk is also the most portable scripting language in existence.
  • But why use it rather than Perl (or PHP or Ruby or...):
    • Awk is simpler (especially important if deciding which to learn first);
    • Awk syntax is far more regular (another advantage for the beginner, even without considering syntax-highlighting editors);
    • You may already know Awk well enough for the task at hand;
    • You may have only Awk installed;
    • Awk can be smaller, thus much quicker to execute for small programs.

From Awk as a Major Systems Programming Language:

  • Effective use of its data structures and its stream-oriented structure takes some adjustment for C programmers, but the results can be quite striking.

According to Ramesh Natarajan:

  • AWK is a superb language for testing algorithms and applications with some complexity, especially where the problem can be broken into chunks which can streamed as part of a pipe. It's an ideal tool for augmenting the features of shell programming as it is ubiquitous; found in some form on almost all Unix/Linux/BSD systems. Many problems dealing with text, log lines or symbol tables are handily solved or at the very least prototyped with awk along with the other tools found on Unix/Linux systems.

From the NoSQL pages:

  • (Other languages like Perl is) a good programming language for writing self-contained programs, but pre-compilation and long start-up time are worth paying only if once the program has loaded it can do everything in one go. This contrasts sharply with the Operator-stream Paradigm, where operators are chained together in pipelines of two, three or more programs. The overhead associated with initializing (say) Perl at every stage of the pipeline makes pipelining inefficient. A better way of manipulating structured ASCII files is to use the AWK programming language, which is much smaller, more specialized for this task, and is very fast at startup.

categories: Misc,Jan,2009,Admin

Community

To join our community, consider contributing to this site.

For a list of authors of this site, see our credits pages.

The Awk Wiki.

USENET discussion group: comp.lang.awk.


categories: Misc,Jan,2009,Admin

Contact

For discussions on Awk, see the Awk discussion group.

For comments/ complaints/ corrections/ extensions to this site, contact mail@awk.info.


categories: Misc,Jan,2009,Admin

Welcome to the Awk Community Portal

Awk is a stable, cross platform computer language named for its authors Alfred Aho, Peter Weinberger & Brian Kernighan. They write: "Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks".

In Classic Shell Scripting, Arnold Robbins & Nelson Beebe confess their Awk bias: "We like it. A lot. The simplicity and power of Awk often make it just the right tool for the job."

Besides the Bourne shell, Awk is the only other scripting language available in the standard Unix environment. Implementations of AWK exist as installed software for almost all other operating systems.

Awk is a mature language- it was first implemented in the 1970s. As a tool from the golden age, it is sometimes called primitive. It is more accurate to call it elemental, so tightly focused is the language on what it does best: quickly converting this into that.

Consequently, throughout history, Awk has been the language of choice for many famous scientists such as Leonardo daVinci.



categories: Misc,Jan,2009,Admin

Code

LAWKER is a repository of Awk code divided into:

fridge
Fresh code (for the current trunk). Best place to start is fridge/gawk.
block
Place to chop up and experiment with code. Usually, avoid this one.
freezer
Frozen code. place to store tags. Currently, empty. But we plan to grow this one.
wiki
Wiki pages. Useful for documentation but, where possible, use the in-line pretty print method, described below.

How to contribute to LAWKER

See How to Contribute.

How to report bug

Use our issue tracking system.


categories: Mascot,Misc,Jan,2009,Admin

Mascot

Missing: the Awk Mascot

Many communities have a mascot, a banner that they proudly wave high. So where's the Awk mascot?

I made on up, but you gotta say, it is kinda lame:

So you have any ideas for such a mascot, please email mail@awk.info with the subject line "suggestion for mascot".

Not to stiffle anyone's creativity but the mascot might be based on the mantra "less, but better" or "easy is not wrong" or "a little awk goes a long way".

Current Offerings

Chris Johnson

Chris writes "more of a logo rather than a mascot":

Other Mascots

Lisp: Aliens

Perl: Camel

Linux: Tux

Java: Duke


categories: Top10,Papers,Misc,WhyAwk,Jan,2009,Ronl

GAWK for AI

by R. Loui

ACM Sigplan Notices, Volume 31, Number 8, August 1996

Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a `scripting language.' It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't even been close (we have not seen work in PROLOG or JAVA).

There are some quick answers that have to do with the pragmatics of undergraduate programming. Then there are more instructive answers that might be valuable to those who debate programming paradigms or to those who study the history of AI languages. And there are some deep philosophical answers that expose the nature of reasoning and symbolic AI. I think the answers, especially the last ones, can be even more surprising than the observed effectiveness of GAWK for AI.

First it must be confessed that PERL programmers can cobble together AI projects well, too. Most of GAWK's attractiveness is reproduced in PERL, and the success of PERL forebodes some of the success of GAWK. Both are powerful string-processing languages that allow the programmer to exploit many of the features of a UNIX environment. Both provide powerful constructions for manipulating a wide variety of data in reasonably efficient ways. Both are interpreted, which can reduce development time. Both have short learning curves. The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonful of syntactic sugar. Some will argue that PERL has superior functionality, but for quick AI applications, the additional functionality is rarely missed. In fact, PERL's terse syntax is not friendly when regular expressions begin to proliferate and strings contain fragments of HTML, WWW addresses, or shell commands. PERL provides new ways of doing things, but not necessarily ways of doing new things.

In the end, despite minor difference, both PERL and GAWK minimize programmer time. Neither really provides the programmer the setting in which to worry about minimizing run-time.

There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI test bed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is turn left; turn right. If the robot is Netscape, then the right language is something that can generate Netscape -remote 'openURL(http://cs.wustl.edu/~loui) with elan.

Of course, there are deeper answers. Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays. GAWK asks the programmer to use the file system for data organization and the operating system for debugging tools and subroutine libraries. There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time. (footnote: I am reminded of my IBM colleague Ben Grosof's advice for Palo Alto: Don't worry about whether it's highway 101 or 280. Don't worry if you have to head south for an entrance to go north. Just get on the highway as quickly as possible.)

There are some similarities between GAWK and LISP that are illuminating. Both provided a powerful uniform data structure (the associative array implemented as a hash table for GAWK and the S-expression, or list of lists, for LISP). Both were well-supported in their environments (GAWK being a child of UNIX, and LISP being the heart of lisp machines). Both have trivial syntax and find their power in the programmer's willingness to use the simple blocks to build a complex approach.

Deeper still, is the nature of AI programming. AI is about functionality and exploratory programming. It is about bottom-up design and the building of ambitions as greater behaviors can be demonstrated. Woe be to the top-down AI programmer who finds that the bottom-level refinements, `this subroutine parses the sentence,' cannot actually be implemented. Woe be to the programmer who perfects the data structures for that heap sort when the whole approach to the high-level problem needs to be rethought, and the code is sent to the junk heap the next day.

AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor.

Now for the surprising philosophical answers. First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution. Chess, neural nets, and genetic programming show the limits of brute computation. The alternative is clever program organization. (footnote: One might add that the former are the AI approaches that work, but that is easily dismissed: those are the AI approaches that work in general, precisely because cleverness is problem-specific.) So AI programmers always want to maximize the content of their program, not optimize the efficiency of an approach. They want minds, not insects. Instead of enumerating large search spaces, they define ways of reducing search, ways of bringing different knowledge to the task. A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.

Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call reasoning'' instead of logic.'' The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.

I view these last two points as news not only to the programming language community, but also to much of the AI community that has not reflected on the past decade's lessons.

In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.


categories: Top10,Misc,Papers,WhyAwk,Apr,2009,Ronl

In Praise of Scripting: Real Programming Pragmatism

by Ronald P. Loui
Associate Professor of CSE
Washington University in St. Louis

(Pre-publication draft; copyright reserved by author. A subsequent version of this document appeared as IEEE Computer, vol. 41, no. 7, July 2008).

    This article's main purpose is to review the changes in programming practices known collectively as the "rise of scripting," as predicted in 1998 IEEE COMPUTER by Ousterhout. This attempts to be both brief and definitive, drawing on many of the essays that have appeared in online forums. The main new idea is that programming language theory needs to move beyond semantics and take language pragmatics more seriously.

To the credit of this journal, it had the courage to publish the signal paper on scripting, John Ousterhout's "Scripting: Higher Level Programming for the 21st Century" in 1998. Today, that document rolls forward with an ever-growing list of positive citations. More importantly, every major observation in that paper seems now to be entrenched in software practice today; every benefit claimed for scripting appears to be genuine (flexibility of typelessness, rapid turnaround of interpretation, higher level semantics, development speed, appropriateness for gluing components and internet programming, ease of learning and increase in amount of casual programming).

Interestingly, IEEE COMPUTER also just printed one of the most canonical attacks on scripting, by one Diomidis Spinellis, 2005, "Java Makes Scripting Languages Irrelevant?" Part of what makes this attack interesting is that the author seems unconvinced of his own title; the paper concludes with more text devoted to praising scripting languages than it expends in its declaration of Java's progress toward improved usability. It is unclear what is a better recommendation for scripting: the durability of Ousterhout's text or the indecisiveness of this recent critic's.

The real shock is that the academic programming language community continues to reject the sea change in programming practices brought about by scripting. Enamored of the object-oriented paradigm, especially in the undergraduate curriculum, unwilling to accept the LAMP (Linux-Apache-MySQL-Perl/Python/Php) tool set, and firmly believing that more programming theory leads to better programming practice, the academics seem blind to the facts on the ground. The ACM flagship, COMMUNICATIONS OF THE ACM for example, has never published a paper recognizing the scripting philosophy, and the references throughout the ACM Digital Library to scripting are not encouraging.

Part of the problem is that scripting has risen in the shadow of object-oriented programming and highly publicized corporate battles between Sun, Netscape, and Microsoft with their competing software practices. Scripting has been appearing language by language, including object-oriented scripting languages now. Another part of the problem is that scripting is only now mature enough to stand up against its legitimate detractors. Today, there are answers to many of the persistent questions about scripting: is there a scripting language appropriate for the teaching of CS1 (the first programming course for majors in the undergraduate computing curriculum)? Is there a scripting language for enterprise or real-time applications? Is there a way for scripting practices to scale to larger software engineering projects?

I intend to review the recent history briefly for those who have not yet joined the debate, then present some of the answers that scripting advocates now give to those nagging questions. Finally, I will describe how a real pragmatism of academic interest in programming languages would have better prepared the academic computing community to see the changes that have been afoot.

1996-1998 are perhaps the most interesting years in the phylogeny of scripting. In those years, perl "held the web together", and together with a new POSIX awk and GNU gawk, was shipping with every new Linux. Meanwhile javascript was being deployed furiously (javascript bearing no important relation to java, having been renamed from "livescript" for purely corporate purposes, apparently a sign of Netscape's solidarity with Sun, and even renamed "jscript" under Microsoft). Also, a handoff from tcl/tk to python was taking place as the language of choice for GUI developers who would not yield to Microsoft's VisualBasic. Php appeared in those years, though it would take another round of development before it would start displacing server-side perl, cold fusion, and asp. Every one of these languages is now considered a classic, even prototypical, scripting language.

Already by mid-decade, the shift from scheme to java as the dominant CS1 language was complete, and the superiority of c++ over c was unquestioned in industry. But java applets were not well supported in browsers, so the appeal of "write once, run everywhere" quickly became derided as "write once, debug everywhere." Web page forms, which used the common gateway interface (cgi) were proliferating, and systems programming languages like c became recognized as overkill for server-side programming. Developers quickly discovered the main advantage of perl for cgi forms processing, especially in the dot-com setting: it minimized the programmer's write-time. What about performance? The algorithms were simple, network latency masked small delays, and database performance was built into the database software. It turned out that the bottleneck was the programming. Even at run-time, the network and disk properties were the problems, not the cpu processing. What about maintenance? The developers and management were both happy to rewrite code for redesigned services rather than deal with legacy code. Scripting, it turns out, was so powerful and programmer-friendly that it was easier to create new scripts from scratch than to modify old programs. What about user interface? After all, by 1990, most of the programming effort had become the writing of the GUI, and the object-oriented paradigm had much of its momentum in the inheritance of interface widget behaviors. Surprisingly, the interface that most programmers needed could be had in a browser. The html/javascript/cgi trio became the GUI, and if more was needed, then ambitious client-side javascript was more reliable than the browser's java virtual machine. Moreover, the server-side program was simply a better way to distribute automation in a heterogeneous internet than the downloadable client-side program, regardless of whether the download was in binary or bytecode.

Although there was not agreement on what exact necessary and sufficient properties characterized scripting and distinguished it from "more serious" programming, several things were clear:

  • scripting permitted rapid development, often regarded as merely "rapid prototyping," but subsequently recognized as a kind of agile programming;
  • scripting was the kind of high-level programming that had always been envisioned, in the ascent from low-level assembly language programming to higher levels of abstraction: it was concise, and it removed programmers from concerning themselves with many performance and memory management details;
  • scripting was well suited to the majority of a programming task, usually the accumulation, extraction, and transformation of data, followed eventually by its presentation, so that only the performance-critical portion of a project had to be written in a more cumbersome, high-performance language;
  • it was easier to get things right when source code was short, when behavior was determined by code that fit on a page, all types were easily coerced into strings for trace-printing, code fragments could be interpreted, identifiers were short, and when the programmer could turn ideas into code quickly without losing focus.

This last point was extremely counterintuitive. Strong typing, naming regimen, and verbosity were motivated mainly by a desire to help the programmer avoid errors. But the programmer who had to generate too many keystrokes and consult too many pages, who had to search through many different files to discover semantics, and who had to follow too many rules, who had to sustain motivation and concentration over a long period of time, was a distracted and consequently inefficient programmer. Just as vast libraries did not deliver the promise of greater reusability, and virtual machines did not deliver the promise of platform-independence, the language's promise to discipline the programmer quite simply did not reduce the tendency of humans to err. It exchanged one kind of frequent error for another.

Scripting languages became the favorite tools of the independent-minded programmers: the "hackers" yes, but also the gifted and genius programmers who tended to drive a project's design and development. As Paul Graham noted (in a column reprinted in "Hackers and Painters" or this), one of the lasting and legitimate benefits of java is that it permits managers to level the playing field and extract considerable productivity from the less talented and less motivated programmers (hence, more disposable). There was a corollary to this difference between the mundane and the liberating:

  • scripting was not enervating but was actually renewing: programmers who viewed code generation as tedious and tiresome in contrast viewed scripting as rewarding self-expression or recreation.

The distinct features of scripting languages that produce these effects are usually enumerated as semantic features, starting with low I/O specification costs, the use of implicit coercion and weak typing, automatic variable initialization with optional declaration, predominant use of associative arrays for storage and regular expressions for pattern matching, reduced syntax, and powerful control structures. But the main reason for the productivity gains may be found in the name "scripting" itself. To script an environment is to be powerfully embedded in that environment. In the same way that the dolphin reigns over the open ocean, lisp is a powerful language for those who would customize their emacs, javascript is feral among browsers, and gawk and perl rule the linux jungle.

There is even a hint of AI in the idea of scripting: the scripting language is the way to get high level control, to automate by capturing the intentions and routines normally provided by the human. If recording and replaying macros is a kind of autopilot, then scripting is a kind of proxy for human decisionmaking. Nowhere is this clearer than in simple server-side php, or in sysadmin shell scripting.

So where do we stand now? While it may have been risky for Ousterhout to proclaim scripting on the rise in 1998, it would be folly to dismiss the success of scripting today. It is even possible that java will yield its position of dominance in the near future. (By the time this essay is printed, LAMP and AJAX might be the new darlings of the tech press; see recent articles in Business Week, this IEEE COMPUTER, and even James Gosling's blog where he concedes he was wanting to write a scripting language when he was handed the java project. Java is very much in full retreat.) Is scripting ready to fill the huge vacuum that would be produced?

I personally believe that CS1 java is the greatest single mistake in the history of computing curricula. I believe this because of the empirical evidence, not because I have an a priori preference (I too voted to shift from scheme to java in our CS1, over a decade ago, so I am complicit in the java debacle). I reported in SIGPLAN 1996 ("Why gawk for AI?") that only the scripting programmers could generate code fast enough to keep up with the demands of the artificial intelligence laboratory class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting. In the intervening decade, little has changed. We actually see more scripting, as students are happy to compress images so that they can script their computer vision projects rather than stumble around in c and c++. In fact, students who learn to script early are empowered throughout their college years, especially in the crucial UNIX and web environments. Those who learn only java are stifled by enterprise-sized correctness and the chimerae of just-in-time compilation, swing, JRE, JINI, etc. Young programmers need to practice and produce, and to learn through mistakes why discipline is needed. They need to learn design patterns by solving problems, not by reading interfaces to someone else's black box code. It is imperative that programmers learn to be creative and inventive, and they need programming tools that support code exploration rather than code production.

What scripting language could be used for CS1? My personal preferences are gawk, javascript, php, and asp, mainly because of their very gentle learning curves. I don't think perl would be a disaster; its imperfection would create many teaching moments. But there is emerging consensus in the scripting community that python is the right choice for freshman programming. Ruby would also be a defensible choice. Python and ruby have the enviable properties that almost no one dislikes them, and almost everyone respects them. Both languages support a wide variety of programming styles and paradigms and satisfy practitioners and theoreticians equally. Both languages are carefully enough designed that "correct" programming practices can be demonstrated and high standards of code quality can be enforced. The fact that Google stands by python is an added motivation for undergraduate majors.

But do scripting solutions scale? What about the performance gap when the polynomial, or worse the exponential, algorithm faces large n, and the algorithm is written in an interpreted or weakly compiled language? What about software engineering in the large, on big projects? There has been a lot of discussion about scalability of scripts recently. In the past, debates have simply ended with the concession that large systems would have to be rewritten in c++, or a similar language, once the scripting had served its prototyping duty.

The enterprise question is the easier of the two. Just as the individual programmer reaps benefits from a division of labor among tools, writing most of the code in scripts, and writing all bottleneck code in a highly optimizable language, the group of programmers benefits from the use of multiple paradigms and multiple languages. In a recent large project, we used vhdl for fpga's with a lot of gawk to configure the vhdl. We used python and php to generate dynamic html with svg and javascript for the interfaces. We used c and c++ for high performance communications wrappers, which communicated xml to higher level scripts that managed databases and processes. We saw sysadmin and report-generation in perl, ruby, and gawk, data scrubbing in perl and gawk, user scripting in bash, tcl, and gawk, and prototyping in perl and gawk. Only one module was written in java (because that programmer loved java): it was late, it was slow, it failed, and it was eventually rewritten in c++. In retrospect, neither the high performance components nor the lightweight code components were appropriate for the java language. Does scripting scale to enterprise software? I would not manage a project that did not include a lot of scripting, to minimize the amount of "hard" programming, to increase flexibility and reduce delivery time at all stages, to take testing to a higher level, and to free development resources for components where performance is actually critical. I nearly weep when I think about the text processing that was written in c under my managerial watch, because the programmer did not know perl. We write five hundred line scripts in gawk that would be ten thousand line modules in java or c++. In view of the fact that there are much better scripting tools for most of what gets programmed in java and c++, perhaps the question is whether java and c++ scale.

How about algorithmic complexity? Don't scripting languages take too long to perform nested loops? The answer here is that a cpu-bound tight loop such as a matrix multiplication is indeed faster in a language like c. But such bottlenecks are easy to identify and indeed easy to rewrite in c. True system bottlenecks are things like paging, chasing pointers on disk, process initialization, garbage collection, fragmentation, cache mismanagement, and poor data organization. Often, we see that better data organization was unimplemented because it would have required more code, code that would have been attempted in an "easier" programming language like a scripting language, but which was too difficult to attempt in a "harder" programming language. We saw this in the AI class with heuristic search and computer vision, where brute force is better in c, but complex heuristics are better than brute force, and scripting is better for complex heuristics. When algorithms are exponential, it usually doesn't matter what language you use because most practical n will incur too great a cost. Again, the solution is to write heuristics, and scripting is the top dog in that house. Cpu's are so much faster than disks these days that a single extra disk read can erase the CPU advantage of using compiled c instead of interpreted gawk. In any case, java is hardly the first choice for those who have algorithmic bottlenecks.

The real reason why academics were blindsided by scripting is their lack of practicality. Academic computing was generally late to adopt Wintel architectures, late to embrace cgi programming, and late to accept Linux in the same decade that brought scripting's rise. Academia understandably holds industry at a distance. Still, there is a purely intellectual reason why programming language courses are only now warming to scripting. The historical concerns of programming language theory have been syntax and semantics. Java's amazing contribution to computer science is that it raised so many old-fashioned questions that tickled the talents of existing programming language experts: e.g., how can it be compiled? But there are new questions that can be asked, too, such as what a particular language is well-suited to achieve inexpensively, quickly, or elegantly, especially with the new mix of platforms. The proliferation of scripting languages represents a new age of innovation in programming practice.

Linguists recognize something above syntax and semantics, and they call it "pragmatics". Pragmatics has to do with more abstract social and cognitive functions of language: situations, speakers and hearers, discourse, plans and actions, and performance. We are entering an era of comparative programming language study when the issues are higher-level, social, and cognitive too.

My old friend, Michael Scott, has a popular textbook called PROGRAMMING LANGUAGE PRAGMATICS. But it is a fairly traditional tome concerned with parameter passing, types, and bindings (it's hard to see why it merits "pragmatics" in its title, even as it goes to second edition with a chapter on scripting added!). A real programming pragmatics would ask questions like:

  • how well does each language mate to the other UNIX tools?
  • what is the propensity in each language for programmers at various expertise levels to produce a memory leak?
  • what is the likelihood in each language that unmodified code will still function in five years?
  • what is the demand of a programmer's concentration, what is the load on her short-term memory of ontology, and what is the support for visual metaphor in each language?

There have been programming language "shootouts" and "scriptometers" on the internet that have sought to address some of the questions that are relevant to the choice of scripting language, but they have been just first steps. For example, one site reports on the shortest script in each scripting language that can perform a simple task. But absolute brevity for trivial tasks, such as "print hello world" is not as illuminating as typical brevity for real tasks, such as xml parsing.

Pragmatic questions are not the easiest questions for mathematically-inclined computer scientists to address. They refer by their nature to people, their habits, their sociology, and the technological demands of the day. But it is the importance of such questions that makes programmers choose scripting languages. Ousterhout declared scripting on the rise, but perhaps so too are programming language pragmatics.

Acknowledgements

I have to thank Charlie Comstock for contributing many ideas and references over the past two years that have shaped my views, especially the commitment to the idea of pragmatics.

About the Author

Prof. Dr. Loui and his students are the usual winners of the department programming contest and have contributed to current gnu releases of gawk and malloc. He has lectured on AI for two decades on five continents, taught AI programming for two decades, and is currently funded on a project delivering hardware and software on U.S. government contracts.

References


categories: Misc,Jan,2009,Timm

History

Recipe for a Language

  • 1 part egrep
  • 1 part snobol
  • 2 parts ed
  • 3 parts C
  • Blend all parts well using lex and yacc. Document minimally and release.
  • After eight years, add another part egrep and two more parts C. Document very well and release.

    Historical Notes

    From awk.freeshell.org:

    • 1977-1985: awk and nawk (now also known as 'old awk' or 'the old true awk'): the original version of the language, lacking many of the features that make it fun to play with now
    • 1985-1996: The GNU implementation, Gawk, was written in 1986 by Paul Rubin and Jay Fenlason, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from Arnold Robbins, thoroughly reworked Gawk for compatibility with the newer Awk.
    • 1996: BWK awk was released under an open license. Huzzah!
    • Sometime before the present, mawk, xgawk, jawk, awkcc, Kernighan's nameless awk-to-C++ compiler, awka, tawk and busybox awk came to be.

    It's a bit embarassing to note that the exact origins of each are a bit hazy. This whole section requires further work, including the addition of links pointing to source repositories and binary distribution points.

    Awk Implemenetations

    Historical list of Awk implementations.

    Awk's Authors: Interviews


    categories: Misc,WhyAwk,Jan,2009,Timm

    Why Gawk?

    by T. Menzies

    "The Enlightened Ones say that....

    • You should never use C if you can do it with a script;
    • You should never use a script if you can do it with awk;
    • Never use awk if you can do it with sed;
    • Never use sed if you can do it with grep."

    Awk is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and Awk programs are generally very short. Awk is useful when the overheads of more sophisticated approaches is not worth the bother. Also, the cost of learning Awk is very low.

    But aren't there better scripting languages? Faster? Well, maybe yes and maybe no.

    And Awk is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:

    loc/fp   language
    ------   --------
    
        6,   excel 5
       13,   sql
       21,   awk       <================
       21,   perl
       21,   eiffel
       21,   clos
       21,   smalltalk
       29,   delphi
       29,   visual basic 5
       49,   ada 95
       49,   ai shells
       53,   c++
       53,   java
       64,   lisp
       71,   ada 83
       71,   fortran 95
       80,   3rd generation default
       91,   ansi cobol 85
       91,   pascal
      107,   2nd generation default
      107,   algol 68
      107,   cobol
      107,   fortran
      128,   c
      320,   1st generation default
      640,   machine language
     3200,   natural language
    

    Anyway, there are other considerations. Awk is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For example, here's the complete listing of someone's Awk spell-checking program.

    BEGIN     {while (getline<"Usr.Dict.Words") dict[$0]=1}
    !dict[$1] {print $1}
    

    Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.

    Awk is the cure for late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. Awk programmers can hack up usable systems in the time it takes other programmers to boot their IDE. And, as a result of that easy exploration, it is possible to find loopholes missed by other analyst that lead to the innovative better solution to the problems (e.g. see Ronald Loui's O(nlogn) clustering tool).

    Certainly, we can drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And Awk's lack of data structures (except num, string, and array) requires some discipline to handle properly.

    But experienced Awk programmers know that the cleverer the program, the smaller the audience gets. If it is possible for to explain something succinctly in a simple language like Awk, then it is also possible that more folks will read that code.

    Finally, at this may be the most important point, it might be misguided to argue about Awk vs LanguageX in terms of the specifics of those languages. Awk programmers can't over-elaborate their solutions- they are forced to code the solution in the simplest manner possible. This push to simplicity, to the essence of the problem, can be an insightful process. Coding in Awk is like preserving fruit- you boil off everything that is superfluous, that needlessly bloats the material what you are working with. It is amazing how little code is required to code the core of an idea (e.g. see Darius Bacon's LISP interpreter, written in Awk).


    categories: SysAdmin,Papers,WhyAwk,Apr,2009,HenryS

    Awk: A Systems Programming Language?

    At the Proceedings of the Winter Usenix Conference (Dallas'91) Henry Spencer wrote in Awk As A Major Systems Programming Language that...

      ...even experienced Unix programmers often don't know awk, or know it but view it as a counterpart of sed: useful "glue" for sticking things together in shell programming, but quite unsuited for major programming tasks. This is a major underestimate of a very powerful tool, and has hampered the development of support software that would make awk much more useful.

      There is no fundamental reason why awk programs have to be small "glue" programs: even the "old" awk is a powerful programming language in its own right. Effective use of its data structures and its stream-oriented structure takes some adjustment for C programmers, but the results can be quite striking.

      On the other hand, getting there can be a bit painful, and improvements in both the language and its support tools would help.

    In 2009, Arnold Robbins comments:

      The paper is still interesting, although some bits are outdated (we now have a profiler, for instance).

    categories: Verification,Jul,2009,Admin

    Awk and Verification

    These pages focus on program verification tools, written in Awk.


    categories: Databases,Jul,2009,Admin

    Awk and Databases

    These pages focus on databases and Awk.


    categories: Games,Apr,2009,Admin

    Awk Games

    These pages focus on games, written in Awk.


    categories: Games,Nov,2009,PPuri

    An Awk Dungeon Adventure Game

    Contents

    Synopsis

    Download

    About

    Comments

    Author

    Synopsis

     gawk -f game.awk
    

    Download

    Download from LAWKER.

    About

    I wrote a small text-adventure game in awk - just to stretch the perception of awk, and show that it can be used as a programming language.

    This game is small, but gives a taste of the fantasy adventure games of the 80's - like Zork from Infocom.

    In this adventure, you are in a cave complex, and need to find the hidden gold to win. The adventure lets you move around, search, pick up objects, and use them. It uses a menu - not free-form entries.

    Here is the awk code:

    function intro() {
    	print
    	print "You are a brave adventurer. You have entered a hidden"
    	print "cave just outside town, that is rumored to hold gold!"
    	print "To win this adventure, you need to get the gold."
    }
     
    function invent() {
    	if (coin || axe || sword)
    	print "You are carrying: "
    	if (coin) print "coin"
    	if (axe) print "big, rusty battle axe"
    	if (sword) print "small sword"
    }
    function input( x ) {
    	printf( "\nCOMMAND> ")
    	getline x
    	return x
    }
    function cave() {
    	print
    	print "You are standing in a cave. Sunlight gleams behind you"
    	print "from the entrance. In front of you, is a wooden door."
    	print "You see an opening to the left, and one to the right."
    	print
    	invent()
    	print
    	print "What do you want to do? "
    	print
    	print "(o)pen wooden door"
    	print "go (l)eft"
    	print "go (r)ight"
    	print "leave thru the (e)ntrance"
    	if (sword) print "break door with your (s)word"
    	if (axe) print "break door with your (a)xe"
    	print "(y)ell Open Sesame"
    	print "e(x)amine area"
    	print "read (i)ntroduction"
    	x = input()
    	if (x=="o") {print "The wooden door is shut tight."; cave()}
    	if (x=="l") {deadend()}
    	if (x=="r") {cave2()}
    	if (x=="e") {print "You decide to quit. Goodbye!";exit}
    	if (sword&&x=="s") {print "your sword breaks!";sword=0;cave()}
    	if (axe&&x=="a") {
    		print "You chop down the door and find the gold!!"
    		print "Great job, bold adventurer!"
    		print "This is the end of this adventure, but"
    		print "you have a promising career ahead of you!"
    		exit;
    	}
    	if (x=="y") {
    		print "A band of evil goblins passing by the entrance"
    		print "hear you, enter the cave, and kill you"
    		exit;
    	}
    	if (x=="x") {print "You find nothing";cave()}
    	if (x=="i") {intro();cave()}
    	print "What do you want to do?";cave()
    }
     
    function deadend() {
    	print
    	print "You are in a dead end"
    	print
    	invent()
    	print
    	print "What do you want to do? "
    	print
    	print "go (b)ack"
    	print "e(x)amine area"
    	print "read (i)ntroduction"
    	x= input();
    	if (x=="b") {cave()}
    	if (x=="x") {print "You find a sword!";sword=1;deadend()}
    	if (x=="i") {intro();deadend()}
    	print "What do you want to do?";deadend()
    }
     
    function cave2() {
    	print
    	print "You are in another cave."
    	print "You can go back, or explore a niche to the left."
    	print
    	invent()
    	print
    	print "What do you want to do? "
    	print
    	print "go (b)ack"
    	print "enter (n)iche"
    	if (rubble) print "(s)earch rubble"
    	print "e(x)amine area"
    	print "read (i)ntroduction"
    	x = input()
    	if (x=="b") {cave()}
    	if (x=="n") {niche()}
    	if (rubble&&x=="s"&&!coin) {print "you found a coin!";coin=1;cave2()}
    	if (rubble&&x=="s"&&coin) {print "you found a nothing!";cave2()}
    	if (x=="x") {print "You see a pile of rubble";rubble=1;cave2()}
    	if (x=="i") {intro();cave2()}
    	print "What do you want to do?";cave2()
    }
     
    function niche() {
    	print
    	print "You are in a niche."
    	print "There is a dwarf here!"
    	print
    	invent()
    	print
    	print "What do you want to do? "
    	print
    	print "go (b)ack"
    	print "(t)alk to dwarf"
    	if (!sword&&!axe) print "(f)ight dwarf"
    	if (sword) print "fight dwarf with (s)word"
    	if (axe) print "fight dwarf with (a)xe"
    	if (coin) print "(o)ffer coin to dwarf"
    	print "e(x)amine area"
    	print "read (i)ntroduction"
    	x = input()
    	if (x=="b") {cave2()}
    	if (x=="t") {print "The dwarf grunts";niche()}
    	if (x=="f") {print "The dwarf kills you";exit}
    	if (x=="s") {print "The dwarf kills you";exit}
    	if (x=="a") {print "The dwarf kills you";exit}
    	if (coin&&x=="o") {print "The dwarf takes the coin and gives you a n axe!";coin=0;axe=1;niche()}
    	if (x=="x") {print "You find nothing";niche()}
    	if (x=="i") {intro();niche()}
    	print "What do you want to do?";niche()
    }
     
    BEGIN { intro(); cave() }
    

    Comments

    This is one of the longest awk programs that I have written. Notice that it is function-driven. I have created functions to give the introduction, and the inventory, and I have created functions for each room.

    The awk program is kicked off by the BEGIN section, which runs intro() and cave() to put you in the first room.

    Each object is represented by a variable of the same name (i.e. sword for sword) and is either 0 (off) or 1 (on), depending if you have the object.

    Each function will print descriptions and gve options, depending on the setting of these boolean variables.

    Author

    Praveen Puri has been a programmer and full-time trader. He is the author of Stock Trading Riches which teaches his stock trading system.


    categories: Games,Top10,TenLiners,Mar,2009,BrianK

    Story.awk

    Contents

    Synopsis

    echo Goal | gawk -f story.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] 
    echo Goal | gawk -f storyp.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] 
    

    Download

    Download from LAWKER.

    Description

    This code inputs a set of productions and outputs a string of words that satisfy the production rules.

    This page describes two versions of that system: story.awk and storyp.awk. The former selects productions at random with equal probability. The latter allows the user to bias the selection by adding weights at the end of line, after each production.

    Options

    -v Grammar=FILE
    Sets the FILE containing the productions. Defaults to "grammar".
    -v Seed=NUM
    Sets the seed for the random number generator. Defaults to "1". A useful idiom for generating random text is to use Seed=$RANDOM

    Examples

    A Short Example

    This grammar..

    Sentence -> Nounphrase Verbphrase   
    Nounphrase -> the boy              
    Nounphrase -> the girl           
    Verbphrase -> Verb Modlist Adverb 
    Verb -> runs                    
    Verb -> walks                  
    Modlist ->                    
    Modlist -> very Modlist      
    Adverb -> quickly           
    Adverb -> slowly           
    
    ... and this input ...
    for i in 1 2 3 4 5 6 7 8 9 10;do
    	echo Sentence | 
    	gawk -f ../story.awk -v Grammar=english.rules -v Seed=$i | 
    	fmt
    done
    
    ... generates these sentences:
    the boy runs very slowly
    the girl runs slowly
    the boy runs very slowly
    the girl walks very very quickly
    the boy runs quickly
    the girl walks very very slowly
    the boy walks very very very very very very quickly
    the boy walks very quickly
    the girl runs slowly
    the girl runs very quickly
    

    A Longer Example

    Here is Gahan Wilson's sci-fi plot generator ...

    Using the above, we can generate the following stories:

    
     Earth scientists invent giant bugs who want Our Women,  And Take
     A Few And Leave
    
     Earth is Attacked By tiny lunar superbeings who  Under Stand and
     Are Not radioactive and can not be killed by the Navy but They Die
     From Catching A Cold
    
     Earth scientists invent enormous bugs who are Friendly and and
     They Get Married And Live Happily Forever After
    
     Earth is Struck By A Giant cloud and Magically Saved
    
     Earth scientists invent giant bugs who  Under Stand and Are Not
     radioactive and can not be killed by the Air Force so They Kill
     Us
    
     Earth is Attacked By enormous extra Galactic blobs who  Under Stand
     and Are Not radioactive and can be killed by the Air Force
    
     Earth scientists discover enormous blobs who  Under Stand and Are
     Not radioactive and can be killed by a Crowd Of Peasants
    
     Earth falls Into Sun and  Some  Resuced
    
     Earth is Struck By A Giant comet but Is Saved
    
     Earth is Struck By A Giant comet and Is Destroyed
    

    This is generated from the following code:

    for i in 1 2 3 4 5 6 7 8 9 10;do
    	echo
    	echo Start | 
    	gawk -f ../story.awk -v Grammar=scifi.rules -v Seed=$i | 
    	fmt
    done
    

    running on the following grammar:

    Start      -> Earth IsStressed
    IsStressed -> Catestrophes 
    IsStressed -> Science 
    IsStressed -> Attack 
    IsStressed -> Collision
    
    Catestrophes -> Catestrophe and PossibleMegaDeath
    
    Catestrophe -> burnsUp 
    Catestrophe -> freezes
    Catestrophe -> fallsIntoSun
    
    Collision -> isStruckByAGiant Floater AndThen
    
    Floater -> comet
    Floater -> asteroid
    Floater -> cloud
    
    AndThen -> butIsSaved
    AndThen -> andIsDestroyed
    AndThen -> andMagicallySaved
    
    
    PossibleMegaDeath -> everybodyDies
    PossibleMegaDeath -> Some GoOn 
    
    SomeSaved ->  somePeople
    SomeSaved ->  everybody
    SomeSaved ->  almostEverybody
      
    GoOn -> dies
    GoOn -> Resuced
    GoOn -> Saved
     
    Rescued -> isRescuedBy Sizes Extraterestrial Beings
    Saved   -> butIsSavedBy SomeOne scientists the  Science
    
    SomeOne -> earth
    SomeOne -> extraterestrial
    
    Science -> scientists DoSomething Sizes Beings Whichetc
    
    DoSomething -> invent
    DoSomething -> discover
    
    Attack -> isAttackedBy Sizes Extraterestrial Beings Whichetc
    
    Sizes -> tiny 
    Sizes -> giant 
    Sizes -> enormous
     
    Extraterestrial -> martian
    Extraterestrial -> lunar
    Extraterestrial -> extraGalactic
    
    Beings -> bugs
    Beings -> reptiles
    Beings -> blobs
    Beings -> superbeings
    
    Whichetc -> who WantSomething
    
    WantSomething -> WantWomen
    WantSomething -> areFriendly  and DenoumentOrHappyEnding
    WantSomething -> UnderStand ButEtc
    
    Understand -> areFriendly butMisunderstood
    Understand -> misunderstandUs
    Understand -> understandUsAllTooWell
    Understand -> hungry
    
    DenoumentOrHappyEnding -> Denoument
    DenoumentOrHappyEnding -> HappyEnding
     
    Dine -> Hungry and eat us Denoument?
    
    WhichEtc -> 
    Hungry -> lookUponUsAsASourceOfNourishment
    
    WantWomen -> wantOurWomen, AndTakeAFewAndLeave
    
    ButEtc -> AndAre radioactive and TryToKill
    
    AndAre -> andAre
    AndAre -> andAreNot
    
    Killers -> Killer 
    Killers -> Killer and Killer
    
    Killer -> aCrowdOfPeasants
    Killer -> theArmy
    Killer -> theNavy
    Killer -> theAirForce
    Killer -> theMarines
    Killer -> theCoastGuard
    Killer -> theAtomBomb
    
    TryToKill -> can be killed by Killers
    TryToKill -> can not be killed by Killers SoEtc
    
    SoEtc -> butTheyDieFromCatchingACold
    SoEtc -> soTheyKillUs
    SoEtc -> soTheyPutUsUnderABenignDictatorShip
    SoEtc -> soTheyEatUs
    SoEtc -> soScientistsInventAWeapon Which
    SeEtc -> but Denoument
    
    Which -> whichTurnsThemIntoDisgustingLumps
    Which -> whichKillsThem
    Which -> whichFails SoEtc
    
    Denomument? ->  
    Denomument? -> Denoument  
    
    Denoument ->  aCuteLittleKidConvincesThemPeopleAreOk Ending
    Denoument -> aPriestTalksToThemOfGod Ending
    Denoument -> theyFallInLoveWithThisBeautifulGirl EndSadOrHappy
    
    EndSadOrHappy -> Ending
    EndSadOrHappy -> HappyEnding
    
    Ending -> andTheyDie
    Ending -> andTheyLeave
    Ending -> andTheyTurnIntoDisgustingLumps
    
    HappyEnding -> andTheyGetMarriedAndLiveHappilyForeverAfter
    

    Biasing the Story

    Here is a grammar suitable for storyp.awk. Note that number at end of line that biases how often a production is selected. For example, "runs" and "slowly" are nine times more likely than other Verbs and Adverbs.

    Sentence -> Nounphrase Verbphrase   1
    Nounphrase -> the boy               0.75
    Nounphrase -> the girl              0.25
    Verbphrase -> Verb Modlist Adverb   1
    Verb -> runs                        0.9
    Verb -> walks                       0.1
    Modlist ->                          0.5
    Modlist -> very Modlist             0.5
    Adverb -> quickly                   0.1
    Adverb -> slowly                    0.9
    
    The following code executes the biases story generation:
    for((i=1;i<=10;i++)); do echo Sentence ;  done |
    gawk -f ../storyp.awk -v Grammar=englishp.rules 
    

    This produces the following output. Note that, usually, we run slowly.

    the boy runs very slowly 
    the boy runs slowly 
    the girl runs very slowly 
    the boy runs slowly 
    the boy runs slowly 
    the girl walks very slowly 
    the boy walks slowly 
    the girl runs slowly 
    the boy runs slowly 
    the boy runs slowly 
    

    Code

    Story.awk

    BEGIN { 
        srand(Seed ? Seed : 1) 
    	Grammar = Grammar ? Grammar : "grammar"
    	while (getline < Grammar > 0)
    	    if ($2 == "->") {
    		    i = ++lhs[$1]              # count lhs
    		    rhscnt[$1, i] = NF-2       # how many in rhs
    		    for (j = 3; j <= NF; j++)  # record them
    		        rhslist[$1, i, j-2] = $j
    	    } else
    		     if ($0 !~ /^[ \t]*$/)
            	    print "illegal production: " $0
    }
    {   if ($1 in lhs) {  # nonterminal to expand
            gen($1)
            printf("\n")
        } else 
            print "unknown nonterminal: " $0   
    }
    function gen(sym,    i, j) {
        if (sym in lhs) {       # a nonterminal
            i = int(lhs[sym] * rand()) + 1   # random production
            for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
                gen(rhslist[sym, i, j])
        } else {
            gsub(/[A-Z]/," &",sym)
            printf("%s ", sym) }
    }
    

    Storyp.awk

    Storyp.awk is almost the same as story.awk but it is assumed that each line ends in a number that will bias how often that production gets selected.

    BEGIN {
        srand(Seed ? Seed : 1) 
        Grammar = Grammar ? Grammar : "grammar"
        while ((getline < Grammar) > 0)
            if ($2 == "->") {
                i = ++lhs[$1]              # count lhs
                rhsprob[$1, i] = $NF       # 0 <= probability <= 1
                rhscnt[$1, i] = NF-3       # how many in rhs
                for (j = 3; j < NF; j++)   # record them
                   rhslist[$1, i, j-2] = $j
            } else
                print "illegal production: " $0
        for (sym in lhs)
             for (i = 2; i <= lhs[sym]; i++)
                rhsprob[sym, i] += rhsprob[sym, i-1]
    }
    {   if ($1 in lhs) {  # nonterminal to expand
             gen($1)
             printf("\n")
         } else 
             print "unknown nonterminal: " $0   
    }
    function gen(sym,    i, j) {
        if (sym in lhs) {       # a nonterminal
            j = rand()          # random production
            for (i = 1; i <= lhs[sym] && j > rhsprob[sym, i]; i++) ;       
            for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
                gen(rhslist[sym, i, j])
        } else
            printf("%s ", sym)
    }
    

    Author

    The code comes from Alfred Aho, Brian Kernighan, and Peter Weinberger from the book "The AWK Programming Language", Addison-Wesley, 1988.

    The scifi grammar was written by Tim Menzies, 2009, and is based on Gahan Wilson's sci-fi plot generator: "The Science Fiction Horror Movie Pocket Computer" ( in "The Year's Best Science Fiction No. 5", edited by Harry Harrison and Brian Aldiss, Sphere, London, 1972).


    categories: TenLiners,Mar,2009,DonaldM

    The Monty Hall Problem

    Donald 'Paddy' McCarthy has a nice Awk solution to the Monty Hall Problem, which he describes as follow:

    • The contestant in in front of three doors that he cannot see behind..
    • The three doors conceal one prize and the rest being booby prizes, arranged randomly.
    • The Host asks the contestant to choose a door.
    • The host then goes behind the doors where only he can see what is concealed, then always opens one door, out of the other s not chosen by the contestant, that must reveal a booby prize to the contestant.
    • The host then asks the contestant if he would like either to stick with his previous choice, or switch and choose the other remaining closed door.

    It turns out that if the contestant follows a strategy of always switching when asked, then he will maximise his chances of winning. Donald's simulator shows that:

    • A strategy of never switching wins 1/3rd of the time.
    • A strategy of randomly switching wins 1/2 of the time.
    • A strategy of always switching wins 2/3rds of the time.

    Code

    BEGIN {
    	srand()
    	doors = 3
    	iterations = 10000
    	# Behind a door: 
    	EMPTY = "empty"; PRIZE = "prize"
    	# Algorithm used
        KEEP = "keep"; SWITCH="switch"; RAND="random"; 
    }
    function monty_hall( choice, algorithm ) { # Set up doors
      for ( i=0; i<doors; i++ ) {
    		door[i] = EMPTY
    	}
    	door[int(rand()*doors)] = PRIZE # One door with prize
    
      chosen = door[choice]
      del door[choice]
    
      #if you didn't choose the prize first time around then
      # that will be the alternative
    	alternative = (chosen == PRIZE) ? EMPTY : PRIZE 
    
    	if( algorithm == KEEP) {
    		return chosen
    	} 
    	if( algorithm == SWITCH) {
    		return alternative
    	} 
    	return rand() <0.5 ? chosen : alternative
    }
    function simulate(algo){
    	prizecount = 0
    	for(j=0; j< iterations; j++){
    		if( monty_hall( int(rand()*doors), algo) == PRIZE) { 
    			prizecount ++ 
    		}
    	}
    	printf "  Algorithm %7s: prize count = %i, = %6.2f%%\n", \
    		algo, prizecount,prizecount*100/iterations
    }
    BEGIN {
    	print "\nMonty Hall problem simulation:"
    	print doors, "doors,", iterations, "iterations.\n"
    	simulate(KEEP)
    	simulate(SWITCH)
    	simulate(RAND)
    }
    

    Sample Output

    gawk -f montyHall.awk
    
    Monty Hall problem simulation:
    3 doors, 10000 iterations.
    
      Algorithm    keep: prize count = 3411, =  34.11%
      Algorithm  switch: prize count = 6655, =  66.55%
      Algorithm  random: prize count = 4991, =  49.91%
    

    categories: Top10,TenLiners,Mar,2009,ScottP

    Predicting Gender

    Contents

    Synopsis

    echo name | gawk -f gender.awk
    

    Download

    Download from LAWKER

    Description

    The following code predicts gender, given a first name.

    This code is an excellent example of rule-based programming in Awk.

    For a full description of the code, see

    Code

                                              { sex = "m" } # Assume male.
    
    /^.*[aeiy]$/                              { sex = "f" }  # Female names endng in a/e/i/y.
    /^All?[iy]((ss?)|z)on$/                   { sex = "f" }  # Allison (and variations)
    /^.*een$/                                 { sex = "f" }  # Cathleen, Eileen, Maureen,...
    /^[^S].*r[rv]e?y?$/                       { sex = "m" }  # Barry, Larry, Perry,...
    /^[^G].*v[ei]$/                           { sex = "m" }  # Clive, Dave, Steve,...
    /^[^BD].*(b[iy]|y|via)nn?$/               { sex = "f" }  # Carolyn,Gwendolyn,Vivian,...
    /^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$/  { sex = "m" }  # Dewey, Stanley, Wesley,...
    /^[^GKSW].*(th|lv)(e[rt])?$/              { sex = "f" }  # Heather, Ruth, Velvet,...
    /^[CGJWZ][^o][^dnt]*y$/                   { sex = "m" }  # Gregory, Jeremy, Zachary,...
    /^.*[Rlr][abo]y$/                         { sex = "m" }  # Leroy, Murray, Roy,...
    /^[AEHJL].*il.*$/                         { sex = "f" }  # Abigail, Jill, Lillian,...
    /^.*[Jj](o|o?[ae]a?n.*)$/                 { sex = "f" }  # Janet, Jennifer, Joan,...
    /^.*[GRguw][ae]y?ne$/                     { sex = "m" }  # Duane, Eugene, Rene,...
    /^[FLM].*ur(.*[^eotuy])?$/                { sex = "f" }  # Fleur, Lauren, Muriel,...
    /^[CLMQTV].*[^dl][in]c.*[ey]$/            { sex = "m" }  # Lance, Quincy, Vince,...
    /^M[aei]r[^tv].*([^cklnos]|([^o]n))$/     { sex = "f" }  # Margaret, Marylou, Miriam,...
    /^.*[ay][dl]e$/                           { sex = "m" }  # Clyde, Kyle, Pascale,...
    /^[^o]*ke$/                               { sex = "m" }  # Blake, Luke, Mike,...
    /^[CKS]h?(ar[^lst]|ry).+$/                { sex = "f" }  # Carol, Karen, Sharon,...
    /^[PR]e?a([^dfju]|qu)*[lm]$/              { sex = "f" }  # Pam, Pearl, Rachel,...
    /^.*[Aa]nn.*$/                            { sex = "f" }  # Annacarol, Leann, Ruthann,...
    /^.*[^cio]ag?h$/                          { sex = "f" }  # Deborah, Leah, Sarah,...
    /^[^EK].*[grsz]h?an(ces)?$/               { sex = "f" }  # Frances, Megan, Susan,...
    /^[^P]*([Hh]e|[Ee][lt])[^s]*[ey].*[^t]$/  { sex = "f" }  # Ethel, Helen, Gretchen,...
    /^[^EL].*o(rg?|sh?)?(e|ua)$/              { sex = "m" }  # George, Joshua, Theodore,..
    /^[DP][eo]?[lr].*se$/                     { sex = "f" }  # Delores, Doris, Precious,...
    /^[^JPSWZ].*[denor]n.*y$/                 { sex = "m" }  # Anthony, Henry, Rodney,...
    /^K[^v]*i.*[mns]$/                        { sex = "f" }  # Karin, Kim, Kristin,...
    /^Br[aou][cd].*[ey]$/                     { sex = "m" }  # Bradley, Brady, Bruce,...
    /^[ACGK].*[deinx][^aor]s$/                { sex = "f" }  # Agnes, Alexis, Glynis,...
    /^[ILW][aeg][^ir]*e$/                     { sex = "m" }  # Ignace, Lee, Wallace,...
    /^[^AGW][iu][gl].*[drt]$/                 { sex = "f" }  # Juliet, Mildred, Millicent,...
    /^[ABEIUY][euz]?[blr][aeiy]$/             { sex = "m" }  # Ari, Bela, Ira,...
    /^[EGILP][^eu]*i[ds]$/                    { sex = "f" }  # Iris, Lois, Phyllis,...
    /^[ART][^r]*[dhn]e?y$/                    { sex = "m" }  # Randy, Timothy, Tony,...
    /^[BHL].*i.*[rtxz]$/                      { sex = "f" }  # Beatriz, Bridget, Harriet,...
    /^.*oi?[mn]e$/                            { sex = "m" }  # Antoine, Jerome, Tyrone,...
    /^D.*[mnw].*[iy]$/                        { sex = "m" }  # Danny, Demetri, Dondi,...
    /^[^BG](e[rst]|ha)[^il]*e$/               { sex = "m" }  # Pete, Serge, Shane,...
    /^[ADFGIM][^r]*([bg]e[lr]|il|wn)$/        { sex = "f" }  # Angel, Gail, Isabel,...
    
                                              { print sex }  # Output prediction
    

    Author

    by Scott Pakin, August 1991

    categories: Games,TenLiners,Eliza,Apr,2009,JuergenK

    Eliza

    Contents

    Synopsis

    gawk -f eliza.awk

    Download

    Download from LAWKER.

    Description

    From Wikipedia: ELIZA was a computer program and an early example of (by modern standards) primitive natural language processing. ELIZA operated by processing users' responses to scripts, the most famous of which was DOCTOR, a simulation of a Rogerian psychotherapist. In this mode, ELIZA mostly rephrased the user's statements as questions and posed those to the 'patient.' ELIZA was written by Joseph Weizenbaum between 1964 to 1966.

    Example dialogue

    This is a Gawk implementation of the famous ELIZA program by Joseph Weizenbaum.
    To end, type SHUT UP
    
    HI, IM ELIZA, TELL ME YOUR PROBLEM
    I hate dogs
    CAN YOU ELABORATE ON THAT ?
    Dogs remind me of Mylie Cyrus
    CAN YOU ELABORATE ON THAT ?
    You're kidding, right? Mylie Cyrus is one of the biggest stars on the planet
    WHAT MAKES YOU THINK I AM KIDDING, RIGHT? MYLIE CYRUS IS ONE OF THE BIGGEST STARS ON THE PLANET ?
    Cause you did not know about Mylie
    WE WERE DISCUSSING YOU -- NOT ME
    No,  I was discussing Mylie
    Oh, I ID NOT KNOW ABOUT MYLIE
    Well aren't you stupid
    Oh, I TUPID
    can you get her number?
    WE WERE DISCUSSING YOU -- NOT ME
    sorry
    PLEASE DON'T APOLOGIZE !
    but you sound so angry
    YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?
    i think you are  a robot
    Oh, I RE  A ROBOT
    yes
    I UNDERSTAND
    no you don't
    YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?
    hey- is this eliza?
    WHAT'S ABOUT ME ?
    I want to take you to dinner
    YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?
    I feel you are really stupid
    WE WERE DISCUSSING YOU -- NOT ME
    SHUT UP
    WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $101
    

    Code

    Set up

    BEGIN {
    	SetUpEliza() 
    	print "This is a Gawk implementation of the "\
              "famous ELIZA program by Joseph Weizenbaum. "\
    	      "To end, type SHUT UP\n";
    	print ElizaSays("");
    }
    { print ElizaSays($0) }
    

    ElizaSays

    function ElizaSays(YouSay) {
     if (YouSay == "") {
       cost = 0
       answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM"
     } else {
       q = toupper(YouSay)
       gsub("'", "", q)
       if(q == qold) {
         answer = "PLEASE DONT REPEAT YOURSELF !"
       } else {
         if (index(q, "SHUT UP") > 0) {
           answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\
                    int(100*rand()+30+cost/100)
    		1;
         } else {
           qold = q
           w = "-"                 # no keyword recognized yet
           for (i in k) {          # search for keywords
             if (index(q, i) > 0) {
               w = i
               break
             }
           }
           if (w == "-") {         # no keyword, take old subject
             w    = wold
             subj = subjold
           } else {                # find subject
             subj = substr(q, index(q, w) + length(w)+1)
             wold = w
             subjold = subj        #  remember keyword and subject
           }
           for (i in conj)
              gsub(i, conj[i], q)   # conjugation
           # from all answers to this keyword, select one randomly
           answer = r[indices[int(split(k[w], indices) * rand()) + 1]]
           # insert subject into answer
           gsub("_", subj, answer)
         }
       }
     }
     cost += length(answer) # for later payment : 1 cent per character
     return answer
    }
    

    SetUpEliza

    function SetUpEliza() {
     srand()
     wold = "-"
     subjold = " "
    
     # table for conjugation
     conj[" ARE "     ] = " AM "
     conj["WERE "     ] = "WAS "
     conj[" YOU "     ] = " I "
     conj["YOUR "     ] = "MY "
     conj[" IVE "     ] =\
     conj[" I HAVE "  ] = " YOU HAVE "
     conj[" YOUVE "   ] =\
     conj[" YOU HAVE "] = " I HAVE "
     conj[" IM "      ] =\
     conj[" I AM "    ] = " YOU ARE "
     conj[" YOURE "   ] =\
     conj[" YOU ARE " ] = " I AM "
    
     # table of all answers
     r[1]   = "DONT YOU BELIEVE THAT I CAN  _"
     r[2]   = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?"
     r[3]   = "YOU WANT ME TO BE ABLE TO _ ?"
     r[4]   = "PERHAPS YOU DONT WANT TO _ "
     r[5]   = "DO YOU WANT TO BE ABLE TO _ ?"
     r[6]   = "WHAT MAKES YOU THINK I AM _ ?"
     r[7]   = "DOES IT PLEASE YOU TO BELIEVE I AM _ ?"
     r[8]   = "PERHAPS YOU WOULD LIKE TO BE _ ?"
     r[9]   = "DO YOU SOMETIMES WISH YOU WERE _ ?"
     r[10]  = "DONT YOU REALLY _ ?"
     r[11]  = "WHY DONT YOU _ ?"
     r[12]  = "DO YOU WISH TO BE ABLE TO _ ?"
     r[13]  = "DOES THAT TROUBLE YOU ?"
     r[14]  = "TELL ME MORE ABOUT SUCH FEELINGS"
     r[15]  = "DO YOU OFTEN FEEL _ ?"
     r[16]  = "DO YOU ENJOY FEELING _ ?"
     r[17]  = "DO YOU REALLY BELIEVE I DONT _ ?"
     r[18]  = "PERHAPS IN GOOD TIME I WILL _ "
     r[19]  = "DO YOU WANT ME TO _ ?"
     r[20]  = "DO YOU THINK YOU SHOULD BE ABLE TO _ ?"
     r[21]  = "WHY CANT YOU _ ?"
     r[22]  = "WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM _ ?"
     r[23]  = "WOULD YOU PREFER IF I WERE NOT _ ?"
     r[24]  = "PERHAPS IN YOUR FANTASIES I AM _ "
     r[25]  = "HOW DO YOU KNOW YOU CANT _ ?"
     r[26]  = "HAVE YOU TRIED ?"
     r[27]  = "PERHAPS YOU CAN NOW _ "
     r[28]  = "DID YOU COME TO ME BECAUSE YOU ARE _ ?"
     r[29]  = "HOW LONG HAVE YOU BEEN _ ?"
     r[30]  = "DO YOU BELIEVE ITS NORMAL TO BE _ ?"
     r[31]  = "DO YOU ENJOY BEING _ ?"
     r[32]  = "WE WERE DISCUSSING YOU -- NOT ME"
     r[33]  = "Oh, I _"
     r[34]  = "YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?"
     r[35]  = "WHAT WOULD IT MEAN TO YOU, IF YOU GOT _ ?"
     r[36]  = "WHY DO YOU WANT _ ?"
     r[37]  = "SUPPOSE YOU SOON GOT _"
     r[38]  = "WHAT IF YOU NEVER GOT _ ?"
     r[39]  = "I SOMETIMES ALSO WANT _"
     r[40]  = "WHY DO YOU ASK ?"
     r[41]  = "DOES THAT QUESTION INTEREST YOU ?"
     r[42]  = "WHAT ANSWER WOULD PLEASE YOU THE MOST ?"
     r[43]  = "WHAT DO YOU THINK ?"
     r[44]  = "ARE SUCH QUESTIONS IN YOUR MIND OFTEN ?"
     r[45]  = "WHAT IS IT THAT YOU REALLY WANT TO KNOW ?"
     r[46]  = "HAVE YOU ASKED ANYONE ELSE ?"
     r[47]  = "HAVE YOU ASKED SUCH QUESTIONS BEFORE ?"
     r[48]  = "WHAT ELSE COMES TO MIND WHEN YOU ASK THAT ?"
     r[49]  = "NAMES DON'T INTEREST ME"
     r[50]  = "I DONT CARE ABOUT NAMES -- PLEASE GO ON"
     r[51]  = "IS THAT THE REAL REASON ?"
     r[52]  = "DONT ANY OTHER REASONS COME TO MIND ?"
     r[53]  = "DOES THAT REASON EXPLAIN ANYTHING ELSE ?"
     r[54]  = "WHAT OTHER REASONS MIGHT THERE BE ?"
     r[55]  = "PLEASE DON'T APOLOGIZE !"
     r[56]  = "APOLOGIES ARE NOT NECESSARY"
     r[57]  = "WHAT FEELINGS DO YOU HAVE WHEN YOU APOLOGIZE ?"
     r[58]  = "DON'T BE SO DEFENSIVE"
     r[59]  = "WHAT DOES THAT DREAM SUGGEST TO YOU ?"
     r[60]  = "DO YOU DREAM OFTEN ?"
     r[61]  = "WHAT PERSONS APPEAR IN YOUR DREAMS ?"
     r[62]  = "ARE YOU DISTURBED BY YOUR DREAMS ?"
     r[63]  = "HOW DO YOU DO ... PLEASE STATE YOUR PROBLEM"
     r[64]  = "YOU DON'T SEEM QUITE CERTAIN"
     r[65]  = "WHY THE UNCERTAIN TONE ?"
     r[66]  = "CAN'T YOU BE MORE POSITIVE ?"
     r[67]  = "YOU AREN'T SURE ?"
     r[68]  = "DON'T YOU KNOW ?"
     r[69]  = "WHY NO _ ?"
     r[70]  = "DON'T SAY NO, IT'S ALWAYS SO NEGATIVE"
     r[71]  = "WHY NOT ?"
     r[72]  = "ARE YOU SURE ?"
     r[73]  = "WHY NO ?"
     r[74]  = "WHY ARE YOU CONCERNED ABOUT MY _ ?"
     r[75]  = "WHAT ABOUT YOUR OWN _ ?"
     r[76]  = "CAN'T YOU THINK ABOUT A SPECIFIC EXAMPLE ?"
     r[77]  = "WHEN ?"
     r[78]  = "WHAT ARE YOU THINKING OF ?"
     r[79]  = "REALLY, ALWAYS ?"
     r[80]  = "DO YOU REALLY THINK SO ?"
     r[81]  = "BUT YOU ARE NOT SURE YOU _ "
     r[82]  = "DO YOU DOUBT YOU _ ?"
     r[83]  = "IN WHAT WAY ?"
     r[84]  = "WHAT RESEMBLANCE DO YOU SEE ?"
     r[85]  = "WHAT DOES THE SIMILARITY SUGGEST TO YOU ?"
     r[86]  = "WHAT OTHER CONNECTION DO YOU SEE ?"
     r[87]  = "COULD THERE REALLY BE SOME CONNECTIONS ?"
     r[88]  = "HOW ?"
     r[89]  = "YOU SEEM QUITE POSITIVE"
     r[90]  = "ARE YOU SURE ?"
     r[91]  = "I SEE"
     r[92]  = "I UNDERSTAND"
     r[93]  = "WHY DO YOU BRING UP THE TOPIC OF FRIENDS ?"
     r[94]  = "DO YOUR FRIENDS WORRY YOU ?"
     r[95]  = "DO YOUR FRIENDS PICK ON YOU ?"
     r[96]  = "ARE YOU SURE YOU HAVE ANY FRIENDS ?"
     r[97]  = "DO YOU IMPOSE ON YOUR FRIENDS ?"
     r[98]  = "PERHAPS YOUR LOVE FOR FRIENDS WORRIES YOU"
     r[99]  = "DO COMPUTERS WORRY YOU ?"
     r[100] = "ARE YOU TALKING ABOUT ME IN PARTICULAR ?"
     r[101] = "ARE YOU FRIGHTENED BY MACHINES ?"
     r[102] = "WHY DO YOU MENTION COMPUTERS ?"
     r[103] = "WHAT DO YOU THINK MACHINES HAVE TO DO WITH YOUR PROBLEMS ?"
     r[104] = "DON'T YOU THINK COMPUTERS CAN HELP PEOPLE ?"
     r[105] = "WHAT IS IT ABOUT MACHINES THAT WORRIES YOU ?"
     r[106] = "SAY, DO YOU HAVE ANY PSYCHOLOGICAL PROBLEMS ?"
     r[107] = "WHAT DOES THAT SUGGEST TO YOU ?"
     r[108] = "I SEE"
     r[109] = "IM NOT SURE I UNDERSTAND YOU FULLY"
     r[110] = "COME COME ELUCIDATE YOUR THOUGHTS"
     r[111] = "CAN YOU ELABORATE ON THAT ?"
     r[112] = "THAT IS QUITE INTERESTING"
     r[113] = "WHY DO YOU HAVE PROBLEMS WITH MONEY ?"
     r[114] = "DO YOU THINK MONEY IS EVERYTHING ?"
     r[115] = "ARE YOU SURE THAT MONEY IS THE PROBLEM ?"
     r[116] = "I THINK WE WANT TO TALK ABOUT YOU, NOT ABOUT ME"
     r[117] = "WHAT'S ABOUT ME ?"
     r[118] = "WHY DO YOU ALWAYS BRING UP MY NAME ?"
     # table for looking up answers that
     # fit to a certain keyword
     k["CAN YOU"]      = "1 2 3"
     k["CAN I"]        = "4 5"
     k["YOU ARE"]      =\
     k["YOURE"]        = "6 7 8 9"
     k["I DONT"]       = "10 11 12 13"
     k["I FEEL"]       = "14 15 16"
     k["WHY DONT YOU"] = "17 18 19"
     k["WHY CANT I"]   = "20 21"
     k["ARE YOU"]      = "22 23 24"
     k["I CANT"]       = "25 26 27"
     k["I AM"]         =\
     k["IM "]          = "28 29 30 31"
     k["YOU "]         = "32 33 34"
     k["I WANT"]       = "35 36 37 38 39"
     k["WHAT"]         =\
     k["HOW"]          =\
     k["WHO"]          =\
     k["WHERE"]        =\
     k["WHEN"]         =\
     k["WHY"]          = "40 41 42 43 44 45 46 47 48"
     k["NAME"]         = "49 50"
     k["CAUSE"]        = "51 52 53 54"
     k["SORRY"]        = "55 56 57 58"
     k["DREAM"]        = "59 60 61 62"
     k["HELLO"]        =\
     k["HI "]          = "63"
     k["MAYBE"]        = "64 65 66 67 68"
     k[" NO "]         = "69 70 71 72 73"
     k["YOUR"]         = "74 75"
     k["ALWAYS"]       = "76 77 78 79"
     k["THINK"]        = "80 81 82"
     k["LIKE"]         = "83 84 85 86 87 88 89"
     k["YES"]          = "90 91 92"
     k["FRIEND"]       = "93 94 95 96 97 98"
     k["COMPUTER"]     = "99 100 101 102 103 104 105"
     k["-"]            = "106 107 108 109 110 111 112"
     k["MONEY"]        = "113 114 115"
     k["ELIZA"]        = "116 117 118"
    }
    

    Author

    Juergen Kahrs


    categories: TenLiners,Apr,2009,AlanL

    Towers of Hanoi

    Contents

    Synopsis

    Description

    Options

    Example

    Details

    Globals

    Code

    Author

    Synopsis

    gawk -f hanoi.awk [-n Disks]

    Description

    The objective is to move N discks from stack 0 to stack 1, always putting a smaller disc on top of a larger one. or on an empty stack

    Options

    -n
    Number of disks, defaults to 5.

    Example

    gawk -f hanoi.awk -n 4
    0 4321
    1 
    2 
    
    0 432
    1 
    2 1
    
    0 43
    1 2
    2 1
    
    0 43
    1 21
    2 
    
    0 4
    1 21
    2 3
    
    0 41
    1 2
    2 3
    
    0 41
    1 
    2 32
    
    0 4
    1 
    2 321
    
    0 
    1 4
    2 321
    
    0 
    1 41
    2 32
    
    0 2
    1 41
    2 3
    
    0 21
    1 4
    2 3
    
    0 21
    1 43
    2 
    
    0 2
    1 43
    2 1
    
    0 
    1 432
    2 1
    
    0 
    1 4321
    2 
    

    Details

    Globals

    sp[i]
    stack pointer for the ith stack = next free space
    stack[i,j]
    value of stack i at position j

    Code

    Main:

    BEGIN {
      n = arg("-n",5)
      for (j=0; j<n; j++) push(0,n-j)
      showstacks()
      hanoi(n,0,1,2)
    }
    

    function hanoi(n,a,b,c) {
      if (n==1) {
        move(a,b)
      } else {
        hanoi(n-1,a,c,b)
        move(a,b)
        hanoi(n-1,c,b,a)
      }
    }
    function move(i,j) {
      push(j,pop(i))
      showstacks()
    }
    

    Showing the stack:

    function showstacks(  i,j) {
      for (i=0; i<=2; i++) {
        printf "%s ", i
        for (j=0; j<sp[i]; j++) printf "%s", stack[i,j]
        print "" }
      print ""
    }
    

    Standard stuff:

    function arg(tag,default) {
      for(i in ARGV) 
    	if (ARGV[i] ~ tag) 
    		return ARGV[i+1]
      return default
    }
    function push(i,v) { stack[i,sp[i]++]=v }
    function pop(i)    { return stack[i,--sp[i]] }
    

    Author

    Alan Linton, 2001


    categories: Games,TenLiners,Apr,2009,Anon

    Mind-Reading Machine

    Contents

    Synposis

    gawk -f readminds.awk

    (then type "h" or "t").

    Download

    Download from LAWKER.

    Description

    Theory

    Shannon's 1953 memo, A Mind-Reading(?) Machine, describes a machine built out of relays at Bell Labs.

      This machine is a somewhat simplified model of a machine designed by D.W. Hagelbarger. It plays what is essentially the old game of matching pennies or "odds and evens". This games has been discussed from the game theoretic angle by von Neumann and Morgenstern, and from the psychological point of view by Edgar Allen Poe in "The Purloined Letter". Oddly enough, the machine is aimed more nearly at Poe's method of play than von Neumann's.

    The machine took advantage of the difficulty of generating truly random behavior in wetware by using a small (8-state) markov model to predict its human opponents.

    Practice

    We implement a 1970's version of this 1950's algorithm, using AWK instead of mechanical relays.

    Our markov model is based on behavior over the last two rounds, with hpa and hpb recording the history of the player's plays, and hca and hcb recording the history of the computer's guesses. The possible cases are: the player won or lost two rounds ago, changed plays or stayed with the same play, and won or lost the last round, for a total of 23 = 8 histories, with any bias towards changing or staying in the upcoming round kept in the tally array.

    If the player has repeated their behavior for a given history at least twice, we guess according to their predicted behavior. After the first observation, we guess with a 75%/25%, split, weighted towards the bias. If the player hasn't shown any bias (or during the first two rounds of the game), we guess at hazard.

    Code

    Begin

    BEGIN	{
     print "+---------------------------------------------------------+"
     print "| An AWKward mind-reading machine                         |"
     print "|         (this retrogame inspired by the Bell Labs Memo: |"
     print "|          Shannon, 1953, 'A Mind-Reading (?) Machine')   |"
     print "+---------------------------------------------------------+"
     print "Shall we play a game?"
     print "Tell me either 'heads' or 'tails'."
     print "If I guess what you picked, I win.  Otherwise, you win."
     print "The match goes for 100 rounds, or someone gets 20."
     printf "your play? "
    }
    

    set seed

    BEGIN	{ "date +%s" | getline seed; srand(seed) }
    

    consult model

    BEGIN	{ t = 0 }
    NR > 2	{
    	case = (hpa!=hca)"/"(hpa!=hpb)"/"(hpb!=hcb)
    	t = tally[case]
    	}
    
    t < -1	{ guess=!hpb }
    t == -1 { guess=(int(rand()+.75)?!hpb:hpb) }
    t == 0	{ guess=int(rand()+.5) }
    t == 1  { guess=(int(rand()+.75)?hpb:!hpb) }
    t > 1	{ guess= hpb }
    
    

    get input

    /^[hH]/		{ play=1 }
    /^[tT]/		{ play=0 }
    /^[^hHtT]/	{ printf "heads or tails? "; next }
    

    report results

    We also report the results of the round to the player (in case they wish to update their internal models). En passant, we update pw and cw, the number of player (resp. computer) wins.

    	{
    	printf "You played " (play?"heads":"tails")
    	printf "; I guessed " (guess?"heads":"tails")
    	printf ".  "(play==guess?"I":"You")" win. "
    	print "("(pw+=(play!=guess))"-"(cw+=(play==guess))")"
    	}
    

    update model

    After finishing a round, we update the history with the results, including updating tally according to the player's behavior. Again, we wait for two rounds before touching the tally counters, at which point the history will have been fully initialized.

    NR > 2	{ tally[case] += (hpb == play ? 1 : -1) }
    	{
    	hpa = hpb; hpb = play
    	hca = hcb; hcb = guess
    	}
    

    check for victory

    At the end of each round, if we haven't met a victory condition, we prompt for the next round.

    cw+pw==100	{ printf (cw>pw?"I":"You")" won the match "
    		  print  "by "(cw>pw?cw-pw:pw-cw)" games."
    		  exit }
    pw-cw==20	{ print "You win -- up by 20"; exit }
    cw-pw==20	{ print "I win -- up by 20"; exit }
    		{ printf "? " }
    
    

    end

    END	{ 
    	print " T H A N K   Y O U   F O R   P L A Y I N G "
    	}
    

    Copyright

    Copyright (c) 2009 the authors listed at the following URL, and/or the authors of referenced articles or incorporated external code: http://en.literateprograms.org/Mind_reading_machine_(AWK)?action=history&offset=20070207160312

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.


    categories: Games,TenLiners,Apr,2009,Ysa

    mastermind.awk

    Contents

    Synopsis

    Download

    Description

    Example

    Code

    See Also

    Author

    Synopsis

    gawk -f mastermind.awk

    Download

    Download from LAWKER.

    Description

    The aim of the game is to guess 4 numbers from 0,1,2,3,4,5,6,7,8,9. A "hit" is the right number in the right position and a "blow" is the right number in a wrong position.

    You lose the game if you fail to guess after 10 rounds.

    Example

     +++  Hit & Blow  +++   <Push Enter>
    
    [ 1] >> 1234
                  ##  1 Hit  2 Blow
    [ 2] >> 1256
                  ##  1 Hit  1 Blow
    [ 3] >> 1789
                  ##  1 Hit  0 Blow
    [ 4] >> 1243 
                  ##  1 Hit  2 Blow
    [ 5] >> 1340
                  ##  3 Hit  0 Blow
    [ 6] >> 1320
    
      Congratulations !!  (1320)
    

    Code

    BEGIN{ 
    	srand();  
    	c=1;  
    	print "\n\n +++  Hit & Blow  +++   <Push Enter>\n";
    	q[z=p=int(9*rand())+1]=1;  
    	for(i=2; i<=4;) 
    		if(q[p=int(10*rand())]<1){ 
    			q[p]=i++;  
    			z=z*10+p; }
    }
    

    Note that the range 1023 ... 9876 are the smallest and largerst 4 digit integers with no repeates.

    { if((n=int($0+0))>=1023 && n<=9876) { 
    		++c;
       		v=0;  
    		for(i=4; i>0; n=int(n/10)) 
    			v+=(q[p=n%10]==i--)?10:(q[p]>0)?1:0;
        			if	(v==40) exit; 
    				else printf("%16s %2d Hit %2d Blow\n", "##", v/10, v%10);
     	}
     	if	(c>10) exit; 
    	else printf("[%2d] >> ", c);
    }
    END{ 
    	printf("\n  %s  (%d)\n", (v==40)?"Congratulations !!":"Over times", z); 
    }
    

    See Also

    mastermind2.awk.

    Author

    The author's name is YSA.


    categories: Games,May,2009,SteveL

    Mastermind2.awk

    Contents

    Synopsis

    Download

    Description

    Example

    Code

    See Also

    Author

    Synopsis

    gawk -f mastermind2.awk [breaker]

    Download

    Download from LAWKER.

    Description

    This is an nteractive play against the evil computer mastermind game.

    The game showing the recursive power of the awk language. It also demonstrates a winning technique for the game mastermind.

    The game has two roles, breaker and maker of mastermind codes. A 5 digit 0 to 9 per digit code must be broken. The maker responds with one + for every correct digit,position guess and a - for every correct digit in the wrong position in the code. A code breaker (human or this program) must use those clues to determine the code. A score is kept, low score wins.

    Example

    In the following example, the goal is "12345".

    gawk -f mastermind2.awk  br
    I'll start, I'll break your code, you respond with +-
    my guess #1 12413 ++--
    my guess #2 12531 ++--
    my guess #3 13211 +--
    my guess #4 14523 +----
    my guess #5 15432 +----
    my guess #6 12345 +++++
    

    Code

    BEGIN{ 
    	srand();  
    	if (index(ARGV[1],"br")) {
    		print "I'll start, I'll break your code, you respond with +-"
    		ARGV[1] = ""
    		mscore += breaker(randguess())
    	}
    	do {
    		printscore()
    		print "Guess my code 5 digits 0 to 9"
    		yscore += maker(randguess())
    		printscore()
    		print "I'll break your code, you respond with +-"
    		mscore += breaker(randguess())
    	} while (1)
    }
    END{ 
    	printscore()
    }
    function printscore() {
    	print("\nlow score wins! my score =", mscore, "yours =", yscore)
    }
    function randguess() {
    	return incr(int((10*10*10*10*10)*rand()))
    }
    function smudge(ins,n,ch) {
    	return substr(ins, 1, n-1) ch substr(ins, n+1)
    }
    function grade(val, guess, 	i, rtn, t){ 
    # return + for exact hits, - for "close" for all 5 digits
    	for (i = 1;i < 6; i++) {
    		if (substr(val, i, 1) == substr(guess, i, 1)) {
    			#exact match
    			rtn = rtn "+"
    			val = smudge(val, i, "x");
    			guess = smudge(guess, i, "y");
    			#print i, val, guess, rtn
    		}
    	}
    	for (i = 1;i < 6; i++) {
    		t = index(val, substr(guess, i, 1))
    		if (t) {
    			rtn = rtn "-"
    			val = smudge(val, t, "u")
    			guess = smudge(guess, i, "v");
    			#print t, i, val, guess, rtn
    		}
    	}
    	return rtn
    }
    #passed guess and old guess array
    #A good guess matches all previous scores with the new guess
    function checkguess(g, oldg,	i,score) {
    	#print "guess " g
    	for (i in oldg) {
    		if (g == i) return 2 #bad, repeated guess
    		if (grade(g,i) != oldg[i]) return 1 #reject this guess
    	}
    	return 0 #success, this is an ok guess
    }
    function incr(old,	new) {
    	new = sprintf("%05d",old + 1)
    	#print "old new", old, new
    	return substr(new, length(new) -4)
    }
    function alignres(res, 	tem) {
    	for (i=1;i<=length(res);i++) {
    		if (substr(res, i, 1) == "+") tem = "+" tem
    		else tem = tem "-"
    	}
    	#print "alignres ",res, tem
    	return tem
    }
    function breaker(g1,	guess, res, hisinput, tries){
    	guess = g1
    	do {
    		printf("my guess #%d %s ", ++tries, guess)
    		do {
    			if (getline hisinput <= 0) {
    				print "whoa, some error, giving up"
    				exit
    			}
    			if (!match(hisinput, /^[-+]*$/)) {
    				print "invalid response, use only +-"
    			}
    		} while (RSTART == 0)
    		hisinput = alignres(hisinput)
    		res[guess] = hisinput
    		#print "hisinput ", hisinput, res[guess]
    		#for (i in res) print "res[" i "]=" res[i]
    		if (res[guess] == "+++++") return tries
    		# make another guess
    		do {
    			guess = incr(guess)
    			r = checkguess(guess, res)
    		} while (r == 1)
    	} while (g1 != guess)
    	print "you must have made a mistake, no answer is possible"
    	exit
    }
    function maker(original,	his, tries)
    { 
    	#print original," cheater!"
    	do {
    		if (getline his <= 0) {
    			print "whoa, some error, giving up"
    			exit
    		}
    		res = grade(original, his)
    		print "try " ++tries " results",res
    		if (res == "+++++") return tries
    	} while (1)
    }
    

    See Also

    mastermind.awk.

    Author

    Steve Calfee, USA.


    categories: Games,May,2009,AaronH

    Checkers Programming in Gawk

    In early 2004, Aaron Hawley threw himself into a programming contest held by the University of Vermont Computer Science Student Association. The contest was a variation on checkers where competitors had their artifical computer players compete in a "virtual tournament".

    It made for an interesting problem, and he chose to make it more interesting by writing his checker player in Awk (in the implementation GNU Awk). He wasn't able to submit a working version then because of a technical problem, and the contest itself never was finalized due to a lack of submissions.

    Recently, he overcame the technical problems and finally put together a working version (not to be confused with winning). The heuristic used in this checker player is not a winning strategy, but at least it plays. There is also the full build distribution, that shows what a large Awk project looks like, and some tricks on how to survive (hint: GNU Makefiles).


    categories: Games,May,2009,AaronH

    Tic-Tac-Toe

    Contents

    Synopsis

    Description

    Example

    Code:

    Author

    Synopsis

    To let the computer play first, run:

    awk -f 15.awk -v start=1
    

    To play first, run:

    awk -f 15.awk -v start=2
    

    Description

    Each move is one square (in the range 1..9).

    Example

     gawk -f 15.awk -v start=1
    6
    9
    1
    3
    8
    I win!
    

    Code:

    BEGIN {
        winning_sum = 15;
        max_play = 9;
        used[0] = 1;
        my_sum = your_sum = 0;
        if (start == 1) {
            answer = ftw(used, my_sum);
    	used[answer] = 1;
    	my_sum += answer;
    	print answer;
        }
        halted=0;
    }
    
    ! /^[1-9]$/ {
        print "Illegal play: " $0;
    }
    
    {
        if ($0 in used) {
            print "Illegal play: " $0;
        } else {
            used[$0] = 1;
            your_sum += $0;
            if (your_sum == winning_sum) {
                print "You win!";
    	    halted=1
    	    exit 0
            } else if (your_sum > winning_sum && my_sum > winning_sum) {
                print "Draw";
    	    halted=1;
    	    exit 2;
            } else {
    	    answer = block = winning_sum - your_sum;
    	    winning_move = ftw(used, my_sum);
                if (block > max_play \
    		|| block <= 0 \
    		|| block in used) {
                    answer = winning_move;
    	    }
    	    while (answer <= 0 || answer > max_play || answer in used) {
    		answer++;
    	    }
                my_sum += answer;
                used[answer] = 1;
                print answer;
                if (my_sum == winning_sum) {
                    print "I win!";
    		halted=1;
    		exit 1;
                }
            }
        }
    }
    
    END {
        if (halted == 1) {
    	exit;
        }
        if (your_sum != winning_sum && my_sum != winning_sum) {
    	print "I win by forfeit";
    	exit 1;
        }
    }
    
    function ftw(used, sum) {
        strlst = "";
        for (v in used) {
            strlst = strlst "" v;
        }
        to_win = try(strlst, max_play "", sum);
        if (to_win == "") {
            return -1;
        }
        return substr(to_win, 1, 1);
    }
    
    function try(used, hunches, sum) {
        curr_sum = strsum(hunches) + sum;
        curr_hunch = substr(hunches, 1, 1);
        next_hunch = curr_hunch - 1;
        if (hunches == "") {
            return "";
        } else if (curr_hunch < 1) {
            return substr(hunches, 2);
        } else if (index(used, curr_hunch) || curr_sum > winning_sum) {
            return try(used, next_hunch "" substr(hunches, 2), sum);
        } else if (curr_sum == winning_sum) {
            return hunches;
        }
        return try(curr_hunch "" used, next_hunch "" hunches, sum);
    }
    
    function strsum(str) {
        s = 0;
        str_length = length(str);
        for (i = 1; i <= str_length; i++) {
            s += substr(str, i, 1);
        }
        return s;
    }
    

    Author

    Aaron S. Hawley


    categories: Sorting,TenLiners,Apr,2009,Ysa

    Quicksort.awk

    Contents

    Synopsis

    cat numbers | gawk -f quicksort.awk

    Download

    Download from LAWKER.

    Description

    Some Awk implementations come with built in sort routines (e.g. Gawk's asort and asorti functions). But it can be useful to code these yourself, especially in you are doing data structure tricks.

    Quicksort selects a pivot and divides the data into values above and below the pivot. Sorting then recurses on these sub-lists.

    Code

    Loading the data

    BEGIN { RS = ""; FS = "\n" }
          { A[NR] = $0 } 
    END {
    	qsort(A, 1, NR)
    	for (i = 1; i <= NR; i++) {
    		print A[i]
    		if (i == NR) break
    		print ""
    	}
    }
    

    Sorting the data

    function qsort(A, left, right,   i, last) {
    	if (left >= right)
    		return
    	swap(A, left, left+int((right-left+1)*rand()))
    	last = left
    	for (i = left+1; i <= right; i++)
    		if (A[i] < A[left])
    			swap(A, ++last, i)
    	swap(A, left, last)
    	qsort(A, left, last-1)
    	qsort(A, last+1, right)
    }
    function swap(A, i, j,   t) {
    	t = A[i]; A[i] = A[j]; A[j] = t
    }
    

    See also

    quicksort2.awk

    Authors

    Alfred Aho, Peter Weinberger, Brian Kernighan, 1988.


    categories: May,2009,TerryB

    QTawk

    The QTAwk utility is an extension to standard Awk that makes it possible to handle simple data-reformatting jobs easily with just a few lines of code.

    Differences to standard Awk:

    • Various changes to regular expressions;
    • Expanded operators;
    • More predefined patterns (more than just BEGIN, END);
    • Multi-dimensional arrays;
    • Integer arrays indexed as integers;
    • Various new keywords; e.g. in QTawk the "local" keyword; allow the declaration and use of local variables within compound statements, including user-defined functions;
    • More math and string functions.
    • Able to pass commands as strings and execute them.
    • More I/O functions.
    • Various new globals.

    For More Information...

    QTawk website.

    Download

    Download here.

    Author

    Terry D. Boldt

    categories: Nov,2009,Admin

    Awk.info Gaining Popularity

    Nov 28, 2009

    This site is moving up the page rankings:

    • Four months ago, typing "awk" into Google resulted in pages of output where the first mention of awk.info did not appear till half-way down page five.
    • Today, the same query finds "awk.info" on page two (in position 18).

    Other indicators also look good. Since the site was launched (Feb 15, 2009), the number of visits has been steadily increasing:

    These 19,268 visits come from 2,765 cities:

    Apart from Granville West Virginia (where this site is administrated), the three cities with the most visits are:
    • London, England: 389 visits;
    • Thessaloniki, Greece: 491 visits;
    • Athens, Greece: 640 visits.

    (BTW: Anyone got any ideas why these cities visit here so often?)

    In other news, Website Outlook reports that:

    • This site is now worth $1423.5 USD.
    • And the daily ad revenue stream from awk.info would be $1.95.

    To put that report in perspective, the same source notes that:

    • rottentomatoes.com is worth $3,300,000.
    • And the daily revenue stream from that site would be $4523.19.

    categories: Nov,2009,ScottS

    AWKBOT

    URL: http://www.blisted.org/wiki/projects/awkbot.

    Awkbot is a small bot written in 100% GNU Awk, awkbot requires GNU Awk version 3.1.1.

    Awkbot Has ability to search google, search the awk man page for descriptions of functions and built in variables.

    The tool accepts a simple configuration file, and has a small wrapper written in sh for automatic restarts.

    The goal of the tool is to (eventually) become a clone of info bot with awk adaptations to prove to those fools in #perl on freenode that awk really is a programming language

    AWKBot uses mysql.awk to connect to, and query, a MySQL database where it will store information you give it, and recall it later. It also uses this to track karma points, and maybe more in the future. It similarly uses some interesting pipelining to do IPC, to support awkpaste

    Author

    Scott S. McCoy


    categories: Oct,2009,Zazzle

    Awk Mug

    Zazzle.com is offering their great "I love Awk mug", starting at $12.


    categories: Oct,2009,JohnD

    Parallel Awk

    From John David Duncan's parallel-awk.org site.

    Parallel Awk is an effort to link Awk with MPI, enabling the everyday analysis of large plain-text files to be parallelized, allowing rapid prototyping of parallel applications, preserving the syntax and style of Awk, and hiding the details of MPI.

    Awk and MPI

    The Awk programming language, first developed at Bell Labs in 1977, is a standard part of Unix operating system distributions. It is a compact language, commonly used in systems administration and in commercial (as opposed to scientific) computing. The half dozen books about awk include the original slim and very readable Awk book by Aho, Kernighan, and Weinberger. Awk is standardized in POSIX, and the most actively maintained current implementation is GNU awk. While awk, like sed, is perhaps most often used for "one-liners," its regular expression handling and rich C-like syntax make it well-suited for many small applications and domain-specific languages.

    MPI is a standard Message Passing Interface for parallel computing created by the MPI Forum, implemented in two widely-used free distributions (LAM/MPI and MPICH) and in optimized versions provided by many hardware vendors. MPI libraries are often linked with Fortran or C code in scientific computing tasks, such as matrix calculations, and run on supercomputers or Beowulf clusters. For some of these applications, runtime is actually greater than development time; nonetheless, a language for rapid prototyping is a handy tool to have around.

    Example: Calculating Pi

    # pi.awk: approximate pi by integrating f(x) = 4/(1+x^2)
    # n = number of intervals to calculate 
    #
    # e.g.: mpiexec -n 4 mpawk -v n=10000 -f pi.awk 
    
    BEGIN {
        h = 1/n
        for(i = RANK+1 ; i <= n ; i += SIZE) {
            x = h * (i - 0.5)
            sum +=  4 / (1 + x^2)
        }
        pi = reduce(sum(h * sum))
        if(!RANK) printf("n=%d, pi is %1.20f\n",n,pi)
    }
    

    pi.awk requires about 20% as many lines of code as its equivalents in C or Fortran. The output is printed by the process with RANK = 0 and looks like this:

    sh% mpiexec -n 4 mpawk -v n=100000 -f pi.awk
    n=100000, pi is 3.14159265359811668006
    

    Status

    The latest beta release of Parallel Awk is version 0.8. In this release, any Awk expression (including numbers, strings, and arrays) can be sent from one process to another using the functions send and recv. The comm_split() function, an interface to MPI_Comm_split, allows the creation of intra-communicators, while a companion function comm_set() is used to set the default MPI communicator implicitly used for all other MPI operations. Supported collective operations include reduce(), which can be applied to both numeric and string expressions, and barrier(). A function called assign() is used to divide the lines of input among the set of processes, as can a hash() function that is applied to array keys or other strings.


    categories: Dec,2009,DanN

    Waclaw Sierpinski's Triangle

    Contents

    Synopsis

    Example

    Code

    Author

    Synopsis

    gawk -f  wst.awk [-v X=anychar] iterations
    

    Example

     gawk -f wst.awk  -v X=* 2
                   *
                  * *
                 *   *
                * * * *
               *       *
              * *     * *
             *   *   *   *
            * * * * * * * *
           *               *
          * *             * *
         *   *           *   *
        * * * *         * * * *
       *       *       *       *
      * *     * *     * *     * *
     *   *   *   *   *   *   *   *
    * * * * * * * * * * * * * * * *
    

    Code

    BEGIN {
        n = ARGV[1] + 0 # iterations
        if (n !~ /^[0-9]+$/) { exit(1) }
        if (n == 0) { width = 3 }
        row = split("X,X X,X   X,X X X X",A,",") # seed the array
        for (i=1; i<=n; i++) { # build triangle
          width = length(A[row])
          for (j=1; j<=row; j++) {
            str = A[j]
          # if (n <= 9) { gsub(/[^ ]/,i,str) } # show structure
            A[j+row] = sprintf("%-*s %-*s",width,str,width,str)
          }
          row *= 2
        }
        for (j=1; j<=row; j++) { # print triangle
          if (X != "") { gsub(/X/,substr(X,1,1),A[j]) }
          sub(/ +$/,"",A[j])
          printf("%*s%s\n",width-j+1,"",A[j])
        }
        exit(0)
    }
    

    Author

    Dan Nielsen


    categories: ,Web,Dec,2009,MichealS

    A Web Server in Awk

    Contents

    Server.awk - a simple, single user, web server built with gawk.

    Download

    Download from LAWKER.

    About

    This code creates an html menu of local applications which you can season to taste. The usage requires two steps...

    1. run: 'gawk -f server.awk'
    2. open browser at: http://localhost:8080

    This code is based on the examples located at the TCP/IP Internetworking With `gawk' manual and is licensed under GPL 3.0. For updates to thos code, see http://topcat.hypermart.net/index.html.

    Code

    Set up

    BEGIN { 
      x        = 1                         # script exits if x < 1 
      port     = 8080                      # port number 
      host     = "/inet/tcp/" port "/0/0"  # host string 
      url      = "http://localhost:" port  # server url 
      status   = 200                       # 200 == OK 
      reason   = "OK"                      # server response 
      RS = ORS = "\r\n"                    # header line terminators 
      doc      = Setup()                   # html document 
      len      = length(doc) + length(ORS) # length of document 
      while (x) { 
         if ($1 == "GET") RunApp(substr($2, 2)) 
         if (! x) break   
         print "HTTP/1.0", status, reason |& host 
         print "Connection: Close"        |& host 
         print "Pragma: no-cache"         |& host 
         print "Content-length:", len     |& host 
         print ORS doc                    |& host 
         close(host)     # close client connection 
         host |& getline # wait for new client request 
      } 
      # server terminated... 
      doc = Bye() 
      len = length(doc) + length(ORS) 
      print "HTTP/1.0", status, reason |& host 
      print "Connection: Close"        |& host 
      print "Pragma: no-cache"         |& host 
      print "Content-length:", len     |& host 
      print ORS doc                    |& host 
      close(host) 
    } 
    

    HTML Menu

    function Setup() { 
      tmp = "<html>\
      <head><title>Simple gawk server</title></head>\
      <body>\
      <p><a href=" url "/xterm>xterm</a>\
      <p><a href=" url "/xcalc>xcalc</a>\
      <p><a href=" url "/xload>xload</a>\
      <p><a href=" url "/exit>terminate script</a>\
      </body>\
      </html>" 
      return tmp 
    } 
    

    Saying Good-bye

    function Bye() { 
      tmp = "<html>\
      <head><title>Simple gawk server</title></head>\
      <body><p>Script Terminated...</body>\
      </html>" 
      return tmp 
    } 
    

    Running Applications

    function RunApp(app) { 
      if (app == "xterm")  {system("xterm&"); return} 
      if (app == "xcalc" ) {system("xcalc&"); return} 
      if (app == "xload" ) {system("xload&"); return} 
      if (app == "exit")   {x = 0} 
    }
    

    Author

    Michael Sanders


    categories: ,Rss,Dec,2009,TimM

    Reading RSS Feeds

    Contents

    Synopsis

     myrss("rss;url;N" [,between])
    

    Download

    Download from LAWKER.

    About

    The function myrss("rss;url;N") returns the first N items from an rss feed found in url.

    This code is a nice example of the brevity of Awk. I've used many PHP and Perl-based RSS readers and this code is by far the simplest, the shortest, and the easiest to modify.

    The functional optionally accepts a between string that is printed between each item. The following example prints a "<li>" between each RSS item; i.e. it converts a text string into an HTML list.

    The code is designed to be customized. Quirks in the RSS stream, or quirks in the formatting are handled by a set of separate my functions that be quickly altered to return the desired strings.

    Notes

    The code uses a slurp function that reads the entire stream as one string using wget then splits it into an array on the < character.

    After a few simplifications, the approach turns out to be very fast. For example, using

    wget -O -
    
    is faster than
    wget -O tmpfile; cat tmpfile
    

    Also, version one of this code split the RSS feed using the disjunction [<>]. This proved to be much slower than just slurping in splitting on "\n" then subsequently splitting on "<".

    The above two optimizations changed the runtimes for the following example from 0.9 seconds to 0.88 seconds. This is very fast considering that just wgetting the RSS feed takes 0.08 seconds.

    Example

     % gawk -f myrss.awk --source 'BEGIN {
       print "<ul>"
       print myrss("rss;lawker.blogspot.com/feeds/posts/default?alt=rss;5","<li>\n")
       print "</ul>"
     '}
    

    This generates the following list from the AWK.INFO rss feed

    • Dec 02 Awk.info now a top-20 website.
    • Dec 02 Praveen Puri offers a Zork-clone, in Awk.
    • Dec 01 Ed Morton sorts out everything (using Awk)
    • Dec 01 Is this the smartest (smallest) formatter ever written?
    • Nov 30 Gregory Grefenstette implements Norvig's spell checker.

    Code

    Top-Level Drive

    function myrss(rss, between, tmp) {
      split(rss,tmp,";");
      return myrss1(tmp[2],tmp[3],between);
    }
    

    Main Worker

    function myrss1(feed,max,  between,  n,all,sep,out,date,url,txt,seen) {
      n = slurp("wget -q -O - http://" feed,">",all);
      for(i=1;i<=n; i++) {
        if (all[i] ~ /^<pubDate/) 
          date = myDate(all[i+1])
        else if (all[i] ~ /^<description/) 
          txt = myText(all[i+1])
        else if (all[i] ~ /^<enclosure/) {
          url = myUrl(all[i]);
          out = out sep myReport(url,date,txt);
          sep = between ? between : "\n";
          if (++seen >= max) 
              return out;
        }}
      return out;
    }
    

    Helper Functions

    slurp reads an entire file into an array.

    function slurp(com,sep,all) { slurp0(com); return split($0,all,sep)     }
    function slurp0(com)        { RS=""; FS="\n"; com | getline; close(com) }
    

    Formatting Functions

    Most of the formatting control is isolated in the following functions. Change these to change the appearance of the feeds.

    function myDate(str, tmp)  { split(str,tmp," "); return tmp[3] " " tmp[2]} 
    function myText(str)            { sub(/<.*/,"",str); return str }
    function myUrl(str)             { sub(/<.*/,"",str);    return str }
    function myReport(url,dat,txt) { return "<a href=\""url"\">"dat"</a>" txt}
    

    Author

    Tim Menzies


    categories: ,Engineering,Nov,2009,GrantC

    rcalc

    Contents

    Synposis

    Download

    About

    Example

    Code

    Author

    Synposis

    #eg
     gawk -v target=89000 -f rcalc.awk 
    

    Download

    Download from LAWKER.

    About

    Calculate resistor pair value from e24 series to make up arbitrary value

    When designing and building electronic projects I mostly use 1% resistors that come in the E24 series (24 values per decade).

    Frequently there's a need for some arbitrary value (between 10R and 1M in this script) resistor that can be made with a series or parallel combination of two standard values.

    This script searches the E24 standard value space for pairs of resistors that will produce or come close to the desired arbitrary resistor value.

    Example

    $ gawk -v target=89000 -f rcalc.awk
           Result         Ra      Rb  Connect    Error
           88800.00    82000    6800  series    -0.22%
           88888.89   200000  160000  parallel  -0.12%
           89000.00    56000   33000  series
           89000.00    62000   27000  series
           89130.43   820000  100000  parallel  +0.15%
           89137.93   470000  110000  parallel  +0.15%
           89189.19   220000  150000  parallel  +0.21%
    

    Code

    BEGIN {
         print "Result      Ra   Rb  Connect    Error"
    
         max_error = 0.005         # +/- 0.5%
         max_multiplier = 10000       # try four decades
    
         format = "%8.2f  %7d %7d  %-8s  %+4.2f%%"
         formnz = "%8.2f  %7d %7d  %-8s"
    
         limit_hi = target * (1 + max_error)
         limit_lo = target * (1 - max_error)
    
    $0 = "10 11 12 13 15 16 18 20 22 24 27 30 33 36 39 43 47 51 56 62 68 75 82 91"
    
         for (i = 1; i < 25; i++) {
           e24[i] = $i
         }
         for (u = 1; u < 25; u++) {
           for (v = 1; v < 25; v++) {
                for (i = 1; i <= max_multiplier; i *= 10) {
                     x = e24[u] * i
                     if (x == target) {
                       continue
                     }
                     for (j = 1; j <= max_multiplier; j *= 10) {
                       y = e24[v] * j
                       if (y == target) {
                            continue
                       }
                       combo(e24[u] * i, e24[v] * j)
                     }
                }
           }
         }
         exit      # skip file reader
    }
    function combo(a, b,   c) {
         # parallel
         c = a * b / (a + b)
         combo2(a, b, c, "parallel")
         # series
         c = a + b
         combo2(a, b, c, "series")
    }
    function combo2(a, b, c, d,   e, f) {
         # avoid duplicates and ignore result when error too big
         if (a < b || c < limit_lo || c > limit_hi) { return }
         e = 100 * (c - target) / target      # percentage error
         f = (e == 0 ? formnz : format)       # select output format
         result[n++] = sprintf(f, c, a, b, d, e)
    }
    END {
         # sort by result value, print list
         n = asort(result, sort_result)
         for (i = 1; i <= n; i++) {
           print sort_result[i]
         }
    }
    

    Author

    Copyright (c) 2009 Grant Coady <http://bugsplatter.id.au> GPLv2


    categories: ,TextMining,Oct,2009,JohnF

    Zipf's Law

    These notes come from John Fry's Counting with Awk lecture in his subject Linguistics 115: Corpus Linguistics, Fall 2007, SJSU.

    Much research has reported that human writings following well-defined laws. For example, natural langauge text and software programs conform tightly to simple and regular statistical models. For example, "Zipf's Laws" states that multiplying a word's rank r by its frequency f produces (roughly) a constant value C : i.e. r times f is a constant. The frequency f of a word is obtained by counting the number of times it occurs in a text, and r is obtained by ranking all the words by frequency (1. the ; 2. and, 3. I ; etc.) Example of Zipf's Law for five words in the London-Lund corpus of spoken conversation:

    r  X     f   = C 
    35 very  836 = 29,260 
    45 see   674 = 30,330 
    55 which 563 = 30,965 
    65 get   469 = 30,485 
    75 out   422 = 31,650 
    
    Another way of expressing Zipf's Law is to say that frequency is reciprocally proportional to rank. For example, the 2nd-ranked word ("and") appears half as often as the 1st-ranked word ("the"). More generally, nth-ranked word appears 1/n as often as "the"

    Here is a short awk program, saved as ~jfry/zipf.awk, that reads in a ranked frequency list and computes r times f.

    BEGIN {printf "%20s%7s%7s%10s\n", "WORD","RANK","FREQ","C"} 
          {printf "%20s%7d%7d%10d\n", $2, NR, $1, NR*$1} 
    

    This program can be run with

    awk -f ~jfry/zipf.awk 
    

    Testing Zipf's Law on Shakespeare :

    $ tr A-Z a-z < shakespeare.txt | tr -sc a-z '\n' | sort | 
    uniq -c | sort -rn | awk -f ~jfry/zipf.awk 
    WORD RANK  FREQ      C WORD RANK  FREQ      C 
    the     1 27378  27378 s i    17  7721 131257 
    and     2 26084  52168 for    18  7655 137790 
    i       3 22538  67614 be     19  6897 131043 
    to      4 19771  79084 his    20  6859 137180 
    of      5 17481  87405 he     21  6679 140259 
    a       6 14725  88350 your   22  6657 146454 
    you     7 13826  96782 this   23  6608 151984 
    my      8 12489  99912 but    24  6277 150648 
    that    9 11318 101862 have   25  5902 147550 
    in     10 11112 111120 as     26  5749 149474 
    is     11  9319 102509 thou   27  5549 149823 
    d      12  8960 107520 him    28  5205 145740 
    not    13  8512 110656 so     29  5058 146682 
    with   14  7791 109074 will   30  5008 150240 
    me     15  7777 116655 what   31  4808 149048 
    it     16  7725 123600 thy    32  4034 129088 
    

    Testing Zipf's Law on newswire

    $ cd /corpora/newswire/data 
    $ zcat -r .|grep -v '^<' | tr A-Z a-z|tr -sc a-z '\n' | sort| 
    uniq -c | sort -rn | awk -f /home/jfry/zipf.awk 
    WORD RANK FREQ    C WORD RANK FREQ    C 
    the     1 142M 142M by     16 14M  224M 
    to      2  60M 120M he     17 13M  235M 
    of      3  60M 180M at     18 13M  244M 
    a       4  53M 214M as     19 12M  230M 
    and     5  51M 257M from   20 10M  216M 
    in      6  51M 307M be     21  9M  201M 
    s       7  28M 202M his    22  9M  205M 
    for     8  22M 178M has    23  9M  208M 
    that    9  21M 195M have   24  9M  217M 
    said   10  19M 199M but    25  8M  212M 
    on     11  19M 214M are    26  8M  218M 
    is     12  16M 200M an     27  8M  225M 
    with   13  15M 197M will   28  7M  207M 
    was    14  14M 203M i      29  7M  213M 
    it     15  14M 211M not    30  7M  217M 
    

    categories: ,Mawk,Oct,2009,JMellander

    Faster Hashing in Mawk

    J. Mellander reports in comp.lang.awk how to make Mawk's hashing run 20+ times faster.

    Recently, for a project, I had the occasion to use mawk - I have a list of ~12,000,000 Unix timestamps to nanosecond precision that I needed to match the first field of every record in a number of huge files. Gawk couldn't handle the number of records, and so I used mawk, as being more memory thrifty. The program was a one-liner like this:

    mawk 'FNR==NR {x[$1]++;next} $1 in x}' timestamp_file log_file
    

    which works perfectly, but the run time seemed excessive - many hours per log file - which made me think that the hashing function was causing many collisions, and thus hash chaining.....

    When stuck in a slow meeting, I started looking at the mawk source code, specifically the hashing functions, of which there are 2: hash() in hash.c & ahash() in array.c

    I was surprised to find that the hashing functions in both cases essentially just sum the bytes of the key to create the hash - this means that 123, 321, 213, etc. would all hash to the same location and cause collisions, and hash chaining.

    Modifying the hashing to a more efficient hash caused an enormous gain in efficiency, as in this test:

    $ wc -l j
    2999999 j
    
    $ time mawk-1.3.3/mawk '{x[$1]++}' j >/dev/null
    
    real    2m24.362s
    user    2m20.174s
    sys     0m0.663s
    
    $ time mawk-1.3.3a/mawk '{x[$1]++}' j >/dev/null
    
    real    0m6.607s
    user    0m6.146s
    sys     0m0.241s
    

    mawk-1.3.3a has the below modifications. In hash.c I replaced the 'hash' function with:

    /*
    FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
    Vo_hash_function
    */
    unsigned hash(s)
    register char *s ;
    {
    	register unsigned h = 2166136261 ;
    	while (*s) h = (h * 16777619) ^ *s++ ;
    	return h ;
    }
    

    and in array.c replaced 'ahash' with:

    /*
    FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
    Vo_hash_function
    */
    static unsigned ahash(sval)
    STRING* sval ;
    {
    	register unsigned h = 2166136261 ;
    	register char *s = sval->str;
    
    	while (*s) h = (h * 16777619) ^ *s++ ;
    	return h ; 
    }
    

    categories: ,Mawk,Oct,2009,BrendanO

    Mawk: faster than C, C++, Java, Perl, Ruby,...

    Brendan O'Conner writes in his blog:

      When one of these new fangled 'Big Data' sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you're dealing with hundreds of megabytes of data, even simple operations can take plenty of time.

      For one recent ad-hoc task I had - reformatting 1GB of textual feature data into a form Matlab and R can read - I tried writing implementations in several languages, with help from my classmate Elijah.

      To be clear, the problem is to take several files of (item name, feature name, value) triples, like:

      000794107-10-K-19960401 limited 1
      000794107-10-K-19960401 colleges 1
      000794107-10-K-19960401 code 2
      ...
      004334108-10-K-19961230 recognition 1
      004334108-10-K-19961230 gross 8
      ...
      
      And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.

      Since it's a standardized language, many implementations exist. One of them, MAWK, is incredibly efficient. It outperforms all other languages, including statically typed compiled ones like Java and C++! It wins on both LOC and performance criteria- a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.

      All the code, results, and data can be obtained at github.com/brendano/awkspeed. I'd love to see results for more languages.

    Editor's note: one reply to this blog entry, by Eric Young, optimized Brendan's Ruby solution and re-ran all the tests. Eric reported the following runtimes. Note that they confirm Brendan's results: mawk runs faster than everything else.

     33.8s     mawk
     36.3s     gcc c
     51.0s     java
     67.0s     perl Fletch.pl
     71.7s     python
     87.8s     perl
     95.8s     nawk
    101.4s     gawk
    114.0s     gcc
    133.0s     ruby1.9 eay.rb
    136.8s     ruby1.8 eay.rb
    327.6s     ruby1.8
    372.9s     ruby1.9
    

    categories: ,Aug,2009,ArnoldR

    Interview with Aharon Robbins

    Aharon Robbins, the maintainer for GNU Awk maintainer, answers some questions from Tim Menzies.

    Q: What is your favorite programming language (besides gawk)? And why?

      A: It depends for what. A long time ago I was a big Korn shell junkie, although these days I would do most high level things in a mixture of bash and awk, with awk doing the heavy lifting.

      For lower level things I prefer C++, although I have something of a love/hate relationship with the language. It's possible to write completely unreadable and unmaintainable code in it. It's also possible to write beautiful, clear, absolutely amazing code in it.

      I find that going back to C after working daily in C++ is hard, although I do it for gawk maintenance. For new programs I would work in C++, not C. For something big, I'd use the Qt framework for support and portability.

      I've been recently living in the C# world for my day job. The development environment is very addictive, but C# hasn't seduced me away from C++.

    Q: The open source world is a fascinating development paradigm. I'm therefore very curious to know what prompted you to write gawk?

      A: I didn't write it from scratch. I got involved shortly after picking up and reading the Aho, Weinberger & Kernighan book in late 1987 when it came out.

      New awk wasn't widely available. I had been involved with USENET since around 1983, and knew about the GNU project. I also had a strong interest in compilers and interpreters, so I got in touch with the GNU project to see if they had an awk clone and to see if I could get involved in upgrading it to "new" awk.

      It turned out that they already had a volunteer, David Trueman, who was working on it, but he was happy to have help. He and I worked together until circa 1993 or 1994 when he had to stop being involved, and I became the sole maintainer.

      It was a lot of fun. The number of emails of the "I could not get my work done without gawk" sort was amazing; Unix awk would often roll over and die on some of the data sets people were running though gawk.

      Things really got shaken down when gawk became part of GNU/Linux distributions; then people were using it as the only awk, instead of alongside Unix awk.

    Q: In retrospect, what are the best/worst features of gawk?

      A: The best feature is the pattern/action paradigm. The implicit read-a-record loop is wonderful. This is the language's data-driven nature, as opposed to the imperative nature of most languages.

      Associative arrays rank second; they are quite powerful.

      There are some warts inherited from Unix awk and left unspecified by POSIX. These are relatively minor.

      The lack of an explicit concatenation operator is an obvious one.

      The lack of real multi-dimensional arrays is another.

      There are features just in gawk that in retrospect seem to have been a waste of time, such as bringing out to the awk level the possibility to internationalize a program. I don't think anyone uses that.

      IGNORECASE was a huge pain to get right; if I'd known how long it would take, I wouldn't have bothered.

      The biggest "lack" is that there isn't an easy, standard way to provide extensibility; there are way too many things in the C library today (and even yesterday) that the awk programmer just can't get to. (Like the chdir system call!) I hope to eventually provide some better mechanisms for this, but I don't know how much actual filling in I can do also.

    Q: Under what circumstances would you recommend/not recommend it?

      A: Gawk is good for small to medium level programs that have to process text and/or do simple numeric work (summing up columns, averaging, VERY simple statistics work). It has a central place in traditional Unix / Linux shell scripting when portability is a must.

      But I wouldn't care to try to write a military air traffic command and control system in gawk, for example. :-)

    Q: Gawk has a reputation of being slow...

      A: "Slow" compared to what? As far as I've seen, gawk is always faster than Unix awk. Michael Brennan's mawk is even faster, but until recently it has been unmaintained, and it lacks many important, modern features.

      Relative to C? Of course. So what? You have to write 5 - 10 times as much C as you do awk to do the same or less. (I remember one program I wrote in C at around 1200 lines and rewrote in under 300 lines of awk, and the awk was clearer and did more.)

      Relative to perl? It depends. I have had emails telling me that gawk was faster than perl for what the users were doing. And if not, do I care? Not really - perl is a write-only language, and don't get me started on Perl 6. :-)

      All that said, this got me to thinking about a possible bottleneck that I'll be investigating in the near future.

    Q: Awk also has a reputation of not being suitable for "real" projects. Is that reputation deserved?

      A: I don't think that contention is true: it may be that scripting languages in general have such a reputation - Ronald Loui has written about this, but I don't think the contention is true for scripting languages either.

      As is always the case, the answer is "it depends". What is the scale of what you're trying to do? Who is the customer? When Rick Adams was still running UUNET, he used a suite of awk programs to do his accounting. That's as "real" a project as you can get: billing your (hundreds or thousands of) customers for their resource usage. And he used gawk, since Unix awk would just roll over and die. (Unix awk has gotten better as a result of the "competition", but that's a different story. :-)

    Q: Are you aware of any landmark projects that use gawk?

      A: GNU/Linux. :-)

      Not really. Gawk "just works", and that in and of itself is a testimony to its quality and value.

    Q: Looking a decade into the future, can you see gawk disappearing? Why (not)?

      A: I don't think so. The bigger question is will I still be involved with it 10 years from now? I don't know.

      I still have some things I'd like to see happen with it that are interesting and valuable and may even end up being relatively unique. I just have to find the time (or some other volunteers :-) to work on them.

    Q: Currently, how are you filling your time?

      A: I have a full time job as a software engineer with Intel. I have a wife and four wonderful children, as well as a dog. That's enough right there to keep me busy.

      I am the series editor for the Prentice Hall Open Source Software Development Series which also takes some of my time.

      And I still try to do some gawk work in between everything else!


    categories: Mawk,Libmawk,Aug,2009,TiborP

    How to Call Awk from "C" with Libmawk

    Libmawk is a fork of mawk 1.3.3 restructured for embedding. This means the user gets libmawk.h and libmawk.so and can embed awk scripting language in any application written in C.

    the project can be downloaded here.

    Features

    Libmawk has the following main features:

    • load and run multiple awk scripts independently, in parallel
    • scripts do not read stdin but a memory buffer the embedding process can fill from time to time
    • running scripts in (mostly) non-blocking manner - that is, the script will not block if the process can not provide new input
    • all these without threads or fork()
    • call awk functions from C, using vararg for smooth integration
    • call C functions from awk scripts
    • resolve existing awk variables from C - read or write variables

    License

    Since mawk is licensed under the GNU GPL v2 and libmawk is a fork of mawk, libmawk is licensed under the GNU GPL v2 too.

    Author

    Tibor Palinkas


    categories: Mascot,Jul,2009,DickL

    New Awk Mascot: 'AWK-eye the Dwarf?

    by Dick L.

    I write to suggest that the Awk mascot's name is Hawk-eye (usually spoken as 'AWK-eye with a silent H).

    I suggest 'AWK-eye is a DWARF, based on the following analogy:

    • My good friend hAWK-eye is a dwarf, from the first ages, long ago. The dwarves are renowned for their skill in mining and metalwork.
    • My friend is known as Hawk-eye, because even among dwarves, he can mine precious metals and jewels from the dross with great ease and precision. And, having found these precious things, he is able to quickly fashion them into all manner of things, both practical and beautiful.
    • As one from the first age, he is sometimes called primitive. I prefer to call him elemental, so tightly focused is he on what he does best: mining and making from the mountains of text I throw to him. Like all dwarves, he is small - but sinewy and unbreakable.
    • He carries with him his tools of trade - a strong sieve of subtle REGEXP for separating the jewels and metal from the dross, and his hammer that he uses to fashion the gold, silver and jewels into useful and beautiful objects. In appearance he is old and gnarly, but don't be put off - he knows his stuff and works willingly.

    I can't draw, but 'AWK-eye looks about half way between Gimli from Lord of the Rings, and Doc from the Disney Snow White and Seven Dwarves. (He has been known to sing "hi ho, hi ho, it's off to work I go". He likes to work!)

    I know many spirits and sprites from the first age - LISP, APL, Assembler, Basic, Fortran and Algol. However, I have lost contact with most of these old friends, but ask 'AWK-eye to do new work most weeks. Why?

    • He is small enough that I can take him everywhere. No esoteric installs and fancy GUIs and other bloat.
    • He is so focused on doing the work that I often need done: simple mining and re-fashioning of text, voluminous text.
    • I can say what I want so simply. 'AWK-eye speaks filtering and fashioning. Usually a few lines to 'AWK-eye accomplish what would take ten or twenty lines with the new age creatures.

    Yes, I love python, and javascript and all those creatures of later ages. And for some projects, functions as first class citizens, objects and the works is just what I want.

    But for many daily jobs, 'AWK-eye is on the sweet spot of enough expressiveness to do the job, but not so much as to be hard to remember, and is small enough I have him everywhere.


    categories: News,Mar,2009,Admin

    The Awk Book's Code

    Brian Kernighan has granted permission for this site to host the code from the original Awk book:

    • The AWK Programming Language
    • by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger,
    • Addison-Wesley, 1988.
    • ISBN 0-201-07981-X.

    The code can be viewed here.


    categories: Runawk,Project,Tools,Nov,2009,AlexC

    Runawk 0.18 Released

    Download

    http://sourceforge.net/projects/runawk

    About

    runawk is a small wrapper for the AWK interpreter that helps one write standalone AWK scripts. Its main feature is to provide a module/library system for AWK which is somewhat similar to Perl's "use" command. It also allows you to select a preferred AWK interpreter and to setup the environment for your scripts. It also provides other helpful features, for example it includes numerous useful of modules.

    Major Changes IN RUNAWK-0.18.0

    Makefile:

    • "install-dirs" target has been renamed to "installdirs"
    • At compile time MODULESDIR can contain a *list* of colon-separated directories, e.g. /usr/local/share/runawk:/usr/local/share/awk
    • Support for multiply applied options, e.g. -vvv for increasing verbosity level. If option without arguments is multiply applied, getarg() function returns a number of times it was applied, not just 0 or 1.

    New modules:

    • init_getopt.awk using alt_getopt.awk and used by power_getopt.awk. Its goal is to initialize `long_opts' and `long_opts' variables but not run `getopt' function.
    • heapsort.awk : heapsort :-)
    • quicksort.awk : quicksort :-)
    • sort.awk : either heapsort or quicksort, the default is heapsort. Unfortunately GAWK's asort() and asorti() functions do *not* satisfy my needs. Another (and more important) reason is a portability.

    Improvements, clean-ups and fixes in regression tests.

    Also, runawk-0-18-0 was successfully tested on the following platforms: NetBSD-5.0/x86, NetBSD-2.0/alpha, OpenBSD-4.5/x86, FreeBSD-7.1/x86, FreeBSD-7.1/spark, Linux/x86 and Darwin/ppc.

    Author

    Aleksey Cheusov


    categories: Runawk,Project,Tools,Sept,2009,AlexC

    New release: RUNAWK 0.17

    What is RUNAWK?

    RUNAWK is a small wrapper for the AWK interpreter that helps one write standalone AWK scripts. Its main feature is to provide a module/library system for AWK which is somewhat similar to Perl's "use" command. It also allows you to select a preferred AWK interpreter and to setup the environment for your scripts. RUNAWK makes programming AWK easy and efficient. RUNAWK also provides many useful AWK modules.

    Sources

    Major Changes

    Version 0.17.0, by Aleksey Cheusov, Sat, 12 Sep 2009

    runawk:

    • ADDED: new option for runawk for #use'ing modules: -f. runawk can also be used for oneliners! ;-)
            runawk -f abs.awk -e 'BEGIN {print abs(-123); exit}'
      
    • In a multilined code passed to runawk using option -e, spaces are allowed before #directives.
    • After inventing alt_getopt.awk module there is no reason for heuristics that detects whether to add `-' to AWK arguments or not. So I've removed this heuristics. Use alt_getopt.awk module or other "smart" module for handling options correctly!

    alt_getopt.awk and power_getopt.awk:

    • FIX: for "abc:" short options specifier BSD and GNU getopt(3) accept "-acb" and understand it as "-a -cb", they also accept "-ac b" and also translate it to "-a -cb". Now alt_getopt.awk and power_getopt.awk work the same way.

    power_getopt.awk:

    • -h option doesn't print usage information, --help (and its short synonym) does.

    New modules:

    • shquote.awk, implementing shquote() function.
      shquote(str):
        `shquote' transforms the string `str' by adding shell escape and quoting characters to include it to the system() and popen() functions as an argument, so that the arguments will have the correct values after being evaluated by the shell.
      Inspired by NetBSD's shquote(3) from libc.
    • runcmd.awk, implementing functions runcmd1() and xruncmd1()
      runcmd1(CMD, OPTS, FILE):
        wrapper for function system() that runs a command CMD with options OPTS and one filename FILE. Unlike system(CMD " " OPTS " " FILE) the function runcmd1() handles correctly FILE and CMD containing spaces, single quote, double quote, tilde etc.
    • xruncmd1(FILE):
        safe wrapper for 'runcmd(1)'. awk exits with error if running command failed.
    • isnum.awk, implementing trivial isnum() function, see the source code.
    • alt_join.awk, implementing the following functions:
      join_keys(HASH, SEP):
        returns string consisting of all keys from HASH separated by SEP.
      join_values(HASH, SEP):
        returns string consisting of all values from HASH separated by SEP.
      join_by_numkeys (ARRAY, SEP [, START [, END]]):
        returns string consisting of all values from ARRAY separated by SEP. Indices from START (default: 1) to END (default: +inf) are analysed. Collecting values is stopped on index absent in ARRAY.

    categories: Runawk,Project,Tools,Apr,2009,AlexC

    New release: Runawk 0.16

    In comp.lang.awk, Aleksey Cheusov writes:

    I've made runawk-0.16.0 release. This release has lots of important improvements and additions. Sources are available from

    What is runawk?

    RUNAWK is a small wrapper for AWK interpreter that helps to write the standalone programs in AWK. It provides MODULES for AWK similar to PERL's "use" command and other powerful features. Dozens of ready to use modules are also provided.

    (For more information, see details from the last release.)

    Major changes in this release

    Lots of demo programs for most runawk modules were created and they are in examples/ subdirectory now.

    New MEGA module ;-) power_getopt.awk See the documentation and demo program examples/demo_power_getopt. It makes options handling REALLY easy (see below).

    New modules:

    • embed_str.awk has_suffix.awk
    • has_prefix.awk
    • readfile.awk
    • modinfo.awk

    Minor fixes and improvements in dirname.awk and basename.awk. Now they are fully compatible with dirname(1) and basename(1)

    RUNAWK sets the following environment variables for the child awk subprocess:

    • RUNAWK_MODC - A number of modules (-f filename) passed to AWK
    • RUNAWK_MODV_<n> - Full path to the module #n, where n is in [0..RUNAWK_MODC) range.

    RUNAWK sets RUNAWK_ART_STDIN environment variable for the child awk subprocess to 1 if additional/artificial `-' was added to the list to awk's arguments.

    Makefile:

    • bmake-ism were removed. Now Makefile is fully compatible with FreeBSD make.
    • CLEANFILES target is used instead of hand-made rules
    • Minor fix in 'test_all' target

    Power_GetOpt.awk

    The most powerful feature of this release is power_getopt.awk module. It provides a very powerful and very easy way to handle options. Everything is in the usage message, you should do anything at all. I think example below is easy.

    Example Code

    % cat 1.awk
    #!/usr/bin/env runawk
    
    #use "power_getopt.awk"
    
    #.begin-str help
    # power_getopt - program demonstrating a power of power_getopt.awk module
    # usage: power_getopt [OPTIONS]
    # OPTIONS:
    #    -h|--help                  display this screen
    #    -f|--flag                  flag
    #       --long-flag             long flag only
    #    -s                         short flag only
    #    =F|--FLAG           flag with value
    #.end-str
    
    BEGIN {
            print "f         --- " getarg("f")
            print "flag      --- " getarg("flag")
            print "long-flag --- " getarg("long-flag")
            print "s         --- " getarg("s")
            print "F         --- " getarg("F", "default1")
            print "FLAG      --- " getarg("FLAG", "default2")
    
            exit 0
    }
    

    ./1.awk

    % ./1.awk
    f         --- 0
    flag      --- 0
    long-flag --- 0
    s         --- 0
    F         --- default1
    FLAG      --- default2
    

    ./1.awk -h

    % ./1.awk -h
    power_getopt - program demonstrating a power of power_getopt.awk module
    usage: power_getopt [OPTIONS]
    OPTIONS:
       -h|--help                  display this screen
       -f|--flag                  flag
          --long-flag             long flag only
       -s                         short flag only
       -F|--FLAG           flag with value
    

    ./1.awk -f

    % ./1.awk -f
    f         --- 1
    flag      --- 1
    long-flag --- 0
    s         --- 0
    F         --- default1
    FLAG      --- default2
    

    ./1.awk -F value

    % ./1.awk -F value
    f         --- 0
    flag      --- 0
    long-flag --- 0
    s         --- 0
    F         --- value
    FLAG      --- value
    

    ./1.awk --FLAG=value

    % ./1.awk --FLAG=value
    f         --- 0
    flag      --- 0
    long-flag --- 0
    s         --- 0
    F         --- value
    FLAG      --- value
    

    categories: Mascot,Sept,2009,PanosP

    Killer Awk Snake

    Panos Papadopoulos offers the latest entry in our Awk mascot competition:

    Scary, yes?


    categories: Mascot,Apr,2009,VenkatesanS

    New Mascot

    Venkatesan Satish offers a new entry in our Awk mascot competition:


    categories: Wp,Apr,2009,Admin

    Word Processing in Awk

    These pages focus on word processing tools in Awk.


    categories: Interpreters,Apr,2009,Admin

    Writing Interpreters

    These pages focus on language interpreters, written in Awk.


    categories: Awk100,Interpreters,Apr,2009,HenryS

    AASL: Parser Genrator in Awk

    Download

    Download from LAWKER

    Synopsis

    aaslg [ -x ] [ file ... ]
    aaslr [ -x ] table [ file ... ]
    

    Description

    Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output. Aaslr parses the contents of the file(s) (default standard input) according to the AASL table in file table, emitting the table's output on standard output.

    Both take a -x option to turn on verbose and cryptic debugging output. Both look in a library directory for pieces of the AASL system; the AASLDIR environment variable, if present, overrides the default notion of the location of this directory.

    Aaslr expects input to consist of input tokens, one per line. For sim- ple tokens, the line is just the text of the token. For metatokens like ``identifier'', the line is the metatoken's name, a tab, and the text of the token. [xxx discuss `#' lines]

    Aaslr output, in the absence of syntax errors, consists of the input tokens plus action tokens, which are lines consisting of `#!' followed immediately by an identifier. If the syntax of the input does not match that specified in the AASL table, aaslr emits complaint(s) on standard error and attempts to repair the input into a legal form; see ``ERROR REPAIR'' below. Unless errors have cascaded to the point where aaslr gives up (in which case it emits the action token ``#!aargh'' to inform later passes of this), the output will always conform to the AASL syntax given in the table.

    Normally, a complete program using AASL consists of three passes, the middle one being an invocation of aaslr. The first pass is a lexical analyzer, which breaks free-form input down into input tokens in some suitable way. The third pass is a semantics interpreter, which typi- cally responds to input tokens by momentarily remembering them and to action tokens by executing some action, often using the remembered value of the previous input token. Aaslg is in fact implemented using AASL, following this structure; it implements the -x option by just passing it to aaslr.

    AASL Specifications

    An AASL specification consists of class definitions, text definitions, and rules, in arbitrary order (except that class definitions must pre- cede use of the classes they define). A `#' (not enclosed in a string) begins a comment; characters from it to the end of the line are ignored. An identifier follows the same rules as a C identifier, except that in most contexts it can be at most 16 characters long. A string is enclosed in double quotes ("") and generally follows C syn- tax. Most strings denote input tokens, and references to ``input token'' as part of AASL specification syntax should be read as ``string denoting input token''.

    A class definition is an identifier enclosed in angle brackets (<>) followed by one or more input tokens followed by a semicolon (;). It gives a name to a set of input tokens. Classes whose names start with capital letters are user abbreviations; see below. Classes whose names start with lowercase letters are special classes, used for internal purposes. The current special classes are:

    trivial
    tokens which the parser can discard at will, in the expectation that they might be inserted erroneously; see ``ERROR REPAIR'' for details.
    lineterm
    tokens which terminate a logical line for purposes of resyn- chronization in error repair; see ``ERROR REPAIR'' for details.
    endmarker
    xxx

    For example, the class definitions used for AASL itself are:

    <trivial> "," ";"   ;
    <lineterm> ";" ;
    <endmarker> "EOF"   ;
    

    When AASL error repair is invoked, the parser sometimes needs to gener- ate input tokens. In the case of a metatoken, the parser knows the token's name but needs to generate a text for it as well. A text defi- nition consists of an input token, an arrow (->), and a string specify- ing what text should be generated for that token. For example, the text definitions used for AASL itself are:

    "id" -> "___"
    "string" -> "\"___\""
    

    The rules of a specification define the syntax that the parser should accept. The order of rules is not significant, except that the first rule is considered to be the top level of the specification. The spec- ification is executed by calling the first rule; when execution of that rule terminates, execution of the specification terminates. If the user wishes this to occur only at end of input, he should arrange for the lexical analyzer to produce an endmarker token (conventionally ``EOF'') at the end of the input, and should write the first rule to require that token at the end.

    Note that an input token may be recognized considerably before it is accepted, but the parser emits it to the output only on acceptance.

    A rule consists of an identifier naming it, a colon (:), a sequence of items which is the body of the rule, and a semicolon (;). When a rule is called, it is executed by executing the individual items of the body in order (as modified by control structures) until either one of them explicitly terminates execution of the rule or the last item is exe- cuted.

    An item which is an input token requires that that token appear in the input at that point, and accepts it (causing it to be emitted as out- put).

    An item which is an identifier denotes a call to another rule, which executes the body of that rule and then returns to the caller. It is an error to call a nonexistent rule.

    An item which is an identifier preceded by `!' causes that identifier to be emitted as an action token; the identifier has no other signifi- cance.

    An item which is `<<' causes execution of the current rule to terminate immediately, returning to the calling rule.

    An item which is `>>' causes the execution of the innermost enclosing loop (see below) to terminate immediately, with execution continuing after the end of that loop. The loop must be within the same rule.

    An item which is an identifier preceded by `@%&!' causes an internal semantic action to be executed within the parser; this is normally needed only for bizarre situations like C's typedef. [xxx should give details I suppose]

    A choice is a sequence of branches enclosed in parentheses (()) and separated by vertical bars (|). The first of the branches that can be executed, is, after which execution continues after the end of the choice.

    A loop is a sequence of branches enclosed in braces ({}) and separated by vertical bars (|). The first of the branches that can be executed, is, and this is done repeatedly until the loop is terminated by `>>', after which execution continues after the end of the loop. (A loop can also be terminated by `<<' terminating execution of the whole rule.)

    A branch is just a sequence of items, like a rule body, except that it must begin with either an input token or a lookahead. If it begins with an input token, it can be executed only when that token is the next token in the input, and execution starts with acceptance of that token.

    A lookahead specifies conditions for execution of a branch based on recognizing but not accepting input token(s). The simplest form is just an input token enclosed in brackets ([]), in which case execution of that branch is possible only when that token is the next token in the input. The brackets can also contain multiple input tokens sepa- rated by commas, in which case the parser looks for any of those tokens. If a user-abbreviation class name appears, either by itself or as an element of a comma-separated list, it stands for the list of tokens given in its definition.

    If a lookahead's brackets contain only a `*', this is a default branch, executable regardless of the state of the input.

    As a very special case, a lookahead's brackets can contain two input tokens separated by slash (/), in which case that branch is executable only when those two tokens, in sequence, are next in the input. Warn- ing: this is implemented by a delicate perversion of the error-repair machinery, and if the first of those tokens is not then accepted, the parser will die in convulsions. A further restriction is that the same input token may not appear as the first token of a double lookahead and as a normal lookahead token in the same choice/loop.

    Certain simple choice/loop structures appear frequently, and there are abbreviations for them:

    abbreviation	    expansion
    ( items ?)	        ( items  |  [*] )
    { items ?}	        { items  |  [*] >> }
    ( ! [look] items ?) ( [ look]  |  items )
    { ! [look] items ?} { [ look] >>  |  items }
    

    For example, here are the rules of the AASL specification for AASL, minus the actions (which add considerable clutter and are unintelligi- ble without the third pass):

    	       rules: {
    				   "id" ":" contents ";"
    				   | "<" "id" ">" {"string" ?} ";"
    				   | "string" "->" "string"
    				   | "EOF" >>
    	       };
    	       contents: {
    				   ">>"
    				   | "<<"
    				   | "id"
    				   | "!" "id"
    				   | "@%&!" "id"
    				   | "string"
    				   | "(" branches ")"
    				   | "{" branches "}"
    				   | [*] >>
    	       };
    	       branches: (
    				   "!" "[" look "]" contents "?"
    				   | [*] branch (
    				   ["|"] {"|" branch ?}
    				   | "?" !endbranch
    				   | [*]
    				   )
    	       );
    	       branch: (
    				   "string" contents
    				   | "[" look "]" contents
    	       );
    	       look: (
    				   ["string"/"/"] "string" "/" "string"
    				   | "*"
    				   | [*] looker {"," looker ?}
    	       );
    	       looker: ( "string" | "id" ) ;
    

    Error Repair

    When the input token is not one of those desired, either because the item being executed is an input token and a different token appears on the input, or because none of the branches of a choice/loop is exe- cutable, error repair is invoked to try to fix things up. Sometimes it can actually guess right and fix the error, but more frequently it merely supplies a legal output so that later passes will not be thrown into chaos by a minor syntax error.

    The general error-repair strategy of an AASL parser is to give the parser what it wants and then attempt to resynchronize the input with the parser.

    [xxx long discussion of how ``what it wants'' is determined when there are multiple possibilities]

    Resynchronization is performed in three stages. The first stage attempts to resynchronize within a logical line, and is applied only if neither the input token nor the desired token is a line terminator (a member of the ``lineterm'' class). If the input token is trivial (a member of the ``trivial'' class), it is discarded. Otherwise it is retained, in hopes that it will be the next token that the parser asks for.

    Either way, an error message is produced, indicating what was desired, what was seen, and what was handed to the parser. If too many of these messages have been produced for a single line, the parser gives up, produces a last despairing message, emits a ``#!aargh'' action token to alert later pases, and exits. Barring this disaster, parsing then con- tinues. If the parser at some point is willing to accept the input token, it is accepted and error repair terminates. If a line termina- tor is seen in input, or the parser requests one, before the parser is willing to accept the input token, the second phase begins.

    The second stage of resynchronization attempts to line both input and parser up on a line terminator. If the desired token is a line termi- nator and the input token is not, input is discarded until a line ter- minator appears. If the desired token is not a line terminator and the input token is, the input token is retained and parsing continues until the parser asks for a line terminator. Either way, the third phase then begins.

    The third stage of resynchronization attempts to reconcile line termi- nators. If the desired and input tokens are identical, the input token is accepted and error repair terminates. If they are not identical and the input token is trivial (yes, line terminators can be trivial, and ones like `;' probably should be), the input token is discarded. If the desired token is the endmarker, then the input token is discarded. Otherwise, the input token continues to be retained in hopes that it will eventually be accepted. [xxx this needs more thought] In any case, the second phase begins again.

    Files

    all in $AASLDIR:
    interp  table interpreter
    lex     first pass of aaslg
    syn     AASL table for aaslg
    sem     third pass of aaslg
    

    See Also

    awk(1), yacc(1)

    Diagnostics

    ``error-repair disaster'' means that the first token of a double looka- head could not be accepted and error repair was invoked on it.

    History

    Written at University of Toronto by Henry Spencer, somewhat in the spirit of S/SL (see ACM TOPLAS April 1982).

    Bugs

    Some of the restrictions on double lookahead are annoying.

    Most of the C string escapes are recognized but disregarded, with only a backslashed double-quote interpreted properly during text generation.

    Error repair needs further tuning; it has an annoying tendency to infi- nite-loop in certain odd situations (although the messages/line limit eventually breaks the loop).

    Complex choices/loops with many branches can result in very long lines in the table.

    Assessment

    The implementation of AASL was fairly straight forward, with AASL itself used to describe its own syntax. An AASL specification is compiled into a table, which is then processed by a table-walking interpreter. The interpreter expects input to be as tokens, one per line, much likethe output of a traditional scanner. A complete program using AASL (for example, the AASL table generator) is normally three passes: thescanner,the parser (tables plus interpreter), and a semantics pass. The first set of tables was generated byhand for bootstrapping.

    Apart from the minor nuisance of repeated iterations of language design, the biggest problem ofimplementing AASL wasthe question of semantic actions. Inserting awk semantic routines into the table interpreter, in the style of yacc,would not be impossible, but it seemed clumsy and inelegant. Awks lack of anyprovision for compile time initialization of tables strongly suggested reading them in at run time, rather than taking up space with a huge BEGIN action whose only purpose was to initialize the tables. This makes insertions into the interpreters code awkward.

    The problem was solved by a crucial observation: traditional compilers (etc.) merge a two-stepprocess, first validating a token stream and inserting semantic action cookiesinto it, then interpreting thestream and the cookies to interface to semantics. Forexample, yaccs grammar notation can be viewed asinserting fragments of C code into a parsed output, and then interpreting that output. This approach yieldsan extremely natural pass structure for an AASL parser,with the parsersoutput stream being (in the absenceof syntax errors) a copy of its input stream with annotations. The following semantic pass then processesthis, momentarily remembering normal tokens and interpreting annotations as operations on the remembered values. (The semantic pass is, in fact, a classic pattern+action awk program, with a pattern and anaction for each annotation, and a general save the value in a variableaction for normal tokens.)

    The one difficulty that arises with this method is when the language definition involves feedbackloops between semantics and parsing, an obvious example being Cs typedef.Dealing with this reallydoes require some imbedding of semantics into the interpreter,although with care it need not be much: thein-parser code for recognizing C typedefs, including the complications introduced by block structure andnested redeclarations of type names, is about 40 lines of awk.The in-parser actions are invoked by a special variant of the AASL emit semantic annotationsyntax.

    Aside benefit of top-down parsing is that the context of errors is known, and it is relatively easy to implement automatic error recovery. When the interpreter is faced with an input token that does not appearin the list of possibilities in the parser table, it givesthe parser one of the possibilities anyway, and then usessimple heuristics to try to adjust the input to resynchronize. The result is that the parser,and subsequentpasses, always see a syntactically-correct program. (This approach is borrowed from S/SL and its predecessors.) Although the detailed error-recovery algorithm is still experimental, and the current one is notentirely satisfactory when a complex AASL specification does certain things, in general it deals with minorsyntax errors simply and cleanly without anyneed for complicating the specification with details of errorrecovery.Knowing the context of errors also makes it much easier to generate intelligible error messagesautomatically.

    The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.

    As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, andwith the exception of the messy details of Cdeclaratorsis mostly a fairly straightforward transcription of the syntax given in the ANSI standard. Generating tables for this takes about three minutes of CPU time on a Sun 3/180; the tables are about 10K bytes.

    The performance of the resulting ANSI C parser is not impressive: in very round numbers, averagedoveralarge program, it parses about one line of C per CPU second. (The scanner,164 lines of awk, accounts for a negligible fraction of this.) Some attention to optimization of both the tables and the interpreter might speed this up somewhat, but remarkable improvements are unlikely. As things stand in the absence of better awk implementations or a rewrite of the table interpreter in C, its a cute toy, possibly of some pedagogical value, but not a useful production tool. On the other hand, there does not appear to be any fundamental reason for the performance shortfall: itspurely the result of the slowexecution of awk programs.

    Lessons From AASL

    The scanner would be much faster with better regular-expression matching, because it can use regular expressions to determine whether a string is a plausible token but must use substr to extract the string first. Nawk functions would be very handy for modularizing code, especially the complicated and seldom-invoked error-recovery procedure. A switch statement modelled on the pattern+action scheme would be useful in several places.

    Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in nawk): there is no array assignment. This lack leads to endless repetitions of code like:

    for (i in array) 
        arraystack[i ":" sp] = array[i] 
    

    whenever block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic sugar for this but don't really fix the problem. Not only is this code clumsy, it is woefully inefficient compared to something like

    arraystack[sp] = array 
    

    even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.

    It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after it had finished executing.

    A minor nuisance that surfaces constantly is that getting an error message out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).

    The multi-pass input-driven structure that awk naturally lends itself to produces very clean and readable code with different phases neatly separated, but creates substantial difficulties when feedback loops appear. (In the case of AASL,this perhaps says more about language design than about awk.)

    Author

    Henry Spencer.


    categories: Interpreters,May,2009,SteveJ

    Brainfuck to C

    (Editor's note: One of the benefits of gawk is its ability to quickly code filters that convert artifacts from one form to another. For example, here's a BrainFuck to C translator.)

    About BrainFuck

    (From Wikipeidia.)

    The BrainFuck programming language is an esoteric programming language noted for its extreme minimalism. It is a Turing tarpit, designed to challenge and amuse programmers, and is not suitable for practical use

    Urban Muller created BrainFuck in 1993 with the intention of designing a language which could be implemented with the smallest possible compiler, inspired by the 1024-byte compiler for the FALSE programming language. Several BrainFuck compilers have been made smaller than 200 bytes. The classic distribution is Muller's version 2, containing a compiler for the Amiga, an interpreter, example programs, and a readme document.

    The language consists of eight commands:

    >
    increment the data pointer (to point to the next cell to the right).
    <
    decrement the data pointer (to point to the next cell to the left).
    +
    increment (increase by one) the byte at the data pointer.
    -
    decrement (decrease by one) the byte at the data pointer.
    .
    output the value of the byte at the data pointer.
    ,
    accept one byte of input, storing its value in the byte at the data pointer.
    [
    if the byte at the data pointer is zero, then instead of moving the instruction pointer forward to the next command, jump it forward to the command after the matching ] command.
    ]
    if the byte at the data pointer is nonzero, then instead of moving the instruction pointer forward to the next command, jump it back to the command after the matching [ command.

    A Brainfuck program is a sequence of these commands, possibly interspersed with other characters (which are ignored). The commands are executed sequentially, except as noted below; an instruction pointer begins at the first command, and each command it points to is executed, after which it normally moves forward to the next command. The program terminates when the instruction pointer moves past the last command.

    The Translator

    I wrote a BrainFuck to C translator in awk. It only takes a few minutes and I noticed that no awk version of this existed.

    I haven't run it through it's paces (I just wrote a few small BrainFuck programs to test it out) so if you find a bug, please let me know.

    #!/sw/bin/awk -f
    # a brainfuck to C translator.
    # Needs a recent version of gawk, if on OS X,
    # try using Fink's version.
    #
    # steve jenson
    
    BEGIN {
      print "#include <stdio.h>\n";
      print "int main() {";
      print "  int c = 0;";
      print "  static int b[30000];\n";
    }
    
    {
      #Note: the order in which these are
      #substituted is very important.
      gsub(/\]/, "  }\n");
      gsub(/\[/, "  while(b[c] != 0) {\n");
      gsub(/\+/, "  ++b[c];\n");
      gsub(/\-/, "  --b[c];\n");
      gsub(/>/, "  ++c;\n");
      gsub(/</, "  --c;\n");
      gsub(/\./, "  putchar(b[c]);\n");
      gsub(/\,/, "  b[c] = getchar();\n");
      print $0
    }
    
    END {
      print "\n  return 0;";
      print "}";
    }
    

    Updates

    • You can blame this on Evan Martin, his recent post wherein half the universe decided to weigh in about PHP had a mention of mod_bf. Am I easily distracted or what?
    • Last update, I swear. Darius said that I don't need to initialize b if I declare it static. Also, I realized that my previous version wouldn't understand if you had multiple operators on the same line since it was matching records and not fields. This version works on all of the programs I tried in the BrainFuck archive (as long as you strip comments).

    Author

    Steve Johnson http://saladwithsteve.com/


    categories: Oo,May,2009,Admin

    OO tools in AWK

    These pages focus on object-oriented tools in Awk.


    categories: Dsl,Mar,2009,Admin

    Domain-Specific Langauges

    These pages focus on domain-specific languages (a.k.a. "little langauges") written in Awk.

    These little languages can range from the simple to the quite intricate. For example, LAWKER contains code for

    • Simple:
      • Graph- a simple ascii graph generator;
      • Markdown- an ultra lightweight HTML markup language;
    • Intricate:
      • Awk++- enables object-oriented programming in Awk;
      • AwkLisp- a fully functioning LISP interpreter, written in Awk.

    Interestingly, without comments, the LISP interpreter is only three times longer than the HTML markup language. This comments either on the power of Awk, the regularity of LISP's core semantics, or both.


    categories: Dsl,Mar,2009,BrianK,PeterW,AlfredA

    Graph.awk

    Contents

    Synopsis

    gawk -f graph.awk graphFile

    Description

    A processor for a little language, specialized for graph-drawing.

    The code inputs data, which includes a specification of a graph The output is data plotted in specified area

    For example, here is an input specification:

    label here's some stuff
    bottom ticks 1 5 10 
    left ticks 1 2 10 20
    range 1 1 10 22
    height 10
    width 30
    1 2 *
    2 4 * 
    3 6 *
    4 8 *
    7 14 +
    8 12 +
    9 10 +
    mb 0.9 11 =
    

    It produces the following output

          |----------------------|
    20    -                 = =  =
          |       = =  = =       |
          =  = =         +  +    |
    10    -                   +  |
          |    *  *              |
          |  *                   |
    2     *---------|------------|
         1         5            10
             here's some stuff    
    

    Code

    Initialization

    Set frame dimensions: height and width; offset for x and y axes.

    BEGIN {                
        ht = 24; wid = 80  
        ox = 6; oy = 2     
        number = "^[-+]?([0-9]+[.]?[0-9]*|[.][0-9]+)" \
                                "([eE][-+]?[0-9]+)?$"
    }
    

    Handling patterns

    Skip comments

    /^[ \t]*#/     { next } 
    

    Simple tags

    $1 == "height" { ht = $2;  next }
    $1 == "width"  { wid = $2; next }
    $1 == "label"  {                       # for bottom
        sub(/^ *label */, "")
        botlab = $0
        next
    }
    $1 == "bottom" && $2 == "ticks" {     # ticks for x-axis
        for (i = 3; i <= NF; i++) bticks[++nb] = $i
        next
    }
    $1 == "left" && $2 == "ticks" {       # ticks for y-axis
        for (i = 3; i <= NF; i++) lticks[++nl] = $i
        next
    }
    $1 == "range" {                       # xmin ymin xmax ymax
        xmin = $2; ymin = $3; xmax = $4; ymax = $5
        next
    }
    

    Handling numerics.

    $1 ~ number && $2 ~ number {  # pair of numbers
        nd++                      # count number of data points
        x[nd] = $1; y[nd] = $2
        ch[nd] = $3               # optional plotting character
        next
    }
    $1 ~ number && $2 !~ number { # single number
        nd++                      # count number of data points
        x[nd] = nd; y[nd] = $1; ch[nd] = $2
        next
    }
    

    Line functions, defined by a slope "m" and a y-intercept "b".

    $1 == "mb" {  # m b [mark]
    	expand()
        for(i=xmin;i<=xmax;i++) {
    		nd++; x[nd]=i; y[nd]=$2*i + $3; ch[nd]=$4 
        }
        next;
    }		
    

    Final case: input error.

    { print "?? line " NR ": ["$0"]" >"/dev/stderr" }
    

    Draw the graph

    END { expand();   frame(); ticks(); label(); data(); draw() }
    

    Functions

    Expand the "x" and "y" boundaries to include all points.

    function expand(note) { if (xmin == "") expand1(note) }
    function expand1(note) {
     	xmin = xmax = x[1]    
        ymin = ymax = y[1]
        for (i = 2; i <= nd; i++) {
            if (x[i] < xmin) xmin = x[i]
            if (x[i] > xmax) xmax = x[i]
            if (y[i] < ymin) ymin = y[i]
            if (y[i] > ymax) ymax = y[i] }
    }
    

    Draw the frame around the graph.

    function frame() {        
        for (i = ox; i < wid; i++) plot(i, oy, "-")     # bottom
        for (i = ox; i < wid; i++) plot(i, ht-1, "-")   # top
        for (i = oy; i < ht; i++) plot(ox, i, "|")      # left
        for (i = oy; i < ht; i++) plot(wid-1, i, "|")   # right
    }
    

    Create tick marks for both axes.

    function ticks(    i) {   
        for (i = 1; i <= nb; i++) {
            plot(xscale(bticks[i]), oy, "|")
            splot(xscale(bticks[i])-1, 1, bticks[i])
        }
        for (i = 1; i <= nl; i++) {
            plot(ox, yscale(lticks[i]), "-")
            splot(0, yscale(lticks[i]), lticks[i])
        }
    }
    

    Center labels under x-axis.

    function label() {        
        splot(int((wid + ox - length(botlab))/2), 0, botlab)
    }
    

    Create data points.

    function data(    i) {    
        for (i = 1; i <= nd; i++)
            plot(xscale(x[i]),yscale(y[i]),ch[i]=="" ? "*" : ch[i])
        for(i in mark) print mark[i]
    }
    

    Print graph from array.

    function draw(    i, j) { 
        for (i = ht-1; i >= 0; i--) {
            for (j = 0; j < wid; j++)
                printf((j,i) in array ? array[j,i] : " ")
            printf("\n")
        }
    }
    

    Scale x-values, y-values.

    function xscale(x) {      
        return int((x-xmin)/(xmax-xmin) * (wid-1-ox) + ox + 0.5)
    }
    function yscale(y) {      
        return int((y-ymin)/(ymax-ymin) * (ht-1-oy) + oy + 0.5)
    }
    

    Put one character into array.

    function plot(x, y, c) {  
        array[x,y] = c
    }
    

    Put string "s" into array.

    function splot(x, y, s,    i, n) { 
        n = length(s)
        for (i = 0; i < n; i++)
            array[x+i, y] = substr(s, i+1, 1)
    }
    

    Author

    This code comes from the original Awk book by Alfred Aho, Peter Weinberger & Brian Kernighan and contains some small modifications by Tim Menzies.


    categories: Dsl,April,2009,MartinF

    UML in Awk

    Contents

    Synopsis

    Description

    Example

    Code

    Author

    Synopsis

    gawk -f uml.sh  file.sdml >  sequence_diagram
    

    Description

    This program will turn SDML into simple ascii text uml sequence diagrams. SDML is an extremely simplistic uml Sequence Diagram Markup Language. SDML is specified as:

    • Lines starting with a [ are a comma separated list of actors (bar headers)
    • Events are defined easily by the following symbols:
      >
      rightward event
      <
      leftward event
      -
      extension of the previous event
    • Actors can be skipped with a |
    • Text on a line after a # is a comment
    • Lines starting with a @ are text lines
    • Lines starting with a " are indented text lines
    • Lines starting with a : are comma separated list of parameter assignment lines. Parameters are:
    • E
      Event Padding (spaces on each side)
      ES
      Event Spacing (lines below)
      EA
      Events Above (put event text above arrows)
      HP
      Header Padding (spaces on each side)
      HS
      Header Spacing (lines below)
      LM
      Left Margin (spaces on the left)
      TSM
      Text Spacing Margin (lines above & below)
      TD
      Text Dots (instead of bars in text margins)
      SS
      Enable Single Arrow Spans (|---A-->|, not |-A-+-A>|)

    Example

    Given this input:

        [Client, Proxy, DNS, Server
        Query Name->
        Answer IP<-
        http GET >->
        <<-html
    

    this code generates:

        Client          Proxy           DNS         Server
           |              |              |             |
           |----------Query Name-------->|             |
           |<---------Answer IP----------|             |
           |--http GET -->|----------http GET -------->|
           |<----html-----|<-----------html------------|
    

    Code

    if [ "$1" = "--awkprog" ] ; then
    
    cat - <<"EOF"
    
    BEGIN {
      EFS="[|<>-]";
      AFS="[<>-]";
      RAFS="[{}RL]";
      FS= EFS;
      ARROWS = 2 ; # Arrowhead constant
      ST=1;
    
      ARG["EP"] = 1;  # Event Padding
      ARG["ES"] = 0;  # Event Spacing (lines below)
      ARG["EA"] = 0;  # Events Above
    
      ARG["HP"] = 2;  # Header Padding
      ARG["HS"] = 1;  # Header Spacing (lines below)
    
      ARG["LM"] = 0;  # Left Margin
    
      ARG["SP"] = 2;  # Start Row Padding (For continuous operation)
    
      ARG["TSM"] = 1; # Text Spacing Margin (lines above & below)
      ARG["TD"] = 1;  # Text Dots (instead of bars in text margins)
      ARG["SS"] = 1;  # Enable Single Arrow Spans (|---A-->|, not |-A-+-A>|)
    }
    
    function padding(outter, inner, extra    ,p,m) {
      p = (outter - inner);
      m = p % 2 ;
      p =  ((p - m)/2) + (extra ? m:0);
      if(p<0) return 0;
      return p;
    }
    function pad(char, count    ,i,r) {
      for(i=1 ; i <= count ; i++) { r = r char };
      return r;
    }
    function ltrim(s) { gsub(/^[     ]*/, "", s) ; return s; }
    
    function center(string, width, padchar, favor    ,p,r,sw) {
      sw = length(string);
      p = padding(width, sw, favor=="r"?1:0);
      r = pad(padchar, p);
      r = r string;
      p = padding(width, sw, favor=="r"?0:1);
      return r pad(padchar, p);
    }
    
    function getevent_rev(row, field   ,p) {
      for(p=field-1; p>0; p--) { # search to the left
        if(RF_s[row,p] !~ AFS) return "";
        if(RF_f[row,p] != "") return RF_f[row,p];
      }
      return "";
    }
    function getevent_for(row, field   ,n) {
      for(n=field+1; n <= R_nf[row]; n++) { # search to the right
        if(RF_s[row,n-1] !~ AFS) return "";
        if(RF_f[row,n] != "") return RF_f[row,n];
      }
      return "";
    }
    
    function rlarrow(arrow, prevarrow) {
      if(arrow == ">") return "R";
      if(arrow == "<") return "L";
      if(arrow == "R" || arrow=="L") return arrow;
      return prevarrow;
    }
    
    function debug_events(s) {
      for(r=1; r <= NRS; r++) debug_row(r, s);
    }
    
    function debug_row(r, s) {
      if(!DEBUG_ROW) return;
      printf("Row["r"]/Stage["s"]:  ");
      for(f=1; f <= R_nf[r]; f++) {
        printf(f"="RF_f[r,f]"("RF_s[r,f]") ");
      }
      printf("\n");
    }
    
    function print_bars(num, char    ,i,out) {
      if(char == "") char = "|";
      while(num--) {
        # Center the bars under the Headers
        out = pad(" ", F_width[0]);
        for(i=1; i<= NH; i++) {
          out = out  char pad(" ", F_width[i]);
        }
        print out;
      }
    }
    
    function print_event(r, type   ,i,bar,out,aspad,span_width,arrow){
      out = pad(" ", F_width[0]);
    
      for(i=1; i<= MAXNF; i++) {
    
        out = out "|";
    
        arrow=" ";
        if(type == "both" || type == "arrow") {
          if(RF_s[r,i] == "{") arrow = "<";
          if(RF_s[r,i] ~ /[}RL]/)  arrow = "-";
        }
        out = out arrow;
    
    
        aspad = "-"; # arrow or space pad
        if(RF_s[r,i] == "|" || RF_s[r,i] == ""|| type == "event") aspad = " ";
    
        span_width = F_width[i];
        if(ARG["SS"]) while(RF_s[r,i] == "R" || RF_s[r,i+1] == "L") {
          span_width += 1 + F_width[++i]; # include bar
        }
    
        event ="";
        if(type == "both" || type == "event") {
          event = RF_f[r,i];
        }
        out = out center(event, span_width - ARROWS, aspad, i>MAXNF/2? "r":"l");
    
    
        if(type == "both" || type == "arrow") {
          if(RF_s[r,i] == "}") arrow = ">";
          if(RF_s[r,i] ~ /[{RL]/) arrow = "-";
        }
        out = out arrow;
      }
      out = out "|";
      print out;
    }
    
    function print_sd(start_row) {
     print "         1         2         3         4         5         6"
     print "123456789012345678901234567890123456789012345678901234567890"
      if(start_row!=1) { for(i=0; i<ARG["SP"];i++) print ""; }
    
      for(j=start_row; j<= NRS; j++) {
    
        if(R_ltype[j] == "Header") {
          NH = R_nf[j];
          out = pad(" ", ARG["HP"]+ARG["LM"]);
          i =1;
          out = out RF_f[j,i];
          hp = ARG["HP"] + ARG["LM"] + RF_l[j,i]; # header pointer (last char)
          bp = F_width[0] + 1 + F_width[i] + 1; # bar pointer
     print "HP:" hp " BP: "bp
          for(i=2; i<= NH; i++) {
            l = int(RF_l[j,i]/2); r = RF_l[j,i] -l; # Header left & right
            lp = (bp - hp) - (l + 1); # left padding
            out = out pad(" ", lp) RF_f[j,i];
            hp = bp + r - 1;
            bp = bp + F_width[i] + 1;
     print "HP:" hp " BP: "bp " LP:"lp " r:"r" l:"l
          }
    
          print out;
          print_bars(ARG["HS"]);
        }
    
        if(R_ltype[j] == "Text") {
          if(R_ltype[j-1] != "Text") {
            if(ARG["TD"]) { 
              print_bars(ARG["TSM"], ".");
            } else {
              for(l=0;l<ARG["TSM"]; l++) print "";
            }
          }
    
          if(T_type[j] == "indent") printf(pad(" ", F_width[0]));
          print RF_f[j,1];
    
          if(R_ltype[j+1] != "Text") {
            if(ARG["TD"]) { 
              print_bars(ARG["TSM"], ".");
            } else {
              for(l=0;l<ARG["TSM"]; l++) print "";
            }
          }
        }
    
        if(R_ltype[j] == "Event") {
          if (ARG["EA"]) {
            print_event(j, "event");
            print_event(j, "arrow");
          } else print_event(j, "both");
          print_bars(ARG["ES"]);
        }
    
      }
      return j;
    }
    
    
    /^[     ]*#/ {next} # we don't want bars for comment only lines!
    /#/ { $0 = sub(/#.*$/, ""); }
    
    /^:/ {
     print "Argument Variable Assignment" $0
      i = split(substr($0,2), v, /,/);
      for(;i>0;i--) {
        j = split(v[i], kv, "=");
        if(j==1) { ARG[kv[1]]= ""; }
        if(j==2) { ARG[kv[1]]=kv[2]; }
      }
     for(k in ARG) { printf("ARG["k"]='"ARG[k]"' "); } ; print "";
      next ;
    }
    
    {
      NRS++; # NRSequences
    }
    
    /^;/ { ST=print_sd(ST); next; }  # Allow continuous operation
    
    /^@/ {
     print "text line"
      R_ltype[NRS] = "Text";
      T_type[NRS] = "left";
      sub(/^@/,"");
      RF_f[NRS,1]=$0;
      next;
    }
    
    /^"/ {
     print "text line"
      R_ltype[NRS] = "Text";
      T_type[NRS] = "indent";
      sub(/^"/,"");
      RF_f[NRS,1]=$0;
      next;
    }
    
    /^\[/ {
     print "Event Headers (Titles)" $0
      R_ltype[NRS] = "Header";
    
      sub(/^\[/,"");
      FS=","; $0 = $0; # resplit line
      R_nf[NRS] = NF;
      if(MAXNF < R_nf[NRS]-1) MAXNF= R_nf[NRS]-1; # print MAXNF;
      for(i=1; i<= NF; i++) {
        f= ltrim($i);
        RF_f[NRS,i]=f;
        RF_l[NRS,i]= length(f);
        RF_s[NRS,i]= ",";
      }
      for(i=1; i<= NF; i++) {
        F_width[i] = padding(RF_l[NRS,i] + 2*ARG["HP"], 1, 1) +\
                     padding(RF_l[NRS,i+1] + 2*ARG["HP"], 1, 0)\
                     -1; # Do not include width of bar
        if(F_width[i] < 2*ARG["HP"])  F_width[i] = 2*ARG["HP"];
    
     print padding(RF_l[NRS,i] + 2*ARG["HP"], 1, 1) " "\
           padding(RF_l[NRS,i+1] + 2*ARG["HP"], 1, 0);
      }
      F_width[0] = padding(RF_l[NRS,1] + 2*ARG["HP"], 1, 1);
     print padding(RF_l[NRS,1] + 2*ARG["HP"],1,0);
      if(F_width[0] < ARG["HP"])  F_width["0"] = ARG["HP"];
      F_width[0] += ARG["LM"];
     for(i=0; i<= MAXNF; i++) printf("FW["i"]="F_width[i]" "); print ""
    
      FS=EFS;
      next;
    }
    
    {
     print "Event Line: " $0 ; DEBUG_ROW=1;
      R_ltype[NRS] = "Event";
    
      stl=0;
      for(i=1; i<= NF; i++) {
        f = $i;
        l = length(f);
        stl += l +1;
        s = substr($0, stl, 1);
    
        RF_f[NRS,i]= f;
        RF_s[NRS,i]= s;
      }
      R_nf[NRS] = NF;
      debug_row(NRS, 1);
    
      # Fill in missing (assumed) fields
      for(i=1; i<= R_nf[NRS]; i++) {
        if (RF_f[NRS,i]=="") RF_f[NRS,i] = getevent_rev(NRS, i);
        if (RF_f[NRS,i]=="") RF_f[NRS,i] = getevent_for(NRS, i);
      }
      debug_row(NRS, 2);
    
      # ->  <-   ->>  >->  <-<  <<-
      # >-  -<        >>-  -<<
      # R>  <L   R>>  >R>  <L<  <<L
    
      for(i=1; i<= R_nf[NRS]; ) {
        if(RF_s[NRS,i] ~ AFS) {
          if(RF_s[NRS,i] == "-") { # left tail
            for(n=i+1; n<= R_nf[NRS]; n++) {
              if(RF_s[NRS,n]==">") {
                pi=i; i=n;  RF_s[NRS,n]="}";
                for(n--; n>=pi; n--) RF_s[NRS,n]="R"; n= R_nf[NRS];
              } else if(RF_s[NRS,n]=="<") {
                pi=i; i=n;  RF_s[NRS,pi]="{";
                for(; n>pi; n--) RF_s[NRS,n]="L"; n= R_nf[NRS];
              }
            }
            i++;
          } else if(RF_s[NRS,i+1] != "-") { # singleton
            RF_s[NRS,i]= RF_s[NRS,i]==">" ? "}":"{";
            i++;
          } else {
            rl= rlarrow(RF_s[NRS,i], "");
            for(n=i+1; n<= R_nf[NRS] && RF_s[NRS,n] ~ AFS; n++) {
              rl= rlarrow(RF_s[NRS,n], rl);
            }
            n--;
            if (RF_s[NRS,n] == "-") { # right tail
              if (rl=="R") RF_s[NRS,n--]="}";
              for(; n>=i && RF_s[NRS,n] == "-"; n--) RF_s[NRS,n]=rl;
              if (rl=="L") RF_s[NRS,n]="{"; else RF_s[NRS,n]="R";
            } else if (RF_s[NRS,n-1] != "-") { # singleton
              RF_s[NRS,n]= RF_s[NRS,n]==">" ? "}":"{";
            } else { # double ended -
              if(RF_s[NRS,i]=="<") { # trumps no matter what
                RF_s[NRS,i]="{";
                for(i++; i<= R_nf[NRS] && RF_s[NRS,i]=="-"; i++) {
                  RF_s[NRS,i]="L";
                }
              } else {
                for(n=i+1; n<= R_nf[NRS] && RF_s[NRS,n] =="-"; n++) ;
                if(RF_s[NRS,n]==">") {
                  RF_s[NRS,n]="}";
                  for(n--; n>i && RF_s[NRS,n]=="-"; n--) {
                    RF_s[NRS,n]="R";
                  }
                } else { # >-<  # > is on the right and trumps
                  for(; i<= R_nf[NRS] && RF_s[NRS,i]=="-"; i++) {
                    RF_s[NRS,i]="R";
                  }
                  RF_s[NRS,i]="}";
                }
              }
            }
          }
        } else i++;
      }
    
      debug_row(NRS, 3);
    
    
      # ~ we need to test this with multi shifts (arrow/bar/arrow)
      shift = 0;
      for(i=1; i<= R_nf[NRS]+1; i++) {
        if(RF_s[NRS,i-1] ~ RAFS && RF_s[NRS,i] !~ RAFS) shift++;
        if(shift) RF_f[NRS,i-shift]=RF_f[NRS,i];
      }
      R_nf[NRS] = R_nf[NRS] - shift;
      debug_row(NRS, 4);
    
      # Trim empty trailing fields
      for(i= R_nf[NRS]; i>0 && RF_f[NRS,i]==""; i--) R_nf[NRS]--;
      debug_row(NRS, 5);
    
      # Get event wlength and adjust the max length of each event
      for(i=1; i<= R_nf[NRS]; i++) {
        RF_l[NRS,i]= length(RF_f[NRS,i]);
        if(RF_l[NRS,i] > E_ml[i]) E_ml[i] = RF_l[NRS,i];
      }
    
      # Adjust the max width of each column (headers/events)
      if(MAXNF < R_nf[NRS]) MAXNF= R_nf[NRS]; # print MAXNF;
      for(i=1; i<= MAXNF; i++) {
        w = E_ml[i] + 2 * ARG["EP"] + ARROWS;
        if (F_width[i] < w)  F_width[i] = w;
       printf("FW:"F_width[i]" W:"w" ");
      }
     print ""
    }
    
    END { ST=print_sd(ST); }
    
    
    EOF
    exit
    fi
    
    
    Usage()
    {
      cat - <<-EOF
    
      use(v1.0): $0 file.sdml >  sequence_diagram
    
      This program will turn SDML into simple ascii text uml sequence
      diagrams.  SDML is an extremely simplistic uml Sequence Diagram
      Markup Language.  SDML is specified as:
    
      .Lines starting with a [ are a comma separated list
        of actors (bar headers)
      .Events are defined easily by the following symbols:
        >  rightward event
        <  leftward event
        -  extension of the previous event
      .Actors can be skipped with a |
      .Text on a line after a # is a comment
      .Lines starting with a @ are text lines
      .Lines starting with a " are indented text lines
      .Lines starting with a : are comma separated list of
        parameter assignment lines.  Parameters are:
    
        E   Event Padding (spaces on each side)
        ES  Event Spacing (lines below)
        EA  Events Above (put event text above arrows)
    
        HP  Header Padding (spaces on each side)
        HS  Header Spacing (lines below)
    
        LM  Left Margin (spaces on the left)
    
        TSM Text Spacing Margin (lines above & below)
        TD  Text Dots (instead of bars in text margins)
        SS  Enable Single Arrow Spans (|---A-->|, not |-A-+-A>|)
    
      Example SDML Input:
    
        [Client, Proxy, DNS, Server
        Query Name->
        Answer IP<-
        http GET >->
        <<-html
    
      Sequence Diagram Output:
    
        Client          Proxy           DNS         Server
           |              |              |             |
           |----------Query Name-------->|             |
           |<---------Answer IP----------|             |
           |--http GET -->|----------http GET -------->|
           |<----html-----|<-----------html------------|
    
      Copyright:  Martin Fick <mogulguy@yahoo.com>, Date: 2008-02-15
      License:    None.  This is released into the public domain: do
                  as you wish.
    
    EOF
    exit
    }
    
    [ "$1" = "--help"  -o  "$1" = "-h"  -o  "$1" = "-u" ] &&  Usage
    
    
     Hack to attempt to make this somewhat portable
    
    
    AWK_PROG="`"$0" --awkprog`"
    
    AWK=awk  # default (should work most places)
    [ -x /usr/bin/nawk ] && AWK=/usr/bin/nawk # solaris
    
    $AWK "$AWK_PROG" "$@"
    

    Author

    Martin Fick


    categories: Eliza,Top10,AwkLisp,Interpreters,Dsl,Mar,2009,DariusB

    AWKLISP v1.2

    Download from

    Synopsis

    awk [-v profiling=1] -f awklisp [optional-Lisp-source-files]

    The -v profiling=1 option turns call-count profiling on.

    If you want to use it interactively, be sure to include '-' (for the standard input) among the source files. For example:

    gawk -f awklisp startup numbers lists -

    Description

    Overview

    This program arose out of one-upmanship. At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster. Since then I've added features and polish, in the hope of taking over the burgeoning market for stately language implementations.

    This version tries to deal with as many of the essential issues in interpreter implementation as is reasonable in awk (though most would call this program utterly unreasonable from start to finish, perhaps...). Awk's impoverished control structures put error recovery and tail-call optimization out of reach, in that I can't see a non-painful way to code them. The scope of variables is dynamic because that was easier to implement efficiently. Subject to all those constraints, the language is as Schemely as I could make it: it has a single namespace with uniform evaluation of expressions in the function and argument positions, and the Scheme names for primitives and special forms.

    The rest of this file is a reference manual. My favorite tutorial would be The Little LISPer (see section 5, References); don't let the cute name and the cartoons turn you off, because it's a really excellent book with some mind-stretching material towards the end. All of its code will work with awklisp, except for the last two chapters. (You'd be better off learning with a serious Lisp implementation, of course.)

    For more details on the implementation, see the Implementation notes (below).

    Examples

    fib.lsp

    Code:

    (define fib
      (lambda (n)
        (if (< n 2)
            1
            (+ (fib (- n 1))
               (fib (- n 2))))))
    (fib 20)
    

    Comamnd line:

    gawk -f awklisp startup numbers  lists fib.lsp
    

    Output:

    10946
    

    Eliza

    Here are the standard ELIZA dialogue patterns:

    (define rules
      '(((hello)
         (How do you do -- please state your problem))
        ((I want)
         (What would it mean if you got -R-)
         (Why do you want -R-)
         (Suppose you got -R- soon))
        ((if)
         (Do you really think its likely that -R-)
         (Do you wish that -R-)
         (What do you think about -R-)
         (Really-- if -R-))
        ((I was)
         (Were you really?)
         (Perhaps I already knew you were -R-)
         (Why do you tell me you were -R- now?))
        ((I am)
         (In what way are you -R-)
         (Do you want to be -R-))
        ((because)
         (Is that the real reason?)
         (What other reasons might there be?)
         (Does that reason seem to explain anything else?))
        ((I feel)
         (Do you often feel -R-))
        ((I felt)
         (What other feelings do you have?))
        ((yes)
         (You seem quite positive)
         (You are sure)
         (I understand))
        ((no)
         (Why not?)
         (You are being a bit negative)
         (Are you saying no just to be negative?))
        ((someone)
         (Can you be more specific?))
        ((everyone)
         (Surely not everyone)
         (Can you think of anyone in particular?)
         (Who for example?)
         (You are thinking of a special person))
        ((perhaps)
         (You do not seem quite certain))
        ((are)
         (Did you think they might not be -R-)
         (Possibly they are -R-))
        (()
         (Very interesting)
         (I am not sure I understand you fully)
         (What does that suggest to you?)
         (Please continue)
         (Go on)
         (Do you feel strongly about discussing such things?))))
    

    Command line:

    gawk -f awklisp startup numbers  lists eliza.lsp -

    Interaction:

    > (eliza)
    Hello-- please state your problem 
    > (I feel sick)
    Do you often feel sick 
    > (I am in love with awk)
    In what way are you in love with awk 
    > (because it is so easy to use)
    Is that the real reason? 
    > (I was laughed at by the other kids at space camp)
    Were you really? 
    > (everyone hates me)
    Can you think of anyone in particular? 
    > (everyone at space camp)
    Surely not everyone 
    > (perhaps not tina fey)
    You do not seem quite certain 
    > (I want her to laugh at me)
    What would it mean if you got her to laugh at me 
    

    Expressions and their evaluation

    Lisp evaluates expressions, which can be simple (atoms) or compound (lists).

    An atom is a string of characters, which can be letters, digits, and most punctuation; the characters may -not- include spaces, quotes, parentheses, brackets, '.', '#', or ';' (the comment character). In this Lisp, case is significant ( X is different from x ).

    • Atoms: atom 42 1/137 + ok? hey:names-with-dashes-are-easy-to-read
    • Not atoms: don't-include-quotes (or spaces or parentheses)

    A list is a '(', followed by zero or more objects (each of which is an atom or a list), followed by a ')'.

    • Lists: () (a list of atoms) ((a list) of atoms (and lists))
    • Not lists: ) ((()) (two) (lists)

    The special object nil is both an atom and the empty list. That is, nil = (). A non-nil list is called a -pair-, because it is represented by a pair of pointers, one to the first element of the list (its -car-), and one to the rest of the list (its -cdr-). For example, the car of ((a list) of stuff) is (a list), and the cdr is (of stuff). It's also possible to have a pair whose cdr is not a list; the pair with car A and cdr B is printed as (A . B).

    That's the syntax of programs and data. Now let's consider their meaning. You can use Lisp like a calculator: type in an expression, and Lisp prints its value. If you type 25, it prints 25. If you type (+ 2 2), it prints 4. In general, Lisp evaluates a particular expression in a particular environment (set of variable bindings) by following this algorithm:

    • If the expression is a number, return that number.
    • If the expression is a non-numeric atom (a -symbol-), return the value of that symbol in the current environment. If the symbol is currently unbound, that's an error.
    • Otherwise the expression is a list. If its car is one of the symbols: quote, lambda, if, begin, while, set!, or define, then the expression is a -special- -form-, handled by special rules. Otherwise it's just a procedure call, handled like this: evaluate each element of the list in the current environment, and then apply the operator (the value of the car) to the operands (the values of the rest of the list's elements). For example, to evaluate (+ 2 3), we first evaluate each of its subexpressions: the value of + is (at least in the initial environment) the primitive procedure that adds, the value of 2 is 2, and the value of 3 is 3. Then we call the addition procedure with 2 and 3 as arguments, yielding 5. For another example, take (- (+ 2 3) 1). Evaluating each subexpression gives the subtraction procedure, 5, and 1. Applying the procedure to the arguments gives 4.
    We'll see all the primitive procedures in the next section. A user-defined procedure is represented as a list of the form (lambda <parameters> <body>), such as (lambda (x) (+ x 1)). To apply such a procedure, evaluate its body in the environment obtained by extending the current environment so that the parameters are bound to the corresponding arguments. Thus, to apply the above procedure to the argument 41, evaluate (+ x 1) in the same environment as the current one except that x is bound to 41.

    If the procedure's body has more than one expression -- e.g., (lambda () (write 'Hello) (write 'world!)) -- evaluate them each in turn, and return the value of the last one.

    We still need the rules for special forms. They are:

    • The value of (quote <x>) is <x>. There's a shorthand for this form: '. E.g., the value of '(+ 2 2) is (+ 2 2), -not- 4.
    • (lambda <parameters> ) returns itself: e.g., the value of (lambda (x) x) is (lambda (x) x).
    • To evaluate (if <test-expr> <then-exp> <else-exp>), first evaluate <test-expr>. If the value is true (non-nil), then return the value of <then-exp>, otherwise return the value of <else-exp>. (<else-exp> is optional; if it's left out, pretend there's a nil there.) Example: (if nil 'yes 'no) returns no.
    • To evaluate (begin <expr-1> <expr-2>...), evaluate each of the subexpressions in order, returning the value of the last one.
    • To evaluate (while <test> <expr-1> <expr-2>...), first evaluate <test>. If it's nil, return nil. Otherwise, evaluate <expr-1>, <expr-2>,... in order, and then repeat.
    • To evaluate (set! <variable> <expr>), evaluate <expr>, and then set the value of <variable> in the current environment to the result. If the variable is currently unbound, that's an error. The value of the whole set! expression is the value of <expr>.
    • (define <variable> <expr>) is like set!, except it's used to introduce new bindings, and the value returned is <variable>.

    It's possible to define new special forms using the macro facility provided in the startup file. The macros defined there are:

    • (let ((<var> <expr>)...)
        <body>...)
      Bind each <var> to its corresponding <expr> (evaluated in the current environment), and evaluate <body> in the resulting environment.
    • (cond (<test-expr> <result-expr>...)... (else <result-expr>...))
      where the final else clause is optional. Evaluate each <test-expr> in turn, and for the first non-nil result, evaluate its <result-expr>. If none are non-nil, and there's no else clause, return nil.
    • (and <expr>...)
      Evaluate each <expr> in order, until one returns nil; then return nil. If none are nil, return the value of the last <expr>.
    • (or <expr>...)
      Evaluate each <expr> in order, until one returns non-nil; return that value. If all are nil, return nil.

    Built-in procedures

    List operations:

    • (null? <x>) returns true (non-nil) when <x> is nil.
    • (atom? <x>) returns true when <x> is an atom.
    • (pair? <x>) returns true when <x> is a pair.
    • (car <pair>) returns the car of <pair>.
    • (cdr <pair>) returns the cdr of <pair>.
    • (cadr <pair>) returns the car of the cdr of <pair>. (i.e., the second element.)
    • (cddr <pair>) returns the cdr of the cdr of <pair>.
    • (cons <x> <y>) returns a new pair whose car is <x> and whose cdr is <y>.
    • (list <x>...) returns a list of its arguments.
    • (set-car! <pair> <x>) changes the car of <pair> to <x>.
    • (set-cdr! <pair> <x>) changes the cdr of <pair> to <x>.
    • (reverse! <list>) reverses <list> in place, returning the result.

    Numbers:

    • (number? <x>) returns true when <x> is a number.
    • (+ <n> <n>) returns the sum of its arguments.
    • (- <n> <n>) returns the difference of its arguments.
    • (* <n> <n>) returns the product of its arguments.
    • (quotient <n> <n>) returns the quotient. Rounding is towards zero.
    • (remainder <n> <n>) returns the remainder.
    • (< <n1> <n2>) returns true when <n1> is less than <n2>.

    I/O:

    • (write <x>) writes <x> followed by a space.
    • (newline) writes the newline character.
    • (read) reads the next expression from standard input and returns it.

    Meta-operations:

    • (eval <x>) evaluates <x> in the current environment, returning the result.
    • (apply <proc> <list>) calls <proc> with arguments <list>, returning the result.

    Miscellany:

    • (eq? <x> <y>) returns true when <x> and <y> are the same object. Be careful using eq? with lists, because (eq? (cons <x> <y>) (cons <x> <y>)) is false.
    • (put <x> <y> <z>)
    • (get <x> <y>) returns the last value <z> that was put for <x> and <y>, or nil if there is no such value.
    • (symbol? <x>) returns true when <x> is a symbol.
    • (gensym) returns a new symbol distinct from all symbols that can be read.
    • (random <n>) returns a random integer between 0 and <n>-1 (if <n> is positive).
    • (error <x>...) writes its arguments and aborts with error code 1.

    Implementation Notes

    Overview

    Since the code should be self-explanatory to anyone knowledgeable about Lisp implementation, these notes assume you know Lisp but not interpreters. I haven't got around to writing up a complete discussion of everything, though.

    The code for an interpreter can be pretty low on redundancy -- this is natural because the whole reason for implementing a new language is to avoid having to code a particular class of programs in a redundant style in the old language. We implement what that class of programs has in common just once, then use it many times. Thus an interpreter has a different style of code, perhaps denser, than a typical application program.

    Data representation

    Conceptually, a Lisp datum is a tagged pointer, with the tag giving the datatype and the pointer locating the data. We follow the common practice of encoding the tag into the two lowest-order bits of the pointer. This is especially easy in awk, since arrays with non-consecutive indices are just as efficient as dense ones (so we can use the tagged pointer directly as an index, without having to mask out the tag bits). (But, by the way, mawk accesses negative indices much more slowly than positive ones, as I found out when trying a different encoding.)

    This Lisp provides three datatypes: integers, lists, and symbols. (A modern Lisp provides many more.)

    For an integer, the tag bits are zero and the pointer bits are simply the numeric value; thus, N is represented by N*4. This choice of the tag value has two advantages. First, we can add and subtract without fiddling with the tags. Second, negative numbers fit right in. (Consider what would happen if N were represented by 1+N*4 instead, and we tried to extract the tag as N%4, where N may be either positive or negative. Because of this problem and the above-mentioned inefficiency of negative indices, all other datatypes are represented by positive numbers.)

    The evaluation/saved-bindings stack

    The following is from an email discussion; it doesn't develop everything from first principles but is included here in the hope it will be helpful.

    Hi. I just took a look at awklisp, and remembered that there's more to your question about why we need a stack -- it's a good question. The real reason is because a stack is accessible to the garbage collector.

    We could have had apply() evaluate the arguments itself, and stash the results into variables like arg0 and arg1 -- then the case for ADD would look like

    if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)
    

    The obvious problem with that approach is how to handle calls to user-defined procedures, which could have any number of arguments. Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1)) is the procedure, and 42 is the argument.

    A (wrong) solution could be to evaluate each argument in turn, and bind the corresponding parameter name (like x in this case) to the resulting value (while saving the old value to be restored after we return from the procedure). This is wrong because we must not change the variable bindings until we actually enter the procedure -- for example, with that algorithm ((lambda (x y) y) 1 x) would return 1, when it should return whatever the value of x is in the enclosing environment. (The eval_rands()-type sequence would be: eval the 1, bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind y to that, then eval the body of the lambda.)

    Okay, that's easily fixed -- evaluate all the operands and stash them away somewhere until you're done, and *then* do the bindings. So the question is where to stash them. How about a global array? Like

    for (i = 0; arglist != NIL; ++i) {
        global_temp[i] = eval(car[arglist])
        arglist = cdr[arglist]
    }

    followed by the equivalent of extend_env(). This will not do, because the global array will get clobbered in recursive calls to eval(). Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +, like this: global_temp[0] gets 2, and then global_temp[1] gets the eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set to 3 and global_temp[1] to 4 -- so the original assignment of 2 to global_temp[0] is clobbered before we get a chance to use it. By using a stack[] instead of a global_temp[], we finesse this problem.

    You may object that we can solve that by just making the global array local, and that's true; lots of small local arrays may or may not be more efficient than one big global stack, in awk -- we'd have to try it out to see. But the real problem I alluded to at the start of this message is this: the garbage collector has to be able to find all the live references to the car[] and cdr[] arrays. If some of those references are hidden away in local variables of recursive procedures, we're stuck. With the global stack, they're all right there for the gc().

    (In C we could use the local-arrays approach by threading a chain of pointers from each one to the next; but awk doesn't have pointers.)

    (You may wonder how the code gets away with having a number of local variables holding lisp values, then -- the answer is that in every such case we can be sure the garbage collector can find the values in question from some other source. That's what this comment is about:

      # All the interpretation routines have the precondition that their
      # arguments are protected from garbage collection.
    

    In some cases where the values would not otherwise be guaranteed to be available to the gc, we call protect().)

    Oh, there's another reason why apply() doesn't evaluate the arguments itself: it's called by do_apply(), which handles lisp calls like (apply car '((x))) -- where we *don't* want the x to get evaluated by apply().

    References

    • Harold Abelson and Gerald J. Sussman, with Julie Sussman. Structure and Interpretation of Computer Programs. MIT Press, 1985.
    • John Allen. Anatomy of Lisp. McGraw-Hill, 1978. <;i> Daniel P. Friedman and Matthias Felleisen. The Little LISPer. Macmillan, 1989.

    Roger Rohrbach wrote a Lisp interpreter, in old awk (which has no procedures!), called walk . It can't do as much as this Lisp, but it certainly has greater hack value. Cooler name, too. It's available at http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/impl/awk/0.html

    Bugs

    Eval doesn't check the syntax of expressions. This is a probably-misguided attempt to bump up the speed a bit, that also simplifies some of the code. The macroexpander in the startup file would be the best place to add syntax- checking.

    Author

    Darius Bacon dairus@wry.me

    Copyright

    Copyright (c) 1994, 2001 by Darius Bacon.

    Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:

    1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from defects in it.
    2. The origin of this software must not be misrepresented, either by explicit claim or by omission.
    3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software.

    categories: Awk100,Top10,Interpreters,Dsl,Apr,2009,HenryS

    Amazing Awk Assembler

    Download from

    Download from LAWKER.

    Description

    "aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.

    aaa is the mainline shell file.

    aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).

    aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.

    abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.

    try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.

    01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.

    If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.

    Author

    Henry Spencer


    categories: Wp,Dec,2009,PeterI

    Convert Comments to Latex

    This is adoC, version 1.1. Generates Latex files from source code comments.

    Download

    Download from LAWKER or http://www.sect.mce.hw.ac.uk

    Synopsis

    Usage: adoc [options] files_to_parse

    Options:

    -f
    one file per section
    -s
    sorted per function, variable, etc...
    -t title
    specify title
    --
    end of arguments

    About

    adoC is a source code documenting system written in awk and shell script. It produces documentation in LaTeX format which resembles the Unix man pages. The documentation is generated from comment sections in the source code. The comment sections are marked by two special character sequences and internally divided into sub- parts by keywords. The system can be used with almost any kind of programming language.

    The idea is based on ROBODoc http://www.xs4all.nl/~rfsber/Robo/robodoc.html

    Requirements

    The system requires a working gawk and LaTeX installation. For the LaTeX document the "refart.sty" style should be installed.

    Example

    adoC is documented by itself .

    For the detailed documentation about the system and its implementation execute the following:

    	adoc -s -t "adoc" adoc > doc.tex
    
    For the detailed documentation about the system and its implementation execute the following:
    	$ adoc -s -t "adoc" adoc > doc.tex
    	$ latex doc
    	$ makeindex doc
    	$ latex doc
    	$ makeindex doc
    	$ latex doc
    	$ latex doc
    	$ dvips doc
    
    The created documentation can be downloaded in Pdf format from here.

    Reporting Bugs

    In case of bug reports, suggestions, criticism e-mail peteri@carme.sect.mce.hw.ac.uk

    LICENSE

    GPL v2.0. Share and enjoy.

    Author

    Peter Ivanyi and Roman Putanowicz

    categories: Wp,Dsl,Jul,2009,JesusG

    md2html : Update to Markdown.awk

    Jesus Galan (yiyus) (yiyu DOT jgl AT gmail DOT com) has updated his markdown system.

    His new md2html.awk code adds several new functionality extensions and implements numerous bug fixes.

    For more on this new code, see his history of a rewrite.

    Download

    Download from LAWKER.


    categories: Top10,Wp,Dsl,Mar,2009,JesusG

    Markdown.awk

    Contents

    Synopsis

    awk -f markdown.awk file.txt > file.html

    Download

    Download from LAWKER.

    Description

    (Note: this code was orginally called txt2html.awk by its author but that caused a name clash inside LAWKER. Hence, I've taken the liberty of renamining it. --Timm)

    The following code implements a subset of John Gruber's Markdown langauge: a widely-used, ultra light-weight markup language for html.

    • Paragraghs- denoted by a leading blank line.
    • Images:
      ![alt text](/path/img.jpg "Title")
    • Emphasis: **To be in italics**
    • Code: `<code>` spans are delimited by backticks.
    • Headings (Setex style)
      Level 1 Header 
      =============== 
      
      Level 2 Header
      --------------
      
      Level 3 Header 
      ______________
      
    • Heaings (Atx style):

      Number of leading "#" codes the heading level:

      # Level 1 Header
      #### Level 4 Header
      
    • Unordered lists
    • - List item 1
      - List item 2
      

      Note: beginnging and end of list are automatically inferred, maybe not always correctly.

    • Ordered lists
    • Denoted by a number at start-of-line.

      1 A numbered list item
      

    Code

    The following code demonstrates a "exception-style" of Awk programming. Note how all the processing relating to each mark-up tag is localized (exception, carrying round prior text and environments). The modularity of the following code should make it easily hackable.

    Globals

    BEGIN {
    	env = "none";
    	text = "";
    }
    

    Images

    /^!\[.+\] *\(.+\)/ {
    	split($0, a, /\] *\(/);
    	split(a[1], b, /\[/);
    	imgtext = b[2];
    	split(a[2], b, /\)/);
    	imgaddr = b[1];
    	print "<p><img src=\"" imgaddr "\" alt=\"" imgtext "\" title=\"\" /></p>\n";
    	text = "";
    	next;
    }
    

    Links

    /\] *\(/ {
    	do {
    		na = split($0, a, /\] *\(/);
    		split(a[1], b, "[");
    		linktext = b[2];
    		nc = split(a[2], c, ")");
    		linkaddr = c[1];
    		text = text b[1] "<a href=\"" linkaddr "\">" linktext "</a>" c[2];
    		for(i = 3; i <= nc; i++)
    			text = text ")" c[i];
    		for(i = 3; i <= na; i++)
    			text = text "](" a[i];
    		$0 = text;;
    		text = "";
    	}
    	while (na > 2);
    }
    

    Code

    /`/ {
    	while (match($0, /`/) != 0) {
    		if (env == "code") {
    			sub(/`/, "</code>");
    			env = pcenv;
    		}
    		else {
    			sub(/`/, "<code>");
    			pcenv = env;
    			env = "code";
    		}
    	}
    }
    

    Emphasis

    /\*\*/ {
    	while (match($0, /\*\*/) != 0) {
    		if (env == "emph") {
    			sub(//, "</emph>");
    			env = peenv;
    		}
    		else {
    			sub(/\*\*/, "<emph>");
    			peenv = env;
    			env = "emph";
    		}
    	}
    }
    

    Setex-style Headers

    (Plus h3 with underscores.)

    /^=+$/ {
    	print "<h1>" text "</h1>\n";
    	text = "";
    	next;
    }
    
    /^-+$/ {
    	print "<h2>" text "</h2>\n";
    	text = "";
    	next;
    }
    
    /^_+$/ {
    	print "<h3>" text "</h3>\n";
    	text = "";
    	next;
    }
    

    Atx-style headers

    /^#/ {
    	match($0, /#+/);
    	n = RLENGTH;
    	if(n > 6)
    		n = 6;
    	print "<h" n ">" substr($0, RLENGTH + 1) "</h" n ">\n";
    	next;
    }
    

    Unordered Lists

    /^[*-+]/ {
    	if (env == "none") {
    		env = "ul";
    		print "<ul>";
    	}
    	print "<li>" substr($0, 3) "</li>";
    	text = "";
    	next;
    }
    
    /^[0-9]./ {
    	if (env == "none") {
    		env = "ol";
    		print "<ol>";
    	}
    	print "<li>" substr($0, 3) "</li>";
    	next;
    }
    

    Paragraphs

    /^[ t]*$/ {
    	if (env != "none") {
    		if (text)
    			print text;
    		text = "";
    		print "</" env ">\n";
    		env = "none";
    	}
    	if (text)
    		print "<p>" text "</p>\n";
    	text = "";
    	next;
    }
    

    Default

    // {
    	text = text $0;
    }
    

    End

    END {
            if (env != "none") {
                    if (text)
                            print text;
                    text = "";
                    print "</" env ">\n";
                    env = "none";
            }
            if (text)
                    print "<p>" text "</p>\n";
            text = "";
    }
    

    Bugs

    Does not implement the full Markdown syntax.

    Author

    Jesus Galan (yiyus) 2006

    <yiyu DOT jgl AT gmail DOT com>

    categories: Awk100,Oo,Dsl,Mar,2009,Jimh

    Awk++

    Contents

    Synopsis

     gawk -f awkpp file-name-of-awk++-program
    
    This command is platform independent and sends the translated program to standard output (stdout). See Running awk++ for variations.

    This is an updated revision (#21), released August 1, 2009. In this new version:

    • The code no longer needs a shell script or batch file to launch awkpp
    • Multiple inheritance improved
    • added configuration items at the top of the program
    This document may be copied only as part of an awk++ distribution and in unmodified form.

    Download

    Download awkpp21.zip from LAWKER

    Description

    Awk++ is a preprocessor, that is it reads in a program written in the awk++ language and outputs a new program. However, it's different than awka. The output from the awk++ preprocessor is awk code, not C or an executable program. So, some version of AWK, such as awk or gawk, has to be used to run the preprocessed program. awka can be used, in a second step, to turn the preprocessed awk++ program into an executable, if desired.

    OO in AWK++

    The awk++ language provides object oriented programming for AWK that includes:

    • classes
    • class properties (persistent object variables)
    • methods
    • inheritance, including multiple inheritance

    Awk++ adds new keywords to standard Awk:

    • class
    • method
    • prop
    • property
    • attr
    • attribute
    • elem
    • element
    • var
    • variable

    Syntax

    Samples:

     a = class1.new[(optional parameters)] *** similar to Ruby
     b = a.get("aProperty")
     a.delete
    
     class class1 {
     property aProperty
     method new([optional parameters]) {
     # put initialization stuff here
     }
    
     method get(propName) {
     if(propName = "aProperty")
     return aProperty ### Note the use of 'return'. It behaves
     ### exactly the same as in an AWK function.
     }
     }
    

    Details

    To define a class (similar to C++ but no public/private):

    class class_name {.....}
    

    To define a class with inheritance:

    class class_name : inherited_class_name [ : inherited_class_name...] {.....}
    

    To add local/private variables (persistent variables; syntax is unique to awk++):

    class class_name {
     attribute|attr|property|prop|element|elem|variable|var variable_name
     ..... }
    

    To help programmers who are used to other OO languages, "attribute", "property", "element", and "variable", along with their 4-letter abbreviations, are interchangeable.

    Note: these persistent variables cannot be accessed directly. The programmer must define method(s) to return them, if their values are to be made available to code that's outside the class.

    To add methods

    class class_name {
     attribute variable_name1
    
     method method_name(parameters) {
     ...any awk code....
     }
     ..other method definitions...
     }
    

    To create an object

     object_variable = class_name.new[(optional parameters)]
    
    (runs the method named "new", if it exists; returns the object ID)

    To call an object method

    object_variable.method_name(parameters)
    

    The dot isn't used for concatenation in awk/gawk, so it's a natural choice for the separator between the object and method.

    To reclaim the memory used by an object, use the delete method, i.e.:

    object_variable.delete
    

    but don't define delete() in your classes. awk++ recognizes delete() as a special method and will take care of deleting the object. Deleting objects is only necessary, though, if they hold a lot of data. Overhead for objects themselves is insignificant.

    Naming and behavior rules:

    • Class names must obey the same rules as user defined function names.
    • Method names must follow the same rules as AWK user defined function names.
    • Class "local" variables (properties, attributes, etc.) must follow the same
    • naming rules as AWK variables.
    • Objects are number variables, so they must obey number variable rules. However,
    • the values in variables holding objects should never be changed, as they are simply identifiers. Performing math operations on them is meaningless.

    Syntax notes

    OO syntax goals:

    • easy to parse and match to awk code using an awk program as the "preprocessor"
    • easy to understand
    • easy to remember
    • easy and fast to type
    • distinct from existing AWK syntax

    The OO syntax is based partly on C++, partly on Javascript, partly on Ruby and partly on the book "The Object-Oriented Thought Process". It isn't lifted in toto from one langauage because other languages provide features that gawk can't accomplish or have syntax that is hard to parse.

    Multiple Inheritance

    In awk++, if a method is called that isn't in the object's class and there are inherited classes (superclasses) specified, the inherited classes are called in left to right order until one of them returns a value. That value becomes the result of the method call. This is the way awk++ resolves the diamond problem. As a programmer, you control the sequence in which superclasses are called by the left to right order of the list of inherited classes in the class definition.

    There are two important things to note.

    1. The search will proceed up through as many ancestors as it takes to find a matching method.
    2. A "match" is made when a value is returned. If a superclass has a matching
    3. method that returns nothing, the search will continue. Thus, it's possible that more than one method could be executed resulting in unintended consequences. Be careful!

    Calls to undefined methods do nothing and return nothing, silently.

    Running awk++

    The command to preprocess an awk++ program looks like this:

    gawk -f awkpp file-name-of-awk++-program
    
    or, if the "she-bang" line (line 1 in awkpp) has the right path to gawk, and awkpp is executable and in a directory in PATH,
    awkpp file-name-of-awk++-program
    
    To run the output program immediately,
    gawk -f awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
    
    or
    awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
    
    When running an awk++ program immediately, standard input (stdin) cannot be used for data. One or more data file paths must be listed on the command line.

    Bugs

    There is a bug in the standard AWK distributions that affects the preprocessor. Additionally, the preprocessor uses the 3rd array option of the match() function. So, it's best to use GAWK to run the preprocessor.

    On the other hand, the AWK code created by translating awk++ is intended to work with all versions of AWK. If you find otherwise, please notify the developer(s).

    Copyright

    Copyright (c) 2008, 2009 Jim Hart, jhart@mail.avcnet.org All rights reserved. The awk++ code is licensed under the GNU Public license (GPL) any version. awk++ documentation, including this page, may be copied only in unmodified form, subject to fair use guidelines.

    Author

    Jim Hart, jhart@mail.avcnet.org

    categories: Awk100,Oo,Dsl,May,2009,AlexS

    Awk + ANSI-C = OO

    Description

    ooc is an awk program which reads class descriptions and performs the routine coding tasks necessary to do object-oriented coding in ANSI C.

    The tool is exceptionally well documented in Object oriented programming with ANSI-C.

    Download

    Download a 2002 copy of this code from LAWKER.

    Or go to the author's web site.

    Description

    ooc is a technique to do object-oriented programming (classes, methods, dynamic linkage, simple inheritance, polymorphisms, persistent objects, method existence testing, message forwarding, exception handling, etc.) using ANSI-C.

    ooc is a preprocessor to simplify the coding task by converting class descriptions and method implementations into ANSI-C as required by the technique. You implement the algorithms inside the methods and the ooc preprocessor produces the boilerplate.

    ooc consists of a shell script driving a modular awk script (with provisions for debugging), a set of reports -- code generation templates -- interpreted by the script, and the source of a root class to provide basic functionality. Everything is designed to be changed if desired. There are manual pages, lots of examples, among them a calculator based on curses and X11, and you can ask me about the book.

    ooc as a technique requires an ANSI-C system -- classic C would necessitate substantial changes. The preprocessor needs a healthy Bourne-Shell and "new" awk as described in Aho, Weinberger, and Kernighan's book.

    ooc was developed primarily to teach about object-oriented programming without having to learn a new language. If you see how it is done in a familiar setting, it is much easier to grasp the concepts and to know what miracles to expect from the technique and what not. Conceivably, the preprocessor can be used for production programming but this was not the original intent. Being able to roll your own object-oriented coding techniques has its possibilities, however...

    Technical Details

    Most sources should be viewed with tab stops set at 4 characters.

    The original system ran on NeXTSTEP 3.2 and older, ESIX (System V) 4.0.4, and Linux 0.99.pl4-49. This rerelease was tested on MacOS X version 10.1.2 and Solaris version 5.8. You need to review paths in the script 'ooc/ooc' before running anything. Make sure the first line of this script points to a Bourne-style shell. Also make sure that the first line of '09/munch' points to a (new) awk.

    The rereleased 'ooc' awk-programs have been tested with GNU awk versions 3.0.1 and 3.0.3. Previous versions did not support AWKPATH properly (but this is not essential).

    The makefiles could be smarter but they are naive enough for all systems. This is a heterogeneous system -- set the environment variable $OSTYPE to an architecture-specific name. 'make' in the current directory will create everything by calling 'make' in the various subdirectories. Each 'makefile' includes 'make/Makefile.$OSTYPE', review your 'make/Makefile.$OSTYPE' before you start.

    The following make calls are supported throughout:

    make [all]	create examples
    make test	[make and] run examples
    make clean	remove all but sources
    make depend	make dependencies (if makefile.$OSTYPE supports it)
    

    Make dependencies can be built with the -MM option of the GNU C compiler. They are stored in a file 'depend' in each subdirectory. They should apply to all systems. 'makefile.$OSTYPE' may include a target 'depend' to recreate 'depend' -- check 'makefile.darwin1.4' for an example.

    Contents

    The following is a walk through the file hierarchy in the order of the book:

    makefile
    dispatch standard make calls to known directories
    make/
    Makefile: boilerplate code for makefiles
    01/*
    chapter 1: abstract data types
    • sets: Set demo
    • bags: Bag demo: Set with reference count
    02/*
    chapter 2: dynamic linkage
    • strings: String demo
    • atoms: Atom demo: unique String
    03/*
    chapter 3: manipulating expressions with dyn. linkage
    • postfix: postfix output of expression
    • value: expression evaluation
    • infix: infix output of expression
    04/*
    chapter 4: inheritance
    • points: Point demo
    • circles: Circle demo: Circle: Point with radius
    05/*
    chapter 5: symbol table with inheritance
    • value: expression evaluation with vars, consts, functions
    06/*
    chapter 6: class hierarchy and meta classes
    • any: objects that do not differ from any object
    07/*
    chapter 7: ooc preprocessor; use ooc -7
    • points: Point demo: PointClass is a new metaclass
    • circles: Circle demo: Circle is a new class
    • queue: Queue demo: List is an abstract base class
    • stack: Stack demo: another subclass of List
    08/*
    chapter 8: dynamic type checking; use ooc -8
    • circles: Circle demo: nothing changed
    • list: List demo: traps insertion of numbers or strings
    09/*
    chapter 9: automatic initialization; use ooc -9
    • munch: awk program to collect class list from nm -p output
    • circles: Circle demo: no more init calls
    • list: List demo: no more init calls
    10/*
    chapter 10: respondsTo method; use ooc -10
    • cmd: Filter demo: how flags and options are handled
    • wc: word count filter
    • sort: sorting filter, adds sort method to List
    11/*
    chapter 11: class methods
    • value: expression evaluator, based on class hierarchy
    • value: x memory reclamation enabled
    12/*
    chapter 12: persistent objects
    • value: expression evaluator, with save and load
    13/*
    chapter 13: exception handling
    • value: expression evaluator with exception handler
    • except: Exception demo
    14/*
    chapter 14: message forwarding
    • makefile.etc: (naive) generated rules for the library
    • Xapp: resources for X11-based programs
    • hello: LineOut demo: hello, world
    • button: Button demo
    • run: terminal-oriented calculator
    • cbutton: Crt demo: hello, world changes into a
    • crun: curses-based caluclator
    • xhello: XLineOut demo: hello, world
    • xbutton: XButton demo with XawBox and XawForm
    • xrun: X11-based calculator with callbacks
    man/*
    manual pages
    • *.1: tools
    • *.2: functions
    • *.3: some classes
    • *.4: classes in chapter 14
    ooc/*
    ooc preprocessor
    • ooc: command script; review 'home' 'OOCPATH' 'AWKPATH'
    • awk/*.awk: modules
    • awk/*.dbg: debugging modules
    • rep/*.rep: reports
    • rep-*/*.rep: reports for early chapters

    Copyright

    Copyright (c) 1993

    While you may use this software package, neither I nor my employers can be made responsible for whatever problems you might cause or encounter.

    While you may give away this package and/or software derived with it, you should not charge for it, you should not claim that ooc is your work, and I have published my own book about ooc before you did.

    The same restrictions apply to whoever might get this package from you.

    Author

    Axel T. Schreiner, http://www.cs.rit.edu/~ats/

    categories: Interpreters,Apr,2009,DavidL

    Awk A*

    Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base.

    An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.

    They write:

      While there are a number of systems that will help one construct full-blown metaprograms such as compilers and interpreters, we wanted something with extremely low overhead. We set out to build a something with the property that it would help even inexperienced users build simple meta-programs in a matter of minutes with a few lines of code. A* is the result; it is more than anything else an engineering exercise, as most of its ideas are not new. It is the arrangement of these ideas and the purpose to which they are directed distinguish A* from other tools.

      A* is an experimental language designed to facilitate the creation of language-processing tools. It is analogous either to an interpreted yacc with Awk as its statement language, or to a version of Awk which processes programs rather than records. A* offers two principal advantages over the combination of lex, yacc, and C:

      1. a high-level interpreted base language
      2. built-in parse tree construction.
    A* programmers are thus able to accomplish many useful tasks with little code.

    Reference: A*: a language for implementing language processors Ladd, D.A.; Ramming, J.C.; Software Engineering, IEEE Transactions on Volume 21, Issue 11, Nov. 1995 Page(s):894 - 901


    categories: Funky,Mar,2009,Timm

    Funky: Functional Gawk

    These pages are focused on Functional Gawk (a.k.a. "Funky").

    Funky is enabled by a new feature added to Gawk 3.2: indirect functions. For example:

    function foo() { print "foo" }
    function bar() { print "bar" }
    
    BEGIN {
                    the_func = "foo"
                    @the_func()     # calls foo()
                    the_func = "bar"
                    @the_func()     # calls bar()
    }
    

    At the time of this writing, Gawk 3.2 is pre-release and indirect functions can be accessed using the gawk-devel CVS tree:

    cvs -d:pserver:anonymous@cvs.sv.gnu.org:/sources/gawk co gawk-devel
    

    categories: Funky,Mar,2009,Timm

    The Functional Challange

    Indirect functions enable a new view on library management in Gawk and, perhaps, a way to emulate functional abstraction in languages like Lisp.

    So, anyone care to try, say:


    categories: Sed,Tips,Apr,2009,Admin

    Sed-clones (in Awk)

    These pages focus on Sed-like stream editors, written in Awk.


    categories: SysAdmin,Oct,2009,M0J0

    Shorten Your Pipes

    m0j0 writes in his blog...

    I was lurking around on twitter during my lunch hour (yes, even freelancers need a lunch hour), and @bitprophet tweeted thusly:

      Get syslog-owned log names from syslog.conf:
      grep -v "^#" syslog.conf | 
      awk "{print $2}" | egrep -v "^(\*|\|)" | 
      sed "/^$/ d" | sed "s/^-//"
      

    Followed by this:

      Interested to see if anyone can shorten my previous tweet's command line, outside of using 'cut' instead of the awk bit.)

    I happen to love puzzles like this, and my lunch was almost immediately followed by a long, boring conference call.

    @bitprophet's pipeline above is translated by my brain into the English:

    Find non-commented lines, grab the second space-delimited field, then filter out the ones that start with "*" or "|", then delete any blank lines, and strip any leading "-" from the result.

    My brain usually attempts to think of the English version of the solution *first*, and then try to emulate that in the code/command I write. So, the issue here is we want to find file paths (and apparently sockets are ok, too, as "@" is a valid leading character in the initial definition of the problem). If it's a file path, we want to see it in a form that would be suitable for passing it to something like "ls -l", which means leading symbols like "-" and "|" should be omitted.

    In a syslog.conf file, the main meat is the area where you specify the warning levels, and the file you want messages at that warning level sent to (this is a simplistic explanation, but good enough to understand the solution I came up with). The file is also littered with comments. Here's the file on my Mac:

    *.err;kern.*;auth.notice;authpriv,remoteauth,install.none;mail.crit        /dev/console
    *.notice;authpriv,remoteauth,ftp,install.none;kern.debug;mail.crit    /var/log/system.log
    
    # Send messages normally sent to the console also to the serial port.
    # To stop messages from being sent out the serial port, comment out this line.
    #*.err;kern.*;auth.notice;authpriv,remoteauth.none;mail.crit        /dev/tty.serial
    
    # The authpriv log file should be restricted access; these
    # messages shouldn't go to terminals or publically-readable
    # files.
    auth.info;authpriv.*;remoteauth.crit            /var/log/secure.log
    
    lpr.info                        /var/log/lpr.log
    mail.*                            /var/log/mail.log
    ftp.*                            /var/log/ftp.log
    
    install.*                        /var/log/install.log
    install.*                        @127.0.0.1:32376
    local0.*                        /var/log/appfirewall.log
    local1.*                        /var/log/ipfw.log
    stuff.*                            -/boo
    things.*                        |/var/log
    *.emerg                            *
    

    So, in English, my brain parses the problem like this:

      Skip blank lines, commented lines, and lines where the file name is "*", and give me everything else, but strip off characters "-" and "|" before sending it to the screen.

    And here's my awk one-liner for doing that:

    awk '$0 !~ /^$|^#/ && $2 !~ /^\*/ {sub(/^-|^\|/,"",$2);print $2}' syslog.conf
    

    Knowing a few key things about awk will help parse the above:

    Awk automatically breaks up each line of input into fields. If you don't tell it what to use as a delimiter, it'll just use any number of spaces as the delimiter. If you have a CSV file, you'd likely use "awk -F," to tell awk to use a comma. For /etc/passwd, use "awk -F:". From there, you can reference the first field as $1, the second as $2, etc. $0 represents the whole line. There are more, but that's enough for this example.

    Though I think most sysadmins can get a lot done with simple usage like "awk -F: '{print $2}'", sometimes more power is needed, and awk delivers. It uses the basic regex engine, and enables you to check a field (or the whole line: $0, like I do above) against a regex as a precondition for performing some action with the line or a field on that line. So, in the above awk command, I check to see if the line is either empty, or a comment. I then use a logical AND to check if field 2 starts with "*". If the current line is a match for any of these rules it is skipped.

    Another nice thing about awk is that it actually is a Turing-complete programming language. After I check the lines of input against the rules mentioned above, I immediately know that I definitely want at least some portion of $2 in the remaining lines. What I *don't* want are preceding characters like "-" or "|". I need to strip them from the file name. I use awk's built in "sub()" function to handle that, and with that out of the way I call "print" to send the result to the screen.


    categories: Sed,Tips,Oct,2009,EdM

    Sed in Awk

    Writing in comp.lang.awk Ed Morton ports numerous complex sed expressions to Awk:

    A comp.lang.awk author ask the question:

      I have a file that has a series of lists

      (qqq)
      aaa 111
      bbb 222
      

      and I want to make it look like

      aaa 111 (qqq)
      bbb 222 (qqq)
      

    IMHO the clearest sed solution given was:

    sed -e '
       /^([^)]*)/{
          h; # remember the (qqq) part
          d
       }
    
       / [1-9][0-9]*$/{
          G; # strap the (qqq) part to the list
          s/\n/ /
       }
    ' yourfile
    

    while the awk one was:

    awk '/^\(/{ h=$0;next } { print $0,h }' file
    

    As I've said repeatedly, sed is an excellent tool for simple substitutions on a single line. For anything else you should use awk, perl, etc.

    Having said that, let's take a look at the awk equivalents for the posted sed examples below that are not simple substitutions on a single line so people can judge for themselves (i.e. quietly - this is not a contest and not a religious war!) which code is clearer, more consistent, and more obvious. When reading this, just imagine yourself having to figure out what the given script does in order to debug or enhance it or write your own similar one later.

    Note that in awk as in shell there are many ways to solve a problem so I'm trying to stick to the solutions that I think would be the most useful to a beginner since that's who'd be reading an examples page like this, and without using any GNU awk extensions. Also note I didn't test any of this but it's all pretty basic stuff so it should mostly be right.

    For those who know absolutely nothing about awk, I think all you need to know to understand the scripts below is that, like sed, it loops through input files evaluating conditions against the current input record (a line by default) and executing the actions you specify (printing the current input record if none specified) if those conditions are true, and it has the following pre-defined symbols:

    NR = Number or Records read so far
    NF = Number of Fields in current record
    FS = the Field Separator
    RS = the Record Separator
    BEGIN = a pattern that's only true before processing any input
    END = a pattern that's only true after processing all input.
    

    Oh, and setting RS to the NULL string (-v RS='') tells awk to read paragraphs instead of lines as individual records, and setting FS to the NULL string (-v FS='') tells awk to treat each individual character as a field.

    For more info on awk, see http://www.awk.info.

    Introductory Examples

    Double space a file:

      Sed:

      sed G
      

      Awk

      awk '{print $0 "\n"}'
      

    Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.

      Sed:

      sed '/^$/d;G'
      

      Awk:

      awk 'NF{print $0 "\n"}'
      

    Triple space a file

      Sed:

      sed 'G;G'
      

      Awk:

      awk '{print $0 "\n\n"}'
      

    Undo double-spacing (assumes even-numbered lines are always blank):

      Sed:

      sed 'n;d'
      

      Awk:

      awk 'NF'
      

    Insert a blank line above every line which matches "regex":

      Sed:

      sed '/regex/{x;p;x;}'
      

      Awk:

      awk '{print (/regex/ ? "\n" : "") $0}'
      

    Insert a blank line below every line which matches "regex":

      Sed:

      sed '/regex/G'
      

      Awk:

      awk '{print $0 (/regex/ ? "\n" : "")}'
      

    Insert a blank line above and below every line which matches "regex":

      Sed:

      sed '/regex/{x;p;x;G;}'
      

      Awk:

      awk '{print (/regex/ ? "\n" $0 "\n" : $0)}'
      

    Numbering

    Number each line of a file (simple left alignment). Using a tab (see note on '\t' at end of file) instead of space will preserve margins:

      Sed:

      sed = filename | sed 'N;s/\n/\t/'
      

      Awk:

      awk '{print NR "\t" $0}'
      

    Number each line of a file (number on left, right-aligned):

      Sed:

      sed = filename | sed 'N; s/^/     /; s/ *\(.\{6,\}\)\n/\1  /'
      

      Awk:

      awk '{printf "%6s  %s\n",NR,$0}'
      

    Number each line of file, but only print numbers if line is not blank:

      Sed:

      ed '/./=' filename | sed '/./N; s/\n/ /'
      

      Awk:

      awk 'NF{print NR "\t" $0}'
      

    Count lines (emulates "wc -l")

      Sed:

      sed -n '$='
      

      Awk:

      awk 'END{print NR}'
      

    Text Conversion and Substitution

    Align all text flush right on a 79-column width:

      Sed:

      sed -e :a -e 's/^.\{1,78\}$/ &/;ta'  # set at 78 plus 1 space
      

      Awk:

      awk '{printf "%79s\n",$0}'
      

    Center all text in the middle of 79-column width. In method 1, spaces at the beginning of the line are significant, and trailing spaces are appended at the end of the line. In method 2, spaces at the beginning of the line are discarded in centering the line, and no trailing spaces appear at the end of lines.

      Sed:

      sed  -e :a -e 's/^.\{1,77\}$/ & /;ta'                     # method 1
      sed  -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/'  # method 2
      

      Awk:

      awk '{printf "%"int((79+length)/2)"s\n",$0}'
      

    Reverse order of lines (emulates "tac") Bug/feature in sed v1.5 causes blank lines to be deleted

      Sed:

      sed '1!G;h;$!d'               # method 1
      sed -n '1!G;h;$p'             # method 2
      

      Awk:

      awk '{a[NR]=$0} END{for (i=NR;i>=1;i--) print a[i]}'
      

    Reverse each character on the line (emulates "rev")

      Sed:

      sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
      

      Awk:

      awk -v FS='' '{for (i=NF;i>=1;i--) printf "%s",$i; print ""}'
      

    Join pairs of lines side-by-side (like "paste")

      Sed:

      sed '$!N;s/\n/ /'
      

      Awk:

      awk '{printf "%s%s",$0,(NR%2 ? " " : "\n")}'
      

    If a line ends with a backslash, append the next line to it

      Sed:

      sed -e :a -e '/\\$/N; s/\\\n//; ta'
      

      Awk:

      awk '{printf "%s",(sub(/\\$/,"") ? $0 : $0 "\n")}'
      

    if a line begins with an equal sign, append it to the previous line and replace the "=" with a single space

      Sed:

      sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'
      

      Awk:

      awk '{printf "%s%s",(sub(/^=/," ") ? "" : "\n"),$0} END{print ""}'
      

    Add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)

      Sed:

      gsed '0~5G'                  # GNU sed only
      sed 'n;n;n;n;G;'             # other seds
      

      Awk:

      awk '{print $0} !(NR%5){print ""}'
      

    Selective Printing of Certain Lines

    Print first 10 lines of file (emulates behavior of "head")

      Sed:

      sed 10q
      

      Awk:

      awk '{print $0} NR==10{exit}'
      

    Print first line of file (emulates "head -1")

      Sed:

      sed q
      

      Awk:

      awk 'NR==1{print $0; exit}'
      

    Print the last 10 lines of a file (emulates "tail")

      Sed:

      sed -e :a -e '$q;N;11,$D;ba'
      

      Awk:

      awk '{a[NR]=$0} END{for (i=NR-10;i<=NR;i++) print a[i]}'
      

    Print the last 2 lines of a file (emulates "tail -2")

      Sed:

      sed '$!N;$!D'
      

      Awk:

      awk '{a[NR]=$0} END{for (i=NR-2;i<=NR;i++) print a[i]}'
      

    Print the last line of a file (emulates "tail -1")

      Sed:

      sed '$!d'                    # method 1
      sed -n '$p'                  # method 2
      

      Awk:

      awk 'END{print $0}'
      

    Print the next-to-the-last line of a file

      Sed:

      sed -e '$!{h;d;}' -e x  # for 1-line files, print blank line
      sed -e '1{$q;}' -e '$!{h;d;}' -e x  # for 1-line files, print the line
      sed -e '1{$d;}' -e '$!{h;d;}' -e x  # for 1-line files, print nothing
      

      Awk:

      awk '{prev=curr; curr=$0} END{print prev}'
      

    Print only lines which match regular expression (emulates "grep")

      Sed:

      sed -n '/regexp/p'           # method 1
      sed '/regexp/!d'             # method 2
      

      Awk:

      awk '/regexp/'
      

    Print only lines which do NOT match regexp (emulates "grep -v")

      Sed:

      sed -n '/regexp/!p'          # method 1, corresponds to above
      sed '/regexp/d'              # method 2, simpler syntax
      

      Awk:

      awk '!/regexp/'
      

    Print the line immediately before a regexp, but not the line containing the regexp

      Sed:

      sed -n '/regexp/{g;1!p;};h'
      

      Awk:

      awk '/regexp/{print prev} {prev=$0}'
      

    Print the line immediately after a regexp, but not the line containing the regexp

      Sed:

      sed -n '/regexp/{n;p;}'
      

      Awk:

      awk 'found{print $0} {found=(/regexp/ ? 1 : 0)}'
      

    Print 1 line of context before and after regexp, with line number indicating where the regexp occurred (similar to "grep -A1 -B1")

      Sed:

      sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
      

      Awk:

      awk 'found    {print preLine "\n" hitLine "\n" $0;   found=0}
            /regexp/ {preLine=prev;   hitLine=NR " " $0;    found=1}
            {prev=$0}'
      

    Grep for AAA and BBB and CCC (in any order)

      Sed:

      sed '/AAA/!d; /BBB/!d; /CCC/!d'
      

      Awk:

      awk '/AAA/&&/BBB/&&/CCC/'
      

    Grep for AAA and BBB and CCC (in that order)

      Sed:

      sed '/AAA.*BBB.*CCC/!d'
      

      Awk:

      awk '/AAA.*BBB.*CCC/'
      

    Grep for AAA or BBB or CCC (emulates "egrep")

      Sed:

      sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d    # most seds
      gsed '/AAA\|BBB\|CCC/!d'                        # GNU sed only
      

      Awk:

      awk '/AAA|BBB|CCC/'
      

    Print paragraph if it contains AAA (blank lines separate paragraphs). Sed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below

      Sed:

      sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'
      

      Awk:

      awk -v RS='' '/AAA/'
      

    Print paragraph if it contains AAA and BBB and CCC (in any order)

      Sed:

      sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
      

      Awk:

      awk -v RS='' '/AAA/&&/BBB/&&/CCC/'
      

    Print paragraph if it contains AAA or BBB or CCC

      Sed:

      sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
      gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d'         # GNU sed only
      

      Awk:

      awk -v RS='' '/AAA|BBB|CCC/'
      

    Print only lines of 65 characters or longer

      Sed:

      sed -n '/^.\{65\}/p'
      

      Awk:

      awk -v FS='' 'NF>=65'
      

    Print only lines of less than 65 characters

      Sed:

      sed -n '/^.\{65\}/!p'        # method 1, corresponds to above
      sed '/^.\{65\}/d'            # method 2, simpler syntax
      

      Awk:

      awk -v FS='' 'NF<65'
      

    Print section of file from regular expression to end of file

      Sed:

      sed -n '/regexp/,$p'
      

      Awk:

      awk '/regexp/{found=1} found'
      

    Print section of file based on line numbers (lines 8-12, inclusive)

      Sed:

      sed -n '8,12p'               # method 1
      sed '8,12!d'                 # method 2
      

      Awk:

      awk 'NR>=8 && NR<=12'
      

    Print line number 52

      Sed:

      sed -n '52p'                 # method 1
      sed '52!d'                   # method 2
      sed '52q;d'                  # method 3, efficient on large files
      

      Awk:

      awk 'NR==52{print $0; exit}'
      

    Beginning at line 3, print every 7th line

      Sed:

      gsed -n '3~7p'               # GNU sed only
      sed -n '3,${p;n;n;n;n;n;n;}' # other seds
      

      Awk:

      awk '!((NR-3)%7)'
      

    print section of file between two regular expressions (inclusive)

      Sed:

      sed -n '/Iowa/,/Montana/p'             # case sensitive
      

      Awk:

      awk '/Iowa/,/Montana/'
      

    Print all lines of FileID upto 1st line containing

      Sed:

      sed '/string/q' FileID
      

      Awk:

      awk '{print $0} /string/{exit}'
      

    Print all lines of FileID from 1st line containing until eof

      Sed:

      sed '/string/,$!d' FileID
      

      Awk:

      awk '/string/{found=1} found'
      

    Print all lines of FileID from 1st line containing until 1st line containing [boundries inclusive]

      Sed:

      sed '/string1/,$!d;/string2/q' FileID
      

      Awk:

      awk '/string1/{found=1} found{print $0} /string2/{exit}'
      

    Selective Deletion of Certain Lines

    Print all of file EXCEPT section between 2 regular expressions

      Sed:

      sed '/Iowa/,/Montana/d'
      

      Awk:

      awk '/Iowa/,/Montana/{next} {print $0}' file
      

    Delete duplicate, consecutive lines from a file (emulates "uniq"). First line in a set of duplicate lines is kept, rest are deleted.

      Sed:

      sed '$!N; /^\(.*\)\n\1$/!P; D'
      

      Awk:

      awk '$0!=prev{print $0} {prev=$0}'
      

    Delete duplicate, nonconsecutive lines from a file. Beware not to overflow the buffer size of the hold space, or else use GNU sed.

      Sed:

      sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
      

      Awk:

      awk '!a[$0]++'
      

    Delete all lines except duplicate lines (emulates "uniq -d").

      Sed:

      sed '$!N; s/^\(.*\)\n\1$/\1/; t; D'
      

      Awk:

      awk '$0==prev{print $0} {prev=$0}'      # works only on consecutive
      awk 'a[$0]++'                           # works on non-consecutive
      

    Delete the first 10 lines of a file

      Sed:

      sed '1,10d'
      

      Awk:

      awk 'NR>10'
      

    Delete the last line of a file

      Sed:

      sed '$d'
      

      Awk:

      awk 'NR>1{print prev} {prev=$0}'
      

    Delete the last 2 lines of a file

      Sed:

      sed 'N;$!P;$!D;$d'
      

      Awk:

      awk 'NR>2{print prev[2]} {prev[2]=prev[1]; prev[1]=$0}'    # method 1
      awk '{a[NR]=$0} END{for (i=i;i<=NR-2;i++) print a[i]}'     # method 2
      awk -v num=2 'NR>num{print prev[num]}
          {for (i=num;i>1;i--) prev[i]=prev[i-1]; prev[1]=$0}'    # method 3
      

    Delete the last 10 lines of a file

      Sed:

      sed -e :a -e '$d;N;2,10ba' -e 'P;D'   # method 1
      sed -n -e :a -e '1,10!{P;N;D;};N;ba'  # method 2
      

      Awk:

      awk -v num=10 '...same as deleting last 2 method 3 above...'
      

    Delete every 8th line

      Sed:

      gsed '0~8d'                           # GNU sed only
      sed 'n;n;n;n;n;n;n;d;'                # other seds
      

      Awk:

      awk 'NR%8'
      

    Delete lines matching pattern

      Sed:

      sed '/pattern/d'
      

      Awk:

      awk '!/pattern/'
      

    Delete ALL blank lines from a file (same as "grep '.' ")

      Sed:

      sed '/^$/d'                           # method 1
      sed '/./!d'                           # method 2
      

      Awk:

      awk '!/^$/'                             # method 1
      awk '/./'                               # method 2
      

    Delete all CONSECUTIVE blank lines from file except the first; also deletes all blank lines from top and end of file (emulates "cat -s")

      Sed:

      sed '/./,/^$/!d'
      

      Awk:

      awk '/./,/^$/'
      

    Delete all leading blank lines at top of file

      Sed:

      sed '/./,$!d'
      

      Awk:

      awk 'NF{found=1} found'
      

    Delete all trailing blank lines at end of file

      Sed:

      sed -e :a -e '/^\n*$/{$d;N;ba' -e '}'  # works on all seds
      sed -e :a -e '/^\n*$/N;/\n$/ba'        # ditto, except for gsed 3.02.*
      

      Awk:

      awk '{a[NR]=$0} NF{nbNr=NR} END{for (i=1;i<=nbNr;i++) print a[i]}'
      

    Delete the last line of each paragraph

      Sed:

      sed -n '/^$/{p;h;};/./{x;/./p;}'
      

      Awk:

      awk -v FS='\n' -v RS='' '{for (i=1;i<=NF;i++) print $i; print ""}'
      

    Special Applications

    Get Usenet/e-mail message header

      Sed:

      sed '/^$/q'        # deletes everything after first blank line
      

      Awk:

      awk '/^$/{exit}'
      

    Get Usenet/e-mail message body

      Sed:

      sed '1,/^$/d'              # deletes everything up to first blank line
      

      Awk:

      awk 'found{print $0} /^$/{found=1}'
      

    Get Subject header, but remove initial "Subject: " portion

      Sed:

      sed '/^Subject: */!d; s///;q'
      

      Awk:

      awk 'sub(/Subject: */,"")'
      

    Parse out the address proper. Pulls out the e-mail address by itself from the 1-line return address header (see preceding script)

      Sed:

      sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
      

      Awk:

      awk '{sub(/ *\(.*\)/,""); sub(/>.*/,""); sub(/.*[:<] */,""); print $0}'
      

    Add a leading angle bracket and space to each line (quote a message)

      Sed:

      sed 's/^/> /'
      

      Awk:

      awk '{print "> " $0}'
      

    Delete leading angle bracket & space from each line (unquote a message)

      Sed:

      sed 's/^> //'
      

      Awk:

      awk '{sub(/> /,""); print $0}'
      

    categories: Jul,2009,RussC

    Awk's RE Match Very Fast

    (This page is a summary of Russ Cox's excellent article Regular Expression Matching Can Be Simple and Fast.)

    Russ Cox writes that Awk's regular expression library is surprisingly faster than that used in Perl, Ruby, and Python:

      This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics.

      Let's use superscripts to denote string repetition, so that a?3a3 is shorthand for a?a?a?aaa. This lets us define experiments where we conduct timing experiments on using regular expressions to match the a?nan against the string an.

      If we conduct those experiments, Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. ... the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string. This trends grows as we increase "n": the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 1015 years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages.).

    For some details of his results, see the following graph. Note that the y-axis is logarithmic (increases by a power of ten for each tick) so these differences are really big differences:

    The reason for these differences is very technical- but Cox's article offers an excellent and clear description of those details. In short, the RE matcher used in Perl, Ruby, Python is a recursive algorithm that allows the match state to exist in only one state at a time. A Thompson NFA used in Awk/Grep, on the other hand, allows a match to exist in multiple states. Using Thompson's NFA, the whole match process can be pre-computed and cached at compile time, thus removing the backtrack-on-failure process.

    And what is the lesson here? Next time someone tells you Awk is old-fashioned, cough politely and mention that at least in some aspects, certain supposedly-more-modern languages do not offer all the support provided by dear-"old"- Awk.


    categories: Errata,,Nov,2009,HermannP

    Errata: WHINY_USERS slows down Awk

    On Nov 30'09, Hermann Peifer found and fixed bug in an older version of the test code at the end of http://awk.info/?tip/whinyUsers .

    With the older, incorrect, version it was reported that keeping all Awk arrays sorted had very little impact on performance.

    With Hermann's fix, we can now show that sorting slows down processing by 15% (at least, for the example explored on that page.)

    Thanks to Hermann for that correction.


    categories: Tips,Sept,2009,EdM

    The Secret WHINY_USERS Flag

    (Editor's note: On Nov 30'09, Hermann Peifer found and fixed bug in an older version of the test code at the end of this file.)

    Writing in comp.lang.awk, Ed Morton reveals the secret WHINY_USERS flag.

    "Nag" asked:

      Hi,

      I am creating a file like...

      awk '{
       ....
       ...
       ..
       printf"%4s %4s\n",$1,$2 > "file1"
      
      }' input
      

      How can I sort file1 within awk code?

    Ed Morton writes:

      There's also the undocumented WHINY_USERS flag for GNU awk that allows for sorted processing of arrays:
      $ cat file
      2
      1
      4
      3
      $ gawk '{a[$0]}END{for (i in a) print i}' file
      4
      1
      2
      3
      $ WHINY_USERS=1 gawk '{a[$0]}END{for (i in a) print i}' file
      1
      2
      3
      4
      

    Execution Cost

    Your editor coded up the following test for the runtime costs of WHINY_USERS. The following code is called twice (once with, and once without setting WHINY_USERS):

    runWhin() {
    WHINY_USERS=1 gawk -v M=1000000 --source '
            BEGIN { 
                    M = M ? M : 50
                    N = M
                    print N
                    while(N-- > 0) {
                            key = rand()" "rand()" "rand()" "rand()" "rand() 
                            A[key] = M - N
                    }
                    for(i in A)
                            N++
            }' 
    }
    runNoWhin() {
    gawk -v M=1000000 --source '
            BEGIN { 
                    M = M ? M : 50
                    N = M
                    print N
                    while(N-- > 0) {
                            key = rand()" "rand()" "rand()" "rand()" "rand() 
                            A[key] = M - N
                    }
                    for(i in A)
                            N++
            }' 
    }
    time runWhin
    time runNoWhin
    

    And the results? Sorted added 15% to runtimes:

    % bash whiny.sh
    1000000
    
    real    0m18.897s
    user    0m15.826s
    sys     0m2.445s
    1000000
    
    real    0m16.345s
    user    0m13.469s
    sys     0m2.435s
    

    categories: Tips,Aug,2009,EdM

    Print Ranges

    In comp.lang.awk, Ed Morton offers advise on how to print ranges of Awk records.

    Problem

    Suppose you are looking to extract a section of code from a text file based on two regular expressions.

    Say the file looks like this: newspaper magazing hiking hiking trails in the city muir hike black mountain hike summer meados hike end hiking phone cell skype

    and you want to extract

    hiking trails in the city
    muir hike
    black mountain hike
    summer meados hike
    
    The following regular expression won't work right:
    awk '/hiking/,/end hiking/{print}' myfile
    
    since that returns some spurious information.

    What do do?

    Solution

    Personally, I rarely if ever use

    /start/,/end/
    

    as I'm never immediately sure what it'd output for input such as:

    start
    a
    start
    b
    end
    c
    end
    

    and whenever you want to do something just slightly different with the selection you need to change the script a lot.

    Not being sure of the semantics is probably a catch 22 since I rarely use it but the benefit of using that syntax vs spelling it out:

    /start/{f=1} f; /end/{f=0}
    

    just doesn't really seem worthwhile, and then if you want to do something extra like test for some other condition over the block this:

    /start/{f=1} f&&cond; /end/{f=0}
    

    is about as brief as:

    /start/,/end/{if (cond) print}
    

    and if you want to exclude the start (or end) of the block you're printing then you just move the "f" test to the obvious place and you don't need to duplicate the condition:

    f; /start/{f=1} /end/{f=0}
    
    vs
    /start/,/end/{if (!/start/) print}
    

    and note the different semantics now. This:

    f; /start/{f=1} /end/{f=0}
    

    will exclude the line at the start of the block you're printing, whereas this:

    /start/,/end/{if (!/start/) print}
    

    will exclude that line plus every other occurrence of "start" within the block which is probably not what you'd want. To simply exclude only the first line of the block but stay with the /start/,/end/ approach you'd need to do something like:

    /start/,/end/{if (!nr++) print; if (/end/) nr=0}
    

    (which is getting fairly obscure.)


    categories: Databases,Tips,Jul,2009,VictorA

    Using Awk for Databases

    Contents

    Download

    Download all the following example code and support data files from LAWKER

    General Information

    Introduction

    This page contains a set of sample Awk scripts to manage different kinds of databases. In all cases, we'll use a text editor such as edit.exe to create and edit the data files, and Awk scripts will be used to query and manipulate the data.

    OK, so it's not a fancy GUI-based system, but this method is flexible and the scripts execute relatively quickly. Also, your data won't be locked in some company's proprietary binary file format. There is also the benefit of portability: If your PC can run DOS, you can also run these scripts on your PC. Awk is also available on Linux and on other operating systems.

    This page assumes that you are already familiar with database terms like 'record', 'field', and 'search keyword'.

    Introduction to Awk

    Awk is an interpreted programming language that is designed for managing and converting data files and generating reports from the data.

    Awk will automatically read an input file and parse it into records and fields, one record at a time. A typicall Awk script will then manipulate the fields using predefined variables like $1 (the first field), $2 (the second field), etc.

    To use Awk, you create an Awk script, and then run it with the Awk program (gawk.exe in this case). Many Awk scripts are small, and it lends itself to writing "one-time use" programs.

    Using the Scripts

    All the files on this page are available in the ZIP archive at this link. Feel free to reuse and customize them.

    You will need the GNU Awk program gawk.exe to be installed on your QuickPAD Pro. See the programming page for instructions on installing GNU Awk.

    Here is the general format of a gawk command line:

    	gawk -f SCRIPT DATAFILE
    
    where SCRIPT is the name of the file that contains the Awk script and DATAFILE is the name of the text file that contains the input data.

    That command line will not modify the input file and all the output will be directed to the screen.

    If a script creates a new data file (for example, a sort script), the command line will be:

    	gawk -f SCRIPT DATAFILE > NEWFILE
    
    where NEWFILE is the name of the new data file that will be created.

    If you use a particular script often and get tired of typing in a long command line, you can create a batch file to execute the long command line for you.

    are currently limited to 64K files for our data. We can work around this restriction by using the chop utility program that is described in the software page.

    Index Card Databases

    Card File

    In this section we demonstrate some Awk scripts to manage This type of database can be used for any type of simple text lists, like lists of books, music CDs, recipes, quotations, etc.

    Our information will be stored into 'cards'. Each card will have a 'title' and a 'body':

    	Title of Card
    	-------------------------
    	Free-formatted field of 
    	information about this 
    	particular card, but
    	without any blank lines.
    
    Let's take this information and store it in a text file. To keep things simple, the cards within the file are separated with a blank line, and the first line of each card will be the title.

    For example, let's create a sample card file called 'cards.txt' and use it to store a list of our goals.

    	Write a book and become famous
    	This is a long range
    	goal. I need a good book
    	idea first. And writing
    	skills.
    
    	Solve the problems of society
    	This might take
    	a little longer
    	than expected.
    
    	Take out the garbage
    	It's stinking up
    	the garage.
    

    Let's begin with an Awk script to print out the titles of all the cards in the file. Here is the script called 'titles':

    	# titles - Print the titles of all the cards in the
    	# index card file.
    
    	BEGIN { RS = ""; FS = "\n" }
    	        { print $1 }
    

    Here is a sample run:

    	[B:\] gawk -f titles cards.txt
    	Write a book and become famous
    	Solve the problems of society
    	Take out the garbage
    	[B:\]
    

    Another useful script is one that can be used for searching the data file, ignoring uppercase and lowercase distinctions. The following script called 'search' will display the cards that contain the keyword 'write'.

    	# search - Print the index card that contains a string
    
    	BEGIN   { RS = ""; FS = "\n"; IGNORECASE=1 }
    
    	/write/ { print $0, "\n" }
    

    Here is a sample run:

    	[B:\] gawk -f search cards.txt
    	Write a book and become famous
    	This is a long range
    	goal. I need a good book
    	idea first. And writing
    	skills.
    
    	[B:\]
    

    To search for other strings, edit the 'search' script and replace 'write' with another search keyword.

    Sorting the cards based on the titles would also be a useful operation. Here is a script called 'sort' which reads the entire data file into and array and then uses the QuickSort algorithm to sort it:

    	# sort - Sort index card file by the card titles
    
    	BEGIN { RS = ""; FS = "\n" }
    
    	      { A[NR] = $0 } 
    
    	END   {
    		qsort(A, 1, NR)
    		for (i = 1; i <= NR; i++) {
    			print A[i]
    			if (i == NR) break
    			print ""
    		}
    	      }
    
    	# QuickSort
    	# Source: "The AWK Programming Language", by Aho, et.al., p.161
    	function qsort(A, left, right,   i, last) {
    		if (left >= right)
    			return
    		swap(A, left, left+int((right-left+1)*rand()))
    		last = left
    		for (i = left+1; i <= right; i++)
    			if (A[i] < A[left])
    				swap(A, ++last, i)
    		swap(A, left, last)
    		qsort(A, left, last-1)
    		qsort(A, last+1, right)
    	}
    	function swap(A, i, j,   t) {
    		t = A[i]; A[i] = A[j]; A[j] = t
    	}
    

    And here is a sample run:

    	[B:\] awk -f sort cards.txt > new.txt
    	[B:\] rename cards.txt cards.bak
    	[B:\] rename new.txt cards.txt
    	[B:\] type cards.txt
    	Solve the problems of society
    	This might take
    	a little longer
    	than expected.
    
    	Take out the garbage
    	It's stinking up
    	the garage.
    
    	Write a book and become famous
    	This is a long range
    	goal. I need a good book
    	idea first. And writing
    	skills.
    	[B:\]
    
    Note that we renamed our old data file to cards.bak, instead of deleting the file. It's always good to keep backups of old databases.

    However, the 'sort' script had some trouble with large files because it reads in all the cards into an array in RAM. In my tests, the largest file I was able to sort was only about 100K.

    "Flash Cards" for Memorization

    Index cards can also be used for memorization. The title of the card can contain a question and the body of the card contains the answer that you want to memorize.

    Let's write a program that randomly chooses a card from our 'cards.txt' file, displays its title, asks the user to press the 'Enter' key, and then displays the body of that card.

    First, we need a text file which contains the questions and answers that we want to memorize. Let's name the file 'question.txt'. Note that the answer can contain multiple lines:

    	What is your name?
    	My name is
    	Sir Lancelot
    	of Camelot.
    
    	What is your quest?
    	To seek the
    	Holy Grail.
    
    	What is your favorite color?
    	Blue.
    

    Here is the Awk script called 'memorize'. It will read the data file into an array, randomly shuffle the array, and then it will loop through the array and display each question and answer.

    	# memorize - randomly display an index card title, ask user to
    	# press return, then display the corresponding body of the card
    
    	BEGIN { RS=""; FS="\n" }
    
    	      { A[NR] = $0 } 
    
    	END   {
    		RS="\n"; FS=" "
    		shuffle(A, NR)
    		for (i = 1; i <= NR; i++) {
    			print "\nQUESTION: ", substr(A[i], 1, index(A[i], "\n")-1)
    			printf "\nPress return for the answer: "
    			getline < "-"
    			print "\nANSWER: "
    			print substr(A[i], index(A[i], "\n")+1)
    			if (i == NR) break
    			printf "\nPress return to continue, or 'q' to quit: "
    			getline < "-"
    			if ($1 == "q") break
    		}
    	      }
    
    	# Shuffle the array
    	function shuffle(A, n,   t) {
    		srand()
    		# Moses/Oakford shuffle algorithm
    		for (i = n; i > 1; i--) {
    			j = int((i-1) * rand()) + 1
    			t = A[j]; A[j] = A[i]; A[i] = t
    		}
    	}
    

    Here is a sample run. The script will randomly choose cards until it either finishes going through all the cards, or until the user enters a 'q' to quit.

    	[B:\] gawk -f memorize question.txt
    
    	QUESTION:  What is your quest?
    
    	Press return for the answer:
    
    	ANSWER:
    	To seek the
    	Holy Grail.
    
    	Press return to continue, or 'q' to quit:
    
    	QUESTION:  What is your favorite color?
    
    	Press return for the answer:
    
    	ANSWER:
    	Blue.
    
    	Press return to continue, or 'q' to quit:
    
    	QUESTION:  What is your name?
    
    	Press return for the answer:
    
    	ANSWER:
    	My name is
    	Sir Lancelot
    	of Camelot.
    	[B:\] gawk -f memorize question.txt
    	
    	QUESTION:  What is your favorite color?
    	
    	Press return for the answer:
    
    	ANSWER:
    	Blue.
    
    	Press return to continue, or 'q' to quit: q
    	[B:\] 
    

    Custom Databases

    Address Book

    The databases above used a simple 'index card' analogy. That data model works fine for simple lists with free form data, but there are also cases where we need to manage records with specialized data fields.

    Let's create a data file and some scripts for an 'address book' database. Our data file will be a text file where every line is one record. Within a line of the file, the data will be separated into fields.

    When choosing a delimiter for our fields, we need to make sure that it won't appear accidentally within a field itself. For example, an address book has fields like name, company name, address, etc., and in this case, each of those fields can contain spaces within them (e.g. "ACME Mail Order Company"). Therefore, we can't use a space to separate the fields of the line.

    Instead, let's use commas to separate the fields, and we'll need a rule that commas cannot appear within a field.

    Here is a sample data file called 'address.txt':

    John Robinson,Koren Inc.,978 4th Ave,Boston,MA 01760,617-696-0987
    Phyllis Chapman,GVE Corp.,34 Sea Drive,Amesbury,MA 01881,781-879-0900
    
    Here is the script called 'labels' which will print all the data and format it like mailing labels:
    	# labels - Format the addresses for printing labels
    	# Source: blocklist.awk from "Sed & Awk", by Dale Dougherty, p.148
    
    	BEGIN { FS = "," }
    
    	{
    	        print ""        # blank line
    	        print $1        # name
    	        print $2        # company
    	        print $3        # street
    	        print $4, $5    # city, state zip
    	}
    
    This is the sample run:
    	[B:\] gawk -f labels address.txt
    	
    	John Robinson
    	Koren Inc.
    	978 4th Ave
    	Boston MA 01760
    	
    	Phyllis Chapman
    	GVE Corp.
    	34 Sea Drive
    	Amesbury MA 01881	
    	[B:\] 
    

    It may also be useful to extract just the phone numbers from our data file. Here is the script called 'phones' which will extract only the names and phone numbers from the data file:

    	# phones
    	# Source: phonelist.awk, from "Sed & Awk", by Dale Dougherty, p.148
    
    	BEGIN { FS="," }
    
    	{ print $1 ", " $6 }
    
    Here is a sample run:
    	[B:\] gawk -f phones address.txt
    	John Robinson, 617-696-0987
    	Phyllis Chapman, 781-879-0900
    	[B:\] 
    
    We'll also need a script to search our data file for a name. Here is a script called 'searchad' with will search for the string 'robinson':
    	# searchad - Return the record that matches a string
    
    	BEGIN { FS = ","; IGNORECASE=1 }
    
    	/robinson/ {
    	        print ""        # blank line
    	        print $1        # name
    	        print $2        # company
    	        print $3        # street
    	        print $4, $5    # city, state zip
    	}
    

    Here is a sample run:

    	[B:\] gawk -f searchad address.txt
    
    	John Robinson
    	Koren Inc.
    	978 4th Ave
    	Boston MA 01760
    	[B:\] 
    

    Grading Program

    Awk can also be used for mathematical computation of fields. Let's demonstrate this with a data file called 'grades.txt' that contains grades of students.

    	Allen Mona 70 77 85 83 70 89
    	Baker John 85 92 78 94 88 91
    	Jones Andrea 89 90 85 94 90 95
    	Smith Jasper 84 88 80 92 84 82
    	Turner Dunce 64 80 60 60 61 62
    	Wells Ellis 90 98 89 96 96 92
    

    Here is a longer script that will take all the grades, average them equally, and compute the final average and the final grade for each student. At the end, it will compute some statistics about the entire class. Here is the script called 'grades'.

    	# grades -- average student grades and determine
    	# letter grade as well as class averages
    	# Source: "Sed & Awk", by Dale Dougherty, p.192
    
    	# set output field separator to tab.
    	BEGIN { OFS = "\t" }
    
    	# action applied to all input lines
    	{
    		# add up the grades
    		total = 0
    		for (i = 3; i <= NF; ++i)
    			total += $i
    		# calculate average
    		avg = total / (NF - 2)
    		# assign student's average to element of array
    		class_avg[NR] = avg
    		# determine letter grade
    		if (avg >= 90) grade="A"
    		else if (avg >= 80) grade="B"
    		else if (avg >= 70) grade="C"
    		else if (avg >= 60) grade="D"
    		else grade="F"
    		# increment counter for letter grade array
    		++class_grade[grade]
    		# print student name, average, and letter grade
    		print $1 " " $2, avg, grade
    	}
    
    	# print out class statistics
    	END  {
    		# calculate class average
    		for (x = 1; x <= NR; x++)
    			class_avg_total += class_avg[x]
    		class_average = class_avg_total / NR
    		# determine how many above/below average
    		for (x = 1; x <= NR; x++)
    			if (class_avg[x] >= class_average)
    				++above_average
    			else
    				++below_average
    		# print results
    		print ""
    		print "Class Average: ", class_average
    		print "At or Above Average: ", above_average
    		print "Below Average: ", below_average
    		# print number of students per letter grade
    		for (letter_grade in class_grade)
    			print letter_grade ":", class_grade[letter_grade]
    	}
    

    Here is a sample run:

    	[B:\] gawk -f grades grades.txt
    	Allen Mona      79      C
    	Baker John      88      B
    	Jones Andrea    90.5    A
    	Smith Jasper    85      B
    	Turner Dunce    64.5    D
    	Wells Ellis     93.5    A
    
    	Class Average:  83.4167
    	At or Above Average:    4
    	Below Average:  2
    	A:      2
    	B:      2
    	C:      1
    	D:      1
    	[B:\]
    

    Another useful script is the following program that computes a histogram of the grades. It is hardcoded to only read the third column ($3), but you can edit it and change it to read any of the columns in the input file. Here is the script called 'histo':

    	# histogram
    	# Source: "The AWK Programming Language", by Aho, et.al., p.70
    
    	     { x[int($3/10)]++ } # use the third column of input data
    
    	END  {
    	        for (i = 0; i < 10; i++)
    	                printf(" %2d - %2d: %3d %s\n",
    	                       10*i, 10*i+9, x[i], rep(x[i],"*"))
    	        printf("100:      %3d %s\n", x[10], rep(x[10],"*"))
    	     }
    
    	function rep(n, s,   t) {   # return string of n s's
    	        while (n--> 0)
    	                t = t s
    	        return t
    	}
    
    And here is the sample run:
    	[B:\] gawk -f histo grades.txt
    	  0 -  9:   0
    	 10 - 19:   0
    	 20 - 29:   0
    	 30 - 39:   0
    	 40 - 49:   0
    	 50 - 59:   0
    	 60 - 69:   1 *
    	 70 - 79:   1 *
    	 80 - 89:   3 ***
    	 90 - 99:   1 *
    	100:        0	
    	[B:\]
    

    The output shows that there were six grades, and most of them were in the 80-89 range.

    Checkbook Program

    This program takes a data file which lists your checkbook entries and your deposits, and calculates the totals.

    Here is what a sample input file called 'checks.txt' looks like:

    	check	1021
    	to	Champagne Unlimited
    	amount	123.10
    	date	1/1/87
    
    	deposit	
    	amount	500.00
    	date	1/1/87
    
    	check	1022
    	date	1/2/87
    	amount	45.10
    	to	Getwell Drug Store
    	tax	medical
    
    	check	1023
    	amount	125.00
    	to	International Travel
    	date	1/3/87
    
    	check	1024
    	amount	50.00
    	to	Carnegie Hall
    	date	1/3/87
    	tax	charitable contribution
    
    	check	1025
    	to	American Express
    	amount	75.75
    	date	1/5/87
    

    Here is the script called 'check' which will calculate the totals:

    	# check - print total deposits and checks
    	# Source: "The AWK Programming Language", by Aho, et.al., p.87
    
    	BEGIN { RS=""; FS="\n" }
    
    	/(^|\n)deposit/ { deposits += field("amount"); next }
    	/(^|\n)check/   { checks += field("amount"); next }
    
    	END   { printf("Deposits: $%.2f, Checks: $%.2f\n", 
    		       deposits, checks)
    	      }
    
    	function field(name,   i, f) {
    		for (i = 1; i <= NF; i++) {
    			split($i, f, "\t")
    			if (f[1] == name)
    				return f[2]
    		}
    		printf("Error: no field %s in record\n%s\n", name, $0)
    	}
    

    And this is a sample run:

    	[B:\] gawk -f check checks.txt
    	Deposits: $500.00, Checks: $418.95
    	[B:\]
    

    Importing and Exporting Data

    Importing Data for use by Awk

    Awk works well with data files that are stored in text files. Awk assumes that the data file is organized into records, within each record the data is divided into fields, and there are unique characters in the file that are used as the field separators and record separators.

    By default, Awk assumes that newline characters are the record separators and whitespace characters (spaces and tabs) are the field separators. It is also possible to redefine the field separators to other characters, like a comma or a tab character, which means that Awk can process the commonly used "comma separated" and "tab separated" format for data files.

    But note that if a file uses newline characters as record separators, it means that a newline cannot appear within a field. For example, a data file file with one record per line cannot contain a text field (e.g. a "notes" field) that contains free form text with newline characters within it. That would confuse Awk unless we added special code to handle that notes field.

    The same restrictions apply to the field separators. If a file is defined to be comma separated, it means that no field is allowed to contain comma characters within it (e.g. a Name field that contains "Alvarado, Victor") because Awk would parse that as two fields, not one.

    That is why tab separated files tend to be used more often. That way, the fields are allowed to contain spaces and commas.

    Another way to format data for use by Awk is to use the "multiline" format, which is what we used for our index card databases above. Awk will treat each line as a field, and a blank line is the record separator.

    Exporting Data to Microsoft Excel

    To export data to Excel, all we need to do is to convert the data file into tab-delimited format, and store it in a text file with a *.xls extension. When that file is opened in Microsoft Windows, Excel will open it automatically as if it were a spreadsheet.

    As an example, let's export our grades.txt file to Excel. Here is our 'grades.txt' file:

    	Allen Mona 70 77 85 83 70 89
    	Baker John 85 92 78 94 88 91
    	Jones Andrea 89 90 85 94 90 95
    	Smith Jasper 84 88 80 92 84 82
    	Turner Dunce 64 80 60 60 61 62
    	Wells Ellis 90 98 89 96 96 92
    

    The file uses spaces as the field separator, so we'll need a script that will convert the field separators into tabs. Here is a script called 'conv2xls':

    	# conv2xls - Convert a data file into tab-separated format
    
    	BEGIN {
    	        IFS=" "    # input field separator is a space
    	        OFS="\t"   # output field separator is a tab
    	      }
    
    	      { print $1, $2, $3, $4, $5, $6, $7, $8 }
    

    And here is the sample run, where we store the tab-delimited output into a text file called grades.xls:

    	[B:\] gawk -f conv2xls grades.txt > grades.xls
    	[B:\]
    
    Here is the contents of the 'grades.xls' text file:
    	Allen   Mona    70      77      85      83      70      89
    	Baker   John    85      92      78      94      88      91
    	Jones   Andrea  89      90      85      94      90      95
    	Smith   Jasper  84      88      80      92      84      82
    	Turner  Dunce   64      80      60      60      61      62
    	Wells   Ellis   90      98      89      96      96      92
    

    We can then copy the grades.xls text file to a Windows PC, double-click on it, and Excel will open it as if it were a spreadsheet:

    You can then do a "Save As" in Excel to save it as the regular Excel binary format.

    Exporting Data to a Web Page

    To export our data to a web page, we will need a script that will input our data file and generate HTML.

    Let's start with our 'grades.txt' data file:

    	Allen Mona 70 77 85 83 70 89
    	Baker John 85 92 78 94 88 91
    	Jones Andrea 89 90 85 94 90 95
    	Smith Jasper 84 88 80 92 84 82
    	Turner Dunce 64 80 60 60 61 62
    	Wells Ellis 90 98 89 96 96 92
    

    Here is a script called 'html' that will do the conversion. Note that the data will appear as rows of a table in HTML.

    	# html - Convert a data file into an HTML web page with a table
    	
    	BEGIN {
    		print "<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>"
    		print "<BODY BGOLOR=\"#ffffff\">"
    		print "<CENTER><H1>Grades Database</H1></CENTER>"
    		print "<HR noshade size=4 width=75%>"
    		print "<P><CENTER><TABLE BORDER>"
    		printf "<TR><TH>Last<TH>First"
    		print "<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6"
    	      }
    	
    	      { # Print the data in table rows
    		printf "<TR><TD>" $1 "<TD>" $2 
    		printf "<TD>" $3 "<TD>" $4 "<TD>" $5 
    		print  "<TD>" $6 "<TD>" $7 "<TD>" $8 
    	      }
    	
    	END   {
    		print "</TABLE></CENTER><P>"
    		print "<HR noshade size=4 width=75%>"
    		print "</BODY></HTML>"
    	      }
    

    Here is the sample run. The output will be placed in a file called 'grades.htm'.

    	[B:\] gawk -f html grades.txt > grades.htm
    	[B:\]
    

    This is what the resulting 'grades.htm' file looks like:

    	<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>
    	<BODY BGOLOR="#ffffff">
    	<CENTER><H1>Grades Database</H1></CENTER>
    	<HR noshade size=4 width=75%>
    	<P><CENTER><TABLE BORDER>
    	<TR><TH>Last<TH>First<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6
    	<TR><TD>Allen<TD>Mona<TD>70<TD>77<TD>85<TD>83<TD>70<TD>89
    	<TR><TD>Baker<TD>John<TD>85<TD>92<TD>78<TD>94<TD>88<TD>91
    	<TR><TD>Jones<TD>Andrea<TD>89<TD>90<TD>85<TD>94<TD>90<TD>95
    	<TR><TD>Smith<TD>Jasper<TD>84<TD>88<TD>80<TD>92<TD>84<TD>82
    	<TR><TD>Turner<TD>Dunce<TD>64<TD>80<TD>60<TD>60<TD>61<TD>62
    	<TR><TD>Wells<TD>Ellis<TD>90<TD>98<TD>89<TD>96<TD>96<TD>92
    	</TABLE></CENTER><P>
    	<HR noshade size=4 width=75%>
    	</BODY></HTML>
    

    And here is a link to the grades.htm file so you can see what the web page looks like in your browser.

    Exporting Data to a Palm Pilot

    First, we will need to install a database program on the Palm. There are several database programs to choose from, but let's use the freeware database program called Pilot-DB (available here from PalmGear).

    Next, we will need the freeware DOS tools that come with Pilot-DB to help us create the PDB data file. The DB-tools package is available here at PalmGear. You can download it and install it on your Windows PC. Those are DOS tools, but they were compiled to run in DOS under Windows, so we can't run them on the QuickPAD Pro. (Note: DB-tools is an open source project, so the source code is available.)

    The DB-tools package contains a program called 'csv2pdb.exe'. It will do the conversion into a Palm PDB file.

    Let's use the 'grades.txt' data file as an example:

    	Allen Mona 70 77 85 83 70 89
    	Baker John 85 92 78 94 88 91
    	Jones Andrea 89 90 85 94 90 95
    	Smith Jasper 84 88 80 92 84 82
    	Turner Dunce 64 80 60 60 61 62
    	Wells Ellis 90 98 89 96 96 92
    

    Before we can run the 'csv2pdb.exe' program we first need to convert our data into "csv" (comma separated values) format. We can do that with the following awk script called 'conv2csv':

    	# conv2csv - Convert a data file into comma-separated format
    
    	BEGIN {
    	        IFS=" "    # input field separator is a space
    	        OFS=","    # output field separator is a comma
    	      }
    
    	      { print $1, $2, $3, $4, $5, $6, $7, $8 }
    

    Here is the command line to create the comma-delimited data file, which we will call 'grades.csv':

    	[B:\] gawk -f conv2csv grades.txt > grades.csv
    	[B:\]
    

    This is what the 'grades.csv' file looks like:

    	Allen,Mona,70,77,85,83,70,89
    	Baker,John,85,92,78,94,88,91
    	Jones,Andrea,89,90,85,94,90,95
    	Smith,Jasper,84,88,80,92,84,82
    	Turner,Dunce,64,80,60,60,61,62
    	Wells,Ellis,90,98,89,96,96,92
    

    Next, we need to create an "info" file which will describe the format of our data. The 'csv2pdb.exe' program will need this information for the conversion to Palm format.

    The info file will give our database a title and describe the fields of each record. In grades.csv, the first field is the student's last name, the second field is the student's first name, and the other six fields are the grades. Here is the resulting info file called 'grades.ifo':

    	title "GradesDB"
    	field "Last" string 38
    	field "First" string 38
    	field "G1" integer 14
    	field "G2" integer 14
    	field "G3" integer 14
    	field "G4" integer 14
    	field "G5" integer 14
    	field "G6" integer 14
    	option backup on
    

    The numbers at the end of the lines are the field widths in pixels; we can make a guess for the field widths, and then fine-tune them on the Palm Pilot. The last line will set the backup bit on the PDB file so that it will be backed up at every hotsync.

    From this point on, the rest of the steps must be done on your Windows PC.

    On Your Windows PC

    Now we create the PDB file on our PC with this command line:

    C:\> csv2pdb -i grades.ifo grades.csv grades.pdb C:\>

    It will create a new file called 'grades.pdb' in the current directory. This is the Palm database file.

    The last step is to install the PDB file to the Palm Pilot: in the Windows Explorer double-click on the PDB file and then hotsync your Palm Pilot as usual.

    Here is a screen shot of the Palm Pilot running Pilot-DB with our grades database. (Make sure you have selected the blank unnamed view from menu at the top-right corner of the screen):

    As you can see, storing data as text files gives you a lot of flexibility in manipulating the data and exporting it to other formats.

    Author

    Victor Alvarado


    categories: Tips,Jul,2009,Admin

    Random Numbers in Gawk

    (Summarized and extended from a recent discussion at comp.lang.awk.)

    Background

    A standard idiom in Gawk is to reset the random number generator in a BEGIN block.

    BEGIN {srand() }
    

    Sadly, when called with no arguments, this "reseeding" uses time-in-seconds. So if the same "random" task runs multiple times in the same second, it will get the same random number seed.

    Houston, We Have a Problem

    "Ben" writes:

    I have a Gawk script that puts random comments into a file. It is run 3 times in a row in quick succession. I found that seeding the random number generator using gawk did not work because all 3 times it was run was done within the same second (and it uses the time).

    I was wondering if anyone could give me some suggestions as to what can be done to get around this problem.

    Solution #1: Persistent Memory

    Kenny McCormack writes:

    When last I ran into this problem, what I did was to save the last value returned by rand() to a file, then on the next run, read that in and use that value as the arg to srand(). Worked well.

    (Editor's comment: Kenny's solution does work well but incurs the cost of maintaining and reading/writing that "last value" file.)

    Solution #2: Use Bash

    Tim Menzies writes:

    How about setting the seed using the BASH $RANDOM variable:

    gawk -v Seed=$RANDOM --source 'BEGIN { srand(Seed ? Seed : 1) }' 
    

    If referenced multiple times in a second, it always generates a different number.

    In the above usage, if we have a seed, use it. Else, no seed so start all "random" at the same place. If you prefer to use the default "seed from time-in-seconds" then use:

    BEGIN { if (Seed) { srand(Seed) } else { srand() } }
    

    (Editor's comment: Tim's solution incurs the overhead of additional command-line syntax. However, it does allow the process calling Gawk to control the seed. This is important when trying to, say, debug code by recreating the sequence of random numbers that lead to the bug.)

    Solution #3: Query the OS

    Thomas Weidenfeller writes:

    Is that good enough (random enough) for your task?

    BEGIN {
            "od -tu4 -N4 -A n /dev/random" | getline
            srand(0+$0)
    }
    

    (Editor's comment: Nice. Thomas' solution reminds us that "Gawk" can access a whole host of operating system facilities.)

    Solution #4: Use the Process Id

    Aharon Robbins writes:

    You could so something like add PROCINFO["pid"] to the value of the time, or use that as the seed.

    $ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
    0.405889
    $ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
    0.671906
    

    (Editor's comment: Aharon's solution is the fastest of all the ones shown here. For example, on Mac OS/X, his solution takes 6ms to run:

    $ time gawk 'BEGIN { srand(systime() + PROCINFO["pid"]) }'
    
    real    0m0.006s
    user    0m0.002s
    sys     0m0.004s
    

    while Thomas' solution is somewhat slower:

    $ time gawk 'BEGIN { "od -tu4 -N4 -A n /dev/random" | getline; srand($0+0) }'
    
    real    0m0.039s
    user    0m0.004s
    sys     0m0.034s
    

    Note that while Aharon's solution is the fastest, it does not let some master process set the seed for the Gawk process (e.g. as in Tim's approach).)

    Conclusion

    If you want raw speed, use Aharon's approach.

    If you want seed control, see Tim's approach.


    categories: Funky,Tips,Mar,2009,ArnoldR

    Super-For Loops

    In this exchange from comp.lang.awk, Jason Quinn discusses his super-for loop trick. Arnold Robbins then chimes in to say that, with indirect functions, super-for loops could become a generic tool.

    Jason Quinn writes:

    • Frequently when programming, situations arise for me where I need a nested number of for-loops. Such case arose for me again just recently while I was inventing a dice game. Anyway, here is the implementation that I ended up using to create a "super-for" loop in AWK (a little trickier than C).
    • This simple example merely lists all possible outcomes of rolling 4, 6, 8, 10, 12, and 20 sided dice at once. A super-for loop requires an array to specify the loop indices... here we have 6 dice and the number of sides determines the indices. The code is easily modified for an arbitrary number of dice (which is the whole point).
    • I identify three parts of a super-for which I called the prologue, body, and epilog. Under most circumstances, I think the main body only would get used.
    • For example:
      #shows an example of a superfor loop
      BEGIN {
      	#define loop maximums
      	loopmax[1]=4
      	loopmax[2]=6
      	loopmax[3]=8
      	loopmax[4]=10
      	loopmax[5]=12
      	loopmax[6]=20
      	#call the loop
      	superfor(6)
      }
      function superfor(loopdepth, zz) { # zz is a local variable
              currloopnum++
      
              #start of prologue
              #end of prologue
      
              for(loopcounter[currloopnum]=1; 
                  loopcounter[currloopnum]<=loopmax[currloopnum]; 
                  loopcounter[currloopnum]++) {
                      if ( loopdepth==1 ) {
                              #start of superfor body
                              for (zz=1;zz<=currloopnum;zz++) {
                                      printf loopcounter[zz] FS
                                      }
                              print ""
                              #end of superfor body
                              }
                      else if ( loopdepth>1 )
                              superfor(loopdepth-1)
                      }
      
              #start of epilog
              #end of epilog
      
              loopdepth++ ; currloopnum--
              }
      

    Arnold Robbins replies:

    • I think this would make a great application for indirect function calls. For example:
      function superfor(loopdepth, prologue, body, epilogue,     zz)
      {
              currloopnum++
      
              @prologue()
      
              for(loopcounter[currloopnum]=1; 
                  loopcounter[currloopnum]<=loopmax [currloopnum]; 
                  loopcounter[currloopnum]++) {
                      if ( loopdepth==1 ) {
                              @body()
                      }
                      else if ( loopdepth>1 )
                              superfor(loopdepth-1, proloogue, 
                                       body, epilogue)
                      }
      
              @epilogue()
      
              loopdepth++ ; currloopnum--
      }
      

    categories: Tips,Aug,2009,JanisP

    Using Field Names to Reference Columns

    In comp.lang.awk, Janis Papanagnou comments on how Awk can read a CSV files where the headers are named in line one.

    Problem

    Suppose you have a a csv file with headers for field names. Gawk can use those headers for field names- which makes the code more intuitive and easier to work with. Given that awk is expected to work on tabular data, this seems to be a good alternative to just field numbers.

    Solution

    Try this shell script:
    #!/bin/sh
    awk -F, -v cols="${1:?}" '
       BEGIN {
         n=split(cols,col)
         for (i=1; i<=n; i++) s[col[i]]=i
       }
       NR==1 {
         for (f=1; f<=NF; f++)
           if ($f in s) c[s[$f]]=f
         next
       }
       { sep=""
         for (f=1; f<=n; f++) {
           printf("%c%s",sep,$c[f])
           sep=FS
         }
         print ""
       }
    '
    

    This script can be called with an arbitrary list of column names as defined in the first line of your data file and separated by the same field separator as your data.

    For example, suppose the above code is in bycolname.sh and we have data that looks like this:

    hello,world,region_name,foo,bar,xyz,dummy
    11111,22222,aspac,77777,8888888,xyz,zzzzz
    21111,22222,ASPAC,77777,8888888,xyz,zzzzz
    31111,22222,ASPAC,77777,8888888,XYZ,zzzzz
    41111,22222,aspac,77777,8888888,XYZ,zzzzz
    

    Now, calling this command...

    sh bycolname.sh world,hello
    
    ... would produce:
    22222,11111
    22222,21111
    22222,31111
    22222,41111
    

    Bugs

    Non existing column names will expand to $0 each, which may be surprising if there's an unnoticed typo in your field list.


    categories: Getline,Tips,Jan,2009,EdM

    Use (and Abuse) of Getline

    by Ed Morton (and friends)

    The following summary, composed to address the recurring issue of getline (mis)use, was based primarily on information from the book "Effective Awk Programming", Third Edition By Arnold Robbins; (http://www.oreilly.com/catalog/awkprog3) with review and additional input from many of the comp.lang.awk regulars, including

    • Steve Calfee,
    • Martin Cohen,
    • Manuel Collado,
    • Jürgen Kahrs,
    • Kenny McCormack,
    • Janis Papanagnou,
    • Anton Treuenfels,
    • Thomas Weidenfeller,
    • John LaBadie and
    • Edward Rosten.

    Getline

    getline is fine when used correctly (see below for a list of those cases), but it's best avoided by default because:

    1. It allows people to stick to their preconceived ideas of how to program rather than learning the easier way that awk was designed to read input. It's like C programmers continuing to do procedural programming in C++ rather than learning the new paradigm and the supporting language constructs.
    2. It has many insidious caveats that come back to bite you either immediately or in future. The succeeding discussion captures some of those and explains when getline IS appropriate.

    As the book "Effective Awk Programming", Third Edition By Arnold Robbins; http://www.oreilly.com/catalog/awkprog3) which provides much of the source for this discussion says:

      "The getline command is used in several different ways and should not be used by beginners. ... come back and study the getline command after you have reviewed the rest ... and have a good knowledge of how awk works."

    Variants

    The following summarises the eight variants of getline applications, listing which variables are set by each one:

    Variant                 Variables Set 
    -------                 -------------
    getline                 $0, ${1...NF}, NF, FNR, NR, FILENAME 
    getline var             var, FNR, NR, FILENAME 
    getline < file          $0, ${1...NF}, NF 
    getline var < file      var 
    command | getline       $0, ${1...NF}, NF 
    command | getline var   var 
    command |& getline      $0, ${1...NF}, NF 
    command |& getline var  var 
    

    The "command |& ..." variants are GNU awk (gawk) extensions. gawk also populates the ERRNO builtin variable if getline fails.

    Although calling getline is very rarely the right approach (see below), if you need to do it the safest ways to invoke getline are:

    if/while ( (getline var < file) > 0) 
    if/while ( (command | getline var) > 0) 
    if/while ( (command |& getline var) > 0) 
    

    since those do not affect any of the builtin variables and they allow you to correctly test for getline succeeding or failing. If you need the input record split into separate fields, just call "split()" to do that.

    Caveats

    Users of getline have to be aware of the following non-obvious effects of using it:

    1. Normally FILENAME is not set within a BEGIN section, but a non-redirected call to getline will set it.
    2. Calling "getline < FILENAME" is NOT the same as calling "getline". The second form will read the next record from FILENAME while the first form will read the first record again.
    3. Calling getline without a var to be set will update $0 and $NF so they will have a different value for subsequent processing than they had for prior processing in the same condition/action block.
    4. Many of the getline variants above set some but not all of the builtin variables, so you need to be very careful that it's setting the ones you need/expect it to.
    5. According to POSIX, `getline < expression' is ambiguous if expression contains unparenthesized operators other than `$'; for example, `getline < dir "/" file' is ambiguous because the concatenation operator is not parenthesized. You should write it as `getline < (dir "/" file)' if you want your program to be portable to other awk implementations.
    6. In POSIX-compliant awks (e.g. gawk --posix) a failure of getline (e.g. trying to read from a non-readable file) will be fatal to the program, otherwise it won't.
    7. Unredirected getline can defeat the simple and usual rule to handle input file transitions:
      FNR==1 { ... start of file actions ... }
      
      File transitions can occur at getlines, so FNR==1 needs to also be checked after each unredirected (from a specific file name) getline. e.g. if you want to print the first line of each of these files:
      $ cat file1 
      a 
      b 
      $ cat file2 
      c 
      d 
      
      you'd normally do:
      $ awk 'FNR==1{print}' file1 file2 
      a 
      c 
      
      but if a "getline" snuck in, it could have the unexpected consequence of skipping the test for FNR==1 and so not printing the first line of the second file.
      $ awk 'FNR==1{print}/b/{getline}' file1 file2 
      a 
      
    8. Using getline in the BEGIN section to skip lines makes your program difficult to apply to multiple files. e.g. with data like...
      some header line 
      ---------------- 
      data line 1 
      data line 2 
      ... 
      data line 10000 
      
      you may consider using...
      BEGIN { getline header; getline } 
      { whatever_using_header_and_data_on_the_line() } 
      
      instead of...
      FNR == 1 { header = $0 } 
      FNR < 3 { next } 
      { whatever_using_header_and_data_on_the_line() } 
      
      but the getline version would not work on multiple files since the BEGIN section would only be executed once, before the first file is processed, whereas the non-getline version would work as-is. This is one example of the common case where the getline command itself isn't directly causing the problem, but the type of design you can end up with if you select a getline approach is not ideal.

    Applications

    getline is an appropriate solution for the following:

    1. Reading from a pipe, e.g.:
      command = "ls" 
      while ( (command | getline var) > 0) { 
          print var 
      } 
      close(command) 
      
    2. Reading from a coprocess, e.g.:
      command = "LC_ALL=C sort" 
      n = split("abcdefghijklmnopqrstuvwxyz", a, "") 
      for (i = n; i > 0; i--) 
           print a[i] |& command 
      close(command, "to") 
      while ((command |& getline var) > 0) 
          print "got", var 
      close(command) 
      
    3. In the BEGIN section, reading some initial data that's referenced during processing multiple subsequent input files, e.g.:
      BEGIN { 
         while ( (getline var < ARGV[1]) > 0) { 
                data[var]++ 
         } 
         close(ARGV[1]) 
         ARGV[1]="" 
       } 
       $0 in data 
      
    4. Recursive-descent parsing of an input file or files, e.g.:
      awk 'function read(file) { 
                  while ( (getline < file) > 0) { 
                      if ($1 == "include") { 
                           read($2) 
                      } else { 
                           print > ARGV[2] 
                      } 
                  } 
                  close(file) 
            } 
            BEGIN{ 
               read(ARGV[1]) 
               ARGV[1]="" 
               close(ARGV[2]) 
           }1' file1 tmp 
      

    In all other cases, it's clearest, simplest, less error-prone, and easiest to maintain to let awks normal text-processing read the records. In the case of "c", whether to use the BEGIN+getline approach or just collect the data within the awk condition/action part after testing for the first file is largely a style choice.

    "a" above calls the UNIX command "ls" to list the current directory contents, then prints the result one line at a time.

    "b" above writes the letters of the alphabet in reverse order, one per line, down the two-way pipe to the UNIX "sort" command. It then closes the write end of the pipe, so that sort receives an end-of-file indication. This causes sort to sort the data and write the sorted data back to the gawk program. Once all of the data has been read, gawk terminates the coprocess and exits. This is particularly necessary in order to use the UNIX "sort" utility as part of a coprocess since sort must read all of its input data before it can produce any output. The sort program does not receive an end-of-file indication until gawk closes the write end of the pipe. Other programs can be invoked as just:

    command = "program" 
    do { 
          print data |& command 
          command |& getline var 
    } while (data left to process) 
    close(command) 
    

    Not that calling close() with a second argument is also gawk-specific.

    "c" above reads every record of the first file passed as an argument to awk into an array and then for every subsequent file passed as an argument will print every record from that file that matches any of the records that appeared in the first file (and so are stored in the "data" array). This could alternatively have been implemented as:

    # fails if first file is empty 
    NR==FNR{ data[$0]++; next } 
    $0 in data 
    

    or:

    FILENAME==ARGV[1] { data[$0]++; next } 
    $0 in data 
    

    or:

    FILENAME=="specificFileName" { data[$0]++; next } 
    $0 in data 
    

    or (gawk only):

    ARGIND==1 { data[$0]++; next } 
    $0 in data 
    

    "d" above not only expands all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2]. In this case, since it's convenient to use $1 and $2, and no other part of the program references any builtin variables, getline was used without populating an explicit variable. This method is limited in its recursion depth to the total number of open files the OS permits at one time.

    Tips

    The following tips may help if, after reading the above, you discover you have an appropriate application for getline or if you're looking for an alternative solution to using getline:

    1. If you need to distinguish between a normal EOF or some read or opening error, you have to use gawks ERRNO variable or code it as: if/while ( (e = (getline var < file)) > 0) { ... } close(file) if(e < 0) some_error_handling
    2. Don't forget to close() any file you open for reading. The common idiom for getline and other methods of opening files/streams is:
      cmd="some command" 
      do something with cmd 
      close(cmd) 
      
    3. A common misapplication of getline is to just skip a few lines of an input file. The following discusses how to do that without using getline with all that implies as discussed above. This discussion builds on the common awk idiom to "decrement a variable to zero" by putting the decrement of the variable as the second term in an "and" clause with the first part being the variable itself, so the decrement only occurs if the variable is non-zero:
      • Print the Nth record after some pattern:
        awk 'c&&!--c;/pattern/{c=N}' file 
      • Print every record except the Nth record after some pattern:
        awk 'c&&!--c{next}/pattern/{c=N}' file 
      • Print the N records after some pattern:
        awk 'c&&c--;/pattern/{c=N}' file 
      • Print every record except the N records after some pattern:
        awk 'c&&c--{next}/pattern/{c=N}' file

    In this example there are no blank lines and the output is all aligned with the left hand column and you want to print $0 for the second record following the record that contains some pattern, e.g. the number 3:

    $ cat file 
    line 1 
    line 2 
    line 3 
    line 4 
    line 5 
    line 6 
    line 7 
    line 8 
    $ awk '/3/{getline;getline;print}' file 
    line 5 
    

    That works Just fine. Now let's see the concise way to do it without getline:

    $ awk 'c&&!--c;/3/{c=2}' file 
    line 5

    It's not quite so obvious at a glance what that does, but it uses an idiom that most awk programmers could do well to learn and it is briefer and avoids all those getline caveats.

    Now let's say we want to print the 5th line after the pattern instead of the 2nd line. Then we'd have:

    $ awk '/3/{getline;getline;getline;getline;getline;print}' file 
    line 8 
    $ awk 'c&&!--c;/3/{c=5}' file 
    line 8
    

    i.e. we have to add a whole series of additional getline calls to the getline version, as opposed to just changing the counter from 2 to 5 for the non-getline version. In reality, you'd probably completely rewrite the getline version to use a loop:

    $ awk '/3/{for (c=1;c<=5;c++) getline; print}' file 
    line 8

    Still not as concise as the non-getline version, has all the getline caveats and required a redesign of the code just to change a counter.

    Now let's say we also have to print the word "Eureka" if the number 4 appears in the input file. With the getline verion, you now have to do something like:

    $ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" } 
    print}' file 
    Eureka! 
    line 8

    whereas with the non-getline version you just have to do:

    $ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file 
    Eureka! 
    line 8

    i.e. with the getline version, you have to work around the fact that you're now processing records outside of the normal awk work-loop, whereas with the non-getline version you just have to drop your test for "4" into the normal place and let awks normal record processing deal with it like it always does. Actually, if you look closely a

    t the above you'll notice we just unintentionally introduced a bug in the getline version. Consider what would happen in both versions if 3 and 4 appear on the same line. The non-getline version would behave correctly, but to fix the getline version, you'd need to duplicate the condition somewhere, e.g. perhaps something like this:

    $ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline } 
    if ($0 ~ /4/) print "Eureka!"; print}' file 
    Eureka! 
    line 8 
    

    Now consider how the above would behave when there aren't 5 lines left in the input file or when the last line of the file contains both a 3 and a 4. i.e. there are still design questions to be answered and bugs that will appear at the limits of the input space.

    Ignoring those bugs since this is not intended as a discussion on debugging getline programs, let's say you no longer need to print the 5th record after the number 3 but still have to do the Eureka on 4. With the getline version, you'd strip out the test for 3 and the getline stuff to be left with:

    $ awk '{if ($0 ~ /4/) print "Eureka!"}' file 
    Eureka!
    

    which you'd then presumably rewrite as:

    $ awk '/4/{print "Eureka!"}' file 
    Eureka! 
    

    which is what you get just by removing everything involving the test for 3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:

    $ awk '/4/{print "Eureka!"}' file 
    Eureka! 
    

    i.e. again, one small requirement change required a complete redesign of the getline code, but just the absolute minimum necessary tweak to the non-getline version.

    So, what you see above in the getline case was significant redesign required for every tiny requirement change, much larger amounts of handwritten code required, insidious bugs introduced during development and challenging design questions at the limits of your input space, whereas the non-getline version always had less code, was much easier to modify as requirements changed, and was much more obvious, predictable, and correct in how it would behave at the limits of the input space.


    categories: Forloop,Tips,Jan,2009,Jimh

    Never write for(i=1;i<=n;i++).. again?

    by Jim Hart

    I've written this kind of thing

    n = split(something,arr,/re/)
    for(i=1;i<=n;i++) {
       print arr[i]
    }
    

    so often, it's tedious. I like this better:

    n = split(something,arr,/re/)
    while(n--) {
       print arr[i++]
    }
    

    Easier to type. And, in cases where front-to-back or back-to-front doesn't matter, it's even simpler:

    # copy a number indexed array, assuming n contains the number of
    # elements
    
    while(n--) arr2[n] = arr1[n]
    

    And, yes,

    for(i in arr1) arr2[i] = arr1[i]
    

    works, too. But, some loops don't involve arrays. :-)

    Want more?

    This tip has been discussed on comp.lang.awk.


    categories: Tips,Apr,2009,ArnoldR

    Moving Files with Awk

    Andrew Eaton wrote at comp.lang.awk:

    I just started with awk and sed, I am more of a perl/C/C++ person. I have a quick question reguarding the pipe. In Awk, I am trying to use this construct.

    while ((getline < "somedata.txt") > 0)
                {print | "mv"} #or could be "mv -v" for verbose. 
    

    Is it possible that "print" is no longer printing the value of getline, if so how do I correct it?

    Arnold Robbins comments:

    The problem here is that `mv' doesn't read standard input, it only processes command lines. Assuming that your data is something like:

    oldfile newfile
    

    You can do things two ways:

    # build the command and execute it
    while ((getline < "somedata.txt") > 0) {
              command = "mv " $1 " " $2
              system(command)
    }
    close("somedata.txt")
    

    or this way:

    # send commands to the shell
    while ((getline < "somedata.txt") > 0) {
              printf("mv %s %s\n", $1, $2) | "sh"
    }
    close("somedata.txt")
    close("sh")
    

    The latter is more efficient.


    categories: Sed,Tips,Apr,2009,ArnoldR

    AwkSed: A Simple Stream Editor

    by Arnold Robbins

    From the Gawk Manual.

    The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:

    command1 < orig.data | sed 's/old/new/g' | command2 > result
    

    Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.

    The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:

    # awksed.awk --- do s/foo/bar/g using just print
    #    Thanks to Michael Brennan for the idea
    
    function usage()
    {
      print "usage: awksed pat repl [files...]" > "/dev/stderr"
      exit 1
    }
    
    BEGIN {
        # validate arguments
        if (ARGC < 3)
            usage()
    
        RS = ARGV[1]
        ORS = ARGV[2]
    
        # don't use arguments as files
        ARGV[1] = ARGV[2] = ""
    }
    
    # look ma, no hands!
    {
        if (RT == "")
            printf "%s", $0
        else
            print
    }
    

    The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.

    The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.

    There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.

    The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.

    The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.


    categories: Sed,Tips,Apr,2009,JamesL

    s2a: sed to Awk

    Contents

    Download

    Description

    Bugs

    Author

    Code

    Download

    Download from LAWKER.

    Description

    The s2a project is a sed to awk conversion utility written in awk. As input it takes sed scripts, and it outputs an equivalent awk script.

    This version should be fully functional as far as the following sed commands are concerned: a,d,s,p,q,c,i,n. Commands to be implemented in the future: {},=,h,g,N,P,r,x,y,l,H,G,D,b,t,:

    Bugs

    $ is not a valid line address. Also, line continuation with '\' is not implemented.

    Author

    James Lyons, Feb 2008.

    For more excellent awk code, visit Lyon's awk.dsplab web site.

    Code

    BEGIN{RS=";|\n"; FS=""; var=1;}
    {
        i=1; case1=""; case2="";
        while($i==" ")i++;
        if($i=="\\"||$i=="/"||$i~/[0-9]/) case1=matchaddr();
        if($i==","){i++; case2=matchaddr()};
     handle sed commands
    ####################################################################################################
        if($i == "d"){ a1=a2="next;";
        }else if($i == "p"){ a1=a2="print;";
        }else if($i == "a"){ rest="";
            for(c=i+2;c<=NF;c++) rest=rest$c;
            a1=a2="$0=$0\"\\n"rest"\";"; 
        }else if($i == "q"){ a1=a2="print; exit;"; 
        }else if($i == "n"){ a1=a2="print; if(getline <= 0) next;"
        }else if($i == "s"){
            re=substr($0, i); p=substr(re,2,1); match(re,"s"p"((\\"p"|.)*)"p"((\\"p"|.)*)"p"([a-zA-Z])?",tmp);
            tmp[3]=gensub(/\\[0-9]/,"\\\\&","g",tmp[3]); 
            tmp[1]=gensub(/\\\(/,"(","g",tmp[1]); tmp[1]=gensub(/\\\)/,")","g",tmp[1]);
            if(tmp[3]=="") a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",1);";
            else a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",\""tmp[5]"\");";
        }else if($i == "c"){ rest="";
            for(c=i+2;c<=NF;c++) rest=rest$c;
            a1="$0=\""rest"\";"; 
            a2="next;";
        }else if($i == "i"){ rest="";
            for(c=i+2;c<=NF;c++) rest=rest$c;
            a1=a2="$0=\""rest"\\n\"$0;"; 
        }else{
            print "ERROR: invalid syntax. Unkown command in expression "$0" (expr number "NR")"; exit;
        }
    ####################################################################################################
     output awk commands
        if(case1=="" && case2=="") print "{"a1"}";
        else if(case1~/^[0-9]/ && case2=="") print "NR=="case1"{"a1"}";
        else if(case2 == "") print "/"case1"/{"a1"}";
        else if(case1~/^[0-9]/ && case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
        else if(case1~/^[0-9]/)  print "temp"var"==1&&/"case2"/{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
        else if(case2~/^[0-9]/)  print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
        else print "temp"var"==1&&/"case2"/{temp"var++"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
        var++;
    }
    
    function matchaddr(){
        str=substr($0, i); p=1;
        if($i == "\\"){ p=substr(str,2,1); match(str,p"([^"p"]*)"p,arr); i++}
        else if($i == "/"){ p=substr(str,1,1); match(str,p"([^"p"]*)"p,arr); }
        else { match(str,/^([0-9]*)/,arr) };
        i += RLENGTH;
        return arr[1];
    }
    END{print "{print}";}
    

    categories: Papers,Jul,2009,JiirL

    Visual Awk

    Reference: Visual AWK: A, Model for Text Processing by Demonstration by Jiirgen Landauer and Masahito Hirakawa . 11th International IEEE Symposium on Visual Languages, 1995

    Download

    Download from LAKWER.

    Abstract

    Programming by Demonstration (PBD) systems often have problems with control structure injerence and user-intended generalization. We propose a new solution for these weaknesses basred on concepts of AWK and present a prototype system for text processing. It utilizes vertical demonstration, extensive visual feedback, and program visualization via spreadsheets to achieve improved usability and expressive power.

    Introduction

    In text editing users are often confronted with reformatting tasks which involve large portions of texts, sometimes consisting of hundreds of lines. For example, let us assume we want to create mailing labels out of a given address list. The task seems to be easy to automat since all paragraphs are similarly structured, containing a name, an address, and a phone number e:ach. However, both the built-in find and replace function and the macro recorder of the editor prove to be not flexible enough to handle the task, because their facilities for specifying search patterns and for dealing with special cases and exceptions are limited.

    On the other hand, most current end-uslers estimate solving such tasks with one of today's programming languages as too difficult for them. Programming by Demonstration (PBD) is a promising remedy here since, by contrast, it promises nearly unlimited prograrnming power though ease of learning and usage. Therefore, a variety of PBD systems were proposed for this application domain in the past. But PBD is not yet very widespread in commercial text editors because of some serious weaknesses.

    This paper examines these weaknesses and present a new approach for the solution of the deficiencies of PBD. We introduce Visual AWK, a prototype text processing system developed at the Information Systems Lab of Hiroshima University based on the programming language AWK which incorporates the new design approach. Extensive visual feedback and program visualization via spreadsheets improve both usability and expressive power.

    Visual AWK is aimed at users without previous knowledge in programming, but with ex- perience in text editor use. The application domain are semi-structured texts. That is, texts that consist of equally structured entities, for instance lines or paragraphs, but may contain a few syntactically classifiable sets of exceptions with a different structure.


    categories: Papers,Verification,Jul,2009,GerardH

    MicroTrace

    by Gerard Holzmann

    Description

    Micro-tracer is a little awk-script for verifying state machines; quite possibly the world's smallest working verifier. Some comments on the working of the script, plus a sample input for the X.21 protocol, are given below.

    Reproduce and use freely, at your own risk of course. The micro-tracer was first described in this report:

    • Gerard Holzmann, X.21 Analysis Revisited: the Micro-Tracer, AT&T Bell Laboratories, Technical Memorandum 11271-8710230-12, October 23, 1987. (PDF)

    Code

    This script was written to show how little code is needed to write a working verifier for safety properties. The hard problem in writing a practical verifier is to make the search efficient, to support a useful logic, and a sensible specification language... (see the Spin homepage.)

    $1 == "init"	{	proc[$2] = $3	}
    $1 == "inp"	{	move[$2,$3]=move[$2,$3] $1 "/" $4 "/" $5 "/" $6 "/;" }
    $1 == "out"	{	move[$2,$3]=move[$2,$3] $1 "/" $4 "/" $5 "/" $6 "/;" }
    END		{	verbose=0; for (i in proc) signal[i] = "-"
    			run(mkstate(state))
    			for (i in space) nstates++;
    			print nstates " states, " deadlocks " deadlocks"
    		}
    
    function run(state,  i,str,moved)	# 1 parameter, 3 local vars
    {
    	if (space[state]++) return	# been here before
    
    	level++; moved=0
    	for (i in proc)
    	{	str = move[i,proc[i]]
    		while (str)
    		{	v = substr(str, 1, index(str, ";"))
    			sub(v, "", str)
    			split(v, arr, "/")
    			if (arr[1] == "inp" && arr[3] == signal[arr[4]])
    			{	Level[level] = i " " proc[i] " -> " v
    				proc[i] = arr[2]
    				run(mkstate(k))
    				unwrap(state); moved=1
    			} else if (arr[1] == "out")
    			{	Level[level] = i " " proc[i] " -> " v
    				proc[i] = arr[2]; signal[arr[4]] = arr[3]
    				run(mkstate(k))
    				unwrap(state); moved=1
    	}	}	}
    	if (!moved)
    	{	deadlocks++
    		print "deadlock " deadlocks ":"
    		for (i in proc) print "\t" i, proc[i], signal[i]
    		if (verbose)
    			for (i = 1; i < level; i++) print i, Level[i]
    	}
    	level--
    }
    function mkstate(state, m)
    {	state = ""
    	for (m in proc) state = state " " proc[m] " " signal[m]
    	return state
    }
    function unwrap(state, m)
    {	split(state, arr, " "); nxt=0
    	for (m in proc) { proc[m] = arr[++nxt]; signal[m] = arr[++nxt] }
    }
    

    The first three lines of the script deal with the input. Data are stored in two arrays. The initial state of machine A is stored in array element proc[A]. The transitions that machine A can make from state s are stored in move[A,s]. All data are stored as strings, and most arrays are also indexed with strings. All valid moves for A in state s, for instance, are concatenated into the same array element move[A,s], and later unwound as needed in function run().

    The line starting with END is executed when the end of the input file has been reached and the complete protocol specification has been read. It initializes the signals and calls the symbolic execution routine run().

    The program contains three function definitions: run(), mkstate(), and unwrap(). The global system state, state, is represented as a concatenation of strings encoding process and signal states. The function mkstate() creates the composite, and the function unwrap() restores the arrays proc and signal to the contents that correspond to the description in state. (The recursive step in run() alters their contents.) Function run() uses three local variables, but only one real parameter state that is passed by the calling routine.

    The analyzer runs by inspecting the possible moves for each process in turn, checking for valid inp or out moves, and performing a complete depth-first search. Any state that has no successors is flagged as a deadlock. A backtrace of transitions leading into a deadlock is maintained in array Level and can be printed when a deadlock is found.

    The first line in run() is a complete state space handler. The composite state is used to index a large array space. If the array element was indexed before it returns a count larger than zero: the state was analyzed before, and the search can be truncated.

    After the analysis completes, the contents of array space is available for other types of probing. In this case, the micro tracer just counts the number of states and prints it as a statistic, together with the number of deadlocks found.

    A Sample Application -- X21

    The transition rules are based on the classic two-process model for the call establishment phase of CCITT Recommendation X.21. Interface signal pairs T, C and R, I are combined. Each possible combination of values on these line pairs is represented by a distinct lower-case ASCII character below. Note that since the lines are modeled as true signals, the receiving process can indeed miss signals if the sending process changes them rapidly and does not wait for the peer process to respond.

    Transition rules for the `dte' process.

    inp dte state01 state08 u dte
    inp dte state01 state18 m dte
    inp dte state02 state03 v dte
    inp dte state02 state15 u dte
    inp dte state02 state19 m dte
    inp dte state04 state19 m dte
    inp dte state05 state19 m dte
    inp dte state05 state6A r dte
    inp dte state07 state19 m dte
    inp dte state07 state6B r dte
    inp dte state08 state19 m dte
    inp dte state09 state10B q dte
    inp dte state09 state19 m dte
    inp dte state10 state19 m dte
    inp dte state10 state6C r dte
    inp dte state10B state19 m dte
    inp dte state10B state6C r dte
    inp dte state11 state12 n dte
    inp dte state11 state19 m dte
    inp dte state12 state19 m dte
    inp dte state14 state19 m dte
    inp dte state15 state03 v dte
    inp dte state15 state19 m dte
    inp dte state16 state17 m dte
    inp dte state17 state21 l dte
    inp dte state18 state01 l dte
    inp dte state18 state19 m dte
    inp dte state20 state21 l dte
    inp dte state6A state07 q dte
    inp dte state6A state19 m dte
    inp dte state6B state07 q dte
    inp dte state6B state10 q dte
    inp dte state6B state19 m dte
    inp dte state6C state11 l dte
    inp dte state6C state19 m dte
    out dte state01 state02 d dce
    out dte state01 state14 i dce
    out dte state01 state21 b dce
    out dte state02 state16 b dce
    out dte state03 state04 e dce
    out dte state04 state05 c dce
    out dte state04 state16 b dce
    out dte state05 state16 b dce
    out dte state07 state16 b dce
    out dte state08 state09 c dce
    out dte state08 state15 d dce
    out dte state08 state16 b dce
    out dte state09 state16 b dce
    out dte state10 state16 b dce
    out dte state10B state16 b dce
    out dte state11 state16 b dce
    out dte state12 state16 b dce
    out dte state14 state01 a dce
    out dte state14 state16 b dce
    out dte state15 state16 b dce
    out dte state18 state16 b dce
    out dte state19 state20 b dce
    out dte state21 state01 a dce
    out dte state6A state16 b dce
    out dte state6B state16 b dce
    out dte state6C state16 b dce
    

    Transition rules for the `dce' process.

    inp dce state01 state02 d dce
    inp dce state01 state14 i dce
    inp dce state01 state21 b dce
    inp dce state02 state16 b dce
    inp dce state03 state04 e dce
    inp dce state04 state05 c dce
    inp dce state04 state16 b dce
    inp dce state05 state16 b dce
    inp dce state07 state16 b dce
    inp dce state08 state09 c dce
    inp dce state08 state15 d dce
    inp dce state08 state16 b dce
    inp dce state09 state16 b dce
    inp dce state10 state16 b dce
    inp dce state10B state16 b dce
    inp dce state11 state16 b dce
    inp dce state12 state16 b dce
    inp dce state14 state01 a dce
    inp dce state14 state16 b dce
    inp dce state15 state16 b dce
    inp dce state18 state16 b dce
    inp dce state19 state20 b dce
    inp dce state21 state01 a dce
    inp dce state6A state16 b dce
    inp dce state6B state16 b dce
    inp dce state6C state16 b dce
    out dce state01 state08 u dte
    out dce state01 state18 m dte
    out dce state02 state03 v dte
    out dce state02 state15 u dte
    out dce state02 state19 m dte
    out dce state04 state19 m dte
    out dce state05 state19 m dte
    out dce state05 state6A r dte
    out dce state07 state19 m dte
    out dce state07 state6B r dte
    out dce state08 state19 m dte
    out dce state09 state10B q dte
    out dce state09 state19 m dte
    out dce state10 state19 m dte
    out dce state10 state6C r dte
    out dce state10B state19 m dte
    out dce state10B state6C r dte
    out dce state11 state12 n dte
    out dce state11 state19 m dte
    out dce state12 state19 m dte
    out dce state14 state19 m dte
    out dce state15 state03 v dte
    out dce state15 state19 m dte
    out dce state16 state17 m dte
    out dce state17 state21 l dte
    out dce state18 state01 l dte
    out dce state18 state19 m dte
    out dce state20 state21 l dte
    out dce state6A state07 q dte
    out dce state6A state19 m dte
    out dce state6B state07 q dte
    out dce state6B state10 q dte
    out dce state6B state19 m dte
    out dce state6C state11 l dte
    out dce state6C state19 m dte
    

    Initialization

    init dte state01
    init dce state01
    

    Error Listings (verbose mode)

    The error listings give with each step number, the name of the executing machine followed by its state and an arrow. Behind the arrow is the transition rule: inp or out, the new state, the required or provided signal value, and the signal name.

    deadlock 1:
    	dce state21 b
    	dte state16 l
    1 dce state01 -> out/state08/u/dte/;
    2 dce state08 -> out/state19/m/dte/;
    3 dte state01 -> inp/state18/m/dte/;
    4 dte state18 -> inp/state19/m/dte/;
    5 dte state19 -> out/state20/b/dce/;
    6 dce state19 -> inp/state20/b/dce/;
    7 dce state20 -> out/state21/l/dte/;
    8 dte state20 -> inp/state21/l/dte/;
    9 dte state21 -> out/state01/a/dce/;
    10 dce state21 -> inp/state01/a/dce/;
    11 dce state01 -> out/state08/u/dte/;
    12 dce state08 -> out/state19/m/dte/;
    13 dte state01 -> inp/state18/m/dte/;
    14 dte state18 -> out/state16/b/dce/;
    15 dce state19 -> inp/state20/b/dce/;
    16 dce state20 -> out/state21/l/dte/;
    deadlock 2:
    	dce state03 b
    	dte state16 v
    1 dce state01 -> out/state08/u/dte/;
    2 dce state08 -> out/state19/m/dte/;
    3 dte state01 -> inp/state18/m/dte/;
    4 dte state18 -> inp/state19/m/dte/;
    5 dte state19 -> out/state20/b/dce/;
    6 dce state19 -> inp/state20/b/dce/;
    7 dce state20 -> out/state21/l/dte/;
    8 dte state20 -> inp/state21/l/dte/;
    9 dte state21 -> out/state01/a/dce/;
    10 dce state21 -> inp/state01/a/dce/;
    11 dce state01 -> out/state08/u/dte/;
    12 dce state08 -> out/state19/m/dte/;
    13 dte state01 -> out/state21/b/dce/;
    14 dce state19 -> inp/state20/b/dce/;
    15 dte state21 -> out/state01/a/dce/;
    16 dte state01 -> inp/state18/m/dte/;
    17 dce state20 -> out/state21/l/dte/;
    18 dce state21 -> inp/state01/a/dce/;
    19 dce state01 -> out/state18/m/dte/;
    20 dte state18 -> inp/state19/m/dte/;
    21 dce state18 -> out/state01/l/dte/;
    22 dte state19 -> out/state20/b/dce/;
    23 dte state20 -> inp/state21/l/dte/;
    24 dce state01 -> out/state08/u/dte/;
    25 dce state08 -> inp/state16/b/dce/;
    26 dte state21 -> out/state01/a/dce/;
    27 dte state01 -> inp/state08/u/dte/;
    28 dce state16 -> out/state17/m/dte/;
    29 dce state17 -> out/state21/l/dte/;
    30 dce state21 -> inp/state01/a/dce/;
    31 dce state01 -> out/state08/u/dte/;
    32 dte state08 -> out/state15/d/dce/;
    33 dce state08 -> inp/state15/d/dce/;
    34 dce state15 -> out/state03/v/dte/;
    35 dte state15 -> inp/state03/v/dte/;
    36 dte state03 -> out/state04/e/dce/;
    37 dte state04 -> out/state05/c/dce/;
    38 dte state05 -> out/state16/b/dce/;
    deadlock 3:
    	dce state03 b
    	dte state20 v
    1 dce state01 -> out/state08/u/dte/;
    2 dce state08 -> out/state19/m/dte/;
    3 dte state01 -> inp/state18/m/dte/;
    4 dte state18 -> inp/state19/m/dte/;
    5 dte state19 -> out/state20/b/dce/;
    6 dce state19 -> inp/state20/b/dce/;
    7 dce state20 -> out/state21/l/dte/;
    8 dte state20 -> inp/state21/l/dte/;
    9 dte state21 -> out/state01/a/dce/;
    10 dce state21 -> inp/state01/a/dce/;
    11 dce state01 -> out/state08/u/dte/;
    12 dce state08 -> out/state19/m/dte/;
    13 dte state01 -> out/state21/b/dce/;
    14 dce state19 -> inp/state20/b/dce/;
    15 dte state21 -> out/state01/a/dce/;
    16 dte state01 -> inp/state18/m/dte/;
    17 dce state20 -> out/state21/l/dte/;
    18 dce state21 -> inp/state01/a/dce/;
    19 dce state01 -> out/state18/m/dte/;
    20 dte state18 -> inp/state19/m/dte/;
    21 dce state18 -> out/state01/l/dte/;
    22 dte state19 -> out/state20/b/dce/;
    23 dte state20 -> inp/state21/l/dte/;
    24 dce state01 -> out/state08/u/dte/;
    25 dce state08 -> inp/state16/b/dce/;
    26 dte state21 -> out/state01/a/dce/;
    27 dte state01 -> inp/state08/u/dte/;
    28 dce state16 -> out/state17/m/dte/;
    29 dce state17 -> out/state21/l/dte/;
    30 dce state21 -> inp/state01/a/dce/;
    31 dce state01 -> out/state18/m/dte/;
    32 dte state08 -> out/state15/d/dce/;
    33 dte state15 -> inp/state19/m/dte/;
    34 dce state18 -> out/state01/l/dte/;
    35 dce state01 -> inp/state02/d/dce/;
    36 dce state02 -> out/state03/v/dte/;
    37 dte state19 -> out/state20/b/dce/;
    deadlock 4:
    	dce state21 b
    	dte state16 -
    1 dte state01 -> out/state02/d/dce/;
    2 dte state02 -> out/state16/b/dce/;
    3 dce state01 -> inp/state21/b/dce/;
    307 states, 4 deadlocks
    

    categories: Papers,Verification,Jul,2009,MikhailA

    An AWK Debugger and Assertion Checker

    From "AUI - the Debugger and Assertion Checker for the Awk Programming Language" by Mikhail Auguston, Subhankar Banerjee, Manish Mamnani, Ghulam Nabi, Juris Reinfelds, Ugis Sarkans, and Ivan Strnad . Proceedings of the 1996 International Conference on Software Engineering: Education and Practice (SE:EP '96)

    Download from LAWKER.

    Abstract

    This paper describes the design of Awk User Interface (AUI). AUI is a graphical programming environment for editing, running, testing and debugging of Awk programs. The AUI environment supports tracing of Awk programs, setting breakpoints, and inspection of variable values.

    An assertion language to describe relationship between input and output of Awk program is provided. Assertions can be checked after the program run, and if violated, informative and readable messages can be generated. The assertions and debugging rules for the Awk program are written in a separate text file. Assertions are useful not only for testing and debugging but can be considered as a mean for program formal specification and documentation.

    Example

    The input file contains a list of all states of U.S.A. There are 50 records separated by newlines, one for each of the states. The number of fields in a record is variable. The first field is the name of the state, and the subsequent fields are names of neighbor states. Fields are separated by tabs. For example, the first two records in the database are

    Alabama Mississippi Tennessee Georgia Florida 
    Alaska 
    

    The task is to color the U.S.A. map in such a way that any two neighboring states are in different colors. We will do it in a greedy manner (without backtracking), assigning to every state the ?rst possible color. The Awk program for this task is the following:

    # Greedy map coloring 
    BEGIN { FS= "\t"; OFS= "\t" # fields separated by tabs 
    		color[0]= "yellow"  # color names 
    		color[1]= "blue" 
    		color[2]= "red" 
    		color[3]= "green" 
    		color[4]= "black" 
    } 
    { 		i=0 
    		while (a[$1,i] ) i++ # find first acceptable color for 
    		                     # state $1 
    		print $1"\t" color[i] # assign that color 
    		for (j=2; j<=NF; j++) a[$j,i]=1	# make that color 
                                                # unacceptable for 
                                                # states $2..$NF 
    } 
    

    We can check the correctness of the coloring using the following assertion:

    /* Checks the correctness of map coloring - any two neighbor
       states should be colored in different colors */
    	FOREACH r1: RECORD FROM FILE input 
    		(EXISTS r2: RECORD FROM FILE output 
    			(r1.$1 == r2.$1 AND 
     			FOREACH i IN 2..FIELD_NUM(r1) 
    				(EXISTS r3: RECORD FROM FILE output 
    					(r3.$1 == r1.$i ANDr3.$2!=r2.$2) 
    				) 
    			) 
    		)		 
    SAY "Map colored correctly" 
    ONFAIL  SAY r1.$1 "and" r1.$i "are of the same color" 
            SAY "although they are neighboring states" 
    

    categories: Papers,Verification,Jul,2009,BalkhisB

    Automated Result Verification with Awk

    Source

    From B.A. Bakar, T. Janowski, Automated Result Verification with AWK iceccs, pp.0188, Sixth IEEE International Conference on Complex Computer Systems (ICECCS'00), 2000

    Download

    Download from LAWKER.

    Abstract

    The goal of result-verification is to prove that one execution run of a program satisfies its specification. Compared with implementation-verification, result-verification has a larger scope for applications in practice, gives more opportunities for automation and, based on the execution record not the implementation, is particularly suitable for complex systems.

    This paper proposes a technical framework to apply this technique in practice. We show how to write formal result-based specifications, how to generate a verifier program to check a given specification and to carry out result-verification according to the generated program.

    The execution result is written as a text file, the verifier is written in AWK (special-purpose language for text processing) and verification is done automatically by the AWK interpreter; given the verifier and the execution result as inputs.

    In this paper...

    In this paper we propose a technical framework to carry out automated result-verification in practice. Its main features are:
    • The execution result is a simple text file. Many programs produce such (log) files during their normal operations, for administrative purposes. A general technique to record exactly the information needed for verification, is to introduce a program wrapper.
    • The execution result is given as input to the verifier program, which does the actual verification. Given the execution result in a text file, we consider result-verification as the text-processing task. Accordingly, the verifier is written in AWK, which is a special-purpose language for text processing, implemented for most computing platforms. Verification is done by the AWK interpreter, given the execution result and the verifier program as inputs.

    categories: Funky,Mar,2009,Timm

    Functional Enumeration in Gawk 3.1.7

    Contents

    Synopsis

    all( fun, array [,max]

    collect( fun, array1, array2 [,max])

    select( fun, array1, array2 [,max])

    reject( fun, array1, array2 [,max])

    detect( fun, array [,max])

    inject( fun, array, carry [,max])

    All these functions return the size of array or array2

    Description

    An interesting new feature in Gawk 3.1.7 is indirect functions. This allows the function name to be a variable, passed as an argument to an array, and called using the syntax

    @fun(arg1,arg2,...)    
    

    This enables a new kind of funcational programming style in Gawk. For example, generic enumeration patterns can be coded once, then called many different ways with different function names passed as arguments.

    This document illustrates this style of programming.

    Enumerators

    For example, here are some standard enumeration functions:

    all(fun,array [,max]

    Applies the function fun to all items in the array. If called with the max argument, then they are iterated in the order i=1 .. max, otherwise we use for(i in a).

    collect(fun,array1,array2 [,max])

    Applies fun to each item in array1 and collects the results in array2.

    select(fun,array1,array2 [,max])

    Find all the items in array1 that satisfies fun and add them to array2.

    reject(fun,array1,array2 [,max])

    Find all the items in array1 that do not satisfy fun and add them to array2.

    detect(fun,array [,max])

    Return the first item found in array that satisfies fun. If no such item is found, then return the magic global value Fail.

    inject(fun,array,carry [,max])

    (This one is a little tricky.) The result of applying fun to each item in array is carried into the processing of the next item. Initially, the carried value is carry. This function returns the final carry.

    Sample Functions

    To illusrate the above, consider the following functions. Each of these are defined for one array item.

    function odd(x)    { return (x % 2) == 1 }
    function show(x)   { print "[" x "]" }
    function mult(x,y) { return x * y }
    function halve(x)  { return x/2 }
    

    Using the Functions

    • All-ing...
    • function do_all(   arr) { 
          split("22 23 24 25 26 27 28",arr)
          all("show",arr)
      }
      

      When we run this ...

      eg/enum1

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_all() }'
      

      we see every item in arr printed using the above show function ...

      eg/enum1.out

      [25]
      [26]
      [27]
      [28]
      [22]
      [23]
      [24]
      
    • Collect-ing...
    • function do_collect(        max,arr1,arr2,i) {
          max=split("22 23 24 25 26 27 28",arr1)
          collect("halve",arr1,arr2,max)
          for(i=1;i<=max;i++) print arr2[i]
      }
      

      When we run this ...

      eg/enum2

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_collect() }'
      

      we see every item in arr divided in two ...

      eg/enum2.out

      11
      11.5
      12
      12.5
      13
      13.5
      14
      
    • Select-ing...
    • function do_select(        all,less,arr1,arr2,i) {
          all  = split("22 23 24 25 26 27 28",arr1)
          less = select("odd",arr1,arr2,all)
          for(i=1;i<=less;i++) print arr2[i]
      }
      

      When we run this ...

      eg/enum3

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_select() }'
      

      we see every item in arr that satisfies odd....

      eg/enum3.out

      23
      25
      27
      
    • Reject-ing...
    • function do_reject(        all,less,arr1,arr2,i) {
          all  = split("22 23 24 25 26 27 28",arr1)
          less = reject("odd",arr1,arr2,all)
          for(i=1;i<=less;i++) print arr2[i]
      }
      

      When we run this ...

      eg/enum4

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_reject() }'
      

      we see every item in arr that do not satisfies odd....

      eg/enum4.out

      22
      24
      26
      28
      
    • Detect-ing
    • function do_detect(        all,arr1) {
          all  = split("22 23 24 25 26 27 28",arr1)
          print detect("odd",arr1,all)   
      }
      

      When we run this ...

      eg/enum5

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_detect() }'
      

      we see the first item in arr that satisfies odd....

      eg/enum5.out

      23
      
    • Inject-ing...
    • function do_inject(        all,less,arr1,arr2,i) {
          split("1 2 3 4 5",arr1)
          print inject("mult",arr1,1)
      }
      

      When we run this ...

      eg/enum6

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_inject() }'
      

      we see every the result of multiplying every item in arr by its predecessor.

      eg/enum6.out

      120
      

    Code

    Note one design principle in the following: any newly generated arrays have indexes 1..max where max is the number of elements in that array.

    all

    function all (fun,a,max,   i) {
    	if (max) 
    		for(i=1;i<=max;i++) @fun(a[i]) 
    	else  
    		for(i in a) @fun(a[i])
    }
    

    collect

    function collect (fun,a,b,max,   i) {
    	if (max)
    	    for(i=1;i<=max;i++) {n++; b[i]= @fun(a[i]) }
    	else
    	    for(i in a) {n++; b[i]= @fun(a[i])}
    	return n
    }
    

    select

    function select (fun,a,b,max,   i,n) {
    	if (max)
    		for(i=1;i<=max;i++) {
    		    if (@fun(a[i])) {n++; b[n]= a[i] }}
    	else
    		for(i in a) {
    		    if (@fun(a[i])) {n++; b[n]= a[i] }}
    	return n
    }
    

    reject

    function reject (fun,a,b,max,   i,n) {
    	if (max)
    		for(i=1;i<=max;i++) {
    		    if (! @fun(a[i])) {n++; b[n]= a[i] }}
    	else
    		for(i in a) {
    		    if (! @fun(a[i])) {n++; b[n]= a[i] }}
    	return n
    }
    

    detect

    BEGIN {Fail="someUnLIKELYSymbol"}
    function detect (fun,a,max,   i) {
    	if (max)
    		for(i=1;i<=max;i++) {
    			if (@fun(a[i])) return a[i] }
    	else	
    		for(i in a) {
    			if (@fun(a[i])) return a[i] }
    	return Fail
    }
    

    inject

    function inject (fun,a,carry,max,   i) {
    	if (max)
    		for(i=1;i<=max;i++)
    			 carry = @fun(a[i],carry) 
    	else
    		for(i in a)
    			 carry = @fun(a[i],carry) 
    	return carry
    }
    

    Bugs

    The above code does not pass around any state information that the fum functions can use. So all their deliberations are either with the current array values (integers or strings) or with global state. It might be worthwhile writing new versions of the above with one more argument, to carry that sate.

    Author

    Tim Menzies

    categories: Contribute,Jan,2009,Admin

    How to Contribute

    This web site is a front end to a repository of Awk code. The site, and the code, is maintained by the international awk community (which includes you) so there are many ways you can contribute:

    Link to this site from your home page

    Using this logo, link to http://awk.info:

    (By the way, our current logo is pretty lame. Want to contribute a better one? Please, be our guest!)

    Improve a Page

    Found a Typo? A Rendering Problem? Want to clarify something?

    Want to add some links?

    See the above instructions.

    How to Write Pages for this Site

    1. Write the page.
    2. Test the page by placing it on a publicly readable site, then see if it renders ok.
    3. Email the url of that page to mail@awk.info. Do NOT send the page.

    When writing a page, please follow these guidelines:

    • Do not use <hr> tags: these are reserved for dividing pages in a multi-page view.
    • Use only one <h1> tag at the top of page. Everything else should <h2> or below.
    • Try to avoid using tricky CSS/HTML styling tricks. Vanilla HMTL is best.
    • The page you write will end up being rendered as the middle pane of this site (around 550 pixels wide). So don't write wide pages.
    • If you include code samples, note that our CSS wraps pre-formatted code if it gets too wide. For example, at the time of this writing, the following pre-formatted texts gets ugly after about 75 characters:
              1         2         3         4         5         6         7
    012345678901234567890123456789012345678901234567890123456789012345678901234567890
    

    Contributing Code

    To contribute code, zip up the directory and mail it to

    Coding Standards

    All function and file names are global to our code so please ensure your new function/file name does not clobber an old one.

    Optionally, you might considering adding:

    Add a Library Function Files

    In the language of this site, a function file is a 100% standalone file containing one or more functions with no dependancies on other files. Note that if your function file depends on other files, then it becomes a package (see below).

    Functions are stored in a file caled myfunc.awk.

    Add a Package

    In the language of this site, a package is a file that depends on other files (and the other files may depend on yet others, recursively).

    Following a recent discussion in comp.lang.awk, we say that these dependancies are commented with

    #use file.awk 
    

    where file.awk is some file (e.g. a file in the current directory).

    Note that : file.awk will be loaded before the file containing the reference to #use file.awk.


    categories: Contribute,Jan,2009,Timm

    Pretty Print AWK Code

    The code that renders the awk.info web site can "pretty print" awk code. For example:

    To enable that pretty print, add some html syntax inside your code and apply the following conventions.

    Preview Engine

    Note that if you want to see your "looking pretty", then you could could see how it looks using our preview tool:

    http://awk.info/?awk:urlWithoutHTTPprefix
    

    For exmaple, the file http://menzies.us/tmp/xx.awk can be previewed using http://awk.info/?awk:menzies.us/tmp/xx.awk

    Contributing Pretty Code

    Once you've got it "looking pretty", please consider contributing that code to awk.info, so our code library can grow. To do so, either email mail@awk.info with the URL of your pretty code or zip up the files and email them across.

    HTML-based Commenting Conventions

    The first paragraph of the file will be ignored. Use this first para for copyright notices or comments about down-in-the weeds trivia. Note: the first para ends with one blank line.

    The next paragraph should start with

    #.H1 <join>Title</join>

    The code could should be topped and tailed as follows:

    #<pre>
    code
    #</pre>
    

    All other comment lines should start with a single "#" at front-of-line. These comment characters will be stripped away by the awk.info renderer.

    Awk.info's renderer adopts the following html shorthand. If a line starts with

    #.WORD other words 
    

    this this is replaced with

    <WORD> other words</WORD>
    

    If no other words follow #.WORD then the line becomes just <WORD>

    Awk.info's renderer supports a few HTML extensions:

    • #.IN path includes a file found in the LAWKER repositoriy at some path inside the trunk.
    • #.CODE path includes the contents of path, wrapped in <pre> tags, and prefixed by the path.
    • #.BODY path is the same as #.CODE but it skips the first paragraph (this is useful when the first paragraph includes tedious details you want to hite from the user).
    • Note that, for #.IN, #.CODE, #.BODY, the path must appear after a single space.

    That's it. Now you can pretty print your code on the web just be adding a little html in the comments.


    categories: Contribute,Jan,2009,Timm

    Show Unit Tests

    Ideally, all code in our code repository comes with unit tests:

    • Either demo scripts to show off functionality
    • Or a regression suite that checks that new changes does not mess up existing code.

    Accordingly code offered to this site can contain unit tests, using the methods described in this page.

    But before going on, we stress that awk.info gratefully accepts awk contributions in any form. That is, including unit tests with code is optional.

    Files

    If your code is in directory yourcode then create a sub-directory yourcode/eg

    Write a test in a file yourcode/eg/yourtest. Divide that test into two parts:

    1. In the first paragraph of that file, write any tedious set up required to get the system ready for the test.
    2. In the second, third, etc paragraph, write the code that shows the test
    3. For example, in the following code, the real test comes after some low-level environmental set up:
      # assumes
      # - the LAWKER trunk has been checked out and
      # - .bash_profile contains: export Lawker="$HOME/svns/lawker/fridge"
      . $Lawker/lib/bash/setup
      
      gawk -f join.awk --source '
      BEGIN { split("tim tom tam",a)
              print join(a,2)
      }'
      

    Write the expected output of that test case in yourcode/eg/yourtest.out

    Regression Tests

    The above file conventions mean that an automatic tool can run over the entire code base and perform a regression test (checking if all the tests generate all the *.out files.

    Displaying the Tests (and Output)

    Another advantage of the above scheme is that you can use the tests to document your code.

    To show the test case, add the following into your .awk file:

     #.BODY       yourcode/eg/yourtest
     #.CODE       yourcode/eg/yourtest.out
    

    Then zip the directory yourcode (including yourcode/eg) and send it to awk.info. Once we install those files on our site then when awk.info displays that file, the test case trivia is hidden and the users only see the essential details. For an example of this, see http://awk.info/?gawk/array/join.awk.


    categories: Learn,Jan,2009,Admin

    Learning Awk

    Short Overviews

    The following list is sorted by newbie-ness (so best to start at the top):

    Longer Tutorials

    The following list is sorted by the number of times this material is tagged at delicious.com (most tagged at top):

    Other Stuff


    categories: Learn,Jan,2009,Ronl

    Teaching Awk

    (For tutorial material on Awk, see Learning Awk page.)

    R. Loui loui@ai.wustl.edu is Associate Professor of Computer Science, at Washington University in St. Louis. He has published in AI Journal, Computational Intelligence, ACM SIGART, AI Magazine, AI and Law, the ACM Computing Surveys Symposium on AI, Cognitive Science, Minds and Machines, Journal of Philosophy.

    Whenever Ronald Loui teaches GAWK, he gives the students the choice of learning PERL instead. Ninety percent will choose GAWK after looking at a few simple examples of each language (samples shown below). Those who choose PERL do so because someone told them to learn PERL.

    After one laboratory, more than half of the GAWK students are confident with their GAWK skills and can begin designing. Almost no student can become confident in PERL that quickly.

    After a week, 90% of those who have attempted GAWK have mastered it, compared to fewer than 50% of PERL students attaining similar facility with the language (it would be unfair to require one to `master' PERL).

    By the end of the semester, over 90% who have attempted GAWK have succeeded, and about two-thirds of those who have attempted PERL have succeeded.

    To be fair, within a year, half of the GAWK programmers have also studied PERL. Most are doing so in order to read PERL and will not switch to writing PERL. No one who learns PERL migrates to GAWK.

    PERL and GAWK appear to have similar programming, development, and debugging cycle times.

    Finally, there seems to be a small advantage for GAWK over PERL, after a year, for the programmers willingness to begin a new program. That is, both GAWK and PERL programmers tend to enjoy writing a lot of programs, but GAWK has the slight edge here.


    categories: Learn,Jan,2009,Timm

    Four Keys to Gawk

    by T. Menzies

    Imagine Gawk as a kind of a cut-down C language with four tricks:

    1. self-initializing variables
    2. pattern-based programming
    3. regular expressions
    4. associative arrays.

    What to all these do? Well....

    Self-initializing variables.

    You don't need to define variables- they appear as your use them.

    There are only three types: stings, numbers, and arrays.

    To ensure a number is a number, add zero to it.

    x=x+0
    

    To ensure a string is a string, add an empty string to it.

    x= x "" "the string you really want to add"
    

    To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables:

     function haslocals(passed1,passed2,         local1,local2,local3) {
            passed1=passes1+1  # changes externally
            local1=7           # only changed locally
     }
    

    Note that its good practice to add white space between passed and local variables.

    Pattern-based programming

    Gawk programs can contain functions AND pattern/action pairs.

    If the pattern is satisfied, the action is called.

     /^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
               p = 1;
             }
     /^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
               p = 0;
             }
     END     { if (p != 0) print "missing .P2 at end" }
    

    Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:

     BEGIN {
            while (getline < "Usr.Dict.Words") #slurp in dictionary 
                    dict[$0] = 1
            FS=",";                            #set field seperator
            srand();                           #reset random seed
            Round=10;                          #always start globals with U.C.
     }
    

    The default action is {print $0}; i.e. print the whole line.

    The default pattern is 1; i.e. true.

    Patterns are checked, top to bottom, in source-code order.

    Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1". Regular expressions are important enough for their own section.

    A Small Example

    Ok, so now we know enough to explain an simple report function. How does hist.awk work in the following?

     
    % cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
                        uniq -c | sort -r -n | Gawk -f hist.awk
    
                  **************************  26 /var/empty
                                          **   2 /var/virusmails
                                          **   2 /var/root
                                           *   1 /var/xgrid/controller
                                           *   1 /var/xgrid/agent
                                           *   1 /var/teamsserver
                                           *   1 /var/spool/uucp
                                           *   1 /var/spool/postfix
                                           *   1 /var/spool/cups
                                           *   1 /var/pcast/server
                                           *   1 /var/pcast/agent
                                           *   1 /var/imap
                                           *   1 /Library/WebServer
    

    hist.awk reads the maximum width from line one (when NR==1), then scales it to some maximum width value. For each line, it then prints the line ($0) with some stars at front.

    NR==1  { Width = Width ? Width : 40 ; sets Width if it is missing
             Scale = $1 > Width ? $1 / Width : 1 
           }
           { Stars=int($1*Scale);  
             print str(Width - Stars," ") str(Stars,"*") $0 
           }
    
    # note that, in the following "tmp" is a local variable
    function str(n,c, tmp) { # returns a string, size "n", of all  "c" 
        while((n--) > 0 ) tmp= c tmp 
        return tmp 
    }
    

    Regular Expressions

    Do you know what these mean?

    • /^[ \t\n]*/
    • /[ \t\n]*$/
    • /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/

    Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:

      function trim(s,     t) {
        t=s;
        sub(/^[ \t\n]*/,"",t);
        sub(/[ \t\n]*$/,"",t);
        return t
     }
    

    or recognize something that isn't a number:

    if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ ) 
        {print "ERROR: " $i " not a number}
    

    Regular expressions are an astonishingly useful tool supported by many languages (e.g. Awk, Perl, Python, Java). The following notes review the basics. For full details, see http://www.gnu.org/manual/Gawk-3.1.1/html_node/Regexp.html#Regexp.

    Syntax: Here's the basic building blocks of regular expressions:

    c
    matches the character c (assuming c is a character with no special meaning in regexps).

    \c
    matches the literal character c; e.g. tabs and newlines are \t and \n respectively.

    .
    matches any character except newline.

    ^
    matches the beginning of a line or a string.

    $
    matches the end of a line or a string.

    [abc...]
    matches any of the characters ac... (character class).

    [^ac...]
    matches any character except abc... and newline (negated character class).

    r*
    matches zero or more r's.

    And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.

    More Syntax:

    But that's only the start of regular expressions. There's lots more. For example:

    r+
    matches one or more r's.

    r?
    matches zero or one r's.

    r1|r2
    matches either r1 or r2 (alternation).

    r1r2
    matches r1, and then r2 (concatenation).

    (r)
    matches r (grouping).

    Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:

    ^[+-]? ...
    Numbers begin with zero or one plus or minus signs.

    ...[0-9]+...
    Simple numbers are just one or more numbers.

    ...[.]?[0-9]*...
    which may be followed by a decimal point and zero or more digits.

    ...|[.][0-9]+...
    Alternatively, a number can have zero leading numbers and just start with a decimal point.

    .... ([eE]...)?$
    Also, there may be an exponent added

    ...[+-]?[0-9]+)?$
    and that exponent is a positive or negative bunch of digits.

    Associative arrays

    Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):

    Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename
    

    The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:

     #!/usr/bin/awk -f
      {for(i=1;i <=NF;i++) freq[$i]++ }
      END{for(word in freq) print word, freq[word]  }
    

    You can find out if an element exists in an array at a certain index with the expression:

    index in array
    

    This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present.

    You can remove an individual element of an array using the delete statement:

    delete array[index]
    

    It is not an error to delete an element which does not exist.

    Gawk has a special kind of for statement for scanning an array:

     for (var in array)
            body
    

    This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.

    There order in which the array is scanned is not defined.

    To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack

     function top(a)        {return a[a[0]]}
     function push(a,x,  i) {i=++a[0]; a[i]=x; return i}
     function pop(a,   x,i) {
       i=a[0]--;  
       if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}
    

    The pop function can be used in the usual way:

     BEGIN {push(a,1); push(a,2); push(a,3);
            while(x=pop(a)) print x
     3
     2
     1
    

    We can catch everything in an array to a string:

     function a2s(a,  i,s) {
            s=""; 
            for (i in a) {s=s " " i "= [" a[i]"]\n"}; 
            return s}
    
      BEGIN {push(L,1); push(L,2); push(L,3);
            print a2s(L);}
      0= [3]
      1= [1]
      2= [2]
      3= [3]
    

    And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form:

    ^=include file

    This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line:

     function rinclude (line,    x,a) {
       split(line,a,/ /);
       if ( a[1] ~ /^\=include/ ) { 
         while ( ( getline x < a[2] ) > 0) rinclude(x);
         close(a[2])}
       else {print line}
     }
    

    Note that the third argument of the split function can be any regular expression.

    By the way, here's a nice trick with arrays. To print the lines in a files in a random order:

     BEGIN {srand()}
           {Array[rand()]=$0}
     END   {for(I in Array) print $0}
    

    Short, heh? This is not a perfect solution. Gawk can only generate 1,000,000 different random numbers so the birthday theorem cautions that there is a small chance that the lines will be lost when different lines are written to the same randomly selected location. After some experiments, I can report that you lose around one item after 1,000 inserts and 10 to 12 items after 10,000 random inserts. Nothing to write home about really. But for larger item sets, the above three liner is not what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%) are lost after 100,000 random inserts. Not good!


    categories: OneLiners,Learn,Jan,2009,Admin

    Awk one-liners

    Awk is famous for how much it can do in one line.

    This site has many samples of that capability. And if you have any more to add, please send them in.


    categories: OneLiners,Learn,Jan,2009,EricP

    Handy One-Liners For Awk (v0.22)

    Eric Pement
    pemente@northpark.edu

    Latest version of this file is usually at:
    http://www.student.northpark.edu/pemente/awk/awk1line.txt

    USAGE

    Unix:     awk '/pattern/ {print "$1"}'    # standard Unix shells
    DOS/Win:  awk '/pattern/ {print "$1"}'    # okay for DJGPP compiled
              awk "/pattern/ {print \"$1\"}"  # required for Mingw32
    

    Most of my experience comes from version of GNU awk (gawk) compiled for Win32. Note in particular that DJGPP compilations permit the awk script to follow Unix quoting syntax '/like/ {"this"}'. However, the user must know that single quotes under DOS/Windows do not protect the redirection arrows (<, >) nor do they protect pipes (|). Both are special symbols for the DOS/CMD command shell and their special meaning is ignored only if they are placed within "double quotes." Likewise, DOS/Win users must remember that the percent sign (%) is used to mark DOS/Win environment variables, so it must be doubled (%%) to yield a single percent sign visible to awk.

    If I am sure that a script will NOT need to be quoted in Unix, DOS, or CMD, then I normally omit the quote marks. If an example is peculiar to GNU awk, the command 'gawk' will be used. Please notify me if you find errors or new commands to add to this list (total length under 65 characters). I usually try to put the shortest script first.

    File Spacing

    Double space a file

     awk '1;{print ""}'
     awk 'BEGIN{ORS="\n\n"};1'
    

    Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text. NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are often treated as non-blank, and thus 'NF' alone will return TRUE.

    awk 'NF{print $0 "\n"}'
    

    Triple space a file

    awk '1;{print "\n"}'

    Numbering and Calculations

    Precede each line by its line number FOR THAT FILE (left alignment). Using a tab (\t) instead of space will preserve margins.

    awk '{print FNR "\t" $0}' files*

    Precede each line by its line number FOR ALL FILES TOGETHER, with tab.

    awk '{print NR "\t" $0}' files*

    Number each line of a file (number on left, right-aligned) Double the percent signs if typing from the DOS command prompt.

    awk '{printf("%5d : %s\n", NR,$0)}'

    Number each line of file, but only print numbers if line is not blank Remember caveats about Unix treatment of \r (mentioned above)

    awk 'NF{$0=++a " :" $0};{print}'
     awk '{print (NF? ++a " :" :"") $0}'
    

    Count lines (emulates "wc -l")

    awk 'END{print NR}'

    Print the sums of the fields of every line

    awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'

    Add all fields in all lines and print the sum

    awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

    Print every line after replacing each field with its absolute value

     awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
     awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
    

    Print the total number of fields ("words") in all lines

     awk '{ total = total + NF }; END {print total}' file

    Print the total number of lines that contain "Beth"

     awk '/Beth/{n++}; END {print n+0}' file

    Print the largest first field and the line that contains it Intended for finding the longest string in field #1

    awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'

    Print the number of fields in each line, followed by the line

    awk '{ print NF ":" $0 } '

    Print the last field of each line

    awk '{ print $NF }'

    Print the last field of the last line

    awk '{ field = $NF }; END{ print field }'

    Print every line with more than 4 fields

    awk 'NF > 4'

    Print every line where the value of the last field is > 4

    awk '$NF > 4'

    Text Conversion and Substitution

    IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format

    awk '{sub(/\r$/,"");print}'   # assumes EACH line ends with Ctrl-M

    IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format

    awk '{sub(/$/,"\r");print}

    IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format

    awk 1

    IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format Cannot be done with DOS versions of awk, other than gawk:

    gawk -v BINMODE="w" '1' infile >outfile

    Use "tr" instead.

     tr -d \r outfile # GNU tr version 1.22 or higher

    Delete leading whitespace (spaces, tabs) from front of each line aligns all text flush left

    awk '{sub(/^[ \t]+/, ""); print}'

    Delete trailing whitespace (spaces, tabs) from end of each line

    awk '{sub(/[ \t]+$/, "");print}'
    

    Delete BOTH leading and trailing whitespace from each line

    awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
    awk '{$1=$1;print}'           # also removes extra space between fields
    

    Insert 5 blank spaces at beginning of each line (make page offset)

    awk '{sub(/^/, "     ");print}'
    

    Align all text flush right on a 79-column width

    awk '{printf "%79s\n", $0}' file*
    

    Center all text on a 79-character width

    awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*
    

    Substitute (find and replace) "foo" with "bar" on each line

    awk '{sub(/foo/,"bar");print}'           # replaces only 1st instance
    gawk '{$0=gensub(/foo/,"bar",4);print}'  # replaces only 4th instance
    awk '{gsub(/foo/,"bar");print}'          # replaces ALL instances in a line
    

    Substitute "foo" with "bar" ONLY for lines which contain "baz"

    awk '/baz/{gsub(/foo/, "bar")};{print}'
    

    Substitute "foo" with "bar" EXCEPT for lines which contain "baz"

    awk '!/baz/{gsub(/foo/, "bar")};{print}'
    

    Change "scarlet" or "ruby" or "puce" to "red"

    awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
    

    Reverse order of lines (emulates "tac")

    awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*
    

    If a line ends with a backslash, append the next line to it (fails if there are multiple lines ending with backslash...)

    awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*
    

    Print and sort the login names of all users

    awk -F ":" '{ print $1 | "sort" }' /etc/passwd
    

    Print the first 2 fields, in opposite order, of every line

    awk '{print $2, $1}' file
    

    Switch the first 2 fields of every line

    awk '{temp = $1; $1 = $2; $2 = temp}' file
    

    Print every line, deleting the second field of that line

    awk '{ $2 = ""; print }'
    

    Print in reverse order the fields of every line

    awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file
    

    Remove duplicate, consecutive lines (emulates "uniq")

    awk 'a !~ $0; {a=$0}'
    

    Remove duplicate, nonconsecutive lines

    awk '! a[$0]++'                     # most concise script
    awk '!($0 in a) {a[$0];print}'      # most efficient script
    

    Concatenate every 5 lines of input, using a comma separator between fields

    awk 'ORS=%NR%5?",":"\n"' file
    

    Selective Printing of Certain Lines

    Print first 10 lines of file (emulates behavior of "head")

    awk 'NR < 11'
    

    Print first line of file (emulates "head -1")

    awk 'NR>1{exit};1'
    

    Print the last 2 lines of a file (emulates "tail -2")

    awk '{y=x "\n" $0; x=$0};END{print y}'
    

    Print the last line of a file (emulates "tail -1")

    awk 'END{print}'
    

    Print only lines which match regular expression (emulates "grep")

    awk '/regex/'
    

    Print only lines which do NOT match regex (emulates "grep -v")

    awk '!/regex/'
    

    Print the line immediately before a regex, but not the line containing the regex

    awk '/regex/{print x};{x=$0}'
     awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'
    

    Print the line immediately after a regex, but not the line containing the regex

    awk '/regex/{getline;print}'
    

    Grep for AAA and BBB and CCC (in any order)

    awk '/AAA/; /BBB/; /CCC/'
    

    Grep for AAA and BBB and CCC (in that order)

    awk '/AAA.*BBB.*CCC/'
    

    Print only lines of 65 characters or longer

    awk 'length > 64'
    

    Print only lines of less than 65 characters

    awk 'length < 64'
    

    Print section of file from regular expression to end of file

    awk '/regex/,0'
    awk '/regex/,EOF'
    

    Print section of file based on line numbers (lines 8-12, inclusive)

    awk 'NR==8,NR==12'
    

    Print line number 52

    awk 'NR==52'
    awk 'NR==52 {print;exit}'          # more efficient on large files
    

    Print section of file between two regular expressions (inclusive)

    awk '/Iowa/,/Montana/'             # case sensitive
    

    Selective Deletion of Certain Lines:

    Delete ALL blank lines from a file (same as "grep '.' ")

    awk NF
    awk '/./'
    

    Credits and Thanks

    Special thanks to Peter S. Tillier for helping me with the first release of this FAQ file.

    For additional syntax instructions, including the way to apply editing commands from a disk file instead of the command line, consult:

    • "sed & awk, 2nd Edition," by Dale Dougherty and Arnold Robbins O'Reilly, 1997
    • "UNIX Text Processing," by Dale Dougherty and Tim O'Reilly Hayden Books, 1987
    • "Effective awk Programming, 3rd Edition." by Arnold Robbins O'Reilly, 2001

    To fully exploit the power of awk, one must understand "regular expressions." For detailed discussion of regular expressions, see

    • "Mastering Regular Expressions, 2d edition" by Jeffrey Friedl (O'Reilly, 2002).

    The manual ("man") pages on Unix systems may be helpful (try "man awk", "man nawk", "man regexp", or the section on regular expressions in "man ed"), but man pages are notoriously difficult. They are not written to teach awk use or regexps to first-time users, but as a reference text for those already acquainted with these tools.

    USE OF '\t' IN awk SCRIPTS: For clarity in documentation, we have used the expression '\t' to indicate a tab character (0x09) in the scripts. All versions of awk, even the UNIX System 7 version should recognize the '\t' abbreviation.


    categories: OneLiners,Learn,Jan,2009,Admin

    Explaining Pemet's One Liners

    Peteris Krumins explaining Eric Pement's Awk one-liners:


    categories: TenLiners,Learn,Jan,2009,Admin

    Awk ten-liners

    Awk is famous for how much it can do in (around) 101 lines. Here are some samples of that capability.

    (And if you have any more to add, please send them in.)


    categories: TenLiners,Learn,Jan,2009,Ronl

    Some Gawk (and PERL) Samples

    by R. Loui

    Here are a few short programs that do the same thing in each language. When reading these examples, the question to ask is `how many language features do I need to understand in order to understand the syntax of these examples'.

    Some of these are longer than they need to be since they don't exploit some (e.g.) command line trick to wrap the code in for each line do X. And that is the point- for teach-ability, the preferred language is the one you need to know LESS about before you can be useful in it.

    hello world

    PERL:

     print "hello world\n"
    

    GAWK:

     BEGIN { print "hello world" }
    

    One plus one

    PERL

     $x= $x+1;
    

    GAWK

     x= x+1
    

    Printing

    PERL

     print $x, $y, $z;
    

    GAWK

     print x,y,z
    

    Printing the first field in a file

    PERL

     while (<>) { 
       split(/ /);
       print "@_[0]\n" 
     }
    
    

    GAWK

     { print $1 }
    

    Printing lines, reversing fields

    PERL

     while (<>) { 
      split(/ /);
      print "@_[1] @_[0]\n" 
     }
    

    GAWK

     { print $2, $1 }
    

    Concatenation of variables

    PERL

     command = "cat $fname1 $fname2 > $fname3"
    

    GAWK

     command = "cat " fname1 " " fname2 " > " fname3
    

    Looping

    PERL:

     for (1..10) { print $_,"\n" }
    

    GAWK:

     BEGIN { 
      for (i=1; i<=10; i++) print i
     }
    

    Pairs of numbers

    PERL:

     for (1..10) { print "$_ ",$_-1 }
     print "\n"
    

    GAWK:

     BEGIN { 
      for (i=1; i<=10; i++) printf i " " i-1
      print ""
     }
    

    List of words into a hash

    PERL

      foreach $x ( split(/ /,"this is not stored linearly") ) 
      { print "$x\n" }
    

    GAWK

     BEGIN { 
      split("this is not stored linearly",temp)
      for (i in temp) print temp[i]
     }
    

    Printing a hash in some key order

    PERL

     $n = split(/ /,"this is not stored linearly");
     for $i (0..$n-1) { print "$i @_[$i]\n" }
     print "\n";
     for $i (@_) { print ++$j," ",$i,"\n" }
    

    AWK

     BEGIN { 
      n = split("this is not stored linearly",temp)
      for (i=1; i<=n; i++) print i, temp[i]
      print ""
      for (i in temp) print i, temp[i]
     }
    

    Printing all lines in a file

    PERL

     open file,"/etc/passwd";
     while (<file>) { print $_ }
    

    GAWK

      BEGIN { 
      while (getline < "/etc/passwd") print
     }
    

    Printing a string

    PERL

     $x = "this " . "that " . "\n";
     print $x
    

    GAWK

     BEGIN {
      x = "this " "that " "\n" ; printf x
     }
    

    Building and printing an array

    PERL

     $assoc{"this"} = 4;
     $assoc{"that"} = 4;
     $assoc{"the other thing"} = 15;
     for $i (keys %assoc) { print "$i $assoc{$i}\n" }
    

    GAWK

     BEGIN {
       assoc["this"] = 4
       assoc["that"] = 4
       assoc["the other thing"] = 15
       for (i in assoc) print i,assoc[i]
     }
    

    Sorting an array

    PERL

     split(/ /,"this will be sorted once in an array");
     foreach $i (sort @_) { print "$i\n" }
    

    GAWK

     BEGIN {
      split("this will be sorted once in an array",temp," ")
      for (i in temp) print temp[i] | "sort"
      while ("sort" | getline) print
     }
    

    Sorting an array (#2)

    GAWK

     BEGIN {
      split("this will be sorted once in an array",temp," ")
      n=asort(temp)
      for (i=1;i<=n;i++) print temp[i] 
     }
    

    Print all lines, vowels changed to stars

    PERL

     while (<STDIN>) {
      s/[aeiou]/*/g;
      print $_
     }
    
    

    GAWK

     {gsub(/[aeiou]/,"*"); print }
    

    Report from file

    PERL

     #!/pkg/gnu/bin/perl
     # this is a comment
     #
     open(stream1,"w | ");
     while ($line = <stream1>) {
       ($user, $tty, $login, $junk) = split(/ +/, $line, 4);
       print "$user $login ",substr($line,49)
     }
    

    GAWK

    #!/pkg/gnu/bin/gawk -f
     # this is a comment
     #
     BEGIN {
       while ("w" | getline) {
         user = $1; tty = $2; login = $3
         print user, login, substr($0,49)
       }
     }
    

    Web Slurping

    PERL

     open(stream1,"lynx -dump 'cs.wustl.edu/~loui' | ");
     while ($line = <stream1>) {
       if ($flag && $line =~ /[0-9]/) { print $line }
       if ($line =~ /References/) { $flag = 1 }
     }
    
    

    GAWK

     BEGIN {
      com = "lynx -dump 'cs.wustl.edu/~loui' &> /dev/stdout"
      while (com | getline line) {
        if (flag && line ~ /[0-9]/) { print line }
        if (line ~ /References/) { flag = 1 }
      }
     }
    

    categories: Arrays,Apr,2009,Timm

    saya

    Synopsis

    saya(array [,label,sep,before,after,eq])

    Description

    Array printing function. Contents printed, sorted on key.

    Arguments

    array
    An array.
    label
    (OPTIONAL) A prefix before every item.
    sep
    (OPTIONAL) A string to print between each item. Defaults to new line.
    before
    (OPTIONAL) A string to print before the array. Defaults to "".
    after
    (OPTIONAL) A string to print after the array. Defaults to new line.
    eq
    (OPTIONAL) A string to print between each key/value pair. Defaults to " = ".

    Returns

    Size of the array

    Notes

    The most common usage is to just use the first two arguments; e.g.

    saya(a,"name") ==>
    
    name[1] = tim
    name[2] = menzies
    

    For other usages, see the examples, below.

    Source

    function saya(a,s, sep0,b4,after,eq,   c,m,n,key,val,i,j,tmp,sep) {
    	sep0  = sep0  ? sep0  : "\n"
    	b4    = b4    ? b4    : "\n"
    	after = after ? after : "\n"
    	eq    = eq    ? eq    : " = "
    	pre   = s     ? s"["  : ""
    	post  = s     ? "]"   : ""
    	m     = asorti(a,b)
    	printf("%s",b4)
    	for(i=1;i<=m;i++)  {
    		key=b[i]
    		val=a[b[i]]
    		printf("%s", sep pre  )
    		n=split(key,tmp,SUBSEP)
    		c = ""
    		for(j=1;j<=n;j++)	{	
    			printf("%s", c tmp[j]  )
    			c=","
    		}
    		printf("%s", post eq val )
    		sep=sep0;
    	};
    	printf("%s",after)
    	return m
    }
    

    Example

    gawk/array/eg/saya »

    gawk -f saya.awk --source '
    BEGIN { 	
    	A["fname"  ] = "tim"
    	A["lname"  ] = "menzies"
    	A["address"] = "usa"
    	saya(A,"",", ","[","]")
    	print ""
    	saya(A,"message")
    	B[2,3,9]   = 100
    	B[10,1,11] = 200
    	B[1,3,10]  = 300
    	saya(B,"b")
    }'
    
    

    gawk/array/eg/saya.out »

    [address = usa, fname = tim, lname = menzies]
    
    message[address] = usa
    message[fname] = tim
    message[lname] = menzies
    
    b[1,3,10] = 300
    b[10,1,11] = 200
    b[2,3,9] = 100
    

    Author

    Tim Menzies


    categories: Timm,Arrays,Function,Feb,2009,ArnoldR

    join

    Synopsis

    join(a [,start,end,sep])

    Description

    Joins at array into a string

    Arguments

    a
    input array
    start
    Index for where to start in the array a. Default=1.
    end
    Index for where to start/stop in the array a. Default=size of array
    sep
    (OPTIONAL) What to write between each item. Defaults to blank space.

    If sep is set to the magic value SUBSEP then internally, join adds nothing between the items.

    Returns

    A string of a's contents.

    Example

    gawk/array/eg/join »

    gawk -f join.awk --source '
    BEGIN { split("tim tom tam",a)
            print join(a,2)
    }'
    

    gawk/array/eg/join.out »

    tom tam
    

    Source

    function join(a,start,end,sep,    result,i) {
        sep   = sep   ? start :  " "
        start = start ? start : 1
        end   = end   ? end   : sizeof(a)
        if (sep == SUBSEP) # magic value
           sep = ""
        result = a[start]
        for (i = start + 1; i <= end; i++)
            result = result sep a[i]
        return result
    }
    

    Helper

    In earlier gawks, length(a) did not work in functions. Hence....

    function sizeof(a,   i,n) { for(i in a) n++ ; return n }
    

    Change Log

    • Jan 24'08: defaults extended to include start,stop
    • Jan 24'08: Sizeof added to handle old gawk bug

    Author

    Arnold Robbins, then Tim Menzies


    categories: Arrays,Function,Feb,2009,Admin

    array

    Synopsis

    arrray(a)

    Description

    Ensure that an array is empty

    Arguments

    a
    input array

    Example

    gawk/array/eg/array »

    gawk -f array.awk --source '
    BEGIN { array(A);
            A[1]=2;
    	print length(A);
    	array(A);
    	print length(A);
    }'
    

    gawk/array/eg/array.out »

    1
    0
    

    Source

    function array(a) { split("",a,"") }
    

    categories: Sorting,Tools,Nov,2009,EdM

    Sorting in Awk

    Contents

    Download

    About

    Code

    selSort

    keySort

    genSort

    Main Loop

    Author

    Download

    Download from LAWKER.

    About

    Below is a script I wrote to demonstrate how to use arrays, functions, numerical vs string comparison, etc.

    It also provides a framework for people to implement sorting algorithms for comparison. I've implemented a couple and I'm hoping others will contribute more in the same style.

    I put very few comments in deliberately because I think the only parts that are hard to understand given some small amount of reading awk manuals are the actual sorting algorithms, and those should be well documented already given a reference except my made-up "Key Sort" but I think that's very easy to understand.

    Code

    selSort

    Selection Sort, O(n^2): http://en.wikipedia.org/wiki/Selection_sort

    function selSort(keyArr,outArr,   swap,thisIdx,minIdx,cmpIdx,numElts) {
      for (thisIdx in keyArr) {
          outArr[++numElts] = thisIdx
      }
      for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
          minIdx = thisIdx
          for (cmpIdx=thisIdx + 1; cmpIdx <= numElts; cmpIdx++) {
              if (keyArr[outArr[minIdx]] > keyArr[outArr[cmpIdx]]) {
                  minIdx = cmpIdx
              }
          }
          if (thisIdx != minIdx) {
              swap = outArr[thisIdx]
              outArr[thisIdx] = outArr[minIdx]
              outArr[minIdx] = swap
          }
      }
      return numElts+0
    }
    

    keySort

    Key Sort O(n^2): made up by Ed Morton for simplicity.

    function keySort(keyArr,outArr,   \
                    occArr,thisIdx,thisKey,cmpIdx,outIdx,numElts) {
      for (thisIdx in keyArr) {
          thisKey = keyArr[thisIdx]
          outIdx=++occArr[thisKey]  # start at 1 plus num occurrences
          for (cmpIdx in keyArr) {
              if (thisKey > keyArr[cmpIdx]) {
                  outIdx++
              }
          }
          outArr[outIdx] = thisIdx
          numElts++
      }
      return numElts+0
    }
    

    genSort

    This code demonstrates the use of arrays, functions, and string vs numeric comparisons in awk. It also provides a framework for people to implement various sorting algorithms in awk such as those listed at http://en.wikipedia.org/wiki/Sorting_algorithm

    Traverses the input array, storing it's indices in the output array in sorted order of the input array elements. e.g.

     in:  inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
          outArr[] is empty
    
     out: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
          outArr[1]="bar"; outArr[2]="foo"; outArr[3]="xyz"
    

    Can sort on specific fields given a field number and field separator.

    sortType of "n" means sort by numerical comparison, sort by string comparison otherwise.

    function genSort(sortAlg,sortType,inArr,outArr,fldNum,fldSep,           \
                  keyArr,thisIdx,thisArr) {
      if (fldNum) {
          if (sortType == "n") {
              for (thisIdx in inArr) {
                  split(inArr[thisIdx],thisArr,fldSep)
                  keyArr[thisIdx] = thisArr[fldNum]+0
              }
          } else {
              for (thisIdx in inArr) {
                  split(inArr[thisIdx],thisArr,fldSep)
                  keyArr[thisIdx] = thisArr[fldNum]""
              }
          }
      } else {
          if (sortType == "n") {
              for (thisIdx in inArr) {
                  keyArr[thisIdx] = inArr[thisIdx]+0
              }
          } else {
              for (thisIdx in inArr) {
                  keyArr[thisIdx] = inArr[thisIdx]""
              }
          }
      }
      if (sortAlg ~ /^sel/) {
          numElts = selSort(keyArr,outArr)
      } else {
          numElts = keySort(keyArr,outArr)
      }
      return numElts
    }
    

    Main Loop

     { inArr[NR]=$0 }
    <H3> Output</H3>
    END {
      numElts = genSort(sortAlg,sortType,inArr,outArr,fldNum,FS)
      for (outIdx=1;outIdx<=numElts;outIdx++) {
          print inArr[outArr[outIdx]]
      }
    }
    

    Author

    Ed Morton


    categories: ,Dec,2009,KennyM,EdM

    Awk's Equivalent to VI's J

    A recent discussion in comp.lang.awk demonstrated a very cute, and very succinct, awk trick.

    Neil Harris wanted to clean up this output:

    host1name.com 
    10.10.10.1 
    host2name.com 
    10.10.10.2 
    host3name.com 
    10.10.10.3 
    

    He was using an uppercase J in vi to manually move the hostname's IP address up onto the same line as it's hostname. But he wanted to automate the task with awk.

    Kenny McCormack offered:

    ORS=NR%2?" ":"\n" 
    

    (Yes, that is the whole program.)

    Ed Morton offered a more elegant version:

    ORS=NR%2?FS:RS 
    

    Finally, Kenny McCormack commented:

    • I'm 98% sure that I personally invented the basic idea (ORS=... as the pattern, with no action - i.e., default action).
    • Ed's enhancement was using FS and RS instead of hardcoding space and newline. It's nice for two reasons:
      1. Saves a few golf strokes
      2. Is more "portable" (or "logical", if you look at it that way) in that if FS and RS had been assigned non-default values, they would be used.
    • Also, as he says, it is a very instructive 14 characters of AWK code.

    categories: ,Sorting,Dec,2009,DebbieF

    Sorting Arrays Via the Shell

    Contents

    Synopsis

    Download

    Notes

    Code

    Example

    Main driver

    Author

    Synopsis

     o(array [,string,control])
    
    • If string is supplied, it is printed as a prefix to each list item.
    • If control is an integer, the array's contents are printed 1 to control.
    • If control is a string, it is passed to as an argument to a UNIX sort command.

    Download

    Download from LAWKER.

    Notes

    Much has been written in comp.lang.awk and awk.info about using Awk code to sort Awk arrays. While all that code is clever and good, I wondered if a little shell scripting would simplify the task.

    On the plus side:

    • The code is very short (11 lines!).
    • The code's functionality is very easy to modify. If the third argument is a string, it is passed to a UNIX sort command. This command supports a very large list of control options.

    On the negative side:

    • This code is operating system dependent. It only words on Mac, UNIX, LINUX, and Windoze (with Cygwin installed).
    • When this code runs, it forks a sub-process. So it may be a little slower to run than the other methods documented in comp.lang.awk and awk.info.

    All that said, I use this code all the time- it is very useful during debugging to dump the contents of the internal structures in my Awk code.

    By the way, if you want to see an even shorter sort routine (that uses a platform independent shell programming trick), check out David Long's amazing quicksort.

    Code

    Example

    Input:

    function odemo(  a,b,i,n) {
    	n = split("watermelon,banana,apple,grape",a,/,/);
    	print "\nEG1"; o(a,"fruit")
    	print "\nEG2"; o(a,"fruit",3)
    	print "\nEG3"; o(a,"fruit","-k 6")
    	for(i in a)
    		b[a[i]] = i
    	print "\nEG4"; o(b,"fruit")
    	print "\nEG5"; o(b,"fruit","-r -k 2")
    }
    

    Output:

    gawk -f o.awk --source "BEGIN { odemo() }"
    

    Print the array, no control string. Defaults to sorting on the index.

    EG1
    fruit[ 1 ]      =        [ watermelon ]
    fruit[ 2 ]      =        [ banana ]
    fruit[ 3 ]      =        [ apple ]
    fruit[ 4 ]      =        [ grape ]
    

    Print array, passing a numeric control string. Prints only the first three items, sorted on the index.

    EG2
    fruit[ 1 ]      =        [ watermelon ]
    fruit[ 2 ]      =        [ banana ]
    fruit[ 3 ]      =        [ apple ]
    

    Print array, sorted on the contents.

    EG3
    fruit[ 3 ]      =        [ apple ]
    fruit[ 2 ]      =        [ banana ]
    fruit[ 4 ]      =        [ grape ]
    fruit[ 1 ]      =        [ watermelon ]
    

    Print an array with strings for keys. Prints in array label order.

    
    EG4
    fruit[ apple ]  =        [ 3 ]
    fruit[ banana ] =        [ 2 ]
    fruit[ grape ]  =        [ 4 ]
    fruit[ watermelon ]     =        [ 1 ]
    

    Print an array with strings for keys. Prints in reverse array label order.

    EG5
    fruit[ watermelon ]     =        [ 1 ]
    fruit[ grape ]  =        [ 4 ]
    fruit[ banana ] =        [ 2 ]
    fruit[ apple ]  =        [ 3 ]
    

    Main driver

    The code is short, yes?

    function o(a, str,control,   i) {
       if (control ~ /^[0-9]/) 
          for(i=1;i<=control;i++)
             print str "[ " i " ]\t=\t [ " a[i] " ]"
      else  {
          com = control ? control : " -n -k 2" 
          com = "sort " com  " #" rand(); # ensure com is unique
          for(i in a)
             print str "[ " i " ]\t=\t [ " a[i] " ]" | com;
          close(com);
    }}
    

    Author

    Debbie Forbes


    categories: Sorting,TenLiners,Apr,2009,Awk

    quicksort2.awk

    Contents

    Synopsis

    Download

    Description

    Code

    Bugs

    See also

    Copyright

    Author

    Synopsis

    cat numbers | gawk -f quicksort2.awk

    Download

    Download from LAWKER.

    Description

    Quicksort divides the input data around a randomly selected pivot, then recurses on the divided data.

    In quicksort2, the pivot is selected from the first line of input. Each data division is handled by a different UNIX pipe and recursive gawk processes are called on the divided data.

    Yes, this is not the fastest way to do it but (in theory anyway) it should be able to handle very big data sets.

    Code

    BEGIN   { 
             recurse1 = "gawk -f quicksort2.awk #" rand()
             recurse2 = "gawk -f quicksort2.awk #" rand()
            }
    NR == 1 { pivot=$0; next }
    NR > 1  { if($0 < pivot) print | recurse1
              if($0 > pivot) print | recurse2
            }
    END     { close(recurse1)
              if(NR > 0) print pivot
    	      close(recurse2)
            }
    

    Bugs

    The output ignores repeated input values. I thought it was a problem with repeating the name of the pipes (hence the "rand()" labelling) but that did not fix the issues.

    See also

    quicksort.awk

    Copyright

    Copyright (c) 2009 by David Long.

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    Author

    Original version: David Long, 2004. Tim Menzies added some modifications in 2009 to call recursive Gawk pipes on both sides of the pivot.


    categories: Tools,Nov,2009,PierreG

    levenshtein.awk

    Contents

    Synopsis

    Download

    Notes

    Code

    levdist

    Demo code

    Unit tests

    Author

    Synopsis

    gawk -f levenshtein.awk --source 'BEGIN {
            print levdist("kitten", "sitting")}' 
    

    (The above code should print "3").

    Download

    Download from LAWKER.

    Notes

    The Levenshtein edit distance calculation is useful for comparing text strings for similarity, such as would be done with a spell checker.

    Hi_saito (from awk.freeshell.org) has written what looks like a straightforward implementation of the reference algorithm described in the above-linked Wikipedia article. hi_saito's code is linked to rather than included outright because no licensing terms appear on the page.

    Gnomon (from awk.freeshell.org) is planning to write a more compact (and hopefully speedier) implementation that will appear here soon. The plan is to compute and retain only those values that are necessary to calculate the edit distance, rather than calculating the entire NxM? matrix. The lazy-evaluation method, which can post substantial speed improvements, probably requires more effort and code complexity than the performance gains would be worth; still, for short strings, the lazy code could perhaps be modeled via recursion by executing from the end of the string rather than the beginning. If experiments are run, the results will also appear here.

    Here is the abovementioned streamlined implementation. There were eleven previous versions, all of which were benchmarked across gawk, mawk and busybox awk. The approaches started with a naive implementation and explored table-based, recursive (with no, single and shared memoization) and lazy models. As expected, the lazy version was incredibly fiddly and not pleasant to read or pursue. Findings will appear here later, but for now, here's the code.

    Code

    levdist

    function levdist(str1, str2,    l1, l2, tog, arr, i, j, a, b, c) {
            if (str1 == str2) {
                    return 0
            } else if (str1 == "" || str2 == "") {
                    return length(str1 str2)
            } else if (substr(str1, 1, 1) == substr(str2, 1, 1)) {
                    a = 2
                    while (substr(str1, a, 1) == substr(str2, a, 1)) a++
                    return levdist(substr(str1, a), substr(str2, a))
            } else if (substr(str1, l1=length(str1), 1) == substr(str2, l2=length(str2), 1)) {
                    b = 1
                    while (substr(str1, l1-b, 1) == substr(str2, l2-b, 1)) b++
                    return levdist(substr(str1, 1, l1-b), substr(str2, 1, l2-b))
            }
            for (i = 0; i <= l2; i++) arr[0, i] = i
            for (i = 1; i <= l1; i++) {
                    arr[tog = ! tog, 0] = i
                    for (j = 1; j <= l2; j++) {
                            a = arr[! tog, j  ] + 1
                            b = arr[  tog, j-1] + 1
                            c = arr[! tog, j-1] + (substr(str1, i, 1) != substr(str2, j, 1))
                            arr[tog, j] = (((a<=b)&&(a<=c)) ? a : ((b<=a)&&(b<=c)) ? b : c)
                    }
            }
            return arr[tog, j-1]
    }
    

    Demo code

    Run demo.awk using gawk -f levenshtein.awk -f demo.awk.

    #demo.awk
    BEGIN {OFS = "\t"}
    {words[NR] = $0}
    END {
       max = 0
       for (i = 2; i in words; i++) {
          for (j = i + 1; j in words; j++) {
             new = levdist(words[i], words[j])
             print words[i], words[j], new
             if (new > max) {
                max = new
                bestpair = (words[i] " - " words[j] ": " new)
             }
          }
       }
       print bestpair
    }
    

    Unit tests

    Run utests.awk using gawk -f levenshtein.awk -f utests.awk.

    #utests.awk
    function testlevdist(str1, str2, correctval,    testval) {
        testval = levdist(str1, str2)
        if (testval == correctval) {
            printf "%s:\tCorrect distance between '%s' and '%s'\n", testval, str1, str2
            return 1
        } else {
            print "MISMATCH on words '%s' and '%s' (wanted %s, got %s)\n", str1, str2, correctval, testval
            return 0
        }
    }
    BEGIN {
        testlevdist("kitten",    "sitting",   3)
        testlevdist("Saturday",  "Sunday",    3)
        testlevdist("acc",       "ac",    1)
        testlevdist("foo",       "four",      2)
        testlevdist("foo",       "foo",       0)
        testlevdist("cow",       "cat",       2)
        testlevdist("cat",       "moocow",    5)
        testlevdist("cat",       "cowmoo",    5)
        testlevdist("sebastian", "sebastien", 1)
        testlevdist("more",      "cowbell",   5)
        testlevdist("freshpack", "freshpak",  1)
        testlevdist("freshpak",  "freshpack", 1)
    }
    

    Author

    pierre.gaston <a.t> gmail.com


    categories: Tools,Nov,2009,Admin

    Columnate

    Contents

    Synopsis

    Download

    About

    Code

    Author

    Synopsis

    #e.g.
    gawk -F: -f columnate.awk /etc/passwd
    

    Download

    Download from LAWKER.

    About

    This script columnates the input file, so that columns line up like in the GNU column(1) command. Its output is like that of column -t. First, awk reads the whole file, keeps track of the maximum width of each field, and saves all the lines/records. At the END, the lines are printed in columnated format. If your terminal is not too narrow, you'll get a handsome display of the file.

    Code

    {   line[NR] = $0    # saves the line
        for (f=1; f<=NF; f++) {
            len = length($f)
            if (len>max[f])
                max[f] = len }  # an array of maximum field widths
    }
    END {
        for(nr=1; nr<=NR; nr++) {
            nf = split(line[nr], fields)
            for (f=1; f<nf; f++)
                printf "%-*s", max[f]+2, fields[f]
            print fields[f] }     # the last field need not be padded
    }
    

    Author

    h-67-101-152-180.nycmny83.dynamic.covad.net


    categories: Graphics,Sept,2009,TedD

    WidenBmp.awk

    Contents

    Background

    My boss wants to put NOAA weather radar images in a looping presentation that is displayed as 720 video on the 1040 LCD TV in the atrium. He couldn't figure out how to download the various layers needed, so he gave me the task. Of course, I had a sample composite image for him in half an hour. It looked terrible on the TV: the writing came out as just a blur and the county and state lines (single pixel mostly) were essentially invisible. Obviously, I could make my own 'cities' overlay, but no tools I had would convert the 'counties' image to any usable vector format for line resizing.

    That afternoon, I wrote a gawk script that widens the lines in a 256 color BMP version of the image - I can convert it back to a transparent background GIF later.

    That script was presented in awk.info July 30, 2009. is an updated and extended version

    The script widens lines in .bmp files to make them more visible when converted to TV video images. For the complete conversion, it is also necessary to mung the line colors to get rid of interpolated colors and togive some lines more contrast, but that is done elsewhere.

    This script is gawk specific.

    Code

    Bytes2Number

    This functions converts byte strings (binary numbers) into their corresponding numeric strings so that they can be processed as gawk numbers. The lookup table (CharString) is a global variable. This code assumes that binary numbers are big-endian (most significant byte first) - it is up to the calling program to order the bytes.

    On the first use, the (global) LUT is created, then left for later use. It consists of a list of characters from \000 to \777 in order - the (index value minus 1) of a character multiplied by the power of 256 corresponding to its position in the string is the byte's numerical weight. The function doesn't care about the length of the byte string (within the integer limits of the gawk version and port).

    function Bytes2Number( String,  x, y, z, Number ) {
    	if( !CharString ) {
    		for( x = 0; x <= 255; x++ ) CharString = CharString sprintf( "%c", x )
    	}
    	x = split( String, Scratch, "" )
    	Number = 0
    	for( y = 1; y <= x; y++ ) {
    		z = index( CharString, Scratch[ y ] ) -1
    
    		Number = Number + z * (256^(x - y))
    	}
    	return Number	# Note that Number is a regular gawk scalar variable.
    }
    

    RealSize

    Uses a brute force approach to factor the image size into width and height numbers that actually match the real image size. It searches around the nominal values for a pair of numbers that, when multiplied together, produce the known size of the image in pixels.

    function RealSize( Wide, High, Pixels,  x, y ) {
    	for( x = Wide - 5; x <= Wide +5; x++ ) {
    		for( y = High - 5; y <= High + 5; y++ ) {
    			if( x * y == Pixels ) {
    				Width = x
    				Height = y	
    			}
    		}
    	}	
    }
    

    BEGIN

    It is necessary to tell gawk to read/write the file as binary, especially under Windows where ^Z in files is a killer. Setting BINMODE to 3 will also work, but it throws error messages.

    Setting FS to null causes gawk to make each byte a separate field.

    Testing indicates that, in Windows at least, it is necessary to specify RS, even though it would appear redundant to set it to \n - not doing so results in 0A0D being replaced with 0A in the output, with the loss of one byte for each occurance. The value is arbitrary - it has been tested using one of the line colors.

    BEGIN{
    	BINMODE = "rw"
    	FS= ""
        # The next two lines are not strictly necessary- 
        # there are here for clarity.
    	Header = ""
    	ByteCount = 0
    	RS = "\n"
    }
    

    For Each Record...

    Read the file into an array. If there are multiple lines, that is, if RS appears in the file, insert the record separator back into the array at the end of each line for which RT exists.

    {
    	for( x = 1; x <= NF; x++ ) Bytes[ ++ByteCount ] = $(x)	
    	if( RT ) { Bytes[ ++ByteCount ] = RT }
    }
    

    END

    Closing FILENAME here allows overwriting the original file - if that is desired, comment out the next line (which creates a new filename for the output).

    Regarding image parameters: Width and Height are in pixels; Depth is the number of bytes per pixel; Data is the zero based index of the actual image in the file; Size refers to the bytes in the file, not the image; ImgSize is the number of pixels in the image. Unfortunately, Width and Height may be wrong: RealSize() calculates the actual values as found from the data block.

    Once the image parameters are set, the two arrays for the image can be built: one to contain an unmodified copy (A) and one to contain a copy to be modified (B). These arrays are indexed by line and dots (Height, Width); data are complete pixels. The C array is used to determine the background color: it uses the pixel data as indexes and the count of the number of copies of that pixel as values - the largest value represents the most common color, and assuming that the image is mostly background, therefore the background color. This assumption will be true for almost all line art.

    When performing line widening: for each pixel that is not part of the background, copy its color to the four surrounding pixels, provided that they are background. This approach prevents one line from encroaching on another, but does not prevent the ends of lines that do not intersect other lines from growing by one pixel on each pass through the program for each free end. u, v, w, and z (z has been reused) are the coordinates of the four pixels surrounding the one in work (defined by x and y).

    END{
    	if( !OutFile ) OutFile = FILENAME
    	close( FILENAME )
    	sub( /[bB][mM][pP]$/, "widened.bmp" Arr[1], OutFile )
    	Width = Bytes2Number( Bytes[ 22 ] Bytes[ 21 ] Bytes[ 20 ] Bytes[ 19 ] )
    	Height = Bytes2Number( Bytes[ 26 ] Bytes[ 25 ] Bytes[ 24 ] Bytes[ 23 ] )
    	Data = Bytes2Number( Bytes[ 14 ] Bytes[ 13 ] Bytes[ 12 ] Bytes[ 11 ] )
    	Size = Bytes2Number( Bytes[ 6 ] Bytes[ 5 ] Bytes[ 4 ] Bytes[ 3 ] )
    	Depth = Bytes2Number( Bytes[ 30 ] Bytes[ 29 ] ) / 8
    	ImgSize = Bytes2Number( Bytes[ 38 ] Bytes[ 37 ] Bytes[ 36 ] Bytes[ 35 ] )
    	RealSize( Width, Height, ImgSize / Depth )
        # Output the header in its original form to the target file.
    	for( x = 1; x <= Data; x++ ) Header = Header Bytes[ x ]
    	printf( "%s", Header ) > OutFile
        # Build the two arrays
    	for( x = 1; x <= Height; x++) {
    		for( y = 1; y <= Width; y++ ) {
    			S = ""
                # Values for the A & B array entries are strings of 
                # bytes representing the color of the pixel, either directly or 
                # as a pointer into a palette.
    			for( z = 1; z <= Depth; z++ ) S = S Bytes[ ++Data ]
    			A[x,y] = S
    			B[x,y] = S
    			C[ S ]++
    		}
    	}
    	
    	z = 0
        # Bkg is the (assumed) background color.  
        # The code is a simple maximum value loop.
    	for( x in C ) {
    		y = C[x]
    		if( y > z ) {
    			Bkg = x
    			z = y
    		}
    	}
       # Begin the actual line widenning code.
    	for( x = 1; x <= Height; x++) {
    		for( y = 1; y <= Width; y++ ) {
    			if( A[x,y] !~ Bkg ) {
    					u = x + 1
    					v = x - 1
    					w = y + 1
    					z = y - 1
    					if( B[u,y] ~ Bkg ) B[u,y] = A[x,y]
    					if( B[v,y] ~ Bkg ) B[v,y] = A[x,y]
    					if( B[x,w] ~ Bkg ) B[x,w] = A[x,y]
    					if( B[x,z] ~ Bkg ) B[x,z] = A[x,y]
    					if( B[u,w] ~ Bkg ) B[u,w] = A[x,y]
    					if( B[u,z] ~ Bkg ) B[u,z] = A[x,y]
    					if( B[v,w] ~ Bkg ) B[v,w] = A[x,y]
    					if( B[v,z] ~ Bkg ) B[v,z] = A[x,y]
    			}
    		}
    	}
    	for( x = 1; x <= Height; x++) {
    		for( y = 1; y <= Width; y++ ) {
    			printf( "%s", B[x,y] ) > OutFile
    		}
    	}
    }
    

    Note the final nested for loops in the above code. After the B array has been modified, the target file can be completed by reading that array out to the file pixel by pixel. The array cannot be output during processing because pixels that have already been through the processor can still be changed.

    Author

    Ted Davis tdavis@mst.edu.


    categories: Graphics,Jul,2009,TedD

    Processing Binary (BMP) files in Gawk

    by Ted Davis

    Updates

    (For an update to this page, see wdenbmp.awk).

    Description

    My boss wants to put NOAA weather radar images in a looping presentation that is displayed as 720 video on the 1040 LCD TV in the atrium. He couldn't figure out how to download the various layers needed, so he gave me the task. Of course, I had a sample composite image for him in half an hour. It looked terrible on the TV: the writing came out as just a blur and the county and state lines (single pixel mostly) were essentially invisible. Obviously, I could make my own 'cities' overlay, but no tools I had would convert the 'counties' image to any usable vector format for line resizing.

    This afternoon, I wrote a gawk script that widens the lines in a 256 color BMP version of the image - I can convert it back to a transparent background GIF later.

    The power and range of gawk never ceases to amaze me - a 42 line (pretty printed) program was all it took.

    The script uses FS="" to convert the entire file into 331 078 single byte fields. The first 1078 went into a header string and printf()ed to the outfile. The rest went into a a pair of 550 row by 600 column arrays. Then I looked at each pixel in the A array, and if it was not the background color, made the four surrounding pixels in the B array the same color, provided they were background color (not part of an existing line). Then I read out the array in order and printf()ed it to the outfile. The resulting overlay should be readable after changing the colors to make the dark lines brighter and moving its location in the stack to be on top of the other images.

    There is one known flaw that I have no intention of addressing: lines that do not intersect other lines grow longer by one pixel for each pass through the program.

    Code Fragments

    While the actual code is proprietary, the following code snippets show most of the idioms required to handle binaries.

    function Bytes2Number( String,  x, y, z, Number ) {
          x = split( String, Scratch, "" )
          Number = 0
          for( y = 1; y <= x; y++ ) {
                  z = index( CharString, Scratch[ y ] ) -1
                  Number = Number + z * (256^(x - y))
          }
          return Number
    }
    

    The following code initializes the CharString variable needed by Bytes2Number.

    BEGIN{
         for( x = 0; x <= 255; x++ ) {
              CharString = CharString sprintf( "%c", x )
    

    The above code generates the list of bytes for the Bytes2Number function.

         FS= ""
         RS = /ABC/
    }
    

    Mote that the string "ABC" does not appear in any of the image files processed by this code. Hence, the above lines means that the whole image ends up in one record.

    The next block analyzes the header to extract useful information.

        {     Width   = Bytes2Number( $22 $21 $20 $19 )
              Height  = Bytes2Number( $26 $25 $24 $23 )
              Data    = Bytes2Number( $14 $13 $12 $11 )
              Size    = Bytes2Number( $6 $5 $4 $3 )
              Depth   = Bytes2Number( $30 $29 ) / 8
              ImgSize = Bytes2Number( $38 $37 $36 $35 )
               ....
        }
    

    (note: I found that the image size in the header may be wrong, notably in files resized by Paint Shop Pro. Calculating it proved more reliable.)


    categories: Databases,Spawk,Dec,2009,PanosP

    Spawk for SUSE Linux

    I've just installed the openSUSE Milestone 8 (11.2) in a virtual machine in my PC.

    In about half an hour, I've also downloaded MySQL, gawk sources and SPAWK (SQL + AWK) sources, compiled and build the SPAWK libraries (/usr/lib/libspawk.so and /usr/lib/libspawk_r.so).

    I've tested the module and worked just fine, so I've uploaded the binary tarball for this distro in SPAWK project (http://code.google.com/p/spawk/downloads/list).

    Have a Happy New Year!
    -- Panos Papadopoulos to tim


    categories: Databases,Spawk,Jul,2009,PanosP

    SPAWK moves to GoogleCode

    Panos I. Papadopoulos reports that he has moved the SPAWK project (SQL and AWK) to Mercurial and spawk.googlecode.com.

    He has also written extensive tutorial notes at the SPAWK wiki.


    categories: Spawk,Databases,Jul,2009,PanosP

    SQL Powered AWK

    Website

    http://sites.google.com/site/spawkinfo.

    Author

    Panos I. Papadopoulos (panos1962@gmail.com).

    Description

    SPAWK is an elegant collection of functions for accessing and updating MySQL databases from within GNU awk programs. The SPAWK module consists of a single awk extension library, namely libspawk.so, which may be loaded in awk programs using the standard extension awk function:

    BEGIN {
       extension("libspawk.so", "dlload")
       ...
    

    A Short Example

    Here's a short example of using SPAWK (for more details, see http://sites.google.com/site/spawkinfo/Home/manual).

    When calling spawk_select, SPAWK sends the query already given (maybe some spawk_query calls preceeded the spawk_select) to the current server (remind you that "server" in SPAWK's point of view is a connection to the actual MySQL server mysqld). After calling spawk_select, the server is ready to return the results to the awk process via spawk_data, spawk_first or spawk_last calls. Alternatively, at any time we can clear the results' set and release the server with a spawk_clear function call.

    The main data receiver is spawk_data function. This function is usually called with one or two arguments. The first argument is an array to be used as a data transfer vehicle, while the second argument may be used optionally to hold the null valued columns. spawk_data returns the number of columns of each returned data row or zero if there are no more data to return (EOD). spawk_first function's arguments and return values are exactly the same as those of spawk_data arguments and returns values, but the rest of the data will be lost, that is get the next available data row and release the server. Similar is the spawk_last function, but the row returned is the last row of the results' set. By the way, the spawk_last function is less efficient than spawk_first; actually, there is no particular reason to call spawk_last at all! Let's see some examples:

    BEGIN {
         extension("libspawk.so", "dlload")
         SPAWKINFO["database"] = "information_schema"
         spawk_select("SELECT TABLE_SCHEMA, TABLE_NAME FROM TABLES")
         while (spawk_data(data))
              print data[0]
         exit(0)
    }
    

    Things need to be explained:

    • extension is used to load the SPAWK module.
    • SPAWKINFO array is used to specify the default database (schema) to connect. Index "database" denotes default database.
    • spawk_select is used to execute the desired query.
    • spawk_data is used repeatedly to get the results, row by row. spawk_data returns 2 as there are results to be retuned and 0 on EOD.

    categories: Macros,Tools,Mar,2009,Timm

    Macros

    These pages focus on macro pre-processors (a natural application for Awk).


    categories: Tools,Jul,2009,WmM

    Finite State Machine Generator

    Contents

    Download

    Download from LAWKER

    Usage

    In general, specify the state machine in FILE.fsm and define the action functions in FILE_actions.c. Then run fsm.awk compile and link fsm.c fsm_FILE.c and any driver file. Thats it.

    Multiple fsms may be built and run in the same application using the function fsm_allocFsm(). Moreover, calls to fsm() may be nested using the same state machine as long as a different context is used. fsm_allocFsm() returns a context number that must be stored and passed to fsm() on each invoction. In the provided sample, the context is stored in myContext in test_driver.c.

    Fsm() may be called either by polling for events or from inside an interrupt service routine. If fsm() is called from an interrupt service routine, it must be protected from nested calls using the same context. Interrupting calls using other contexts is permitted.

    Note that the function fsminit() is called only once and should not be called for each fsm. If there are special requirements for a given fsm, an appropriate init function should be provided and called for that particular fsm.

    Currently, fsm traceEnable is set to true and cannot be disbled (without changing fsm_allocFsm()). An array is maintained within each fsm context wherein each state and event are recorded for each call to fsm().

    DESCRIPTION

    Fsm.awk is an awk script designed to read a finite state machine (fsm) specification and produce C files which implement that fsm. The file fsm.c, included in the distribution, provides the actual state transition function, and the user provides the state transition "action" functions and any special initialization.

    The fsm distribution consists mainly of fsm.awk and fsm.c, although there are a number of header files for declarations - doesn't get much simpler than that.

    Typically, the fsm specification is named in the form fsm_name.fsm, but may be named any legal filename. The action functions may be placed in any number of files by any name the user chooses. Each function should return either true or false so that the appropriate next state may be chosen.

    The chief benefit of using fsm.awk is easy to read, consistent state machine specifications and reuse of existing, tested code. Multiple tables and multiple users are happily accommodated. It's not hi-tech, but in provides an easy avenue to generalization and consistency where fsms are required.

    This distribution represents a rewrite of an earlier version written many years ago - rewritten with newer versions of awk and gcc in mind. Consequently, it has not been tested using other compiler suites. There are no known bugs, but, it IS a rewrite.

    Although a good candidate for C++, C was used because C++ was not being used in any of the systems currently using fsm-gen. Maybe a C++ version will be in a subsequent release.

    Building the Sample FSM

    The distribution provides the following files:

    COPYING and      FSF licenses
    COPYING.LESSER
    filelist         the "packing list"
    fsm.awk          the code generator
    fsm.c            the context and transition code
    fsm.h            definitions for the API
    makefile         simple makefile for the test driver code
    utils.h          error and utility definitions
    test.fsm         a sample fsm specification named "test"
    test_actions.c   action functions for the sample
    

    To build the sample,

    1. Download the .zip
    2. extract the files from the zip - unzip contents.zip
    3. build the example fsm - ./fsm.awk test.fsm This step will produce fsm_test.c and fsm_test.h.
    4. compile and link the executable (test) using make
    5. run the sample - the executable produced by the makefile is "test". See the section THE EXAMPLE FSM below for information on using the example.

      When fsm.awk is run, (run via fsm.awk fsmName.fsm) it produces two files, fsm_fsmName.c and fsm_fsmName.h. Fsm_fsmName.c will contain an array of struct fsm_s tagged as fsm_fsmName, eg.,

      struct fsm_s fsm_fsmName [STATES_COUNT][EVENTS_COUNT].
      

      In the fsm distribution, the files fsm_test.c, fsm_test.h and test_actions.c may be built as an executable sample.

      The file fsm.c should be compiled and linked with the final executable as it contains the C code necessary to read the generated tables and update context. <> P Building the example should compile error free with the exception of a warning about using "gets()" in the sample driver. Hey - it's just a driver for a test.

    Example FSM Specification File

    In its purist form, a fsm specifies state, event, action, new state. For example, a rudimentary ftp server might be specified as follows:

    # current     event     action          next 
    # state                                  state
    # --------------+----------+---------------+------------
    IDLE            CONN_REQ   makeConnection  CONNECTED
    CONNECTED       GET_REQ    sendBuffer      SENDING
    SENDING         FILE_SENT  closeFile       IDLE
    

    It is useful on occasion to make the next state depend on the success or failure of the action function. Here, "ok" and "fail" mean "true" and "false", respectively. For example, as each buffer is sent it would be useful to specify a different state if sendFile() returns fail (indicating EOF).

    # current     event     action     next         next 
    # state                             state        state
    #                                    ok           fail
    # --------------+----------+----------+---------+-----
    CONNECTED       GET_REQ    sendBuffer SENDING   IDLE
    

    State, event, action, and new state may be specified according to the same rules as C variables/functions. In the above table, the words CONNECTED, GET_REQ, SENDING, and IDLE are used to generate #defines, and the action sendBuffer is the name of a user supplied function.

    The file test.fsm illustrates several idioms:

    • an event may be a single event or a comma separated list of events that all result in the same action and same next state. For example, the specification
      # current     event     action     next         next 
      # state                             state        state
      #                                    ok           fail
      # --------------+----------+----------+---------+-----
      S1              EVENT_1    action_1   S2        S3
      

      means, when receiving event EVENT_1 or EVENT_2 in state S1,

      execute action action_1 and go to state S2 if the return value of action_1() is true; go to state S3 if the return value of action_1 is false.
    • note that all events must be specified for each state. See the example specification file, test.fsm.
    • an action specified as "-" means, "do nothing". fsm.awk will generate a NULL in the state transition tables which will be treated as "do nothing". When so specified, the next state will always be the next-state-ok state.
    • an action specified as fsm_invalid_event will call the function fsm_invalid_event(void) which always returns false. This function may be edited to suit the situation at hand. When fsm_invalid_event is specified, the next state (both) may be left unspecified - fsm.awk will generate next state information as being the current state (ie., no change in the current state).
    • a fail next state specified as "-" means the fail next state is the same as the success next state. That is, in the specification
      # current     event     action     next         next 
      # state                             state        state
      #                                    ok           fail
      # --------------+----------+----------+---------+-----
      S1              EVENT_1    action_1   S2        -
      
      means, when receiving event EVENT_1 in state S1, execute action action_1 and go to state S2 irrespective of the return value of action_1().

    The Example FSM

    Included in the distribution are test.fsm and test_actions.c which implement a very simple state machine called "test". After the executable "test" is produced (via make), it may be used to show the behavior of the fsm.

    The example fsm was built and tested with gcc version 4.0.2 and awk version 3.1.4.

    Example Output from the Sample

    On running "test", first the line "testing fsm test" is printed, then a line indicating the initial state. It then asks for the next event. All events in the example are the lowercase letters 'a' thru 'd', entered from the keyboard. A special event 'z' will cause the trace to be dumped. Entering 'q' will cause test to exit. Note that to keep the example simple, other than special events 'z' and 'q', there is no checking of input for being outside the known set of events. A sample session might look like this:

    $>
    $> ./test
    testing fsm test
    
    starting in state 1
    next event: a
    got a (0)  ----> called fsm_s2_ab ----> ,went to state 0
    next event: d
    got d (3)  ----> invalid eventwent to state 0
    next event: b
    got b (1)  ----> called fsm_s1_b ----> ,went to state 1
    next event: c
    got c (2)  ----> went to state 1
    next event: z
    trace index is 4
    event      state
    0          0
    3          0
    1          1
    2          1
    0          0 <-- next/oldest
    0          0
    0          0
    0          0
    
    next event: q
    bye
    $>
    

    Copyright

    Copyright 2008 Wm Miller

    This file is part of fsm-gen, and is distributed under the terms of the GNU Lesser General Public License .

    Copies of the GNU General Public License and the GNU Lesser General Public License are included with this distrubution in the files COPYING and COPYING.LESSER, respectively.

    Fsm-gen is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

    Fsm-gen is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

    You should have received a copy of the GNU Lesser General Public License along with fsm-gen. If not, see http://www.gnu.org/licenses.

    Author

    Wm Miller. The author may be contacted at wmmsf at users.sourceforge.net.

    categories: Sigs,Tools,Apr,2009,Anon

    Hiding Email Address

    Contents

    Synopsis

    Download

    Description

    Code

    Author

    Synopsis

    gawk -f cryptosig.awk tim@menzies.us

    Download

    Download from LAWKER.

    Description

    Generates a one-line Awk program that can print your email, from a seemingly jumbled string. This program can then become your email sig and only the Awk cognoscente can generate a reply.

    Example

    % gawk -f cryptosig.awk tim@menzies.us
    BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}
    

    This can be tested as follows:

    echo 'BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}' | gawk -f -
    

    or

    gawk -f crypotsig.awk tim@menzies.us | gawk -f -
    

    both of which should print "tim@menzies.us".

    Code

    BEGIN {
      for (i=0; i<=255; i++) {           # build table of char=value pairs
        ord_arr[sprintf("%c",i)] = i     # character = ordinal value
      }
      for (i=1; i<=ARGC-1; i++) {
        str = ""
        for (j=1; j<=length(ARGV[i]); j++) {
          str = sprintf("%s%02d",str,ord_arr[substr(ARGV[i],j,1)]-46)
        }
        printf("BEGIN{a=\"%s\";while(a){printf(\"%%c\",46+substr(a,1,2));a=substr(a,3)}}\n",str)
      }
      exit(0)
    }
    

    Author

    BEGIN{a="535170696159626207061118755158656500536563";
          while(a){
              printf("%c",46+substr(a,1,2));a=substr(a,3)};
          print("")
    }
    

    categories: Sigs,Tools,Apr,2009,Timm

    Random Signatures

    Contents

    Synopsis

    chmod +x sigs; ./sigs

    Download

    Download from LAWKER.

    Description

    Generates random signtures. Signatures and generation code included in same file so installation is just a matter of calling one file.

    Most of the file is a large "here" document. Paragraph 1 of that document is always added to the signatures, followed one of the folowing paragraphs, selected at radonom.

    To add to the signtures, include them in the here document, with one preceeding blank line.

    Code

    Pick1

    pick1() {
        gawk 'BEGIN { srand(); RS=""    }
              NR==1 { print $0 "\n"     }
              NR>1  { Recs[rand()] = $0 }
              END   { for ( R in Recs ) {print Recs[R]; exit}}
            ' $1
    }
    

    The Signatures

    cat << SoMEI_mpOSSIblE_sYMBOl | pick1
    tim.menzies {
      title:   dr (Ph.D.) and associate professor;
      align:   csee, west virginia university;
      cell:   esb 841A; 
      url:   http://menzies.us;
      fyi:   unless marked "URGENT", i usually won't get 2 your email b4 5pm; 
    }
    
    Doing a job RIGHT the first time gets the job done. Doing the job WRONG
    fourteen times gives you job security.
    
    Rome did not create a great empire by having meetings, they did it by
    killing all those who opposed them.
    
    INDECISION is the key to FLEXIBILITY.
    
    "When a subject becomes totally obsolete we make it a required
    course."  Peter Drucker
    
    I saw two shooting stars last night but they were only satellites .
    Its wrong to wish on space hardware. I wish, I wish, I wish you cared.
    -- Billy Bragg
    
    Then, in 1995, came the most amazing event in the
    history of programming languages: the introduction
    of Java.  -- Programming Languages: Principles and Practice
    
    Suburbia is where the developer bulldozes out the trees, then names
    the streets after them. --Bill Vaughan
    
    Instant gratification takes too long.
    -- Carrie Fisher
    
    Complexity is easy. Simplicity is hard.
    --Unknown
    

    Author

    Tim Menzies


    categories: Stats,Tools,May,2009,TimS

    Correlate.awk

    Contents

    Synopsis

    Notes

    Example

    Code

    Author

    Synopsis

    cat data | gawk -f correlate.awk 
    

    Notes

    This script calculates the correlation between two columns of numbers.

    For more Sherwood scripts, see Some useful Awk scripts.

    Example

    cat <<EOF | gawk -f correlate.awk
    1	1.417600305
    2	2.265271781
    3	3.241368347
    4	4.367711955
    5	5.390612315
    6	6.296879718
    7	7.43218197
    8	8.117831008
    9	9.338019481
    10	10.01823657
    EOF
    

    This outputs

    NR=10
    ssx=82.5
    ssy=79.0584
    ssxy=80.6985
    r=0.999227
    

    Code

    {   xy+=($1*$2); 
    	x+=$1; 
    	y+=$2; 
    	x2+=($1*$1); 
    	y2+=($2*$2);
    } 
    END { 
    	print "NR=" NR; 
    	ssx=x2-((x*x)/NR); 
    	print "ssx=" ssx; 
    	ssy=y2-((y*y)/NR); 
    	print "ssy=" ssy; 
    	ssxy = xy - ((x*y)/NR); 
    	print "ssxy=" ssxy; 
    	r=ssxy/sqrt(ssx*ssy); 
    	print "r=" r; 
    }
    

    Author

    Tim Sherwood


    categories: ,Music,Tools,June,2009,Admin

    Music and Awk

    These pages focus on muic players and music analysis tools in Awk.


    categories: Project,Tools,Mar,2009,Admin

    Project Tools

    These pages focus on tools for larger Gawk programs; e.g. ways to load multiple files or auto-generate documentation straight from the source code.


    categories: Databases,Oct,2009,ScottS

    A MySql Client

    Contents

    Download

    Download from LAWKER.

    Code

    Set Up

    BEGIN {
        if (!mysql["path"]) {
            mysql["path"] = "/usr/bin/mysql"
        }
        if (mysql["user"]) mysql["user"] = "-u" mysql["user"]
        if (mysql["pass"]) mysql["pass"] = "-p" mysql["pass"]
    
        if (!mysql["tempfile_command"]) {
            mysql["tempfile_command"] = "mktemp /tmp/__mysql.awk.XXXXXX"
        }
        mysql["resource_id"] = 1
        __mysql_dequote["r"]  = "\r"
        __mysql_dequote["n"]  = "\n"
        __mysql_dequote["t"]  = "\t"
        __mysql_dequote["\\"] = "\\"
    }
    

    Main Functions

    function mysql_db (db)      { mysql["database"] = db    }
    function mysql_path (path)  { mysql["path"]     = path  }
    
    function mysql_tempfile_command (command) {
        mysql["tempfile_command"] = command
    }
    function mysql_login (username, password, host, args) {
        mysql["user"] = "-u" username
        mysql["pass"] = "-p" password
            if (host) mysql["host"] = "-h" host
            if (args) mysql["args"] = args
    }
    function mysql_query (query    ,input,key,i,call,resource) {
        resource = mysql["resource_id"]++
        mysql["tempfile_command"] | getline mysql[resource]
        close(mysql["tempfile_command"])
        call = sprintf("%s %s %s %s %s %s > %s",
                mysql["path"], mysql["user"], mysql["pass"], mysql["host"],
                            mysql["args"], mysql["database"],
                mysql[resource])
        print query | call
        close(call)
        if (getline input < mysql[resource]) {
            for (i = split(input, key, "\t"); i > 0; i--)
                mysql[resource, i] = key[i]
        }
        return resource
    }
    function mysql_fetch_assoc (resource,row  ,input,i,fields) {
        fields = 0
        if (getline input < mysql[resource]) {
            fields = mysql_split(row, input)
            for (i = 1; i <= fields; i++)
                row[mysql[resource, i]] = row[i]
        }
        return fields
    }
    function mysql_split (row, input,   r,i) {
         r = split(input, row, "\t")
         for (i = 0; i <= r; i++) {
             row[i] = mysql_dequote(row[i])
         }
         return r
    }
    function mysql_fetch_row (resource,row  ,input,r,i) {
        if (getline input < mysql[resource]) {
            return mysql_split(row, input)
        }
        return 0
    }
    function mysql_index (resource, id) {
        return mysql[resource, id]
    }
    function mysql_finish (resource, i) {
        close(mysql[resource])
        system(sprintf("rm %s", mysql[resource]))
        delete mysql[resource]
        i = 1
        while (mysql[resource,i])
            delete mysql[resource, i++]
    }
    function mysql_cleanup (  i) {
        for (i = 1; i < mysql["resource_id"]; i++)
            if (mysql[i]) {
                close(mysql[i])
                system(sprintf("rm %s", mysql[i]))
                delete mysql[resource]
                i = 1
                while (mysql[resource,i])
                    delete mysql[resource, i++]
            }
    }
    

    Support Utils

    Scan a string for mysql escaped tokens and replace them with the appropriate character. This is a fairly slow operation for large strings but it's necessary.

    function mysql_dequote (string, result,i,l,c) {
        result = ""
        l = length(string)
        for (i = 1; i <= l; i++) {
            c = substr(string, i, 1)
            if (c == "\\") {
                # This simply shouldn't happen...
                ## if ((i + 1) == l) continue;
                c = substr(string, ++i, 1)
                result = result __mysql_dequote[c]
            }
            else {
                result = result c
            }
        }
        return result
    }
    function mysql_quote (string,   result) {
        gsub(/\\/, "\\\\", string)
        gsub(/'/, "\\'", string)
        return "'" string "'"
    }
    

    Copyright

    "THE BEER-WARE LICENSE" (Revision 43) borrowed from FreeBSD's jail.c: wrote this file. As long as you retain this notice you can do whatever you want with this stuff. If we meet some day, and you think this stuff is worth it, you can buy me a beer in return.

    Author

    Scott S. McCoy


    categories: Databases,Jul,2009,CarloS

    NoSQL

    By Carlo Strozzi (carlo@strozzi.it).

    NoSQL is a fast, portable, relational database management system without arbitrary limits, (other than memory and processor speed) that runs under, and interacts with, the UNIX Operating System. It uses the "Operator-Stream Paradigm" described in Unix Review (March, 1991, page 24, "A 4GL Language") where there are a number of "operators" that each perform a unique function on the data. These operators are written in Awk and C, designed to be lightweight Operators will have to be lightweight ones (have a small memory footprint and allows fast startup of the command).

    The main reason why NoSQL decided to turn an original RDB system into NoSQL is precisely that the former is entirely written in Perl. Perl is a good programming language for writing self-contained programs, but its pre-compilation phase and long start-up time are worth paying only if once the program has loaded it can do everything in one go. This contrasts sharply with the Operator-stream Paradigm, where operators are chained together in pipelines of two, three or more programs. The overhead associated with initializing Perl at every stage of the pipeline makes pipelining Perl inefficient. A better way of manipulating structured ASCII files is to use the AWK programming language, which is much smaller than Perl, is more specialized for this task, and is very fast at startup.

    For more information on NoSQL, see the NoSQL home page.


    categories: Awk100,,Music,Tools,June,2009,StephenJ

    Plaiter: a music player

    Synopsis

    plaiter [options] [file, playlist, directory or stream ...]
    

    Download

    Download from LAWKER or, for the latest version, from SourceForge

    Description

    Plaiter (pronounced "player") is a command line front end to command line music players. It uses shell scripting to try to create the command line music player that Plait would have used if it already existed. It complements Plait but is also quite useful on its own, especially if you already use mpg123 or similar programs and find yourself wanting more features.

    What does Plaiter do that (say) mpg123 can't already? It queues tracks, first of all. Secondly, it understands commands like play, plause, stop, next and prev. Finally, unlike most of the command line music players out there, Plaiter can handle a play list with more than one type of audio file, selecting the proper helper app to handle each type of file you throw at it.

    Plaiter will automatically configure itself to use ogg123, mpg123, and/or mpg321, if they are installed on your system. If you have a helper application that plays other types of audio, Plaiter can be configured to use it as well.

    Like many of us, Plaiter is part daemon and part controller. The controller builds a play list from the files you provide on the command line and forwards commands to the daemon. The daemon reads commands and executes them by running helper applications.

    Options

    --daemon,-d
    daemon mode
    --queue,-q
    add tracks to queue
    --enqueue
    add tracks to queue
    --random
    random shuffle
    --play
    play
    --pause
    toggle pause mode
    --stop,-s
    stop
    --latch [on|off]
    toggle or set stop after current track
    --next,-n [n]
    skip forward [n tracks]
    --prev [n]
    skip backward [n tracks]
    --search
    search in playlist
    --rsearch
    reverse search in playlist
    --reset,-r
    play track 1
    --loop [on|off]
    toggle or set loop mode
    --quit
    quit daemon
    --status
    show status
    --list,-l
    show playlist
    --help
    show help
    --version
    show version
    -v
    be verbose

    Copyright

    Copyright (C) 2005, 2006 by Stephen Jungels. Released under the GPL.

    Author

    Written by Stephen Jungels (sjungels@gmail.com)


    categories: ,Music,Tools,June,2009,DavidH

    Humdrum

    Download

    http://www.music-cog.ohio-state.edu/HumdrumDownload/downloading.html.

    Description

    The Humdrum Toolkit provides a set of free software tools intended to assist in music research. The toolkit is suitable for use in a wide variety of computer-based musical tasks.

    The Humdrum web site contains a comprehensive collection of over 200 web pages providing both detailed and summary information concerning all aspects of the Humdrum Toolkit.

    About 15% of the code is written in C, another 15% in kornshell, and about 2% using the LEX lexical parser and YACC compiler-compiler. The bulk of the code is written in AWK.

    Questions that can be answered in Humdrum are:

    • Determine the rhyme scheme for a vocal text.
    • Identify any French sixth chords.
    • Locate instances of the pitch sequence D-S-C-H in Shostakovich's music.
    • Are German drinking songs more likely to be in triple meter.
    • Determine whether Haydn tends to avoid V-IV progressions.
    • Locate any doubled seventh scale degrees.

    (For a longer list of such questions, see the Humdrum sample problems page.

    Author

    David Huron

    For more information

    Go to http://www.music-cog.ohio-state.edu/Humdrum/.


    categories: TenLiners,Tools,June,2009,Timm

    shuffle.awk

    Contents

    Synopsis

    To rearrange the items in the input list:

     nshuffle(Array)
    

    To rearrange the items in a copy of the input list:

     shuffle(Array,Copy)
    

    The above calls assumes that array item zero stores the length of the array. If this is not the case, use:

     shuffles(Array,Copy)
    

    Download

    Download from LAWKER.

    Description

    Suppose we want to shuffle items an array into a random order. This shuffle sort do so in linear time and memory.

    The algorithm comes from the dawn of computer time but I first heard of it from Bart Massey (at Portland State). Thank Bart for the clarity of the explanation and blame me for any silliness in the implementation.

    The Slow Way

    A simple way to shuffle an input array of elements is to:

    • Allocate an output array of the same size.
    • Copy items selected at random from the input to the output array.
    • Compact the input array by sliding the first part of the array down to fill the hole left by the removed item.

    This algorithm is clearly correct. However, the algorithm requires time quadratic in the size of the list, and 2x space.

    The Better Way

    We can easily reduce the time complexity to O(N). The only thing done with the input array is to select random elements from it, the order of the elements in it is irrelevant. Therefore, instead of closing the hole left by a removed element by shifting elements, we'll close it by moving the first remaining element of the input array to fill the gap.

    Note an important invariant of the algorithm:

       the number of elements left in the input array 
     + the number of elements in the output array 
     ------------------------------------------------
     = the number of elements initially passed in.  
    

    This means that once an element is removed from the input array and the hole filled, there is a fresh hole created right at the beginning of the input array. Let us put the newly removed element in that hole. Now we can dispense with the output array altogether, and just return the input array. Now the space complexity is just x+1.

    Code

    This code assumes that the array "a" stores its size at "a[0]".

    function nshuffle(a,  i,j,n,tmp) {
      n=a[0]; # a has items at 1...n
      for(i=1;i<=n;i++) {
        j=i+round(rand()*(n-i));
        tmp=a[j];
        a[j]=a[i];
        a[i]=tmp;
      };
      return n;
    }
    function round(x) { return int(x + 0.5) }
    

    nshuffle is fast, but rearranges the order of items in the original list. shuffle generates a new copy of the list with the items in a random order.

    function shuffle(a,b) {
      for(i in a) b[i]=a[i];
      nshuffle(b);
    }
    

    nshuffle also assumes that the list is stores the list size at position zero. If this is not the case, use shuffles.

    function shuffles(a,b,   c,n) {
      for(i in a) {n++; c[i]=a[i]};
      c[0]=n;
      shuffle(c,b);
    }
    

    Correctness proof

    By number of loop iterations

    Base case:
    When i = 0 the 0 array elements in a below i form a shuffled list of 0 elements. All remaining elements are candidates for append.
    Inductive case:
    Assume that i = k and that the sequence of elements in a below k are a random subsequence of the input values of length k. Now every possible remaining candidate is equally likely to occur at position k in this iteration. Thus at the end of the iteration i = k + 1 and the sequence of elements in a below k + 1 are a random subsequence of the input values of length k + 1.

    Examples

    Random orders

    One way to use the above is to run down a list in a random order. For example:

    BEGIN {
      if (ShuffleDemo) {
      		if (Seed) { srand(Seed) } else { srand() };
      		s2i(ShuffleDemo,L1," ");
      		shuffles(L1,L2);
      		while(Item =pop(L2)) print Item;
      }
    }
    function s2i(str,a,sep,   n,i,tmp) {
      n=split(str,tmp,sep);
      for(i=1;i<=n;i++) a[i]=tmp[i];
      return n;
    }
    function pop(a,   x,i) {
      i=a[0]--;  
      if (!i) {return ""} else {x=a[i]; delete a[i]; return x}
    } 
    

    The above can be run using

     gawk -f shuffle.awk  -v ShuffleDemo="aa bb cc dd"
    

    If you run this twice, you'll see two different orderings. Here's one:

     cc
     aa
     dd
     bb
    

    And here's another:

     dd
     bb
     cc
     aa
    

    Fast sampling

    If you are generating the above lists very quickly, then be aware that srand() initializes its random number generator using CPU time in seconds. So, if you are calling the above command line many times per second, you can get repeated outputs.

    The fix is to supply a seed from the Bash $RANDOM variable:

     gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=$RANDOM
    

    much faster than once a second, the above call will generate (far) fewer repeats.

    Repeats

    If you want to repeat some prior run (say, during debugging), set the Seed variable on the command line using (e.g.)

     gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=23
    

    This will always print out the same ordering.

    Author

    Tim Menzies


    categories: Runawk,Project,Tools,Mar,2009,AlexC

    runawk - wrapper for AWK interpreter

    (Note: see recent update.)

    Contents

    Download from...

    Download from LAWKER or a tar file or from SourceForge.

    NAME

    runawk - wrapper for AWK interpreter

    SYNOPSIS

    runawk [options] program_file

    runawk -e program

    DESCRIPTION

    After years of using AWK for programming I've found that despite of its simplicity and limitations AWK is good enough for scripting a wide range of different tasks. AWK is not as poweful as their bigger counterparts like Perl, Ruby, TCL and others but it has their own advantages like compactness, simplicity and availability on almost all UNIX-like systems. I personally also like its data-driven nature and token orientation, very useful technique for simple text processing utilities.

    But! Unfortunately awk interpreters lacks some important features and sometimes work not as good as it whould be.

    Problems I see (some of them, of course)

    1. AWK lacks support for modules. Even if I create small programs, I often want to use the functions created earlier and already used in other scripts. That is, it whould great to orginise functions into so called libraries (modules).

    2. In order to pass arguments to #!/usr/bin/awk -f script (not to awk interpreter), it is necessary to prepand a list of arguments with -- (two minus signes). In my view, this looks badly.

      Example:

      awk_program:

          #!/usr/bin/awk -f
      
          BEGIN {
             for (i=1; i < ARGC; ++i){
                printf "ARGV [%d]=%s\n", i, ARGV [i]
             }
          }

      Shell session:

          % awk_program --opt1 --opt2
          /usr/bin/awk: unknown option --opt1 ignored
          /usr/bin/awk: unknown option --opt2 ignored
      
          % awk_program -- --opt1 --opt2
          ARGV [1]=--opt1
          ARGV [2]=--opt2
          %

      In my opinion awk_program script should work like this

          % awk_program --opt1 --opt2
          ARGV [1]=--opt1
          ARGV [2]=--opt2
          %

      It is possible using runawk.

    3. When #!/usr/bin/awk -f script handles arguments (options) and wants to read from stdin, it is necessary to add /dev/stdin (or `-') as a last argument explicitly.

      Example:

      awk_program:

          #!/usr/bin/awk -f
      
          BEGIN {
             if (ARGV [1] == "--flag"){
                flag = 1
                ARGV [1] = "" # to not read file named "--flag"
             }
          }
          {
             print "flag=" flag " $0=" $0
          }

      Shell session:

          % echo test | awk_program -- --flag
          % echo test | awk_program -- --flag /dev/stdin
          flag=1 $0=test
          %

      Ideally awk_program should work like this

          % echo test | awk_program --flag
          flag=1 $0=test
          %

    runawk was created to solve all these problems

    OPTIONS

    -h|--help

    Display help information.

    -V|--version

    Display version information.

    -d|--debug

    Turn on a debugging mode in which runawk prints argument list with which real awk interpreter will be run.

    -i|--with-stdin

    Always add stdin file name to a list of awk arguments

    -I|--without-stdin

    Do not add stdin file name to a list of awk arguments

    -e|--execute program

    Specify program. If -e is not specified program is read from program_file.

    DETAILS/INTERNALS

    Standalone script

    Under UNIX-like OS-es you can use runawk by beginning your script with

       #!/usr/local/bin/runawk

    line or something like this instead of

       #!/usr/bin/awk -f

    or similar.

    AWK modules

    In order to activate modules you should add them into awk script like this

      #use "module1.awk"
      #use "module2.awk"

    that is the line that specifies module name is treated as a comment line by normal AWK interpreter but is processed by runawk especially.

    Note that #use should begin with column 0, no spaces are allowed before it and no spaces are allowed between # and use.

    Also note that AWK modules can also "use" another modules and so forth. All them are collected in a depth-first order and each one is added to the list of awk interpreter arguments prepanded with -f option. That is #use directive is *NOT* similar to #include in C programming language, runawk's module code is not inserted into the place of #use. Runawk's modules are closer to Perl's "use" command. In case some module is mentioned more than once, only one -f will be added for it, i.e duplications are removed automatically.

    Position of #use directive in a source file does matter, i.e. the earlier module is mentioned, the earlier -f will be generated for it.

    Example:

      file prog:
         #!/usr/local/bin/runawk
    
         #use "A.awk"
         #use "B.awk"
         #use "E.awk"
    
         PROG code
         ...
      file B.awk:
         #use "A.awk"
         #use "C.awk"
         B code
         ...
      file C.awk:
         #use "A.awk"
         #use "D.awk"
    
         C code
         ...
    A.awk and D.awk don't contain #use directive.

    If you run

      runawk prog file1 file2

    or

      /path/to/prog file1 file2

    the following command

      awk -f A.awk -f D.awk -f C.awk -f B.awk -f E.awk -f prog -- file1 file2

    will actually run.

    You can check this by running

      runawk -d prog file1 file2

    Module search strategy

    Modules are first searched in a directory where main program (or module in which #use directive is specified) is placed. If it is not found there, then AWKPATH environment variable is checked. AWKPATH keeps a colon separated list of search directories. Finally, module is searched in system runawk modules directory, by default PREFIX/share/runawk but this can be changed at build time.

    An absolute path of the module can also be specified.

    AWK interpreter and its arguments

    In order to pass arguments to AWK script correctly, runawk treats their arguments beginning with `-' sign (minus) especially. The following command

      runawk prog2 -x -f=file -o=output file1 file2

    or

      /path/to/prog2 -x -f=file -o=output file1 file2

    will actually run

      awk -f prog2 -- -x -f=file -o=output file1 file2

    therefore -s, -f, -o options will be passed to ARGV/ARGC awk's variables together with file1 and file2. If all arguments begin with `-' (minus), runawk will add stdin filename to the end of argument list, (unless -I option is specified) i.e. running

      runawk prog3 --value=value

    or

      /path/to/prog3 --value=value

    will actually run the following

      awk -f prog3 -- --value=value /dev/stdin

    Program as an argument

    Like some other interpreters runawk can obtain the script from a command line like this

     /path/to/runawk -e '
     #use "alt_assert.awk"
    
     {
       assert($1 >= 0 && $1 <= 10, "Bad value: " $1)
    
       # your code below
       ...
     }'

    Selecting a preferred AWK interpreter

    For some reason you may prefer one AWK interpreter or another with a help of #interp command like this

      file prog:
         #!/usr/local/bin/runawk
    
         #use "A.awk"
         #use "B.awk"
    
         #interp "/usr/pkg/bin/nbawk"
    
         # your code here
         ...

    The reason may be efficiency for a particular task, useful but not standard extensions or enything else.

    Note that #interp directive should also begin with column 0, no spaces are allowed before it and between # and interp.

    Setting environment

    In some cases you may want to run AWK interpreter with a specific environment. For example, your script may be oriented to process ASCII text only. In this case you can run AWK with LC_CTYPE=C environment and use regexp ranges.

    runawk provides #env directive for this. Strings inside double quotes is passed to putenv(3) libc function.

    Example:

      file prog:
         #!/usr/local/bin/runawk
    
         #env "LC_ALL=C"
    
         $1 ~ /^[A-Z]+$/ { # A-Z is valid if LC_CTYPE=C
             print $1
         }

    EXIT STATUS

    If AWK interpreter exits normally, runawk exits with its exit status. If AWK interpreter was killed by signal, runawk exits with exit status 128+signal.

    ENVIRONMENT

    AWKPATH

    Colon separated list of directories where awk modules are searched.

    RUNAWK_AWKPROG

    Sets the path to the AWK interpreter, used by default, i.e. this variable overrides the compile-time default. Note that #interp directive overrides this.

    AUTHOR/LICENSE

    Copyright (c) 2007-2008 Aleksey Cheusov <vle@gmx.net>

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    BUGS/FEEDBACK

    Please send any comments, questions, bug reports etc. to me by e-mail or (even better) register them at sourceforge project home. Feature requests are also welcomed.


    categories: Awk100,Macros,Tools,Mar,2009,JonB

    m1 : A Micro Macro Processor

    Contents

    Synopsis

    awk -f m1.awk [file...]
    

    Download

    Download from LAWKER.

    Description

    M1 is a simple macro language that supports the essential operations of defining strings and replacing strings in text by their definitions. It also provides facilities for file inclusion and for conditional expan- sion of text. It is not designed for any particular application, so it is mildly useful across several applications, including document preparation and programming. This paper describes the evolution of the program; the final version is implemented in about 110 lines of Awk.

    M1 copies its input file(s) to its output unchanged except as modified by certain "macro expressions." The following lines define macros for subsequent processing:

     @comment Any text
     @@                     same as @comment
     @define name value
     @default name value    set if name undefined
     @include filename
     @if varname            include subsequent text if varname != 0
     @unless varname        include subsequent text if varname == 0
     @fi                    terminate @if or @unless
     @ignore DELIM          ignore input until line that begins with DELIM
     @stderr stuff          send diagnostics to standard error
    

    A definition may extend across many lines by ending each line with a backslash, thus quoting the following newline.

    Any occurrence of @name@ in the input is replaced in the output by the corresponding value.

    @name at beginning of line is treated the same as @name@.

    Applications

    Form Letters

    We'll start with a toy example that illustrates some simple uses of m1. Here's a form letter that I've often been tempted to use:

    @default MYNAME Jon Bentley 
    @default TASK respond to your special offer 
    @default EXCUSE the dog ate my homework 
    Dear @NAME@: 
        Although I would dearly love to @TASK@, 
    I am afraid that I am unable to do so because @EXCUSE@. 
    I am sure that you have been in this situation 
    many times yourself. 
                Sincerely, 
                @MYNAME@ 
    

    If that file is namedsayno.mac, it might be invoked with this text:

    @define NAME Mr. Smith 
    @define TASK subscribe to your magazine 
    @define EXCUSE I suddenly forgot how to read 
    

    Recall that a @default takes effect only if its variable was not previously @defined.

    Troff Pre-Processing

    I've found m1 to be a handy Troff preprocessor. Many of my text files (including this one) start with m1 definitions like:

    @define ArrayFig @StructureSec@.2 
    @define HashTabFig @StructureSec@.3 
    @define TreeFig @StructureSec@.4 
    @define ProblemSize 100 
    

    Even a simple form of arithmetic would be useful in numeric sequences of definitions. The longer m1 variables get around Troff's dreadful two-character limit on string names; these variables are also avail- able to Troff preprocessors like Pic and Eqn. Various forms of the @define, @if, and @include facilities are present in some of the Troff-family languages (Pic and Troff) but not others (Tbl); m1 provides a consistent mechanism.

    I include figures in documents with lines like this:

    @define FIGNUM @FIGMFMOVIE@ 
    @define FIGTITLE The Multiple Fragment heuristic. 
    @FIGSTART@ 
    <PS> <@THISDIR@/mfmovie.pic</PS>
    @FIGEND@ 
    

    The two @defines are a hack to supply the two parameters of number and title to the figure. The figure might be set off by horizontal lines or enclosed in a box, the number and title might be printed at the top or the bottom, and the figures might be graphs, pictures, or animations of algorithms. All figures, though, are presented in the consistent format defined by FIGSTART and FIGEND.

    Awk Library Management

    I have also used m1 as a preprocessor for Awk programs. The @include statement allows one to build simple libraries of Awk functions (though some- but not all- Awk implementations provide this facility by allowing multiple program files). File inclusion was used in an earlier version of this paper to include individual functions in the text and then wrap them all together into the completem1 program. The conditional statements allow one to customize a program with macros rather than run-time if statements, which can reduce both run time and compile time.

    Controlling Experiments

    The most interesting application for which I've used this macro language is unfortunately too complicated to describe in detail. The job for which I wrote the original version of m1 was to control a set of experiments. The experiments were described in a language with a lexical structure that forced me to make substitutions inside text strings; that was the original reason that substitutions are bracketed by at-signs. The experiments are currently controlled by text files that contain descriptions in the experiment language, data extraction programs written in Awk, and graphical displays of data written in Grap; all the programs are tailored bym1commands.

    Most experiments are driven by short files that set a few keys parameters and then@includea large file with many @defaults. Separate files describe the fields of shared databases:

     @define N ($1) 
     @define NODES ($2) 
     @define CPU ($3) 
     ... 
    

    These files are @included in both the experiment files and in Troff files that display data from the databases. I had tried to conduct a similar set of experiments before I built m1, and got mired in muck. The few hours I spent building the tool were paid back handsomely in the first days I used it.

    The Substitution Function

    M1 uses as fast substitution function. The idea is to process the string from left to right, searching for the first substitution to be made. We then make the substitution, and rescan the string starting at the fresh text. We implement this idea by keeping two strings: the text processed so far is in L (for Left), and unprocessed text is in R (for Right). Here is the pseudocode for dosubs:

    L = Empty 
    R = Input String 
    while R contains an "@" sign do 
    	let R = A @ B; set L = L A and R = B 
    	if R contains no "@" then 
    		L = L "@" 
    		break 
    	let R = A @ B; set M = A and R = B 
    	if M is in SymTab then 
    		R = SymTab[M] R 
    	else 
    		L = L "@" M 
    		R = "@" R 
    	return L R 
    

    Possible Extensions

    There are many ways in which them1program could be extended. Here are some of the biggest temptations to "creeping creaturism":

    • A long definition with a trail of backslashes might be more graciously expressed by a @longdefinestatement terminated by a@longend.
    • An @undefinestatement would remove a definition from the symbol table.
    • I've been tempted to add parameters to macros, but so far I have gotten around the problem by using an idiom described in the next section.
    • It would be easy to add stack-based arithmetic and strings to the language by adding@pushand @popcommands that read and write variables.
    • As soon as you try to write interesting macros, you need to have mechanisms for quoting strings (to postpone evaluation) and for forcing immediate evaluation.

    Code

    The following code is short (around 100 lines), which is significantly shorter than other macro processors; see, for instance, Chapter 8 of Kernighan and Plauger [1981]. The program uses several techniques that can be applied in many Awk programs.

    • Symbol tables are easy to implement with Awk¿s associative arrays.
    • The program makes extensive use of Awk's string-handling facilities: regular expressions, string concatenation, gsub, index, andsubstr.
    • Awk's file handling makes the dofile procedure straightforward.
    • The readline function and pushback mechanism associated with buffer are of general utility.

    error

    function error(s) {
    	print "m1 error: " s | "cat 1>&2"; exit 1
    }
    

    dofile

    function dofile(fname,  savefile, savebuffer, newstring) {
    	if (fname in activefiles)
    		error("recursively reading file: " fname)
    	activefiles[fname] = 1
    	savefile = file; file = fname
    	savebuffer = buffer; buffer = ""
    	while (readline() != EOF) {
    		if (index($0, "@") == 0) {
    			print $0
    		} else if (/^@define[ \t]/) {
    			dodef()
    		} else if (/^@default[ \t]/) {
    			if (!($2 in symtab))
    				dodef()
    		} else if (/^@include[ \t]/) {
    			if (NF != 2) error("bad include line")
    			dofile(dosubs($2))
    		} else if (/^@if[ \t]/) {
    			if (NF != 2) error("bad if line")
    			if (!($2 in symtab) || symtab[$2] == 0)
    				gobble()
    		} else if (/^@unless[ \t]/) {
    			if (NF != 2) error("bad unless line")
    			if (($2 in symtab) && symtab[$2] != 0)
    				gobble()
    		} else if (/^@fi([ \t]?|$)/) { # Could do error checking here
    		} else if (/^@stderr[ \t]?/) {
    			print substr($0, 9) | "cat 1>&2"
    		} else if (/^@(comment|@)[ \t]?/) {
    		} else if (/^@ignore[ \t]/) { # Dump input until $2
    			delim = $2
    			l = length(delim)
    			while (readline() != EOF)
    				if (substr($0, 1, l) == delim)
    					break
    		} else {
    			newstring = dosubs($0)
    			if ($0 == newstring || index(newstring, "@") == 0)
    				print newstring
    			else
    				buffer = newstring "\n" buffer
    		}
    	}
    	close(fname)
    	delete activefiles[fname]
    	file = savefile
    	buffer = savebuffer
    }
    

    readline

    Put next input line into global string "buffer". Return "EOF" or "" (null string).

    function readline(  i, status) {
    	status = ""
    	if (buffer != "") {
    		i = index(buffer, "\n")
    		$0 = substr(buffer, 1, i-1)
    		buffer = substr(buffer, i+1)
    	} else {
    		# Hume: special case for non v10: if (file == "/dev/stdin")
    		if (getline <file <= 0)
    			status = EOF
    	}
    	# Hack: allow @Mname at start of line w/o closing @
    	if ($0 ~ /^@[A-Z][a-zA-Z0-9]*[ \t]*$/)
    		sub(/[ \t]*$/, "@")
    	return status
    }
    

    gobble

    function gobble(  ifdepth) {
    	ifdepth = 1
    	while (readline() != EOF) {
    		if (/^@(if|unless)[ \t]/)
    			ifdepth++
    		if (/^@fi[ \t]?/ && --ifdepth <= 0)
    			break
    	}
    }
    

    dosubs

    function dosubs(s,  l, r, i, m) {
    	if (index(s, "@") == 0)
    		return s
    	l = ""	# Left of current pos; ready for output
    	r = s	# Right of current; unexamined at this time
    	while ((i = index(r, "@")) != 0) {
    		l = l substr(r, 1, i-1)
    		r = substr(r, i+1)	# Currently scanning @
    		i = index(r, "@")
    		if (i == 0) {
    			l = l "@"
    			break
    		}
    		m = substr(r, 1, i-1)
    		r = substr(r, i+1)
    		if (m in symtab) {
    			r = symtab[m] r
    		} else {
    			l = l "@" m
    			r = "@" r
    		}
    	}
    	return l r
    }
    

    docodef

    function dodef(fname,  str, x) {
    	name = $2
    	sub(/^[ \t]*[^ \t]+[ \t]+[^ \t]+[ \t]*/, "")  # OLD BUG: last * was +
    	str = $0
    	while (str ~ /\\$/) {
    		if (readline() == EOF)
    			error("EOF inside definition")
    		# OLD BUG: sub(/\\$/, "\n" $0, str)
    		x = $0
    		sub(/^[ \t]+/, "", x)
    		str = substr(str, 1, length(str)-1) "\n" x
    	}
    	symtab[name] = str
    }
    

    BEGIN

    BEGIN {	
        EOF = "EOF"
    	if (ARGC == 1)
    		dofile("/dev/stdin")
    	else if (ARGC >= 2) {
    		for (i = 1; i < ARGC; i++)
    			dofile(ARGV[i])
    	} else
    		error("usage: m1 [fname...]")
    }
    

    Bugs

    M1 is three steps lower than m4. You'll probably miss something you have learned to expect.

    History

    M1 was documented in the 1997 sedawk book by Dale Dougherty & Arnold Robbins (ISBN 1-56592-225-5) but may have been written earlier.

    This page was adapted from 131.191.66.141:8181/UNIX_BS/sedawk/examples/ch13/m1.pdf (download from LAWKER).

    Author

    Jon L. Bentley.


    categories: Macros,Tools,Mar,2009,WillW

    m5 - macro processor

    Download

    Download from LAWKER.

    Synopsis

    m5 [ -Dname ] [ -Dname=def ] [-c] [ -dp char ] 
       [ -o file ] [-sp char ] [ file ... ]
     
    [g|n]awk -f m5.awk X [ -Dname ] [ -Dname=def ]  [-c]  [ -dp char ] 
                         [ -o file ] [ -sp char ] [ file ... ]
    

    Description

    M5 is a Bourne shell script for invoking m5.awk, which actu- ally performs the macro processing. m5, unlike many macroprocessors, does not directly interpret its input. Instead it uses a two-pass approach in which the first pass translates the input to an awk program, and the second pass executes the awk program to produce the final output. Details of usage are provided below.

    This two pass sytem means that macros can contain awk commands, to be executed on the second pass. This greatly extends the expressability of the m5 macro system.

    As noted in the synopsis above, its invocation may require specification of awk, gawk, or nawk, depending on the ver- sion of awk available on your system. This choice is further complicated on some systems, e.g. Sun, which have both awk (original awk) and nawk (new awk). Other systems appear to have new awk, but have named it just awk. New awk should be used, regardless of what it has been named. The macro processor translator will not work using original awk because the former, for example, uses the built-in function match().

    Options

    The following options are supported:

    -Dname
    Following the cpp convention, define name as 1 (one). This is the same as if a -Dname=1 appeared as an option or #name=1 appeared as an input line. Names specified using -D are awk variables defined just before main is invoked.
    -Dname=def
    Define name as "def". This is the same as if #name="def" appeared as an input line. Names specified using -D are awk variables defined just before main is invoked.
    X
    Yes, that really is a capital "X". The ver- sion of nawk on Sun Solaris 2.5.1 apparently does its own argument processing before pass- ing the arguments on to the awk program. In this case, X (and all succeeding options) are believed by nawk to be file names and are passed on to the macro processor translator (m5.awk) for its own argument processing). Without the X, Sun nawk attempts to process succeeding options (e.g., -Dname) as valid nawk arguments or files, thus causing an error. This may not be a problem for all awks.
    -c
    Compile only. The output program is still produced, but the final output is not.
    -dp char
    The directive prefix character (default is #).
    -o file
    The output program file (default is a.awk).
    -sp char
    The substitution prefix character (default is $).

    Usage

    Overview

    The program that performs the first pass noted above is called the m5 translator and is named m5.awk. The input to the translator may be either standard input or one or more files listed on the command line. An input line with the directive prefix character (# by default) in column 1 is treated as a directive statement in the MP directive language (awk). All other input lines are processed as text lines. Simple macros are created using awk assignment statements and their values referenced using the substitu- tion prefix character ($ by default). The backslash (\) is the escape character; its presence forces the next character to literally appear in the output. This is most useful when forcing the appearance of the directive prefix character, the substitution prefix character, and the escape character itself.

    Macro Substitution

    All input lines are scanned for macro references that are indicated by the substitution prefix character. Assuming the default value of that character, macro references may be of the form $var, $(var), $(expr), $[str], $var[expr], or $func(args). These are replaced by an awk variable, awk variable, awk expression, awk array reference to the special array M[], regular awk array reference, or awk function call, respectively. These are, in effect, macros. The MP translator checks for proper nesting of parentheses and dou- ble quotes when translating $(expr) and $func(args) macros, and checks for proper nesting of square brackets and double quotes when translating $[expr] and $var[expr] macros. The substitution prefix character indicates a a macro reference unless it is (i) escaped (e.g., \$abc), (ii) followed by a character other than A-Z, a-z, (, or [ (e.g., $@), or (iii) inside a macro reference (e.g., $($abc); probably an error).

    An understanding of the implementation of macro substitution will help in its proper usage. When a text line is encoun- tered, it is scanned for macros, embedded in an awk print statement, and copied to the output program. For example, the input line

    The quick $fox jumped over the lazy $dog.
    

    is transformed into

    print "The quick " fox " jumped over the lazy " dog "."
    

    Obviously the use of this transformation technique relies completely on the presence of the awk concatenation operator (one or more blanks).

    Macros Containing Macros

    As already noted, a macro reference inside another macro reference will not result in substitution and will probably cause an awk execution-time error. Furthermore, a substitution prefix character in the substituted string is also generally not significant because the substitution pre- fix character is detected at translation time, and macro values are assigned at execution time. However, macro references of the form $[expr] provide a simple nested referencing capability. For example, if $[abc] is in a text line, or in a directive line and not on the left hand side of an assignment statement, it is replaced by eval(M["abc"])/. When the output program is executed, the m5 runtime routine eval()/ substitutes the value of M["abc"] examining it for further macro references of the form $[str] (where "str" denotes an arbitrary string). If one is found, substitution and scanning proceed recursively. Function type macro references may result in references to other mac- ros, thus providing an additional form of nested referenc- ing.

    Directive Lines

    Except for the include directive, when a directive line is detected, the directive prefix is removed, the line is scanned for macros, and then the line is copied to the out- put program (as distinct from the final output). Any valid awk construct, including the function statement, is allowed in a directive line. Further information on writing awk programs may be found in Aho, Kernighan, and Weinberger, Dougherty and Robbins, and Robbins.

    Include Directive

    A single non-awk directive has been provided: the include directive. Assuming that # is the directive prefix, #include(filename) directs the MP translator to immediately read from the indicated file, processing lines from it in the normal manner. This processing mode makes the include directive the only type of directive to take effect at translation time. Nested includes are allowed. Include directives must appear on a line by themselves. More ela- borate types of file processing may be directly programmed using appropriate awk statements in the input file.

    Main Program and Functions

    The MP translator builds the resulting awk program in one of two ways, depending on the form of the first input line. If that line begins with "function", it is assumed that the user is providing one or more functions, including the func- tion "main" required by m5. If the first line does not begin with "function", then the entire input file is translated into awk statements that are placed inside "main". If some input lines are inside functions, and oth- ers are not, awk will will detect this and complain. The MP by design has little awareness of the syntax of directive lines (awk statements), and as a consequence syntax errors in directive lines are not detected until the output program is executed.

    Output

    Finally, unless the -c (compile only) option is specified on the command line, the output program is executed to produce the final output (directed by default to standard output). The version of awk specified in ARGV[0] (a built-in awk variable containing the command name) is used to execute the program. If ARGV[0] is null, awk is used.

    EXAMPLE

    Understanding this example requires recognition that macro substitution is a two-step process: (i) the input text is translated into an output awk program, and (ii) the awk program is executed to produce the final output with the macro substitutions actually accomplished. The examples below illustrate this process. # and $ are assumed to be the directive and substitution prefix characters. This example was successfully executed using awk on a Cray C90 running UNICOS 10.0.0.3, gawk on a Gateway E-3200 runing SuSE Linux Version 6.0, and nawk on a Sun Ultra 2 Model 2200 running Solaris 2.5.1.

    Input Text

    #function main() {
    
       Example 1: Simple Substitution
       ------------------------------
    #  br = "brown"
       The quick $br fox.
    
       Example 2: Substitution inside a String
       ---------------------------------------
    #  r = "row"
       The quick b$(r)n fox.
    
       Example 3: Expression Substitution
       ----------------------------------
    #  a = 4
    #  b = 3
       The quick $(2*a + b) foxes.
    
       Example 4: Macros References inside a Macro
       -------------------------------------------
    #  $[fox] = "\$[q] \$[b] \$[f]"
    #  $[q] = "quick"
    #  $[b] = "brown"
    #  $[f] = "fox"
       The $[fox].
    
       Example 5: Array Reference Substitution
       ---------------------------------------
    #  x[7] = "brown"
    #  b = 3
       The quick $x[2*b+1] fox.
    
       Example 6: Function Reference Substitution
       ------------------------------------------
       The quick $color(1,2) fox.
    
       Example 7: Substitution of Special Characters
       ---------------------------------------------
    \#  The \$ quick \\ brown $# fox. $$
    #}
    #include(testincl.m5)
    

    Included File testincl.m5

    #function color(i,j) {
       The lazy dog.
    #  if (i == j)
    #     return "blue"
    #  else
    #     return "brown"
    #}
    

    Output Program

    function main() {
       print
       print "   Example 1: Simple Substitution"
       print "   ------------------------------"
       br = "brown"
       print "   The quick " br " fox."
       print
       print "   Example 2: Substitution inside a String"
       print "   ---------------------------------------"
       r = "row"
       print "   The quick b" r "n fox."
       print
       print "   Example 3: Expression Substitution"
       print "   ----------------------------------"
       a = 4
       b = 3
       print "   The quick " 2*a + b " foxes."
       print
       print "   Example 4: Macros References inside a Macro"
       print "   -------------------------------------------"
       M["fox"] = "$[q] $[b] $[f]"
       M["q"] = "quick"
       M["b"] = "brown"
       M["f"] = "fox"
       print "   The " eval(M["fox"]) "."
       print
       print "   Example 5: Array Reference Substitution"
       print "   ---------------------------------------"
       x[7] = "brown"
       b = 3
       print "   The quick " x[2*b+1] " fox."
       print
       print "   Example 6: Function Reference Substitution"
       print "   ------------------------------------------"
       print "   The quick " color(1,2) " fox."
       print
       print "   Example 7: Substitution of Special Characters"
       print "   ---------------------------------------------"
       print "\#  The \$ quick \\ brown $# fox. $$"
    }
    function color(i,j) {
       print "   The lazy dog."
       if (i == j)
          return "blue"
       else
          return "brown"
    }
    
    function eval(inp   ,isplb,irb,out,name) {
    
       splb = SP "["
       out = ""
    
       while( isplb = index(inp, splb) ) {
          irb = index(inp, "]")
          if ( irb == 0 ) {
             out = out substr(inp,1,isplb+1)
             inp = substr( inp, isplb+2 )
          } else {
             name = substr( inp, isplb+2, irb-isplb-2 )
             sub( /^ +/, "", name )
             sub( / +$/, "", name )
             out = out substr(inp,1,isplb-1) eval(M[name])
             inp = substr( inp, irb+1 )
          }
       }
    
       out = out inp
    
       return out
    }
    BEGIN {
       SP = "$"
       main()
       exit
    }
    

    Final Output

       Example 1: Simple Substitution
       ------------------------------
       The quick brown fox.
    
       Example 2: Substitution inside a String
       ---------------------------------------
       The quick brown fox.
    
       Example 3: Expression Substitution
       ----------------------------------
       The quick 11 foxes.
    
       Example 4: Macros References inside a Macro
       -------------------------------------------
       The quick brown fox.
    
       Example 5: Array Reference Substitution
       ---------------------------------------
       The quick brown fox.
    
       Example 6: Function Reference Substitution
       ------------------------------------------
       The lazy dog.
       The quick brown fox.
    
       Example 7: Substitution of Special Characters
       ---------------------------------------------
    #  The $ quick \ brown $# fox. $$
    

    File

    a.awk is the default output program file.

    See Also

    awk(1), cpp(1), gawk(1), m4(1), nawk(1). vi(1)

    Author

    William A. Ward, Jr., School of Computer and Information Sciences, University of South Alabama, Mobile, Alabama, July 23, 1999.


    categories: Wp,Project,Tools,Mar,2009,Timm

    AWKWORDS

    Contents

    Synopsis

    awkwords --title "Title" file > file.html

    awkwords file > file.html

    Download

    This code requires gawk and bash. To download:

    wget  http://lawker.googlecode.com/svn/fridge/lib/bash/awkwords
    chmod +x awkwords
    

    To test the code, apply it to itself:

    • ./awkwords --title "Does this work?" awkwords > awkwards.html

    Description

    AwkWords is a simple-to-use markup language for writing documentation for programs whose comment lines start with "#" and whose comments contain HTML code.

    For example, awk.info?tools/awkwords shows the html generated from this bash script.

    When used with the --title option, a stand alone web page is generated (to control the style of that page, see the CSS function, dicussed below). When used without --title it generated some html suitable for inclusion into other pages.

    Also, AwkWords finds all the <h2>, <h3>, <h4>, <h5>, <h6>, <h7>, <h8>, <h9> headings and copies them to a table of contents at the front of the file. Note that AwkWords assumes that the file contains only one <h1> heading- this is printed before the table of contents.

    AwkWords adds some short cuts for HTML markup, as well as including nested contents (see below: "including nested content"). This is useful for including, say, program output along with the actual program.

    Extra Markup

    Short cuts for HTML

    #.XX
    This is replaced by <XX>.
    #.XX words
    This is replaced by <XX>words</XX>. Note that this tag won't work properly if the source text spills over more than one line.
    #.TO url words
    This is replaced by a link to mail to url.
    #.URL url words
    This is replaced by a link to mail to url.

    Including nested content:

    #.IN file
    This line is replaced by the contents of file.
    #.LISTING file
    This line is replaced by the name of the file, followed by a verbatbim displau of file (no formatting).
    #.CODE file
    This line is replaced by the name of the file, followed verbatbim by file (no formatting).
    #.BODY file
    This line is replaced by file, less the lines before the first blank line.

    Programmer's Guide

    Awkwords is divided into three functions: unhtml fixes the printing of pre-formatted blocks; toc adds the table of contents while includes handles the details of the extra mark-up.

    Functions

    unhtml

    unhtml() { cat $1| gawk '
      BEGIN {IGNORECASE=1}
      /^<PRE>/   {In=1; print; next}
      /^<\/PRE>/ {In=0; print; next}
      In         {gsub("<","\\<",$0); print; next }
                 {print $0 }'
    }
    

    toc

    toc() { cat $1 | gawk '
     BEGIN             { IGNORECASE = 1 }
     /^<[h]1>/         { Header=$0; next}
     /^[<]h[23456789]>/  { 
           T++ ;
          Toc[T]  = gensub(/(.*)<h(.*)>[ \t]*(.*)[ \t]*<\/h(.*)>(.*)/,
          "<""h\\2><""font color=black>\\•</font></a> <""a href=#" T ">\\3</a></h\\4>",
                    "g",$0)
    		Pre="<a name="T"></a>" }
         { Line[++N] = Pre $0; Pre="" }
     END { print Header;
           print "<" "h2>Contents</h2>"
           print "<" "div id=\"htmltoc\">"
           for(I=1;I<=T;I++) print Toc[I]	
           print "<" "/div><!--- htmltoc --->"
           print "<" "div id=\"htmlbody\">"
           for(I=1;I<=N;I++) print Line[I]
           print "</" "div><!--- htmlbody --->"		
         }'
    }
    

    includes

    The xpand function controls recursive inclusion of content. Note that

    • The last act of this function must be to call xpand1.
    • When including verbatim text, the recursive call to xpands must pass "1" to the second paramter.
    includes() { cat $1 | gawk '
    function xpand(pre,  tmp) {
       if      ($1 ~ "^#.IN")    xpands($2,pre) 
       else if ($1 ~ "^#.BODY" ) xpandsBody($2,pre)
       else if ($1 ~ "^#.LISTING")  {
      	    print "<" "pre>"
    	    xpands($2,1)     # <===== note the recursive call with "1"
    	    print "<" "/pre>" } 
       else if ($1 ~ "^#.CODE")  {
      	    print "<" "p>" $2 "\n<" "pre>"
    	    xpands($2,1)     # <===== note the recursive call with "1"
    	    print "<" "/pre>" } 
       else if ($1 ~ "^#.URL") {
    	    tmp = $2; $1=$2="";
    	    print "<" "a href=\""tmp"\">" trim($0) "</a>"
    	    }
       else if ($1 ~ "^#.TO") {
    	    tmp = $2; $1=$2="";
    	    print "<" "a href=\"mailto:"tmp"\">" trim($0) "</a>"
    	    }
       else 
    	xpand1(pre)
    }
    

    The xpand1 function controls the printing of a single line. If we are formatting verbatim text, we must remove the start-of-html character "<". Otherwise, we expand any html shortcuts.

    function xpand1(pre) {
       if (pre)
            gsub("<","\\<",$0)  # <=== remove start-of-html-character
       else {
            $0= xpandHtml($0)      # <=== expand html short cuts
            sub(/^#/,"",$0) }
            print $0 
    }
    

    The function xpandHtml controls the html short cuts

    function xpandHtml(    str,tag) {
       if ($0 ~ /^#\.H1/) {         
    	   $1=""
    	   return "<" "h""1><join>" $0 "</join></" "h1>" }
       if (sub(/^#\./,"",$1)) {
    	   tag=$1;  $1=""
    	   return "<" tag ">"  (($0 ~ /^[ \t]*$/) ? "" : $0"</"tag">")
       }
       return $0
    }
    

    The rest of the code is just some book-keeping and managing the recursive addition of content.

    function xpands(f,pre) {
         if (newFile(f)) {
    	  while((getline <f) > 0) xpand(pre)
              close(f) }
    }
    function xpandsBody(f,pre, using) {
         if (newFile(f)) { 
    	  while((getline <f) >0) {
    	    if ( !using && ($0 ~ /^[\t ]*$/) ) using = 1
    	    if ( using ) xpand(pre)}
    	  close(f) }
    }
    function newFile(f) { return ++Seen[f]==1 }
    function trim (s)   { sub(/^[ \t]*/,"",s);  sub(/[ \t]*$/,"",s); return s } 
    
    BEGIN { IGNORECASE=1 }
          { xpand()      }'
    }
    

    CSS styles

    If used to generate a full web page, then the following styles are added. Note that the htmltoc class controls the appearance of the table of contents.

    css() { 
          echo "<""STYLE type=\"text/css\">"
          cat<<-'EOF'
             div.htmltoc h2 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 30px;}
    	 div.htmltoc h3 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 60px;}
             div.htmltoc h4 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 90px;}
             div.htmltoc h5 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 120px;}
             div.htmltoc h6 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 150px;}
             div.htmltoc h7 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 180px; }
          </STYLE>
    EOF
    }
    

    Main command line

    main() { cat $1 | includes | unhtml | toc; }
    
    if [ $1 == "--title" ]
    then 
         echo "<""html><""head><""title>$2</title>`css`</head><""body>"; 
         shift 2
         main $1
         echo "<""/body><""/html>"
    else 
         main $1
    fi 
    

    Bugs

    There's no checking for valid input (e.g. pre-formatting tags that never close).

    If the input file contains no html mark up, the results are pretty messy.

    Recursive includes fail silently if the referenced file does not exist.

    I don't like the way I need a seperate pass to do "unhtml". I tried making it work within the code but it got messy.

    Author

    Tim Menzies

    categories: Wp,Awk100,Wp,Tools,Apr,2009,HenryS

    awf

    The amazingly workable (text) formatter

    Synopsis

    awf -macros [ file ] ...

    Download

    Download from LAWKER. Type "make r" to run a regression test, formatting the manual page (awf.1) and comparing it to a preformatted copy (awf.1.out). Type "make install" to install it. Pathnames may need changing.

    Description

    Awf formats the text from the input file(s) (standard input if none) in an imitation of nroff's style with the -man or -ms macro packages. The -macro option is mandatory and must be `-man' or `-ms'.

    Awf is slow and has many restrictions, but does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.

    Awf implements the following raw nroff requests:

    .\"  .ce  .fi  .in  .ne  .pl  .sp
    .ad  .de  .ft  .it  .nf  .po  .ta
    .bp  .ds  .ie  .ll  .nr  .ps  .ti
    .br  .el  .if  .na  .ns  .rs  .tm
    

    and the following in-text codes:

    \$   \%   \*   \c   \f   \n   \s
    

    plus the full list of nroff/troff special characters in the original V7 troff manual.

    Many restrictions are present; the behavior in general is a subset of nroff's. Of particular note are the following:

    • Point sizes do not exist; .ps and \s are ignored.
    • Conditionals implement only numeric comparisons on \n(.$, string com- parisons between a macro parameter and a literal, and n (always true) and t (always false).
    • The implementation of strings is generally primitive.
    • Expressions in (e.g.) .sp are fairly general, but the |, &, and : operators do not exist, and the implementation of \w requires that quote (') be used as the delimiter and simply counts the characters inside (so that, e.g., \w'\(bu' equals 4).

    White space at the beginning of lines, and imbedded white space within lines, is dealt with properly. Sentence terminators at ends of lines are understood to imply extra space afterward in filled lines. Tabs are implemented crudely and not quite correctly, although in most cases they work as expected. Hyphenation is done only at explicit hyphens, emdashes, and nroff discretionary hyphens.

    MAN Macros

    The -man macro set implements the full V7 manual macros, plus a few semi- random oddballs. The full list is:

    .B   .DT  .IP  .P   .RE  .SM
    .BI  .HP  .IR  .PD  .RI  .TH
    .BR  .I   .LP  .PP  .RS  .TP
    .BY  .IB  .NB  .RB  .SH  .UC
    

    .BY and .NB each take a single string argument (respectively, an indi- cation of authorship and a note about the status of the manual page) and arrange to place it in the page footer.

    MS Macros

    The -ms macro set is a substantial subset of the V7 manuscript macros. The implemented macros are:

    .AB  .CD  .ID  .ND  .QP  .RS  .UL
    .AE  .DA  .IP  .NH  .QS  .SH  .UX
    .AI  .DE  .LD  .NL  .R   .SM
    .AU  .DS  .LG  .PP  .RE  .TL
    .B   .I   .LP  .QE  .RP  .TP
    

    Size changes are recognized but ignored, as are .RP and .ND. .UL just prints its argument in italics. .DS/.DE does not do a keep, nor do any of the other macros that normally imply keeps.

    Assignments to the header/footer string variables are recognized and implemented, but there is otherwise no control over header/footer formatting. The DY string variable is available. The PD, PI, and LL number registers exist and can be changed.

    Output

    The only output format supported by awf, in its distributed form, is that appropriate to a dumb terminal, using overprinting for italics (via underlining) and bold. The nroff special characters are printed as some vague approximation (it's sometimes very vague) to their correct appearance.

    Awf's knowledge of the output device is established by a device file, which is read before the user's input. It is sought in awf's library directory, first as dev.term (where term is the value of the TERM environment variable) and, failing that, as dev.dumb. The device file uses special internal commands to set up resolution, special characters, fonts, etc., and more normal nroff commands to set up page length etc.

    FiLes

    All in /usr/lib/awf (this can be overridden by the AWFLIB environment variable):

    common     common device-independent initialization
    dev.*      device-specific initialization
    mac.m*     macro packages
    pass1      macro substituter
    pass2.base central formatter
    pass2.m*   macro-package-specific bits of formatter
    pass3      line and page composer
    

    See Also

    awk(1), nroff(1), man(7), ms(7)

    Diagnostics

    Unlike nroff, awf complains whenever it sees unknown commands and macros. All diagnostics (these and some internal ones) appear on standard error at the end of the run.

    Author

    Written at University of Toronto by Henry Spencer, more or less as a supplement to the C News project.

    Copyright

    Copyright 1990 University of Toronto. All rights reserved. Written by Henry Spencer. This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California.

    Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions:

    1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from flaws in it.
    2. The origin of this software must not be misrepresented, either by explicit claim or by omission. Since few users ever read sources, credits must appear in the documentation.
    3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Since few users ever read sources, credits must appear in the documentation.
    4. This notice may not be removed or altered.

    Bugs

    There are plenty, but what do you expect for a text formatter written entirely in (old) awk?

    The -ms stuff has not been checked out very thoroughly.


    categories: Tools,May,2009,AlexR

    Linking Awk to Spreadsheets

    Axel Renihold's MacroCALC (mc) interactive spreadhsheet calculator is an interactive, macro-programmable tool. mc has no graphic features, but therefore it can run also on terminals. It uses a convenient, well-known user interface and has some special features especially interesting in the UNIX environment.

    mc has an elaborate operating system via piping. That is, mc and Unix tools like Awk can be easily intergrated.

    A "cell" statement has the syntax:

    cell < command
    
    (and "command" is any Unix script, e.g. using Awk). When such a cell is entered, it will:
    • execute the command
    • put the command's output into the range of cells starting with cell as the upper-left corner.

    The output is read line by line into the rows of the range. The columns, which have to be separated by "tab" in the output of the command, are placed into the columns of the range.

    At the end of the data a special cell value designated 'EOF' (end of file) is placed in the cell below the data. This offers great flexibility based upon the Unix operating system's piping mechanism

    For more details, see the MacroCALC home page.


    categories: Ps,Apr,2009,Admin

    Postscript Tricks

    These pages focus on postscript tricks, written in Awk.


    categories: Ps,Apr,2009,ArnoldR

    pschoose.awk

    Contents

    Synopsis

    Download

    Description

    Details

    Code

    Author

    Synopsis

    gawk -f pschoose

    Download

    Download from LAWKER

    Description

    Pulls out a range of pages from postscript and just print those.

    Details

    Pagerange : list of pages from command line.

    Pages : array with broken out list.

    At end: "(n in Pages)" is true if page n should be printed

    Code

    Set up the list of paes to print.
    function set_pagerange(        n, m, i, j, f, g)
    {
    	delete Pages
    
    	n = split(Pagerange, f, ",")
    	for (i = 1; i <= n; i++) {
    		if (index(f[i], "-") != 0) { # a range
    			m = split(f[i], g, "-")
    			if (m != 2 || g[1] >= g[2]) {
    				printf("bad list of pages: %s\n",
    					f[i]) > "/dev/stderr"
    				exit 1
    			}
    			for (j = g[1]; j <= g[2]; j++)
    				Pages[j] = 1
    		} else
    			Pages[f[i]] = 1
    	}
    }
    
    BEGIN {
    	# constants
    	TRUE = 1
    	FALSE = 0
    
    	if (ARGC != 3) {
    		print "usage: pschoose range-spec file\n" > "/dev/stderr"
    		exit 1
    	}
    	Pagerange = ARGV[1]
    	delete ARGV[1]
    	set_pagerange()
    }
    
    NR == 1, /^%%Page:/ {
    	if (! /^%%Page/) {
    		Prolog[++nprolog] = $0
    		next
    	}
    }
    
    /^%%Trailer/ || In_trailer {
    	In_trailer = TRUE
    	Epilog[++nepilog] = $0
    	next
    }
    
    /^%%Page: /	{
    	++Npage
    	line = 0
    }
    
     for all non-special lines
    {
    	# only save it if we will want to print it
    	if (Npage in Pages)
    		Page[Npage, ++line] = $0
    }
    
    END {
    	# print the prologue
    	for (i = 1; i in Prolog; i++)
    		print Prolog[i]
    
    	# print the actual body
    	for (i = 1; i <= Npage; i++) {
    		if (i in Pages) {
    			for (j = 1; (i, j) in Page; j++) {
    				print Page[i, j]
    			}
    		}
    	}
    
    	# print the epilog
    	for (i = 1; i in Epilog; i++)
    		print Epilog[i]
    }
    

    Author

    Arnold Robbins


    categories: Ps,Apr,2009,ArnoldR

    psrev.awk

    Contents

    Synopsis

    Download

    Description

    Code

    Author

    Synopsis

    gawk -f psrev.awk

    Download

    Download from LAWKER

    Description

    Reverse the pages in a postscript file.

    Code

    BEGIN {
    	# constants
    	TRUE = 1
    	FALSE = 0
    
    	# Initialize global booleans
    	Twoup = FALSE
    
    	# process command line flags
    	for (i = 1; i in ARGV && ARGV[i] ~ /^-/; i++) {
    		if (ARGV[i] == "-2")
    			Twoup = TRUE
    		else
    			printf("psrev: unrecognized option %s\n",
    				ARGV[i]) > "/dev/stderr"
    		delete ARGV[i]
    	}
    }
    
    NR == 1, /^%%Page:/ {
    	if (! /^%%Page/) {
    		Prolog[++nprolog] = $0
    		next
    	}
    }
    
    /^%%Trailer/ || In_trailer {
    	In_trailer = TRUE
    	Epilog[++nepilog] = $0
    	next
    }
    
    /^%%Page: /	{
    	++Npage
    	line = 0
    }
    
     for all non-special lines
    {
    	Page[Npage, ++line] = $0
    }
    
    END {
    	# print the prologue
    	for (i = 1; i in Prolog; i++)
    		print Prolog[i]
    
    	# print the actual body
    	if (Twoup) {
    		hasodd = (Npage %2 == 1)
    		if (hasodd) {
    			# print last page
    			for (j = 1; (Npage, j) in Page; j++)
    				print Page[Npage, j]
    			# make a fake last page for psnup
    			printf "%%%%Page: %d %d\n", Npage+1, Npage+1
    			printf "showpage\n"
    			print "%%BeginPageSetup"
    			print "BP"
    			print "%%EndPageSetup"
    			print "EP"
    		}
    		lastpage = (hasodd ? Npage - 1 : Npage)
    		for (i = lastpage; i > 0; i -= 2) {
    			for (k = i - 1; k <= i; k++)
    				for (j = 1; (k, j) in Page; j++)
    					print Page[k, j]
    		}
    	} else {
    		# regular 1 up printing
    		for (i = Npage; i > 0; i--)
    			for (j = 1; (i, j) in Page; j++)
    				print Page[i, j]
    	}
    
    	# print the epilog
    	for (i = 1; i in Epilog; i++)
    		print Epilog[i]
    }
    

    Author

    Arnold Robbins


    categories: TenLiners,Apr,2009,PhilipB

    indent.awk

    Contents

    Synopsis

    gawk -f indent.awk file.sh

    Download

    Download from LAWKER

    Description

    This is part of Phil's AWK tutorial at http://www.bolthole.com/AWK.html. This program adjusts the indentation level based on which keywords are found in each line it encounters.

    Code

    doindent

    function doindent(){
    	tmpindent=indent;
    	if(indent<0){
    		print "ERROR; indent level == " indent
    	}
    	while(tmpindent >0){
    		printf("    ");
    		tmpindent-=1;
    	}
    }
    

    Out-denting

    $1 == "done" 	{ indent -=1; }
    $1 == "fi" 	{ indent -=1; }
    $0 ~ /}/	{ if(indent!=0) indent-=1;  }
    

    Worker

    This is the 'default' action, that actually prints a line out. This gets called AS WELL AS any other matching clause, in the order they appear in this program. An "if" match is run AFTER we run this clause. A "done" match is run BEFORE we run this clause.

    		{ 
    		  doindent();
    		  print $0;
    		}
    

    In-denting

    $0 ~ /if.*;[ ]*then/	{ indent+=1; }
    $0 ~ /for.*;[ ]*do/	{ indent+=1; }
    $0 ~ /while.*;[ ]*do/	{ indent+=1; }
    
    $1 == "then"		{ indent+=1; }
    $1 == "do"		{ indent+=1; }
    $0 ~ /{$/		{ indent+=1; }
    

    Author

    Philip Brown phil@bolthole.com


    categories: Newsgroup,Jan,2009,Steffen

    Top posters at comp.lang.awk

    For the 7 day period ending Monday April 27, 2009.

    posts kbytes name address
    13 28.4 roby elleroroberto@katamail.com
    7 11.6 Steffen Schuler schuler.steffen@gmail.com
    4 10.9 pmarin pacogeek@gmail.com
    3 9.7 Ed Morton mortonspam@gmail.com
    3 5.2 Janis Papanagnou janis_papanagnou@hotmail.com
    3 5.1 nag visitnag@gmail.com
    2 6.5 Tim Menzies menzies.tim@gmail.com
    2 6.1 r.p.loui@gmail.com r.p.loui@gmail.com
    2 5.8 Hermann Peifer peifer@gmx.net
    2 5.7 kielhd kielhd@freenet.de
    41 95.0 Total for top 10

    Totals for the newsgroup

    • 19 posters
    • 52 articles
    • 115.1 kbytes

    The top 10 accounted for

    • 52.6% of the posters
    • 78.8% of the articles
    • 82.5% of the bytes

    Averages

    • 2.7 articles / poster
    • 2.2 kbytes / article
    • 6.1 kbytes / poster


    Provided as a public service by Steffen Schuler

    categories: Newsgroup,Jan,2009,Steffen

    Top subjects at comp.lang.awk

    For the 7 day period ending Monday April 27, 2009.

    posts kbytes subject
    10 33.5 OS-variables in awk
    9 17.9 user functions with variable number of parameters
    5 8.9 File infos
    3 8.5 Interpreter Informations
    3 5.0 Log/History Files
    3 4.9 Help with an input file
    3 4.8 gawk can't run an awk program...
    3 4.6 Log/History File
    2 5.6 pgawk.exe.stackdump
    2 4.7 OT: Re: Interpreter Informations

    52 articles on 18 subjects

    • 38 were followups (73.1%)
    • 0 were crossposts (0.0%)

    115.1 kbytes total

    • headers: 54.4kb (47.3%)
    • quoted text: 32.9kb (28.6%)
    • original text: 27.2kb (23.6%)
    • signatures: 0.6kb (0.5%)

    Averages

    • 2.9 articles / subject
    • 2.2 kbytes / article
    • 6.4 kbytes / subject


    Provided as a public service by Steffen Schuler

    categories: Newsgroup,Jan,2009,Steffen

    Top posters at comp.lang.awk

    For the 365 day period ending Sunday April 26, 2009.

    posts kbytes name address
    156 530.8 Ed Morton mortonspam@gmail.com
    156 388.3 Janis Papanagnou janis_papanagnou@hotmail.com
    146 256.1 pk pk@pk.invalid
    109 306.6 Ed Morton morton@lsupcaemnt.com
    84 146.5 Steffen Schuler schuler.steffen@gmail.com
    83 139.4 Kenny McCormack gazelle@shell.xmission.com
    77 174.1 Aharon Robbins arnold@skeeve.com
    64 162.2 Dave B daveb@addr.invalid
    54 194.9 r.p.loui@gmail.com r.p.loui@gmail.com
    50 107.7 Hermann Peifer peifer@gmx.eu
    979 2406.6 Total for top 10

    Totals for the newsgroup

    • 271 posters
    • 2272 articles
    • 5542.5 kbytes

    The top 10 accounted for

    • 3.7% of the posters
    • 43.1% of the articles
    • 43.4% of the bytes

    Averages

    • 8.4 articles / poster
    • 2.4 kbytes / article
    • 20.5 kbytes / poster


    Provided as a public service by Steffen Schuler

    categories: Newsgroup,Jan,2009,Steffen

    Top subjects at comp.lang.awk

    For the 365 day period ending Sunday April 26, 2009.

    posts kbytes subject
    61 219.6 changing a field without recompiling the record
    44 71.3 Top 10 subjects comp.lang.awk
    42 88.1 GAWK: A fix for "missing file is a fatal error"
    34 59.6 Top 10 posters comp.lang.awk
    30 75.3 Indirect function calls patch for gawk available
    29 65.0 gawk for windows: system() does not yield exit status
    26 67.1 split field by delimiter
    24 63.6 Is there an simple way to initialise arrays in bulk?
    23 63.5 Sed1liners in Awk?
    23 62.6 Gawk match() and numbers in scientific notation

    2272 articles on 389 subjects

    • 1865 were followups (82.1%)
    • 8 were crossposts (0.4%)

    5540.0 kbytes total

    • headers: 2356.9kb (42.5%)
    • quoted text: 1591.2kb (28.7%)
    • original text: 1531.2kb (27.6%)
    • signatures: 60.7kb (1.1%)

    Averages

    • 5.8 articles / subject
    • 2.4 kbytes / article
    • 14.2 kbytes / subject


    Provided as a public service by Steffen Schuler

    categories: Dates,June,2009,BobO

    holidays.awk

    Synopsis

    [gn]awk -f holidays.awk  "opts" holidayfile
    

    Download

    Download from LAWKER.

    Description

    Job scheduling around holidays has always been a pain. To prevent messing around with crons several times a year, I used to place a "holidays" file in, for example, /usr/local/bin. The file contained the holiday date in yyyymmdd format, followed by the holiday name. (See Dateplus program for easy date manipulation.) That worked, but every year I had to refresh the file with those dates that fall on, for example, the last Monday in May. This meant remembering to edit the holidays file after the company calendar was set for the year.

    Then, I came across the American Secular Holidays web site by Marcos J. Montes. Montes cites Claus Tondering as his primary source, and Timothy Barmann, and Bobby Cossum for their contributions in simplifying the equations used in the alorithms. This is significant for these algorithms provide a robust yet elegant method for identifying whether a given date is a holiday without constantly updating a configuration file.

    To make these algorithms and routines as portable as possible (as long as the porting OS has nawk or gawk), I rewrote the whole thing in [gn]awk. Now practically any program with access to AWK can avail itself of these holiday date capabilities. The AWK version of the program can return the nth business day, a multi-line yyyymmdd date list, or a single line of yyyymmdd holiday dates. With those, you can easily determine whether the date you have is a holiday or specific business day.

    In the following code, none of my holiday work is possible without the algorithms presented by Montes, Tondering, Barmann, and Cossum. The holidays file and the logic to process that, are my contributions.

    Options

    --
    Allows passing script opts and args.
    -B
    Return true if today is a business day.
    -b nn
    Return nn business day as yyyymmdd (nn may also be specified as "last" for that business day of the month, or -n (minus n) for nth business day from end of the month).
    -d n.Www.OoA
    Return nth weekday (Www = Sun-Sat, and "n" may also be given as "last" for the last "Www" day of the month).
    -d n.w.OoA
    Alternate suntax for the above only the ".w" is 0-6 for Sun-Sat. The ".OoA" is an optional on-or-after day of the month that says the date want must be on or after the OoA'th day of the month.
    -H
    Full formal documentation (functions only when the current working directory is the program directory).
    -h
    Summary help (Usage).
    -l
    Return multiline yyyymmdd date list.
    -s
    Return a single line of yyyymmdd's.
    -t
    Test resultant date against today (works with -b and -d options).
    -y yyyy
    Use yyyy for the year.
    -m mm
    Use the mm that follows as the month (for business day calculations).
    holidayfile
    Calculation directives file (neither used nor needed with "-d" option). The file lays out as follows.

    The Holidays File

    Although second to the algorithms, the holidays file is central to this system. The file's directives allow for the handling of, for example, the Friday after U.S. Thanksgiving Day (Thursday). For those organizations and companies that grant a Friday holiday when a day like Christmas or New Year's Day falls on a Saturday, or give a Monday holiday when those holidays fall on a Sunday, the holidays file provides the necessary vehicle.

    After a brief description of holidays file layout, I'll discuss the the file itself, and see how three holidays are handled: Memorial Day, Thanksgiving Day (including the Friday after), and Christmas.

    The file itself is a simple ASCII file available to to all programs. It contains values that allow the calling program to calculate holidays either by given (fixed) month and day, or by day of a given week. The general layout is as follows:

      # Mm N.Day Adj Holiday name # Comments
    
        Mm           = Month number (leading zeros NOT required)
        N.day        = Nth day (1-5 and "last") "." weekday (0-6)
                       (Not every part is required.)
        Adj          = Can be either a +|- n days,
                       or weekday followed immediately by a +|- n days,
        Holiday name = How you want it spelled out--your call.
        Comments     = ignored.
    

    Leading white space is ignored, as is everything following and including the octothorpe (#-sign). Here are the entries for the three holidays:

      #-----------------------------------------------------------------------#
      # Mm N.Day.OnOrA Adj Holiday name                     # Comments        #
      # -- ----------- --- -------------------------------- ----------------- #
        05 last.1          Memorial Day                     # Last Mon in May
        11 4.4             Thanksgiving Day (US)            # 4th Thu in Nov
        11 4.4          +1 Thanksgiving Day II (US)
        12 25              Christmas Day                    # M-F
        12 25          6-1 Christmas Day (pre-holiday obs)  # Sat?  Use Fri
        12 25          0+1 Christmas Day (post-holiday obs) # Sun?  Use Mon
    

    Memorial Day

    Memorial Day is the last Monday in May. In the table the month is "05" (again, leading zero is unnecessary). The last Monday is specified by the word "last" and not a 5 because the last Monday may not be the 5th Monday (there is no 5th Monday in May, 2003). Monday is identified by the 1 following the dot (".1"). This is based on the 0-6 convention for representing Sunday through Saturday.

    Thanksgiving Day

    Thanksgiving Day (U.S. observance) is the forth Thursday in November. November is identified by the "11". The fourth (nth) day is the first "4". Thursday is the ".4". Same method as was used for Memorial Day. The day after Thanksgiving, Friday, is a little tricky.

    Friday After Thanksgiving [Thurs]Day

    Contrary to what you might think, you cannot specify:

      11 4.5 Thanksgiving Day II (Friday)
    

    since the fourth Friday might not follow the fourth Thursday of a given month. Consider Thanksgiving Day, 2002--the fourth Thursday was November 28. The fourth Friday fell on the 22nd. So, to accurately capture the Friday after Thanksgiving Day, specify the same parameters for Thanksgiving, and an adjustment of +1:

      11 4.5 +1 Thanksgiving Day II (Friday)
    

    Christmas

    Christmas is December 25. Like New Year's Day (January 1) and Independence Day (July 4), Christmas is a fixed date. Simply specifying "12 25 Christmas Day" in the holidays file returns "yyyy1225". However, with many companies, if Christmas falls on a Saturday (day 6), the Friday before is observed by adjusting it by -1. If it falls on a Sunday (day 0), the Monday following is observed by adjusting it by +1. Hence, the three entries:

      12 25     Christmas Day                     # M-F
      12 25 6-1 Christmas Day (pre-holiday obs)   # Sat? Use Fri
      12 25 0+1 Christmas Day (post-holiday obs)  # Sun? Use Mon
    

    New Year's Day

    New Year's Day is a fixed date, January 1, and like Christmas and Independence Day, it can be observed on the Friday before a Saturday occurrence or the Monday after a Sunday occurrence simply by setting it up like the Christmas example above. However, some organizations use a post-holiday observance of New Year's Day when it falls on a Saturday simply so the holiday falls in the correct year. You can do that by specifying New Year's Day as follows:

      01 01     New Year's Day                    # M-F
      01 01 6+2 New Year's Day (post-holiday obs) # Sat? Use Mon
      01 01 0+1 New Year's Day (post-holiday obs) # Sun? Also Mon
    

    Remember, the "6" in our "6+2" means the actual date, January 1st, falls on a Saturday (day 6 in the 0-6 day-numbering schema), so adjust that date by +2 days (i.e. Saturday's date (01/01) plus two days (01/03).

    Daylight Savings Time

    While the program is incapable of handling Daylight Savings dates in Iran where DST starts on the first day of Farvardin and ends the first day of Mehr, holidays.awk (v1.22) is capable of handling at least one set of unique Daylight Savings Time (DST) dates. In the Falkland Islands, DST begins on the first Sunday on or after September 8th and ends on the first Sunday on or after April 6th. Those exceptions (starting on or after a date in the month) are handled by specifying a holidays line like this:

      04 1.0.6 Falklands ST  # 1st Sun on/after Apr 6
      09 1.0.8 Falklands DST # 1st Sun on/after Sep 8
    

    The ".6" in our "1.0.6" means Standard Time (ST) begins on the first Sunday (1.0) in April that falls on or after the 6th of April. Likewise, the ".8" in our "1.0.8" means DST begins on the first Sunday in September that comes on or after the 8th of September.

    Since Daylight Savings dates are not usually holidays, you can also retrieve the Daylight Savings Time dates via the -d option and bypass the need for the holidays file altogether. Here are Daylight Savings Times for the United States (begins the second Sunday in March) and the Faulklands (begins on the first Sunday on/after September 8).

      holidays.awk -- -d 2.0   -m 3
      holidays.awk -- -d 1.0.8 -m 9
    

    You can even set up a cron to test for Daylight Savings Time and perform some action if true.

      05 00 * 03 * [ `/usr/local/bin/holidays.awk -- -d 2.0 -m 3 -t` -eq 1 ] \
        && ... Some action ...
    

    Examples

    Calculating Business Days

    I incorporated the business day calculation into my date routines because of a need to run a given process on the second business day of the month. Once the holidays are known, business day calculation is relatively simple--just grab the month's days and remove holidays, Saturdays and Sundays. For example, to provide the second business day, just pass a "-b 2" option to the program:

       bizday=`nawk -f holidays.awk -- -b 2 holidays`
       if [ `date "+%Y%m%d"` = $bizday ]; then
          echo "Today is the 2nd business day of the month."
          # Do whatever
       fi
    

    Last business day and business day offset from the last business day (negative numbers) is also available in holidays.awk. To retrieve the last business day of the month, specify the "last" option argument (optarg) for -b option (i.e., "-b last"). For the next-to-last business day of the month, provide "-b -1" as an option and optarg.

    Holidays.awk is a well-behaved program in that it uses exit status to indicate success or failure. As indicated in the documentation, all options except business day (-b), returning a zero status means the program completed successfully; non-zero indicates failure. However, with the business day option, non-zero indicates success because it is the day of the month on which the business day falls. Therefore, use the holidays.awk the exit status as the test comparand:

       nawk -f holidays.awk -- -b last holidays > /dev/null 2>&1
    
       if [ $? -eq `date +%d` ]; then
          echo "Today is the last business day of the month."
          # Do whatever
       fi
    

    You can also combine -b with -m and -y to return the nth business day for a given month and year. If you request a business day (positive or negative) that is not found in the month, you receive an error message, and a 0 exit status indicating an error.

    For those needing only an indication that today is a given business day, you can use the -t option in conjunction with -b. For example, using Unix cron (scheduler) we combine those options to set up a job to run only on the second business day of the month with as little as the following:

        00 02 2-5 * * /usr/local/bin/holidays.awk -- -b 2 -t \
          || some_program > some_program.out 2>&1
     

    In this example, no holidays file is specified because we use the default, /usr/local/bin/holidays (you can change the program to point to wherever you wish to locate the file). No nawk -f is used because the first line of holidays.awk uses the shebang syntax (#!/usr/bin/nawk -f) to execute itself. (Obviously, the program must have the necessary execution permissions to run this way.) With the -t option, holidays.awk returns true or false (which is not the same as success or failure), only running the called program if the day is, indeed, the second business day of the month.

    Returning The Nth Weekday

    There appears to be as much interest in determining the nth weekday day as there is in business days, so I added an option to holidays.awk to return that. To get the first Monday in the current month, simply pass a "-d 1.Mon" option to the program:

       fst_monday=`nawk -f holidays.awk -- -d 1.Mon`
    

    An alternative syntax is also provided:

    nawk -f holidays.awk -- -d 1.1
    

    You can expand this to report the first Monday in any month and year like this.

       yyyy=2005
       for mm in 1 2 3 4 5 6 7 8 9 10 11 12
       do
          nawk -f holidays.awk -- -y $yyyy -m $mm -d 1.1
       done
    
    

    For the last Sunday in a month use

       nawk -f holidays.awk -- -d last.Sun
    

    For those preferring a simpler syntax: If your OS recognizes the #! (shebang) syntax, you can place a #!/usr/bin/nawk -f (or gawk) at the start of holidays.awk, thereby allowing you skip the [gn]awk -f during invocation and simply call it like this,

       holidays.awk -- -d last.Sun
       holidays.awk -- -d last.0
       holidays.awk -- -d 5.0
    

    Testing with Holidays.sh

    Holidays.sh executes holidays.awk, providing examples of holiday and business day testing. Provided the holidays file is located properly, executing holidays.sh on June 21, 2003 displays:

       Today's no holiday, get busy. :-((
       20030101 Wed. New Year's Day
       20030120 Mon. M.L.King Jr. Birthday
       20030526 Mon. Memorial Day
       20030704 Fri. Independence Day
       20030901 Mon. Labor Day
       20031127 Thu. Thanksgiving Day (US)
       20031128 Fri. Thanksgiving Day II (US)
       20031225 Thu. Christmas Day
       Today is NOT the 2nd business day (20030603) of the month.
       Today is NOT the last business day (20030630) of the month.
       Today is NOT the next-to-the-last business day (20030627) of the month.
    

    As a real acid test, I include the next-to-last and last business days of every month from 2000 to 2010. The holidays.sh script concludes with a report for all holidays for the 21st century.

    For more information

    See http://www.orlandokuntao.com/mf_holidays.htmll.

    Copyright

    Copyright (c) 1995-2005 by Bob Orlando. All rights reserved.

    Permission to use, copy, modify and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies, and that both the copyright notice and this permission notice appear in supporting documentation, and that the name of Bob Orlando not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. Bob Orlando makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.

    Bob Orlando disclaims all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall Bob Orlando be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.

    Author

    Bob Orlando


    categories: DataMining,May,2009,Timm

    NBC: a not-so-naive Bayes Classifier

    Contents

    Synopsis

     ./nbc -[hlx] train test
    

    Download

    Download from Lawker

    Description

    Is less more? Can a few lines of gawk developed in a day or two stand in for a sophisticated state-of-the-art JAVA package? In the general case there may be software engineering advantages to working with rich languages like JAVA. However, in the specific case of a Naive Bayes classifier for discrete data, it is interesting to test if less is indeed more.

    A Naive Bayes classifier collects frequency counts of old events, grouped into "classes". Then, if a new event arrives without a classification, it checks through the old list of classes looking for the one with the highest frequency counts for this new event.

    The method is called "naive" This assumption allows us to collect frequency counts just on each attribute value (and not pairs, or triples, or quads of values).

    In practice, this "naive" strategy works remarkably well- often performs as well as other schemes that try to model interactions between frequencies.

    Hence, we call this system a not-so-naive Bayes Classifier.

    Summary of Results

    In summary, the performance on this simple gawk-based Naive Bayes classifier is quite remarkable.

    • Not surprisingly, nbc.awk's accuracy are very similar to a standard bayes classifier (from the WEKA system). That is, this implementation is adequate.
    • Quite surprisingly, the awk version does not run much slower than java. In fact, for under 1000 instances, this implementation is comparatively fast or faster than a much more mature JAVA-based tool.

    Details: Accuracy

    The following table compares classification accuracies between nbc.awk and WEKA's weka.classifiers.NaiveBayes.

    All the data sets were discrete so no kernel estimation was used.

    Results come from a 10-way cross-val (but no initial randomization of data set order).

    The table is sorted by increase mean difference. Nbc.awk does better than WEKA Bayes on the datasets shown at the bottom of the table.

                    mean                significant 
                    difference   std.   difference? 
          data      in accuracy  dev.   (alpha=0.01)
     -----------------------------------------------
            soybean | -3.02 |   2.02 |  y
               iris | -2.14 |   4.51 |  n
                zoo | -0.38 |   6.54 |  n
      primary-tumor | -0.30 |   3.28 |  n
          audiology | -0.27 |   2.67 |  n
           mushroom | -0.25 |   0.23 |  n
             splice |  0.00 |   0.00 |  n
           kr-vs-kp |  0.00 |   0.00 |  n
      breast-cancer |  0.00 |   0.00 |  n
     contact-lenses |  0.00 |   0.00 |  n
               vote |  0.27 |   1.47 |  n
              lymph |  0.73 |   5.01 |  n
           breast-w |  1.63 |   1.32 |  y
           credit-a |  7.40 |   5.60 |  y
             letter |  9.44 |   1.40 |  y
    

    On the whole, nbc.awk works as well as WEKA Bayes.

    Details: Runtimes

    The following table compares the runtimes of nbc.awk (awk) vs WEKA BAYES (java) measured in seconds.

    Each lines show total times for ten training+test runs (one for each item in the cross val). E.g. letter actually ran in time 4.92 seconds (on average) and this was called 10 times.

    Note: the time for dividing files for the x-val is not shown.

    The table is sorted on the ratio of awk vs java runtimes. Ratios less than one mean awk ran faster than java. Sampler.awk does better than Weka Bayes on the datasets shown at the bottom of the table (below the middle line).

                     runtimes (secs) | 
                    -----------------|---------------------
               data  awk   java ratio| insts  attrs classes
     --------------------------------|---------------------
             letter  49.2  17.6  2.8  20,000   17      27
           mushroom  10.1   5.9  1.7   8,124   23       3
           kr-vs-kp   8.1   5.1  1.6   3,916   37       3
             splice  11.3   7.8  1.4   3,190   62       4
            soybean   4.2   3.4  1.2     683   36      20
     ------------------------------------------------------
          audiology   2.9   3.4  0.9     226   70      25
      primary-tumor   1.3   2.8  0.5     339   18      23
               vote   1.0   2.4  0.4     435   17       3
     contact-lenses   0.6   2.0  0.3      24    5       4
      breast-cancer   0.7   2.4  0.3     286   10       3
           credit-a   1.1   3.3  0.3     690   16       3
           breast-w   1.0   3.5  0.3     699   10       3
              lymph   0.6   2.4  0.2     148   19       5
               iris   0.5   2.5  0.2     150    5       4
                zoo   0.6   2.4  0.2     101   18       8
     ------------------------------------------------------
              total  93.1  66.8  1.4
    

    All up, the awk-based learner was 40% slower than the JAVA. For larger data sets, JAVA was always faster. However, for smaller datasets (under 1000 instances) the awk version was nearly as fast or faster.

    Details: Memory

    We have run this small Awk script on 100s of megabytes of data, without crashes or core dumps. The code is very memory effecient- unlike the WEKA which loads all the data into RAM.

    Discussion

    It is hardly surprising that a state-of-the-art tool kit built and optimized by JAVA gurus can out-perform awk code on large examples. However, what is surprising is that an 32 line AWK script built and debugged in a weekend often works nearly as well, or better.

    Perhaps "nbc" is not-so-naive after all.

    Options

    -x
    run an example
    -h
    print help text
    -l
    print legal notice

    Installation

    To check the download, unzip the contents.zip then

     chmod +x nbc
     ./nbc nbceg.train nbceg.test  | 
     gawk -F, '{print $0  "\t " ($1 !=$2 ? " <== bad" : "")}' 
    

    This should print:

     malign_lymph,malign_lymph        
     metastases,metastases    
     malign_lymph,malign_lymph        
     metastases,metastases    
     malign_lymph,metastases   <== bad
     malign_lymph,malign_lymph        
     malign_lymph,malign_lymph        
     metastases,metastases    
     metastases,metastases    
     metastases,metastases    
     malign_lymph,malign_lymph        
     metastases,metastases    
     malign_lymph,malign_lymph        
    

    Awk code

    Here is the nbc.awk code called by the Bash script (shown below).

     BEGIN {
      #Internal globals:
         Total=0    # count of all instances
       # Classes    # table of class names/frequencies
       # Freg       # table of counters for values in attributes in classes
       # Seen       # table of counters for values in attributes
       # Attributes # table of number of values per attribute
       }
    
     Pass==1 {train()}
     Pass==2 {print $NF "," classify()}
    
     function train(    i,c) { 
       Total++;
       c=$NF;
       Classes[c]++;
       for(i=1;i<=NF;i++) {
         if ($i=="?") continue;
         Freq[c,i,$i]++
         if (++Seen[i,$i]==1) Attributes[i]++}
     }
    
     function classify(         i,temp,what,like,c) {  
       like = -100000; # smaller than any log
       for(c in Classes) {  
         temp=log(Classes[c]/Total); #uses logs to stop numeric errors
         for(i=1;i<NF;i++) {  
           if ( $i=="?" ) continue;
           temp += log((Freq[c,i,$i]+1)/(Classes[c]+Attributes[i]));
         };
         if ( temp >= like ) {like = temp; what=c}
       };
       return what;
     }
    

    Bash Code

    Copyright

    copyleft() { cat<<EOF 
    nbc: a naive bayes classifier
    Copyright (C) 2004 Tim Menzies
    
    This program is free software; you can redistribute it and/or
    modify it under the terms of the GNU General Public License
    as published by the Free Software Foundation, version 2.
    
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    
    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
    EOF
    }
    

    Usage

     usage() { cat<<-EOF
    	Usage: nbc  [FLAGs] TRAIN TEST
    	Naive bayes classifier.
    
    	TRAIN and TEST are comma-separated data files with the same
    	number of columns. The last column of each is the class
    	symbol. This classifier learns from TRAIN and then tries
    	to classify the examples in TEST.
    
    	Flags: 
    	  -h        print this help text
    	  -l        copyright notice
    	  -x	    run an example
    	EOF
    	exit
     } 
    

    Demo code

     nbcDemo() { 
    	main nbceg.train nbceg.test
     } 
    

    Main

     main() {
    	gawk -F, -f nbc.awk Pass=1 $1 Pass=2 $2
     }
    

    Command-line Processing

     demo=""
     while getopts "hlx" flag
     do case "$flag" in
            l)  copyleft; exit;;
            h)  usage; exit ;;
            x)  demo="nbcDemo";;
        esac
     done
     shift $(($OPTIND - 1))
     if [ -n "$demo" ]
     then $demo
          exit
     else  main $1 $2
     fi
    

    Author

    Tim Menzies


    categories: Os,Apr,2009,Admin

    Awk and Operating Systems

    These pages focus on Awk and operating systems.


    categories: SysAdmin,Oct,2009,BrianJ

    SysAdmins: Awk is Your Friend

    Brian Jones writes at linux.com:

    The nice thing about humans is that they're at least somewhat predictable. Given the choice between having data randomly strewn about, and having it in some predictable pattern, humans will generally choose predictable patterns (Microsoft filesystem management issues notwithstanding). These patterns are what make awk, a pattern-matching programming language, a wonderful tool for systems administrators, database administrators, and even command-line junkies who use their box mainly for pleasure. The notion of being able to write a one-line command to do almost anything draws ever closer with awk in your tool belt. For most things administrators use awk for, it's an extremely simple language. As you get into writing more advanced awk scripts, at some point it becomes a bit cumbersome, and you realize that Perl is also your friend. But for now, let's focus on how awk can get you the most bang for your keyboard strokes, shall we?

    The first thing you should know is that awk is actually a rather powerful language. Entire books have been written about its use. If you're so inclined, you can write extremely complex 1000-line scripts using awk. However, as a systems administrator (the intended audience for this article), 99% of your use of awk will consist of relatively short scripts, and one-off one-liners typed right on the command line. Here's an example of a common use of awk:

    [jonesy@newhotness jonesy]$ cat access_log | 
         awk '{print $1}' | sort | uniq -c | sort -rn
    

    The above one-liner uses awk to slim down the amount of data coming from the web server's access log. The access log is space-delimited, and I only want to see the first field (hence "print $1"). Once I have that data, I want to sort it, then I have "uniq -c" provide a count of each occurrence for each unique value, and then I produce a reverse sort based on the numeric count provided by "uniq". The result has the number of hits in the left column, and the host in the right column, and the most frequent visitors are at the top of the list. Give it a shot! Even if you're hosted by an ISP, you should be able to access this log.

    Awk is perfect for ripping data into smaller chunks, to make it more bite-size for other applications or manipulation. To use it on the command line on files that are not space-delimited, you can use the "-F" flag, and indicate a delimiter. This is useful for tearing apart /etc/passwd and /etc/shadow files. For example:

    [jonesy@tux jonesy]$ cat /etc/passwd | awk -F: '{print $5}' | awk -F, '{print NF}'
    

    I actually used something kinda similar to that during a NIS to LDAP migration to see if the gecos field ($5 in /etc/passwd) had consistent enough data to be useful. One of the tests is to see how consistent the number of datapoints held in the gecos field is from record to record. To figure out the number of fields in each record's gecos field, I tell awk to use ":" as the delimiter, and, based on that, print the fifth field. I then pipe that to another awk one-liner, which uses an awk built-in variable, "NF" and a different delimiter (gecos is generally comma-delimited, if it's even used for useful data).

    Awk in Scripts

    When one-liners just aren't enough for you, you can store a whole bunch of awk one-liners in a file, and call awk with "-f script" to tell it which file to read its commands from. Additionally, since awk needs to act on some data, you should also tag on something to take care of feeding awk the data it so desperately needs. For example, if I have a script called "getuname", which looks like this:

    BEGIN { FS=":" }
          {print $1}
    

    I can now call that script, feeding it anything that I know ahead of time has the user name as the first field in a given record. So I can say "awk -f getuname < /etc/passwd", or "ypcat passwd | awk -f getuname". There are two rather important things I did in this script that will save you some headaches. First, notice the "BEGIN" statement. This statement exists to give you some space to do some tasks before awk starts reading any data. In this example, I want awk to know before it processes any data, that it should use a colon as its field separator. Sure, I could've called awk differently to get around this, ie "awk -F: -f getuname < /etc/passwd", but this way is shorter, and that's the point! It should also be noted that, if you have the need, you can also have an "END" section to your script, which will perform any actions, once, after the last data record has been processed.

    On the second line, I've just called a simple awk "action" statement, just like on the command line, with one important exception: I didn't use single quotes around it. If I had, the shell would've tried to interpret this part of the script and choked. I know, because it happened while I was testing this script. Bad admin!

    Built-in Goodness

    Awk has some built-in functions, like most scripting languages, which make life a bit easier. It also has some built-in variables that awk keeps track of for you -- and you get their values for free, just for asking, which is nice. The most useful variable I've had the pleasure to use as an admin is the "NF" variable, which will tell you, based on the field separator given (space by default), how many fields are in the current record. Conversely, the most useful function I've used as an awk scripter is the "split" function, which can break a single field into another array of separate fields. First, here's a quick example of NF in action:

    cat /etc/passwd | awk -F: '{print NF}'
    

    This is the lazy man's way to get the users' shells from the /etc/passwd file without having to remember how many fields are in the file. But wait! This doesn't print the last field in the record! It prints the number of fields in the record! Simple enough -- add a "$" to the front of "NF", and you'll get what you're looking for. Pipe the output to a couple of "sort" and "uniq" commands like we did earlier with the web log, and you'll get a snapshot of what the most commonly used shells are.

    Now let's have a look at the split function. Let's say you use your gecos field to store a bunch of datapoints, and the datapoints within the gecos field are comma-delimited. This is not nearly so contrived as it might sound -- this happens in more than two environments I've done work in. Here's what it might look like:

    jonesy:x:12000:13:Brian K. Jones,LUSER,101B,NONE:/home/jonesy:/bin/bash
    

    Now let's say your PHB comes along and says he's tired of referring to me as "jonesy" and wants to know my real name. You can use awk's "split" function to help you here, and the code for doing so is fairly short:

    BEGIN { FS=":" }
          {
            gfields = split ( $5, gecos, ",")
            chunkname = split ( gecos[1], fullname, " " )
            print fullname[chunkname], fullname[1]
          }
    

    Let's translate that into English, shall we? Of course, you now know what the BEGIN statement does here -- nothing new. We'll start by looking at the "gfields" line, where I use "split" to break up the 5th field of the record, (the gecos field), using the comma as a delimiter, and storing all of the resulting fields in an array called "gecos". This can be counterintuitive, as you may be tempted to think that the resulting array is called "gfields". However, the "gfields" variable actually represents the last field in the record. You get a look at how this works in the following two lines. "chunkname" represents the number of fields in the "fullname" array. The "fullname" array is created by splitting the first field of the "gecos" array (in this case, the field holding my full name), using a space as the delimiter. On the next line, I reference "fullname[chunkname]", which will print the last name of the person, even if (as in my case) they have a middle name or initial. Then I print the very first field in the fullname array, so the output generated by this script acting on my passwd record would be "Jones Brian".

    In conclusion

    Whew! That was a mouthful. Awk has so many cool little hacks and built-in features that there has been more than one book published just on Awk. Undoubtedly, I'll utilize some of these features in future articles that involve putting together syadmin solutions using various scripts as duct tape.


    categories: XML,Apr,2009,Admin

    XML

    These pages focus on XML tools and Awk.


    categories: XML,June,2009,SteveC

    xmlparse.awk

    Contents

    Synopsis

    Download

    Description:

    Code

    Author

    A simple XML parser for awk

    Synopsis

    awk -f xmlparse.awk [FILESPEC]...

    Download

    From LAWKER.

    Description:

    This script is a simple XML parser for (modern variants of) awk. Input in XML format is saved to two arrays, "type" and "item".

    The term, "item", as used here, refers to a distinct XML element, such as a tag, an attribute name, an attribute value, or data.

    The indexes into the arrays are the sequence number that a particular item was encountered. For example, the third item's type is described by type[3], and its value is stored in item[3].

    The "type" array contains the type of the item encountered for each sequence number. Types are expressed as a single word: "error" (invalid item or other error), "begin" (open tag), "attrib" (attribute name), "value" (attribute value), "end" (close tag), and "data" (data between tags).

    The "item" array contains the value of the item encountered for each sequence number. For types "begin" and "end", the item value is the name of the tag. For "error", the value is the text of the error message. For "attrib", the value is the attribute name. For "value", the value is the attribute value. For "data", the value is the raw data.

    WARNING: XML-quoted values ("entities") in the data and attribute values are *NOT* unquoted; they are stored as-is.

    Code

    BEGIN {
    

    In XML, literal "<" and ">" are only valid as tag delimiters; to include a "<" or ">" as data, they must be quoted: "<" and ">". So we know that if we encounter a ">", we have reached the end of a tag. This makes a convenient end-of-record marker, as the end-of-tag delimiter marks a special event, whereas a new-line is simply whitespace in XML.

            RS = ">";
    
            lineno = 1;
            sptr = 0;
    }
    
    Count input lines.
    {
            data = $0;
            lineno += gsub( /\n/, "", data );
            data = "";
    }
    

    Special modes of operation. These handle special XML sections, such as literal character data containing XML meta-characters ("cdata" sections), comments, and processing instructions ("pi") for other document processors.

    "Cdata" sections are teminated by the sequence, "]]>".

    ( mode == "cdata" ) {
            if ( $0 ~ /\]\]$/ ) {
                    sub( /\]\]$/, "", $0 );
                    mode = "";
            };
            item[idx] = item[idx] RS $0;
            next;
    }
    

    Comment sections are terminated by the sequence, "-->".

    ( mode == "comment" ) {
            if ( $0 ~ /--$/ ) {
                    sub( /--$/, "", $0 );
                    mode = "";
            };
            item[idx] = item[idx] RS $0;
            next;
    }
    
    Processing instruction sections are terminated by the sequence, "?>".
    ( mode == "pi" ) {
            if ( $0 ~ /\?$/ ) {
                    sub( /\?$/, "", $0 );
                    mode = "";
            };
            item[idx] = item[idx] RS $0;
            next;
    }
    
    ( !mode ) {
            mline = 0;
    

    Our record separator is the end-of-tag marker, ">". If we've encountered an end-of-tag marker, we should have a beginning-of-tag marker ("<") somewhere in the input record. If not, either there is a spurious end-of-tag marker, or the record was terminated by the end-of-file.

            p = index( $0, "<" );
    

    Any data preceeding the beginning-of-tag marker is raw data. If no beginning-of-tag marker is present, everything in the input is data.

            if ( !p || ( p > 1 )) {
                    idx += 1;
                    type[idx] = "data";
                    item[idx] = ( p ? substr( $0, 1, ( p - 1 )) : $0 );
                    if ( !p ) next;
                    $0 = substr( $0, p );
            };
    

    Recognize special XML sections. Sections are not processed as XML, but handled specially. If the section end with the current input record, we continue processing XML in the next record; otherwise, we enter a special mode and perform special processing.

    Character data ("cdata") sections contain literal character data containing XML meta-characters that should not be processed. Character data sections begin with the sequence, "<![CDATA[" and end with "]]>". This section may span input records.

            if ( $0 ~ /^<!\[[Cc][Dd][Aa][Tt][Aa]\[/ ) {
                    idx += 1;
                    type[idx] = "cdata";
                    $0 = substr( $0, 10 );
                    if ( $0 ~ /\]\]$/ ) sub( /\]\]$/, "", $0 );
                    else {
                            mode = "cdata";
                            mline = lineno;
                    };
                    item[idx] = $0;
                    next;
            }
    

    Comments begin with the sequence, "". This section may span input records.

            else if ( $0 ~ /^<!--/ ) {
                    idx += 1;
                    type[idx] = "comment";
                    $0 = substr( $0, 5 );
                    if ( $0 ~ /--$/ ) sub( /--$/, "", $0 );
                    else {
                            mode = "comment";
                            mline = lineno;
                    };
                    item[idx] = $0;
                    next;
            }
    

    Declarations begin with the sequence, "". This section may *NOT* span input records.

            else if ( $0 ~ /^<!/ ) {
                    idx += 1;
                    type[idx] = "decl";
                    $0 = substr( $0, 3 );
                    item[idx] = $0;
                    next;
            }
    

    Processing instructions ("pi") begin with the sequence, "". This section may span input records.

            else if ( $0 ~ /^<\?/ ) {
                    idx += 1;
                    type[idx] = "pi";
                    $0 = substr( $0, 3 );
                    if ( $0 ~ /\?$/ ) sub( /\?$/, "", $0 );
                    else {
                            mode = "pi";
                            mline = lineno;
                    };
                    item[idx] = $0;
                    next;
            };
    

    Beyond this point, we're dealing strictly with a tag.

            idx += 1;
    

    A tag that begins with "") is a close tag: it closes a tag-enclosed block.

            if ( substr( $0, 1, 2 ) == "</" ) {
                    type[idx] = "end";
                    tag = $0 = substr( $0, 3 );
            }
    

    A tag that begins simply with "<" (e.g. as in "

    ") is an open tag: it starts a tag-enclosed block. Note that a stand-alone tag (e.g. "") will be handled later, and will appear as an open tag and close tag, with no data between.

            else {
                    type[idx] = "begin";
                    tag = $0 = substr( $0, 2 );
            };
    

    The tag name is saved in "tag" so that we can retreive it later should we find that the tag is stand-alone and need to save a close tag item.

            sub( /[ \n\t/].*$/, "", tag );
            tag = toupper( tolower( tag ));
            item[idx] = tag;
    

    Validate the tag name. If invalid, indicate so and exit.

            if ( tag !~ /^[A-Za-z][-+_.:0-9A-Za-z]*$/ )
            {
                    type[idx] = "error";
                    item[idx] = "line " lineno ": " tag ": invalid tag name";
                    exit( 1 );
            }
    

    If an open tag is encountered, its name is recorded on the stack. If a close tag is encountered, its name is compared against the name on the top of the stack. If the names differ, an error is generated (XML does not allow overlapping tags).

            if ( type[idx] == "begin" ) {
                    sptr += 1;
                    lstack[sptr] = lineno;
                    tstack[sptr] = tag;
            }
            else if ( type[idx] == "end" ) {
                    if ( tag != tstack[sptr] ) {
                            type[idx] = "error";
                            item[idx] = "line " lineno ": " tag \
                                        ": unexpected close tag, expecting " \
                                            tstack[sptr];
                            exit( 1 );
                    };
                    delete tstack[sptr];
                    sptr -= 1;
            };
    
            sub( /[^ \n\t/]*[ \n\t]*/, "", $0 );
    

    Beyond this point, we're dealing with the tag attributes, if any, and/or the stand-alone end-of-tag marker.

            while ( $0 ) {
    

    If $0 contains only a slash (/), then the tag we're processing is stand-alone (e.g. ""), so we generate a close tag, but no data between the open and close tags.

                    if ( $0 == "/" )
                    {
                            idx += 1;
                            type[idx] = "end";
                            item[idx] = tag;
                            delete lstack[sptr];
                            delete tstack[sptr];
                            sptr -= 1;
                            break;
                    };
    

    The attribute name is determined. Note that the attribute name is also saved to "attrib" so that we can reference it should the attribute not include a value. If the attribute does not include a value, it's name is given as its value.

                    idx += 1;
                    type[idx] = "attrib";
                    attrib = $0;
                    sub( /=.*$/, "", attrib );
                    attrib = tolower( attrib );
    
                    item[idx] = attrib;
    

    Validate the attribute name. If invalid, indicate so and exit.

                    if ( attrib !~ /^[A-Za-z][-+_0-9A-Za-z]*$/ )
                    {
                            type[idx] = "error";
                            item[idx] = "line " lineno ": " attrib \
                                            ": invalid attribute name";
                            exit( 1 );
                    }
    
                    sub( /^[^=]*/, "", $0 );
    

    Each attribute must have a value. If one isn't explicit in the input, we assign it one equal to the name of the attribute itself. Attribute values in the input may be in one of three forms: enclosed in double quotes ("), enclosed in single quotes/apostrophes ('), or a single word.

                    idx += 1;
                    type[idx] = "value";
    
                    if ( substr( $0, 1, 1 ) == "=" ) {
                            if ( substr( $0, 2, 1 ) == "\"" ) {
                                    item[idx] = substr( $0, 3 );
                                    sub( /".*$/, "", item[idx] );
                                    sub( /^="[^"]*"/, "", $0 );
                            }
                            else if ( substr( $0, 2, 1 ) == "'" ) {
                                    item[idx] = substr( $0, 3 );
                                    sub( /'.*$/, "", item[idx] );
                                    sub( /^='[^']*'/, "", $0 );
                            }
                            else {
                                    item[idx] = $0;
                                    sub( /[ \n\t/]*.$/, "", item[idx] );
                                    sub( /^=[^ \n\t/]*/, "", $0 );
                            };
                    }
                    else item[idx] = attrib;
    
                    sub( /^[ \n\t]*/, "", $0 );
    
            };
    
            attrib = "";
            tag = "";
            next;
    }
    
    END {
    

    If mode is defined, the input stream ended without terminating an XML section. Thus, the input contains invalid XML.

            if ( mode ) {
                    idx += 1;
                    type[idx] = "error";
                    if ( mode == "cdata" ) mode = "character data";
                    else if ( mode == "pi" ) mode = "processing instruction";
                    item[idx] = "line " mline ": unterminated " mode;
            };
    

    If an open tag occured with no corresponding close tag, we have invalid XML.

            for ( n = sptr; n; n -= 1 ) {
                    idx += 1;
                    type[idx] = "error";
                    item[idx] = "line " lstack[n] ": " \
                                    tstack[n] ": unclosed tag";
            };
    }
    

    The following simple examples demonstrate the use of the accumulated data from the XML input stream.

    END {
    
    If errors occured, generate appropriate messages and exit without further processing.
            if ( type[idx] == "error" ) {
                    for ( n = idx; n && ( type[n] == "error" ); n -= 1 );
                    for ( n += 1; n <= idx; n += 1 ) print "ERROR:", item[n];
                    exit 1;
            };
    
    # Print simplified XML. If output completes successfully and the stack # is not empty, close tags are generated for each tag on the stack.
    #       in_tag = 0;
    #
    #       for ( n = 1; n <= idx; n += 1 ) {
    #
    #               if ( type[n] == "attrib" ) printf( " %s", item[n] );
    #
    #               else if ( type[n] == "begin" ) {
    #                       if ( in_tag ) printf( ">" );
    #                       else in_tag = 1;
    #                       printf( "<%s", item[n] );
    #               }
    #
    #               else if ( type[n] == "cdata" ) {
    #                       if ( in_tag ) {
    #                               printf( ">" );
    #                               in_tag = 0;
    #                       };
    #                       printf( "<![CDATA[%s]]>", item[n] );
    #               }
    #
    #               else if ( type[n] == "comment" ) {
    #                       if ( in_tag ) {
    #                               printf( ">" );
    #                               in_tag = 0;
    #                       };
    #                       printf( "<!--%s-->", item[n] );
    #               }
    #
    #               else if ( type[n] == "data" ) {
    #                       if ( in_tag ) {
    #                               printf( ">" );
    #                               in_tag = 0;
    #                       };
    #                       printf( "%s", item[n] );
    #               }
    #
    #               else if ( type[n] == "decl" ) {
    #                       if ( in_tag ) {
    #                               printf( ">" );
    #                               in_tag = 0;
    #                       }
    #                       printf( "<!%s>", item[n] );
    #               }
    #
    #               else if ( type[n] == "end" ) {
    #                       if ( in_tag ) {
    #                               printf( "/>" );
    #                               in_tag = 0;
    #                       }
    #                       else printf( "</%s>", item[n] );
    #               }
    #
    #               else if ( type[n] == "error" ) {
    #                       if ( in_tag ) {
    #                               printf( ">" );
    #                               in_tag = 0;
    #                       };
    #                       print "";
    #                       print "<!-- ERROR:", item[n], "-->";
    #                       break;
    #               }
    #
    #               else if ( type[n] == "pi" ) {
    #                       if ( in_tag ) {
    #                               printf( ">" );
    #                               in_tag = 0;
    #                       };
    #                       printf( "<?%s?>", item[n] );
    #               }
    #
    #               else if ( type[n] == "value" ) {
    #                       if ( item[n] ~ /"/ ) printf( "='%s'", item[n] );
    #                       else printf( "=\"%s\"", item[n] );
    #               };
    #       };
    #
    #       if ( in_tag ) printf( "\>" );
    #
    #       for ( n = sptr; n; n -= 1 ) printf( "</%s>", tstack[n] );
    

    # Print an object tree, identifying tags and attributes. Nesting is # emphasized by indenting.

    #       indent = "";
    #       for ( n = 1; n <= idx; n += 1 ) {
    #               if ( type[n] == "attrib" ) print indent "attrib", item[n];
    #               else if ( type[n] == "begin" ) {
    #                       print indent "begin", item[n];
    #                       indent = indent "  ";
    #               }
    #               else if ( type[n] == "end" ) {
    #                       indent = substr( indent, 3 );
    #                       print indent "end", item[n];
    #               }
    #               else if ( type[n] == "error" ) print "ERROR:", item[n];
    #               else print indent type[n];
    #       };
    

    Print in a linear format suitable for parsing by shell scripts. Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.

            for ( n = 1; n <= idx; n += 1 ) {
                    value = item[n];
                    gsub( /\\/, "\\\\", value );
                    gsub( /\n/, "\\n", value );
                    print type[n], value;
            };
    
            for ( n = sptr; n; n -= 1 ) print "end", tstack[n];
    

    Print attribute values and data in a linear format suitable for searching (e.g. with grep). Attributes are representd as:

          [TAG/]...TAG/ATTRIB=VALUE
    
    Data is represented as:
          [TAG/]...TAG: DATA
    

    Note that all tag names are displayed in upper-case. All attribute names are displayed in lower-case.

    Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.

    #       sptr = 0;
    #       for ( n = 1; n <= idx; n += 1 ) {
    #               if ( type[n] == "attrib" ) {
    #                       lead = stack[1];
    #                       for ( m = 2; m <= sptr; m += 1 ) \
    #                               lead = lead "/" stack[m];
    #                       lead = lead "/" item[n] "=";
    #               }
    #               else if ( type[n] == "begin" ) stack[++sptr] = item[n];
    #               else if (( type[n] == "cdata" ) || ( type[n] == "data" )) {
    #                       lead = stack[1];
    #                       for ( m = 2; m <= sptr; m += 1 ) \
    #                               lead = lead "/" stack[m];
    #                       lead = lead ": ";
    #               }
    #               else if ( type[n] == "end" ) sptr -= 1;
    #               if (( type[n] == "data" ) || ( type[n] == "value" )) {
    #                       value = item[n];
    #                       gsub( /\\/, "\\\\", value );
    #                       gsub( /\n/, "\\n", value );
    #                       print lead value;
    #               };
    #       };
    
    }
    

    Author

    Steve Coile


    categories: Awk100,May,2009,Dab

    Jawk: Awk in Java

    Download

    Download from Source Forge.

    Description

    Jawk parses, analyzes, and interprets and/or compiles AWK scripts. Compilation is targetted for the JVM.

    Jawk runs on any platform which supports, at minimum, J2SE 5.

    Usage

    To use, simply download the application, copy the release jar to the jawk.jar file and execute the following command:
    java -jar jawk.jar {command-line-arguments}
    

    To view the command line argument usage summary, execute

    java -jar jawk.jar -h
    
    The output of this command is shown below:
    java ... org.jawk.Awk [-F fs_val] [-f script-filename] 
                          [-o output-filename] [-c] [-z] [-Z] 
                          [-d dest-directory] [-S] [-s] [-x] [-y] [-r] 
                          [-ext] [-ni] [-t] [-v name=val]... 
                          [script] [name=val | input_filename]...
    
     -F fs_val = Use fs_val for FS.
     -f filename = Use contents of filename for script.
     -v name=val = Initial awk variable assignments.
    
     -t = (extension) Maintain array keys in sorted order.
     -c = (extension) Compile to intermediate file. (default: a.ai)
     -o = (extension) Specify output file.
     -z = (extension) | Compile for JVM. (default: AwkScript.class)
     -Z = (extension) | Compile for JVM and execute it. (default: AwkScript.class)
     -d = (extension) | Compile to destination directory.  (default: pwd)
     -S = (extension) Write the syntax tree to file. (default: syntax_tree.lst)
     -s = (extension) Write the intermediate code to file. (default: avm.lst)
     -x = (extension) Enable _sleep, _dump as keywords, and exec as a builtin func.
                      (Note: exec enabled only in interpreted mode.)
     -y = (extension) Enable _INTEGER, _DOUBLE, and _STRING casting keywords.
     -r = (extension) Do NOT hide IllegalFormatExceptions for [s]printf.
    -ext= (extension) Enable user-defined extensions. (default: not enabled)
    -ni = (extension) Do NOT process stdin or ARGC/V through input rules.
                      (Useful for blocking extensions.)
                      (Note: -ext & -ni available only in interpreted mode.)
    
     -h or -? = (extension) This help screen.
    

    Extensions

    Jawk addresses a drawback with standard Awk. For example, in standard Awk, it us be impossible to create a socket or display a simple GUI without external assistance either from the shell or via extensions to Awk itself (i.e., gawk). To overcome this limitation, an extension facility is added to Jawk .

    The Jawk extension facility allows for arbitrary Java code to be called as Awk functions in a Jawk script. These extensions can come from the user (developer) or 3rd party providers (i.e., the Jawk project team). And, Jawk extensions are opt-in. In other words, the -ext flag is required to use Jawk extensions and extensions must be explicitly registered to the Jawk instance via the -Djawk.extensions property (except for core extensions bundled with Jawk ).

    Also, Jawk extensions support blocking. You can think of blocking as a tool for extension event management. A Jawk script can block on a collection of blockable services, such as socket input availability, database triggers, user input, GUI dialog input response, or a simple fixed timeout, and, together with the -ni option, action rules can act on block events instead of input text, leveraging a powerful AWK construct originally intended for text processing, but now can be used to process blockable events. A sample enhanced echo server script is included in this article. It uses blocking to handle socket events, standard input from the user, and timeout events, all within the 47-line script (including comments).

    Example

    The example script implements a simple echo server which also allows broadcast messaging via stdin input from the server process:
    ## to run: java ... -jar jawk.jar -ext -ni -f {filename}
    BEGIN {
    	css = CServerSocket(7777);
    	print "(echo server socket created)"
    }
    ## note: default input processing disabled by -ni
    $0 = SocketAcceptBlock(css,
    	SocketInputBlock(sockets,
    		SocketCloseBlock(css, sockets,
    			StdinBlock(
    				Timeout(1000)))));
    				## note: default action { print } disabled by -ni
    # $1 = "SocketAccept", $2 = socket handle
    $1 == "SocketAccept" {
    	socket = SocketAccept($2)
    	sockets[socket] = 1
    }
    
    # $1 = "SocketInput", $2 = socket handle
    $1 == "SocketInput" {
    	## echo server action:
    
    	socket = $2
    	line = SocketRead(socket)
    	SocketWrite(socket, line)
    }
    
    # $1 = "SocketClose", $2 = socket handle
    $1 == "SocketClose" {
    	socket = $2
    	SocketClose(socket)
    	delete sockets[socket]
    }
    ## display a . for every second the server is running
    $0 == "Timeout" {
    	printf "."
    }
    ## stdin block is last because StdinGetline writes directly to $0
    ## $0 == "Stdin"
    $0 == "Stdin" {
    	## broadcast message to all sockets
    	retcode = StdinGetline()
    	if (retcode != 1)
    		exit
    	for (socket in sockets)
    		SocketWrite(socket, "From server : " $0)
    	print "(message sent)"
    }
    
    

    Each extension function used in the script above is covered in some detail below:

    • CServerSocket - Creates a character-based server socket. SocketRead for character-based sockets return lines of text (with newlines stripped), while SocketRead returns blocks of bytes (converted to a String) for sockets accepted by ServerSocket. Use character-based sockets for interactive or line-based input, and use ordinary sockets to achieve high-throughput since arbitrary byte blocks are returned. To create a client socket, use CSocket for character-based sockets, or Socket for byte-block-based sockets.
    • SocketAcceptBlock/SocketInputBlock/SocketCloseBlock/StdinBlock/Timeout - Each of these extensions is a blocking extension, blocking for particular events, such as a server socket is ready to accept an incoming socket, or a connected socket has input to be read, or a certain amount of time has elapsed, etc. Socket*Block extension functions come from SocketExtension, StdinBlock comes from StdinExtension, and Timeout comes from CoreExtension. Each Socket*Block extension returns a string of the format:
      extension-label-prefix OFS parameter
      
      while StdinBlock and Timeout returns
      extension-label-prefix
      
    • SocketAccept/SocketRead/SocketWrite/SocketClose - Socket operations, as the names of the extension functions suggest. Each will block until it is able to complete the operation.
    • StdinGetline - Get a line of input from stdin. If there is no stdin, block until input is available. This is why blocking is a valuable tool. This way, the script can wait for other events while waiting for stdin, bringing AWK out of the focused text processing domain into a powerful event processing language.

    As stated by the comments, -ni disables stdin processing (as provided by Jawk itself, not the StdinExtension) and the default blank rule of { print } . Disabling stdin processing is paramount to extension processing because, otherwise, it would be confusing, if not completely impossible, to multiplex extension blocking with Jawk 's default stdin processing. And, disabling the default blank rule allows for easy-to-read blocking statements (like the one provided in the sample script) without the wierd side effect of printing the result.

    Author

    Dan: ddaglas at users.sourceforge.net.


    categories: Xgawk,XML,Awk100,Apr,2009,JurgenK

    XMLgawk

    Editor's note: Programmers often take awk "as is", never thinking to use it as a lab in which they can explore other language extensions. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach is taken in the Awk A* project and, as shown here, in XMLgawk.

    IMHO, XMLgawk is one of the most exciting new innovations seen in Gawk for many years. It shows that Awk is more than "just" a text processor: rather it is also a candidate technology for modern XML-based web applications. )

    Purpose

    Extends standard gawk with built-in XML processing.

    Developers

    Main developers: Jurgen Kahrs and Andrew Schorr.

    Conceptual guidance: Manuel Collado.

    MS Windows build expert: Victor Paeza.

    Contributor of ideas for new features: Peter Saveliev.

    Domain

    XML processing, plus libraries for other extensions to Gawk.

    Description

    XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.

    Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.

    XMLgawk provides the following functionality including:

    • AWK's way of reading data line by line is supplemented by reading XML files node by node.
    • XMLgawk can load .awk file as as well as shared libraries.
    • Adds support for an @include directive in the source code. This is the same feature provided by the current igawk script.

    Current

    3=Released

    Use

    3=Free/public domain.

    Date Deployed

    November 2003.

    Dated

    April 28, 2009.

    Url


    categories: Xgawk,XML,Dec,2009,WimVB

    Xgawk on Windows

    After some hard work I seem to be able to build XMLgawk for native Windows :-). Jurgen, Victor and Manuel: thanks for all the tips!

    If you're interested, have a look at http://www.wimdows.info/project/xgawk and have fun.

    -- Wim van Blitterswijk


    categories: Games,Awk100,Apr,2009,Ronl

    Soccer

    Purpose

    AI Programming lab class challenge .

    Installation

    Download from LAWKER. Look at the first line of each file for something that looks like thos:

    #!/usr/bin/gawk -f
    
    Replace this with the full path to the local version of Gawk.

    Developers

    Ronald Loui (programmer and designer)

    Organization

    Washington University in St. Louis

    Country

    USA

    Domain

    Text-based game simulation.

    Contact

    Ronald P. Loui

    Email

    r.p.loui@gmail.com

    Description

    Ronald Loui writes: Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language

    This code manages a CGI interface to a process that simulates a soccer game, polling for inputs from two student programs.

    A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.

    In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.

    Awk

    Was written for gawk in 1995 but should run on almost any awk dialect; some css positioning commands will not work in all browsers; try IE6.

    Platform

    Was written on Redhat Linux with multiple hardware platforms in mind.

    Uses

    Intended to be run on close server to minimize delays.

    Lines

    605 lines in main cgi with several small aux control programs.

    DevelopmentEffort

    Minimal compared to development effort, but potentially will require css for new browsers.

    MaintenanceEffort

    Number of person-months since, including enhancements

    Current

    2=Evaluation.

    Users

    50 students in artificial intelligence project classes had to use some version of this code over seven years

    DateDeployed

    October 2004

    Dated

    April 2009


    categories: Top10,Awk100,Papers,Os,Apr,2009,YungC

    Awk-Linux

    Awk-Linux Educational Operating Systems

    Purpose

    Teaching operating systems.

    Developers

    Yung-Pin Cheng

    Email

    ypc@csie.ntnu.edu.tw

    Organization

    Software Engineering Lab. Department of Computer Science and Information Engineering National Taiwan Normal University

    Country

    TAIWAN

    Domain

    Educators of Operating Systems

    Description

    Most well-known instructional operating systems are complex, particularly if their companion software is taken into account. It takes considerable time and effort to craft these systems, and their complexity may introduce maintenance and evolution problems. In this project, a courseware called Awk-Linux is proposed. The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt, which are simulated through program instrumentation over user programs.

    A major advantange of the use of Awk for this tool is platform independence. Awk-Linux can be crafted relatively more easily and it does not depend on any hardware simulator or platform. Stable Awk versions run on many platforms so this tool can be readily and easily ported to other machines. The same can not be said for other, more complex operating systems courseware that may be much harder to port to new environments.

    In practice, using Awk-Linux is very simple for the instructor and students:

    • Course projects based on Awk-Linux provides source code extracted and simplified from a Linux kernel.
    • Results of our study indicate that the projects helped students better to understand inner workings of operating systems.

    Awk

    Gawk under cygwin or Linux

    Platform

    Windows (CYGWIN required) or Linux

    Uses

    C programming language

    Current

    Status 3 (Released)

    Use

    3(Free/public domain)

    DateDeployed

    2004

    References

    Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux: A Lightweight Operating Systems Courseware IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.

    Url

    www.csie.ntnu.edu.tw/~ypc/awklinux.htm


    categories: Top10,Awk100,Mar,2009,NelsonB,Spell,ArnoldR

    spell.awk

    Contents

    Synopsis

    awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
        [=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
        [-strip] [-verbose] [file(s)]
    

    Download

    Download from LAWKER.

    Description

    Why Study This Code?

    This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.

    It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:

    • The code is hundreds of lines long. Yes folks, its true, Awk is not just a tool for writing one-liners.
    • The code is well-structured. Note, for example, how the BEGIN block is used to initialize the system from files/functions.
    • The code uses two tricks that encourages function reuse:
      • Much of the functionality has been moved out of PATTERN-ACTION and into functions.
      • The number of globals is restricted: note the frequent use of local variables in functions.
    • There is an example, in scan_options, of how parse command line arguments;
    • The use of "print pipes" in in report_expcetions shows how to link Awk code to other commands.

    (And to write even larger programs, divided into many files, see runawk.)

    Dictionaries

    Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.

    For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.

    All word matching is case insensitive (subject to the workings of tolower()).

    In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.

    Suffixes

    Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:

    	ies$	ie ies y	# flies -> fly, series -> series, ties -> tie
    	ily$	y ily		# happily -> happy, wily -> wily
    	nnily$	n		# funnily -> fun
    

    Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.

    Suffixes are tested in order of decreasing length, so that the longest matches are tried first.

    Output

    The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form

    	filename:linenumber:exception
    

    Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.

    Code

    Top-Level

    BEGIN	{ initialize() }
    	    { spell_check_line() }
    END	    { report_exceptions() }
    

    get_dictionaries

    function get_dictionaries(        files, key)
    {
        if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
    	Dictionaries = ENVIRON["DICTIONARIES"]
        if (Dictionaries == "")	# Use default dictionary list
        {
    	DictionaryFiles["/usr/dict/words"]++
    	DictionaryFiles["/usr/local/share/dict/words.knuth"]++
        }
        else			# Use system dictionaries from command line
        {
    	split(Dictionaries, files)
    	for (key in files)
    	    DictionaryFiles[files[key]]++
        }
    }
    

    Initialize

    function initialize()
    {
       NonWordChars = "[^" \
    	"'" \
    	"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
    	"abcdefghijklmnopqrstuvwxyz" \
    	"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
    	"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
    	"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
    	"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
    	"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
    	"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
    	"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
    	"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
    	"]"
        get_dictionaries()
        scan_options()
        load_dictionaries()
        load_suffixes()
        order_suffixes()
    }
    

    load_dictionaries

    function load_dictionaries(        file, word)
    {
        for (file in DictionaryFiles)
        {
    	## print "DEBUG: Loading dictionary " file > "/dev/stderr"
    	while ((getline word < file) > 0)
    	    Dictionary[tolower(word)]++
    	close(file)
        }
    }
    

    load_suffixes

    function load_suffixes(        file, k, line, n, parts)
    {
        if (NSuffixFiles > 0)		# load suffix regexps from files
        {
    	for (file in SuffixFiles)
    	{
    	    ## print "DEBUG: Loading suffix file " file > "/dev/stderr"
    	    while ((getline line < file) > 0)
    	    {
    		sub(" *#.*$", "", line)		# strip comments
    		sub("^[ \t]+", "", line)	# strip leading whitespace
    		sub("[ \t]+$", "", line)	# strip trailing whitespace
    		if (line == "")
    		    continue
    		n = split(line, parts)
    		Suffixes[parts[1]]++
    		Replacement[parts[1]] = parts[2]
    		for (k = 3; k <= n; k++)
    		  Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
    	    }
    	    close(file)
    	}
        }
        else	      # load default table of English suffix regexps
        {
    	split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
    	for (k in parts)
    	{
    	    Suffixes[parts[k]] = 1
    	    Replacement[parts[k]] = ""
    	}
        }
    }
    

    order_suffixes

    function order_suffixes(        i, j, key)
    {
        # Order suffixes by decreasing length
        NOrderedSuffix = 0
        for (key in Suffixes)
    	OrderedSuffix[++NOrderedSuffix] = key
        for (i = 1; i < NOrderedSuffix; i++)
    	for (j = i + 1; j <= NOrderedSuffix; j++)
    	    if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
    		swap(OrderedSuffix, i, j)
    }
    

    report_execptions

    function report_exceptions(        key, sortpipe)
    {
      sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
      for (key in Exception)
      print Exception[key] | sortpipe
      close(sortpipe)
    }
    

    scan_options

    function scan_options(        k)
    {
        for (k = 1; k < ARGC; k++)
        {
    	if (ARGV[k] == "-strip")
    	{
    	    ARGV[k] = ""
    	    Strip = 1
    	}
    	else if (ARGV[k] == "-verbose")
    	{
    	    ARGV[k] = ""
    	    Verbose = 1
    	}
    	else if (ARGV[k] ~ /^=/)	# suffix file
    	{
    	    NSuffixFiles++
    	    SuffixFiles[substr(ARGV[k], 2)]++
    	    ARGV[k] = ""
    	}
    	else if (ARGV[k] ~ /^[+]/)	# private dictionary
    	{
    	    DictionaryFiles[substr(ARGV[k], 2)]++
    	    ARGV[k] = ""
    	}
        }
    
        # Remove trailing empty arguments (for nawk)
        while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
            ARGC--
    }
    

    spell_check_line

    function spell_check_line(        k, word)
    {
        ## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
        gsub(NonWordChars, " ")		# eliminate nonword chars
        for (k = 1; k <= NF; k++)
        {
    	word = $k
    	sub("^'+", "", word)		# strip leading apostrophes
    	sub("'+$", "", word)		# strip trailing apostrophes
    	if (word != "")
    	    spell_check_word(word)
        }
    }
    

    spell_check_word

    function spell_check_word(word,        key, lc_word, location, w, wordlist)
    {
        lc_word = tolower(word)
        ## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
        if (lc_word in Dictionary)		# acceptable spelling
    	return
        else				# possible exception
        {
    	if (Strip)
    	{
    	    strip_suffixes(lc_word, wordlist)
    	    ## for (w in wordlist) print "DEBUG: wordlist[" w "]"
    	    for (w in wordlist)
    		if (w in Dictionary)
    		    break
    	    if (w in Dictionary)
    		return
    	}
    	## print "DEBUG: spell_check():", word
    	location = Verbose ? (FILENAME ":" FNR ":") : ""
    	if (lc_word in Exception)
    	    Exception[lc_word] = Exception[lc_word] "\n" location word
    	else
    	    Exception[lc_word] = location word
        }
    }
    

    strip_suffixes

    function strip_suffixes(word, wordlist,        ending, k, n, regexp)
    {
        ## print "DEBUG: strip_suffixes(" word ")"
        split("", wordlist)
        for (k = 1; k <= NOrderedSuffix; k++)
        {
    	regexp = OrderedSuffix[k]
    	## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
    	if (match(word, regexp))
    	{
    	    word = substr(word, 1, RSTART - 1)
    	    if (Replacement[regexp] == "")
    		wordlist[word] = 1
    	    else
    	    {
    		split(Replacement[regexp], ending)
    		for (n in ending)
    		{
    		    if (ending[n] == "\"\"")
    			ending[n] = ""
    		    wordlist[word ending[n]] = 1
    		}
    	    }
    	    break
    	}
        }
         ## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
    }
    

    swap

    function swap(a, i, j,        temp)
    {
        temp = a[i]
        a[i] = a[j]
        a[j] = temp
    }
    

    Author

    Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books


    categories: Nov,2009,Spell,GregoryG

    spellcheck.awk

    Contents

    Author

    (For the original version of this code, see http://feedback.exalead.com/feedbacks/191466-spell-checking.)

    Peter Norvig of Google describes "How to Write a Spelling Corrector" at http://norvig.com/spell-correct.html. He gave a python solution, and points to a number of other implementations I saw one was missing for awk/gawk, so here it is it uses the "big.txt" file found at http://norvig.com/big.txt.

    function words(text) { 
       while (getline line < text ) { 
          line=tolower(line) ;
       while (match(line,/[a-z]+/)) { 
          NWORDS[substr(line,RSTART,RLENGTH)]++ ; 
          line=substr(line,RSTART+RLENGTH) }}
    }
    BEGIN { words("big.txt"); } 
    
    BEGIN { alph="abcdefghijklmnopqrstuvwxyz"; 
          for(i=1;i<=26;i++) 
             alphabet[substr(alph,i,1)]++ }
    
    function edits1 (word,set) {
       n = length(word); 
       delete set;
       for (i=1;i<=n+1;i++) {
        if(i<=n) # deletion 
          set[substr(word,1,i-1)""substr(word,i+1)]++; 
        if(i<n)  # transposition
         set[substr(word,1,i-1)""substr(word,i+1,1)""substr(word,i,1)""substr(word,i+2)]++; 
        if(i<=n) 
          for (c in alphabet)  # alteration
             set[substr(word,1,i-1)""c""substr(word,i+1)]++; 
          for (c in alphabet) # insertion
             set[substr(word,1,i-1)""c""substr(word,i)]++; } 
    }
    function known_edits2(oneChange,twoChanges) { 
       delete twoChanges;
       for (e2 in oneChange) { 
          edits1(e2,set); 
          known(set,goods) ; 
          for (w in goods) { 
             twoChanges[w]=goods[w]}} 
    }
    function known(words,knowntable) { 
       delete knowntable; 
       found=0;
       for (w in words) 
          if(w in NWORDS) {
             found++; 
             knowntable[w]=NWORDS[w] }
       return (found) 
    }
    function maxtable(tab) { 
       maxval=0; 
       for(i in tab) { 
          if(tab[i]>maxval) {
             maxval=tab[i]; 
             max=i}} 
       return(max)
    }
    function correct(word) { 
       delete candidates; 
       candidates[word]=1;
       if( known(candidates,good) ) { }
       else {    edits1(word, candidates); 
             if ( known(candidates,good) ) { }
       else { known_edits2(candidates,candidates2); 
             if ( known(candidates2,good) ) { }
       else {    delete good; 
             good[word]=1;}}}
       print maxtable(good);
    }
    
    correct, one word per line
    { gsub(" ",""); 
      correct(tolower($0)) }
    

    Author

    Gregory Grefenstette, Nov 24, 2008


    categories: Yawk,Awk100,Feb,2009,WolfganZ

    Yawk

    Purpose

    Run a WIKI using Gawk.

    Download

    Download from LAWKER or Wolfgan Zekol's web site.

    Url

    For a live demo, see the Yawk home page.

    Developers

    Wolfgan Zekol.

    Domain

    Web application.

    Contact

    Wolfgan Zekol.

    Email

    dag@awk-scripting.de

    Description

    Yawk is "yet another wiki klone", one among a lot of others. Yawk was written because the available wikis were missing some formatting capabilities or used strange formatting rules (and you might not like mine) or imposed too much requirements for understanding a wiki (mysql database installation with or without php installed).

    Awk

    Gawk 3.1.4 or later.

    Platform

    CGI

    Lines

    6000 lines.

    Current

    Status 3=Released.

    Use

    3=Free/public domain.

    DateDeployed

    2004

    Dated

    2009


    categories: AwkLisp,Awk100,Feb,2009,DariusB

    AwkLisp

    Purpose

    Code up a LISP/Scheme interpreter in Awk.

    For more details..

    See awklisp.

    Developers

    1

    Domain

    Domain-specific language.

    Contact

    Darius Bacon dairus@wry.me

    Email

    dairus@wry.me

    Description

    At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster.

    Awk

    Awk/Gawk

    Lines

    350

    Current

    1=Prototype

    Use

    1=Personal use.

    DateDeployed

    1994

    Dated

    2009


    categories: Name,Awk100,Feb,2009,BillP

    Name

    Not a single program.

    Purpose

    Generate TeX code for a bilingual dictionary from a flat file database. This system has been used to generate multiple editions of dictionaries for several dialects of Carrier, the endangered language of a large portion of the central interior of British Columbia.

    Developers

    Bill Poser

    Organization

    Country

    Canada

    Domain

    linguistics - dictionary publishing

    Contact

    Bill Poser

    Email

    billposer@alum.mit.edu

    Description

    A dictionary database consists of four flat files containing records in which fields are identified by tags, in a format isomorphic to Standard Dictionary Format. The four files contain: main entries, example sentences with translations, verb roots, verb stems. This provides modest degree of relativization. Awk scripts controlled by a makefile do the bulk of the work of generating TeX code for printing dictionaries containing front matter, a Carrier-English section, an English-Carrier section, a topical index, an alphabetical root list, a list of roots sorted by English gloss, an alphabetical list of verb stems, a list of verb stems sorted by root, an alphabetical list of affixes, a list of affixes sorted by English gloss, a list of scientific names , a list of placenames, and credits for illustrations.

    Awk

    gawk

    Shell

    The awk scripts are executed from a make file.

    Platform

    GNU/Linux on x86.

    Uses

    The awk scripts are executed from a makefile by GNU make. The other program used extensively is the sort utility msort.

    Lines

    5500

    DevelopmentEffort

    The first usable version took no more than a day (plus the time to create the TeX template into which the generated code is inserted).

    MaintenanceEffort

    Pure maintenance due to changes in environment, bit rot, etc. has been just about nil. The effort devoted to adding features very difficult to estimate as it has taken place at irregular intervals over a period of 15 years.

    Current

    Status 1=Prototype, 2=Evaluation, 3=Released, 4=No longer supported, 5=Dead 3, I guess. The code is mature but not really released since the author is the only one who normally uses it.

    Use

    1=Personal use, 2=in-House use, 3=Free/public domain, 4=Licensed, 5=Sold product 1

    Users

    1

    DateDeployed

    June 1993.

    References

    A paper describing these databases and the process for generating dictionaries from them is available: Lexical Databases for Carrier

    Url

    Some information about the resulting dictionaries: http://www.ydli.org/products/dicts.htm


    categories: Top10,Boris,Awk100,Feb,2009,Ronl

    Boris

    Purpose

    Demonstration to DoD of a clustering algorithm suitable for streaming data.

    Source code

    gawk/awk100/boris

    Live demo

    http://www.cse.wustl.edu/~loui/boris.cgi.

    Developers

    Ronald Loui and a programmer named Boris.

    Organization

    Washington University in St. Louis, CS Dept.

    Country

    USA

    Domain

    This is an evolutionary algorithm and visualization of a clustering algorithm that could be turned from O(n^4) to O(nlogn) with a few judicious uses of constants. Later developments added other interactive devices, including progress meters and mouse-and-click behavior.

    Contact

    Ronald Loui

    Email

    r.p.loui@gmail.com

    Description

    The code is an excellent example of the power of Awk as a prototyping tool: after getting the code running, with the least development time, a quirk was observed in the code that allowed a reduction from O(n^4) to O(nlogn).

    • Two of the n's are lost (n^2) by noticing that when there is a swap, the delta in the scoring function falls off by the squared distance from the point of a swap. So if you just set a constant, such as 10 or 20, or 100, based on the expected size of your clusters, then you can stop calculating the scoring function when you get past that constant.
    • The other n comes from either fixing the size of the matrix, and occasionally flushing new candidates in and out, or else by sampling over a subset of the n when you calculate the score.
    • The nlogn remains because there is a sort every now and then.

    Awk

    Gawk

    Platform

    Intended for fast servers, 1+ ghz.

    Uses

    Html.

    Lines

    158.

    Development Effort

    One weekend.

    Maintenance Effort

    None.

    Current

    2=Evaluation.

    Use

    2=in-House use.

    Users

    5

    DateDeployed

    2004.

    Dated

    Feb 2009.

    References

    Streaming Hierarchical Clustering for Concept Mining Looks, M.; Levine, A.; Covington, G.A.; Loui, R.P.; Lockwood, J.W.; Cho, Y.H. Aerospace Conference, 2007 IEEE Volume , Issue , 3-10 March 2007 Page(s):1 - 12 Digital Object Identifier 10.1109/AERO.2007.352792


    categories: WWW,Awk100,Jan,2009,PeterK

    Get_YouTube_Vids

    Purpose

    Download videos from youtube.

    Source code

    gawk/www/get_youtube_vids.awk

    Developers

    Peter Krumin: Downloading YouTube Videos With Gawk

    Domain

    World wide web, slurping, file sharing.

    Contact

    Peter Krumin

    Description

    How to download YouTube videos.

    Awk

    Gawk

    Lines

    331 lines

    Current

    3=Released

    Use

    1=Personal use

    DateDeployed

    July 2007

    Dated

    Sat Feb 21 19:46:10 EST 2009

    Url

    Downloading YouTube Videos With Gawk


    categories: Sudoku,Awk100,Jan,2009,Jimh

    sudoku

    This is a Awk 100 program.

    Submitted by

    Jim Hart

    Purpose

    Solve sudoku puzzles using the same strategies as a person would, not by brute force.

    Source

    gawk/awk100/sudoku

    Developers

    Jim Hart

    Country

    US

    Domain

    command line games

    Contact

    Jim Hart

    Email

    jhart50@gmail.com

    Description

    see Purpose

    AWK versions

    gawk

    Platform

    Mac OS X, PowerPC

    Lines

    529

    Development Effort

    1

    Maintenance Effort

    0

    Date Deployed

    /2006


    categories: Negotiate,Awk100,Jan,2009,Ronl

    Anne's Negotiation Game

    An Awk100 program.

    Purpose

    Research on a model of negotiation incorporating search, dialogue, and changing expectations

    Source code

    See gawk/awk100/negotiate.

    Developers

    Ronald Loui (programmer and designer), Anne Jump (adversary)

    Organization

    National Science Foundation grant at Washington University in St. Louis

    Country

    USA

    Domain

    Prototype of a new idea for cognitive modelling (in artificial intelligence/economics/organizational behavior)

    Contact

    Ronald P. Loui

    Email

    r.p.loui@gmail.com

    Description

    Program generates a game board upon which players take turn searching or declaring according to a protocol. It is based on the same game bimatrix made famous by people like von Neumann and Nash, but invents a new approach to negotiation based on process instead of solution.

    Awk

    Was written for gawk in 1997 but should run on almost any awk dialect

    Platform

    Was written on Redhat Linux with multiple hardware platforms in mind

    Uses

    Was intended to be self-contained

    Lines

    658 lines, of which 39 are comments

    DevelopmentEffort

    One day, 6-8 hours total

    MaintenanceEffort

    Two revisions are available, mainly to permit programs to negotiate instead of humans, and to provide a web-based dashboard to monitor the events

    CurrentStatus

    2=Evaluation

    Use

    2=in-House use

    Users

    50 students in artificial intelligence project classes had to use some version of this code over three yeears

    DateDeployed

    October 1997

    Dated

    January 2008

    References

    There is a draft article (unpublished), and several talks, e.g.

    The paper in Harper and Wheeler, Probability and Inference: Essays in Honour of Henry E. Kyburg Jr. (Paperback), Publisher: College Publications (23 April 2007) ISBN-10: 1904987184 ISBN-13: 978-1904987185 also refers to the theory implemented here. Diana Moore's thesis on negotiation and draft article http://citeseer.ist.psu.edu/11983.html contains some precursor ideas.

    Url

    http://www.cs.wustl.edu/~loui/313f97/anne4.expl.html


    categories: Baseballsim,Awk100,Jan,2009,Ronl

    Baseball sim

    This is a Awk 100 program.

    Purpose

    A quick and dirty baseball simulator for investigating the efficiency of batting lineups

    Source

    See gawk/awk100/baseballsim.

    Developers

    Ronald P. Loui

    Organization

    Washington University in St. Louis

    Country

    USA

    Domain

    Research/Decision Support

    Contact

    Ronald P. Loui

    Email

    r.p.loui@gmail.com

    Description

    This was written for the AI course, and for several investigations, including the determination of whether it is a good idea to bat the pitcher in the 8th spot. One hypothesis that emerges from this program that deserves further study is that the most potent offense is one that spreads rather than concentrates the batting threats.

    Awk

    Gawk around 2002

    Platform

    Linux around 2002

    Uses

    None

    Lines

    409

    DevelopmentEffort

    Approximately one day

    MaintenanceEffort

    Further simulators were developed for improved domain modeling and for successive addition of functionality; no other code maintenance was required.

    CurrentStatus

    1=Prototype

    Use

    1=Personal use

    Users

    About 50 students used this program over three years in AI classes, and two undergraduate theses and one Master's thesis on evolutionary computing made use of this simulator.

    DateDeployed

    October 2002

    Dated

    January 2009

    References

    None, but see Tony LaRussa's comments on batting order while managing the St. Louis Cardinals


    categories: Argcol,Awk100,Jan,2009,Ronl

    Argcol

    An Awk100 program.

    Purpose

    A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.

    Source code

    See gawk/awk100/argcol.

    Developers

    Mark Foltz, Ronald Loui, Thieu Dang, Jeremy Frens

    Organization

    Washington University in St. Louis

    Country

    USA

    Domain

    Application/text support for text editor.

    Contact

    Ronald Loui

    Email

    r.p.loui@gmail.com

    Awk

    Gawk circa 1994, Solaris and MS-DOS-based awk such as mawk.

    Platform

    Solaris and MS-DOS

    Uses

    Vi and variants such as stevie.

    Lines

    278

    DevelopmentEffort

    One week.

    MaintenanceEffort

    No maintenance, eventually rewritten as cgi/web program in Room5 project.

    Current

    4=No longer supported

    Use

    3=Free/public domain

    Users

    2

    DateDeployed

    May 1994

    Dated

    Jan 2009

    References

    Progress on Room 5: a testbed for public interactive semi-formal legal argumentation International Conference on Artificial Intelligence and Law archive Proceedings of the 6th international conference on Artificial intelligence and law Melbourne, Australia Pages: 207 - 214 Year of Publication: 1997 ISBN:0-89791-924-6


    categories: Xgawk,XML,May,2009,JurgenK

    XML Well-Formedness

    (This page comes from the XML Gawk tutorial.)

    One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).

    These mechanisms for checking correctness are applied at different levels. The lowest level being well-formedness. The next higher levels of correctness-check are the level of the DTD and (even higher, but not required yet by standards) the Schema. If you have a DTD (or Schema) specification for your XML file, you can hand it over to a validation tool, which applies the specification, checks for conformance and tells you the result. A simple tool for validation against a DTD is xmllint, which is part of libxml and therefore installed on most GNU/Linux systems. Validation against a Schema can be done with more recent versions of xmllint or with the xsv tool.

    There are two reasons why validation is currently not incorporated into the gawk interpreter.

    1. Validation is not trivial and only DTD-validation has reached a proper level of standardization, support and stability.
    2. We want a tool that can process all well-formed XML files, not just a tool for processing clean data. A good tool is one that you can rely on and use for fixing problems. What would you think of a car that rejected to drive outside just because there is some mud on the street and the sun isn't shining ?
    Here is a script for testing well-formedness of XML data. The real work of checking well-formedness is done by the XML parser incorporated into gawk. We are only interested in the result and some details for error diagnostic and recovery.
         @load xml
         END {
           if (XMLERROR)
             printf("XMLERROR '%s' at row %d col %d len %d\n",
                     XMLERROR, XMLROW, XMLCOL, XMLLEN)
           else
             print "file is well-formed"
         }
    

    As usual, the script starts with switching gawk into XML mode. We are not interested in the content of the nodes being traversed, therefore we have no action to be triggered for a node. Only at the end (when the XML file is already closed) we look at some variables reporting success or failure. If the variable XMLERROR ever contains anything other than 0 or the empty string, there is an error in parsing and the parser will stop tree traversal at the place where the error is. An explanatory message is contained in XMLERROR (whose contents depends on the specific parser used on this platform). The other variables in the example contain the line number and the column in which the XML file is formed badly.

    Author

    Jurgen Kahrs

    Copyright

    Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

    Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

    • A GNU Manual
    • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

    categories: XML,June,2009,JurgenK

    Dealing with DTDs

    (This page comes from the XML Gawk tutorial.)

    The declaration of a document type in the header of an XML file is an optional part of the data, not a mandatory one. If such a declaration is present, the reference to the DTD will not be resolved and its contents will not be parsed. However, the presence of the declaration will be reported by gawk. When the declaration starts, the variable XMLSTARTDOCT contains the name of the root element's tag; and later, when the declaration ends, the variable XMLENDDOCT is set to 1. In between, the array variable XMLATTR will be populated with the values of the public identifier of the DTD (if any) and the value of the system's identifier of the DTD (if any). Other parts of the declaration (elements, attributes and entities) will not be reported.

         @load xml
         XMLDECLARATION {
           version    = XMLATTR["VERSION"        ]
           encoding   = XMLATTR["ENCODING"       ]
           standalone = XMLATTR["STANDALONE"     ]
         }
         XMLSTARTDOCT {
           root       = XMLSTARTDOCT
           pub_id     = XMLATTR["PUBLIC"         ]
           sys_id     = XMLATTR["SYSTEM"         ]
           intsubset  = XMLATTR["INTERNAL_SUBSET"]
         }
         XMLENDDOCT {
           print FILENAME
           print "  version    '" version    "'"
           print "  encoding   '" encoding   "'"
           print "  standalone '" standalone "'"
           print "  root   id '" root   "'"
           print "  public id '" pub_id "'"
           print "  system id '" sys_id "'"
           print "  intsubset '" intsubset "'"
           print ""
           version = encoding = standalone = ""
           root = pub_id = sys_id = intsubset ""
         }
    

    Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script shown above will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.

    In the following, you can see an example output of the script shown above. Obviously, the first entry is a DocBook file (English version 4.2) containing a book element which has to be validated against a local copy of the DTD at CERN in Switzerland. The second file is a chapter element of DocBook (English version 4.1.2) to be validated against a DTD on the Internet. Finally, the third entry is a file describing a project of the GanttProject application. There is only a tag name for the root element specified, a DTD does not seem to exist.

         data/dbfile.xml
           version    ''
           encoding   ''
           standalone ''
           root   id  'book'
           public id  '-//OASIS//DTD DocBook XML V4.2//EN'
           system id  '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
           intsubset  ''
         
         data/docbook_chapter.xml
           version    ''
           encoding   ''
           standalone ''
           root   id  'chapter'
           public id  '-//OASIS//DTD DocBook XML V4.1.2//EN'
           system id  'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
           intsubset  ''
         
         data/exampleGantt.gan
           version    '1.0'
           encoding   'UTF-8'
           standalone ''
           root   id  'ganttproject.sourceforge.net'
           public id  ''
           system id  ''
           intsubset  ''
    

    You may wish to make changes to this script if you need it in daily work. For example, the script currently reports nothing for files which have no DTD declaration in them. You can easily change this by appending an action for the END rule which reports in case all the variables root, pub_id and sys_id are empty. As it is, the script parses the entire XML file, although the DTD is always positioned at the top, before the root element. Parsing the root element is unnecessary and you can improve the speed of the script significantly if you tell it to stop parsing when the first element (the root element) comes in.

      XMLSTARTELEM { nextfile } 
    

    Author

    Jurgen Kahrs

    Copyright

    Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

    Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

    • A GNU Manual
    • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

    categories: Xgawk,XML,May,2009,JurgenK

    Printing an Outline of an XML file

    (This page comes from the XML Gawk tutorial.)

    When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:

         
         <book id="hello-world" lang="en">
         
         <bookinfo>
         <title>Hello, world</title>
         </bookinfo>
    
         
         <chapter id="introduction">
         <title>Introduction</title>
         
         <para>This is the introduction. It has two sections</para>
         
         <sect1 id="about-this-book">
         <title>About this book</title>
    
         
         <para>This is my first DocBook file.</para>
         
         </sect1>
         
         <sect1 id="work-in-progress">
         <title>Warning</title>
         
         <para>This is still under construction.</para>
    
         
         </sect1>
         
         </chapter>
         </book>
    

    Software developers are used to reading text files with proper indentation like this:

         book lang='en' id='hello-world'
           bookinfo
             title
           chapter id='introduction'
             title
             para
             sect1 id='about-this-book'
               title
               para
             sect1 id='work-in-progress'
               title
               para
    

    Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).

    The outline tool produces such an indented output and we will now write a script that imitates this kind of output.

         @load xml
         XMLSTARTELEM {
           printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
           for (i=1; i<=NF; i++)
             printf(" %s='%s'", $i, XMLATTR[$i])
           print ""
         }
    

    For the first time, we don't just check if the XMLSTARTELEM variable contains a tag name, but we also print the name out, properly indented with a printf format statement (two blank characters for each indentation level).

    Note the use of the associative array XMLATTR. Whenever we enter a markup block (and XMLSTARTELEM is non-empty), the array XMLATTR contains all the attributes of the tag. You can find out the value of an attribute by accessing the array with the attribute's name as an array index. In a well-formed XML file, all the attribute names of one tag are distinct, so we can be sure that each attribute has its own place in the array. The only thing that's left to do is to iterate over all the entries in the array and print name and value in a formatted way. Earlier versions of this script really iterated over the associative array with the for (i in XMLATTR) loop. Doing so is still an option, but in this case we wanted to make sure that attributes are printed in exactly the same oder that is given in the original XML data. The exact order of attribute names is reproduced in the fields $1 .. $NF. So the for loop can iterate over the attributes names in the fields $1 .. $NF and print the attribute values XMLATTR[$i].

    Author

    Jurgen Kahrs

    Copyright

    Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

    Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

    • A GNU Manual
    • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

    categories: Xgawk,XML,May,2009,JurgenK

    Pulling data from an XML file

    (This page comes from the XML Gawk tutorial.)

    In a procedural language, the software developer expects that he himself determines control flow within a program. He writes down what has to be done first, second, third and so on. In the pattern-action model of AWK, the novice software developer often has the oppressive feeling that

    • she is not in control
    • events seem to crackle down on her from nowhere
    • data flow seems chaotic and invariants don't exist
    • assertions seem impossible

    This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:

    • In GUI frameworks like the X Window system, the main program is a trivial event loop – the main program does nothing but wait for events and invoke event-handlers.
    • In the Prolog programming language, the main program has the form of a query – and then the Prolog interpreter decides which rules to apply to solve the query.
    • When writing a compiler with the lex and yacc tools, the main program only invokes a function yyparse() and the exact control flow depends on the input source which controls invocation of certain rules.
    • When writing an XML parser with the Expat XML parser, the main program registers some callback handler functions, passes the XML source to the Expat parser and the detailed invocation of callback function depends on the XML source.
    • Finally, AWK's pattern-action encourages writing scripts that have no main program at all.

    Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.

         @load xml
         BEGIN {
           while (getline > 0) {
             switch (XMLEVENT) {
               case "STARTELEM": {
                 printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
                 for (i=1; i<=NF; i++)
                   printf(" %s='%s'", $i, XMLATTR[$i])
                 print ""
               }
             }
           }
         }
    
    

    One XML event after the other is pulled out of the data with the getline command. It's like feeling each grain of sand pour through your fingers. Users who prefer this style of reading input will also appreciate another novelty: The variable XMLEVENT. While the push-style script in another page used the event-specific variable XMLSTARTELEM to detect the occurrence of a new XML element, our pull-style script always looks at the value of the same universal variable XMLEVENT to detect a new XML element.

    Formally, we have a script that consists of one BEGIN pattern followed by an action which is always invoked. You see, this is a corner case of the pattern-action model which has been reduced so wide that its essence has disappeared. Instead of the patterns you now see the cases of switch statement, embedded into a while loop (for reading the file item-wise). Obviously, we have explicite conditionals now, instead of the implicite ones we used formerly. The actions invoked within the case conditions are the same we have seen in the push approach.

    Author

    Jurgen Kahrs

    Copyright

    Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

    Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

    • A GNU Manual
    • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

    categories: XML,May,2009,MarkB

    xmldump

    Contents

    Displays components within a set of named XML files. With no options, displays the XML files much like that cat command. When options are supplied, displays only the selected components.

    Editor's note: for those who do not want to take the plunge into xgawk, dumpxml shows that shows standard Awk supports XML. For a discussion of this file, see comp.lang.awk.

    Synopsis

    xmldump -[cdit] file

    Download

    This code requires awk and ksh. To download:

    wget  http://lawker.googlecode.com/svn/fridge/lib/ksh/dumpxml
    chmod +x dumpxml
    

    Description

    One reason I have a distinct loathing for XML, esp. in configuration files, is it's very difficult to parse (with line-based editors) and it's not very readable either. In my book, this breaks both of the fundamental tests for a useable configuration standard .... whoever first thought XML was a good idea for anything except document mark-up should be shot (steps off soap box before he gets lynched for posting off-topic).

    Anyway, personal grievances aside, here's a script I was forced to write, unhappy and at gun-point, to try and make some XML files I was dealing with more readable. This demonstrates how much work it takes in AWK just to parse the structure alone. This doesn't even take into consideration reading attribute values or parsing DTDs.

    The next person who thinks it's a good idea to write a configuration file in XML will have to personally answer to my wrath ........ perhaps I should set-up a new website banxml.org or xmlboycott.com with the sole intent to make the world see reason. Anyone with me? :-)

    Code

    Set up

    #!/bin/ksh
    CALL=$(basename $0)
    USAGE="Syntax: $CALL [-cdit] xmlfile ..."
    

    DisplayXML()

    Displays selected components of a named XML file. Arguments:

    arg 1
    0 no doc content, 1 display doc content
    arg 2
    0 no tags, 1 display tags
    arg 3
    0 no comments, 1 display comments
    arg 4
    0 do not change indentation, 1 recalculate indents
    arg 5
    filename
    DisplayXML()
    {
        nawk -v shdoc=$1 -v shtags=$2 -v shcomm=$3 -v indent=$4 '
        {
            pushline=levhigh=0
    
            ### If indenting strip any leading blanks from input
            CloseFlags()
            if (indent && !comment) sub("^[    ][      ]*","")
    
            ### Strip carriage returns
            gsub("\\r","")
    
            ### Scan line one character at a time
            for (c=1;c<=length($0);c++)
            {
                CloseFlags()
                ReadChars()
                DisplayChars()
            }
    
            if (newline)
            {
                print ""
                newline=0
            }
        }
    
        function CloseFlags()
        {
            if (comment==2) comment=0       # close comment
            if (tag==2) tag=0               # close tag
            if (quotes==2) quotes=0         # close quote
        }
    
        function ReadChars()
        {
            ch=substr($0,c,1)
    
            if (!comment)
            {
                if (ch=="<" && substr($0,c,4)=="<!--")
                {
                    comment=1                       # opening comment
                    ch=substr($0,c,4)               # stretch chars
                    c+=3
                }
                else if (!tag && ch=="<")
                {
                    tag=1                            # opening tag
    
                    ### Increase or decrease indent depending
                    ### on tag style <tag> or </tag> 
                    ### but not <?tag?> or <!tag>
                    ch2=substr($0,c,2)
                    if (ch2=="</") level--
                    else if (ch2!="<?" && ch2!="<!")
                    {
                        level++
                        levhigh=1
                    }
                }
                else if (tag)
                {
                    if (!quotes && ch=="\"") quotes=1 # opening quote
                    else if (quotes && ch=="\"") quotes=2   # closing 
                    else if (!quotes && ch==">")
                    {
                        tag=2                   # closing tag
    
                        ### Catch <tag/> style where
                        ### indent level should not change
                        if (c>1 && substr($0,c-1,2)=="/>") level--
                    }
                }
            }
            else
            {
                if (ch=="-" && substr($0,c,3)=="-->")
                {
                    comment=2                 # closing comment
                    ch=substr($0,c,3)         # stretch chars
                    c+=2
                }
            }
        }
    
        function DisplayChars()
        {
            ### Work out whether to display this character or not
            dispch=0
            if (comment && shcomm) dispch=1
            if (tag && shtags) dispch=1
            if (!comment && !tag && shdoc) dispch=1
            if (dispch)
            {
                if (indent) IndentLine()
                printf("%s",ch)
                if (!newline) newline=1
            }
        }
    
        function IndentLine()
        {
            if (pushline || comment) return
            pushline=1
    
            ### Have begun processing first tag so indent level
            ### may already be one level too high
            if ((thislevel=(levhigh?level-1:level))<0) thislevel=0
            for (lev=0;lev<thislevel;lev++) printf("  ")
        }' "$5"
    
    }
    

    Start Up

    comments=0
    doc=0
    indent=0
    tags=0
    help=0
    
    while getopts cdit c
    do
        case $c in
            c) comments=1;;
            d) doc=1;;
            i) indent=1;;
            t) tags=1;;
            ?) help=1;;
        esac
    done
    shift $(($OPTIND - 1))
    

    Display help message

    if [ $help -eq 1 -o $# -eq 0 ]; then
        cat << EOF
    
    Displays components within a set of named XML files.
    With no options, displays the XML files much like that cat command.
    When options are supplied, displays only the selected components.
    
    $USAGE
    
    where   -c      displays comments
            -d      displays document contents
            -i      indent properly
            -t      displays tags
    
    EOF
        exit 2
    fi
    

    If no options supplied, then display entire XML files

    if [ $comments -eq 0 -a $doc -eq 0 -a $tags -eq 0 ]; then
        comments=1
        doc=1
        tags=1
    fi
    
    first=1
    while [ $# -gt 0 ]
    do
        if [ $first -eq 1 ]; then
             first=0
        else echo " "  ### this should be Ctrl+L for a form-feed
        fi
    
        echo "<!-- --- $1 --- -->"
        DisplayXML $doc $tags $comments $indent "$1"
        shift
    done 
    

    Author

    Mark R.Bannister <markb at freedomware.co.uk>.


    categories: Sept,2009,Admin

    Ethiopian Multiplication

    Here is some Awk code from the Rosetta Code wiki hat multiplyes integers using only addition, doubling, and halving.

    How?

    1. Take two numbers to be multiplied and write them down at the top of two columns.
    2. In the left-hand column repeatedly halve the last number, discarding any remainders, and write the result below the last in the same column, until you write a value of 1.
    3. In the right-hand column repeatedly double the last number and write the result below. stop when you add a result in the same row as where the left hand column shows 1.
    4. Examine the table produced and discard any row where the value in the left column is even.
    5. Sum the values in the right-hand column that remain to produce the result of multiplying the original two numbers together

    For example: 17 X 34

           17    34
    
    Halving the first column:
           17    34
            8
            4
            2
            1
    
    Doubling the second column:
           17    34
            8    68
            4   136 
            2   272
            1   544
    
    Strike-out rows whose first cell is even:
           17    34
            8    -- 
            4   --- 
            2   --- 
            1   544
    
    Sum the remaining numbers in the right-hand column:
           17    34
            8    -- 
            4   --- 
            2   --- 
            1   544
               ====
                578
    
    So 17 multiplied by 34, by the Ethiopian method is 578.

    The task is to define three functions/methods/procedures/subroutines:

    1. one to halve an integer,
    2. one to double an integer, and
    3. one to state if an integer is even.

    Code

    function halve(x)  { return(int(x/2)) }
    function double(x) { return(x*2) }
    function iseven(x) { return((x%2) == 0) }
    
    function ethiopian(plier, plicand) {
      r = 0
      while(plier >= 1) {
        if ( !iseven(plier) ) {
          r += plicand
        }
        plier = halve(plier)
        plicand = double(plicand)
      }
      return(r)
    }
    
    BEGIN { print ethiopian(17, 34) }
    

    categories: Sept,2009,Admin

    A Tale of Two TAWKs

    In the Awk-verse, there are two TAWKs.

    TAWK #1 is the TAWK Compiler from Thompson Automation Software (no longer trading)

    • Is 100% compatible with Awk.
    • Generates executable
    • Comes with an interactive debugger
    • In some test cases, code written 4 to 15 times faster runs as fast as "C", or better.

    TAWK #2 was a ultra-cut down version of AWK written in C++ by Bruce Eckel in 1989. Eckel writes:

    • The program is called TAWK for "tiny awk," since the problem it solves is vaguely reminiscent of the "awk" pattern-matching language found on Unix (versions have also been created for DOS).
    • It demonstrates one of the thornier problems in computer science: parsing and executing a programming language.
    • The data-encapsulation features of C++ prove most useful here, and a recursive-descent technique is used to read arbitrarily long fields and records.

    categories: XML,May,2009,JanW

    getXML.awk

    Contents

    Synopsis

    gawk -f getXML.awk

    Download

    Download from LAWKER

    Example

    BEGIN {
        while ( getXML(ARGV[1],1) ) {
            print XTYPE, XITEM;
            for (attrName in XATTR)
                print "\t" attrName "=" XATTR[attrName];
        }
        if (XERROR) {
            print XERROR;
            exit 1;
        }
    }
    

    Details

    Main function, read snext xml-data into XTYPE,XITEM,XATTR

    getXML( file, skipData ): 
    
    file
    path to xml file
    skipData
    flag: do not read "DAT" (data between tags) sections

    External variables:

    XTYPE
    type of item read, e.g. "TAG"(tag), "END"(end tag), "COM"(comment), "DAT"(data)
    XITEM
    value of item, e.g. tagname if type is "TAG" or "END"
    XATTR
    Map of attributes, only set if XTYPE=="TAG"
    XPATH
    Path to current tag, e.g. /TopLevelTag/SubTag1/SubTag2
    XLINE
    current line number in input file
    XNODE
    XTYPE, XITEM, XATTR combined into a single string
    XERROR
    error text, set on parse error

    Returns

    1
    on successful read: XTYPE, XITEM, XATTR are set accordingly
    ""
    at end of file or parse error, XERROR is set on error

    Private Data

    _XMLIO
    buffer, XLINE, XPATH for open files

    Code

    function getXML( file, skipData           \
    				,end,p,q,tag,att,accu,mline,mode,S0,ex,dtd) {
        XTYPE=XITEM=XERROR=XNODE=""; split("",XATTR);
        S0=_XMLIO[file,"S0"]; XLINE=_XMLIO[file,"line"]; 
    	XPATH=_XMLIO[file,"path"]; dtd=_XMLIO[file,"dtd"];
        while (!XTYPE) {
            if (S0=="") { if (1!=(getline S0 <file)) break; XLINE++; S0=S0 RS; }
            if ( mode == "" ) {
                mline=XLINE; accu=""; p=substr(S0,1,1);
                if ( p!="<" && !(dtd && p=="]") )         
    				mode="DAT";
                else if ( p=="]" ) 
    				{ S0=substr(S0,2);  mode="DTE"; end=">"; dtd=0; }
                else if ( substr(S0,1,4)=="<!--" ) 
    				{ S0=substr(S0,5);  mode="COM"; end="-->"; }
                else if ( substr(S0,1,9)=="<!DOCTYPE" ) 
                    { S0=substr(S0,10); mode="DTB"; end=">"; }
                else if ( substr(S0,1,9)=="<![CDATA[" ) 
                    { S0=substr(S0,10); mode="CDA"; end="]]>"; }
                else if ( substr(S0,1,2)=="<!" ) 
    				{ S0=substr(S0,3);  mode="DEC"; end=">"; }
                else if ( substr(S0,1,2)=="<?" ) 
    				{ S0=substr(S0,3);  mode="PIN"; end="?>"; }
                else if ( substr(S0,1,2)=="</" ) 
    				{ S0=substr(S0,3);  mode="END"; end=">";
                    tag=S0;sub(/[ \n\r\t>].*$/,"",tag);
    				S0=substr(S0,length(tag)+1);
                    ex=XPATH;sub(/\/[^\/]*$/,"",XPATH);
    				ex=substr(ex,length(XPATH)+2);
                    if (tag!=ex) { 
                       	XERROR="unexpected close tag <" ex ">..</" tag ">"; 
    					break; } }
                else{                                     
    				S0=substr(S0,2);  mode="TAG";
                    tag=S0;sub(/[ \n\r\t\/>].*$/,"",tag);
    				S0=substr(S0,length(tag)+1);
                    if ( tag !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) { 
                        XERROR="invalid tag name '" tag "'"; break; }
                    XPATH = XPATH "/" tag; } }
            else if ( mode == "DAT" ) {                            
                p=index(S0,"<"); 
    			if ( dtd && (q=index(S0,"]")) && (!p || q<p) ) p=q;
                if (p) {
                    if (!skipData) { XTYPE="DAT"; 
                           XITEM=accu unescapeXML(substr(S0,1,p-1)); }
                    S0=substr(S0,p); mode=""; }
                else{ if (!skipData) accu=accu unescapeXML(S0); S0=""; } }
            else if ( mode == "TAG" ) {   
    			sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
                if ( substr(S0,1,2)=="/>" ) {
                    S0=substr(S0,3); mode=""; XTYPE="TAG"; 
    				XITEM=tag; S0="</"tag">"S0; }
                else if ( substr(S0,1,1)==">" ) {
                    S0=substr(S0,2); mode=""; XTYPE="TAG"; XITEM=tag; }
                else{
                    att=S0; sub(/[= \n\r\t\/>].*$/,"",att); 
    				S0=substr(S0,length(att)+1); mode="ATTR";
                    if ( att !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) { 
                        XERROR="invalid attribute name '" att "'"; 
    					break; } } }
            else if ( mode == "ATTR" ) {  
    				sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
                if ( substr(S0,1,1)=="=" ) { S0=substr(S0,2); mode="EQ"; }
                else                       { XATTR[att]=att; mode="TAG"; 
                                             XNODE=XNODE att"="att"\001"; } }
            else if ( mode == "EQ" ) {    
    					sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
                end=substr(S0,1,1);
                if ( end=="\"" || end=="'" ) {
    					S0=substr(S0,2);accu="";mode="VALUE";}
                else{
                    accu=S0; sub(/[ \n\r\t\/>].*$/,"",accu); 
    				S0=substr(S0,length(accu)+1);
                    XATTR[att]=unescapeXML(accu); mode="TAG"; 
    				XNODE=XNODE att"="XATTR[att]"\001"; } }
            else if ( mode == "VALUE" ) { # terminated by end
                if ( p=index(S0,end) ) {
                    XATTR[att]=accu unescapeXML(substr(S0,1,p-1)); 
    				XNODE=XNODE att"="XATTR[att]"\001";
                    S0=substr(S0,p+length(end)); mode="TAG"; }
                else{ accu=accu unescapeXML(S0); S0=""; } }
            else if ( mode == "DTB" ) { # terminated by "[" or ">"
                if ( (q=index(S0,"[")) && (!(p=index(S0,end)) || q<p ) ) {
                    XTYPE=mode; XITEM= accu substr(S0,1,q-1); 
    				S0=substr(S0,q+1); mode=""; dtd=1; }
                else if ( p=index(S0,end) ) {
                    XTYPE=mode; XITEM= accu substr(S0,1,p-1); 
    				S0="]"substr(S0,p); mode=""; dtd=1; }
                else{ accu=accu S0; S0=""; } }
            else if ( p=index(S0,end) ) {  # terminated by end
                XTYPE=mode; XITEM= ( mode=="END" ? tag : accu substr(S0,1,p-1) );
                S0=substr(S0,p+length(end)); mode=""; }
            else{ accu=accu S0; S0=""; } }
        _XMLIO[file,"S0"]=S0; _XMLIO[file,"line"]=XLINE; 
    	_XMLIO[file,"path"]=XPATH; _XMLIO[file,"dtd"]=dtd;
        if (mode=="DAT") { mode=""; if (accu!="") XTYPE="DAT"; XITEM=accu; }
        if (XTYPE) { XNODE=XTYPE"\001"XITEM"\001"XNODE; return 1; }
        close(file);
        delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"]; 
    	delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
        if (XERROR) XERROR=file ":" XLINE ": " XERROR;
        else if (mode) XERROR=file ":" mline ": " "unterminated " mode;
        else if (XPATH) XERROR=file ":" XLINE ": "  "unclosed tag(s) " XPATH;
    } 
    

    Unescape data and attribute values, used by getXML.

    function unescapeXML( text ) {
        gsub( "'", "'",  text );
        gsub( """, "\"", text );
        gsub( ">",   ">",  text );
        gsub( "<",   "<",  text );
        gsub( "&",  "\\&",  text );
        return text
    }
    

    Close xml file

    function closeXML( file ) {
        close(file);
        delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"]; 
        delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
        delete _XMLIO[file,"open"]; delete _XMLIO[file,"IND"];
    }
    

    Author

    Jan Weber


    categories: Papers,Os,Apr,2009,SallyF

    Simulations for Equation-Based Congestion Control for Unicast Applications

    (Editor's Note: This page is a mirror of the original web site. It describes a collection of shell/awk/tcl scripts used for modeling complex domains. This code illustrates how language choice is not a matter of "awk" vs "X". Rather, systems can be a menagerie of different languages, including Awk.)

    Description

    This page has pointers to the simulation scripts for the Equation-Based Congestion Control for Unicast Applications by Sally Floyd, Mark Handley, Jitendra Padhye, and Joerg Widmer, May 2000, SIGCOMM 2000.

    Download

    These simulation scripts are also available from in LAWKER.

    To test the code:

    • Unpack this zip file:
      contents.zip
      cd contents
      

      To use these scripts, you must go the following:

      gcc bwcnt2.c -o bwcnt2
      gcc bwcnt2a.c -o bwcnt2a
      

      Then, put a copy of "ns" in the current directory, for example:

      ln -s ~/vint/ns-2/ns ns
      

      To run the tests:

      ./single.com
      ./tfrm12.com
      ./queue2.com
      ./increase.com
      ./reduce.com
      ./reduce1.com
      

    Details

    These scripts are quick amalgams of shell scripts, awk, tcl, and whatever else was handy at the time, so they are not intended as an example of good programming style. They are run in a directory with a "graphs" subdirectory for saved output and *.mf files (gnuplot command files), and an "awk" subdirectory for awk files. Some of these scripts use supporting *.awk files that are available in the awk directory, but are not listed separately below. Some of the scripts (tfrm12.run) also use "bwcnt" C programs for processing output data; the C code for these is in the scripts directory. Possibly one day we will clean this all up to reduce the proliferation of scripts and languages involved.

    The implementation of TFRC in the NS simulator is still occasionally being modified, so the precise results of simulations can change with different versions of NS.

    Some of these simulations must be run with SBSIZE in scoreboard.h set to 10000 instead of to 1024, to allow larger TCP congestion windows.

    From Scripts to Figures

    The simulation for Figure 2 on "Illustration of the Average Loss Interval" can be run with "contents/single.com", with supporting files "contents/single.run", "contents/single.tcl", and "contents/queueSize.tcl". Generating the postscript file also uses the following files:
    "contents/graphs/s0.interval.mf", "contents/graphs/s0.loss.mf", and "contents/graphs/s0.rate.mf".

    The simulations for Figure 5 on "TCP flow sending rate" can be run with "contents/tfrm-full.CA.DropTail.run", "contents/tfrm-full.CA.RED.run" with supporting files "contents/tfrm-full.CA.tcl", "contents/queueSize.tcl", "contents/getmean-full.tcl". These scripts will produce data files called

    graphs/s-full-RED.CA.tcpmean
    graphs/s-full-DropTail.CA.tcpmean
    
    There are three values for each data point (from three runs) in these output files. To merge them, use "contents/merge2.tcl":
    merge2.tcl graphs/graphs/s-full-RED.CA.tcpmean > graphs/s-full-RED.CA.tcp
    merge2.tcl graphs/graphs/s-full-DropTail.CA.tcpmean > graphs/s-full-DropTail.CA.tcp
    
    Unfortunately, we no longer have the *.mf gnuplot script for generating the postscript from "s-full-RED.CA.tcp" and "s-full-DropTail.CA.tcp". BTW, on a 450MHz Xeon, each graph takes about 7 hours to generate

    The simulations for Figure 6 on can be run with "contents/tfrm12.com", with supporting files "contents/tfrm12.run", "contents/tfrm12.tcl", "contents/awk/plotdrops.awk" and "contents/queueSize.tcl". The supporting programs "bwcnt2" and "bwcnt2a" for processing the output data are compiled from "contents/bwcnt2.c" and "contents/bwcnt2a.c". FYI: On Sally's computer, this simulation set took 13 minutes. The following supporting files were also required for generating the postscript file "contents/tfrm12.run1", "contents/graphs/getmean.tcl", "contents/graphs/s0.12.mf", "contents/graphs/s0.loss3.mf".

    The simulations for Figure 7 on "Coefficient of variation of throughput between flows" can be run with "contents/tfrmvar.run" with supporting files "contents/tfrmvar.tcl", "contents/queueSize.tcl", and "contents/graphs/getvar.tcl". The scripts "contents/fixcov.tcl" combines the many output files together, and gnuplot requires "contents/graphs/s3xxx.mf" to generate the postscript.

    When we have collected the scripts for Figure 8, we will put them on-line.

    The simulations for Figures 9 and 10 can be run with the script "contents/long/doit". The supporting scripts are in the tar file. The simulation takes perhaps one hour.

    The simulations for Figures 11-13 can be run with the script "contents/short/doit". The simulation takes up to three days.

    The simulations for Figure 14 on 40 long-lived flows can be run with "contents/queue2.com", with supporting files "contents/queue.run", "contents/queue.tcl", "contents/queueSize.tcl", "contents/tracequeue.tcl", awk/"contents/awk/plotaveq.awk", and awk/"contents/awk/plotqueue.awk". Generating the postscript file also uses the following file: "contents/graphs/s0.queue.mf".

    Figures 15-18 are from experiments.

    The simulations for Figure 19 on "A TFRC flow with an end to congestion" can be run with "contents/increase.com", with supporting files "contents/increase.run", "contents/increase.tcl", "contents/queueSize.tcl", "contents/awk/increase.awk", and graphs/"scriptsTR/graphs/s0.packetrate.mf".

    The simulations for Figure 20 on "A TFRC flow with persistent congestion" can be run with "contents/reduce.com", with supporting files "contents/reduce.run", "contents/reduce.tcl", "contents/queueSize.tcl", "contents/awk/reduce.awk", and "contents/awk/reduce1.awk". Generating the postscript file also uses the following file: "contents/graphs/s0.rate1.mf".

    The simulations for Figure 21 on "Number of round-trip times to reduce the sending rate" can be run with "contents/reduce1.com", with supporting files "contents/reduce1.run", "contents/reduce.tcl", "contents/queueSize.tcl", "contents/awk/reduce1.awk", and "contents/awk/reduce2.awk". Generating the postscript file also uses the following file: graphs/"contents/graphs/s0.half.mf".


    categories: Papers,May,2009,JonB

    Template-Driven Interfaces for Numerical Subroutines

    Jon L. Bentley, Mary F. Fernandez, Brian W. Kernighan, and Norman L. Schryer, ACM Transactions on Mathematical Software, Vol. 19, No. 3, September 1993, Pages 265-287

    This paper describes a set of interfaces for numerical subroutines. Typing a short (often one-line) description allows one to solve problems in application domains including least-squares data fitting, differential equations, minimization, root finding, and integration. Our approach of "template-driven programming" makes it easy to build such an interface: a simple one takes a few hours to construct, while a few days suffice to build the most complex program we describe.

    It is straightforward to implement this approach on many systems. We have tailored our implementation to our computing environment: our numerical routines are from the Port library, we call the routines from Fortran programs, and our interfaces are implemented in Awk.

    An appendix to the paper describes "L2fit". This program performs only the least-squares regression to calculate the parameters; it does not prepare the graphical summary. It is implemented as a 50-line Awk program and a 40-line Fortran template. The complete L2fit is a 330-line Awk program that uses a 45-line Fortran template; it also uses a 60-line Troff and Grap template to produce the output.

    Download pdf.


    categories: Papers,Os,Apr,2009,KimD

    Intrusion Alert Normalization with Awk

    From Intrusion Alert Normalization method using AWK scripts and attack name database. Dongyoung Kim, HyoChan Bang, Jung-Chan Na, Advanced Communication Technology, 2005, ICACT 2005. The 7th International Conference on Publication Date: 21-23 Feb. 2005 Volume: 1, On page(s): 608- 611 Vol. 1

    The current several classes of intrusion alert have various formats and semantics. And it is transferred using a variety of protocols. The protocols that transfer intrusion alert are IDXP, SNMP trap, SYSLOG protocol, etc. These varieties of intrusion alert formats make it difticult to use that together. Intrusion alert normalization makes various intrusion alert to same structure data and same semantics. We need this normalition process to unify alerts from a variety of security equipments. This paper describes how to normalize alerts from several IDS and security equipments.


    categories: TextMining,Mar,2009,Admin

    Text Mining

    Some of the code at awk.info is somewhat historical in nature. For example, Scott Pakin's gender predictor was written in 1991. Given that, it might be mistakenly concluded that Awk is somehow old-fashioned and not suitable for modern tasks.

    Text mining, on the other hand, could be the killer app for Awk in the 21st century. The language excels at creating one-off reports that handle the quirks of a particular file format.

    There is a growing interest in using Awk for this kind of work. All the examples presented below come from work conducted in 2007, 2008:

    Why Text Mining?

    If we could properly understand unstructured text, this would be a result of tremendous practical importance. A recent study concluded that:

    • 80 percent of business is conducted on unstructured information;
    • 85 percent of all data stored is held in an unstructured format;
    • Unstructured data doubles every three months;

    That is, if we can tame the text mining problem, it would be possible to reason and learn from a much wider range of business data than ever before.

    Results (with Awk)

    Note that, in the Menzies/Marcus and Schmitt/Christianson tool kits, Awk by itself was not enough. The two data mining toolkits mentioned above were all intricate combinations of Awk and sed and bash and etc end etc. Within that combination, Awk was very useful for handling the specifics not managed by the other tools.


    categories: TextMining,May,2009,YasumasaS

    Lexical and Grammar Analysis

    Yasumasa Someya describes an entire natural langauge processing kit, written in Awk at http://someya-net.com/09-MA/.

    In this sense, the toolkit is an excellent example of Awk-in-the-large. Appendix C1 of that documentation lists the Awk programs used in that study. It is a fascinating combination of tiny filters and complex code, which can be combined in multiple ways to result in an instricate analysis:

    • 63 Awk files...
    • ... used in 11 batch files ...
    • ... that utilize data in 8 dictionary and other files

    The Awk file list is shown below.

    NumFileDatedDescription
    1 ad_sp_ed.awk 980628 Insert space before the return mark
    2 add.awk 980820 Adds all the values contained in $1 through $n respectively.
    3 bun_fre2.awk 980724 The main program of "Sentence Profiler (Ver.1)." Print sentence- length profile table and graph.
    4 bun_fre4.awk 980730 Revised version of "bun_fre2.awk"
    5 cnt_freq.awk Counts the number of each tag sequence and to produce a list of modal verb-structures with frequency information.
    6 capital.awk 980622 Prints text lines beginning with a capital letter (for extracting proper nouns from a wordlist).
    7 chikan.awk 980814 Compares an input file and a specified dictionary. If the words in $1 of the dictionary matches words in the input file, the latter will be replaced with the $2 data in the former. (See "fmatch. awk").
    8 cleantag.awk 980818 Cleans up a file tagged with the Brill Tagger, and replaces the default slash symbol (/) with the underbar (_).
    9 countme.awk 981002 Counts the number of words in a text , either as type or token.
    10 del_hyph.awk 971117 Deletes line-end hyphens.
    11 del_nbr.awk 980623 Deletes line-initial numbers and symbols.
    12 del_null.awk 980205 Deletes excess blank lines, leaving only one blank line.
    13 del_rtn.awk 980518 Deletes the return mark at the end of each record
    14 del{_}.awk 981007 Deletes the idiom mark from the output of "dmfreq. awk".
    15 delblank.awk 980601 Deletes all blank lines.
    16 delkigou.awk 980721 Deletes all symbols and marks in $2.
    17 delslash.awk 970831 Replaces the slash with a space.
    18 ex_there.awk 980628 Extract all the "Ex-There" constructions.
    19 f1_del.awk 980417 Print all the data except those in $1.
    20 fmatch.awk 980814 The main program of "Collocation and Idiom Finder (Ver.1)." Marks all the matched strings in the format of "{ idiom }_IDM."
    21 hv_vbn.awk 980628 Extracts all the present perfect constructions from a tagged corpus.
    22 ichigyo.awk 980205 Same as "del_null.awk"
    23 idmfreq.awk 981007 Produces a frequency comparison table of specified collocations and idioms. Used as part of Collocation and Idiom Finder (Ver.1).
    24 if$2none.awk 980821 Prints records whose $2 is not blank.
    25 if_md.awk 980628 Extracts all the IF+MD constructions from a tagged corpus.
    26 JJ.awk 980912 Extracts all the adjectives from a tagged wordlist.
    27 kaihi-1.awk 980523 Prints the data as is, except for those marled by #.
    28 kaihi-2.awk 980730 Prints the data as is if marked marked by #. If not, adds sentence ID numbers before printing.
    29 karamoji.awk 980417 Deletes sentence-initail space.
    30 kensaku.awk 980201 Regular expression search from the command line.
    31 l_sp_del.awk 971004 Deletes excess line-initial space.
    32 line_nbr.awk 980518 Adds sentence numbers.
    33 makeline.awk 980201 Inserts a return code at the end of sentence-initial punctuation marks and symbols, except at specified abbreviations (used in conjunction with "txt_id.awk").
    34 matching.awk 980620 Replaces each entry word in the input file with a corresponding WL tag as defined in the WL-tag dictionary file. Non-match strings are printed as is (used as part of "Word Level Checker").
    35 matchnew.awk 980825 Replaces each entry word in the input file with a corresponding WL tag as defined in the WL-tag dictionary file (See endnote 8, Chapter 3).
    36 merge0.awk 980623 Merges two wordlists (Add FILE1 to FILE2, and prints FILE3, for $0).
    37 merge1.awk 980703 Merges two wordlists (Add FILE1 to FILE2, and prints FILE3, for $1).
    38 nandoprn.awk 980624 Sorts and prints the results of "matching.awk" (used as part of "wlc.bat" and "w_nando.bat").
    39 NN.awk 980915 Extracts all the words with NN tags.
    40 non_cap.awk 980624 Prints all lines starting with a lower case letter (for extracting data other than proper nouns from a wordlist).
    41 open_con.awk 980529 Opens contractions (e.g. I'm, we'd, we'll, couldn't, etc.) . Used before executing the Brill Tagger.
    42 predcnt1.awk 980725 Counts the number of predications (mentioned in End note 19, Chapter
    2)
    43 prn_!tag.awk 980825 Prints text data only from a POS-tagged text.
    44 prn_tag.awk 980601 Extracts POS tag data from a tagged text, and prints them onto a separate file (See Endnote 6, Chapter 2).
    45 prn{_}.awk 981007 Prints lines that include strings marked "{É}_IDM" (used as part of "fmatch1.bat" and "fmatch2.bat").
    46 prn_MD.awk 980601 Extracts all MD tags from a tagged text, and prints them onto a separate file (See Endnote 30, Chapter 4).
    47 r_sp_del.awk 980207 deletes space between the return code and the last word of each sentence.
    48 RB.awk 980912 Extracts all the words with RB tags.
    49 rtn@}.awk 981007 Inserts a return code at the X mark to the output of "prn{_}.awk.".
    50 sentence.awk 980718 Main program of "Sentence Profiler (Ver.1)." Counts the numbers of words and sentences, and the average number of words per sentence, and print the result.
    51 shiage.awk 980107 Deletes unnecessary data from the output of "prn{_{.awk tp rtn@}.awk" and prints the result after sorting.
    52 sp_kigou.awk 980523 Adds space before and/or after specified punctuation marks and symbols (used in conjunction with the Brill Tagger).
    53 tagme.awk 980623 Experimental tagging program.
    54 TagToSyn.awk 980720 Extract syntactic information from POS tag data.
    55 tokei.awk 980801 Calculates sum, mean, variance, SD, dispersion and usage.
    56 txt_id.awk 980623 Adds "Sentence ID and Number" to a plain running text.
    57 VB.awk 980830 Extracts all the words with VB tags (See Endnote 25, Chapter 3).
    58 voc_lev1.awk 980823 Processed input data for "voc_lev2.awk"
    59 voc_lev2.awk 980823 Prints the results of "matching.awk to nandoprn. awk" with a graph and a table (used as part of "wlc.bat").
    60 mk_list.awk 980801 A multi-function wordlist compiler, mk_list.awk. Mentioned in Endnote 14, Chapter 3. See Appendix C2 for program source.
    61 word.awk 980828 Produces a simple wordlist from a plain running text file.
    62 wordlist.awk 980718 Produces a simple wordlist with frequency information from a plain running text file.
    63 wrdlevel.awk 980623 Replaces entries in a wordlist with WL tags.

    categories: TextMining,Mar,2009,LotharS

    Awk and Sed for Language Analysis

    References

    Lothar M. Schmitt and Kiel T. Christianson:

    Description

    The authors show how to construct tools for language analysis in research and teaching using the Awk, the Bourne-shell, and sed under UNIX. Applications include the following:
    • searches for words, phrases, grammatical patterns and phonemic patterns in text;
    • statistical evaluation of texts in regard to such searches;
    • transformation of phonetic, phonemic or typographic transcriptions;
    • comparison of texts in various respects;
    • lexical-etymological analysis;
    • concordance;
    • assistance in translating text;
    • assistance in learning languages;
    • assistance in teaching languages;
    • and text processing and formatting. This latter includes the generation of on-line dictionaries for the Internet from files that were generated with what-you-see-is-what-you-get editors representing only the linear structure of the dictionary (i.e., the book).
    All of the above can be achieved with particularly simple and short code. In that regard, they illustrate how sed and awk can be combined in the pipe mechanism of UNIX to create very powerful processing devices.

    Their notes include a short introduction to programming the Bourne-shell and rather short, but complete descriptions of sed and awk customized in regard to language analysis.


    categories: TextMining,Mar,2009,Timm

    Text Mining Issue Reports

    References

    Tim Menzies and Andrian Marcus:

    Description

    Severis is a set of Awk, bash, sed, etc scripts for finding predictors of high severity issues in text reports. Test engineers write such issue reports whenever they encounter anomalies in the code they are inspecting.

    Severis was designed to be an audit tool for test engineers, a second "look over the shoulder" to alert a senior engineer if a junior test engineer was doing something strange.

    At least for the text issue reports studied by Severis, very simple tools were enough to determine the terms that predicting for different issue severities.


    categories: TextMining,Mar,2009,DonaldM

    Text Munging in Awk (and Perl and Python)

    Donald 'Paddy' McCarthy reports an interesting comparison of Awk vs Perl vs Python for doing some text pre-processing.

    The example shows off Awk's ability to quickly prototype a one-off specialized report for a particular data format.

    It also offers some comment on the language wars between Awk and <insert your favorite scripting language here>: there is no evidence in the following code that dear old-fashioned Awk is more complex or arcane or slower that more recent, supposedly better, languages.

    • Tests on 1MB of data of the form
      <string:date> [ <float:data-n> <int:flag-n> ]*24
      

      e.g.

      1991-03-31 10.000  1 10.000  1  ... 20.000      1       35.000  1
      
    • Time to process 1MB of data (over 5000 records of the above form):
      • Awk: 1.069s
      • Perl: 2.450s
      • Python: 1.138s

    Awk

    The awk example:

    # Author Donald 'Paddy' McCarthy Jan 01 2007
    
    BEGIN{
      nodata = 0;             # Curret run of consecutive flags < 0 in lines of file
      nodata_max=-1;          # Max consecutive flags < 0 in lines of file
      nodata_maxline="!";     # ... and line number(s) where it occurs
    }
    FNR==1 {
      # Accumulate input file names
      if(infiles){
        infiles = infiles "," infiles
      } else {
        infiles = FILENAME
      }
    }
    {
      tot_line=0;             # sum of line data
      num_line=0;             # number of line data items with flag>0
    
      # extract field info, skipping initial date field
      for(field=2; field < =NF; field+=2){
        datum=$field;
        flag=$(field+1);
        if(flag < 1){
          nodata++
        }else{
          # check run of data-absent fields
          if(nodata_max==nodata && (nodata>0)){
            nodata_maxline=nodata_maxline ", " $1
          }
          if(nodata_max < nodata && (nodata>0)){
            nodata_max=nodata
            nodata_maxline=$1
          }
          # re-initialise run of nodata counter
          nodata=0;
          # gather values for averaging
          tot_line+=datum
          num_line++;
        }
      }
    
      # totals for the file so far
      tot_file += tot_line
      num_file += num_line
    
      printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
             $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0
    
      # debug prints of original data plus some of the computed values
      #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
      #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline
    
    
    }
    
    END{
      printf "\n"
      printf "File(s)  = %s\n", infiles
      printf "Total    = %10.3f\n", tot_file
      printf "Readings = %6i\n", num_file
      printf "Average  = %10.3f\n", tot_file / num_file
    
      printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
    }
    

    Perl

    The same functionality in perl is very similar to the awk program:

    # Author Donald 'Paddy' McCarthy Jan 01 2007
    
    BEGIN {
      $nodata = 0;             # Curret run of consecutive flags < 0 in lines of file
      $nodata_max=-1;          # Max consecutive flags < 0 in lines of file
      $nodata_maxline="!";     # ... and line number(s) where it occurs
    }
    foreach (@ARGV) {
      # Accumulate input file names
      if($infiles ne ""){
        $infiles = "$infiles, $_";
      } else {
        $infiles = $_;
      }
    }
    
    while ( < >){
      $tot_line=0;             # sum of line data
      $num_line=0;             # number of line data items with flag>0
    
      # extract field info, skipping initial date field
      chomp;
      @fields = split(/\s+/);
      $nf = @fields;
      $date = $fields[0];
      for($field=1; $field < $nf; $field+=2){
        $datum = $fields[$field] +0.0;
        $flag  = $fields[$field+1] +0;
        if(($flag+1 < 2)){
          $nodata++;
        }else{
          # check run of data-absent fields
          if($nodata_max==$nodata and ($nodata>0)){
            $nodata_maxline = "$nodata_maxline, $fields[0]";
          }
          if($nodata_max < $nodata and ($nodata>0)){
            $nodata_max = $nodata;
            $nodata_maxline=$fields[0];
          }
          # re-initialise run of nodata counter
          $nodata = 0;
          # gather values for averaging
          $tot_line += $datum;
          $num_line++;
        }
      }
    
      # totals for the file so far
      $tot_file += $tot_line;
      $num_file += $num_line;
    
      printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
             $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;
    
    }
    
    printf "\n";
    printf "File(s)  = %s\n", $infiles;
    printf "Total    = %10.3f\n", $tot_file;
    printf "Readings = %6i\n", $num_file;
    printf "Average  = %10.3f\n", $tot_file / $num_file;
    
    printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
           $nodata_max, $nodata_maxline;
    

    Python

    The python program however splits the fields in the line slightly differently (although it could use the method used in the perl and awk programs too):

    # Author Donald 'Paddy' McCarthy Jan 01 2007
    
    import fileinput
    import sys
    
    nodata = 0;             # Curret run of consecutive flags < 0 in lines of file
    nodata_max=-1;          # Max consecutive flags < 0 in lines of file
    nodata_maxline=[];      # ... and line number(s) where it occurs
    
    tot_file = 0            # Sum of file data
    num_file = 0            # Number of file data items with flag>0
    
    infiles = sys.argv[1:]
    
    for line in fileinput.input():
      tot_line=0;             # sum of line data
      num_line=0;             # number of line data items with flag>0
    
      # extract field info
      field = line.split()
      date  = field[0]
      data  = [float(f) for f in field[1::2]]
      flags = [int(f)   for f in field[2::2]]
    
      for datum, flag in zip(data, flags):
        if flag < 1:
          nodata += 1
        else:
          # check run of data-absent fields
          if nodata_max==nodata and nodata>0:
            nodata_maxline.append(date)
          if nodata_max < nodata and nodata>0:
            nodata_max=nodata
            nodata_maxline=[date]
          # re-initialise run of nodata counter
          nodata=0;
          # gather values for averaging
          tot_line += datum
          num_line += 1
    
      # totals for the file so far
      tot_file += tot_line
      num_file += num_line
    
      print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
            date,
            len(data) -num_line,
            num_line, tot_line,
            tot_line/num_line if (num_line>0) else 0)
    
    print ""
    print "File(s)  = %s" % (", ".join(infiles),)
    print "Total    = %10.3f" % (tot_file,)
    print "Readings = %6i" % (num_file,)
    print "Average  = %10.3f" % (tot_file / num_file,)
    
    print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
        nodata_max, ", ".join(nodata_maxline))
    
    

    categories: Apr,2009,WilhelmW,OsamuA,ArnoldR

    99 Bottles of Beer

    You know the song:

      99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.

      98 bottles of beer on the wall, 98 bottles of beer. Take one down and pass it around, 97 bottles of beer on the wall.

      97 bottles of beer on the wall, 97 bottles of beer. Take one down and pass it around, 96 bottles of beer on the wall.

      ....

    But how do you code it? Here's Wilhelm Weske's version. It is kind of fun but its a little hard to read:

    #!/usr/bin/awk -f
    
            BEGIN{
           split( \
           "no mo"\
           "rexxN"\
           "o mor"\
           "exsxx"\
           "Take "\
          "one dow"\
         "n and pas"\
        "s it around"\
       ", xGo to the "\
      "store and buy s"\
      "ome more, x bot"\
      "tlex of beerx o"\
      "n the wall" , s,\
      "x"); for( i=99 ;\
      i>=0; i--){ s[0]=\
      s[2] = i ; print \
      s[2 + !(i) ] s[8]\
      s[4+ !(i-1)] s[9]\
      s[10]", " s[!(i)]\
      s[8] s[4+ !(i-1)]\
      s[9]".";i?s[0]--:\
      s[0] = 99; print \
      s[6+!i]s[!(s[0])]\
      s[8] s[4 +!(i-2)]\
      s[9]s[10] ".\n";}}
    

    Osamu Aoki has a more maintainable version. Note how all the screen I/O is localized via functions that return strings, rather than printing straight to the screen. This is very useful for maintaince purposes or including code as libraries into other Awk programs.

    BEGIN { 
       for(i = 99; i >= 0; i--) {
          print ubottle(i), "on the wall,", lbottle(i) "."
          print action(i), lbottle(inext(i)), "on the wall."
          print
       }
    }
    function ubottle(n) {
       return \ 
         sprintf("%s bottle%s of beer", n ? n : "No more", n - 1 ? "s" : "")
    }
    function lbottle(n) {
       return \
         sprintf("%s bottle%s of beer", n ? n : "no more", n - 1 ? "s" : "")
    }
    function action(n) {
       return \
          sprintf("%s", n ? "Take one down and pass it around," : \
                             "Go to the store and buy some more,")
    }
    function inext(n) {
       return n ? n - 1 : 99
    }
    

    Osamu's version is very similar to how it'd be done in C or other languages and it does not take full advantage of Awk's features. So Arnold Robbins wrote a third version that is more data driven. Most of the work is done in a pre-processor and the actual runtime just dumps text decided before the run. This solution might take more time (to do the setup) but it does allow for the simple switching of the interface (just change the last 10 lines).

    BEGIN {
            # Setup
            take = "Take one down, pass it around"
            buy = "Go to the store and buy some more"
    
            Instruction[0] = buy
            Next[0] = 99
            Count[0, 1] = "No more"
            Count[0, 0] = "no more"
    
            for (i = 99; i >= 1; i--) {
                    Instruction[i] = take
                    Next[i] = i - 1
                    Count[i, 0] = Count[i, 1] = (i "")
                    Bottles[i] = "bottles"
            }
            Bottles[1] = "bottle"
            Bottles[0] = "bottles"
            # Execution
            for (i = 99; i >= 0; i--) {
                    printf("%s %s of beer on the wall, %s %s of beer.\n",
                            Count[i, 1],
                            Bottles[i],
                            Count[i, 0],
                            Bottles[i])
                    printf("%s, %s %s of beer on the wall.\n\n",
                            Instruction[i],
                            Count[Next[i], 0],
                            Bottles[Next[i]])
            }
    }
    

    I'll drink to that.


    categories: Mail,Apr,2009,Admin

    Awk and Mail

    These pages focused on using Awk to implement filters on Unix mail files.


    categories: Mail,Apr,2009,StevenH

    Shell Statistical Spam Filter and Whitelist

    Contents

    About

    Author

    Steven Hauser.

    Origin

    http://www.tc.umn.edu/~hause011/article/Statistical_spam_filter.html

    Client Side Unix Shell - AWK with updating email address "Whitelist"

    I now use a "Statistical Spam Filter". Wow, the scummy sewer of internet mail is cleansed, refreshed and usable again. Just using the delete button was getting too difficult, I got 8 to 10 spam for every good piece of mail. As a spam detector I am not as good a filter as you might think, just the subject and address is not always enough, an anti-spam tool I am not, I would occasionally open a spam to my great annoyance.

    My interpretation of Paul Graham's Spam Article

    My filter was inspired by Paul Graham's article about a Naive Bayesian spam filter. The article is at "A Plan for Spam". He basically says that you get statistics on how often tokens show up in two bodies of mail, (spam and good,) and then calculate the a statistical value that a single mail is spam by looking at the tokens in it. The more mail in the good and spam mail bodies, the better the filter is "trained". Jeez, he made it sound so easy. And it is. I slapped an anti-spam tool together as a ksh and awk script for use as a personal filter on a Unix type system. To implement it I put it in the ~/.forward file. The code is at the bottom of the article, less than 100 The total code for the filter and training script is less than 200

    This filter differs in lots of ways from the Paul Graham article. I took out some of the biases he describes and simplified it, maybe it is too simple. What I find most interesting is that the differences do not seem to matter much, I still filter out 96+% of spams. I got those results with a spam sample that is at least 500 emails and a good email sample that is at least 700 emails. With smaller training samples or a different mail mix it may not get as good results, or it may be better. Note: I later changed the training body to be more like the proportion of real spam to good mail, which is much more spam than good mail, about 8-10 spam to every good mail received and the anti-spam tool worked better.

    How the Spam Filter Works in Unix

    First I run the training script on two bodies of mail, ~/Mail/received (good mail) and ~/Mail/junk (saved spam mail.) The ~/Mail/received file is already created on my unix box and holds mail that I have read and not deleted. The training script finds all the tokens in the emails and gives them a probability depending on how the token is found in the "spam mail" and the "good mail". The training script also creates the whitelist of addresses from the "good mail." As the mail flows through the system the training script will then "learn" each time it is run.

    I run the actual spam filter script from the .forward file which allows a user to process mail before it hits your inbox. (Look up "man forward" at the shell prompt for further information on the .forward file.) The script first checks the whitelist for a good address, if it is found it passes the filter. If the address is not found it is passed to the statistical spam filter, the tokens are checked and the email is given a spaminess value. Above a certain value the email is classified as spam and put in the ~/Mail/junk file, below the value it passes to /var/spool/mail/mylogin where I read it as god intended email to be read, with a creaky old unix client. However, I can still read it with any other client I want, POP or IMAP.

    Testing the Spam Filter

    I included a little test script below that I used to check my results. I just split emails into files and run them one at a time and check the value the filter gives.

    Testing on email that has been used to train the filter will give results that are very good and not valid, so I tested on email not seen by the training script. The filter does get much better at filtering as the training sample gets bigger, just like the other statistical spam filters. For example, at lower sample sizes (trained with 209 good mails, 301 spammails) the filter was pretty bad. When the average spam value cutoff was raised to .51 so no good mail was blocked, 44% of the spam email passed through on a set of 320 spam and 683 good email. Even so, that means %56 of the spam was blocked. Small sample sizes are not perfect, but are usable and I began using the mail filter with a sample set of about 600 good mail and 300 spam. As the training sample increased the results improved. As I changed the mail mix to reflect the real spam proportions it got even better, around 96-98% of the spam blocked. I think the lower early results were because of the proportions of spam to good email, they should reflect the real proportion received on the system used by the filter.

    Paul Graham or others may have superior filters and better mathematics for anti-spam algorithms but I am not sure that it matters all that much, the amount of spam that gets through is small enough not to bother me.

    Filter Performance

    I used gawk in the filter and checked it with the gawk profiler to look for performance problems. The largest performance constraint is creating the spam-probability associative array in memory, the key-value pairs of tokens and the spam value I assign to them. Creating this associative array is more that 95% of the current time to process an email through the filter and gets worse when the set of tokens gets larger. Perl and other language users can get around this performance problem with DBM file interfaces, currently not available to my gawk filter.

    White List Filter Improves Performance and Cuts Errors

    I added a "whitelist" of good email addresses, a feature that helps keep good email from a bad classification and improves performance by a huge amount (at least a magnitude of 100) by not having to further filter the message. The white list is not one of the "challenge-response" things that annoys me so much that I toss any such email away, it simply learns from the email used to train the filter, it saves addresses that are from email that has passed the filter and gets in my "received" file. I figure that if I receive a good email from someone, chances are 100% that I want to receive email from that address. Note there is a place in the white list script to get rid of commonly forged email addresses, like your own address.

    Why Differences With Bayes Filters Do Not Matter

    The main concept put forward by Paul Graham holds true and seems ungodly robust: applying statistics to filter spam works very well compared to lame rule sets and black lists. My program just proves the robustness of the solution; apparently any half-baked formula (like what I used) seems to work as long as the base probability of the tokens is computed.

    Here are some of the many differences between this filter and the filter in the Paul Graham article in no particular order of importance:

    • I do not lower-case the tokens, one result is that token frequency is set to three instead of five to be included in in the spam-probability associative array. I think that "case" is an important token characteristic.
    • "Number of mails" is replaced with "number of tokens." My explanation is that I am looking at a token frequency in an interval of a stream of tokens. It seems simpler to think of it that way, instead of number of mails. And when I tried "number of mails" I got the same result values on the messages for the formula I used.
    • "Interesting tokens" were tokens in the message with a spam statistic "greater than 0.95" and "less than 0.05" Easy to implement. I did not figure out the fifteen most interesting tokens, the limit used by Paul Graham. As a result, most of my mail has more than 15 interesting tokens, a few have fewer, which could be a weakness, but does not seem to matter too much.
    • Paul Graham's Naive Bayesian formula goes to 0 or 1 pretty quickly, which is fine, I tried it out in awk too. But now I just sum the "interesting token" probabilities and divide by the number of "interesting tokens" per message. Yes, it is just an average of the probability of "interesting tokens" and it is easy to implement and spreads the values over a 0-1 interval, spam towards 1 and good mail towards 0. I did this to implement some spam filtering as soon as possible. Even with a small sample of mail I was able to adjust the average probability value up to keep all the good mail and still get rid of a good proportion of spam. As I acquired more sample mail the filter caught more spam and I adjusted the average probability value down.
    • I have a "training" program that generates the token probabilities and an address "whitelist" to be run as a batch job at intervals (like once a day or week) and a separate filter program run out of ".forward"
    • I did not double the frequency value of the "good tokens" to bias them in the base spam probability calculation of each token.
    • Tokens not seen by the filter before are ignored. Paul Graham gives them a 0.4 probability of spaminess. Most other methods of calculating the probability of unknown tokens end up being ignored by my formula as they would have a probability outside the "interesting token" ranges.
    • I noticed that Paul Graham ignores HTML comments. When I looked at some of the spam I found out why, some spammers load recipient address and common words into HTML comments spread through the text to pass rule filters but the statistical spam filter seems to find them anyway so I include tags, comments, everything.

    Code

    Test SpamFilter

    Note: Do not test mail that has been used to train the filter, test mail not seen by the training program.

    #!/bin/ksh
    filter_test () {
      # Split a file of unix email into many mail files with this:
      cat ~/Mail/rece* |csplit -k -f good -n 4 - '/^From /' {900}
    
      # Run a modified filter that displays the spam value for each mail file.
      # I just commented out the last part of the filter and added a 
      # print statement of the Subject line and spam value the filter found.
      for I in test/good*
      do
         cat $I | [filter_program-that_shows_the_value_only]
      done | sort -n 
    }
    

    Train SpamFilter.

    Call from the command line or in a crontab file.

    #!/bin/ksh
    number_of_tokens (){
      zcat $1 | cat $2 - | wc -w
    }
    
     # Note: Get rid of addresses that are commonly forged at the
     #       "My-Own-Address" string.
    address_white_list (){
      zcat $1 | 
      cat $2 - | 
      egrep '^From |^Return-Path: ' | 
      nawk '{print tolower($2)}'| 
      nawk '{gsub ("<",""); gsub (">","");print;}'| 
      grep -v 'My-Own-Address'| 
      sort -u > ~/Mail/address_whitelist
    }
    
     # Create a hash with probability of spaminess per token.
     #       Words only in good hash get .01, words only in spam hash get .99
    spaminess () {
    nawk 'BEGIN {goodnum=ENVIRON["GOODNUM"]; junknum=ENVIRON["JUNKNUM"];}
           FILENAME ~ "spamwordfrequency" {bad_hash[$1]=$2}
           FILENAME ~ "goodwordfrequency" {good_hash[$1]=$2}
    
        END    {
        for (word in good_hash) {
            if (word in bad_hash) { print word, 
                (bad_hash[word]/junknum)/ \
                ((good_hash[word]/goodnum)+(bad_hash[word]/junknum)) }
            else { print word, "0.01"}
        }
        for (word in bad_hash) {
            if (word in good_hash) { done="already"}
            else { print word, "0.99"}
        }}' ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency 
    
    }
    
     # Print list of word frequencies
    frequency (){
      nawk ' { for (i = 1; i <= NF; i++)
            freq[$i]++ }
        END    {
        for (word in freq){
            if (freq[word] > 2) {
              printf "%s\t%d\n", word, freq[word];
            }
        } 
      }'
    }
     # Note: I store the email in compressed files to keep my storage space small,
     #       so I have the gzipped mail that I run through the filter training 
     #       script as well as current uncompressed "good" and spam files.
     #       
    prepare_data () {
      export JUNKNUM=$(number_of_tokens '/Your/home/Mail/*junk*.gz' '/Your/home/Mail/junk')
      export GOODNUM=$(number_of_tokens '/Your/home/Mail/*received*.gz' '/Your/home//Mail/received')
      address_white_list '/Your/home/Mail/*received*.gz' '/Your/home/Mail/received'
    
      echo $JUNKNUM $GOODNUM
    
      zcat ~/Mail/*junk*.gz | cat ~/Mail/junk - |
        frequency|
        sort -nr -k 2,2 > ~/Mail/spamwordfrequency
      zcat ~/Mail/*received*.gz | cat ~/Mail/received - |
        frequency|
        sort -nr -k 2,2 > ~/Mail/goodwordfrequency
    
      spaminess| 
        sort -nr -k 2,2 > ~/Mail/spamprobability
      # Clean up files
      rm ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency 
    }
    
     #########
     # Main
    
    prepare_data
    exit
    

    Spamfilter using statistical filtering.

    Inspired by the Paul Graham article "A Plan for Spam" www.paulgraham.com

    Implement in the .forward file like so:

    "| /Your/path/to/bin/spamfilter"
    

    If mail is spam then put in a spam file else put in the good mail file.

    #!/bin/ksh
    spamly () {
    /usr/bin/nawk '
    
       { message[k++]=$0; }
    
       END { if (k==0) {exit;} # empty message or was in the whitelist.
    
             good_mail_file="/usr/spool/mail/your_user";
             spam_mail_file="/Your/home/Mail/junk";
             spam_probability_file="/Your/home/Mail/spamprobability";
             total_tokens=0.01;
    
             while (getline < spam_probability_file)
                bad_hash[$1]=$2; close(spam_probability_file);
    
             for (line in message){ 
               token_number=split(message[line],tokens);
               for (i = 0; i <= token_number; i++){
                 if (tokens[i] in bad_hash) { 
                   if (bad_hash[tokens[i]] <= 0.06 || bad_hash[tokens[i]] >= 0.94){
                      total_tokens+=1;
                      spamtotal+=bad_hash[tokens[i]];
                    }
                  }
                }
             }
    
             if (spamtotal/total_tokens > 0.50) { 
                for (j = 0; j <= k; j++){ print message[j] >> spam_mail_file}
                print "\n\n" >> spam_mail_file;
             }
             else {
                for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
                print "\n\n" >> good_mail_file;
             }
       }'
    }
    
     # Check whitelist for good address. 
     # if in whitelist then put in good_mail_file
     #   else Pass message through filter.
    whitelister () {
      /usr/bin/nawk '
          BEGIN { whitelist_file="/Your/home/Mail/address_whitelist";
                  good_mail_file="/usr/spool/mail/your_user";
                  found="no";
                  while (getline < whitelist_file)
                  whitelist[$1]="address"; close(whitelist_file);
          }
          { message[k++]=$0;}
          /^From / {sender=tolower($2); 
                gsub ("\<","",sender);
                gsub ("\>","",sender); 
                if (whitelist[sender]) { found="yes";}
          }
          /^Return-Path: / {sender=tolower($2); 
                gsub ("\<","",sender);
                gsub ("\>","",sender); 
                if (whitelist[sender]) { found="yes";}
          }
          END { if (found=="yes") { 
                   for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
                   print "\n\n" >> good_mail_file;
                }
                else {
                   for (j = 0; j <= k; j++){ print message[j];}
                }
          }'
    }
    
     #####################################
     # Main
     # The mail is first checked by the white list, if it is not found in the
     # white list it is piped to the spam filter.
    whitelister | spamly 
    exit
    

    categories: Mail,Apr,2009,ArnoldR

    Mail Sort

    Contents

    Author

    Arnold Robbins

    Download

    Download from LAWKER.

    Description

    Sorts a Unix style mailbox by "thread", in date+subject order.

    This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.

    This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.

    Code

    Main

    BEGIN {
           TRUE = 1
           FALSE = 0
    
           split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
           for (i in months)
                   Month[months[i]] = i    # map name to number
    
           MonthDays[1] = 31
           MonthDays[2] = 28       # not used
           MonthDays[3] = 31
           MonthDays[4] = 30
           MonthDays[5] = 31
           MonthDays[6] = 30
           MonthDays[7] = 31
           MonthDays[8] = 31
           MonthDays[9] = 30
           MonthDays[10] = 31
           MonthDays[11] = 30
           MonthDays[12] = 31
    
           In_header = FALSE
           Body = ""
    
           LocalOffset = 2 # We are two hours ahead of GMT
    
           # These keep --lint happier
           Debug = 0
           MessageNum = 0
           Duplicates = FALSE
    }
    
    /^From / {
           In_header = TRUE
           if (MessageNum)
                   Text[MessageNum] = Body
           MessageNum++
           Body = ""
     # print MessageNum
    }
    
    In_header && /^Date: / {
           Date[MessageNum] = compute_date($0)
    }
    
    In_header && /^Subject: / {
           Subject[MessageNum] = canonacalize_subject($0)
    }
    
    In_header && /^Message-[Ii][Dd]: / {
           if (NF == 1) {
                   getline junk
                   $0 = $0 RT junk # Preserve original input text!
           }
    
           # Note: Do not use $0 directly; it's needed as the Body text
           # later on.
    
           line = tolower($0)
           split(line, linefields)
    
           message_id = linefields[2]
           Mesg_ID[MessageNum] = message_id        # needed for disambiguating message
           if (message_id in Message_IDs) {
                   printf("Message %d is duplicate of %s (%s)\n",
                           MessageNum, Message_IDs[message_id],
                           message_id) > "/dev/stderr"
                   Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
                   Duplicates++
           } else {
                   Message_IDs[message_id] = MessageNum ""
           }
    }
    
    
    In_header && /^$/ {
           In_header = FALSE
           # map subject and date to index into text
    
           if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
                   printf(\
           ("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
           " SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
                   MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
                   SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
                   Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
                   Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
                   Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
                           > "/dev/stderr"
           }
    
           SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum
    
           if (Debug) {
                   printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
                           MessageNum, length(SubjectDateId)) > "/dev/stderr"
                   if (MessageNum != length(SubjectDateId) && ! Printed1) {
                           Printed1++
                           printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
                   }
           }
    
           # build up mapping of subject to earliest date for that subject
           if (! (Subject[MessageNum] in FirstDates) ||
               FirstDates[Subject[MessageNum]] > Date[MessageNum])
                   FirstDates[Subject[MessageNum]] = Date[MessageNum]
    }
    
    {
           Body = Body ($0 "\n")
    }
    
    END {
           Text[MessageNum] = Body # get last message
    
           if (Debug) {
                   printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
                           length(SubjectDateId), length(Subject), length(Date))
                   printf("length(FirstDates) = %d\n", length(FirstDates))
           }
    
           # Create new array to sort by thread. Subscript is
           # earliest date, subject, actual date
           for (i in SubjectDateId) {
                   n = split(i, t, SUBSEP)
                   if (n != 3) {
                           printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
                           exit 1
                   }
                   # now have subject, date, message-id in t
                   # create index into Text
                   Thread[FirstDates[t[1]], i] = SubjectDateId[i]
           }
    
           n = asorti(Thread, SortedThread)        # Shazzam!
    
           if (Debug) {
                   printf("length(Thread) = %d, length(SortedThread) = %d\n",
                           length(Thread), length(SortedThread))
           }
           if (n != MessageNum && ! Duplicates) {
                   printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
                           n, MessageNum) > "/dev/stderr"
    	#               exit 1
           }
    
           if (Debug) {
                   for (i = 1; i <= n; i++)
                           printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
                                   i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
                   close("DUMP1")
                   if (Debug ~ /exit/)
                           exit 0
           }
    
           for (i = 1; i <= MessageNum; i++) {
                   if (Debug) {
                           printf("Date[%d] = %s\n",
                                   i, strftime("%c", Date[i]))
                           printf("Subject[%d] = %s\n", i, Subject[i])
                   }
    
                   printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
           }
           close("OUTPUT")
    
           close("/dev/stderr")    # shuts up --lint
    }
    

    compute_date

    Pull apart a date string and convert to timestamp.

    function compute_date(date_rec,         fields, year, month, day,
                                           hour, min, sec, tzoff, timestamp)
    {
           split(date_rec, fields, "[:, ]+")
           if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
                   # Date: Thu, 05 Jan 2006 17:11:26 -0500
                   year = fields[5]
                   month = Month[fields[4]]
                   day = fields[3] + 0
                   hour = fields[6]
                   min = fields[7]
                   sec = fields[8]
                   tzoff = fields[9] + 0
           } else {
                   # Date: 05 Jan 2006 17:11:26 -0500
                   year = fields[4]
                   month = Month[fields[3]]
                   day = fields[2] + 0
                   hour = fields[5]
                   min = fields[6]
                   sec = fields[7]
                   tzoff = fields[8] + 0
           }
           if (tzoff == "GMT" || tzoff == "gmt")
                   tzoff = 0
           tzoff /= 100    # assume offsets are in whole hours
           tzoff = -tzoff
    
           # crude compensation for timezone
           # mktime() wants a local time:
           #       hour + tzoff yields GMT
           #       GMT + LocalOffset yields local time
           hour += tzoff + LocalOffset
    
           # if moved into next day, reset other values
           if (hour > 23) {
                   hour %= 24
                   day++
                   if (day > days_in_month(month, year)) {
                           day = 1
                           month++
                           if (month > 12) {
                                   month = 1
                                   year++
                           }
                   }
           }
    
           timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
                                   year, month, day, hour, min, sec))
    
           # timestamps can be 9 or 10 digits.
           # canonicalize them into 11 digits with leading zeros
           return sprintf("%011d", timestamp)
    }
    

    days_in_month

    How many days in the given month?

    function days_in_month(month, year)
    {
           if (month != 2)
                   return MonthDays[month]
    
           if (year % 4 == 0 && year % 400 != 0)
                   return 29
    
           return 28
    }
    

    canonacalize_subject

    Trim out "Re:", white space.

    function canonacalize_subject(subj_line)
    {
           subj_line = tolower(subj_line)
           sub(/^subject: +/, "", subj_line)
           sub(/^(re: *)+/, "", subj_line)
           sub(/[[:space:]]+$/, "", subj_line)
           gsub(/[[:space:]]+/, " ", subj_line)
    
           return subj_line
    }
    

    Copyright

    Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com


    categories: Engineering,June,2009,Admin

    Awk for Engineering

    These pages focused on using Awk for analysis in engineering domains.


    categories: Engineering,June,2009,EisaA

    Awk for Chemical Engineers

    A style seen in many Awk libraries is lots of small scripts, each handling a very specific task.

    A good example of this style is Eiso Ab's library of scripts for chemical engineering. Shown below are dozens of his scripts. His library is an interesting example of real-world Awk programming.

    You can download a tgz of all awk and other scripts from http://www.nmr.chem.uu.nl/~eiso/scripts.tgz. Please direct all bugs and , questions to eiso@nmr.chem.uu.nl

    help ass2shift.awk (Jan-23-2008) purpose: read in anything with assignments or chemical shifts check consistency and write a shift list
    help ppm2prot.awk (Jun-12-12:15) generate an xeasy .prot file from another shift.list or ppm.out file
    help xpk2peaks.awk (Sep-20-2007)
    help pdb2iupac.awk (Feb-11-2005)
    help pdb2pdb.awk (Jun-12-12:17) purpose: - reformat ATOM records for various conventions - set B-factors for residues and/or atoms
    help seq2shift.awk (Nov-21-2007) fill shift list with chemical shift statistics from a database , currently the cyana lib file.
    help predict.awk (Jun-12-12:16) purpose : make list of predicted peaks from shift-lists
    help addass.awk (Feb-11-2005)
    help reref.awk (Jun-12-12:18) compare referencing for one or two peaklists in a 2D-histogram see also plotpeaks.awk
    help plotpeaks.awk (Jun-12-12:15) plot peaks in a 2D graph useful for comparing referencing between peak files or within domains of one peakfile, see also reref.awk
    help calib.awk (Jul-13-2005) determining calibration parameters from bruker acqu file use: calib.awk temp=298 acqu [1] Wishart, D.S; Sykes, B.D. (1995) J. Biomol. NMR., 6, 135-140 1H, 13C and 15N chemical shift referencing in biomolecular NMR
    help mergeshift.awk (Jun-12-12:12)
    help gmx2nmr.awk (Jun-12-12:10) opposite of nmr2gmx.awk convert gmx topology distance and orientation restraints
    help nmr2gmx.awk (Jun-16-13:55) make gromacs topology files for distance restraints and dipolar coupling data
    help diffshift.awk (Apr-23-10:33) compare shift lists e.g. diffshift.awk [ options ] shiftlist1 shiftlist2
    help complete_assignments.awk (Sep-20-2007) adds assignments in xeasy peaklist where one atom of a proton-heteroatom couple is assigned and the remaining assignment is clear.
    help peaks-project.awk (Jun-12-12:13) make lower dimension projections from xeasy peak files
    help peaks-unfold.awk (Jun-12-12:14) unfolds peaks in xeasy peaks files
    help unwatergate.awk (Feb-14-2005) undo the effect of watergate water suppression on peak intensities in nmrview xpk files
    help colorchain2mac.awk (Feb-15-2005) make molmol macro for rainbow-colored spline
    help seq2seq.awk (Sep-20-2007) convert protein aminoacid sequence files from oneletter to threeletter format and vice versa.
    help seq2shift.awk (Nov-21-2007) fill shift list with chemical shift statistics from a database , currently the cyana lib file.
    help makehbonds.awk (Feb-11-2005)
    help sparkysave2peaks.awk (Sep-21-2006) convert sparky save files to xeasy peakslist
    help peaks2sparky.awk (Jun-12-12:14) create sparky readable peaklists (.list) example: peaks2sparky.awk protein.seq protein.prot c13-cycle7.peaks
    help addass2sparkysave.awk (Feb-11-2005)
    help shifts2sparky-rl.awk (Sep-20-2007)
    help check_hetero_atom.awk (May-15-2007)
    help splitass.awk (Feb-11-2005)
    help splitnoa.awk (Feb-11-2005)
    help pdb2ariapdb.awk (Feb-11-2005)
    help tblcount.awk (Aug-15-2005) make a table with numbers of ambiguous and unambiguous intra/seq/medium/long/inter-dom restraints it needs a2ps to format the output.
    help upl2tbl.awk (Jun-12-12:20) very simple converter for xeasy .upl files to xplor/cns .tbl files
    help tbl2upl.awk (Apr-22-2005) convert (ambiguous) distance restraints in xplor *.tbl file to xeasy/cyana .upl/lol file Use: tbl2upl.awk name.(seq|pdb) unambig.tbl > out.upl
    help filterpeaks.awk (May-26-2005) filter peaklist for diagonal,water,lowest/highest intensities
    help tabstat.awk (Jun-12-12:21) get stats on values in columns of tables
    help linestat.awk (Jun-12-12:10) perform statistic on a certain number of columns of each line
    help qual-col.awk (Jun-12-12:17) purpose: color residues according to whatcheck bad/poor scores. creates molmol macros
    help make_IDR.awk (Jun-12-12:11) purpose: create restraints for working with proxy residues
    help add-linkers.awk (Sep-20-2007)
    help cyana-renum-lib.awk (Sep-20-2007) renumber the atoms a cyana residue library entry

    categories: Engineering,June,2009,DavidL

    Awk for Mechanical Engineers

    (Editor's note: This page is adapted from David Leo's excellent mechanical engineering using Awk scripts website.)

    Here is yet another Awk library for engineering applications. Elsewhere, we have seen an extensive library of chemical engineering scripts. Here, David Leo applies Awk to numerous mechanical engineering tasks. Interestingly, the style of David's code is similar to that seen in the chemical engineering library; i.e. lots of small scripts, each doing a different specific task.

    Library

    Shown below are lists of David's scripts. This library is an interesting example of real-world Awk programming.

    To learn more about these scripts, go to David's Awk scripts site. At that site, you will find:

    • sample input/output files for all these scripts
    • mini-tutorials on the science behind each script.

    Heat Transfer Through a Multi-Layered Wall or Flat Plate

    This script calculates the heat transfer through a flat wall or plate made up of several material layers and having convection heat transfer on both sides of the wall or plate.

    Heat Transfer From a House

    This script calculates the overall average heat transfer out of a house through a winter. It also calculates and compares oil heat to a geothermal heat pump.

    External Flat Plate Flow (Laminar-to-Turbulent HTC's)

    This script calculates the convection heat transfer coefficient on the surface of a flat plate with fluid flowing over it. The boundary layer may transition from laminar to turbulent, as established by the critical Reynold's number (a user input value).

    Heat Transfer Through The Wall a Multi-Layered Pipe, Tube or Duct

    This script calculates the heat transfer through a pipe, tube or duct made up of several material layers and having convection heat transfer on both sides of the wall.

    Internal Forced Convection Coefficients in a Pipe, Tube or Duct

    This script calculates the internal heat transfer coefficients for flow through an intenal passage (pipe, tube, duct). It includes an "entrance effect" where the coefficients are larger at the inlet, as the boundary layer builds up. It also includes the effects of fluid being heated or cooled, and uses a laminar boundary layer if the Reynold's number is below 2300.

    External Forced Convection Coefficients on a Pipe, Tube or Duct

    This script calculates the average heat transfer coefficient on the external wall of a pipe, with forced convection (fluid flowing across the pipe at some prescribed velocity).

    Impingement Jet Heat Transfer Coefficients

    Script.

    1DOF Vibrations

    This script calculates the forced, damped response of a 1 degree of freedom mass & spring system. Input file, script file, sample output file

    This is the classic textbook 1DOF response to an applied force of fixed magnitude and varying frequency. A crude bar chart is plotted for a quick visual check.

    Beam Vibration Frequencies

    This script calculates the first few natural frequencies of beams with common end conditions. It allows added distributed weight and a G-level for simulating static shock, and calculates the resulting peak deflection and peak stress. Input file, script file, sample output file

    Rotor Shaft Vibrations

    This script calculates the first natural frequency of a uniform rotor shaft on resilient bearings. A distributed weight may be added. It also calculates the damped, forced response to a specified (oz-in) unbalance at the midspan of the shaft. Input file, script file, sample output file

    The unbalance force is F = m*r*(RPM * pi / 30)2, which increases with speed. The m*r term is converted from the commonly specified oz-in to the correct units for the calculation. The response is calculated as a function of frequency ratio, using the classic textbook equation for a 1 degree of freedom system.

    Critical frequencies are explicitly calculated as well. Finally, a crude bar chart is plotted for a quick visual check.

    Basic Heat Exchanger Sizing

    This script is a simple calculation of the heat transfer requirements for a heat exchanger. This is taken as U * A, where U = the overall heat transfer coefficient of the design, and A = the total heat transfer area between the two fluids.

    The user can tweak any or all of the input variables until Qhot = Qcool. At that condition, the total heat lost by one fluid equals the total heat gained by the other fluid. This equality must be achieved (by user input variables) or the resulting answers will be incorrect. The script was written this way to allow the user infinite latitude for tweaking whatever variables desired. But the requirement that Qhot = Qcool must be met.

    blog comments powered by Disqus