Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Sitemap,Apr,2009,Admin

Featured Topics

These pages are grouped into the topics, listed below (latest one shown first):


categories: Sitemap,Apr,2009,Admin

Table of Contents

210 pages.

3300 million characters (in an Awk string)
999 bottles of beer
AA Tale of Two Tawks
A Web Server in Awk
Advocacy
Amazing Awk Assembler
Amazing Awk Formatter
An Awk Dungeon Adventure Game
Argcol
Arnold Robbins
array.awk
Automated Results Verification
Awk + Ansi-C = OO
Awk and Mail
Awk Cookbook Project
Awk for AI
Awk for Chemical Engineers
Awk for Engineering
Awk for Mechanical Engineers
Awk for system programming
Awk Games
Awk Mug
Awk on Android
Awk snake
Awk's Equivalent to VI's J
Awk++
Awk-Linux
Awk.info
Awk.info Gaining Popularity
Awk100
Awkbot:
Awklisp
awkwords
BBasebase Sim
Brainfuck to C
Building Interpreters with A*
CCheckers
Coding
Columnate
Community
Contact
Convert Code Comments to Latex
Correlation between numbers
Credits
DDatabases
Davinci mascot
Debugger and Assertion Checker
Domain-Specific Langauges
EEd Morton
Eliza
Errata: WHINY_USERS slows down Awk
Ethiopian Multiplication
Explaining Awk One Liners
FFast Clustering
Faster Hashing in Mawk
Finite State Machine Generator
Forloops
Format Shell Scripts
Four Keys to Awk
Functional Challenge
Functional Enumeration
Functional Gawk
GGenerating random sigs
Get YouTube Vids
Getline
GetXML.awk
Graph
Great Auks
HHandy One Liners
Hiding Email Address
History
Holiday date routines
How to call Awk form "C" with Libmawk
How to contribute
How to Read Minds
IIn praise of scripting
Interpreters
Interview with Arnold Robbins
Intrusion alert normalization
IRC agent in AWK
Issue report mining
JJawk = Java + Awk
Jim Hart
join.awk
LLanguage Analysis
Learning Awk
levenshtein.awk
Lexical and Grammar Analysis
List of Tags
Mm1 : simple macros
m5 : macro processor
Macro pre-processors
Mail sort
Markdown
Mascot
Mastermind
Mastermind (again)
Mawk: faster than C, C++, Java, Perl, Ruby....
md2html : Update to Markdown.awk
MicroTracer
Mike Langman
Monty Hall Problem
Moving Files with Awk
Music analysis workbench
Music tools and Awk
MySql
NNaive Bayes Classifier
Negotiate
Network monitoring in Awk
New AWK debugger
New Awk Mascot ('AWK-eye the Dwarf)?
New mascot
NoSQL
OOne Liners
OO tools in Awk
Operating Systems and Awk
PParallel Awk
Parser Generator
Patent search for genes or proteins
Playing music
Postscript tricks
Predicting Gender
Pretty print
Print an Array
Print Some Postscript Pages
Printing ranges
Processing Bitmaps in Gawk
Project Tools
QQSE: an embeddable Awk Interpreter
QTawk
Quicksort
Quicksort2
RRandom Numbers in Gawk
Reading RSS feeds
Regular Expression Matching Can Be Simple And Fast
Resistor Calculation
Reverse Postscript pages
Ronald Loui
Rot13 in Awk
runawk
Runawk 0.16
Runawk 0.17
Runawk 0.18
Runawk 0.19
SSamples of AWK
Sed in Awk
Sed to Awk
Sed-clones (in Awk)
Shorten my pipes
Shuffle.awk
Simple Awk GUIs in Windows
Simple Stream Editor
Simulations Unicast Applications
Soccer
Sorting
Sorting Arrays Via the Shell
Spam Filtering
Spawk for SuSE Linux
Spawk in GoogleCode
spell.awk
spellcheck.awk
Spreadsheets and Awk.
SQL Powered AWK
Steffen Schuler
Sudoku
Super-For loops
Sys Admin tricks in Awk
SysAdmins: Awk is Your Friend
TTable of Contents
Teaching Awk
Template-driven programming
Ten Liners
Tex-to-bilingual Dictionary
Text Mining
Text Munging
The Awk Book's Code
The Secret WHINY_YSERS flag
The TinyTim Content Management System
Tic-tac-toe
Tim Menzies
Top 10 pages
Top 10 posters
Top 10 posters last year
Top 10 subjects
Top 10 subjects last year
Topics
Towers of Hanoi
UUML: sequence diagrams
Unit tests
Using Awk for Databases
Using field names to reference columns
VVerification
Visual Awk
WWaclaw Sierpinski's Triangle
Why Gawk?
Widen bitmaps, using Gawk
Word-processing in Awk
Writing SciFi
XXgawk for Windows
XML and Awk
XML: Checking for Well-Formednes
XML: Dealing with DTDs
XML: Display components
XML: printing an outline
XML: pulling out data
XMLgawk
xmlparse.awk
Xmonth: Gawk+X-windows GUI
YYawk
ZZipf's Law

categories: Sitemap,Apr,2009,Admin

Page Tags

NumberTag
42 Admin
24 Tools
24 Awk100
19 Timm
13 Tips
13 TenLiners
12 ArnoldR
11 Top10
11 Papers
10 XML
10 Ronl
10 Misc
10 June
10 Games
9 Who
9 Dsl
8 Learn
7 Wp
7 Project
7 EdM
7 Databases
6 TextMining
6 Sept
6 Interpreters
5 Xgawk
5 WhyAwk
5 Steffen
5 Runawk
5 Os
5 Mascot
5 JurgenK
5 AlexC
4 Verification
4 SysAdmin
4 Sorting
4 Sed
4 PanosP
4 Newsgroup
4 Jimh
4 HenryS
4 Funky
4 Engineering
3 Spawk
3 Sitemap
3 Ps
3 Oo
3 OneLiners
3 Music
3 Mawk
3 Mail
3 Macros
3 Contribute
3 Arrays
2 Ysa
2 TedD
2 Spell
2 Sigs
2 ScottS
2 MichealS
2 MichaelS
2 MartinC
2 JonB
2 JesusG
2 Graphics
2 GrantC
2 GUI
2 Function
2 Eliza
2 DonaldM
2 DavidL
2 DariusB
2 BrianK
2 AwkLisp
2 Anon
2 AaronH
1 Zazzle
1 YungC
1 Yawk
1 YasumasaS
1 WolfganZ
1 WmM
1 WimVB
1 WillW
1 WilhelmW
1 Web
1 WWW
1 VictorA
1 VenkatesanS
1 TimS
1 TimM
1 TiborP
1 TerryB
1 Sudoku
1 StevenH
1 SteveL
1 SteveJ
1 SteveC
1 StephenJ
1 Stats
1 ScottP
1 SallyF
1 RussC
1 Rss
1 PremyslJ
1 PierreG
1 PhilipB
1 PeterW
1 PeterK
1 PeterI
1 PPuri
1 OsamuA
1 News
1 NelsonB
1 Negotiate
1 Name
1 MikhailA
1 Mikel
1 MartinF
1 MarkB
1 M0J0
1 LotharS
1 Libmawk
1 KimD
1 KennyM
1 JuergenK
1 JohnF
1 JohnD
1 JiirL
1 JanisP
1 JanW
1 JamesL
1 JMellander
1 Irc
1 HyungC
1 HiroS
1 HermannP
1 GregoryG
1 Getline
1 GerardH
1 Forloop
1 Errata
1 EricP
1 EisaA
1 DickL
1 DebbieF
1 DavidH
1 Dates
1 DataMining
1 DanN
1 Dab
1 Cookbook
1 CarloS
1 CMS
1 BrianJ
1 BrendanO
1 Boris
1 BobO
1 BillP
1 Baseballsim
1 BalkhisB
1 Awk
1 Argcol
1 April
1 Android
1 AlfredA
1 AlexS
1 AlexR
1 AlanL
1 ALahm

categories: Top10,Misc,Papers,WhyAwk,Apr,2009,Ronl

In Praise of Scripting: Real Programming Pragmatism

by Ronald P. Loui
Associate Professor of CSE
Washington University in St. Louis

(Pre-publication draft; copyright reserved by author. A subsequent version of this document appeared as IEEE Computer, vol. 41, no. 7, July 2008).

    This article's main purpose is to review the changes in programming practices known collectively as the "rise of scripting," as predicted in 1998 IEEE COMPUTER by Ousterhout. This attempts to be both brief and definitive, drawing on many of the essays that have appeared in online forums. The main new idea is that programming language theory needs to move beyond semantics and take language pragmatics more seriously.

To the credit of this journal, it had the courage to publish the signal paper on scripting, John Ousterhout's "Scripting: Higher Level Programming for the 21st Century" in 1998. Today, that document rolls forward with an ever-growing list of positive citations. More importantly, every major observation in that paper seems now to be entrenched in software practice today; every benefit claimed for scripting appears to be genuine (flexibility of typelessness, rapid turnaround of interpretation, higher level semantics, development speed, appropriateness for gluing components and internet programming, ease of learning and increase in amount of casual programming).

Interestingly, IEEE COMPUTER also just printed one of the most canonical attacks on scripting, by one Diomidis Spinellis, 2005, "Java Makes Scripting Languages Irrelevant?" Part of what makes this attack interesting is that the author seems unconvinced of his own title; the paper concludes with more text devoted to praising scripting languages than it expends in its declaration of Java's progress toward improved usability. It is unclear what is a better recommendation for scripting: the durability of Ousterhout's text or the indecisiveness of this recent critic's.

The real shock is that the academic programming language community continues to reject the sea change in programming practices brought about by scripting. Enamored of the object-oriented paradigm, especially in the undergraduate curriculum, unwilling to accept the LAMP (Linux-Apache-MySQL-Perl/Python/Php) tool set, and firmly believing that more programming theory leads to better programming practice, the academics seem blind to the facts on the ground. The ACM flagship, COMMUNICATIONS OF THE ACM for example, has never published a paper recognizing the scripting philosophy, and the references throughout the ACM Digital Library to scripting are not encouraging.

Part of the problem is that scripting has risen in the shadow of object-oriented programming and highly publicized corporate battles between Sun, Netscape, and Microsoft with their competing software practices. Scripting has been appearing language by language, including object-oriented scripting languages now. Another part of the problem is that scripting is only now mature enough to stand up against its legitimate detractors. Today, there are answers to many of the persistent questions about scripting: is there a scripting language appropriate for the teaching of CS1 (the first programming course for majors in the undergraduate computing curriculum)? Is there a scripting language for enterprise or real-time applications? Is there a way for scripting practices to scale to larger software engineering projects?

I intend to review the recent history briefly for those who have not yet joined the debate, then present some of the answers that scripting advocates now give to those nagging questions. Finally, I will describe how a real pragmatism of academic interest in programming languages would have better prepared the academic computing community to see the changes that have been afoot.

1996-1998 are perhaps the most interesting years in the phylogeny of scripting. In those years, perl "held the web together", and together with a new POSIX awk and GNU gawk, was shipping with every new Linux. Meanwhile javascript was being deployed furiously (javascript bearing no important relation to java, having been renamed from "livescript" for purely corporate purposes, apparently a sign of Netscape's solidarity with Sun, and even renamed "jscript" under Microsoft). Also, a handoff from tcl/tk to python was taking place as the language of choice for GUI developers who would not yield to Microsoft's VisualBasic. Php appeared in those years, though it would take another round of development before it would start displacing server-side perl, cold fusion, and asp. Every one of these languages is now considered a classic, even prototypical, scripting language.

Already by mid-decade, the shift from scheme to java as the dominant CS1 language was complete, and the superiority of c++ over c was unquestioned in industry. But java applets were not well supported in browsers, so the appeal of "write once, run everywhere" quickly became derided as "write once, debug everywhere." Web page forms, which used the common gateway interface (cgi) were proliferating, and systems programming languages like c became recognized as overkill for server-side programming. Developers quickly discovered the main advantage of perl for cgi forms processing, especially in the dot-com setting: it minimized the programmer's write-time. What about performance? The algorithms were simple, network latency masked small delays, and database performance was built into the database software. It turned out that the bottleneck was the programming. Even at run-time, the network and disk properties were the problems, not the cpu processing. What about maintenance? The developers and management were both happy to rewrite code for redesigned services rather than deal with legacy code. Scripting, it turns out, was so powerful and programmer-friendly that it was easier to create new scripts from scratch than to modify old programs. What about user interface? After all, by 1990, most of the programming effort had become the writing of the GUI, and the object-oriented paradigm had much of its momentum in the inheritance of interface widget behaviors. Surprisingly, the interface that most programmers needed could be had in a browser. The html/javascript/cgi trio became the GUI, and if more was needed, then ambitious client-side javascript was more reliable than the browser's java virtual machine. Moreover, the server-side program was simply a better way to distribute automation in a heterogeneous internet than the downloadable client-side program, regardless of whether the download was in binary or bytecode.

Although there was not agreement on what exact necessary and sufficient properties characterized scripting and distinguished it from "more serious" programming, several things were clear:

  • scripting permitted rapid development, often regarded as merely "rapid prototyping," but subsequently recognized as a kind of agile programming;
  • scripting was the kind of high-level programming that had always been envisioned, in the ascent from low-level assembly language programming to higher levels of abstraction: it was concise, and it removed programmers from concerning themselves with many performance and memory management details;
  • scripting was well suited to the majority of a programming task, usually the accumulation, extraction, and transformation of data, followed eventually by its presentation, so that only the performance-critical portion of a project had to be written in a more cumbersome, high-performance language;
  • it was easier to get things right when source code was short, when behavior was determined by code that fit on a page, all types were easily coerced into strings for trace-printing, code fragments could be interpreted, identifiers were short, and when the programmer could turn ideas into code quickly without losing focus.

This last point was extremely counterintuitive. Strong typing, naming regimen, and verbosity were motivated mainly by a desire to help the programmer avoid errors. But the programmer who had to generate too many keystrokes and consult too many pages, who had to search through many different files to discover semantics, and who had to follow too many rules, who had to sustain motivation and concentration over a long period of time, was a distracted and consequently inefficient programmer. Just as vast libraries did not deliver the promise of greater reusability, and virtual machines did not deliver the promise of platform-independence, the language's promise to discipline the programmer quite simply did not reduce the tendency of humans to err. It exchanged one kind of frequent error for another.

Scripting languages became the favorite tools of the independent-minded programmers: the "hackers" yes, but also the gifted and genius programmers who tended to drive a project's design and development. As Paul Graham noted (in a column reprinted in "Hackers and Painters" or this), one of the lasting and legitimate benefits of java is that it permits managers to level the playing field and extract considerable productivity from the less talented and less motivated programmers (hence, more disposable). There was a corollary to this difference between the mundane and the liberating:

  • scripting was not enervating but was actually renewing: programmers who viewed code generation as tedious and tiresome in contrast viewed scripting as rewarding self-expression or recreation.

The distinct features of scripting languages that produce these effects are usually enumerated as semantic features, starting with low I/O specification costs, the use of implicit coercion and weak typing, automatic variable initialization with optional declaration, predominant use of associative arrays for storage and regular expressions for pattern matching, reduced syntax, and powerful control structures. But the main reason for the productivity gains may be found in the name "scripting" itself. To script an environment is to be powerfully embedded in that environment. In the same way that the dolphin reigns over the open ocean, lisp is a powerful language for those who would customize their emacs, javascript is feral among browsers, and gawk and perl rule the linux jungle.

There is even a hint of AI in the idea of scripting: the scripting language is the way to get high level control, to automate by capturing the intentions and routines normally provided by the human. If recording and replaying macros is a kind of autopilot, then scripting is a kind of proxy for human decisionmaking. Nowhere is this clearer than in simple server-side php, or in sysadmin shell scripting.

So where do we stand now? While it may have been risky for Ousterhout to proclaim scripting on the rise in 1998, it would be folly to dismiss the success of scripting today. It is even possible that java will yield its position of dominance in the near future. (By the time this essay is printed, LAMP and AJAX might be the new darlings of the tech press; see recent articles in Business Week, this IEEE COMPUTER, and even James Gosling's blog where he concedes he was wanting to write a scripting language when he was handed the java project. Java is very much in full retreat.) Is scripting ready to fill the huge vacuum that would be produced?

I personally believe that CS1 java is the greatest single mistake in the history of computing curricula. I believe this because of the empirical evidence, not because I have an a priori preference (I too voted to shift from scheme to java in our CS1, over a decade ago, so I am complicit in the java debacle). I reported in SIGPLAN 1996 ("Why gawk for AI?") that only the scripting programmers could generate code fast enough to keep up with the demands of the artificial intelligence laboratory class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting. In the intervening decade, little has changed. We actually see more scripting, as students are happy to compress images so that they can script their computer vision projects rather than stumble around in c and c++. In fact, students who learn to script early are empowered throughout their college years, especially in the crucial UNIX and web environments. Those who learn only java are stifled by enterprise-sized correctness and the chimerae of just-in-time compilation, swing, JRE, JINI, etc. Young programmers need to practice and produce, and to learn through mistakes why discipline is needed. They need to learn design patterns by solving problems, not by reading interfaces to someone else's black box code. It is imperative that programmers learn to be creative and inventive, and they need programming tools that support code exploration rather than code production.

What scripting language could be used for CS1? My personal preferences are gawk, javascript, php, and asp, mainly because of their very gentle learning curves. I don't think perl would be a disaster; its imperfection would create many teaching moments. But there is emerging consensus in the scripting community that python is the right choice for freshman programming. Ruby would also be a defensible choice. Python and ruby have the enviable properties that almost no one dislikes them, and almost everyone respects them. Both languages support a wide variety of programming styles and paradigms and satisfy practitioners and theoreticians equally. Both languages are carefully enough designed that "correct" programming practices can be demonstrated and high standards of code quality can be enforced. The fact that Google stands by python is an added motivation for undergraduate majors.

But do scripting solutions scale? What about the performance gap when the polynomial, or worse the exponential, algorithm faces large n, and the algorithm is written in an interpreted or weakly compiled language? What about software engineering in the large, on big projects? There has been a lot of discussion about scalability of scripts recently. In the past, debates have simply ended with the concession that large systems would have to be rewritten in c++, or a similar language, once the scripting had served its prototyping duty.

The enterprise question is the easier of the two. Just as the individual programmer reaps benefits from a division of labor among tools, writing most of the code in scripts, and writing all bottleneck code in a highly optimizable language, the group of programmers benefits from the use of multiple paradigms and multiple languages. In a recent large project, we used vhdl for fpga's with a lot of gawk to configure the vhdl. We used python and php to generate dynamic html with svg and javascript for the interfaces. We used c and c++ for high performance communications wrappers, which communicated xml to higher level scripts that managed databases and processes. We saw sysadmin and report-generation in perl, ruby, and gawk, data scrubbing in perl and gawk, user scripting in bash, tcl, and gawk, and prototyping in perl and gawk. Only one module was written in java (because that programmer loved java): it was late, it was slow, it failed, and it was eventually rewritten in c++. In retrospect, neither the high performance components nor the lightweight code components were appropriate for the java language. Does scripting scale to enterprise software? I would not manage a project that did not include a lot of scripting, to minimize the amount of "hard" programming, to increase flexibility and reduce delivery time at all stages, to take testing to a higher level, and to free development resources for components where performance is actually critical. I nearly weep when I think about the text processing that was written in c under my managerial watch, because the programmer did not know perl. We write five hundred line scripts in gawk that would be ten thousand line modules in java or c++. In view of the fact that there are much better scripting tools for most of what gets programmed in java and c++, perhaps the question is whether java and c++ scale.

How about algorithmic complexity? Don't scripting languages take too long to perform nested loops? The answer here is that a cpu-bound tight loop such as a matrix multiplication is indeed faster in a language like c. But such bottlenecks are easy to identify and indeed easy to rewrite in c. True system bottlenecks are things like paging, chasing pointers on disk, process initialization, garbage collection, fragmentation, cache mismanagement, and poor data organization. Often, we see that better data organization was unimplemented because it would have required more code, code that would have been attempted in an "easier" programming language like a scripting language, but which was too difficult to attempt in a "harder" programming language. We saw this in the AI class with heuristic search and computer vision, where brute force is better in c, but complex heuristics are better than brute force, and scripting is better for complex heuristics. When algorithms are exponential, it usually doesn't matter what language you use because most practical n will incur too great a cost. Again, the solution is to write heuristics, and scripting is the top dog in that house. Cpu's are so much faster than disks these days that a single extra disk read can erase the CPU advantage of using compiled c instead of interpreted gawk. In any case, java is hardly the first choice for those who have algorithmic bottlenecks.

The real reason why academics were blindsided by scripting is their lack of practicality. Academic computing was generally late to adopt Wintel architectures, late to embrace cgi programming, and late to accept Linux in the same decade that brought scripting's rise. Academia understandably holds industry at a distance. Still, there is a purely intellectual reason why programming language courses are only now warming to scripting. The historical concerns of programming language theory have been syntax and semantics. Java's amazing contribution to computer science is that it raised so many old-fashioned questions that tickled the talents of existing programming language experts: e.g., how can it be compiled? But there are new questions that can be asked, too, such as what a particular language is well-suited to achieve inexpensively, quickly, or elegantly, especially with the new mix of platforms. The proliferation of scripting languages represents a new age of innovation in programming practice.

Linguists recognize something above syntax and semantics, and they call it "pragmatics". Pragmatics has to do with more abstract social and cognitive functions of language: situations, speakers and hearers, discourse, plans and actions, and performance. We are entering an era of comparative programming language study when the issues are higher-level, social, and cognitive too.

My old friend, Michael Scott, has a popular textbook called PROGRAMMING LANGUAGE PRAGMATICS. But it is a fairly traditional tome concerned with parameter passing, types, and bindings (it's hard to see why it merits "pragmatics" in its title, even as it goes to second edition with a chapter on scripting added!). A real programming pragmatics would ask questions like:

  • how well does each language mate to the other UNIX tools?
  • what is the propensity in each language for programmers at various expertise levels to produce a memory leak?
  • what is the likelihood in each language that unmodified code will still function in five years?
  • what is the demand of a programmer's concentration, what is the load on her short-term memory of ontology, and what is the support for visual metaphor in each language?

There have been programming language "shootouts" and "scriptometers" on the internet that have sought to address some of the questions that are relevant to the choice of scripting language, but they have been just first steps. For example, one site reports on the shortest script in each scripting language that can perform a simple task. But absolute brevity for trivial tasks, such as "print hello world" is not as illuminating as typical brevity for real tasks, such as xml parsing.

Pragmatic questions are not the easiest questions for mathematically-inclined computer scientists to address. They refer by their nature to people, their habits, their sociology, and the technological demands of the day. But it is the importance of such questions that makes programmers choose scripting languages. Ousterhout declared scripting on the rise, but perhaps so too are programming language pragmatics.

Acknowledgements

I have to thank Charlie Comstock for contributing many ideas and references over the past two years that have shaped my views, especially the commitment to the idea of pragmatics.

About the Author

Prof. Dr. Loui and his students are the usual winners of the department programming contest and have contributed to current gnu releases of gawk and malloc. He has lectured on AI for two decades on five continents, taught AI programming for two decades, and is currently funded on a project delivering hardware and software on U.S. government contracts.

References


categories: SysAdmin,Papers,WhyAwk,Apr,2009,HenryS

Awk: A Systems Programming Language?

At the Proceedings of the Winter Usenix Conference (Dallas'91) Henry Spencer wrote in Awk As A Major Systems Programming Language that...

    ...even experienced Unix programmers often don't know awk, or know it but view it as a counterpart of sed: useful "glue" for sticking things together in shell programming, but quite unsuited for major programming tasks. This is a major underestimate of a very powerful tool, and has hampered the development of support software that would make awk much more useful.

    There is no fundamental reason why awk programs have to be small "glue" programs: even the "old" awk is a powerful programming language in its own right. Effective use of its data structures and its stream-oriented structure takes some adjustment for C programmers, but the results can be quite striking.

    On the other hand, getting there can be a bit painful, and improvements in both the language and its support tools would help.

In 2009, Arnold Robbins comments:

    The paper is still interesting, although some bits are outdated (we now have a profiler, for instance).

categories: Games,Apr,2009,Admin

Awk Games

These pages focus on games, written in Awk.


categories: Games,TenLiners,Eliza,Apr,2009,JuergenK

Eliza

Contents

Synopsis

gawk -f eliza.awk

Download

Download from LAWKER.

Description

From Wikipedia: ELIZA was a computer program and an early example of (by modern standards) primitive natural language processing. ELIZA operated by processing users' responses to scripts, the most famous of which was DOCTOR, a simulation of a Rogerian psychotherapist. In this mode, ELIZA mostly rephrased the user's statements as questions and posed those to the 'patient.' ELIZA was written by Joseph Weizenbaum between 1964 to 1966.

Example dialogue

This is a Gawk implementation of the famous ELIZA program by Joseph Weizenbaum.
To end, type SHUT UP

HI, IM ELIZA, TELL ME YOUR PROBLEM
I hate dogs
CAN YOU ELABORATE ON THAT ?
Dogs remind me of Mylie Cyrus
CAN YOU ELABORATE ON THAT ?
You're kidding, right? Mylie Cyrus is one of the biggest stars on the planet
WHAT MAKES YOU THINK I AM KIDDING, RIGHT? MYLIE CYRUS IS ONE OF THE BIGGEST STARS ON THE PLANET ?
Cause you did not know about Mylie
WE WERE DISCUSSING YOU -- NOT ME
No,  I was discussing Mylie
Oh, I ID NOT KNOW ABOUT MYLIE
Well aren't you stupid
Oh, I TUPID
can you get her number?
WE WERE DISCUSSING YOU -- NOT ME
sorry
PLEASE DON'T APOLOGIZE !
but you sound so angry
YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?
i think you are  a robot
Oh, I RE  A ROBOT
yes
I UNDERSTAND
no you don't
YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?
hey- is this eliza?
WHAT'S ABOUT ME ?
I want to take you to dinner
YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?
I feel you are really stupid
WE WERE DISCUSSING YOU -- NOT ME
SHUT UP
WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $101

Code

Set up

BEGIN {
	SetUpEliza() 
	print "This is a Gawk implementation of the "\
          "famous ELIZA program by Joseph Weizenbaum. "\
	      "To end, type SHUT UP\n";
	print ElizaSays("");
}
{ print ElizaSays($0) }

ElizaSays

function ElizaSays(YouSay) {
 if (YouSay == "") {
   cost = 0
   answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM"
 } else {
   q = toupper(YouSay)
   gsub("'", "", q)
   if(q == qold) {
     answer = "PLEASE DONT REPEAT YOURSELF !"
   } else {
     if (index(q, "SHUT UP") > 0) {
       answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\
                int(100*rand()+30+cost/100)
		1;
     } else {
       qold = q
       w = "-"                 # no keyword recognized yet
       for (i in k) {          # search for keywords
         if (index(q, i) > 0) {
           w = i
           break
         }
       }
       if (w == "-") {         # no keyword, take old subject
         w    = wold
         subj = subjold
       } else {                # find subject
         subj = substr(q, index(q, w) + length(w)+1)
         wold = w
         subjold = subj        #  remember keyword and subject
       }
       for (i in conj)
          gsub(i, conj[i], q)   # conjugation
       # from all answers to this keyword, select one randomly
       answer = r[indices[int(split(k[w], indices) * rand()) + 1]]
       # insert subject into answer
       gsub("_", subj, answer)
     }
   }
 }
 cost += length(answer) # for later payment : 1 cent per character
 return answer
}

SetUpEliza

function SetUpEliza() {
 srand()
 wold = "-"
 subjold = " "

 # table for conjugation
 conj[" ARE "     ] = " AM "
 conj["WERE "     ] = "WAS "
 conj[" YOU "     ] = " I "
 conj["YOUR "     ] = "MY "
 conj[" IVE "     ] =\
 conj[" I HAVE "  ] = " YOU HAVE "
 conj[" YOUVE "   ] =\
 conj[" YOU HAVE "] = " I HAVE "
 conj[" IM "      ] =\
 conj[" I AM "    ] = " YOU ARE "
 conj[" YOURE "   ] =\
 conj[" YOU ARE " ] = " I AM "

 # table of all answers
 r[1]   = "DONT YOU BELIEVE THAT I CAN  _"
 r[2]   = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?"
 r[3]   = "YOU WANT ME TO BE ABLE TO _ ?"
 r[4]   = "PERHAPS YOU DONT WANT TO _ "
 r[5]   = "DO YOU WANT TO BE ABLE TO _ ?"
 r[6]   = "WHAT MAKES YOU THINK I AM _ ?"
 r[7]   = "DOES IT PLEASE YOU TO BELIEVE I AM _ ?"
 r[8]   = "PERHAPS YOU WOULD LIKE TO BE _ ?"
 r[9]   = "DO YOU SOMETIMES WISH YOU WERE _ ?"
 r[10]  = "DONT YOU REALLY _ ?"
 r[11]  = "WHY DONT YOU _ ?"
 r[12]  = "DO YOU WISH TO BE ABLE TO _ ?"
 r[13]  = "DOES THAT TROUBLE YOU ?"
 r[14]  = "TELL ME MORE ABOUT SUCH FEELINGS"
 r[15]  = "DO YOU OFTEN FEEL _ ?"
 r[16]  = "DO YOU ENJOY FEELING _ ?"
 r[17]  = "DO YOU REALLY BELIEVE I DONT _ ?"
 r[18]  = "PERHAPS IN GOOD TIME I WILL _ "
 r[19]  = "DO YOU WANT ME TO _ ?"
 r[20]  = "DO YOU THINK YOU SHOULD BE ABLE TO _ ?"
 r[21]  = "WHY CANT YOU _ ?"
 r[22]  = "WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM _ ?"
 r[23]  = "WOULD YOU PREFER IF I WERE NOT _ ?"
 r[24]  = "PERHAPS IN YOUR FANTASIES I AM _ "
 r[25]  = "HOW DO YOU KNOW YOU CANT _ ?"
 r[26]  = "HAVE YOU TRIED ?"
 r[27]  = "PERHAPS YOU CAN NOW _ "
 r[28]  = "DID YOU COME TO ME BECAUSE YOU ARE _ ?"
 r[29]  = "HOW LONG HAVE YOU BEEN _ ?"
 r[30]  = "DO YOU BELIEVE ITS NORMAL TO BE _ ?"
 r[31]  = "DO YOU ENJOY BEING _ ?"
 r[32]  = "WE WERE DISCUSSING YOU -- NOT ME"
 r[33]  = "Oh, I _"
 r[34]  = "YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?"
 r[35]  = "WHAT WOULD IT MEAN TO YOU, IF YOU GOT _ ?"
 r[36]  = "WHY DO YOU WANT _ ?"
 r[37]  = "SUPPOSE YOU SOON GOT _"
 r[38]  = "WHAT IF YOU NEVER GOT _ ?"
 r[39]  = "I SOMETIMES ALSO WANT _"
 r[40]  = "WHY DO YOU ASK ?"
 r[41]  = "DOES THAT QUESTION INTEREST YOU ?"
 r[42]  = "WHAT ANSWER WOULD PLEASE YOU THE MOST ?"
 r[43]  = "WHAT DO YOU THINK ?"
 r[44]  = "ARE SUCH QUESTIONS IN YOUR MIND OFTEN ?"
 r[45]  = "WHAT IS IT THAT YOU REALLY WANT TO KNOW ?"
 r[46]  = "HAVE YOU ASKED ANYONE ELSE ?"
 r[47]  = "HAVE YOU ASKED SUCH QUESTIONS BEFORE ?"
 r[48]  = "WHAT ELSE COMES TO MIND WHEN YOU ASK THAT ?"
 r[49]  = "NAMES DON'T INTEREST ME"
 r[50]  = "I DONT CARE ABOUT NAMES -- PLEASE GO ON"
 r[51]  = "IS THAT THE REAL REASON ?"
 r[52]  = "DONT ANY OTHER REASONS COME TO MIND ?"
 r[53]  = "DOES THAT REASON EXPLAIN ANYTHING ELSE ?"
 r[54]  = "WHAT OTHER REASONS MIGHT THERE BE ?"
 r[55]  = "PLEASE DON'T APOLOGIZE !"
 r[56]  = "APOLOGIES ARE NOT NECESSARY"
 r[57]  = "WHAT FEELINGS DO YOU HAVE WHEN YOU APOLOGIZE ?"
 r[58]  = "DON'T BE SO DEFENSIVE"
 r[59]  = "WHAT DOES THAT DREAM SUGGEST TO YOU ?"
 r[60]  = "DO YOU DREAM OFTEN ?"
 r[61]  = "WHAT PERSONS APPEAR IN YOUR DREAMS ?"
 r[62]  = "ARE YOU DISTURBED BY YOUR DREAMS ?"
 r[63]  = "HOW DO YOU DO ... PLEASE STATE YOUR PROBLEM"
 r[64]  = "YOU DON'T SEEM QUITE CERTAIN"
 r[65]  = "WHY THE UNCERTAIN TONE ?"
 r[66]  = "CAN'T YOU BE MORE POSITIVE ?"
 r[67]  = "YOU AREN'T SURE ?"
 r[68]  = "DON'T YOU KNOW ?"
 r[69]  = "WHY NO _ ?"
 r[70]  = "DON'T SAY NO, IT'S ALWAYS SO NEGATIVE"
 r[71]  = "WHY NOT ?"
 r[72]  = "ARE YOU SURE ?"
 r[73]  = "WHY NO ?"
 r[74]  = "WHY ARE YOU CONCERNED ABOUT MY _ ?"
 r[75]  = "WHAT ABOUT YOUR OWN _ ?"
 r[76]  = "CAN'T YOU THINK ABOUT A SPECIFIC EXAMPLE ?"
 r[77]  = "WHEN ?"
 r[78]  = "WHAT ARE YOU THINKING OF ?"
 r[79]  = "REALLY, ALWAYS ?"
 r[80]  = "DO YOU REALLY THINK SO ?"
 r[81]  = "BUT YOU ARE NOT SURE YOU _ "
 r[82]  = "DO YOU DOUBT YOU _ ?"
 r[83]  = "IN WHAT WAY ?"
 r[84]  = "WHAT RESEMBLANCE DO YOU SEE ?"
 r[85]  = "WHAT DOES THE SIMILARITY SUGGEST TO YOU ?"
 r[86]  = "WHAT OTHER CONNECTION DO YOU SEE ?"
 r[87]  = "COULD THERE REALLY BE SOME CONNECTIONS ?"
 r[88]  = "HOW ?"
 r[89]  = "YOU SEEM QUITE POSITIVE"
 r[90]  = "ARE YOU SURE ?"
 r[91]  = "I SEE"
 r[92]  = "I UNDERSTAND"
 r[93]  = "WHY DO YOU BRING UP THE TOPIC OF FRIENDS ?"
 r[94]  = "DO YOUR FRIENDS WORRY YOU ?"
 r[95]  = "DO YOUR FRIENDS PICK ON YOU ?"
 r[96]  = "ARE YOU SURE YOU HAVE ANY FRIENDS ?"
 r[97]  = "DO YOU IMPOSE ON YOUR FRIENDS ?"
 r[98]  = "PERHAPS YOUR LOVE FOR FRIENDS WORRIES YOU"
 r[99]  = "DO COMPUTERS WORRY YOU ?"
 r[100] = "ARE YOU TALKING ABOUT ME IN PARTICULAR ?"
 r[101] = "ARE YOU FRIGHTENED BY MACHINES ?"
 r[102] = "WHY DO YOU MENTION COMPUTERS ?"
 r[103] = "WHAT DO YOU THINK MACHINES HAVE TO DO WITH YOUR PROBLEMS ?"
 r[104] = "DON'T YOU THINK COMPUTERS CAN HELP PEOPLE ?"
 r[105] = "WHAT IS IT ABOUT MACHINES THAT WORRIES YOU ?"
 r[106] = "SAY, DO YOU HAVE ANY PSYCHOLOGICAL PROBLEMS ?"
 r[107] = "WHAT DOES THAT SUGGEST TO YOU ?"
 r[108] = "I SEE"
 r[109] = "IM NOT SURE I UNDERSTAND YOU FULLY"
 r[110] = "COME COME ELUCIDATE YOUR THOUGHTS"
 r[111] = "CAN YOU ELABORATE ON THAT ?"
 r[112] = "THAT IS QUITE INTERESTING"
 r[113] = "WHY DO YOU HAVE PROBLEMS WITH MONEY ?"
 r[114] = "DO YOU THINK MONEY IS EVERYTHING ?"
 r[115] = "ARE YOU SURE THAT MONEY IS THE PROBLEM ?"
 r[116] = "I THINK WE WANT TO TALK ABOUT YOU, NOT ABOUT ME"
 r[117] = "WHAT'S ABOUT ME ?"
 r[118] = "WHY DO YOU ALWAYS BRING UP MY NAME ?"
 # table for looking up answers that
 # fit to a certain keyword
 k["CAN YOU"]      = "1 2 3"
 k["CAN I"]        = "4 5"
 k["YOU ARE"]      =\
 k["YOURE"]        = "6 7 8 9"
 k["I DONT"]       = "10 11 12 13"
 k["I FEEL"]       = "14 15 16"
 k["WHY DONT YOU"] = "17 18 19"
 k["WHY CANT I"]   = "20 21"
 k["ARE YOU"]      = "22 23 24"
 k["I CANT"]       = "25 26 27"
 k["I AM"]         =\
 k["IM "]          = "28 29 30 31"
 k["YOU "]         = "32 33 34"
 k["I WANT"]       = "35 36 37 38 39"
 k["WHAT"]         =\
 k["HOW"]          =\
 k["WHO"]          =\
 k["WHERE"]        =\
 k["WHEN"]         =\
 k["WHY"]          = "40 41 42 43 44 45 46 47 48"
 k["NAME"]         = "49 50"
 k["CAUSE"]        = "51 52 53 54"
 k["SORRY"]        = "55 56 57 58"
 k["DREAM"]        = "59 60 61 62"
 k["HELLO"]        =\
 k["HI "]          = "63"
 k["MAYBE"]        = "64 65 66 67 68"
 k[" NO "]         = "69 70 71 72 73"
 k["YOUR"]         = "74 75"
 k["ALWAYS"]       = "76 77 78 79"
 k["THINK"]        = "80 81 82"
 k["LIKE"]         = "83 84 85 86 87 88 89"
 k["YES"]          = "90 91 92"
 k["FRIEND"]       = "93 94 95 96 97 98"
 k["COMPUTER"]     = "99 100 101 102 103 104 105"
 k["-"]            = "106 107 108 109 110 111 112"
 k["MONEY"]        = "113 114 115"
 k["ELIZA"]        = "116 117 118"
}

Author

Juergen Kahrs


categories: TenLiners,Apr,2009,AlanL

Towers of Hanoi

Contents

Synopsis

Description

Options

Example

Details

Globals

Code

Author

Synopsis

gawk -f hanoi.awk [-n Disks]

Description

The objective is to move N discks from stack 0 to stack 1, always putting a smaller disc on top of a larger one. or on an empty stack

Options

-n
Number of disks, defaults to 5.

Example

gawk -f hanoi.awk -n 4
0 4321
1 
2 

0 432
1 
2 1

0 43
1 2
2 1

0 43
1 21
2 

0 4
1 21
2 3

0 41
1 2
2 3

0 41
1 
2 32

0 4
1 
2 321

0 
1 4
2 321

0 
1 41
2 32

0 2
1 41
2 3

0 21
1 4
2 3

0 21
1 43
2 

0 2
1 43
2 1

0 
1 432
2 1

0 
1 4321
2 

Details

Globals

sp[i]
stack pointer for the ith stack = next free space
stack[i,j]
value of stack i at position j

Code

Main:

BEGIN {
  n = arg("-n",5)
  for (j=0; j<n; j++) push(0,n-j)
  showstacks()
  hanoi(n,0,1,2)
}

function hanoi(n,a,b,c) {
  if (n==1) {
    move(a,b)
  } else {
    hanoi(n-1,a,c,b)
    move(a,b)
    hanoi(n-1,c,b,a)
  }
}
function move(i,j) {
  push(j,pop(i))
  showstacks()
}

Showing the stack:

function showstacks(  i,j) {
  for (i=0; i<=2; i++) {
    printf "%s ", i
    for (j=0; j<sp[i]; j++) printf "%s", stack[i,j]
    print "" }
  print ""
}

Standard stuff:

function arg(tag,default) {
  for(i in ARGV) 
	if (ARGV[i] ~ tag) 
		return ARGV[i+1]
  return default
}
function push(i,v) { stack[i,sp[i]++]=v }
function pop(i)    { return stack[i,--sp[i]] }

Author

Alan Linton, 2001


categories: Games,TenLiners,Apr,2009,Anon

Mind-Reading Machine

Contents

Synposis

gawk -f readminds.awk

(then type "h" or "t").

Download

Download from LAWKER.

Description

Theory

Shannon's 1953 memo, A Mind-Reading(?) Machine, describes a machine built out of relays at Bell Labs.

    This machine is a somewhat simplified model of a machine designed by D.W. Hagelbarger. It plays what is essentially the old game of matching pennies or "odds and evens". This games has been discussed from the game theoretic angle by von Neumann and Morgenstern, and from the psychological point of view by Edgar Allen Poe in "The Purloined Letter". Oddly enough, the machine is aimed more nearly at Poe's method of play than von Neumann's.

The machine took advantage of the difficulty of generating truly random behavior in wetware by using a small (8-state) markov model to predict its human opponents.

Practice

We implement a 1970's version of this 1950's algorithm, using AWK instead of mechanical relays.

Our markov model is based on behavior over the last two rounds, with hpa and hpb recording the history of the player's plays, and hca and hcb recording the history of the computer's guesses. The possible cases are: the player won or lost two rounds ago, changed plays or stayed with the same play, and won or lost the last round, for a total of 23 = 8 histories, with any bias towards changing or staying in the upcoming round kept in the tally array.

If the player has repeated their behavior for a given history at least twice, we guess according to their predicted behavior. After the first observation, we guess with a 75%/25%, split, weighted towards the bias. If the player hasn't shown any bias (or during the first two rounds of the game), we guess at hazard.

Code

Begin

BEGIN	{
 print "+---------------------------------------------------------+"
 print "| An AWKward mind-reading machine                         |"
 print "|         (this retrogame inspired by the Bell Labs Memo: |"
 print "|          Shannon, 1953, 'A Mind-Reading (?) Machine')   |"
 print "+---------------------------------------------------------+"
 print "Shall we play a game?"
 print "Tell me either 'heads' or 'tails'."
 print "If I guess what you picked, I win.  Otherwise, you win."
 print "The match goes for 100 rounds, or someone gets 20."
 printf "your play? "
}

set seed

BEGIN	{ "date +%s" | getline seed; srand(seed) }

consult model

BEGIN	{ t = 0 }
NR > 2	{
	case = (hpa!=hca)"/"(hpa!=hpb)"/"(hpb!=hcb)
	t = tally[case]
	}

t < -1	{ guess=!hpb }
t == -1 { guess=(int(rand()+.75)?!hpb:hpb) }
t == 0	{ guess=int(rand()+.5) }
t == 1  { guess=(int(rand()+.75)?hpb:!hpb) }
t > 1	{ guess= hpb }

get input

/^[hH]/		{ play=1 }
/^[tT]/		{ play=0 }
/^[^hHtT]/	{ printf "heads or tails? "; next }

report results

We also report the results of the round to the player (in case they wish to update their internal models). En passant, we update pw and cw, the number of player (resp. computer) wins.

	{
	printf "You played " (play?"heads":"tails")
	printf "; I guessed " (guess?"heads":"tails")
	printf ".  "(play==guess?"I":"You")" win. "
	print "("(pw+=(play!=guess))"-"(cw+=(play==guess))")"
	}

update model

After finishing a round, we update the history with the results, including updating tally according to the player's behavior. Again, we wait for two rounds before touching the tally counters, at which point the history will have been fully initialized.

NR > 2	{ tally[case] += (hpb == play ? 1 : -1) }
	{
	hpa = hpb; hpb = play
	hca = hcb; hcb = guess
	}

check for victory

At the end of each round, if we haven't met a victory condition, we prompt for the next round.

cw+pw==100	{ printf (cw>pw?"I":"You")" won the match "
		  print  "by "(cw>pw?cw-pw:pw-cw)" games."
		  exit }
pw-cw==20	{ print "You win -- up by 20"; exit }
cw-pw==20	{ print "I win -- up by 20"; exit }
		{ printf "? " }

end

END	{ 
	print " T H A N K   Y O U   F O R   P L A Y I N G "
	}

Copyright

Copyright (c) 2009 the authors listed at the following URL, and/or the authors of referenced articles or incorporated external code: http://en.literateprograms.org/Mind_reading_machine_(AWK)?action=history&offset=20070207160312

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.


categories: Games,TenLiners,Apr,2009,Ysa

mastermind.awk

Contents

Synopsis

Download

Description

Example

Code

See Also

Author

Synopsis

gawk -f mastermind.awk

Download

Download from LAWKER.

Description

The aim of the game is to guess 4 numbers from 0,1,2,3,4,5,6,7,8,9. A "hit" is the right number in the right position and a "blow" is the right number in a wrong position.

You lose the game if you fail to guess after 10 rounds.

Example

 +++  Hit & Blow  +++   <Push Enter>

[ 1] >> 1234
              ##  1 Hit  2 Blow
[ 2] >> 1256
              ##  1 Hit  1 Blow
[ 3] >> 1789
              ##  1 Hit  0 Blow
[ 4] >> 1243 
              ##  1 Hit  2 Blow
[ 5] >> 1340
              ##  3 Hit  0 Blow
[ 6] >> 1320

  Congratulations !!  (1320)

Code

BEGIN{ 
	srand();  
	c=1;  
	print "\n\n +++  Hit & Blow  +++   <Push Enter>\n";
	q[z=p=int(9*rand())+1]=1;  
	for(i=2; i<=4;) 
		if(q[p=int(10*rand())]<1){ 
			q[p]=i++;  
			z=z*10+p; }
}

Note that the range 1023 ... 9876 are the smallest and largerst 4 digit integers with no repeates.

{ if((n=int($0+0))>=1023 && n<=9876) { 
		++c;
   		v=0;  
		for(i=4; i>0; n=int(n/10)) 
			v+=(q[p=n%10]==i--)?10:(q[p]>0)?1:0;
    			if	(v==40) exit; 
				else printf("%16s %2d Hit %2d Blow\n", "##", v/10, v%10);
 	}
 	if	(c>10) exit; 
	else printf("[%2d] >> ", c);
}
END{ 
	printf("\n  %s  (%d)\n", (v==40)?"Congratulations !!":"Over times", z); 
}

See Also

mastermind2.awk.

Author

The author's name is YSA.


categories: Sorting,TenLiners,Apr,2009,Ysa

Quicksort.awk

Contents

Synopsis

cat numbers | gawk -f quicksort.awk

Download

Download from LAWKER.

Description

Some Awk implementations come with built in sort routines (e.g. Gawk's asort and asorti functions). But it can be useful to code these yourself, especially in you are doing data structure tricks.

Quicksort selects a pivot and divides the data into values above and below the pivot. Sorting then recurses on these sub-lists.

Code

Loading the data

BEGIN { RS = ""; FS = "\n" }
      { A[NR] = $0 } 
END {
	qsort(A, 1, NR)
	for (i = 1; i <= NR; i++) {
		print A[i]
		if (i == NR) break
		print ""
	}
}

Sorting the data

function qsort(A, left, right,   i, last) {
	if (left >= right)
		return
	swap(A, left, left+int((right-left+1)*rand()))
	last = left
	for (i = left+1; i <= right; i++)
		if (A[i] < A[left])
			swap(A, ++last, i)
	swap(A, left, last)
	qsort(A, left, last-1)
	qsort(A, last+1, right)
}
function swap(A, i, j,   t) {
	t = A[i]; A[i] = A[j]; A[j] = t
}

See also

quicksort2.awk

Authors

Alfred Aho, Peter Weinberger, Brian Kernighan, 1988.


categories: Runawk,Project,Tools,Apr,2009,AlexC

New release: Runawk 0.16

In comp.lang.awk, Aleksey Cheusov writes:

I've made runawk-0.16.0 release. This release has lots of important improvements and additions. Sources are available from

What is runawk?

RUNAWK is a small wrapper for AWK interpreter that helps to write the standalone programs in AWK. It provides MODULES for AWK similar to PERL's "use" command and other powerful features. Dozens of ready to use modules are also provided.

(For more information, see details from the last release.)

Major changes in this release

Lots of demo programs for most runawk modules were created and they are in examples/ subdirectory now.

New MEGA module ;-) power_getopt.awk See the documentation and demo program examples/demo_power_getopt. It makes options handling REALLY easy (see below).

New modules:

  • embed_str.awk has_suffix.awk
  • has_prefix.awk
  • readfile.awk
  • modinfo.awk

Minor fixes and improvements in dirname.awk and basename.awk. Now they are fully compatible with dirname(1) and basename(1)

RUNAWK sets the following environment variables for the child awk subprocess:

  • RUNAWK_MODC - A number of modules (-f filename) passed to AWK
  • RUNAWK_MODV_<n> - Full path to the module #n, where n is in [0..RUNAWK_MODC) range.

RUNAWK sets RUNAWK_ART_STDIN environment variable for the child awk subprocess to 1 if additional/artificial `-' was added to the list to awk's arguments.

Makefile:

  • bmake-ism were removed. Now Makefile is fully compatible with FreeBSD make.
  • CLEANFILES target is used instead of hand-made rules
  • Minor fix in 'test_all' target

Power_GetOpt.awk

The most powerful feature of this release is power_getopt.awk module. It provides a very powerful and very easy way to handle options. Everything is in the usage message, you should do anything at all. I think example below is easy.

Example Code

% cat 1.awk
#!/usr/bin/env runawk

#use "power_getopt.awk"

#.begin-str help
# power_getopt - program demonstrating a power of power_getopt.awk module
# usage: power_getopt [OPTIONS]
# OPTIONS:
#    -h|--help                  display this screen
#    -f|--flag                  flag
#       --long-flag             long flag only
#    -s                         short flag only
#    =F|--FLAG           flag with value
#.end-str

BEGIN {
        print "f         --- " getarg("f")
        print "flag      --- " getarg("flag")
        print "long-flag --- " getarg("long-flag")
        print "s         --- " getarg("s")
        print "F         --- " getarg("F", "default1")
        print "FLAG      --- " getarg("FLAG", "default2")

        exit 0
}

./1.awk

% ./1.awk
f         --- 0
flag      --- 0
long-flag --- 0
s         --- 0
F         --- default1
FLAG      --- default2

./1.awk -h

% ./1.awk -h
power_getopt - program demonstrating a power of power_getopt.awk module
usage: power_getopt [OPTIONS]
OPTIONS:
   -h|--help                  display this screen
   -f|--flag                  flag
      --long-flag             long flag only
   -s                         short flag only
   -F|--FLAG           flag with value

./1.awk -f

% ./1.awk -f
f         --- 1
flag      --- 1
long-flag --- 0
s         --- 0
F         --- default1
FLAG      --- default2

./1.awk -F value

% ./1.awk -F value
f         --- 0
flag      --- 0
long-flag --- 0
s         --- 0
F         --- value
FLAG      --- value

./1.awk --FLAG=value

% ./1.awk --FLAG=value
f         --- 0
flag      --- 0
long-flag --- 0
s         --- 0
F         --- value
FLAG      --- value

categories: Mascot,Apr,2009,VenkatesanS

New Mascot

Venkatesan Satish offers a new entry in our Awk mascot competition:


categories: Wp,Apr,2009,Admin

Word Processing in Awk

These pages focus on word processing tools in Awk.


categories: Interpreters,Apr,2009,Admin

Writing Interpreters

These pages focus on language interpreters, written in Awk.


categories: Awk100,Interpreters,Apr,2009,HenryS

AASL: Parser Genrator in Awk

Download

Download from LAWKER

Synopsis

aaslg [ -x ] [ file ... ]
aaslr [ -x ] table [ file ... ]

Description

Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output. Aaslr parses the contents of the file(s) (default standard input) according to the AASL table in file table, emitting the table's output on standard output.

Both take a -x option to turn on verbose and cryptic debugging output. Both look in a library directory for pieces of the AASL system; the AASLDIR environment variable, if present, overrides the default notion of the location of this directory.

Aaslr expects input to consist of input tokens, one per line. For sim- ple tokens, the line is just the text of the token. For metatokens like ``identifier'', the line is the metatoken's name, a tab, and the text of the token. [xxx discuss `#' lines]

Aaslr output, in the absence of syntax errors, consists of the input tokens plus action tokens, which are lines consisting of `#!' followed immediately by an identifier. If the syntax of the input does not match that specified in the AASL table, aaslr emits complaint(s) on standard error and attempts to repair the input into a legal form; see ``ERROR REPAIR'' below. Unless errors have cascaded to the point where aaslr gives up (in which case it emits the action token ``#!aargh'' to inform later passes of this), the output will always conform to the AASL syntax given in the table.

Normally, a complete program using AASL consists of three passes, the middle one being an invocation of aaslr. The first pass is a lexical analyzer, which breaks free-form input down into input tokens in some suitable way. The third pass is a semantics interpreter, which typi- cally responds to input tokens by momentarily remembering them and to action tokens by executing some action, often using the remembered value of the previous input token. Aaslg is in fact implemented using AASL, following this structure; it implements the -x option by just passing it to aaslr.

AASL Specifications

An AASL specification consists of class definitions, text definitions, and rules, in arbitrary order (except that class definitions must pre- cede use of the classes they define). A `#' (not enclosed in a string) begins a comment; characters from it to the end of the line are ignored. An identifier follows the same rules as a C identifier, except that in most contexts it can be at most 16 characters long. A string is enclosed in double quotes ("") and generally follows C syn- tax. Most strings denote input tokens, and references to ``input token'' as part of AASL specification syntax should be read as ``string denoting input token''.

A class definition is an identifier enclosed in angle brackets (<>) followed by one or more input tokens followed by a semicolon (;). It gives a name to a set of input tokens. Classes whose names start with capital letters are user abbreviations; see below. Classes whose names start with lowercase letters are special classes, used for internal purposes. The current special classes are:

trivial
tokens which the parser can discard at will, in the expectation that they might be inserted erroneously; see ``ERROR REPAIR'' for details.
lineterm
tokens which terminate a logical line for purposes of resyn- chronization in error repair; see ``ERROR REPAIR'' for details.
endmarker
xxx

For example, the class definitions used for AASL itself are:

<trivial> "," ";"   ;
<lineterm> ";" ;
<endmarker> "EOF"   ;

When AASL error repair is invoked, the parser sometimes needs to gener- ate input tokens. In the case of a metatoken, the parser knows the token's name but needs to generate a text for it as well. A text defi- nition consists of an input token, an arrow (->), and a string specify- ing what text should be generated for that token. For example, the text definitions used for AASL itself are:

"id" -> "___"
"string" -> "\"___\""

The rules of a specification define the syntax that the parser should accept. The order of rules is not significant, except that the first rule is considered to be the top level of the specification. The spec- ification is executed by calling the first rule; when execution of that rule terminates, execution of the specification terminates. If the user wishes this to occur only at end of input, he should arrange for the lexical analyzer to produce an endmarker token (conventionally ``EOF'') at the end of the input, and should write the first rule to require that token at the end.

Note that an input token may be recognized considerably before it is accepted, but the parser emits it to the output only on acceptance.

A rule consists of an identifier naming it, a colon (:), a sequence of items which is the body of the rule, and a semicolon (;). When a rule is called, it is executed by executing the individual items of the body in order (as modified by control structures) until either one of them explicitly terminates execution of the rule or the last item is exe- cuted.

An item which is an input token requires that that token appear in the input at that point, and accepts it (causing it to be emitted as out- put).

An item which is an identifier denotes a call to another rule, which executes the body of that rule and then returns to the caller. It is an error to call a nonexistent rule.

An item which is an identifier preceded by `!' causes that identifier to be emitted as an action token; the identifier has no other signifi- cance.

An item which is `<<' causes execution of the current rule to terminate immediately, returning to the calling rule.

An item which is `>>' causes the execution of the innermost enclosing loop (see below) to terminate immediately, with execution continuing after the end of that loop. The loop must be within the same rule.

An item which is an identifier preceded by `@%&!' causes an internal semantic action to be executed within the parser; this is normally needed only for bizarre situations like C's typedef. [xxx should give details I suppose]

A choice is a sequence of branches enclosed in parentheses (()) and separated by vertical bars (|). The first of the branches that can be executed, is, after which execution continues after the end of the choice.

A loop is a sequence of branches enclosed in braces ({}) and separated by vertical bars (|). The first of the branches that can be executed, is, and this is done repeatedly until the loop is terminated by `>>', after which execution continues after the end of the loop. (A loop can also be terminated by `<<' terminating execution of the whole rule.)

A branch is just a sequence of items, like a rule body, except that it must begin with either an input token or a lookahead. If it begins with an input token, it can be executed only when that token is the next token in the input, and execution starts with acceptance of that token.

A lookahead specifies conditions for execution of a branch based on recognizing but not accepting input token(s). The simplest form is just an input token enclosed in brackets ([]), in which case execution of that branch is possible only when that token is the next token in the input. The brackets can also contain multiple input tokens sepa- rated by commas, in which case the parser looks for any of those tokens. If a user-abbreviation class name appears, either by itself or as an element of a comma-separated list, it stands for the list of tokens given in its definition.

If a lookahead's brackets contain only a `*', this is a default branch, executable regardless of the state of the input.

As a very special case, a lookahead's brackets can contain two input tokens separated by slash (/), in which case that branch is executable only when those two tokens, in sequence, are next in the input. Warn- ing: this is implemented by a delicate perversion of the error-repair machinery, and if the first of those tokens is not then accepted, the parser will die in convulsions. A further restriction is that the same input token may not appear as the first token of a double lookahead and as a normal lookahead token in the same choice/loop.

Certain simple choice/loop structures appear frequently, and there are abbreviations for them:

abbreviation	    expansion
( items ?)	        ( items  |  [*] )
{ items ?}	        { items  |  [*] >> }
( ! [look] items ?) ( [ look]  |  items )
{ ! [look] items ?} { [ look] >>  |  items }

For example, here are the rules of the AASL specification for AASL, minus the actions (which add considerable clutter and are unintelligi- ble without the third pass):

	       rules: {
				   "id" ":" contents ";"
				   | "<" "id" ">" {"string" ?} ";"
				   | "string" "->" "string"
				   | "EOF" >>
	       };
	       contents: {
				   ">>"
				   | "<<"
				   | "id"
				   | "!" "id"
				   | "@%&!" "id"
				   | "string"
				   | "(" branches ")"
				   | "{" branches "}"
				   | [*] >>
	       };
	       branches: (
				   "!" "[" look "]" contents "?"
				   | [*] branch (
				   ["|"] {"|" branch ?}
				   | "?" !endbranch
				   | [*]
				   )
	       );
	       branch: (
				   "string" contents
				   | "[" look "]" contents
	       );
	       look: (
				   ["string"/"/"] "string" "/" "string"
				   | "*"
				   | [*] looker {"," looker ?}
	       );
	       looker: ( "string" | "id" ) ;

Error Repair

When the input token is not one of those desired, either because the item being executed is an input token and a different token appears on the input, or because none of the branches of a choice/loop is exe- cutable, error repair is invoked to try to fix things up. Sometimes it can actually guess right and fix the error, but more frequently it merely supplies a legal output so that later passes will not be thrown into chaos by a minor syntax error.

The general error-repair strategy of an AASL parser is to give the parser what it wants and then attempt to resynchronize the input with the parser.

[xxx long discussion of how ``what it wants'' is determined when there are multiple possibilities]

Resynchronization is performed in three stages. The first stage attempts to resynchronize within a logical line, and is applied only if neither the input token nor the desired token is a line terminator (a member of the ``lineterm'' class). If the input token is trivial (a member of the ``trivial'' class), it is discarded. Otherwise it is retained, in hopes that it will be the next token that the parser asks for.

Either way, an error message is produced, indicating what was desired, what was seen, and what was handed to the parser. If too many of these messages have been produced for a single line, the parser gives up, produces a last despairing message, emits a ``#!aargh'' action token to alert later pases, and exits. Barring this disaster, parsing then con- tinues. If the parser at some point is willing to accept the input token, it is accepted and error repair terminates. If a line termina- tor is seen in input, or the parser requests one, before the parser is willing to accept the input token, the second phase begins.

The second stage of resynchronization attempts to line both input and parser up on a line terminator. If the desired token is a line termi- nator and the input token is not, input is discarded until a line ter- minator appears. If the desired token is not a line terminator and the input token is, the input token is retained and parsing continues until the parser asks for a line terminator. Either way, the third phase then begins.

The third stage of resynchronization attempts to reconcile line termi- nators. If the desired and input tokens are identical, the input token is accepted and error repair terminates. If they are not identical and the input token is trivial (yes, line terminators can be trivial, and ones like `;' probably should be), the input token is discarded. If the desired token is the endmarker, then the input token is discarded. Otherwise, the input token continues to be retained in hopes that it will eventually be accepted. [xxx this needs more thought] In any case, the second phase begins again.

Files

all in $AASLDIR:
interp  table interpreter
lex     first pass of aaslg
syn     AASL table for aaslg
sem     third pass of aaslg

See Also

awk(1), yacc(1)

Diagnostics

``error-repair disaster'' means that the first token of a double looka- head could not be accepted and error repair was invoked on it.

History

Written at University of Toronto by Henry Spencer, somewhat in the spirit of S/SL (see ACM TOPLAS April 1982).

Bugs

Some of the restrictions on double lookahead are annoying.

Most of the C string escapes are recognized but disregarded, with only a backslashed double-quote interpreted properly during text generation.

Error repair needs further tuning; it has an annoying tendency to infi- nite-loop in certain odd situations (although the messages/line limit eventually breaks the loop).

Complex choices/loops with many branches can result in very long lines in the table.

Assessment

The implementation of AASL was fairly straight forward, with AASL itself used to describe its own syntax. An AASL specification is compiled into a table, which is then processed by a table-walking interpreter. The interpreter expects input to be as tokens, one per line, much likethe output of a traditional scanner. A complete program using AASL (for example, the AASL table generator) is normally three passes: thescanner,the parser (tables plus interpreter), and a semantics pass. The first set of tables was generated byhand for bootstrapping.

Apart from the minor nuisance of repeated iterations of language design, the biggest problem ofimplementing AASL wasthe question of semantic actions. Inserting awk semantic routines into the table interpreter, in the style of yacc,would not be impossible, but it seemed clumsy and inelegant. Awks lack of anyprovision for compile time initialization of tables strongly suggested reading them in at run time, rather than taking up space with a huge BEGIN action whose only purpose was to initialize the tables. This makes insertions into the interpreters code awkward.

The problem was solved by a crucial observation: traditional compilers (etc.) merge a two-stepprocess, first validating a token stream and inserting semantic action cookiesinto it, then interpreting thestream and the cookies to interface to semantics. Forexample, yaccs grammar notation can be viewed asinserting fragments of C code into a parsed output, and then interpreting that output. This approach yieldsan extremely natural pass structure for an AASL parser,with the parsersoutput stream being (in the absenceof syntax errors) a copy of its input stream with annotations. The following semantic pass then processesthis, momentarily remembering normal tokens and interpreting annotations as operations on the remembered values. (The semantic pass is, in fact, a classic pattern+action awk program, with a pattern and anaction for each annotation, and a general save the value in a variableaction for normal tokens.)

The one difficulty that arises with this method is when the language definition involves feedbackloops between semantics and parsing, an obvious example being Cs typedef.Dealing with this reallydoes require some imbedding of semantics into the interpreter,although with care it need not be much: thein-parser code for recognizing C typedefs, including the complications introduced by block structure andnested redeclarations of type names, is about 40 lines of awk.The in-parser actions are invoked by a special variant of the AASL emit semantic annotationsyntax.

Aside benefit of top-down parsing is that the context of errors is known, and it is relatively easy to implement automatic error recovery. When the interpreter is faced with an input token that does not appearin the list of possibilities in the parser table, it givesthe parser one of the possibilities anyway, and then usessimple heuristics to try to adjust the input to resynchronize. The result is that the parser,and subsequentpasses, always see a syntactically-correct program. (This approach is borrowed from S/SL and its predecessors.) Although the detailed error-recovery algorithm is still experimental, and the current one is notentirely satisfactory when a complex AASL specification does certain things, in general it deals with minorsyntax errors simply and cleanly without anyneed for complicating the specification with details of errorrecovery.Knowing the context of errors also makes it much easier to generate intelligible error messagesautomatically.

The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.

As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, andwith the exception of the messy details of Cdeclaratorsis mostly a fairly straightforward transcription of the syntax given in the ANSI standard. Generating tables for this takes about three minutes of CPU time on a Sun 3/180; the tables are about 10K bytes.

The performance of the resulting ANSI C parser is not impressive: in very round numbers, averagedoveralarge program, it parses about one line of C per CPU second. (The scanner,164 lines of awk, accounts for a negligible fraction of this.) Some attention to optimization of both the tables and the interpreter might speed this up somewhat, but remarkable improvements are unlikely. As things stand in the absence of better awk implementations or a rewrite of the table interpreter in C, its a cute toy, possibly of some pedagogical value, but not a useful production tool. On the other hand, there does not appear to be any fundamental reason for the performance shortfall: itspurely the result of the slowexecution of awk programs.

Lessons From AASL

The scanner would be much faster with better regular-expression matching, because it can use regular expressions to determine whether a string is a plausible token but must use substr to extract the string first. Nawk functions would be very handy for modularizing code, especially the complicated and seldom-invoked error-recovery procedure. A switch statement modelled on the pattern+action scheme would be useful in several places.

Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in nawk): there is no array assignment. This lack leads to endless repetitions of code like:

for (i in array) 
    arraystack[i ":" sp] = array[i] 

whenever block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic sugar for this but don't really fix the problem. Not only is this code clumsy, it is woefully inefficient compared to something like

arraystack[sp] = array 

even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.

It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after it had finished executing.

A minor nuisance that surfaces constantly is that getting an error message out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).

The multi-pass input-driven structure that awk naturally lends itself to produces very clean and readable code with different phases neatly separated, but creates substantial difficulties when feedback loops appear. (In the case of AASL,this perhaps says more about language design than about awk.)

Author

Henry Spencer.


categories: Awk100,Top10,Interpreters,Dsl,Apr,2009,HenryS

Amazing Awk Assembler

Download from

Download from LAWKER.

Description

"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. It's quite slow, the input syntax is eccentric and rather restricted, and error-checking is virtually nonexistent, but it does work. Furthermore it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category. It is supplied "as is", with no guarantees of any kind. I can't be bothered to do any more work on it right now, but even in its imperfect state it may be useful to someone.

aaa is the mainline shell file.

aux is a subdirectory with machine-independent stuff. Anon, 6801, and 6809 are subdirectories with machine-dependent stuff, choice specified by a -m option (default is "anon"). Actually, even the stuff that is supposedly machine-independent does have some machine-dependent assumptions; notably, it knows that bytes are 8 bits (not serious) and that the byte is the basic unit of instructions (more serious). These would have to change for the 68000 (going to 16-bit "bytes" might be sufficient) and maybe for the 32016 (harder).

aaa thinks that the machine subdirectories and the aux subdirectory are in the current directory, which is almost certainly wrong.

abst is an abstract for a paper. "card", in each machine directory, is a summary card for the slightly-eccentric input language. There is no real manual at present; sorry.

try.s is a sample piece of 6809 input; it is semantic trash, purely for test purposes. The assembler produces try.a, try.defs, and try.x as outputs from "aaa try.s". try.a is an internal file that looks somewhat like an assembly listing. try.defs is another internal file that looks somewhat like a symbol table. These files are preserved because of possible usefulness; tmp[123] are non-preserved temporaries. try.x is the Intel-hex output. try.x.good is identical to try.x and is a saved copy for regression testing of new work.

01pgm.s is a self-programming program for a 68701, based on the one in the Motorola ap note. 01pgm.x.good is another regression-test file.

If your C library (used by awk) has broken "%02x" so it no longer means "two digits of hex, *zero-filled*" (as some SysV libraries have), you will have to fall back from aux/hex to aux/hex.argh, which does it the hard way. Oh yes, you'll note that aaa feeds settings into awk on the command line; don't assume your awk won't do this until you try it.

Author

Henry Spencer


categories: Interpreters,Apr,2009,DavidL

Awk A*

Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base.

An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.

They write:

    While there are a number of systems that will help one construct full-blown metaprograms such as compilers and interpreters, we wanted something with extremely low overhead. We set out to build a something with the property that it would help even inexperienced users build simple meta-programs in a matter of minutes with a few lines of code. A* is the result; it is more than anything else an engineering exercise, as most of its ideas are not new. It is the arrangement of these ideas and the purpose to which they are directed distinguish A* from other tools.

    A* is an experimental language designed to facilitate the creation of language-processing tools. It is analogous either to an interpreted yacc with Awk as its statement language, or to a version of Awk which processes programs rather than records. A* offers two principal advantages over the combination of lex, yacc, and C:

    1. a high-level interpreted base language
    2. built-in parse tree construction.
A* programmers are thus able to accomplish many useful tasks with little code.

Reference: A*: a language for implementing language processors Ladd, D.A.; Ramming, J.C.; Software Engineering, IEEE Transactions on Volume 21, Issue 11, Nov. 1995 Page(s):894 - 901


categories: Sed,Tips,Apr,2009,Admin

Sed-clones (in Awk)

These pages focus on Sed-like stream editors, written in Awk.


categories: Tips,Apr,2009,ArnoldR

Moving Files with Awk

Andrew Eaton wrote at comp.lang.awk:

I just started with awk and sed, I am more of a perl/C/C++ person. I have a quick question reguarding the pipe. In Awk, I am trying to use this construct.

while ((getline < "somedata.txt") > 0)
            {print | "mv"} #or could be "mv -v" for verbose. 

Is it possible that "print" is no longer printing the value of getline, if so how do I correct it?

Arnold Robbins comments:

The problem here is that `mv' doesn't read standard input, it only processes command lines. Assuming that your data is something like:

oldfile newfile

You can do things two ways:

# build the command and execute it
while ((getline < "somedata.txt") > 0) {
          command = "mv " $1 " " $2
          system(command)
}
close("somedata.txt")

or this way:

# send commands to the shell
while ((getline < "somedata.txt") > 0) {
          printf("mv %s %s\n", $1, $2) | "sh"
}
close("somedata.txt")
close("sh")

The latter is more efficient.


categories: Sed,Tips,Apr,2009,ArnoldR

AwkSed: A Simple Stream Editor

by Arnold Robbins

From the Gawk Manual.

The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:

command1 < orig.data | sed 's/old/new/g' | command2 > result

Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.

The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:

# awksed.awk --- do s/foo/bar/g using just print
#    Thanks to Michael Brennan for the idea

function usage()
{
  print "usage: awksed pat repl [files...]" > "/dev/stderr"
  exit 1
}

BEGIN {
    # validate arguments
    if (ARGC < 3)
        usage()

    RS = ARGV[1]
    ORS = ARGV[2]

    # don't use arguments as files
    ARGV[1] = ARGV[2] = ""
}

# look ma, no hands!
{
    if (RT == "")
        printf "%s", $0
    else
        print
}

The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.

The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.

There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.

The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.

The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.


categories: Sed,Tips,Apr,2009,JamesL

s2a: sed to Awk

Contents

Download

Description

Bugs

Author

Code

Download

Download from LAWKER.

Description

The s2a project is a sed to awk conversion utility written in awk. As input it takes sed scripts, and it outputs an equivalent awk script.

This version should be fully functional as far as the following sed commands are concerned: a,d,s,p,q,c,i,n. Commands to be implemented in the future: {},=,h,g,N,P,r,x,y,l,H,G,D,b,t,:

Bugs

$ is not a valid line address. Also, line continuation with '\' is not implemented.

Author

James Lyons, Feb 2008.

For more excellent awk code, visit Lyon's awk.dsplab web site.

Code

BEGIN{RS=";|\n"; FS=""; var=1;}
{
    i=1; case1=""; case2="";
    while($i==" ")i++;
    if($i=="\\"||$i=="/"||$i~/[0-9]/) case1=matchaddr();
    if($i==","){i++; case2=matchaddr()};
 handle sed commands
####################################################################################################
    if($i == "d"){ a1=a2="next;";
    }else if($i == "p"){ a1=a2="print;";
    }else if($i == "a"){ rest="";
        for(c=i+2;c<=NF;c++) rest=rest$c;
        a1=a2="$0=$0\"\\n"rest"\";"; 
    }else if($i == "q"){ a1=a2="print; exit;"; 
    }else if($i == "n"){ a1=a2="print; if(getline <= 0) next;"
    }else if($i == "s"){
        re=substr($0, i); p=substr(re,2,1); match(re,"s"p"((\\"p"|.)*)"p"((\\"p"|.)*)"p"([a-zA-Z])?",tmp);
        tmp[3]=gensub(/\\[0-9]/,"\\\\&","g",tmp[3]); 
        tmp[1]=gensub(/\\\(/,"(","g",tmp[1]); tmp[1]=gensub(/\\\)/,")","g",tmp[1]);
        if(tmp[3]=="") a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",1);";
        else a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",\""tmp[5]"\");";
    }else if($i == "c"){ rest="";
        for(c=i+2;c<=NF;c++) rest=rest$c;
        a1="$0=\""rest"\";"; 
        a2="next;";
    }else if($i == "i"){ rest="";
        for(c=i+2;c<=NF;c++) rest=rest$c;
        a1=a2="$0=\""rest"\\n\"$0;"; 
    }else{
        print "ERROR: invalid syntax. Unkown command in expression "$0" (expr number "NR")"; exit;
    }
####################################################################################################
 output awk commands
    if(case1=="" && case2=="") print "{"a1"}";
    else if(case1~/^[0-9]/ && case2=="") print "NR=="case1"{"a1"}";
    else if(case2 == "") print "/"case1"/{"a1"}";
    else if(case1~/^[0-9]/ && case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
    else if(case1~/^[0-9]/)  print "temp"var"==1&&/"case2"/{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
    else if(case2~/^[0-9]/)  print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
    else print "temp"var"==1&&/"case2"/{temp"var++"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
    var++;
}

function matchaddr(){
    str=substr($0, i); p=1;
    if($i == "\\"){ p=substr(str,2,1); match(str,p"([^"p"]*)"p,arr); i++}
    else if($i == "/"){ p=substr(str,1,1); match(str,p"([^"p"]*)"p,arr); }
    else { match(str,/^([0-9]*)/,arr) };
    i += RLENGTH;
    return arr[1];
}
END{print "{print}";}

categories: Arrays,Apr,2009,Timm

saya

Synopsis

saya(array [,label,sep,before,after,eq])

Description

Array printing function. Contents printed, sorted on key.

Arguments

array
An array.
label
(OPTIONAL) A prefix before every item.
sep
(OPTIONAL) A string to print between each item. Defaults to new line.
before
(OPTIONAL) A string to print before the array. Defaults to "".
after
(OPTIONAL) A string to print after the array. Defaults to new line.
eq
(OPTIONAL) A string to print between each key/value pair. Defaults to " = ".

Returns

Size of the array

Notes

The most common usage is to just use the first two arguments; e.g.

saya(a,"name") ==>

name[1] = tim
name[2] = menzies

For other usages, see the examples, below.

Source

function saya(a,s, sep0,b4,after,eq,   c,m,n,key,val,i,j,tmp,sep) {
	sep0  = sep0  ? sep0  : "\n"
	b4    = b4    ? b4    : "\n"
	after = after ? after : "\n"
	eq    = eq    ? eq    : " = "
	pre   = s     ? s"["  : ""
	post  = s     ? "]"   : ""
	m     = asorti(a,b)
	printf("%s",b4)
	for(i=1;i<=m;i++)  {
		key=b[i]
		val=a[b[i]]
		printf("%s", sep pre  )
		n=split(key,tmp,SUBSEP)
		c = ""
		for(j=1;j<=n;j++)	{	
			printf("%s", c tmp[j]  )
			c=","
		}
		printf("%s", post eq val )
		sep=sep0;
	};
	printf("%s",after)
	return m
}

Example

gawk/array/eg/saya »

gawk -f saya.awk --source '
BEGIN { 	
	A["fname"  ] = "tim"
	A["lname"  ] = "menzies"
	A["address"] = "usa"
	saya(A,"",", ","[","]")
	print ""
	saya(A,"message")
	B[2,3,9]   = 100
	B[10,1,11] = 200
	B[1,3,10]  = 300
	saya(B,"b")
}'

gawk/array/eg/saya.out »

[address = usa, fname = tim, lname = menzies]

message[address] = usa
message[fname] = tim
message[lname] = menzies

b[1,3,10] = 300
b[10,1,11] = 200
b[2,3,9] = 100

Author

Tim Menzies


categories: Sorting,TenLiners,Apr,2009,Awk

quicksort2.awk

Contents

Synopsis

Download

Description

Code

Bugs

See also

Copyright

Author

Synopsis

cat numbers | gawk -f quicksort2.awk

Download

Download from LAWKER.

Description

Quicksort divides the input data around a randomly selected pivot, then recurses on the divided data.

In quicksort2, the pivot is selected from the first line of input. Each data division is handled by a different UNIX pipe and recursive gawk processes are called on the divided data.

Yes, this is not the fastest way to do it but (in theory anyway) it should be able to handle very big data sets.

Code

BEGIN   { 
         recurse1 = "gawk -f quicksort2.awk #" rand()
         recurse2 = "gawk -f quicksort2.awk #" rand()
        }
NR == 1 { pivot=$0; next }
NR > 1  { if($0 < pivot) print | recurse1
          if($0 > pivot) print | recurse2
        }
END     { close(recurse1)
          if(NR > 0) print pivot
	      close(recurse2)
        }

Bugs

The output ignores repeated input values. I thought it was a problem with repeating the name of the pipes (hence the "rand()" labelling) but that did not fix the issues.

See also

quicksort.awk

Copyright

Copyright (c) 2009 by David Long.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Author

Original version: David Long, 2004. Tim Menzies added some modifications in 2009 to call recursive Gawk pipes on both sides of the pivot.


categories: Sigs,Tools,Apr,2009,Anon

Hiding Email Address

Contents

Synopsis

Download

Description

Code

Author

Synopsis

gawk -f cryptosig.awk tim@menzies.us

Download

Download from LAWKER.

Description

Generates a one-line Awk program that can print your email, from a seemingly jumbled string. This program can then become your email sig and only the Awk cognoscente can generate a reply.

Example

% gawk -f cryptosig.awk tim@menzies.us
BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}

This can be tested as follows:

echo 'BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}' | gawk -f -

or

gawk -f crypotsig.awk tim@menzies.us | gawk -f -

both of which should print "tim@menzies.us".

Code

BEGIN {
  for (i=0; i<=255; i++) {           # build table of char=value pairs
    ord_arr[sprintf("%c",i)] = i     # character = ordinal value
  }
  for (i=1; i<=ARGC-1; i++) {
    str = ""
    for (j=1; j<=length(ARGV[i]); j++) {
      str = sprintf("%s%02d",str,ord_arr[substr(ARGV[i],j,1)]-46)
    }
    printf("BEGIN{a=\"%s\";while(a){printf(\"%%c\",46+substr(a,1,2));a=substr(a,3)}}\n",str)
  }
  exit(0)
}

Author

BEGIN{a="535170696159626207061118755158656500536563";
      while(a){
          printf("%c",46+substr(a,1,2));a=substr(a,3)};
      print("")
}

categories: Sigs,Tools,Apr,2009,Timm

Random Signatures

Contents

Synopsis

chmod +x sigs; ./sigs

Download

Download from LAWKER.

Description

Generates random signtures. Signatures and generation code included in same file so installation is just a matter of calling one file.

Most of the file is a large "here" document. Paragraph 1 of that document is always added to the signatures, followed one of the folowing paragraphs, selected at radonom.

To add to the signtures, include them in the here document, with one preceeding blank line.

Code

Pick1

pick1() {
    gawk 'BEGIN { srand(); RS=""    }
          NR==1 { print $0 "\n"     }
          NR>1  { Recs[rand()] = $0 }
          END   { for ( R in Recs ) {print Recs[R]; exit}}
        ' $1
}

The Signatures

cat << SoMEI_mpOSSIblE_sYMBOl | pick1
tim.menzies {
  title:   dr (Ph.D.) and associate professor;
  align:   csee, west virginia university;
  cell:   esb 841A; 
  url:   http://menzies.us;
  fyi:   unless marked "URGENT", i usually won't get 2 your email b4 5pm; 
}

Doing a job RIGHT the first time gets the job done. Doing the job WRONG
fourteen times gives you job security.

Rome did not create a great empire by having meetings, they did it by
killing all those who opposed them.

INDECISION is the key to FLEXIBILITY.

"When a subject becomes totally obsolete we make it a required
course."  Peter Drucker

I saw two shooting stars last night but they were only satellites .
Its wrong to wish on space hardware. I wish, I wish, I wish you cared.
-- Billy Bragg

Then, in 1995, came the most amazing event in the
history of programming languages: the introduction
of Java.  -- Programming Languages: Principles and Practice

Suburbia is where the developer bulldozes out the trees, then names
the streets after them. --Bill Vaughan

Instant gratification takes too long.
-- Carrie Fisher

Complexity is easy. Simplicity is hard.
--Unknown

Author

Tim Menzies


categories: Wp,Awk100,Wp,Tools,Apr,2009,HenryS

awf

The amazingly workable (text) formatter

Synopsis

awf -macros [ file ] ...

Download

Download from LAWKER. Type "make r" to run a regression test, formatting the manual page (awf.1) and comparing it to a preformatted copy (awf.1.out). Type "make install" to install it. Pathnames may need changing.

Description

Awf formats the text from the input file(s) (standard input if none) in an imitation of nroff's style with the -man or -ms macro packages. The -macro option is mandatory and must be `-man' or `-ms'.

Awf is slow and has many restrictions, but does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.

Awf implements the following raw nroff requests:

.\"  .ce  .fi  .in  .ne  .pl  .sp
.ad  .de  .ft  .it  .nf  .po  .ta
.bp  .ds  .ie  .ll  .nr  .ps  .ti
.br  .el  .if  .na  .ns  .rs  .tm

and the following in-text codes:

\$   \%   \*   \c   \f   \n   \s

plus the full list of nroff/troff special characters in the original V7 troff manual.

Many restrictions are present; the behavior in general is a subset of nroff's. Of particular note are the following:

  • Point sizes do not exist; .ps and \s are ignored.
  • Conditionals implement only numeric comparisons on \n(.$, string com- parisons between a macro parameter and a literal, and n (always true) and t (always false).
  • The implementation of strings is generally primitive.
  • Expressions in (e.g.) .sp are fairly general, but the |, &, and : operators do not exist, and the implementation of \w requires that quote (') be used as the delimiter and simply counts the characters inside (so that, e.g., \w'\(bu' equals 4).

White space at the beginning of lines, and imbedded white space within lines, is dealt with properly. Sentence terminators at ends of lines are understood to imply extra space afterward in filled lines. Tabs are implemented crudely and not quite correctly, although in most cases they work as expected. Hyphenation is done only at explicit hyphens, emdashes, and nroff discretionary hyphens.

MAN Macros

The -man macro set implements the full V7 manual macros, plus a few semi- random oddballs. The full list is:

.B   .DT  .IP  .P   .RE  .SM
.BI  .HP  .IR  .PD  .RI  .TH
.BR  .I   .LP  .PP  .RS  .TP
.BY  .IB  .NB  .RB  .SH  .UC

.BY and .NB each take a single string argument (respectively, an indi- cation of authorship and a note about the status of the manual page) and arrange to place it in the page footer.

MS Macros

The -ms macro set is a substantial subset of the V7 manuscript macros. The implemented macros are:

.AB  .CD  .ID  .ND  .QP  .RS  .UL
.AE  .DA  .IP  .NH  .QS  .SH  .UX
.AI  .DE  .LD  .NL  .R   .SM
.AU  .DS  .LG  .PP  .RE  .TL
.B   .I   .LP  .QE  .RP  .TP

Size changes are recognized but ignored, as are .RP and .ND. .UL just prints its argument in italics. .DS/.DE does not do a keep, nor do any of the other macros that normally imply keeps.

Assignments to the header/footer string variables are recognized and implemented, but there is otherwise no control over header/footer formatting. The DY string variable is available. The PD, PI, and LL number registers exist and can be changed.

Output

The only output format supported by awf, in its distributed form, is that appropriate to a dumb terminal, using overprinting for italics (via underlining) and bold. The nroff special characters are printed as some vague approximation (it's sometimes very vague) to their correct appearance.

Awf's knowledge of the output device is established by a device file, which is read before the user's input. It is sought in awf's library directory, first as dev.term (where term is the value of the TERM environment variable) and, failing that, as dev.dumb. The device file uses special internal commands to set up resolution, special characters, fonts, etc., and more normal nroff commands to set up page length etc.

FiLes

All in /usr/lib/awf (this can be overridden by the AWFLIB environment variable):

common     common device-independent initialization
dev.*      device-specific initialization
mac.m*     macro packages
pass1      macro substituter
pass2.base central formatter
pass2.m*   macro-package-specific bits of formatter
pass3      line and page composer

See Also

awk(1), nroff(1), man(7), ms(7)

Diagnostics

Unlike nroff, awf complains whenever it sees unknown commands and macros. All diagnostics (these and some internal ones) appear on standard error at the end of the run.

Author

Written at University of Toronto by Henry Spencer, more or less as a supplement to the C News project.

Copyright

Copyright 1990 University of Toronto. All rights reserved. Written by Henry Spencer. This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California.

Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from flaws in it.
  2. The origin of this software must not be misrepresented, either by explicit claim or by omission. Since few users ever read sources, credits must appear in the documentation.
  3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Since few users ever read sources, credits must appear in the documentation.
  4. This notice may not be removed or altered.

Bugs

There are plenty, but what do you expect for a text formatter written entirely in (old) awk?

The -ms stuff has not been checked out very thoroughly.


categories: Ps,Apr,2009,Admin

Postscript Tricks

These pages focus on postscript tricks, written in Awk.


categories: Ps,Apr,2009,ArnoldR

pschoose.awk

Contents

Synopsis

Download

Description

Details

Code

Author

Synopsis

gawk -f pschoose

Download

Download from LAWKER

Description

Pulls out a range of pages from postscript and just print those.

Details

Pagerange : list of pages from command line.

Pages : array with broken out list.

At end: "(n in Pages)" is true if page n should be printed

Code

Set up the list of paes to print.
function set_pagerange(        n, m, i, j, f, g)
{
	delete Pages

	n = split(Pagerange, f, ",")
	for (i = 1; i <= n; i++) {
		if (index(f[i], "-") != 0) { # a range
			m = split(f[i], g, "-")
			if (m != 2 || g[1] >= g[2]) {
				printf("bad list of pages: %s\n",
					f[i]) > "/dev/stderr"
				exit 1
			}
			for (j = g[1]; j <= g[2]; j++)
				Pages[j] = 1
		} else
			Pages[f[i]] = 1
	}
}

BEGIN {
	# constants
	TRUE = 1
	FALSE = 0

	if (ARGC != 3) {
		print "usage: pschoose range-spec file\n" > "/dev/stderr"
		exit 1
	}
	Pagerange = ARGV[1]
	delete ARGV[1]
	set_pagerange()
}

NR == 1, /^%%Page:/ {
	if (! /^%%Page/) {
		Prolog[++nprolog] = $0
		next
	}
}

/^%%Trailer/ || In_trailer {
	In_trailer = TRUE
	Epilog[++nepilog] = $0
	next
}

/^%%Page: /	{
	++Npage
	line = 0
}

 for all non-special lines
{
	# only save it if we will want to print it
	if (Npage in Pages)
		Page[Npage, ++line] = $0
}

END {
	# print the prologue
	for (i = 1; i in Prolog; i++)
		print Prolog[i]

	# print the actual body
	for (i = 1; i <= Npage; i++) {
		if (i in Pages) {
			for (j = 1; (i, j) in Page; j++) {
				print Page[i, j]
			}
		}
	}

	# print the epilog
	for (i = 1; i in Epilog; i++)
		print Epilog[i]
}

Author

Arnold Robbins


categories: Ps,Apr,2009,ArnoldR

psrev.awk

Contents

Synopsis

Download

Description

Code

Author

Synopsis

gawk -f psrev.awk

Download

Download from LAWKER

Description

Reverse the pages in a postscript file.

Code

BEGIN {
	# constants
	TRUE = 1
	FALSE = 0

	# Initialize global booleans
	Twoup = FALSE

	# process command line flags
	for (i = 1; i in ARGV && ARGV[i] ~ /^-/; i++) {
		if (ARGV[i] == "-2")
			Twoup = TRUE
		else
			printf("psrev: unrecognized option %s\n",
				ARGV[i]) > "/dev/stderr"
		delete ARGV[i]
	}
}

NR == 1, /^%%Page:/ {
	if (! /^%%Page/) {
		Prolog[++nprolog] = $0
		next
	}
}

/^%%Trailer/ || In_trailer {
	In_trailer = TRUE
	Epilog[++nepilog] = $0
	next
}

/^%%Page: /	{
	++Npage
	line = 0
}

 for all non-special lines
{
	Page[Npage, ++line] = $0
}

END {
	# print the prologue
	for (i = 1; i in Prolog; i++)
		print Prolog[i]

	# print the actual body
	if (Twoup) {
		hasodd = (Npage %2 == 1)
		if (hasodd) {
			# print last page
			for (j = 1; (Npage, j) in Page; j++)
				print Page[Npage, j]
			# make a fake last page for psnup
			printf "%%%%Page: %d %d\n", Npage+1, Npage+1
			printf "showpage\n"
			print "%%BeginPageSetup"
			print "BP"
			print "%%EndPageSetup"
			print "EP"
		}
		lastpage = (hasodd ? Npage - 1 : Npage)
		for (i = lastpage; i > 0; i -= 2) {
			for (k = i - 1; k <= i; k++)
				for (j = 1; (k, j) in Page; j++)
					print Page[k, j]
		}
	} else {
		# regular 1 up printing
		for (i = Npage; i > 0; i--)
			for (j = 1; (i, j) in Page; j++)
				print Page[i, j]
	}

	# print the epilog
	for (i = 1; i in Epilog; i++)
		print Epilog[i]
}

Author

Arnold Robbins


categories: TenLiners,Apr,2009,PhilipB

indent.awk

Contents

Synopsis

gawk -f indent.awk file.sh

Download

Download from LAWKER

Description

This is part of Phil's AWK tutorial at http://www.bolthole.com/AWK.html. This program adjusts the indentation level based on which keywords are found in each line it encounters.

Code

doindent

function doindent(){
	tmpindent=indent;
	if(indent<0){
		print "ERROR; indent level == " indent
	}
	while(tmpindent >0){
		printf("    ");
		tmpindent-=1;
	}
}

Out-denting

$1 == "done" 	{ indent -=1; }
$1 == "fi" 	{ indent -=1; }
$0 ~ /}/	{ if(indent!=0) indent-=1;  }

Worker

This is the 'default' action, that actually prints a line out. This gets called AS WELL AS any other matching clause, in the order they appear in this program. An "if" match is run AFTER we run this clause. A "done" match is run BEFORE we run this clause.

		{ 
		  doindent();
		  print $0;
		}

In-denting

$0 ~ /if.*;[ ]*then/	{ indent+=1; }
$0 ~ /for.*;[ ]*do/	{ indent+=1; }
$0 ~ /while.*;[ ]*do/	{ indent+=1; }

$1 == "then"		{ indent+=1; }
$1 == "do"		{ indent+=1; }
$0 ~ /{$/		{ indent+=1; }

Author

Philip Brown phil@bolthole.com


categories: Os,Apr,2009,Admin

Awk and Operating Systems

These pages focus on Awk and operating systems.


categories: XML,Apr,2009,Admin

XML

These pages focus on XML tools and Awk.


categories: Xgawk,XML,Awk100,Apr,2009,JurgenK

XMLgawk

Editor's note: Programmers often take awk "as is", never thinking to use it as a lab in which they can explore other language extensions. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach is taken in the Awk A* project and, as shown here, in XMLgawk.

IMHO, XMLgawk is one of the most exciting new innovations seen in Gawk for many years. It shows that Awk is more than "just" a text processor: rather it is also a candidate technology for modern XML-based web applications. )

Purpose

Extends standard gawk with built-in XML processing.

Developers

Main developers: Jurgen Kahrs and Andrew Schorr.

Conceptual guidance: Manuel Collado.

MS Windows build expert: Victor Paeza.

Contributor of ideas for new features: Peter Saveliev.

Domain

XML processing, plus libraries for other extensions to Gawk.

Description

XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.

Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.

XMLgawk provides the following functionality including:

  • AWK's way of reading data line by line is supplemented by reading XML files node by node.
  • XMLgawk can load .awk file as as well as shared libraries.
  • Adds support for an @include directive in the source code. This is the same feature provided by the current igawk script.

Current

3=Released

Use

3=Free/public domain.

Date Deployed

November 2003.

Dated

April 28, 2009.

Url


categories: Games,Awk100,Apr,2009,Ronl

Soccer

Purpose

AI Programming lab class challenge .

Installation

Download from LAWKER. Look at the first line of each file for something that looks like thos:

#!/usr/bin/gawk -f
Replace this with the full path to the local version of Gawk.

Developers

Ronald Loui (programmer and designer)

Organization

Washington University in St. Louis

Country

USA

Domain

Text-based game simulation.

Contact

Ronald P. Loui

Email

r.p.loui@gmail.com

Description

Ronald Loui writes: Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language

This code manages a CGI interface to a process that simulates a soccer game, polling for inputs from two student programs.

A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.

In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.

Awk

Was written for gawk in 1995 but should run on almost any awk dialect; some css positioning commands will not work in all browsers; try IE6.

Platform

Was written on Redhat Linux with multiple hardware platforms in mind.

Uses

Intended to be run on close server to minimize delays.

Lines

605 lines in main cgi with several small aux control programs.

DevelopmentEffort

Minimal compared to development effort, but potentially will require css for new browsers.

MaintenanceEffort

Number of person-months since, including enhancements

Current

2=Evaluation.

Users

50 students in artificial intelligence project classes had to use some version of this code over seven years

DateDeployed

October 2004

Dated

April 2009


categories: Top10,Awk100,Papers,Os,Apr,2009,YungC

Awk-Linux

Awk-Linux Educational Operating Systems

Purpose

Teaching operating systems.

Developers

Yung-Pin Cheng

Email

ypc@csie.ntnu.edu.tw

Organization

Software Engineering Lab. Department of Computer Science and Information Engineering National Taiwan Normal University

Country

TAIWAN

Domain

Educators of Operating Systems

Description

Most well-known instructional operating systems are complex, particularly if their companion software is taken into account. It takes considerable time and effort to craft these systems, and their complexity may introduce maintenance and evolution problems. In this project, a courseware called Awk-Linux is proposed. The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt, which are simulated through program instrumentation over user programs.

A major advantange of the use of Awk for this tool is platform independence. Awk-Linux can be crafted relatively more easily and it does not depend on any hardware simulator or platform. Stable Awk versions run on many platforms so this tool can be readily and easily ported to other machines. The same can not be said for other, more complex operating systems courseware that may be much harder to port to new environments.

In practice, using Awk-Linux is very simple for the instructor and students:

  • Course projects based on Awk-Linux provides source code extracted and simplified from a Linux kernel.
  • Results of our study indicate that the projects helped students better to understand inner workings of operating systems.

Awk

Gawk under cygwin or Linux

Platform

Windows (CYGWIN required) or Linux

Uses

C programming language

Current

Status 3 (Released)

Use

3(Free/public domain)

DateDeployed

2004

References

Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux: A Lightweight Operating Systems Courseware IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.

Url

www.csie.ntnu.edu.tw/~ypc/awklinux.htm


categories: Papers,Os,Apr,2009,SallyF

Simulations for Equation-Based Congestion Control for Unicast Applications

(Editor's Note: This page is a mirror of the original web site. It describes a collection of shell/awk/tcl scripts used for modeling complex domains. This code illustrates how language choice is not a matter of "awk" vs "X". Rather, systems can be a menagerie of different languages, including Awk.)

Description

This page has pointers to the simulation scripts for the Equation-Based Congestion Control for Unicast Applications by Sally Floyd, Mark Handley, Jitendra Padhye, and Joerg Widmer, May 2000, SIGCOMM 2000.

Download

These simulation scripts are also available from in LAWKER.

To test the code:

  • Unpack this zip file:
    contents.zip
    cd contents
    

    To use these scripts, you must go the following:

    gcc bwcnt2.c -o bwcnt2
    gcc bwcnt2a.c -o bwcnt2a
    

    Then, put a copy of "ns" in the current directory, for example:

    ln -s ~/vint/ns-2/ns ns
    

    To run the tests:

    ./single.com
    ./tfrm12.com
    ./queue2.com
    ./increase.com
    ./reduce.com
    ./reduce1.com
    

Details

These scripts are quick amalgams of shell scripts, awk, tcl, and whatever else was handy at the time, so they are not intended as an example of good programming style. They are run in a directory with a "graphs" subdirectory for saved output and *.mf files (gnuplot command files), and an "awk" subdirectory for awk files. Some of these scripts use supporting *.awk files that are available in the awk directory, but are not listed separately below. Some of the scripts (tfrm12.run) also use "bwcnt" C programs for processing output data; the C code for these is in the scripts directory. Possibly one day we will clean this all up to reduce the proliferation of scripts and languages involved.

The implementation of TFRC in the NS simulator is still occasionally being modified, so the precise results of simulations can change with different versions of NS.

Some of these simulations must be run with SBSIZE in scoreboard.h set to 10000 instead of to 1024, to allow larger TCP congestion windows.

From Scripts to Figures

The simulation for Figure 2 on "Illustration of the Average Loss Interval" can be run with "contents/single.com", with supporting files "contents/single.run", "contents/single.tcl", and "contents/queueSize.tcl". Generating the postscript file also uses the following files:
"contents/graphs/s0.interval.mf", "contents/graphs/s0.loss.mf", and "contents/graphs/s0.rate.mf".

The simulations for Figure 5 on "TCP flow sending rate" can be run with "contents/tfrm-full.CA.DropTail.run", "contents/tfrm-full.CA.RED.run" with supporting files "contents/tfrm-full.CA.tcl", "contents/queueSize.tcl", "contents/getmean-full.tcl". These scripts will produce data files called

graphs/s-full-RED.CA.tcpmean
graphs/s-full-DropTail.CA.tcpmean
There are three values for each data point (from three runs) in these output files. To merge them, use "contents/merge2.tcl":
merge2.tcl graphs/graphs/s-full-RED.CA.tcpmean > graphs/s-full-RED.CA.tcp
merge2.tcl graphs/graphs/s-full-DropTail.CA.tcpmean > graphs/s-full-DropTail.CA.tcp
Unfortunately, we no longer have the *.mf gnuplot script for generating the postscript from "s-full-RED.CA.tcp" and "s-full-DropTail.CA.tcp". BTW, on a 450MHz Xeon, each graph takes about 7 hours to generate

The simulations for Figure 6 on can be run with "contents/tfrm12.com", with supporting files "contents/tfrm12.run", "contents/tfrm12.tcl", "contents/awk/plotdrops.awk" and "contents/queueSize.tcl". The supporting programs "bwcnt2" and "bwcnt2a" for processing the output data are compiled from "contents/bwcnt2.c" and "contents/bwcnt2a.c". FYI: On Sally's computer, this simulation set took 13 minutes. The following supporting files were also required for generating the postscript file "contents/tfrm12.run1", "contents/graphs/getmean.tcl", "contents/graphs/s0.12.mf", "contents/graphs/s0.loss3.mf".

The simulations for Figure 7 on "Coefficient of variation of throughput between flows" can be run with "contents/tfrmvar.run" with supporting files "contents/tfrmvar.tcl", "contents/queueSize.tcl", and "contents/graphs/getvar.tcl". The scripts "contents/fixcov.tcl" combines the many output files together, and gnuplot requires "contents/graphs/s3xxx.mf" to generate the postscript.

When we have collected the scripts for Figure 8, we will put them on-line.

The simulations for Figures 9 and 10 can be run with the script "contents/long/doit". The supporting scripts are in the tar file. The simulation takes perhaps one hour.

The simulations for Figures 11-13 can be run with the script "contents/short/doit". The simulation takes up to three days.

The simulations for Figure 14 on 40 long-lived flows can be run with "contents/queue2.com", with supporting files "contents/queue.run", "contents/queue.tcl", "contents/queueSize.tcl", "contents/tracequeue.tcl", awk/"contents/awk/plotaveq.awk", and awk/"contents/awk/plotqueue.awk". Generating the postscript file also uses the following file: "contents/graphs/s0.queue.mf".

Figures 15-18 are from experiments.

The simulations for Figure 19 on "A TFRC flow with an end to congestion" can be run with "contents/increase.com", with supporting files "contents/increase.run", "contents/increase.tcl", "contents/queueSize.tcl", "contents/awk/increase.awk", and graphs/"scriptsTR/graphs/s0.packetrate.mf".

The simulations for Figure 20 on "A TFRC flow with persistent congestion" can be run with "contents/reduce.com", with supporting files "contents/reduce.run", "contents/reduce.tcl", "contents/queueSize.tcl", "contents/awk/reduce.awk", and "contents/awk/reduce1.awk". Generating the postscript file also uses the following file: "contents/graphs/s0.rate1.mf".

The simulations for Figure 21 on "Number of round-trip times to reduce the sending rate" can be run with "contents/reduce1.com", with supporting files "contents/reduce1.run", "contents/reduce.tcl", "contents/queueSize.tcl", "contents/awk/reduce1.awk", and "contents/awk/reduce2.awk". Generating the postscript file also uses the following file: graphs/"contents/graphs/s0.half.mf".


categories: Papers,Os,Apr,2009,KimD

Intrusion Alert Normalization with Awk

From Intrusion Alert Normalization method using AWK scripts and attack name database. Dongyoung Kim, HyoChan Bang, Jung-Chan Na, Advanced Communication Technology, 2005, ICACT 2005. The 7th International Conference on Publication Date: 21-23 Feb. 2005 Volume: 1, On page(s): 608- 611 Vol. 1

The current several classes of intrusion alert have various formats and semantics. And it is transferred using a variety of protocols. The protocols that transfer intrusion alert are IDXP, SNMP trap, SYSLOG protocol, etc. These varieties of intrusion alert formats make it difticult to use that together. Intrusion alert normalization makes various intrusion alert to same structure data and same semantics. We need this normalition process to unify alerts from a variety of security equipments. This paper describes how to normalize alerts from several IDS and security equipments.


categories: Apr,2009,WilhelmW,OsamuA,ArnoldR

99 Bottles of Beer

You know the song:

    99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.

    98 bottles of beer on the wall, 98 bottles of beer. Take one down and pass it around, 97 bottles of beer on the wall.

    97 bottles of beer on the wall, 97 bottles of beer. Take one down and pass it around, 96 bottles of beer on the wall.

    ....

But how do you code it? Here's Wilhelm Weske's version. It is kind of fun but its a little hard to read:

#!/usr/bin/awk -f

        BEGIN{
       split( \
       "no mo"\
       "rexxN"\
       "o mor"\
       "exsxx"\
       "Take "\
      "one dow"\
     "n and pas"\
    "s it around"\
   ", xGo to the "\
  "store and buy s"\
  "ome more, x bot"\
  "tlex of beerx o"\
  "n the wall" , s,\
  "x"); for( i=99 ;\
  i>=0; i--){ s[0]=\
  s[2] = i ; print \
  s[2 + !(i) ] s[8]\
  s[4+ !(i-1)] s[9]\
  s[10]", " s[!(i)]\
  s[8] s[4+ !(i-1)]\
  s[9]".";i?s[0]--:\
  s[0] = 99; print \
  s[6+!i]s[!(s[0])]\
  s[8] s[4 +!(i-2)]\
  s[9]s[10] ".\n";}}

Osamu Aoki has a more maintainable version. Note how all the screen I/O is localized via functions that return strings, rather than printing straight to the screen. This is very useful for maintaince purposes or including code as libraries into other Awk programs.

BEGIN { 
   for(i = 99; i >= 0; i--) {
      print ubottle(i), "on the wall,", lbottle(i) "."
      print action(i), lbottle(inext(i)), "on the wall."
      print
   }
}
function ubottle(n) {
   return \ 
     sprintf("%s bottle%s of beer", n ? n : "No more", n - 1 ? "s" : "")
}
function lbottle(n) {
   return \
     sprintf("%s bottle%s of beer", n ? n : "no more", n - 1 ? "s" : "")
}
function action(n) {
   return \
      sprintf("%s", n ? "Take one down and pass it around," : \
                         "Go to the store and buy some more,")
}
function inext(n) {
   return n ? n - 1 : 99
}

Osamu's version is very similar to how it'd be done in C or other languages and it does not take full advantage of Awk's features. So Arnold Robbins wrote a third version that is more data driven. Most of the work is done in a pre-processor and the actual runtime just dumps text decided before the run. This solution might take more time (to do the setup) but it does allow for the simple switching of the interface (just change the last 10 lines).

BEGIN {
        # Setup
        take = "Take one down, pass it around"
        buy = "Go to the store and buy some more"

        Instruction[0] = buy
        Next[0] = 99
        Count[0, 1] = "No more"
        Count[0, 0] = "no more"

        for (i = 99; i >= 1; i--) {
                Instruction[i] = take
                Next[i] = i - 1
                Count[i, 0] = Count[i, 1] = (i "")
                Bottles[i] = "bottles"
        }
        Bottles[1] = "bottle"
        Bottles[0] = "bottles"
        # Execution
        for (i = 99; i >= 0; i--) {
                printf("%s %s of beer on the wall, %s %s of beer.\n",
                        Count[i, 1],
                        Bottles[i],
                        Count[i, 0],
                        Bottles[i])
                printf("%s, %s %s of beer on the wall.\n\n",
                        Instruction[i],
                        Count[Next[i], 0],
                        Bottles[Next[i]])
        }
}

I'll drink to that.


categories: Mail,Apr,2009,Admin

Awk and Mail

These pages focused on using Awk to implement filters on Unix mail files.


categories: Mail,Apr,2009,StevenH

Shell Statistical Spam Filter and Whitelist

Contents

About

Author

Steven Hauser.

Origin

http://www.tc.umn.edu/~hause011/article/Statistical_spam_filter.html

Client Side Unix Shell - AWK with updating email address "Whitelist"

I now use a "Statistical Spam Filter". Wow, the scummy sewer of internet mail is cleansed, refreshed and usable again. Just using the delete button was getting too difficult, I got 8 to 10 spam for every good piece of mail. As a spam detector I am not as good a filter as you might think, just the subject and address is not always enough, an anti-spam tool I am not, I would occasionally open a spam to my great annoyance.

My interpretation of Paul Graham's Spam Article

My filter was inspired by Paul Graham's article about a Naive Bayesian spam filter. The article is at "A Plan for Spam". He basically says that you get statistics on how often tokens show up in two bodies of mail, (spam and good,) and then calculate the a statistical value that a single mail is spam by looking at the tokens in it. The more mail in the good and spam mail bodies, the better the filter is "trained". Jeez, he made it sound so easy. And it is. I slapped an anti-spam tool together as a ksh and awk script for use as a personal filter on a Unix type system. To implement it I put it in the ~/.forward file. The code is at the bottom of the article, less than 100 The total code for the filter and training script is less than 200

This filter differs in lots of ways from the Paul Graham article. I took out some of the biases he describes and simplified it, maybe it is too simple. What I find most interesting is that the differences do not seem to matter much, I still filter out 96+% of spams. I got those results with a spam sample that is at least 500 emails and a good email sample that is at least 700 emails. With smaller training samples or a different mail mix it may not get as good results, or it may be better. Note: I later changed the training body to be more like the proportion of real spam to good mail, which is much more spam than good mail, about 8-10 spam to every good mail received and the anti-spam tool worked better.

How the Spam Filter Works in Unix

First I run the training script on two bodies of mail, ~/Mail/received (good mail) and ~/Mail/junk (saved spam mail.) The ~/Mail/received file is already created on my unix box and holds mail that I have read and not deleted. The training script finds all the tokens in the emails and gives them a probability depending on how the token is found in the "spam mail" and the "good mail". The training script also creates the whitelist of addresses from the "good mail." As the mail flows through the system the training script will then "learn" each time it is run.

I run the actual spam filter script from the .forward file which allows a user to process mail before it hits your inbox. (Look up "man forward" at the shell prompt for further information on the .forward file.) The script first checks the whitelist for a good address, if it is found it passes the filter. If the address is not found it is passed to the statistical spam filter, the tokens are checked and the email is given a spaminess value. Above a certain value the email is classified as spam and put in the ~/Mail/junk file, below the value it passes to /var/spool/mail/mylogin where I read it as god intended email to be read, with a creaky old unix client. However, I can still read it with any other client I want, POP or IMAP.

Testing the Spam Filter

I included a little test script below that I used to check my results. I just split emails into files and run them one at a time and check the value the filter gives.

Testing on email that has been used to train the filter will give results that are very good and not valid, so I tested on email not seen by the training script. The filter does get much better at filtering as the training sample gets bigger, just like the other statistical spam filters. For example, at lower sample sizes (trained with 209 good mails, 301 spammails) the filter was pretty bad. When the average spam value cutoff was raised to .51 so no good mail was blocked, 44% of the spam email passed through on a set of 320 spam and 683 good email. Even so, that means %56 of the spam was blocked. Small sample sizes are not perfect, but are usable and I began using the mail filter with a sample set of about 600 good mail and 300 spam. As the training sample increased the results improved. As I changed the mail mix to reflect the real spam proportions it got even better, around 96-98% of the spam blocked. I think the lower early results were because of the proportions of spam to good email, they should reflect the real proportion received on the system used by the filter.

Paul Graham or others may have superior filters and better mathematics for anti-spam algorithms but I am not sure that it matters all that much, the amount of spam that gets through is small enough not to bother me.

Filter Performance

I used gawk in the filter and checked it with the gawk profiler to look for performance problems. The largest performance constraint is creating the spam-probability associative array in memory, the key-value pairs of tokens and the spam value I assign to them. Creating this associative array is more that 95% of the current time to process an email through the filter and gets worse when the set of tokens gets larger. Perl and other language users can get around this performance problem with DBM file interfaces, currently not available to my gawk filter.

White List Filter Improves Performance and Cuts Errors

I added a "whitelist" of good email addresses, a feature that helps keep good email from a bad classification and improves performance by a huge amount (at least a magnitude of 100) by not having to further filter the message. The white list is not one of the "challenge-response" things that annoys me so much that I toss any such email away, it simply learns from the email used to train the filter, it saves addresses that are from email that has passed the filter and gets in my "received" file. I figure that if I receive a good email from someone, chances are 100% that I want to receive email from that address. Note there is a place in the white list script to get rid of commonly forged email addresses, like your own address.

Why Differences With Bayes Filters Do Not Matter

The main concept put forward by Paul Graham holds true and seems ungodly robust: applying statistics to filter spam works very well compared to lame rule sets and black lists. My program just proves the robustness of the solution; apparently any half-baked formula (like what I used) seems to work as long as the base probability of the tokens is computed.

Here are some of the many differences between this filter and the filter in the Paul Graham article in no particular order of importance:

  • I do not lower-case the tokens, one result is that token frequency is set to three instead of five to be included in in the spam-probability associative array. I think that "case" is an important token characteristic.
  • "Number of mails" is replaced with "number of tokens." My explanation is that I am looking at a token frequency in an interval of a stream of tokens. It seems simpler to think of it that way, instead of number of mails. And when I tried "number of mails" I got the same result values on the messages for the formula I used.
  • "Interesting tokens" were tokens in the message with a spam statistic "greater than 0.95" and "less than 0.05" Easy to implement. I did not figure out the fifteen most interesting tokens, the limit used by Paul Graham. As a result, most of my mail has more than 15 interesting tokens, a few have fewer, which could be a weakness, but does not seem to matter too much.
  • Paul Graham's Naive Bayesian formula goes to 0 or 1 pretty quickly, which is fine, I tried it out in awk too. But now I just sum the "interesting token" probabilities and divide by the number of "interesting tokens" per message. Yes, it is just an average of the probability of "interesting tokens" and it is easy to implement and spreads the values over a 0-1 interval, spam towards 1 and good mail towards 0. I did this to implement some spam filtering as soon as possible. Even with a small sample of mail I was able to adjust the average probability value up to keep all the good mail and still get rid of a good proportion of spam. As I acquired more sample mail the filter caught more spam and I adjusted the average probability value down.
  • I have a "training" program that generates the token probabilities and an address "whitelist" to be run as a batch job at intervals (like once a day or week) and a separate filter program run out of ".forward"
  • I did not double the frequency value of the "good tokens" to bias them in the base spam probability calculation of each token.
  • Tokens not seen by the filter before are ignored. Paul Graham gives them a 0.4 probability of spaminess. Most other methods of calculating the probability of unknown tokens end up being ignored by my formula as they would have a probability outside the "interesting token" ranges.
  • I noticed that Paul Graham ignores HTML comments. When I looked at some of the spam I found out why, some spammers load recipient address and common words into HTML comments spread through the text to pass rule filters but the statistical spam filter seems to find them anyway so I include tags, comments, everything.

Code

Test SpamFilter

Note: Do not test mail that has been used to train the filter, test mail not seen by the training program.

#!/bin/ksh
filter_test () {
  # Split a file of unix email into many mail files with this:
  cat ~/Mail/rece* |csplit -k -f good -n 4 - '/^From /' {900}

  # Run a modified filter that displays the spam value for each mail file.
  # I just commented out the last part of the filter and added a 
  # print statement of the Subject line and spam value the filter found.
  for I in test/good*
  do
     cat $I | [filter_program-that_shows_the_value_only]
  done | sort -n 
}

Train SpamFilter.

Call from the command line or in a crontab file.

#!/bin/ksh
number_of_tokens (){
  zcat $1 | cat $2 - | wc -w
}

 # Note: Get rid of addresses that are commonly forged at the
 #       "My-Own-Address" string.
address_white_list (){
  zcat $1 | 
  cat $2 - | 
  egrep '^From |^Return-Path: ' | 
  nawk '{print tolower($2)}'| 
  nawk '{gsub ("<",""); gsub (">","");print;}'| 
  grep -v 'My-Own-Address'| 
  sort -u > ~/Mail/address_whitelist
}

 # Create a hash with probability of spaminess per token.
 #       Words only in good hash get .01, words only in spam hash get .99
spaminess () {
nawk 'BEGIN {goodnum=ENVIRON["GOODNUM"]; junknum=ENVIRON["JUNKNUM"];}
       FILENAME ~ "spamwordfrequency" {bad_hash[$1]=$2}
       FILENAME ~ "goodwordfrequency" {good_hash[$1]=$2}

    END    {
    for (word in good_hash) {
        if (word in bad_hash) { print word, 
            (bad_hash[word]/junknum)/ \
            ((good_hash[word]/goodnum)+(bad_hash[word]/junknum)) }
        else { print word, "0.01"}
    }
    for (word in bad_hash) {
        if (word in good_hash) { done="already"}
        else { print word, "0.99"}
    }}' ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency 

}

 # Print list of word frequencies
frequency (){
  nawk ' { for (i = 1; i <= NF; i++)
        freq[$i]++ }
    END    {
    for (word in freq){
        if (freq[word] > 2) {
          printf "%s\t%d\n", word, freq[word];
        }
    } 
  }'
}
 # Note: I store the email in compressed files to keep my storage space small,
 #       so I have the gzipped mail that I run through the filter training 
 #       script as well as current uncompressed "good" and spam files.
 #       
prepare_data () {
  export JUNKNUM=$(number_of_tokens '/Your/home/Mail/*junk*.gz' '/Your/home/Mail/junk')
  export GOODNUM=$(number_of_tokens '/Your/home/Mail/*received*.gz' '/Your/home//Mail/received')
  address_white_list '/Your/home/Mail/*received*.gz' '/Your/home/Mail/received'

  echo $JUNKNUM $GOODNUM

  zcat ~/Mail/*junk*.gz | cat ~/Mail/junk - |
    frequency|
    sort -nr -k 2,2 > ~/Mail/spamwordfrequency
  zcat ~/Mail/*received*.gz | cat ~/Mail/received - |
    frequency|
    sort -nr -k 2,2 > ~/Mail/goodwordfrequency

  spaminess| 
    sort -nr -k 2,2 > ~/Mail/spamprobability
  # Clean up files
  rm ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency 
}

 #########
 # Main

prepare_data
exit

Spamfilter using statistical filtering.

Inspired by the Paul Graham article "A Plan for Spam" www.paulgraham.com

Implement in the .forward file like so:

"| /Your/path/to/bin/spamfilter"

If mail is spam then put in a spam file else put in the good mail file.

#!/bin/ksh
spamly () {
/usr/bin/nawk '

   { message[k++]=$0; }

   END { if (k==0) {exit;} # empty message or was in the whitelist.

         good_mail_file="/usr/spool/mail/your_user";
         spam_mail_file="/Your/home/Mail/junk";
         spam_probability_file="/Your/home/Mail/spamprobability";
         total_tokens=0.01;

         while (getline < spam_probability_file)
            bad_hash[$1]=$2; close(spam_probability_file);

         for (line in message){ 
           token_number=split(message[line],tokens);
           for (i = 0; i <= token_number; i++){
             if (tokens[i] in bad_hash) { 
               if (bad_hash[tokens[i]] <= 0.06 || bad_hash[tokens[i]] >= 0.94){
                  total_tokens+=1;
                  spamtotal+=bad_hash[tokens[i]];
                }
              }
            }
         }

         if (spamtotal/total_tokens > 0.50) { 
            for (j = 0; j <= k; j++){ print message[j] >> spam_mail_file}
            print "\n\n" >> spam_mail_file;
         }
         else {
            for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
            print "\n\n" >> good_mail_file;
         }
   }'
}

 # Check whitelist for good address. 
 # if in whitelist then put in good_mail_file
 #   else Pass message through filter.
whitelister () {
  /usr/bin/nawk '
      BEGIN { whitelist_file="/Your/home/Mail/address_whitelist";
              good_mail_file="/usr/spool/mail/your_user";
              found="no";
              while (getline < whitelist_file)
              whitelist[$1]="address"; close(whitelist_file);
      }
      { message[k++]=$0;}
      /^From / {sender=tolower($2); 
            gsub ("\<","",sender);
            gsub ("\>","",sender); 
            if (whitelist[sender]) { found="yes";}
      }
      /^Return-Path: / {sender=tolower($2); 
            gsub ("\<","",sender);
            gsub ("\>","",sender); 
            if (whitelist[sender]) { found="yes";}
      }
      END { if (found=="yes") { 
               for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
               print "\n\n" >> good_mail_file;
            }
            else {
               for (j = 0; j <= k; j++){ print message[j];}
            }
      }'
}

 #####################################
 # Main
 # The mail is first checked by the white list, if it is not found in the
 # white list it is piped to the spam filter.
whitelister | spamly 
exit

categories: Mail,Apr,2009,ArnoldR

Mail Sort

Contents

Author

Arnold Robbins

Download

Download from LAWKER.

Description

Sorts a Unix style mailbox by "thread", in date+subject order.

This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.

This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.

Code

Main

BEGIN {
       TRUE = 1
       FALSE = 0

       split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
       for (i in months)
               Month[months[i]] = i    # map name to number

       MonthDays[1] = 31
       MonthDays[2] = 28       # not used
       MonthDays[3] = 31
       MonthDays[4] = 30
       MonthDays[5] = 31
       MonthDays[6] = 30
       MonthDays[7] = 31
       MonthDays[8] = 31
       MonthDays[9] = 30
       MonthDays[10] = 31
       MonthDays[11] = 30
       MonthDays[12] = 31

       In_header = FALSE
       Body = ""

       LocalOffset = 2 # We are two hours ahead of GMT

       # These keep --lint happier
       Debug = 0
       MessageNum = 0
       Duplicates = FALSE
}

/^From / {
       In_header = TRUE
       if (MessageNum)
               Text[MessageNum] = Body
       MessageNum++
       Body = ""
 # print MessageNum
}

In_header && /^Date: / {
       Date[MessageNum] = compute_date($0)
}

In_header && /^Subject: / {
       Subject[MessageNum] = canonacalize_subject($0)
}

In_header && /^Message-[Ii][Dd]: / {
       if (NF == 1) {
               getline junk
               $0 = $0 RT junk # Preserve original input text!
       }

       # Note: Do not use $0 directly; it's needed as the Body text
       # later on.

       line = tolower($0)
       split(line, linefields)

       message_id = linefields[2]
       Mesg_ID[MessageNum] = message_id        # needed for disambiguating message
       if (message_id in Message_IDs) {
               printf("Message %d is duplicate of %s (%s)\n",
                       MessageNum, Message_IDs[message_id],
                       message_id) > "/dev/stderr"
               Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
               Duplicates++
       } else {
               Message_IDs[message_id] = MessageNum ""
       }
}


In_header && /^$/ {
       In_header = FALSE
       # map subject and date to index into text

       if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
               printf(\
       ("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
       " SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
               MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
               SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
               Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
               Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
               Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
                       > "/dev/stderr"
       }

       SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum

       if (Debug) {
               printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
                       MessageNum, length(SubjectDateId)) > "/dev/stderr"
               if (MessageNum != length(SubjectDateId) && ! Printed1) {
                       Printed1++
                       printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
               }
       }

       # build up mapping of subject to earliest date for that subject
       if (! (Subject[MessageNum] in FirstDates) ||
           FirstDates[Subject[MessageNum]] > Date[MessageNum])
               FirstDates[Subject[MessageNum]] = Date[MessageNum]
}

{
       Body = Body ($0 "\n")
}

END {
       Text[MessageNum] = Body # get last message

       if (Debug) {
               printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
                       length(SubjectDateId), length(Subject), length(Date))
               printf("length(FirstDates) = %d\n", length(FirstDates))
       }

       # Create new array to sort by thread. Subscript is
       # earliest date, subject, actual date
       for (i in SubjectDateId) {
               n = split(i, t, SUBSEP)
               if (n != 3) {
                       printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
                       exit 1
               }
               # now have subject, date, message-id in t
               # create index into Text
               Thread[FirstDates[t[1]], i] = SubjectDateId[i]
       }

       n = asorti(Thread, SortedThread)        # Shazzam!

       if (Debug) {
               printf("length(Thread) = %d, length(SortedThread) = %d\n",
                       length(Thread), length(SortedThread))
       }
       if (n != MessageNum && ! Duplicates) {
               printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
                       n, MessageNum) > "/dev/stderr"
	#               exit 1
       }

       if (Debug) {
               for (i = 1; i <= n; i++)
                       printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
                               i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
               close("DUMP1")
               if (Debug ~ /exit/)
                       exit 0
       }

       for (i = 1; i <= MessageNum; i++) {
               if (Debug) {
                       printf("Date[%d] = %s\n",
                               i, strftime("%c", Date[i]))
                       printf("Subject[%d] = %s\n", i, Subject[i])
               }

               printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
       }
       close("OUTPUT")

       close("/dev/stderr")    # shuts up --lint
}

compute_date

Pull apart a date string and convert to timestamp.

function compute_date(date_rec,         fields, year, month, day,
                                       hour, min, sec, tzoff, timestamp)
{
       split(date_rec, fields, "[:, ]+")
       if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
               # Date: Thu, 05 Jan 2006 17:11:26 -0500
               year = fields[5]
               month = Month[fields[4]]
               day = fields[3] + 0
               hour = fields[6]
               min = fields[7]
               sec = fields[8]
               tzoff = fields[9] + 0
       } else {
               # Date: 05 Jan 2006 17:11:26 -0500
               year = fields[4]
               month = Month[fields[3]]
               day = fields[2] + 0
               hour = fields[5]
               min = fields[6]
               sec = fields[7]
               tzoff = fields[8] + 0
       }
       if (tzoff == "GMT" || tzoff == "gmt")
               tzoff = 0
       tzoff /= 100    # assume offsets are in whole hours
       tzoff = -tzoff

       # crude compensation for timezone
       # mktime() wants a local time:
       #       hour + tzoff yields GMT
       #       GMT + LocalOffset yields local time
       hour += tzoff + LocalOffset

       # if moved into next day, reset other values
       if (hour > 23) {
               hour %= 24
               day++
               if (day > days_in_month(month, year)) {
                       day = 1
                       month++
                       if (month > 12) {
                               month = 1
                               year++
                       }
               }
       }

       timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
                               year, month, day, hour, min, sec))

       # timestamps can be 9 or 10 digits.
       # canonicalize them into 11 digits with leading zeros
       return sprintf("%011d", timestamp)
}

days_in_month

How many days in the given month?

function days_in_month(month, year)
{
       if (month != 2)
               return MonthDays[month]

       if (year % 4 == 0 && year % 400 != 0)
               return 29

       return 28
}

canonacalize_subject

Trim out "Re:", white space.

function canonacalize_subject(subj_line)
{
       subj_line = tolower(subj_line)
       sub(/^subject: +/, "", subj_line)
       sub(/^(re: *)+/, "", subj_line)
       sub(/[[:space:]]+$/, "", subj_line)
       gsub(/[[:space:]]+/, " ", subj_line)

       return subj_line
}

Copyright

Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com

blog comments powered by Disqus