About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
These pages are grouped into the topics, listed below (latest one shown first):
;
210 pages.
Awk is being used all around the world for real programming problems, but the news is not getting out.
We are aiming to create a database of at least one hundred Awk programs which will:
If you, or your colleagues or friends have written a program which has been used for purposes small or large, why not take five minutes to record the facts, so that others can see what you've done?
To contribute, fill in this template and mail it to mail@awk.info with the subject line Awk 100 contribution.
(Recent additions are shown first.)
These pages focus on sys admin tools in Awk.
The Awk.info Top 10 pages highlights the "best" (most impressive, most insightful, most fun, most visited) pages on this site.
Awk.info is maintained by the international awk community. There are many ways you can contribute and get listed below.
Author of awk++.
Jim has been a Great Auk since Feb'09.
2009: consultant, CycCorp.
2009: assoc Prof, LCSEE, WVU email: tim@menzies.us web site: http://menzies.us.
Tim has been a Great Auk since Feb'09.
2009: frequent poster to comp.lang.awk
From: Tim Menzies <tim@menzies.us>
To: mikelangman@blueyonder.co.uk
Subject: auk images
I write to see if you would be gracious enough to grant us usage rights for your auk paintings to use on this site, in exchange for appropriate credit such as:
From: Mike Langman <mikelangman@blueyonder.co.uk>
Date: Mon, Jan 19, 2009 at 2:55 AM
Subject: Re: auk images
I normally charge for the use of images but as there is no money involved please carry on using the images and include a link to my website as suggested.
Many thanks for asking.
- Mike
Arnold Robbins, an Atlanta native,
is a professional programmer and technical author.
e has worked with Unix systems since 1980, when he was introduced to a
PDP-11 running a version of Sixth Edition Unix.
He has been a heavy AWK user since 1987, when he became involved with gawk, the GNU project's version of AWK. As a member of the POSIX 1003.2 balloting group, he helped shape the POSIX standard for AWK. He is currently the maintainer of gawk and its documentation.
Since late 1997, he and
his family have been living happily in Israel.
2009: gnu utils developer and monitor of comp.lang.awk
Some must lead, some must follow, and some have to fix the typos.
A Great Auk is someone with write permission to our repository. Since the source for this web site is stored in that repoistory, it also means that they are webmasters of this site. So they (try) to:
If you want to be a Great Auk, please start contributing to this site using any of the usual methods. Once it is clear that you know what you are doing and that you play nice with others, then you should ask a current Great Auk to nominate you. Then, all the current Great Auks will vote about giving your write access.
The current Great Auks are
"Because easy is not wrong." - Anon
From various sources:
Quotes:
From Project Management Advice:
From Awk programming:
From Awk as a Major Systems Programming Language:
According to Ramesh Natarajan:
From the NoSQL pages:
To join our community, consider contributing to this site.
For a list of authors of this site, see our credits pages.
The Awk Wiki.
USENET discussion group: comp.lang.awk.
For discussions on Awk, see the Awk discussion group.
For comments/ complaints/ corrections/ extensions to this site, contact mail@awk.info.
Awk is a stable, cross platform computer language named for its
authors
Alfred Aho,
Peter Weinberger &
Brian Kernighan. They write:
"Awk is a convenient and expressive programming language that can be
applied to a wide variety of computing and data-manipulation tasks".
In Classic Shell Scripting, Arnold Robbins & Nelson Beebe confess their Awk bias: "We like it. A lot. The simplicity and power of Awk often make it just the right tool for the job."
Besides the Bourne shell, Awk is the only other scripting language available in the standard Unix environment. Implementations of AWK exist as installed software for almost all other operating systems.
Awk is a mature language- it was first implemented in the 1970s. As a tool from the golden age, it is sometimes called primitive. It is more accurate to call it elemental, so tightly focused is the language on what it does best: quickly converting this into that.
Consequently, throughout history, Awk has been the language of choice for many famous scientists such as Leonardo daVinci.
|
|
LAWKER is a repository of Awk code divided into:
See How to Contribute.
Use our issue tracking system.
Many communities have a mascot, a banner that they proudly wave high. So where's the Awk mascot?
I made on up, but you gotta say, it is kinda lame:
So you have any ideas for such a mascot, please email mail@awk.info with the subject line "suggestion for mascot".
Not to stiffle anyone's creativity but the mascot might be based on the mantra "less, but better" or "easy is not wrong" or "a little awk goes a long way".
Chris writes "more of a logo rather than a mascot":
by R. Loui
ACM Sigplan Notices, Volume 31, Number 8, August 1996
Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a `scripting language.' It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't even been close (we have not seen work in PROLOG or JAVA).
There are some quick answers that have to do with the pragmatics of undergraduate programming. Then there are more instructive answers that might be valuable to those who debate programming paradigms or to those who study the history of AI languages. And there are some deep philosophical answers that expose the nature of reasoning and symbolic AI. I think the answers, especially the last ones, can be even more surprising than the observed effectiveness of GAWK for AI.
First it must be confessed that PERL programmers can cobble together AI projects well, too. Most of GAWK's attractiveness is reproduced in PERL, and the success of PERL forebodes some of the success of GAWK. Both are powerful string-processing languages that allow the programmer to exploit many of the features of a UNIX environment. Both provide powerful constructions for manipulating a wide variety of data in reasonably efficient ways. Both are interpreted, which can reduce development time. Both have short learning curves. The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonful of syntactic sugar. Some will argue that PERL has superior functionality, but for quick AI applications, the additional functionality is rarely missed. In fact, PERL's terse syntax is not friendly when regular expressions begin to proliferate and strings contain fragments of HTML, WWW addresses, or shell commands. PERL provides new ways of doing things, but not necessarily ways of doing new things.
In the end, despite minor difference, both PERL and GAWK minimize programmer time. Neither really provides the programmer the setting in which to worry about minimizing run-time.
There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI test bed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is turn left; turn right. If the robot is Netscape, then the right language is something that can generate Netscape -remote 'openURL(http://cs.wustl.edu/~loui) with elan.
Of course, there are deeper answers. Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays. GAWK asks the programmer to use the file system for data organization and the operating system for debugging tools and subroutine libraries. There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time. (footnote: I am reminded of my IBM colleague Ben Grosof's advice for Palo Alto: Don't worry about whether it's highway 101 or 280. Don't worry if you have to head south for an entrance to go north. Just get on the highway as quickly as possible.)
There are some similarities between GAWK and LISP that are illuminating. Both provided a powerful uniform data structure (the associative array implemented as a hash table for GAWK and the S-expression, or list of lists, for LISP). Both were well-supported in their environments (GAWK being a child of UNIX, and LISP being the heart of lisp machines). Both have trivial syntax and find their power in the programmer's willingness to use the simple blocks to build a complex approach.
Deeper still, is the nature of AI programming. AI is about functionality and exploratory programming. It is about bottom-up design and the building of ambitions as greater behaviors can be demonstrated. Woe be to the top-down AI programmer who finds that the bottom-level refinements, `this subroutine parses the sentence,' cannot actually be implemented. Woe be to the programmer who perfects the data structures for that heap sort when the whole approach to the high-level problem needs to be rethought, and the code is sent to the junk heap the next day.
AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor.
Now for the surprising philosophical answers. First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution. Chess, neural nets, and genetic programming show the limits of brute computation. The alternative is clever program organization. (footnote: One might add that the former are the AI approaches that work, but that is easily dismissed: those are the AI approaches that work in general, precisely because cleverness is problem-specific.) So AI programmers always want to maximize the content of their program, not optimize the efficiency of an approach. They want minds, not insects. Instead of enumerating large search spaces, they define ways of reducing search, ways of bringing different knowledge to the task. A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.
Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call reasoning'' instead of logic.'' The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.
I view these last two points as news not only to the programming language community, but also to much of the AI community that has not reflected on the past decade's lessons.
In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
by Ronald P. Loui
Associate Professor of CSE
Washington University in St. Louis
(Pre-publication draft; copyright reserved by author. A subsequent version of this document appeared as IEEE Computer, vol. 41, no. 7, July 2008).
To the credit of this journal, it had the courage to publish the signal paper on scripting, John Ousterhout's "Scripting: Higher Level Programming for the 21st Century" in 1998. Today, that document rolls forward with an ever-growing list of positive citations. More importantly, every major observation in that paper seems now to be entrenched in software practice today; every benefit claimed for scripting appears to be genuine (flexibility of typelessness, rapid turnaround of interpretation, higher level semantics, development speed, appropriateness for gluing components and internet programming, ease of learning and increase in amount of casual programming).
Interestingly, IEEE COMPUTER also just printed one of the most canonical attacks on scripting, by one Diomidis Spinellis, 2005, "Java Makes Scripting Languages Irrelevant?" Part of what makes this attack interesting is that the author seems unconvinced of his own title; the paper concludes with more text devoted to praising scripting languages than it expends in its declaration of Java's progress toward improved usability. It is unclear what is a better recommendation for scripting: the durability of Ousterhout's text or the indecisiveness of this recent critic's.
The real shock is that the academic programming language community continues to reject the sea change in programming practices brought about by scripting. Enamored of the object-oriented paradigm, especially in the undergraduate curriculum, unwilling to accept the LAMP (Linux-Apache-MySQL-Perl/Python/Php) tool set, and firmly believing that more programming theory leads to better programming practice, the academics seem blind to the facts on the ground. The ACM flagship, COMMUNICATIONS OF THE ACM for example, has never published a paper recognizing the scripting philosophy, and the references throughout the ACM Digital Library to scripting are not encouraging.
Part of the problem is that scripting has risen in the shadow of object-oriented programming and highly publicized corporate battles between Sun, Netscape, and Microsoft with their competing software practices. Scripting has been appearing language by language, including object-oriented scripting languages now. Another part of the problem is that scripting is only now mature enough to stand up against its legitimate detractors. Today, there are answers to many of the persistent questions about scripting: is there a scripting language appropriate for the teaching of CS1 (the first programming course for majors in the undergraduate computing curriculum)? Is there a scripting language for enterprise or real-time applications? Is there a way for scripting practices to scale to larger software engineering projects?
I intend to review the recent history briefly for those who have not yet joined the debate, then present some of the answers that scripting advocates now give to those nagging questions. Finally, I will describe how a real pragmatism of academic interest in programming languages would have better prepared the academic computing community to see the changes that have been afoot.
1996-1998 are perhaps the most interesting years in the phylogeny of scripting. In those years, perl "held the web together", and together with a new POSIX awk and GNU gawk, was shipping with every new Linux. Meanwhile javascript was being deployed furiously (javascript bearing no important relation to java, having been renamed from "livescript" for purely corporate purposes, apparently a sign of Netscape's solidarity with Sun, and even renamed "jscript" under Microsoft). Also, a handoff from tcl/tk to python was taking place as the language of choice for GUI developers who would not yield to Microsoft's VisualBasic. Php appeared in those years, though it would take another round of development before it would start displacing server-side perl, cold fusion, and asp. Every one of these languages is now considered a classic, even prototypical, scripting language.
Already by mid-decade, the shift from scheme to java as the dominant CS1 language was complete, and the superiority of c++ over c was unquestioned in industry. But java applets were not well supported in browsers, so the appeal of "write once, run everywhere" quickly became derided as "write once, debug everywhere." Web page forms, which used the common gateway interface (cgi) were proliferating, and systems programming languages like c became recognized as overkill for server-side programming. Developers quickly discovered the main advantage of perl for cgi forms processing, especially in the dot-com setting: it minimized the programmer's write-time. What about performance? The algorithms were simple, network latency masked small delays, and database performance was built into the database software. It turned out that the bottleneck was the programming. Even at run-time, the network and disk properties were the problems, not the cpu processing. What about maintenance? The developers and management were both happy to rewrite code for redesigned services rather than deal with legacy code. Scripting, it turns out, was so powerful and programmer-friendly that it was easier to create new scripts from scratch than to modify old programs. What about user interface? After all, by 1990, most of the programming effort had become the writing of the GUI, and the object-oriented paradigm had much of its momentum in the inheritance of interface widget behaviors. Surprisingly, the interface that most programmers needed could be had in a browser. The html/javascript/cgi trio became the GUI, and if more was needed, then ambitious client-side javascript was more reliable than the browser's java virtual machine. Moreover, the server-side program was simply a better way to distribute automation in a heterogeneous internet than the downloadable client-side program, regardless of whether the download was in binary or bytecode.
Although there was not agreement on what exact necessary and sufficient properties characterized scripting and distinguished it from "more serious" programming, several things were clear:
This last point was extremely counterintuitive. Strong typing, naming regimen, and verbosity were motivated mainly by a desire to help the programmer avoid errors. But the programmer who had to generate too many keystrokes and consult too many pages, who had to search through many different files to discover semantics, and who had to follow too many rules, who had to sustain motivation and concentration over a long period of time, was a distracted and consequently inefficient programmer. Just as vast libraries did not deliver the promise of greater reusability, and virtual machines did not deliver the promise of platform-independence, the language's promise to discipline the programmer quite simply did not reduce the tendency of humans to err. It exchanged one kind of frequent error for another.
Scripting languages became the favorite tools of the independent-minded programmers: the "hackers" yes, but also the gifted and genius programmers who tended to drive a project's design and development. As Paul Graham noted (in a column reprinted in "Hackers and Painters" or this), one of the lasting and legitimate benefits of java is that it permits managers to level the playing field and extract considerable productivity from the less talented and less motivated programmers (hence, more disposable). There was a corollary to this difference between the mundane and the liberating:
The distinct features of scripting languages that produce these effects are usually enumerated as semantic features, starting with low I/O specification costs, the use of implicit coercion and weak typing, automatic variable initialization with optional declaration, predominant use of associative arrays for storage and regular expressions for pattern matching, reduced syntax, and powerful control structures. But the main reason for the productivity gains may be found in the name "scripting" itself. To script an environment is to be powerfully embedded in that environment. In the same way that the dolphin reigns over the open ocean, lisp is a powerful language for those who would customize their emacs, javascript is feral among browsers, and gawk and perl rule the linux jungle.
There is even a hint of AI in the idea of scripting: the scripting language is the way to get high level control, to automate by capturing the intentions and routines normally provided by the human. If recording and replaying macros is a kind of autopilot, then scripting is a kind of proxy for human decisionmaking. Nowhere is this clearer than in simple server-side php, or in sysadmin shell scripting.
So where do we stand now? While it may have been risky for Ousterhout to proclaim scripting on the rise in 1998, it would be folly to dismiss the success of scripting today. It is even possible that java will yield its position of dominance in the near future. (By the time this essay is printed, LAMP and AJAX might be the new darlings of the tech press; see recent articles in Business Week, this IEEE COMPUTER, and even James Gosling's blog where he concedes he was wanting to write a scripting language when he was handed the java project. Java is very much in full retreat.) Is scripting ready to fill the huge vacuum that would be produced?
I personally believe that CS1 java is the greatest single mistake in the history of computing curricula. I believe this because of the empirical evidence, not because I have an a priori preference (I too voted to shift from scheme to java in our CS1, over a decade ago, so I am complicit in the java debacle). I reported in SIGPLAN 1996 ("Why gawk for AI?") that only the scripting programmers could generate code fast enough to keep up with the demands of the artificial intelligence laboratory class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting. In the intervening decade, little has changed. We actually see more scripting, as students are happy to compress images so that they can script their computer vision projects rather than stumble around in c and c++. In fact, students who learn to script early are empowered throughout their college years, especially in the crucial UNIX and web environments. Those who learn only java are stifled by enterprise-sized correctness and the chimerae of just-in-time compilation, swing, JRE, JINI, etc. Young programmers need to practice and produce, and to learn through mistakes why discipline is needed. They need to learn design patterns by solving problems, not by reading interfaces to someone else's black box code. It is imperative that programmers learn to be creative and inventive, and they need programming tools that support code exploration rather than code production.
What scripting language could be used for CS1? My personal preferences are gawk, javascript, php, and asp, mainly because of their very gentle learning curves. I don't think perl would be a disaster; its imperfection would create many teaching moments. But there is emerging consensus in the scripting community that python is the right choice for freshman programming. Ruby would also be a defensible choice. Python and ruby have the enviable properties that almost no one dislikes them, and almost everyone respects them. Both languages support a wide variety of programming styles and paradigms and satisfy practitioners and theoreticians equally. Both languages are carefully enough designed that "correct" programming practices can be demonstrated and high standards of code quality can be enforced. The fact that Google stands by python is an added motivation for undergraduate majors.
But do scripting solutions scale? What about the performance gap when the polynomial, or worse the exponential, algorithm faces large n, and the algorithm is written in an interpreted or weakly compiled language? What about software engineering in the large, on big projects? There has been a lot of discussion about scalability of scripts recently. In the past, debates have simply ended with the concession that large systems would have to be rewritten in c++, or a similar language, once the scripting had served its prototyping duty.
The enterprise question is the easier of the two. Just as the individual programmer reaps benefits from a division of labor among tools, writing most of the code in scripts, and writing all bottleneck code in a highly optimizable language, the group of programmers benefits from the use of multiple paradigms and multiple languages. In a recent large project, we used vhdl for fpga's with a lot of gawk to configure the vhdl. We used python and php to generate dynamic html with svg and javascript for the interfaces. We used c and c++ for high performance communications wrappers, which communicated xml to higher level scripts that managed databases and processes. We saw sysadmin and report-generation in perl, ruby, and gawk, data scrubbing in perl and gawk, user scripting in bash, tcl, and gawk, and prototyping in perl and gawk. Only one module was written in java (because that programmer loved java): it was late, it was slow, it failed, and it was eventually rewritten in c++. In retrospect, neither the high performance components nor the lightweight code components were appropriate for the java language. Does scripting scale to enterprise software? I would not manage a project that did not include a lot of scripting, to minimize the amount of "hard" programming, to increase flexibility and reduce delivery time at all stages, to take testing to a higher level, and to free development resources for components where performance is actually critical. I nearly weep when I think about the text processing that was written in c under my managerial watch, because the programmer did not know perl. We write five hundred line scripts in gawk that would be ten thousand line modules in java or c++. In view of the fact that there are much better scripting tools for most of what gets programmed in java and c++, perhaps the question is whether java and c++ scale.
How about algorithmic complexity? Don't scripting languages take too long to perform nested loops? The answer here is that a cpu-bound tight loop such as a matrix multiplication is indeed faster in a language like c. But such bottlenecks are easy to identify and indeed easy to rewrite in c. True system bottlenecks are things like paging, chasing pointers on disk, process initialization, garbage collection, fragmentation, cache mismanagement, and poor data organization. Often, we see that better data organization was unimplemented because it would have required more code, code that would have been attempted in an "easier" programming language like a scripting language, but which was too difficult to attempt in a "harder" programming language. We saw this in the AI class with heuristic search and computer vision, where brute force is better in c, but complex heuristics are better than brute force, and scripting is better for complex heuristics. When algorithms are exponential, it usually doesn't matter what language you use because most practical n will incur too great a cost. Again, the solution is to write heuristics, and scripting is the top dog in that house. Cpu's are so much faster than disks these days that a single extra disk read can erase the CPU advantage of using compiled c instead of interpreted gawk. In any case, java is hardly the first choice for those who have algorithmic bottlenecks.
The real reason why academics were blindsided by scripting is their lack of practicality. Academic computing was generally late to adopt Wintel architectures, late to embrace cgi programming, and late to accept Linux in the same decade that brought scripting's rise. Academia understandably holds industry at a distance. Still, there is a purely intellectual reason why programming language courses are only now warming to scripting. The historical concerns of programming language theory have been syntax and semantics. Java's amazing contribution to computer science is that it raised so many old-fashioned questions that tickled the talents of existing programming language experts: e.g., how can it be compiled? But there are new questions that can be asked, too, such as what a particular language is well-suited to achieve inexpensively, quickly, or elegantly, especially with the new mix of platforms. The proliferation of scripting languages represents a new age of innovation in programming practice.
Linguists recognize something above syntax and semantics, and they call it "pragmatics". Pragmatics has to do with more abstract social and cognitive functions of language: situations, speakers and hearers, discourse, plans and actions, and performance. We are entering an era of comparative programming language study when the issues are higher-level, social, and cognitive too.
My old friend, Michael Scott, has a popular textbook called PROGRAMMING LANGUAGE PRAGMATICS. But it is a fairly traditional tome concerned with parameter passing, types, and bindings (it's hard to see why it merits "pragmatics" in its title, even as it goes to second edition with a chapter on scripting added!). A real programming pragmatics would ask questions like:
There have been programming language "shootouts" and "scriptometers" on the internet that have sought to address some of the questions that are relevant to the choice of scripting language, but they have been just first steps. For example, one site reports on the shortest script in each scripting language that can perform a simple task. But absolute brevity for trivial tasks, such as "print hello world" is not as illuminating as typical brevity for real tasks, such as xml parsing.
Pragmatic questions are not the easiest questions for mathematically-inclined computer scientists to address. They refer by their nature to people, their habits, their sociology, and the technological demands of the day. But it is the importance of such questions that makes programmers choose scripting languages. Ousterhout declared scripting on the rise, but perhaps so too are programming language pragmatics.
I have to thank Charlie Comstock for contributing many ideas and references over the past two years that have shaped my views, especially the commitment to the idea of pragmatics.
Prof. Dr. Loui and his students are the usual winners of the department programming contest and have contributed to current gnu releases of gawk and malloc. He has lectured on AI for two decades on five continents, taught AI programming for two decades, and is currently funded on a project delivering hardware and software on U.S. government contracts.
From awk.freeshell.org:
It's a bit embarassing to note that the exact origins of each are a bit hazy. This whole section requires further work, including the addition of links pointing to source repositories and binary distribution points.
Historical list of Awk implementations.
by T. Menzies
"The Enlightened Ones say that....
Awk is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and Awk programs are generally very short. Awk is useful when the overheads of more sophisticated approaches is not worth the bother. Also, the cost of learning Awk is very low.
But aren't there better scripting languages? Faster? Well, maybe yes and maybe no.
And Awk is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:
loc/fp language
------ --------
6, excel 5
13, sql
21, awk <================
21, perl
21, eiffel
21, clos
21, smalltalk
29, delphi
29, visual basic 5
49, ada 95
49, ai shells
53, c++
53, java
64, lisp
71, ada 83
71, fortran 95
80, 3rd generation default
91, ansi cobol 85
91, pascal
107, 2nd generation default
107, algol 68
107, cobol
107, fortran
128, c
320, 1st generation default
640, machine language
3200, natural language
Anyway, there are other considerations. Awk is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For example, here's the complete listing of someone's Awk spell-checking program.
BEGIN {while (getline<"Usr.Dict.Words") dict[$0]=1}
!dict[$1] {print $1}
Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.
Awk is the cure for late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. Awk programmers can hack up usable systems in the time it takes other programmers to boot their IDE. And, as a result of that easy exploration, it is possible to find loopholes missed by other analyst that lead to the innovative better solution to the problems (e.g. see Ronald Loui's O(nlogn) clustering tool).
Certainly, we can drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And Awk's lack of data structures (except num, string, and array) requires some discipline to handle properly.
But experienced Awk programmers know that the cleverer the program, the smaller the audience gets. If it is possible for to explain something succinctly in a simple language like Awk, then it is also possible that more folks will read that code.
Finally, at this may be the most important point, it might be misguided to argue about Awk vs LanguageX in terms of the specifics of those languages. Awk programmers can't over-elaborate their solutions- they are forced to code the solution in the simplest manner possible. This push to simplicity, to the essence of the problem, can be an insightful process. Coding in Awk is like preserving fruit- you boil off everything that is superfluous, that needlessly bloats the material what you are working with. It is amazing how little code is required to code the core of an idea (e.g. see Darius Bacon's LISP interpreter, written in Awk).
At the Proceedings of the Winter Usenix Conference (Dallas'91) Henry Spencer wrote in Awk As A Major Systems Programming Language that...
There is no fundamental reason why awk programs have to be small "glue" programs: even the "old" awk is a powerful programming language in its own right. Effective use of its data structures and its stream-oriented structure takes some adjustment for C programmers, but the results can be quite striking.
On the other hand, getting there can be a bit painful, and improvements in both the language and its support tools would help.
In 2009, Arnold Robbins comments:
These pages focus on program verification tools, written in Awk.
These pages focus on databases and Awk.
These pages focus on games, written in Awk.
gawk -f game.awk
Download from LAWKER.
I wrote a small text-adventure game in awk - just to stretch the perception of awk, and show that it can be used as a programming language.
This game is small, but gives a taste of the fantasy adventure games of the 80's - like Zork from Infocom.
In this adventure, you are in a cave complex, and need to find the hidden gold to win. The adventure lets you move around, search, pick up objects, and use them. It uses a menu - not free-form entries.
Here is the awk code:
function intro() {
print
print "You are a brave adventurer. You have entered a hidden"
print "cave just outside town, that is rumored to hold gold!"
print "To win this adventure, you need to get the gold."
}
function invent() {
if (coin || axe || sword)
print "You are carrying: "
if (coin) print "coin"
if (axe) print "big, rusty battle axe"
if (sword) print "small sword"
}
function input( x ) {
printf( "\nCOMMAND> ")
getline x
return x
}
function cave() {
print
print "You are standing in a cave. Sunlight gleams behind you"
print "from the entrance. In front of you, is a wooden door."
print "You see an opening to the left, and one to the right."
print
invent()
print
print "What do you want to do? "
print
print "(o)pen wooden door"
print "go (l)eft"
print "go (r)ight"
print "leave thru the (e)ntrance"
if (sword) print "break door with your (s)word"
if (axe) print "break door with your (a)xe"
print "(y)ell Open Sesame"
print "e(x)amine area"
print "read (i)ntroduction"
x = input()
if (x=="o") {print "The wooden door is shut tight."; cave()}
if (x=="l") {deadend()}
if (x=="r") {cave2()}
if (x=="e") {print "You decide to quit. Goodbye!";exit}
if (sword&&x=="s") {print "your sword breaks!";sword=0;cave()}
if (axe&&x=="a") {
print "You chop down the door and find the gold!!"
print "Great job, bold adventurer!"
print "This is the end of this adventure, but"
print "you have a promising career ahead of you!"
exit;
}
if (x=="y") {
print "A band of evil goblins passing by the entrance"
print "hear you, enter the cave, and kill you"
exit;
}
if (x=="x") {print "You find nothing";cave()}
if (x=="i") {intro();cave()}
print "What do you want to do?";cave()
}
function deadend() {
print
print "You are in a dead end"
print
invent()
print
print "What do you want to do? "
print
print "go (b)ack"
print "e(x)amine area"
print "read (i)ntroduction"
x= input();
if (x=="b") {cave()}
if (x=="x") {print "You find a sword!";sword=1;deadend()}
if (x=="i") {intro();deadend()}
print "What do you want to do?";deadend()
}
function cave2() {
print
print "You are in another cave."
print "You can go back, or explore a niche to the left."
print
invent()
print
print "What do you want to do? "
print
print "go (b)ack"
print "enter (n)iche"
if (rubble) print "(s)earch rubble"
print "e(x)amine area"
print "read (i)ntroduction"
x = input()
if (x=="b") {cave()}
if (x=="n") {niche()}
if (rubble&&x=="s"&&!coin) {print "you found a coin!";coin=1;cave2()}
if (rubble&&x=="s"&&coin) {print "you found a nothing!";cave2()}
if (x=="x") {print "You see a pile of rubble";rubble=1;cave2()}
if (x=="i") {intro();cave2()}
print "What do you want to do?";cave2()
}
function niche() {
print
print "You are in a niche."
print "There is a dwarf here!"
print
invent()
print
print "What do you want to do? "
print
print "go (b)ack"
print "(t)alk to dwarf"
if (!sword&&!axe) print "(f)ight dwarf"
if (sword) print "fight dwarf with (s)word"
if (axe) print "fight dwarf with (a)xe"
if (coin) print "(o)ffer coin to dwarf"
print "e(x)amine area"
print "read (i)ntroduction"
x = input()
if (x=="b") {cave2()}
if (x=="t") {print "The dwarf grunts";niche()}
if (x=="f") {print "The dwarf kills you";exit}
if (x=="s") {print "The dwarf kills you";exit}
if (x=="a") {print "The dwarf kills you";exit}
if (coin&&x=="o") {print "The dwarf takes the coin and gives you a n axe!";coin=0;axe=1;niche()}
if (x=="x") {print "You find nothing";niche()}
if (x=="i") {intro();niche()}
print "What do you want to do?";niche()
}
BEGIN { intro(); cave() }
This is one of the longest awk programs that I have written. Notice that it is function-driven. I have created functions to give the introduction, and the inventory, and I have created functions for each room.
The awk program is kicked off by the BEGIN section, which runs intro() and cave() to put you in the first room.
Each object is represented by a variable of the same name (i.e. sword for sword) and is either 0 (off) or 1 (on), depending if you have the object.
Each function will print descriptions and gve options, depending on the setting of these boolean variables.
Praveen Puri has been a programmer and full-time trader. He is the author of Stock Trading Riches which teaches his stock trading system.
echo Goal | gawk -f story.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] echo Goal | gawk -f storyp.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ]
Download from LAWKER.
This code inputs a set of productions and outputs a string of words that satisfy the production rules.
This page describes two versions of that system: story.awk and storyp.awk. The former selects productions at random with equal probability. The latter allows the user to bias the selection by adding weights at the end of line, after each production.
This grammar..
Sentence -> Nounphrase Verbphrase Nounphrase -> the boy Nounphrase -> the girl Verbphrase -> Verb Modlist Adverb Verb -> runs Verb -> walks Modlist -> Modlist -> very Modlist Adverb -> quickly Adverb -> slowly... and this input ...
for i in 1 2 3 4 5 6 7 8 9 10;do echo Sentence | gawk -f ../story.awk -v Grammar=english.rules -v Seed=$i | fmt done... generates these sentences:
the boy runs very slowly the girl runs slowly the boy runs very slowly the girl walks very very quickly the boy runs quickly the girl walks very very slowly the boy walks very very very very very very quickly the boy walks very quickly the girl runs slowly the girl runs very quickly
Here is Gahan Wilson's sci-fi plot generator ...
Using the above, we can generate the following stories:
Earth scientists invent giant bugs who want Our Women, And Take A Few And Leave Earth is Attacked By tiny lunar superbeings who Under Stand and Are Not radioactive and can not be killed by the Navy but They Die From Catching A Cold Earth scientists invent enormous bugs who are Friendly and and They Get Married And Live Happily Forever After Earth is Struck By A Giant cloud and Magically Saved Earth scientists invent giant bugs who Under Stand and Are Not radioactive and can not be killed by the Air Force so They Kill Us Earth is Attacked By enormous extra Galactic blobs who Under Stand and Are Not radioactive and can be killed by the Air Force Earth scientists discover enormous blobs who Under Stand and Are Not radioactive and can be killed by a Crowd Of Peasants Earth falls Into Sun and Some Resuced Earth is Struck By A Giant comet but Is Saved Earth is Struck By A Giant comet and Is Destroyed
This is generated from the following code:
for i in 1 2 3 4 5 6 7 8 9 10;do echo echo Start | gawk -f ../story.awk -v Grammar=scifi.rules -v Seed=$i | fmt done
running on the following grammar:
Start -> Earth IsStressed IsStressed -> Catestrophes IsStressed -> Science IsStressed -> Attack IsStressed -> Collision Catestrophes -> Catestrophe and PossibleMegaDeath Catestrophe -> burnsUp Catestrophe -> freezes Catestrophe -> fallsIntoSun Collision -> isStruckByAGiant Floater AndThen Floater -> comet Floater -> asteroid Floater -> cloud AndThen -> butIsSaved AndThen -> andIsDestroyed AndThen -> andMagicallySaved PossibleMegaDeath -> everybodyDies PossibleMegaDeath -> Some GoOn SomeSaved -> somePeople SomeSaved -> everybody SomeSaved -> almostEverybody GoOn -> dies GoOn -> Resuced GoOn -> Saved Rescued -> isRescuedBy Sizes Extraterestrial Beings Saved -> butIsSavedBy SomeOne scientists the Science SomeOne -> earth SomeOne -> extraterestrial Science -> scientists DoSomething Sizes Beings Whichetc DoSomething -> invent DoSomething -> discover Attack -> isAttackedBy Sizes Extraterestrial Beings Whichetc Sizes -> tiny Sizes -> giant Sizes -> enormous Extraterestrial -> martian Extraterestrial -> lunar Extraterestrial -> extraGalactic Beings -> bugs Beings -> reptiles Beings -> blobs Beings -> superbeings Whichetc -> who WantSomething WantSomething -> WantWomen WantSomething -> areFriendly and DenoumentOrHappyEnding WantSomething -> UnderStand ButEtc Understand -> areFriendly butMisunderstood Understand -> misunderstandUs Understand -> understandUsAllTooWell Understand -> hungry DenoumentOrHappyEnding -> Denoument DenoumentOrHappyEnding -> HappyEnding Dine -> Hungry and eat us Denoument? WhichEtc -> Hungry -> lookUponUsAsASourceOfNourishment WantWomen -> wantOurWomen, AndTakeAFewAndLeave ButEtc -> AndAre radioactive and TryToKill AndAre -> andAre AndAre -> andAreNot Killers -> Killer Killers -> Killer and Killer Killer -> aCrowdOfPeasants Killer -> theArmy Killer -> theNavy Killer -> theAirForce Killer -> theMarines Killer -> theCoastGuard Killer -> theAtomBomb TryToKill -> can be killed by Killers TryToKill -> can not be killed by Killers SoEtc SoEtc -> butTheyDieFromCatchingACold SoEtc -> soTheyKillUs SoEtc -> soTheyPutUsUnderABenignDictatorShip SoEtc -> soTheyEatUs SoEtc -> soScientistsInventAWeapon Which SeEtc -> but Denoument Which -> whichTurnsThemIntoDisgustingLumps Which -> whichKillsThem Which -> whichFails SoEtc Denomument? -> Denomument? -> Denoument Denoument -> aCuteLittleKidConvincesThemPeopleAreOk Ending Denoument -> aPriestTalksToThemOfGod Ending Denoument -> theyFallInLoveWithThisBeautifulGirl EndSadOrHappy EndSadOrHappy -> Ending EndSadOrHappy -> HappyEnding Ending -> andTheyDie Ending -> andTheyLeave Ending -> andTheyTurnIntoDisgustingLumps HappyEnding -> andTheyGetMarriedAndLiveHappilyForeverAfter
Here is a grammar suitable for storyp.awk. Note that number at end of line that biases how often a production is selected. For example, "runs" and "slowly" are nine times more likely than other Verbs and Adverbs.
Sentence -> Nounphrase Verbphrase 1 Nounphrase -> the boy 0.75 Nounphrase -> the girl 0.25 Verbphrase -> Verb Modlist Adverb 1 Verb -> runs 0.9 Verb -> walks 0.1 Modlist -> 0.5 Modlist -> very Modlist 0.5 Adverb -> quickly 0.1 Adverb -> slowly 0.9The following code executes the biases story generation:
for((i=1;i<=10;i++)); do echo Sentence ; done | gawk -f ../storyp.awk -v Grammar=englishp.rules
This produces the following output. Note that, usually, we run slowly.
the boy runs very slowly the boy runs slowly the girl runs very slowly the boy runs slowly the boy runs slowly the girl walks very slowly the boy walks slowly the girl runs slowly the boy runs slowly the boy runs slowly
BEGIN {
srand(Seed ? Seed : 1)
Grammar = Grammar ? Grammar : "grammar"
while (getline < Grammar > 0)
if ($2 == "->") {
i = ++lhs[$1] # count lhs
rhscnt[$1, i] = NF-2 # how many in rhs
for (j = 3; j <= NF; j++) # record them
rhslist[$1, i, j-2] = $j
} else
if ($0 !~ /^[ \t]*$/)
print "illegal production: " $0
}
{ if ($1 in lhs) { # nonterminal to expand
gen($1)
printf("\n")
} else
print "unknown nonterminal: " $0
}
function gen(sym, i, j) {
if (sym in lhs) { # a nonterminal
i = int(lhs[sym] * rand()) + 1 # random production
for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
gen(rhslist[sym, i, j])
} else {
gsub(/[A-Z]/," &",sym)
printf("%s ", sym) }
}
Storyp.awk is almost the same as story.awk but it is assumed that each line ends in a number that will bias how often that production gets selected.
BEGIN {
srand(Seed ? Seed : 1)
Grammar = Grammar ? Grammar : "grammar"
while ((getline < Grammar) > 0)
if ($2 == "->") {
i = ++lhs[$1] # count lhs
rhsprob[$1, i] = $NF # 0 <= probability <= 1
rhscnt[$1, i] = NF-3 # how many in rhs
for (j = 3; j < NF; j++) # record them
rhslist[$1, i, j-2] = $j
} else
print "illegal production: " $0
for (sym in lhs)
for (i = 2; i <= lhs[sym]; i++)
rhsprob[sym, i] += rhsprob[sym, i-1]
}
{ if ($1 in lhs) { # nonterminal to expand
gen($1)
printf("\n")
} else
print "unknown nonterminal: " $0
}
function gen(sym, i, j) {
if (sym in lhs) { # a nonterminal
j = rand() # random production
for (i = 1; i <= lhs[sym] && j > rhsprob[sym, i]; i++) ;
for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
gen(rhslist[sym, i, j])
} else
printf("%s ", sym)
}
The code comes from Alfred Aho, Brian Kernighan, and Peter Weinberger from the book "The AWK Programming Language", Addison-Wesley, 1988.
The scifi grammar was written by Tim Menzies, 2009, and is based on Gahan Wilson's sci-fi plot generator: "The Science Fiction Horror Movie Pocket Computer" ( in "The Year's Best Science Fiction No. 5", edited by Harry Harrison and Brian Aldiss, Sphere, London, 1972).
Donald 'Paddy' McCarthy has a nice Awk solution to the Monty Hall Problem, which he describes as follow:
It turns out that if the contestant follows a strategy of always switching when asked, then he will maximise his chances of winning. Donald's simulator shows that:
BEGIN {
srand()
doors = 3
iterations = 10000
# Behind a door:
EMPTY = "empty"; PRIZE = "prize"
# Algorithm used
KEEP = "keep"; SWITCH="switch"; RAND="random";
}
function monty_hall( choice, algorithm ) { # Set up doors
for ( i=0; i<doors; i++ ) {
door[i] = EMPTY
}
door[int(rand()*doors)] = PRIZE # One door with prize
chosen = door[choice]
del door[choice]
#if you didn't choose the prize first time around then
# that will be the alternative
alternative = (chosen == PRIZE) ? EMPTY : PRIZE
if( algorithm == KEEP) {
return chosen
}
if( algorithm == SWITCH) {
return alternative
}
return rand() <0.5 ? chosen : alternative
}
function simulate(algo){
prizecount = 0
for(j=0; j< iterations; j++){
if( monty_hall( int(rand()*doors), algo) == PRIZE) {
prizecount ++
}
}
printf " Algorithm %7s: prize count = %i, = %6.2f%%\n", \
algo, prizecount,prizecount*100/iterations
}
BEGIN {
print "\nMonty Hall problem simulation:"
print doors, "doors,", iterations, "iterations.\n"
simulate(KEEP)
simulate(SWITCH)
simulate(RAND)
}
gawk -f montyHall.awk Monty Hall problem simulation: 3 doors, 10000 iterations. Algorithm keep: prize count = 3411, = 34.11% Algorithm switch: prize count = 6655, = 66.55% Algorithm random: prize count = 4991, = 49.91%
echo name | gawk -f gender.awk
Download from LAWKER
The following code predicts gender, given a first name.
This code is an excellent example of rule-based programming in Awk.
For a full description of the code, see
{ sex = "m" } # Assume male.
/^.*[aeiy]$/ { sex = "f" } # Female names endng in a/e/i/y.
/^All?[iy]((ss?)|z)on$/ { sex = "f" } # Allison (and variations)
/^.*een$/ { sex = "f" } # Cathleen, Eileen, Maureen,...
/^[^S].*r[rv]e?y?$/ { sex = "m" } # Barry, Larry, Perry,...
/^[^G].*v[ei]$/ { sex = "m" } # Clive, Dave, Steve,...
/^[^BD].*(b[iy]|y|via)nn?$/ { sex = "f" } # Carolyn,Gwendolyn,Vivian,...
/^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$/ { sex = "m" } # Dewey, Stanley, Wesley,...
/^[^GKSW].*(th|lv)(e[rt])?$/ { sex = "f" } # Heather, Ruth, Velvet,...
/^[CGJWZ][^o][^dnt]*y$/ { sex = "m" } # Gregory, Jeremy, Zachary,...
/^.*[Rlr][abo]y$/ { sex = "m" } # Leroy, Murray, Roy,...
/^[AEHJL].*il.*$/ { sex = "f" } # Abigail, Jill, Lillian,...
/^.*[Jj](o|o?[ae]a?n.*)$/ { sex = "f" } # Janet, Jennifer, Joan,...
/^.*[GRguw][ae]y?ne$/ { sex = "m" } # Duane, Eugene, Rene,...
/^[FLM].*ur(.*[^eotuy])?$/ { sex = "f" } # Fleur, Lauren, Muriel,...
/^[CLMQTV].*[^dl][in]c.*[ey]$/ { sex = "m" } # Lance, Quincy, Vince,...
/^M[aei]r[^tv].*([^cklnos]|([^o]n))$/ { sex = "f" } # Margaret, Marylou, Miriam,...
/^.*[ay][dl]e$/ { sex = "m" } # Clyde, Kyle, Pascale,...
/^[^o]*ke$/ { sex = "m" } # Blake, Luke, Mike,...
/^[CKS]h?(ar[^lst]|ry).+$/ { sex = "f" } # Carol, Karen, Sharon,...
/^[PR]e?a([^dfju]|qu)*[lm]$/ { sex = "f" } # Pam, Pearl, Rachel,...
/^.*[Aa]nn.*$/ { sex = "f" } # Annacarol, Leann, Ruthann,...
/^.*[^cio]ag?h$/ { sex = "f" } # Deborah, Leah, Sarah,...
/^[^EK].*[grsz]h?an(ces)?$/ { sex = "f" } # Frances, Megan, Susan,...
/^[^P]*([Hh]e|[Ee][lt])[^s]*[ey].*[^t]$/ { sex = "f" } # Ethel, Helen, Gretchen,...
/^[^EL].*o(rg?|sh?)?(e|ua)$/ { sex = "m" } # George, Joshua, Theodore,..
/^[DP][eo]?[lr].*se$/ { sex = "f" } # Delores, Doris, Precious,...
/^[^JPSWZ].*[denor]n.*y$/ { sex = "m" } # Anthony, Henry, Rodney,...
/^K[^v]*i.*[mns]$/ { sex = "f" } # Karin, Kim, Kristin,...
/^Br[aou][cd].*[ey]$/ { sex = "m" } # Bradley, Brady, Bruce,...
/^[ACGK].*[deinx][^aor]s$/ { sex = "f" } # Agnes, Alexis, Glynis,...
/^[ILW][aeg][^ir]*e$/ { sex = "m" } # Ignace, Lee, Wallace,...
/^[^AGW][iu][gl].*[drt]$/ { sex = "f" } # Juliet, Mildred, Millicent,...
/^[ABEIUY][euz]?[blr][aeiy]$/ { sex = "m" } # Ari, Bela, Ira,...
/^[EGILP][^eu]*i[ds]$/ { sex = "f" } # Iris, Lois, Phyllis,...
/^[ART][^r]*[dhn]e?y$/ { sex = "m" } # Randy, Timothy, Tony,...
/^[BHL].*i.*[rtxz]$/ { sex = "f" } # Beatriz, Bridget, Harriet,...
/^.*oi?[mn]e$/ { sex = "m" } # Antoine, Jerome, Tyrone,...
/^D.*[mnw].*[iy]$/ { sex = "m" } # Danny, Demetri, Dondi,...
/^[^BG](e[rst]|ha)[^il]*e$/ { sex = "m" } # Pete, Serge, Shane,...
/^[ADFGIM][^r]*([bg]e[lr]|il|wn)$/ { sex = "f" } # Angel, Gail, Isabel,...
{ print sex } # Output prediction
gawk -f eliza.awk
Download from LAWKER.
From Wikipedia: ELIZA was a computer program and an early example of (by modern standards) primitive natural language processing. ELIZA operated by processing users' responses to scripts, the most famous of which was DOCTOR, a simulation of a Rogerian psychotherapist. In this mode, ELIZA mostly rephrased the user's statements as questions and posed those to the 'patient.' ELIZA was written by Joseph Weizenbaum between 1964 to 1966.
This is a Gawk implementation of the famous ELIZA program by Joseph Weizenbaum. To end, type SHUT UP HI, IM ELIZA, TELL ME YOUR PROBLEM I hate dogs CAN YOU ELABORATE ON THAT ? Dogs remind me of Mylie Cyrus CAN YOU ELABORATE ON THAT ? You're kidding, right? Mylie Cyrus is one of the biggest stars on the planet WHAT MAKES YOU THINK I AM KIDDING, RIGHT? MYLIE CYRUS IS ONE OF THE BIGGEST STARS ON THE PLANET ? Cause you did not know about Mylie WE WERE DISCUSSING YOU -- NOT ME No, I was discussing Mylie Oh, I ID NOT KNOW ABOUT MYLIE Well aren't you stupid Oh, I TUPID can you get her number? WE WERE DISCUSSING YOU -- NOT ME sorry PLEASE DON'T APOLOGIZE ! but you sound so angry YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ? i think you are a robot Oh, I RE A ROBOT yes I UNDERSTAND no you don't YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ? hey- is this eliza? WHAT'S ABOUT ME ? I want to take you to dinner YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ? I feel you are really stupid WE WERE DISCUSSING YOU -- NOT ME SHUT UP WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $101
BEGIN {
SetUpEliza()
print "This is a Gawk implementation of the "\
"famous ELIZA program by Joseph Weizenbaum. "\
"To end, type SHUT UP\n";
print ElizaSays("");
}
{ print ElizaSays($0) }
function ElizaSays(YouSay) {
if (YouSay == "") {
cost = 0
answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM"
} else {
q = toupper(YouSay)
gsub("'", "", q)
if(q == qold) {
answer = "PLEASE DONT REPEAT YOURSELF !"
} else {
if (index(q, "SHUT UP") > 0) {
answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\
int(100*rand()+30+cost/100)
1;
} else {
qold = q
w = "-" # no keyword recognized yet
for (i in k) { # search for keywords
if (index(q, i) > 0) {
w = i
break
}
}
if (w == "-") { # no keyword, take old subject
w = wold
subj = subjold
} else { # find subject
subj = substr(q, index(q, w) + length(w)+1)
wold = w
subjold = subj # remember keyword and subject
}
for (i in conj)
gsub(i, conj[i], q) # conjugation
# from all answers to this keyword, select one randomly
answer = r[indices[int(split(k[w], indices) * rand()) + 1]]
# insert subject into answer
gsub("_", subj, answer)
}
}
}
cost += length(answer) # for later payment : 1 cent per character
return answer
}
function SetUpEliza() {
srand()
wold = "-"
subjold = " "
# table for conjugation
conj[" ARE " ] = " AM "
conj["WERE " ] = "WAS "
conj[" YOU " ] = " I "
conj["YOUR " ] = "MY "
conj[" IVE " ] =\
conj[" I HAVE " ] = " YOU HAVE "
conj[" YOUVE " ] =\
conj[" YOU HAVE "] = " I HAVE "
conj[" IM " ] =\
conj[" I AM " ] = " YOU ARE "
conj[" YOURE " ] =\
conj[" YOU ARE " ] = " I AM "
# table of all answers
r[1] = "DONT YOU BELIEVE THAT I CAN _"
r[2] = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?"
r[3] = "YOU WANT ME TO BE ABLE TO _ ?"
r[4] = "PERHAPS YOU DONT WANT TO _ "
r[5] = "DO YOU WANT TO BE ABLE TO _ ?"
r[6] = "WHAT MAKES YOU THINK I AM _ ?"
r[7] = "DOES IT PLEASE YOU TO BELIEVE I AM _ ?"
r[8] = "PERHAPS YOU WOULD LIKE TO BE _ ?"
r[9] = "DO YOU SOMETIMES WISH YOU WERE _ ?"
r[10] = "DONT YOU REALLY _ ?"
r[11] = "WHY DONT YOU _ ?"
r[12] = "DO YOU WISH TO BE ABLE TO _ ?"
r[13] = "DOES THAT TROUBLE YOU ?"
r[14] = "TELL ME MORE ABOUT SUCH FEELINGS"
r[15] = "DO YOU OFTEN FEEL _ ?"
r[16] = "DO YOU ENJOY FEELING _ ?"
r[17] = "DO YOU REALLY BELIEVE I DONT _ ?"
r[18] = "PERHAPS IN GOOD TIME I WILL _ "
r[19] = "DO YOU WANT ME TO _ ?"
r[20] = "DO YOU THINK YOU SHOULD BE ABLE TO _ ?"
r[21] = "WHY CANT YOU _ ?"
r[22] = "WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM _ ?"
r[23] = "WOULD YOU PREFER IF I WERE NOT _ ?"
r[24] = "PERHAPS IN YOUR FANTASIES I AM _ "
r[25] = "HOW DO YOU KNOW YOU CANT _ ?"
r[26] = "HAVE YOU TRIED ?"
r[27] = "PERHAPS YOU CAN NOW _ "
r[28] = "DID YOU COME TO ME BECAUSE YOU ARE _ ?"
r[29] = "HOW LONG HAVE YOU BEEN _ ?"
r[30] = "DO YOU BELIEVE ITS NORMAL TO BE _ ?"
r[31] = "DO YOU ENJOY BEING _ ?"
r[32] = "WE WERE DISCUSSING YOU -- NOT ME"
r[33] = "Oh, I _"
r[34] = "YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?"
r[35] = "WHAT WOULD IT MEAN TO YOU, IF YOU GOT _ ?"
r[36] = "WHY DO YOU WANT _ ?"
r[37] = "SUPPOSE YOU SOON GOT _"
r[38] = "WHAT IF YOU NEVER GOT _ ?"
r[39] = "I SOMETIMES ALSO WANT _"
r[40] = "WHY DO YOU ASK ?"
r[41] = "DOES THAT QUESTION INTEREST YOU ?"
r[42] = "WHAT ANSWER WOULD PLEASE YOU THE MOST ?"
r[43] = "WHAT DO YOU THINK ?"
r[44] = "ARE SUCH QUESTIONS IN YOUR MIND OFTEN ?"
r[45] = "WHAT IS IT THAT YOU REALLY WANT TO KNOW ?"
r[46] = "HAVE YOU ASKED ANYONE ELSE ?"
r[47] = "HAVE YOU ASKED SUCH QUESTIONS BEFORE ?"
r[48] = "WHAT ELSE COMES TO MIND WHEN YOU ASK THAT ?"
r[49] = "NAMES DON'T INTEREST ME"
r[50] = "I DONT CARE ABOUT NAMES -- PLEASE GO ON"
r[51] = "IS THAT THE REAL REASON ?"
r[52] = "DONT ANY OTHER REASONS COME TO MIND ?"
r[53] = "DOES THAT REASON EXPLAIN ANYTHING ELSE ?"
r[54] = "WHAT OTHER REASONS MIGHT THERE BE ?"
r[55] = "PLEASE DON'T APOLOGIZE !"
r[56] = "APOLOGIES ARE NOT NECESSARY"
r[57] = "WHAT FEELINGS DO YOU HAVE WHEN YOU APOLOGIZE ?"
r[58] = "DON'T BE SO DEFENSIVE"
r[59] = "WHAT DOES THAT DREAM SUGGEST TO YOU ?"
r[60] = "DO YOU DREAM OFTEN ?"
r[61] = "WHAT PERSONS APPEAR IN YOUR DREAMS ?"
r[62] = "ARE YOU DISTURBED BY YOUR DREAMS ?"
r[63] = "HOW DO YOU DO ... PLEASE STATE YOUR PROBLEM"
r[64] = "YOU DON'T SEEM QUITE CERTAIN"
r[65] = "WHY THE UNCERTAIN TONE ?"
r[66] = "CAN'T YOU BE MORE POSITIVE ?"
r[67] = "YOU AREN'T SURE ?"
r[68] = "DON'T YOU KNOW ?"
r[69] = "WHY NO _ ?"
r[70] = "DON'T SAY NO, IT'S ALWAYS SO NEGATIVE"
r[71] = "WHY NOT ?"
r[72] = "ARE YOU SURE ?"
r[73] = "WHY NO ?"
r[74] = "WHY ARE YOU CONCERNED ABOUT MY _ ?"
r[75] = "WHAT ABOUT YOUR OWN _ ?"
r[76] = "CAN'T YOU THINK ABOUT A SPECIFIC EXAMPLE ?"
r[77] = "WHEN ?"
r[78] = "WHAT ARE YOU THINKING OF ?"
r[79] = "REALLY, ALWAYS ?"
r[80] = "DO YOU REALLY THINK SO ?"
r[81] = "BUT YOU ARE NOT SURE YOU _ "
r[82] = "DO YOU DOUBT YOU _ ?"
r[83] = "IN WHAT WAY ?"
r[84] = "WHAT RESEMBLANCE DO YOU SEE ?"
r[85] = "WHAT DOES THE SIMILARITY SUGGEST TO YOU ?"
r[86] = "WHAT OTHER CONNECTION DO YOU SEE ?"
r[87] = "COULD THERE REALLY BE SOME CONNECTIONS ?"
r[88] = "HOW ?"
r[89] = "YOU SEEM QUITE POSITIVE"
r[90] = "ARE YOU SURE ?"
r[91] = "I SEE"
r[92] = "I UNDERSTAND"
r[93] = "WHY DO YOU BRING UP THE TOPIC OF FRIENDS ?"
r[94] = "DO YOUR FRIENDS WORRY YOU ?"
r[95] = "DO YOUR FRIENDS PICK ON YOU ?"
r[96] = "ARE YOU SURE YOU HAVE ANY FRIENDS ?"
r[97] = "DO YOU IMPOSE ON YOUR FRIENDS ?"
r[98] = "PERHAPS YOUR LOVE FOR FRIENDS WORRIES YOU"
r[99] = "DO COMPUTERS WORRY YOU ?"
r[100] = "ARE YOU TALKING ABOUT ME IN PARTICULAR ?"
r[101] = "ARE YOU FRIGHTENED BY MACHINES ?"
r[102] = "WHY DO YOU MENTION COMPUTERS ?"
r[103] = "WHAT DO YOU THINK MACHINES HAVE TO DO WITH YOUR PROBLEMS ?"
r[104] = "DON'T YOU THINK COMPUTERS CAN HELP PEOPLE ?"
r[105] = "WHAT IS IT ABOUT MACHINES THAT WORRIES YOU ?"
r[106] = "SAY, DO YOU HAVE ANY PSYCHOLOGICAL PROBLEMS ?"
r[107] = "WHAT DOES THAT SUGGEST TO YOU ?"
r[108] = "I SEE"
r[109] = "IM NOT SURE I UNDERSTAND YOU FULLY"
r[110] = "COME COME ELUCIDATE YOUR THOUGHTS"
r[111] = "CAN YOU ELABORATE ON THAT ?"
r[112] = "THAT IS QUITE INTERESTING"
r[113] = "WHY DO YOU HAVE PROBLEMS WITH MONEY ?"
r[114] = "DO YOU THINK MONEY IS EVERYTHING ?"
r[115] = "ARE YOU SURE THAT MONEY IS THE PROBLEM ?"
r[116] = "I THINK WE WANT TO TALK ABOUT YOU, NOT ABOUT ME"
r[117] = "WHAT'S ABOUT ME ?"
r[118] = "WHY DO YOU ALWAYS BRING UP MY NAME ?"
# table for looking up answers that
# fit to a certain keyword
k["CAN YOU"] = "1 2 3"
k["CAN I"] = "4 5"
k["YOU ARE"] =\
k["YOURE"] = "6 7 8 9"
k["I DONT"] = "10 11 12 13"
k["I FEEL"] = "14 15 16"
k["WHY DONT YOU"] = "17 18 19"
k["WHY CANT I"] = "20 21"
k["ARE YOU"] = "22 23 24"
k["I CANT"] = "25 26 27"
k["I AM"] =\
k["IM "] = "28 29 30 31"
k["YOU "] = "32 33 34"
k["I WANT"] = "35 36 37 38 39"
k["WHAT"] =\
k["HOW"] =\
k["WHO"] =\
k["WHERE"] =\
k["WHEN"] =\
k["WHY"] = "40 41 42 43 44 45 46 47 48"
k["NAME"] = "49 50"
k["CAUSE"] = "51 52 53 54"
k["SORRY"] = "55 56 57 58"
k["DREAM"] = "59 60 61 62"
k["HELLO"] =\
k["HI "] = "63"
k["MAYBE"] = "64 65 66 67 68"
k[" NO "] = "69 70 71 72 73"
k["YOUR"] = "74 75"
k["ALWAYS"] = "76 77 78 79"
k["THINK"] = "80 81 82"
k["LIKE"] = "83 84 85 86 87 88 89"
k["YES"] = "90 91 92"
k["FRIEND"] = "93 94 95 96 97 98"
k["COMPUTER"] = "99 100 101 102 103 104 105"
k["-"] = "106 107 108 109 110 111 112"
k["MONEY"] = "113 114 115"
k["ELIZA"] = "116 117 118"
}
Juergen Kahrs
gawk -f hanoi.awk [-n Disks]
The objective is to move N discks from stack 0 to stack 1, always putting a smaller disc on top of a larger one. or on an empty stack
gawk -f hanoi.awk -n 4 0 4321 1 2 0 432 1 2 1 0 43 1 2 2 1 0 43 1 21 2 0 4 1 21 2 3 0 41 1 2 2 3 0 41 1 2 32 0 4 1 2 321 0 1 4 2 321 0 1 41 2 32 0 2 1 41 2 3 0 21 1 4 2 3 0 21 1 43 2 0 2 1 43 2 1 0 1 432 2 1 0 1 4321 2
Main:
BEGIN {
n = arg("-n",5)
for (j=0; j<n; j++) push(0,n-j)
showstacks()
hanoi(n,0,1,2)
}
function hanoi(n,a,b,c) {
if (n==1) {
move(a,b)
} else {
hanoi(n-1,a,c,b)
move(a,b)
hanoi(n-1,c,b,a)
}
}
function move(i,j) {
push(j,pop(i))
showstacks()
}
Showing the stack:
function showstacks( i,j) {
for (i=0; i<=2; i++) {
printf "%s ", i
for (j=0; j<sp[i]; j++) printf "%s", stack[i,j]
print "" }
print ""
}
Standard stuff:
function arg(tag,default) {
for(i in ARGV)
if (ARGV[i] ~ tag)
return ARGV[i+1]
return default
}
function push(i,v) { stack[i,sp[i]++]=v }
function pop(i) { return stack[i,--sp[i]] }
Alan Linton, 2001
gawk -f readminds.awk
(then type "h" or "t").
Shannon's 1953 memo, A Mind-Reading(?) Machine, describes a machine built out of relays at Bell Labs.
The machine took advantage of the difficulty of generating truly random behavior in wetware by using a small (8-state) markov model to predict its human opponents.
We implement a 1970's version of this 1950's algorithm, using AWK instead of mechanical relays.
Our markov model is based on behavior over the last two rounds, with hpa and hpb recording the history of the player's plays, and hca and hcb recording the history of the computer's guesses. The possible cases are: the player won or lost two rounds ago, changed plays or stayed with the same play, and won or lost the last round, for a total of 23 = 8 histories, with any bias towards changing or staying in the upcoming round kept in the tally array.
If the player has repeated their behavior for a given history at least twice, we guess according to their predicted behavior. After the first observation, we guess with a 75%/25%, split, weighted towards the bias. If the player hasn't shown any bias (or during the first two rounds of the game), we guess at hazard.
BEGIN {
print "+---------------------------------------------------------+"
print "| An AWKward mind-reading machine |"
print "| (this retrogame inspired by the Bell Labs Memo: |"
print "| Shannon, 1953, 'A Mind-Reading (?) Machine') |"
print "+---------------------------------------------------------+"
print "Shall we play a game?"
print "Tell me either 'heads' or 'tails'."
print "If I guess what you picked, I win. Otherwise, you win."
print "The match goes for 100 rounds, or someone gets 20."
printf "your play? "
}
BEGIN { "date +%s" | getline seed; srand(seed) }
BEGIN { t = 0 }
NR > 2 {
case = (hpa!=hca)"/"(hpa!=hpb)"/"(hpb!=hcb)
t = tally[case]
}
t < -1 { guess=!hpb }
t == -1 { guess=(int(rand()+.75)?!hpb:hpb) }
t == 0 { guess=int(rand()+.5) }
t == 1 { guess=(int(rand()+.75)?hpb:!hpb) }
t > 1 { guess= hpb }
/^[hH]/ { play=1 }
/^[tT]/ { play=0 }
/^[^hHtT]/ { printf "heads or tails? "; next }
We also report the results of the round to the player (in case they wish to update their internal models). En passant, we update pw and cw, the number of player (resp. computer) wins.
{
printf "You played " (play?"heads":"tails")
printf "; I guessed " (guess?"heads":"tails")
printf ". "(play==guess?"I":"You")" win. "
print "("(pw+=(play!=guess))"-"(cw+=(play==guess))")"
}
After finishing a round, we update the history with the results, including updating tally according to the player's behavior. Again, we wait for two rounds before touching the tally counters, at which point the history will have been fully initialized.
NR > 2 { tally[case] += (hpb == play ? 1 : -1) }
{
hpa = hpb; hpb = play
hca = hcb; hcb = guess
}
At the end of each round, if we haven't met a victory condition, we prompt for the next round.
cw+pw==100 { printf (cw>pw?"I":"You")" won the match "
print "by "(cw>pw?cw-pw:pw-cw)" games."
exit }
pw-cw==20 { print "You win -- up by 20"; exit }
cw-pw==20 { print "I win -- up by 20"; exit }
{ printf "? " }
END {
print " T H A N K Y O U F O R P L A Y I N G "
}
Copyright (c) 2009 the authors listed at the following URL, and/or the authors of referenced articles or incorporated external code: http://en.literateprograms.org/Mind_reading_machine_(AWK)?action=history&offset=20070207160312
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
gawk -f mastermind.awk
Download from LAWKER.
The aim of the game is to guess 4 numbers from 0,1,2,3,4,5,6,7,8,9. A "hit" is the right number in the right position and a "blow" is the right number in a wrong position.
You lose the game if you fail to guess after 10 rounds.
+++ Hit & Blow +++ <Push Enter>
[ 1] >> 1234
## 1 Hit 2 Blow
[ 2] >> 1256
## 1 Hit 1 Blow
[ 3] >> 1789
## 1 Hit 0 Blow
[ 4] >> 1243
## 1 Hit 2 Blow
[ 5] >> 1340
## 3 Hit 0 Blow
[ 6] >> 1320
Congratulations !! (1320)
BEGIN{
srand();
c=1;
print "\n\n +++ Hit & Blow +++ <Push Enter>\n";
q[z=p=int(9*rand())+1]=1;
for(i=2; i<=4;)
if(q[p=int(10*rand())]<1){
q[p]=i++;
z=z*10+p; }
}
Note that the range 1023 ... 9876 are the smallest and largerst 4 digit integers with no repeates.
{ if((n=int($0+0))>=1023 && n<=9876) {
++c;
v=0;
for(i=4; i>0; n=int(n/10))
v+=(q[p=n%10]==i--)?10:(q[p]>0)?1:0;
if (v==40) exit;
else printf("%16s %2d Hit %2d Blow\n", "##", v/10, v%10);
}
if (c>10) exit;
else printf("[%2d] >> ", c);
}
END{
printf("\n %s (%d)\n", (v==40)?"Congratulations !!":"Over times", z);
}
The author's name is YSA.
gawk -f mastermind2.awk [breaker]
Download from LAWKER.
This is an nteractive play against the evil computer mastermind game.
The game showing the recursive power of the awk language. It also demonstrates a winning technique for the game mastermind.
The game has two roles, breaker and maker of mastermind codes. A 5 digit 0 to 9 per digit code must be broken. The maker responds with one + for every correct digit,position guess and a - for every correct digit in the wrong position in the code. A code breaker (human or this program) must use those clues to determine the code. A score is kept, low score wins.
In the following example, the goal is "12345".
gawk -f mastermind2.awk br I'll start, I'll break your code, you respond with +- my guess #1 12413 ++-- my guess #2 12531 ++-- my guess #3 13211 +-- my guess #4 14523 +---- my guess #5 15432 +---- my guess #6 12345 +++++
BEGIN{
srand();
if (index(ARGV[1],"br")) {
print "I'll start, I'll break your code, you respond with +-"
ARGV[1] = ""
mscore += breaker(randguess())
}
do {
printscore()
print "Guess my code 5 digits 0 to 9"
yscore += maker(randguess())
printscore()
print "I'll break your code, you respond with +-"
mscore += breaker(randguess())
} while (1)
}
END{
printscore()
}
function printscore() {
print("\nlow score wins! my score =", mscore, "yours =", yscore)
}
function randguess() {
return incr(int((10*10*10*10*10)*rand()))
}
function smudge(ins,n,ch) {
return substr(ins, 1, n-1) ch substr(ins, n+1)
}
function grade(val, guess, i, rtn, t){
# return + for exact hits, - for "close" for all 5 digits
for (i = 1;i < 6; i++) {
if (substr(val, i, 1) == substr(guess, i, 1)) {
#exact match
rtn = rtn "+"
val = smudge(val, i, "x");
guess = smudge(guess, i, "y");
#print i, val, guess, rtn
}
}
for (i = 1;i < 6; i++) {
t = index(val, substr(guess, i, 1))
if (t) {
rtn = rtn "-"
val = smudge(val, t, "u")
guess = smudge(guess, i, "v");
#print t, i, val, guess, rtn
}
}
return rtn
}
#passed guess and old guess array
#A good guess matches all previous scores with the new guess
function checkguess(g, oldg, i,score) {
#print "guess " g
for (i in oldg) {
if (g == i) return 2 #bad, repeated guess
if (grade(g,i) != oldg[i]) return 1 #reject this guess
}
return 0 #success, this is an ok guess
}
function incr(old, new) {
new = sprintf("%05d",old + 1)
#print "old new", old, new
return substr(new, length(new) -4)
}
function alignres(res, tem) {
for (i=1;i<=length(res);i++) {
if (substr(res, i, 1) == "+") tem = "+" tem
else tem = tem "-"
}
#print "alignres ",res, tem
return tem
}
function breaker(g1, guess, res, hisinput, tries){
guess = g1
do {
printf("my guess #%d %s ", ++tries, guess)
do {
if (getline hisinput <= 0) {
print "whoa, some error, giving up"
exit
}
if (!match(hisinput, /^[-+]*$/)) {
print "invalid response, use only +-"
}
} while (RSTART == 0)
hisinput = alignres(hisinput)
res[guess] = hisinput
#print "hisinput ", hisinput, res[guess]
#for (i in res) print "res[" i "]=" res[i]
if (res[guess] == "+++++") return tries
# make another guess
do {
guess = incr(guess)
r = checkguess(guess, res)
} while (r == 1)
} while (g1 != guess)
print "you must have made a mistake, no answer is possible"
exit
}
function maker(original, his, tries)
{
#print original," cheater!"
do {
if (getline his <= 0) {
print "whoa, some error, giving up"
exit
}
res = grade(original, his)
print "try " ++tries " results",res
if (res == "+++++") return tries
} while (1)
}
Steve Calfee, USA.
In early 2004, Aaron Hawley threw himself into a programming contest held by the University of Vermont Computer Science Student Association. The contest was a variation on checkers where competitors had their artifical computer players compete in a "virtual tournament".
It made for an interesting problem, and he chose to make it more interesting by writing his checker player in Awk (in the implementation GNU Awk). He wasn't able to submit a working version then because of a technical problem, and the contest itself never was finalized due to a lack of submissions.
Recently, he overcame the technical problems and finally put together a working version (not to be confused with winning). The heuristic used in this checker player is not a winning strategy, but at least it plays. There is also the full build distribution, that shows what a large Awk project looks like, and some tricks on how to survive (hint: GNU Makefiles).
To let the computer play first, run:
awk -f 15.awk -v start=1
To play first, run:
awk -f 15.awk -v start=2
Each move is one square (in the range 1..9).
gawk -f 15.awk -v start=1 6 9 1 3 8 I win!
BEGIN {
winning_sum = 15;
max_play = 9;
used[0] = 1;
my_sum = your_sum = 0;
if (start == 1) {
answer = ftw(used, my_sum);
used[answer] = 1;
my_sum += answer;
print answer;
}
halted=0;
}
! /^[1-9]$/ {
print "Illegal play: " $0;
}
{
if ($0 in used) {
print "Illegal play: " $0;
} else {
used[$0] = 1;
your_sum += $0;
if (your_sum == winning_sum) {
print "You win!";
halted=1
exit 0
} else if (your_sum > winning_sum && my_sum > winning_sum) {
print "Draw";
halted=1;
exit 2;
} else {
answer = block = winning_sum - your_sum;
winning_move = ftw(used, my_sum);
if (block > max_play \
|| block <= 0 \
|| block in used) {
answer = winning_move;
}
while (answer <= 0 || answer > max_play || answer in used) {
answer++;
}
my_sum += answer;
used[answer] = 1;
print answer;
if (my_sum == winning_sum) {
print "I win!";
halted=1;
exit 1;
}
}
}
}
END {
if (halted == 1) {
exit;
}
if (your_sum != winning_sum && my_sum != winning_sum) {
print "I win by forfeit";
exit 1;
}
}
function ftw(used, sum) {
strlst = "";
for (v in used) {
strlst = strlst "" v;
}
to_win = try(strlst, max_play "", sum);
if (to_win == "") {
return -1;
}
return substr(to_win, 1, 1);
}
function try(used, hunches, sum) {
curr_sum = strsum(hunches) + sum;
curr_hunch = substr(hunches, 1, 1);
next_hunch = curr_hunch - 1;
if (hunches == "") {
return "";
} else if (curr_hunch < 1) {
return substr(hunches, 2);
} else if (index(used, curr_hunch) || curr_sum > winning_sum) {
return try(used, next_hunch "" substr(hunches, 2), sum);
} else if (curr_sum == winning_sum) {
return hunches;
}
return try(curr_hunch "" used, next_hunch "" hunches, sum);
}
function strsum(str) {
s = 0;
str_length = length(str);
for (i = 1; i <= str_length; i++) {
s += substr(str, i, 1);
}
return s;
}
Aaron S. Hawley
cat numbers | gawk -f quicksort.awk
Download from LAWKER.
Some Awk implementations come with built in sort routines (e.g. Gawk's asort and asorti functions). But it can be useful to code these yourself, especially in you are doing data structure tricks.
Quicksort selects a pivot and divides the data into values above and below the pivot. Sorting then recurses on these sub-lists.
BEGIN { RS = ""; FS = "\n" }
{ A[NR] = $0 }
END {
qsort(A, 1, NR)
for (i = 1; i <= NR; i++) {
print A[i]
if (i == NR) break
print ""
}
}
function qsort(A, left, right, i, last) {
if (left >= right)
return
swap(A, left, left+int((right-left+1)*rand()))
last = left
for (i = left+1; i <= right; i++)
if (A[i] < A[left])
swap(A, ++last, i)
swap(A, left, last)
qsort(A, left, last-1)
qsort(A, last+1, right)
}
function swap(A, i, j, t) {
t = A[i]; A[i] = A[j]; A[j] = t
}
Alfred Aho, Peter Weinberger, Brian Kernighan, 1988.
The QTAwk utility is an extension to standard Awk that makes it possible to handle simple data-reformatting jobs easily with just a few lines of code.
Differences to standard Awk:
Nov 28, 2009
This site is moving up the page rankings:
Other indicators also look good. Since the site was launched (Feb 15, 2009), the number of visits has been steadily increasing:
These 19,268 visits come from 2,765 cities:
(BTW: Anyone got any ideas why these cities visit here so often?)
In other news, Website Outlook reports that:
To put that report in perspective, the same source notes that:
URL: http://www.blisted.org/wiki/projects/awkbot.
Awkbot is a small bot written in 100% GNU Awk, awkbot requires GNU Awk version 3.1.1.
Awkbot Has ability to search google, search the awk man page for descriptions of functions and built in variables.
The tool accepts a simple configuration file, and has a small wrapper written in sh for automatic restarts.
The goal of the tool is to (eventually) become a clone of info bot with awk adaptations to prove to those fools in #perl on freenode that awk really is a programming language
AWKBot uses mysql.awk to connect to, and query, a MySQL database where it will store information you give it, and recall it later. It also uses this to track karma points, and maybe more in the future. It similarly uses some interesting pipelining to do IPC, to support awkpaste
Scott S. McCoy
Zazzle.com is offering their great "I love Awk mug", starting at $12.
From John David Duncan's parallel-awk.org site.
Parallel Awk is an effort to link Awk with MPI, enabling the everyday analysis of large plain-text files to be parallelized, allowing rapid prototyping of parallel applications, preserving the syntax and style of Awk, and hiding the details of MPI.
The Awk programming language, first developed at Bell Labs in 1977, is a standard part of Unix operating system distributions. It is a compact language, commonly used in systems administration and in commercial (as opposed to scientific) computing. The half dozen books about awk include the original slim and very readable Awk book by Aho, Kernighan, and Weinberger. Awk is standardized in POSIX, and the most actively maintained current implementation is GNU awk. While awk, like sed, is perhaps most often used for "one-liners," its regular expression handling and rich C-like syntax make it well-suited for many small applications and domain-specific languages.
MPI is a standard Message Passing Interface for parallel computing created by the MPI Forum, implemented in two widely-used free distributions (LAM/MPI and MPICH) and in optimized versions provided by many hardware vendors. MPI libraries are often linked with Fortran or C code in scientific computing tasks, such as matrix calculations, and run on supercomputers or Beowulf clusters. For some of these applications, runtime is actually greater than development time; nonetheless, a language for rapid prototyping is a handy tool to have around.
# pi.awk: approximate pi by integrating f(x) = 4/(1+x^2)
# n = number of intervals to calculate
#
# e.g.: mpiexec -n 4 mpawk -v n=10000 -f pi.awk
BEGIN {
h = 1/n
for(i = RANK+1 ; i <= n ; i += SIZE) {
x = h * (i - 0.5)
sum += 4 / (1 + x^2)
}
pi = reduce(sum(h * sum))
if(!RANK) printf("n=%d, pi is %1.20f\n",n,pi)
}
pi.awk requires about 20% as many lines of code as its equivalents in C or Fortran. The output is printed by the process with RANK = 0 and looks like this:
sh% mpiexec -n 4 mpawk -v n=100000 -f pi.awk n=100000, pi is 3.14159265359811668006
The latest beta release of Parallel Awk is version 0.8. In this release, any Awk expression (including numbers, strings, and arrays) can be sent from one process to another using the functions send and recv. The comm_split() function, an interface to MPI_Comm_split, allows the creation of intra-communicators, while a companion function comm_set() is used to set the default MPI communicator implicitly used for all other MPI operations. Supported collective operations include reduce(), which can be applied to both numeric and string expressions, and barrier(). A function called assign() is used to divide the lines of input among the set of processes, as can a hash() function that is applied to array keys or other strings.
gawk -f wst.awk [-v X=anychar] iterations
gawk -f wst.awk -v X=* 2
*
* *
* *
* * * *
* *
* * * *
* * * *
* * * * * * * *
* *
* * * *
* * * *
* * * * * * * *
* * * *
* * * * * * * *
* * * * * * * *
* * * * * * * * * * * * * * * *
BEGIN {
n = ARGV[1] + 0 # iterations
if (n !~ /^[0-9]+$/) { exit(1) }
if (n == 0) { width = 3 }
row = split("X,X X,X X,X X X X",A,",") # seed the array
for (i=1; i<=n; i++) { # build triangle
width = length(A[row])
for (j=1; j<=row; j++) {
str = A[j]
# if (n <= 9) { gsub(/[^ ]/,i,str) } # show structure
A[j+row] = sprintf("%-*s %-*s",width,str,width,str)
}
row *= 2
}
for (j=1; j<=row; j++) { # print triangle
if (X != "") { gsub(/X/,substr(X,1,1),A[j]) }
sub(/ +$/,"",A[j])
printf("%*s%s\n",width-j+1,"",A[j])
}
exit(0)
}
Dan Nielsen
Server.awk - a simple, single user, web server built with gawk.
Download from LAWKER.
This code creates an html menu of local applications which you can season to taste. The usage requires two steps...
This code is based on the examples located at the TCP/IP Internetworking With `gawk' manual and is licensed under GPL 3.0. For updates to thos code, see http://topcat.hypermart.net/index.html.
BEGIN {
x = 1 # script exits if x < 1
port = 8080 # port number
host = "/inet/tcp/" port "/0/0" # host string
url = "http://localhost:" port # server url
status = 200 # 200 == OK
reason = "OK" # server response
RS = ORS = "\r\n" # header line terminators
doc = Setup() # html document
len = length(doc) + length(ORS) # length of document
while (x) {
if ($1 == "GET") RunApp(substr($2, 2))
if (! x) break
print "HTTP/1.0", status, reason |& host
print "Connection: Close" |& host
print "Pragma: no-cache" |& host
print "Content-length:", len |& host
print ORS doc |& host
close(host) # close client connection
host |& getline # wait for new client request
}
# server terminated...
doc = Bye()
len = length(doc) + length(ORS)
print "HTTP/1.0", status, reason |& host
print "Connection: Close" |& host
print "Pragma: no-cache" |& host
print "Content-length:", len |& host
print ORS doc |& host
close(host)
}
function Setup() {
tmp = "<html>\
<head><title>Simple gawk server</title></head>\
<body>\
<p><a href=" url "/xterm>xterm</a>\
<p><a href=" url "/xcalc>xcalc</a>\
<p><a href=" url "/xload>xload</a>\
<p><a href=" url "/exit>terminate script</a>\
</body>\
</html>"
return tmp
}
function Bye() {
tmp = "<html>\
<head><title>Simple gawk server</title></head>\
<body><p>Script Terminated...</body>\
</html>"
return tmp
}
function RunApp(app) {
if (app == "xterm") {system("xterm&"); return}
if (app == "xcalc" ) {system("xcalc&"); return}
if (app == "xload" ) {system("xload&"); return}
if (app == "exit") {x = 0}
}
Michael Sanders
myrss("rss;url;N" [,between])
The function myrss("rss;url;N") returns the first N items from an rss feed found in url.
This code is a nice example of the brevity of Awk. I've used many PHP and Perl-based RSS readers and this code is by far the simplest, the shortest, and the easiest to modify.
The functional optionally accepts a between string that is printed between each item. The following example prints a "<li>" between each RSS item; i.e. it converts a text string into an HTML list.
The code is designed to be customized. Quirks in the RSS stream, or quirks in the formatting are handled by a set of separate my functions that be quickly altered to return the desired strings.
The code uses a slurp function that reads the entire stream as one string using wget then splits it into an array on the < character.
After a few simplifications, the approach turns out to be very fast. For example, using
wget -O -is faster than
wget -O tmpfile; cat tmpfile
Also, version one of this code split the RSS feed using the disjunction [<>]. This proved to be much slower than just slurping in splitting on "\n" then subsequently splitting on "<".
The above two optimizations changed the runtimes for the following example from 0.9 seconds to 0.88 seconds. This is very fast considering that just wgetting the RSS feed takes 0.08 seconds.
% gawk -f myrss.awk --source 'BEGIN {
print "<ul>"
print myrss("rss;lawker.blogspot.com/feeds/posts/default?alt=rss;5","<li>\n")
print "</ul>"
'}
This generates the following list from the AWK.INFO rss feed
function myrss(rss, between, tmp) {
split(rss,tmp,";");
return myrss1(tmp[2],tmp[3],between);
}
function myrss1(feed,max, between, n,all,sep,out,date,url,txt,seen) {
n = slurp("wget -q -O - http://" feed,">",all);
for(i=1;i<=n; i++) {
if (all[i] ~ /^<pubDate/)
date = myDate(all[i+1])
else if (all[i] ~ /^<description/)
txt = myText(all[i+1])
else if (all[i] ~ /^<enclosure/) {
url = myUrl(all[i]);
out = out sep myReport(url,date,txt);
sep = between ? between : "\n";
if (++seen >= max)
return out;
}}
return out;
}
slurp reads an entire file into an array.
function slurp(com,sep,all) { slurp0(com); return split($0,all,sep) }
function slurp0(com) { RS=""; FS="\n"; com | getline; close(com) }
Most of the formatting control is isolated in the following functions. Change these to change the appearance of the feeds.
function myDate(str, tmp) { split(str,tmp," "); return tmp[3] " " tmp[2]}
function myText(str) { sub(/<.*/,"",str); return str }
function myUrl(str) { sub(/<.*/,"",str); return str }
function myReport(url,dat,txt) { return "<a href=\""url"\">"dat"</a>" txt}
Tim Menzies
#eg gawk -v target=89000 -f rcalc.awk
Download from LAWKER.
Calculate resistor pair value from e24 series to make up arbitrary value
When designing and building electronic projects I mostly use 1% resistors that come in the E24 series (24 values per decade).
Frequently there's a need for some arbitrary value (between 10R and 1M in this script) resistor that can be made with a series or parallel combination of two standard values.
This script searches the E24 standard value space for pairs of resistors that will produce or come close to the desired arbitrary resistor value.
$ gawk -v target=89000 -f rcalc.awk
Result Ra Rb Connect Error
88800.00 82000 6800 series -0.22%
88888.89 200000 160000 parallel -0.12%
89000.00 56000 33000 series
89000.00 62000 27000 series
89130.43 820000 100000 parallel +0.15%
89137.93 470000 110000 parallel +0.15%
89189.19 220000 150000 parallel +0.21%
BEGIN {
print "Result Ra Rb Connect Error"
max_error = 0.005 # +/- 0.5%
max_multiplier = 10000 # try four decades
format = "%8.2f %7d %7d %-8s %+4.2f%%"
formnz = "%8.2f %7d %7d %-8s"
limit_hi = target * (1 + max_error)
limit_lo = target * (1 - max_error)
$0 = "10 11 12 13 15 16 18 20 22 24 27 30 33 36 39 43 47 51 56 62 68 75 82 91"
for (i = 1; i < 25; i++) {
e24[i] = $i
}
for (u = 1; u < 25; u++) {
for (v = 1; v < 25; v++) {
for (i = 1; i <= max_multiplier; i *= 10) {
x = e24[u] * i
if (x == target) {
continue
}
for (j = 1; j <= max_multiplier; j *= 10) {
y = e24[v] * j
if (y == target) {
continue
}
combo(e24[u] * i, e24[v] * j)
}
}
}
}
exit # skip file reader
}
function combo(a, b, c) {
# parallel
c = a * b / (a + b)
combo2(a, b, c, "parallel")
# series
c = a + b
combo2(a, b, c, "series")
}
function combo2(a, b, c, d, e, f) {
# avoid duplicates and ignore result when error too big
if (a < b || c < limit_lo || c > limit_hi) { return }
e = 100 * (c - target) / target # percentage error
f = (e == 0 ? formnz : format) # select output format
result[n++] = sprintf(f, c, a, b, d, e)
}
END {
# sort by result value, print list
n = asort(result, sort_result)
for (i = 1; i <= n; i++) {
print sort_result[i]
}
}
Copyright (c) 2009 Grant Coady <http://bugsplatter.id.au> GPLv2
These notes come from John Fry's Counting with Awk lecture in his subject Linguistics 115: Corpus Linguistics, Fall 2007, SJSU.
Much research has reported that human writings following well-defined laws. For example, natural langauge text and software programs conform tightly to simple and regular statistical models. For example, "Zipf's Laws" states that multiplying a word's rank r by its frequency f produces (roughly) a constant value C : i.e. r times f is a constant. The frequency f of a word is obtained by counting the number of times it occurs in a text, and r is obtained by ranking all the words by frequency (1. the ; 2. and, 3. I ; etc.) Example of Zipf's Law for five words in the London-Lund corpus of spoken conversation:
r X f = C 35 very 836 = 29,260 45 see 674 = 30,330 55 which 563 = 30,965 65 get 469 = 30,485 75 out 422 = 31,650Another way of expressing Zipf's Law is to say that frequency is reciprocally proportional to rank. For example, the 2nd-ranked word ("and") appears half as often as the 1st-ranked word ("the"). More generally, nth-ranked word appears 1/n as often as "the"
Here is a short awk program, saved as ~jfry/zipf.awk, that reads in a ranked frequency list and computes r times f.
BEGIN {printf "%20s%7s%7s%10s\n", "WORD","RANK","FREQ","C"}
{printf "%20s%7d%7d%10d\n", $2, NR, $1, NR*$1}
This program can be run with
awk -f ~jfry/zipf.awk
Testing Zipf's Law on Shakespeare :
$ tr A-Z a-z < shakespeare.txt | tr -sc a-z '\n' | sort | uniq -c | sort -rn | awk -f ~jfry/zipf.awk WORD RANK FREQ C WORD RANK FREQ C the 1 27378 27378 s i 17 7721 131257 and 2 26084 52168 for 18 7655 137790 i 3 22538 67614 be 19 6897 131043 to 4 19771 79084 his 20 6859 137180 of 5 17481 87405 he 21 6679 140259 a 6 14725 88350 your 22 6657 146454 you 7 13826 96782 this 23 6608 151984 my 8 12489 99912 but 24 6277 150648 that 9 11318 101862 have 25 5902 147550 in 10 11112 111120 as 26 5749 149474 is 11 9319 102509 thou 27 5549 149823 d 12 8960 107520 him 28 5205 145740 not 13 8512 110656 so 29 5058 146682 with 14 7791 109074 will 30 5008 150240 me 15 7777 116655 what 31 4808 149048 it 16 7725 123600 thy 32 4034 129088
Testing Zipf's Law on newswire
$ cd /corpora/newswire/data $ zcat -r .|grep -v '^<' | tr A-Z a-z|tr -sc a-z '\n' | sort| uniq -c | sort -rn | awk -f /home/jfry/zipf.awk WORD RANK FREQ C WORD RANK FREQ C the 1 142M 142M by 16 14M 224M to 2 60M 120M he 17 13M 235M of 3 60M 180M at 18 13M 244M a 4 53M 214M as 19 12M 230M and 5 51M 257M from 20 10M 216M in 6 51M 307M be 21 9M 201M s 7 28M 202M his 22 9M 205M for 8 22M 178M has 23 9M 208M that 9 21M 195M have 24 9M 217M said 10 19M 199M but 25 8M 212M on 11 19M 214M are 26 8M 218M is 12 16M 200M an 27 8M 225M with 13 15M 197M will 28 7M 207M was 14 14M 203M i 29 7M 213M it 15 14M 211M not 30 7M 217M
J. Mellander reports in comp.lang.awk how to make Mawk's hashing run 20+ times faster.
Recently, for a project, I had the occasion to use mawk - I have a list of ~12,000,000 Unix timestamps to nanosecond precision that I needed to match the first field of every record in a number of huge files. Gawk couldn't handle the number of records, and so I used mawk, as being more memory thrifty. The program was a one-liner like this:
mawk 'FNR==NR {x[$1]++;next} $1 in x}' timestamp_file log_file
which works perfectly, but the run time seemed excessive - many hours per log file - which made me think that the hashing function was causing many collisions, and thus hash chaining.....
When stuck in a slow meeting, I started looking at the mawk source code, specifically the hashing functions, of which there are 2: hash() in hash.c & ahash() in array.c
I was surprised to find that the hashing functions in both cases essentially just sum the bytes of the key to create the hash - this means that 123, 321, 213, etc. would all hash to the same location and cause collisions, and hash chaining.
Modifying the hashing to a more efficient hash caused an enormous gain in efficiency, as in this test:
$ wc -l j
2999999 j
$ time mawk-1.3.3/mawk '{x[$1]++}' j >/dev/null
real 2m24.362s
user 2m20.174s
sys 0m0.663s
$ time mawk-1.3.3a/mawk '{x[$1]++}' j >/dev/null
real 0m6.607s
user 0m6.146s
sys 0m0.241s
mawk-1.3.3a has the below modifications. In hash.c I replaced the 'hash' function with:
/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
unsigned hash(s)
register char *s ;
{
register unsigned h = 2166136261 ;
while (*s) h = (h * 16777619) ^ *s++ ;
return h ;
}
and in array.c replaced 'ahash' with:
/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
static unsigned ahash(sval)
STRING* sval ;
{
register unsigned h = 2166136261 ;
register char *s = sval->str;
while (*s) h = (h * 16777619) ^ *s++ ;
return h ;
}
Brendan O'Conner writes in his blog:
When one of these new fangled 'Big Data' sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you're dealing with hundreds of megabytes of data, even simple operations can take plenty of time.
For one recent ad-hoc task I had - reformatting 1GB of textual feature data into a form Matlab and R can read - I tried writing implementations in several languages, with help from my classmate Elijah.
To be clear, the problem is to take several files of (item name, feature name, value) triples, like:
000794107-10-K-19960401 limited 1 000794107-10-K-19960401 colleges 1 000794107-10-K-19960401 code 2 ... 004334108-10-K-19961230 recognition 1 004334108-10-K-19961230 gross 8 ...And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.
Since it's a standardized language, many implementations exist. One of them, MAWK, is incredibly efficient. It outperforms all other languages, including statically typed compiled ones like Java and C++! It wins on both LOC and performance criteria- a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.
All the code, results, and data can be obtained at github.com/brendano/awkspeed. I'd love to see results for more languages.
Editor's note: one reply to this blog entry, by Eric Young, optimized Brendan's Ruby solution and re-ran all the tests. Eric reported the following runtimes. Note that they confirm Brendan's results: mawk runs faster than everything else.
33.8s mawk 36.3s gcc c 51.0s java 67.0s perl Fletch.pl 71.7s python 87.8s perl 95.8s nawk 101.4s gawk 114.0s gcc 133.0s ruby1.9 eay.rb 136.8s ruby1.8 eay.rb 327.6s ruby1.8 372.9s ruby1.9
Aharon Robbins, the maintainer for GNU Awk maintainer, answers some questions from Tim Menzies.
Q: What is your favorite programming language (besides gawk)? And why?
A: It depends for what. A long time ago I was a big Korn shell junkie, although these days I would do most high level things in a mixture of bash and awk, with awk doing the heavy lifting.
For lower level things I prefer C++, although I have something of a love/hate relationship with the language. It's possible to write completely unreadable and unmaintainable code in it. It's also possible to write beautiful, clear, absolutely amazing code in it.
I find that going back to C after working daily in C++ is hard, although I do it for gawk maintenance. For new programs I would work in C++, not C. For something big, I'd use the Qt framework for support and portability.
I've been recently living in the C# world for my day job. The development environment is very addictive, but C# hasn't seduced me away from C++.
Q: The open source world is a fascinating development paradigm. I'm therefore very curious to know what prompted you to write gawk?
A: I didn't write it from scratch. I got involved shortly after picking up and reading the Aho, Weinberger & Kernighan book in late 1987 when it came out.
New awk wasn't widely available. I had been involved with USENET since around 1983, and knew about the GNU project. I also had a strong interest in compilers and interpreters, so I got in touch with the GNU project to see if they had an awk clone and to see if I could get involved in upgrading it to "new" awk.
It turned out that they already had a volunteer, David Trueman, who was working on it, but he was happy to have help. He and I worked together until circa 1993 or 1994 when he had to stop being involved, and I became the sole maintainer.
It was a lot of fun. The number of emails of the "I could not get my work done without gawk" sort was amazing; Unix awk would often roll over and die on some of the data sets people were running though gawk.
Things really got shaken down when gawk became part of GNU/Linux distributions; then people were using it as the only awk, instead of alongside Unix awk.
Q: In retrospect, what are the best/worst features of gawk?
A: The best feature is the pattern/action paradigm. The implicit read-a-record loop is wonderful. This is the language's data-driven nature, as opposed to the imperative nature of most languages.
Associative arrays rank second; they are quite powerful.
There are some warts inherited from Unix awk and left unspecified by POSIX. These are relatively minor.
The lack of an explicit concatenation operator is an obvious one.
The lack of real multi-dimensional arrays is another.
There are features just in gawk that in retrospect seem to have been a waste of time, such as bringing out to the awk level the possibility to internationalize a program. I don't think anyone uses that.
IGNORECASE was a huge pain to get right; if I'd known how long it would take, I wouldn't have bothered.
The biggest "lack" is that there isn't an easy, standard way to provide extensibility; there are way too many things in the C library today (and even yesterday) that the awk programmer just can't get to. (Like the chdir system call!) I hope to eventually provide some better mechanisms for this, but I don't know how much actual filling in I can do also.
Q: Under what circumstances would you recommend/not recommend it?
A: Gawk is good for small to medium level programs that have to process text and/or do simple numeric work (summing up columns, averaging, VERY simple statistics work). It has a central place in traditional Unix / Linux shell scripting when portability is a must.
But I wouldn't care to try to write a military air traffic command and control system in gawk, for example. :-)
Q: Gawk has a reputation of being slow...
A: "Slow" compared to what? As far as I've seen, gawk is always faster than Unix awk. Michael Brennan's mawk is even faster, but until recently it has been unmaintained, and it lacks many important, modern features.
Relative to C? Of course. So what? You have to write 5 - 10 times as much C as you do awk to do the same or less. (I remember one program I wrote in C at around 1200 lines and rewrote in under 300 lines of awk, and the awk was clearer and did more.)
Relative to perl? It depends. I have had emails telling me that gawk was faster than perl for what the users were doing. And if not, do I care? Not really - perl is a write-only language, and don't get me started on Perl 6. :-)
All that said, this got me to thinking about a possible bottleneck that I'll be investigating in the near future.
Q: Awk also has a reputation of not being suitable for "real" projects. Is that reputation deserved?
A: I don't think that contention is true: it may be that scripting languages in general have such a reputation - Ronald Loui has written about this, but I don't think the contention is true for scripting languages either.
As is always the case, the answer is "it depends". What is the scale of what you're trying to do? Who is the customer? When Rick Adams was still running UUNET, he used a suite of awk programs to do his accounting. That's as "real" a project as you can get: billing your (hundreds or thousands of) customers for their resource usage. And he used gawk, since Unix awk would just roll over and die. (Unix awk has gotten better as a result of the "competition", but that's a different story. :-)
Q: Are you aware of any landmark projects that use gawk?
A: GNU/Linux. :-)
Not really. Gawk "just works", and that in and of itself is a testimony to its quality and value.
Q: Looking a decade into the future, can you see gawk disappearing? Why (not)?
A: I don't think so. The bigger question is will I still be involved with it 10 years from now? I don't know.
I still have some things I'd like to see happen with it that are interesting and valuable and may even end up being relatively unique. I just have to find the time (or some other volunteers :-) to work on them.
Q: Currently, how are you filling your time?
A: I have a full time job as a software engineer with Intel. I have a wife and four wonderful children, as well as a dog. That's enough right there to keep me busy.
I am the series editor for the Prentice Hall Open Source Software Development Series which also takes some of my time.
And I still try to do some gawk work in between everything else!
Libmawk is a fork of mawk 1.3.3 restructured for embedding. This means the user gets libmawk.h and libmawk.so and can embed awk scripting language in any application written in C.
the project can be downloaded here.
Libmawk has the following main features:
Since mawk is licensed under the GNU GPL v2 and libmawk is a fork of mawk, libmawk is licensed under the GNU GPL v2 too.
Tibor Palinkas
by Dick L.
I write to suggest that the Awk mascot's name is Hawk-eye (usually spoken as 'AWK-eye with a silent H).
I suggest 'AWK-eye is a DWARF, based on the following analogy:
I can't draw, but 'AWK-eye looks about half way between Gimli from Lord of the Rings, and Doc from the Disney Snow White and Seven Dwarves. (He has been known to sing "hi ho, hi ho, it's off to work I go". He likes to work!)
I know many spirits and sprites from the first age - LISP, APL, Assembler, Basic, Fortran and Algol. However, I have lost contact with most of these old friends, but ask 'AWK-eye to do new work most weeks. Why?
Yes, I love python, and javascript and all those creatures of later ages. And for some projects, functions as first class citizens, objects and the works is just what I want.
But for many daily jobs, 'AWK-eye is on the sweet spot of enough expressiveness to do the job, but not so much as to be hard to remember, and is small enough I have him everywhere.
Brian Kernighan has granted permission for this site to host the code from the original Awk book:
The code can be viewed here.
runawk is a small wrapper for the AWK interpreter that helps one write standalone AWK scripts. Its main feature is to provide a module/library system for AWK which is somewhat similar to Perl's "use" command. It also allows you to select a preferred AWK interpreter and to setup the environment for your scripts. It also provides other helpful features, for example it includes numerous useful of modules.
Makefile:
New modules:
Improvements, clean-ups and fixes in regression tests.
Also, runawk-0-18-0 was successfully tested on the following platforms: NetBSD-5.0/x86, NetBSD-2.0/alpha, OpenBSD-4.5/x86, FreeBSD-7.1/x86, FreeBSD-7.1/spark, Linux/x86 and Darwin/ppc.
Aleksey Cheusov
RUNAWK is a small wrapper for the AWK interpreter that helps one write standalone AWK scripts. Its main feature is to provide a module/library system for AWK which is somewhat similar to Perl's "use" command. It also allows you to select a preferred AWK interpreter and to setup the environment for your scripts. RUNAWK makes programming AWK easy and efficient. RUNAWK also provides many useful AWK modules.
Version 0.17.0, by Aleksey Cheusov, Sat, 12 Sep 2009
runawk:
runawk -f abs.awk -e 'BEGIN {print abs(-123); exit}'
alt_getopt.awk and power_getopt.awk:
power_getopt.awk:
New modules:
In comp.lang.awk, Aleksey Cheusov writes:
I've made runawk-0.16.0 release. This release has lots of important improvements and additions. Sources are available from
RUNAWK is a small wrapper for AWK interpreter that helps to write the standalone programs in AWK. It provides MODULES for AWK similar to PERL's "use" command and other powerful features. Dozens of ready to use modules are also provided.
(For more information, see details from the last release.)
Lots of demo programs for most runawk modules were created and they are in examples/ subdirectory now.
New MEGA module ;-) power_getopt.awk See the documentation and demo program examples/demo_power_getopt. It makes options handling REALLY easy (see below).
New modules:
Minor fixes and improvements in dirname.awk and basename.awk. Now they are fully compatible with dirname(1) and basename(1)
RUNAWK sets the following environment variables for the child awk subprocess:
RUNAWK sets RUNAWK_ART_STDIN environment variable for the child awk subprocess to 1 if additional/artificial `-' was added to the list to awk's arguments.
Makefile:
The most powerful feature of this release is power_getopt.awk module.
It provides a very powerful and very easy way to handle options.
Everything is in the usage message, you should do anything at all.
Panos Papadopoulos offers the latest entry in our Awk mascot competition:
Scary, yes?
Venkatesan Satish offers a new
entry in our Awk mascot competition:
These pages focus on word processing tools in Awk.
These pages focus on language interpreters, written in Awk.
Download from LAWKER
Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro-
nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL
specification from the concatenation of the file(s) (default standard
input) and emits the corresponding AASL table on standard output.
Aaslr parses the contents of the file(s) (default standard input)
according to the AASL table in file table, emitting the table's output
on standard output.
Both take a -x option to turn on verbose and cryptic debugging output.
Both look in a library directory for pieces of the AASL system; the
AASLDIR environment variable, if present, overrides the default notion
of the location of this directory.
Aaslr expects input to consist of input tokens, one per line. For sim-
ple tokens, the line is just the text of the token. For metatokens
like ``identifier'', the line is the metatoken's name, a tab, and the
text of the token. [xxx discuss `#' lines]
Aaslr output, in the absence of syntax errors, consists of the input
tokens plus action tokens, which are lines consisting of `#!' followed
immediately by an identifier. If the syntax of the input does not
match that specified in the AASL table, aaslr emits complaint(s) on
standard error and attempts to repair the input into a legal form; see
``ERROR REPAIR'' below. Unless errors have cascaded to the point where
aaslr gives up (in which case it emits the action token ``#!aargh'' to
inform later passes of this), the output will always conform to the
AASL syntax given in the table.
Normally, a complete program using AASL consists of three passes, the
middle one being an invocation of aaslr. The first pass is a lexical
analyzer, which breaks free-form input down into input tokens in some
suitable way. The third pass is a semantics interpreter, which typi-
cally responds to input tokens by momentarily remembering them and to
action tokens by executing some action, often using the remembered
value of the previous input token. Aaslg is in fact implemented using
AASL, following this structure; it implements the -x option by just
passing it to aaslr.
An AASL specification consists of class definitions, text definitions,
and rules, in arbitrary order (except that class definitions must pre-
cede use of the classes they define). A `#' (not enclosed in a string)
begins a comment; characters from it to the end of the line are
ignored. An identifier follows the same rules as a C identifier,
except that in most contexts it can be at most 16 characters long. A
string is enclosed in double quotes ("") and generally follows C syn-
tax. Most strings denote input tokens, and references to ``input
token'' as part of AASL specification syntax should be read as ``string
denoting input token''.
A class definition is an identifier enclosed in angle brackets (<>)
followed by one or more input tokens followed by a semicolon (;). It
gives a name to a set of input tokens. Classes whose names start with
capital letters are user abbreviations; see below. Classes whose names
start with lowercase letters are special classes, used for internal
purposes. The current special classes are:
For example, the class definitions used for AASL itself are:
When AASL error repair is invoked, the parser sometimes needs to gener-
ate input tokens. In the case of a metatoken, the parser knows the
token's name but needs to generate a text for it as well. A text defi-
nition consists of an input token, an arrow (->), and a string specify-
ing what text should be generated for that token. For example, the
text definitions used for AASL itself are:
The rules of a specification define the syntax that the parser should
accept. The order of rules is not significant, except that the first
rule is considered to be the top level of the specification. The spec-
ification is executed by calling the first rule; when execution of that
rule terminates, execution of the specification terminates. If the
user wishes this to occur only at end of input, he should arrange for
the lexical analyzer to produce an endmarker token (conventionally
``EOF'') at the end of the input, and should write the first rule to
require that token at the end.
Note that an input token may be recognized considerably before it is
accepted, but the parser emits it to the output only on acceptance.
A rule consists of an identifier naming it, a colon (:), a sequence of
items which is the body of the rule, and a semicolon (;). When a rule
is called, it is executed by executing the individual items of the body
in order (as modified by control structures) until either one of them
explicitly terminates execution of the rule or the last item is exe-
cuted.
An item which is an input token requires that that token appear in the
input at that point, and accepts it (causing it to be emitted as out-
put).
An item which is an identifier denotes a call to another rule, which
executes the body of that rule and then returns to the caller. It is
an error to call a nonexistent rule.
An item which is an identifier preceded by `!' causes that identifier
to be emitted as an action token; the identifier has no other signifi-
cance.
An item which is `<<' causes execution of the current rule to terminate
immediately, returning to the calling rule.
An item which is `>>' causes the execution of the innermost enclosing
loop (see below) to terminate immediately, with execution continuing
after the end of that loop. The loop must be within the same rule.
An item which is an identifier preceded by `@%&!' causes an internal
semantic action to be executed within the parser; this is normally
needed only for bizarre situations like C's typedef. [xxx should give
details I suppose]
A choice is a sequence of branches enclosed in parentheses (()) and
separated by vertical bars (|). The first of the branches that can be
executed, is, after which execution continues after the end of the
choice.
A loop is a sequence of branches enclosed in braces ({}) and separated
by vertical bars (|). The first of the branches that can be executed,
is, and this is done repeatedly until the loop is terminated by `>>',
after which execution continues after the end of the loop. (A loop can
also be terminated by `<<' terminating execution of the whole rule.)
A branch is just a sequence of items, like a rule body, except that it
must begin with either an input token or a lookahead. If it begins
with an input token, it can be executed only when that token is the
next token in the input, and execution starts with acceptance of that
token.
A lookahead specifies conditions for execution of a branch based on
recognizing but not accepting input token(s). The simplest form is
just an input token enclosed in brackets ([]), in which case execution
of that branch is possible only when that token is the next token in
the input. The brackets can also contain multiple input tokens sepa-
rated by commas, in which case the parser looks for any of those
tokens. If a user-abbreviation class name appears, either by itself or
as an element of a comma-separated list, it stands for the list of
tokens given in its definition.
If a lookahead's brackets contain only a `*', this is a default branch,
executable regardless of the state of the input.
As a very special case, a lookahead's brackets can contain two input
tokens separated by slash (/), in which case that branch is executable
only when those two tokens, in sequence, are next in the input. Warn-
ing: this is implemented by a delicate perversion of the error-repair
machinery, and if the first of those tokens is not then accepted, the
parser will die in convulsions. A further restriction is that the same
input token may not appear as the first token of a double lookahead and
as a normal lookahead token in the same choice/loop.
Certain simple choice/loop structures appear frequently, and there are
abbreviations for them:
For example, here are the rules of the AASL specification for AASL,
minus the actions (which add considerable clutter and are unintelligi-
ble without the third pass):
When the input token is not one of those desired, either because the
item being executed is an input token and a different token appears on
the input, or because none of the branches of a choice/loop is exe-
cutable, error repair is invoked to try to fix things up. Sometimes it
can actually guess right and fix the error, but more frequently it
merely supplies a legal output so that later passes will not be thrown
into chaos by a minor syntax error.
The general error-repair strategy of an AASL parser is to give the
parser what it wants and then attempt to resynchronize the input with
the parser.
[xxx long discussion of how ``what it wants'' is determined when there
are multiple possibilities]
Resynchronization is performed in three stages. The first stage
attempts to resynchronize within a logical line, and is applied only if
neither the input token nor the desired token is a line terminator (a
member of the ``lineterm'' class). If the input token is trivial (a
member of the ``trivial'' class), it is discarded. Otherwise it is
retained, in hopes that it will be the next token that the parser asks
for.
Either way, an error message is produced, indicating what was desired,
what was seen, and what was handed to the parser. If too many of these
messages have been produced for a single line, the parser gives up,
produces a last despairing message, emits a ``#!aargh'' action token to
alert later pases, and exits. Barring this disaster, parsing then con-
tinues. If the parser at some point is willing to accept the input
token, it is accepted and error repair terminates. If a line termina-
tor is seen in input, or the parser requests one, before the parser is
willing to accept the input token, the second phase begins.
The second stage of resynchronization attempts to line both input and
parser up on a line terminator. If the desired token is a line termi-
nator and the input token is not, input is discarded until a line ter-
minator appears. If the desired token is not a line terminator and the
input token is, the input token is retained and parsing continues until
the parser asks for a line terminator. Either way, the third phase
then begins.
The third stage of resynchronization attempts to reconcile line termi-
nators. If the desired and input tokens are identical, the input token
is accepted and error repair terminates. If they are not identical and
the input token is trivial (yes, line terminators can be trivial, and
ones like `;' probably should be), the input token is discarded. If
the desired token is the endmarker, then the input token is discarded.
Otherwise, the input token continues to be retained in hopes that it
will eventually be accepted. [xxx this needs more thought] In any
case, the second phase begins again.
awk(1), yacc(1)
``error-repair disaster'' means that the first token of a double looka-
head could not be accepted and error repair was invoked on it.
Written at University of Toronto by Henry Spencer, somewhat in the
spirit of S/SL (see ACM TOPLAS April 1982).
Some of the restrictions on double lookahead are annoying.
Most of the C string escapes are recognized but disregarded, with only
a backslashed double-quote interpreted properly during text generation.
Error repair needs further tuning; it has an annoying tendency to infi-
nite-loop in certain odd situations (although the messages/line limit
eventually breaks the loop).
Complex choices/loops with many branches can result in very long lines
in the table.
The implementation of AASL was fairly straight forward, with AASL
itself used to describe its own syntax. An AASL specification is
compiled into a table, which is then processed by a table-walking
interpreter. The interpreter expects input to be as tokens, one
per line, much likethe output of a traditional scanner. A complete
program using AASL (for example, the AASL table generator) is
normally three passes: thescanner,the parser (tables plus interpreter),
and a semantics pass. The first set of tables was generated byhand
for bootstrapping.
Apart from the minor nuisance of repeated iterations of language
design, the biggest problem ofimplementing AASL wasthe question of
semantic actions. Inserting awk semantic routines into the table
interpreter, in the style of yacc,would not be impossible, but it
seemed clumsy and inelegant. Awks lack of anyprovision for compile
time initialization of tables strongly suggested reading them in at
run time, rather than taking up space with a huge BEGIN action whose
only purpose was to initialize the tables. This makes insertions into
the interpreters code awkward.
The problem was solved by a crucial observation: traditional compilers
(etc.) merge a two-stepprocess, first validating a token stream and
inserting semantic action cookiesinto it, then interpreting thestream
and the cookies to interface to semantics. Forexample, yaccs grammar
notation can be viewed asinserting fragments of C code into a parsed
output, and then interpreting that output. This approach yieldsan
extremely natural pass structure for an AASL parser,with the
parsersoutput stream being (in the absenceof syntax errors) a copy
of its input stream with annotations. The following semantic pass
then processesthis, momentarily remembering normal tokens and
interpreting annotations as operations on the remembered values.
(The semantic pass is, in fact, a classic pattern+action awk program,
with a pattern and anaction for each annotation, and a general save
the value in a variableaction for normal tokens.)
The one difficulty that arises with this method is when the language
definition involves feedbackloops between semantics and parsing,
an obvious example being Cs typedef.Dealing with this reallydoes
require some imbedding of semantics into the interpreter,although
with care it need not be much: thein-parser code for recognizing C
typedefs, including the complications introduced by block structure
andnested redeclarations of type names, is about 40 lines of awk.The
in-parser actions are invoked by a special variant of the AASL emit
semantic annotationsyntax.
Aside benefit of top-down parsing is that the context of errors is
known, and it is relatively easy to implement automatic error
recovery. When the interpreter is faced with an input token that
does not appearin the list of possibilities in the parser table,
it givesthe parser one of the possibilities anyway, and then usessimple
heuristics to try to adjust the input to resynchronize. The result
is that the parser,and subsequentpasses, always see a syntactically-correct
program. (This approach is borrowed from S/SL and its predecessors.)
Although the detailed error-recovery algorithm is still experimental,
and the current one is notentirely satisfactory when a complex AASL
specification does certain things, in general it deals with minorsyntax
errors simply and cleanly without anyneed for complicating the
specification with details of errorrecovery.Knowing the context of
errors also makes it much easier to generate intelligible error
messagesautomatically.
The AASL implementation is not large. The
scanner is 78 lines of
awk,the parser is 61 lines of AASL (using a fairly low-density
paragraphing style and a good manycomments), and the
semantics pass
is 290
lines of awk. The
table interpreter is 340 lines, about half
of which (and most of the complexity) can be attributed to the
automatic error recovery.
As an experiment with a more ambitious AASL specification, one for
ANSI C was written. This occupies 374 lines excluding comments and
blank lines, andwith the exception of the messy details of
Cdeclaratorsis mostly a fairly straightforward transcription of the
syntax given in the ANSI standard. Generating tables for this takes
about three minutes of CPU time on a Sun 3/180; the tables are about
10K bytes.
The performance of the resulting ANSI C parser is not impressive:
in very round numbers, averagedoveralarge program, it parses about
one line of C per CPU second. (The scanner,164 lines of awk, accounts
for a negligible fraction of this.) Some attention to optimization
of both the tables and the interpreter might speed this up somewhat,
but remarkable improvements are unlikely. As things stand in the absence
of better awk implementations or a rewrite of the table interpreter
in C, its a cute toy, possibly of some pedagogical value, but not a
useful production tool. On the other hand, there does not appear
to be any fundamental reason for the performance shortfall: itspurely
the result of the slowexecution of awk programs.
The scanner would be much
faster with better regular-expression matching, because it can use regular expressions to determine whether
a string is a plausible token but must use substr
to extract the string first. Nawk functions would be very
handy for modularizing code, especially the complicated and seldom-invoked
error-recovery procedure. A
switch statement modelled on the pattern+action scheme would be useful in several places.
Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in
nawk): there is no array assignment. This lack leads to endless repetitions of code like:
whenever
block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic
sugar for this but don't
really fix the problem. Not only is this code clumsy, it is woefully inefficient compared
to something like
even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.
It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively
a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after
it had finished executing.
A
minor nuisance that surfaces constantly is that getting an error message
out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages
out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).
The multi-pass input-driven
structure that awk naturally lends itself to produces very
clean and readable code with different phases neatly separated, but creates substantial difficulties
when
feedback loops appear.
(In the case of AASL,this perhaps says more about language design than about
awk.) Henry Spencer.
(Editor's note: One of the benefits of gawk is its ability to quickly code filters that convert artifacts from one form to another.
For example,
here's a BrainFuck to C translator.) (From Wikipeidia.)
The BrainFuck programming language is an esoteric programming language noted for its extreme minimalism. It is a Turing tarpit, designed to challenge and amuse programmers, and is not suitable for practical use
Urban Muller created BrainFuck in 1993 with the intention of designing a language which could be implemented with the smallest possible compiler, inspired by the 1024-byte compiler for the FALSE programming language. Several BrainFuck compilers have been made smaller than 200 bytes. The classic distribution is Muller's version 2, containing a compiler for the Amiga, an interpreter, example programs, and a readme document.
The language consists of eight commands:
A Brainfuck program is a sequence of these commands, possibly interspersed with other characters (which are ignored). The commands are executed sequentially, except as noted below; an instruction pointer begins at the first command, and each command it points to is executed, after which it normally moves forward to the next command. The program terminates when the instruction pointer moves past the last command.
I wrote a BrainFuck to C translator in awk. It only takes a few minutes and I noticed that no awk version of this
existed.
I haven't run it through it's paces (I just wrote a few small BrainFuck programs to test it out) so if you find a bug, please let me know.
Steve Johnson http://saladwithsteve.com/
These pages focus on object-oriented tools in Awk.
These pages focus on domain-specific languages
(a.k.a. "little langauges") written in Awk.
These little languages can range from the simple to the quite intricate. For example,
LAWKER contains code for
Interestingly, without comments, the LISP interpreter is only three times longer than the HTML markup language.
This comments either on the power of Awk, the regularity of LISP's core semantics, or both.
gawk -f graph.awk graphFile A processor for a little language, specialized for graph-drawing. For example, here is an input specification: It produces the following output Set frame dimensions: height and width; offset for x and y axes. Skip comments Simple tags Handling numerics. Line functions, defined by a slope "m" and a y-intercept "b". Final case: input error. Draw the graph Expand the "x" and "y" boundaries to include all points. Draw the frame around the graph. Create tick marks for both axes. Center labels under x-axis. Create data points. Print graph from array. Scale x-values, y-values. Put one character into array. Put string "s" into array.
This code comes from the original Awk book by Alfred Aho, Peter Weinberger & Brian Kernighan and contains some small
modifications by Tim Menzies.
This program will turn SDML into simple ascii text uml sequence
diagrams. SDML is an extremely simplistic uml Sequence Diagram
Markup Language. SDML is specified as:
Given this input: this code generates: Martin Fick
The -v profiling=1 option turns call-count profiling on.
If you want to use it interactively, be sure to include '-' (for the standard
input) among the source files. For example:
This program arose out of one-upmanship. At my previous job I had to
use MapBasic, an interpreter so astoundingly slow (around 100 times
slower than GWBASIC) that one must wonder if it itself is implemented
in an interpreted language. I still wonder, but it clearly could be:
a bare-bones Lisp in awk, hacked up in a few hours, ran substantially
faster. Since then I've added features and polish, in the hope of
taking over the burgeoning market for stately language
implementations.
This version tries to deal with as many of the essential issues in
interpreter implementation as is reasonable in awk (though most would
call this program utterly unreasonable from start to finish, perhaps...).
Awk's impoverished control structures put error recovery and tail-call
optimization out of reach, in that I can't see a non-painful way to code
them. The scope of variables is dynamic because that was easier to
implement efficiently. Subject to all those constraints, the language
is as Schemely as I could make it: it has a single namespace with
uniform evaluation of expressions in the function and argument positions,
and the Scheme names for primitives and special forms.
The rest of this file is a reference manual. My favorite tutorial would be
The Little LISPer (see section 5, References); don't let the cute name
and the cartoons turn you off, because it's a really excellent book with
some mind-stretching material towards the end. All of its code will work
with awklisp, except for the last two chapters. (You'd be better off
learning with a serious Lisp implementation, of course.)
For more details on the implementation,
see the Implementation notes (below).
Code:
Comamnd line:
Output:
Here are the standard ELIZA dialogue patterns:
Command line:
Interaction:
Lisp evaluates expressions, which can be simple (atoms) or compound (lists).
An atom is a string of characters, which can be letters, digits, and most
punctuation; the characters may -not- include spaces, quotes, parentheses,
brackets, '.', '#', or ';' (the comment character). In this Lisp, case is
significant ( X is different from x ).
A list is a '(', followed by zero or more objects (each of which is an atom
or a list), followed by a ')'.
The special object nil is both an atom and the empty list. That is,
nil = (). A non-nil list is called a -pair-, because it is represented by a
pair of pointers, one to the first element of the list (its -car-), and one to
the rest of the list (its -cdr-). For example, the car of ((a list) of stuff)
is (a list), and the cdr is (of stuff). It's also possible to have a pair
whose cdr is not a list; the pair with car A and cdr B is printed as (A . B).
That's the syntax of programs and data. Now let's consider their meaning. You
can use Lisp like a calculator: type in an expression, and Lisp prints its
value. If you type 25, it prints 25. If you type (+ 2 2), it prints 4. In
general, Lisp evaluates a particular expression in a particular environment
(set of variable bindings) by following this algorithm:
If the procedure's body has more than one expression -- e.g.,
(lambda () (write 'Hello) (write 'world!)) -- evaluate them each in turn, and
return the value of the last one.
We still need the rules for special forms. They are:
It's possible to define new special forms using the macro facility provided in
the startup file. The macros defined there are:
Since the code should be self-explanatory to anyone knowledgeable
about Lisp implementation, these notes assume you know Lisp but not
interpreters. I haven't got around to writing up a complete
discussion of everything, though.
The code for an interpreter can be pretty low on redundancy -- this is
natural because the whole reason for implementing a new language is to
avoid having to code a particular class of programs in a redundant
style in the old language. We implement what that class of programs
has in common just once, then use it many times. Thus an interpreter
has a different style of code, perhaps denser, than a typical
application program.
Conceptually, a Lisp datum is a tagged pointer, with the tag giving
the datatype and the pointer locating the data. We follow the common
practice of encoding the tag into the two lowest-order bits of the
pointer. This is especially easy in awk, since arrays with
non-consecutive indices are just as efficient as dense ones (so we can
use the tagged pointer directly as an index, without having to mask
out the tag bits). (But, by the way, mawk accesses negative indices
much more slowly than positive ones, as I found out when trying a
different encoding.)
This Lisp provides three datatypes: integers, lists, and symbols. (A
modern Lisp provides many more.)
For an integer, the tag bits are zero and the pointer bits are simply
the numeric value; thus, N is represented by N*4. This choice of the
tag value has two advantages. First, we can add and subtract without
fiddling with the tags. Second, negative numbers fit right in.
(Consider what would happen if N were represented by 1+N*4 instead,
and we tried to extract the tag as N%4, where N may be either positive
or negative. Because of this problem and the above-mentioned
inefficiency of negative indices, all other datatypes are represented
by positive numbers.)
The following is from an email discussion; it doesn't develop
everything from first principles but is included here in the hope
it will be helpful.
Hi. I just took a look at awklisp, and remembered that there's more
to your question about why we need a stack -- it's a good question.
The real reason is because a stack is accessible to the garbage
collector.
We could have had apply() evaluate the arguments itself, and stash
the results into variables like arg0 and arg1 -- then the case for
ADD would look like
The obvious problem with that approach is how to handle calls to
user-defined procedures, which could have any number of arguments.
Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1))
is the procedure, and 42 is the argument.
A (wrong) solution could be to evaluate each argument in turn, and
bind the corresponding parameter name (like x in this case) to the
resulting value (while saving the old value to be restored after we
return from the procedure). This is wrong because we must not
change the variable bindings until we actually enter the procedure --
for example, with that algorithm ((lambda (x y) y) 1 x) would return
1, when it should return whatever the value of x is in the enclosing
environment. (The eval_rands()-type sequence would be: eval the 1,
bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind
y to that, then eval the body of the lambda.)
Okay, that's easily fixed -- evaluate all the operands and stash them
away somewhere until you're done, and *then* do the bindings. So
the question is where to stash them. How about a global array?
Like
followed by the equivalent of extend_env(). This will not do, because
the global array will get clobbered in recursive calls to eval().
Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +,
like this: global_temp[0] gets 2, and then global_temp[1] gets the
eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set
to 3 and global_temp[1] to 4 -- so the original assignment of 2 to
global_temp[0] is clobbered before we get a chance to use it. By
using a stack[] instead of a global_temp[], we finesse this problem.
You may object that we can solve that by just making the global array
local, and that's true; lots of small local arrays may or may not be
more efficient than one big global stack, in awk -- we'd have to try
it out to see. But the real problem I alluded to at the start of this
message is this: the garbage collector has to be able to find all the
live references to the car[] and cdr[] arrays. If some of those
references are hidden away in local variables of recursive procedures,
we're stuck. With the global stack, they're all right there for the
gc().
(In C we could use the local-arrays approach by threading a chain of
pointers from each one to the next; but awk doesn't have pointers.)
(You may wonder how the code gets away with having a number of local
variables holding lisp values, then -- the answer is that in every
such case we can be sure the garbage collector can find the values
in question from some other source. That's what this comment is
about:
In some cases where the values would not otherwise be guaranteed to
be available to the gc, we call protect().)
Oh, there's another reason why apply() doesn't evaluate the arguments
itself: it's called by do_apply(), which handles lisp calls like
(apply car '((x))) -- where we *don't* want the x to get evaluated
by apply().
Roger Rohrbach wrote a Lisp interpreter, in old awk (which has no
procedures!), called walk . It can't do as much as this Lisp, but it
certainly has greater hack value. Cooler name, too. It's available at
http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/impl/awk/0.html
Eval doesn't check the syntax of expressions. This is a probably-misguided
attempt to bump up the speed a bit, that also simplifies some of the code.
The macroexpander in the startup file would be the best place to add syntax-
checking.
Darius Bacon dairus@wry.me
Copyright (c) 1994, 2001 by Darius Bacon.
Permission is granted to anyone to use this software for any
purpose on any computer system, and to redistribute it freely,
subject to the following restrictions:
Download from
LAWKER.
"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely
in awk and sed. It was done for fun, to establish whether it was possible.
It is; it works. It's quite slow, the input syntax is eccentric and rather
restricted, and error-checking is virtually nonexistent, but it does work.
Furthermore it's very easy to adapt to a new machine, provided the machine
falls into the generic "8-bit-micro" category. It is supplied "as is",
with no guarantees of any kind. I can't be bothered to do any more work on
it right now, but even in its imperfect state it may be useful to someone.
aaa is the mainline shell file.
aux is a subdirectory with machine-independent stuff. Anon, 6801, and
6809 are subdirectories with machine-dependent stuff, choice specified
by a -m option (default is "anon"). Actually, even the stuff that is
supposedly machine-independent does have some machine-dependent
assumptions; notably, it knows that bytes are 8 bits (not serious) and
that the byte is the basic unit of instructions (more serious). These
would have to change for the 68000 (going to 16-bit "bytes" might be
sufficient) and maybe for the 32016 (harder).
aaa thinks that the machine subdirectories and the aux subdirectory are
in the current directory, which is almost certainly wrong.
abst is an abstract for a paper. "card", in each machine directory,
is a summary card for the slightly-eccentric input language. There is no
real manual at present; sorry.
try.s is a sample piece of 6809 input; it is semantic trash, purely for
test purposes. The assembler produces try.a, try.defs, and try.x as
outputs from "aaa try.s". try.a is an internal file that looks
somewhat like an assembly listing. try.defs is another internal file
that looks somewhat like a symbol table. These files are preserved
because of possible usefulness; tmp[123] are non-preserved temporaries.
try.x is the Intel-hex output. try.x.good is identical to try.x and is a
saved copy for regression testing of new work.
01pgm.s is a self-programming program for a 68701, based on the one
in the Motorola ap note. 01pgm.x.good is another regression-test file.
If your C library (used by awk) has broken "%02x" so it no longer means
"two digits of hex, *zero-filled*" (as some SysV libraries have), you
will have to fall back from aux/hex to aux/hex.argh, which does it the
hard way. Oh yes, you'll note that aaa feeds settings into awk on the
command line; don't assume your awk won't do this until you try it.
Henry Spencer
This is adoC, version 1.1. Generates Latex files from
source code comments.
Download from
LAWKER or
http://www.sect.mce.hw.ac.uk
Usage: adoc [options] files_to_parse
Options:
adoC is a source code documenting system written in awk and shell
script. It produces documentation in LaTeX format which resembles
the Unix man pages. The documentation is generated from comment
sections in the source code. The comment sections are marked by
two special character sequences and internally divided into sub-
parts by keywords. The system can be used with almost any kind of
programming language.
The idea is based on ROBODoc
http://www.xs4all.nl/~rfsber/Robo/robodoc.html
The system requires a working gawk and LaTeX installation. For the
LaTeX document the "refart.sty" style should be installed.
adoC is documented by
itself .
For the detailed documentation about the system and its implementation
execute the following:
GPL v2.0. Share and enjoy.
Jesus Galan (yiyus)
(yiyu DOT jgl AT gmail DOT com)
has updated his markdown system.
His new
md2html.awk
code adds several new functionality extensions and implements numerous bug fixes.
For more on this new code, see his history of a rewrite.
Download from
LAWKER.
awk -f markdown.awk file.txt > file.html
Download from
LAWKER.
(Note: this code was orginally called txt2html.awk by its author but that caused a name
clash inside LAWKER. Hence, I've taken the liberty of renamining it. --Timm)
The following code implements a subset of John Gruber's Markdown langauge: a widely-used, ultra light-weight markup language for html. Number of leading "#" codes the heading level: Note: beginnging and end of list are automatically inferred, maybe not always correctly. Denoted by a number at start-of-line.
The following code demonstrates a "exception-style" of Awk programming. Note
how all the processing relating to each mark-up tag is localized (exception, carrying
round prior text and environments). The modularity of the following code should make it
easily hackable.
(Plus h3 with underscores.) Does not implement the full Markdown syntax. Jesus Galan (yiyus) 2006
This is an updated revision (#21), released August 1, 2009.
In this new version:
Download awkpp21.zip from
LAWKER
Awk++ is a preprocessor, that is it reads in a program written in
the awk++ language and outputs a new program.
However, it's
different than awka. The output from the awk++ preprocessor is awk code, not C
or an executable program. So, some version of AWK, such as awk or gawk, has
to be used to run the preprocessed program. awka can be used, in a second step,
to turn the preprocessed awk++ program into an executable, if desired.
The awk++ language provides object oriented programming for AWK that includes:
Awk++ adds new keywords to standard Awk: To define a class (similar to C++ but no public/private): To define a class with inheritance: To add local/private variables (persistent variables; syntax is unique to awk++):
To help programmers who are used to other OO languages, "attribute",
"property", "element", and "variable", along with their 4-letter abbreviations,
are interchangeable.
Note: these persistent variables cannot be accessed directly. The programmer
must define method(s) to return them, if their values are to be made available
to code that's outside the class.
To add methods
To create an object
To call an object method
The dot isn't used for concatenation in awk/gawk, so it's a natural choice
for the separator between the object and method.
To reclaim the memory used by an object, use the delete method, i.e.:
but don't define delete() in your classes. awk++ recognizes delete() as a special
method and will take care of deleting the object. Deleting objects is
only necessary, though, if they hold a lot of
data. Overhead for objects themselves is insignificant.
OO syntax goals:
The OO syntax is based partly on C++, partly on Javascript, partly on Ruby and
partly on the book "The Object-Oriented Thought Process". It isn't lifted in
toto from one langauage because other languages provide features that gawk can't
accomplish or have syntax that is hard to parse.
In awk++, if a method is called that isn't in the object's class and there
are inherited classes (superclasses) specified, the inherited classes are called
in left to right order until one of them returns a value. That value
becomes the result of the method call.
This is the way awk++ resolves the
diamond problem. As a programmer, you control the sequence in which
superclasses are called by the left to
right order of the list of inherited classes in the class definition.
There are two important things to note.
Calls to undefined methods do nothing and return nothing, silently.
The command to preprocess an awk++ program looks like this:
There is a bug in the standard AWK distributions that affects the preprocessor.
Additionally, the preprocessor uses the 3rd array option of the match() function.
So, it's best to use GAWK to run the preprocessor.
On the other hand, the AWK code created by translating awk++ is intended
to work with all versions of AWK. If you find otherwise, please notify the
developer(s).
Copyright (c) 2008, 2009
Jim Hart, jhart@mail.avcnet.org
All rights reserved. The awk++ code is licensed under the GNU Public license (GPL) any version.
awk++ documentation, including this page, may be copied only in unmodified
form, subject to fair use guidelines.
ooc is an awk program which reads
class descriptions and performs the
routine coding tasks necessary to do
object-oriented coding in ANSI C.
The tool is exceptionally well documented in
Object oriented programming with ANSI-C.
Download a 2002 copy of this code from
LAWKER.
Or go to the
author's web site.
ooc is a technique to do object-oriented programming (classes,
methods, dynamic linkage, simple inheritance, polymorphisms,
persistent objects, method existence testing, message forwarding,
exception handling, etc.) using ANSI-C.
ooc is a preprocessor to simplify the coding task by converting
class descriptions and method implementations into ANSI-C as required
by the technique. You implement the algorithms inside the methods
and the ooc preprocessor produces the boilerplate.
ooc consists of a shell script driving a modular awk script (with
provisions for debugging), a set of reports -- code generation
templates -- interpreted by the script, and the source of a root
class to provide basic functionality. Everything is designed to
be changed if desired. There are manual pages, lots of examples,
among them a calculator based on curses and X11, and you can ask
me about the book.
ooc as a technique requires an ANSI-C system -- classic C would
necessitate substantial changes. The preprocessor needs a healthy
Bourne-Shell and "new" awk as described in Aho, Weinberger, and
Kernighan's book.
ooc was developed primarily to teach about object-oriented programming
without having to learn a new language. If you see how it is done
in a familiar setting, it is much easier to grasp the concepts and
to know what miracles to expect from the technique and what not.
Conceivably, the preprocessor can be used for production programming
but this was not the original intent. Being able to roll your own
object-oriented coding techniques has its possibilities, however...
Most sources should be viewed with tab stops set at 4 characters.
The original system ran on NeXTSTEP 3.2 and older, ESIX (System
V) 4.0.4, and Linux 0.99.pl4-49. This rerelease was tested on MacOS X
version 10.1.2 and Solaris version 5.8. You need to review paths in the
script 'ooc/ooc' before running anything. Make sure the first line
of this script points to a Bourne-style shell. Also make sure that
the first line of '09/munch' points to a (new) awk.
The rereleased 'ooc' awk-programs have been tested with GNU awk versions
3.0.1 and 3.0.3. Previous versions did not support AWKPATH properly
(but this is not essential).
The makefiles could be smarter but they are naive enough for all
systems. This is a heterogeneous system -- set the environment
variable $OSTYPE to an architecture-specific name. 'make' in the current
directory will create everything by calling 'make' in the various
subdirectories. Each 'makefile' includes 'make/Makefile.$OSTYPE', review
your 'make/Makefile.$OSTYPE' before you start.
The following make calls are supported throughout:
Make dependencies can be built with the -MM option of the GNU C
compiler. They are stored in a file 'depend' in each subdirectory.
They should apply to all systems. 'makefile.$OSTYPE' may include a target
'depend' to recreate 'depend' -- check 'makefile.darwin1.4' for an
example.
The following is a walk through the file hierarchy in the order of
the book:
Copyright (c) 1993
While you may use this software package, neither I nor my employers can
be made responsible for whatever problems you might cause or encounter.
While you may give away this package and/or software derived with
it, you should not charge for it, you should not claim that ooc is
your work, and I have published my own book about ooc before you did.
The same restrictions apply to whoever might get this package from
you.
Programmers often take awk "as is", never thinking to use it as a lab in which
we can explore other language extensions. This is of course, only one way to treat
the Awk code base.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach was take by David Ladd and J. Christopher
Raming in their A* system.
They write:
A* is an experimental language designed to facilitate
the creation of language-processing tools. It is analogous either to
an interpreted yacc with Awk as its statement language, or to a
version of Awk which processes programs rather than records.
A* offers two principal advantages over the combination of lex,
yacc, and C:
Reference:
A*: a language for implementing language processors
Ladd, D.A.; Ramming, J.C.;
Software Engineering, IEEE Transactions on
Volume 21, Issue 11, Nov. 1995 Page(s):894 - 901
These pages are focused on Functional Gawk (a.k.a.
"Funky").
Funky is enabled by a new feature added to Gawk 3.2: indirect functions.
For example:
At the time of this writing, Gawk 3.2 is pre-release
and indirect functions can be accessed using the
gawk-devel CVS tree:
Indirect functions enable a new view on library management in Gawk
and, perhaps, a way to emulate functional abstraction in languages
like Lisp.
So, anyone care to try, say:
These pages focus on Sed-like stream editors, written in Awk.
I was lurking around on twitter during my lunch hour (yes, even freelancers need a lunch hour), and @bitprophet tweeted thusly:
Followed by this:
Interested to see if anyone can shorten my previous tweet's command line,
outside of using 'cut' instead of the awk bit.)
I happen to love puzzles like this, and my lunch was almost immediately followed by a long, boring conference call.
@bitprophet's pipeline above is translated by my brain into the English:
Find non-commented lines, grab the second space-delimited field,
then filter out the ones that start with "*" or "|", then delete any blank lines, and strip any leading "-" from the result.
My brain usually attempts to think of the English version of the solution *first*, and then try to emulate that in the code/command I write. So, the issue here is we want to find file paths (and apparently sockets are ok, too, as "@" is a valid leading character in the initial definition of the problem). If it's a file path, we want to see it in a form that would be suitable for passing it to something like "ls -l", which means leading symbols like "-" and "|" should be omitted.
In a syslog.conf file, the main meat is the area where you specify the warning levels, and the file you want messages at that warning level sent to (this is a simplistic explanation, but good enough to understand the solution I came up with). The file is also littered with comments. Here's the file on my Mac:
So, in English, my brain parses the problem like this:
Skip blank lines, commented lines, and lines where the file name is "*", and give me everything else, but strip off characters "-" and
"|" before sending it to the screen.
And here's my awk one-liner for doing that:
Knowing a few key things about awk will help parse the above:
Awk automatically breaks up each line of input into fields. If you don't tell it what to use as a delimiter, it'll just use any number of spaces as the delimiter. If you have a CSV file, you'd likely use "awk -F," to tell awk to use a comma. For /etc/passwd, use "awk -F:". From there, you can reference the first field as $1, the second as $2, etc. $0 represents the whole line. There are more, but that's enough for this example.
Though I think most sysadmins can get a lot done with simple usage like "awk -F: '{print $2}'", sometimes more power is needed, and awk delivers. It uses the basic regex engine, and enables you to check a field (or the whole line: $0, like I do above) against a regex as a precondition for performing some action with the line or a field on that line. So, in the above awk command, I check to see if the line is either empty, or a comment. I then use a logical AND to check if field 2 starts with "*". If the current line is a match for any of these rules it is skipped.
Another nice thing about awk is that it actually is a Turing-complete programming language. After I check the lines of input against the rules mentioned above, I immediately know that I definitely want at least some portion of $2 in the remaining lines. What I *don't* want are preceding characters like "-" or "|". I need to strip them from the file name. I use awk's built in "sub()" function to handle that, and with that out of the way I call "print" to send the result to the screen.
Writing in comp.lang.awk
Ed Morton ports numerous complex sed expressions to Awk:
A comp.lang.awk author ask the question:
I have a file that has a series of lists
and I want to make it look like
IMHO the clearest sed solution given was:
while the awk one was:
As I've said repeatedly, sed is an excellent tool for simple
substitutions on a single line. For anything else you should use awk,
perl, etc.
Having said that, let's take a look at the awk equivalents for the
posted sed examples below that are not simple substitutions on a single
line so people can judge for themselves (i.e. quietly - this is not a
contest and not a religious war!) which code is clearer, more
consistent, and more obvious. When reading this, just imagine yourself
having to figure out what the given script does in order to debug or
enhance it or write your own similar one later.
Note that in awk as in shell there are many ways to solve a problem so
I'm trying to stick to the solutions that I think would be the most
useful to a beginner since that's who'd be reading an examples page like
this, and without using any GNU awk extensions. Also note I didn't test
any of this but it's all pretty basic stuff so it should mostly be right.
For those who know absolutely nothing about awk, I think all you need to
know to understand the scripts below is that, like sed, it loops through
input files evaluating conditions against the current input record (a
line by default) and executing the actions you specify (printing the
current input record if none specified) if those conditions are true,
and it has the following pre-defined symbols:
Oh, and setting RS to the NULL string (-v RS='') tells awk to read
paragraphs instead of lines as individual records, and setting FS to the
NULL string (-v FS='') tells awk to treat each individual character as a
field.
For more info on awk, see http://www.awk.info.
Double space a file:
Sed:
Awk
Double space a file which already has blank lines in it. Output file
should contain no more than one blank line between lines of text.
Sed:
Awk:
Triple space a file
Sed: Awk: Undo double-spacing (assumes even-numbered lines are always blank):
Sed: Awk: Insert a blank line above every line which matches "regex":
Sed: Awk:
Insert a blank line below every line which matches "regex":
Sed: Awk:
Insert a blank line above and below every line which matches "regex":
Sed: Awk:
Number each line of a file (simple left alignment). Using a tab (see
note on '\t' at end of file) instead of space will preserve margins:
Sed: Awk:
Number each line of a file (number on left, right-aligned):
Sed: Awk:
Number each line of file, but only print numbers if line is not blank:
Sed: Awk:
Count lines (emulates "wc -l")
Sed: Awk: Align all text flush right on a 79-column width:
Sed: Awk:
Center all text in the middle of 79-column width. In method 1,
spaces at the beginning of the line are significant, and trailing
spaces are appended at the end of the line. In method 2, spaces at
the beginning of the line are discarded in centering the line, and
no trailing spaces appear at the end of lines.
Sed: Awk:
Reverse order of lines (emulates "tac")
Bug/feature in sed v1.5 causes blank lines to be deleted
Sed: Awk:
Reverse each character on the line (emulates "rev")
Sed: Awk:
Join pairs of lines side-by-side (like "paste")
Sed: Awk:
If a line ends with a backslash, append the next line to it
Sed: Awk:
if a line begins with an equal sign, append it to the previous line
and replace the "=" with a single space
Sed: Awk:
Add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)
Sed: Awk: Print first 10 lines of file (emulates behavior of "head")
Sed: Awk:
Print first line of file (emulates "head -1")
Sed: Awk:
Print the last 10 lines of a file (emulates "tail")
Sed: Awk:
Print the last 2 lines of a file (emulates "tail -2")
Sed: Awk:
Print the last line of a file (emulates "tail -1")
Sed: Awk:
Print the next-to-the-last line of a file
Sed: Awk:
Print only lines which match regular expression (emulates "grep")
Sed: Awk:
Print only lines which do NOT match regexp (emulates "grep -v")
Sed: Awk:
Print the line immediately before a regexp, but not the line
containing the regexp
Sed: Awk:
Print the line immediately after a regexp, but not the line
containing the regexp
Sed: Awk:
Print 1 line of context before and after regexp, with line number
indicating where the regexp occurred (similar to "grep -A1 -B1")
Sed: Awk:
Grep for AAA and BBB and CCC (in any order)
Sed: Awk:
Grep for AAA and BBB and CCC (in that order)
Sed: Awk:
Grep for AAA or BBB or CCC (emulates "egrep")
Sed: Awk:
Print paragraph if it contains AAA (blank lines separate paragraphs).
Sed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below
Sed: Awk:
Print paragraph if it contains AAA and BBB and CCC (in any order)
Sed: Awk:
Print paragraph if it contains AAA or BBB or CCC
Sed: Awk:
Print only lines of 65 characters or longer
Sed: Awk:
Print only lines of less than 65 characters
Sed: Awk:
Print section of file from regular expression to end of file
Sed: Awk:
Print section of file based on line numbers (lines 8-12, inclusive)
Sed: Awk:
Print line number 52
Sed: Awk:
Beginning at line 3, print every 7th line
Sed: Awk:
print section of file between two regular expressions (inclusive)
Sed: Awk:
Print all lines of FileID upto 1st line containing
Sed: Awk:
Print all lines of FileID from 1st line containing
until eof
Sed: Awk:
Print all lines of FileID from 1st line containing
until 1st line containing [boundries inclusive]
Sed: Awk:
Print all of file EXCEPT section between 2 regular expressions
Sed: Awk:
Delete duplicate, consecutive lines from a file (emulates "uniq").
First line in a set of duplicate lines is kept, rest are deleted.
Sed: Awk:
Delete duplicate, nonconsecutive lines from a file. Beware not to
overflow the buffer size of the hold space, or else use GNU sed.
Sed: Awk:
Delete all lines except duplicate lines (emulates "uniq -d").
Sed: Awk:
Delete the first 10 lines of a file
Sed: Awk:
Delete the last line of a file
Sed: Awk:
Delete the last 2 lines of a file
Sed: Awk:
Delete the last 10 lines of a file
Sed: Awk:
Delete every 8th line
Sed: Awk:
Delete lines matching pattern
Sed: Awk:
Delete ALL blank lines from a file (same as "grep '.' ")
Sed: Awk:
Delete all CONSECUTIVE blank lines from file except the first; also
deletes all blank lines from top and end of file (emulates "cat -s")
Sed: Awk:
Delete all leading blank lines at top of file
Sed: Awk:
Delete all trailing blank lines at end of file
Sed: Awk:
Delete the last line of each paragraph
Sed: Awk: Get Usenet/e-mail message header
Sed: Awk:
Get Usenet/e-mail message body
Sed: Awk:
Get Subject header, but remove initial "Subject: " portion
Sed: Awk:
Parse out the address proper. Pulls out the e-mail address by itself
from the 1-line return address header (see preceding script)
Sed: Awk:
Add a leading angle bracket and space to each line (quote a message)
Sed: Awk:
Delete leading angle bracket & space from each line (unquote
a message)
Sed: Awk: (This page is a summary of Russ Cox's excellent article Regular Expression Matching Can Be Simple and Fast.)
Russ Cox writes that Awk's regular expression library is surprisingly faster than that used in Perl, Ruby, and Python:
Let's use superscripts to denote string repetition, so that a?3a3 is shorthand for a?a?a?aaa. This lets us
define experiments where we
conduct timing experiments on using regular expressions to match the a?nan against the string an.
If we conduct those experiments, Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. ...
the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string.
This trends grows as we increase "n": the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 1015 years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages.).
For some details of his results, see the following graph. Note that the y-axis is logarithmic (increases by a power of ten for each tick) so
these differences are really big differences:
The reason for these differences is very technical- but Cox's article offers an excellent and clear description of those details.
In short, the RE matcher used in Perl, Ruby, Python is a recursive algorithm that allows the match state to exist in only one
state at a time.
A Thompson NFA used in Awk/Grep, on the other hand, allows a match to exist in multiple states. Using Thompson's NFA,
the whole match process can be
pre-computed and cached at compile time, thus removing the backtrack-on-failure process.
And what is the lesson here? Next time someone tells you Awk is old-fashioned, cough politely and mention that at least in some
aspects, certain supposedly-more-modern languages do not offer all the support provided by dear-"old"- Awk.
On Nov 30'09, Hermann Peifer found and fixed bug in an older version of the test code at the end of
http://awk.info/?tip/whinyUsers .
With the older, incorrect, version it was reported that keeping all Awk arrays sorted had very little impact on performance.
With Hermann's fix, we can now show that sorting slows down processing by
15% (at least, for the example explored on that page.)
Thanks to Hermann for that correction.
(Editor's note: On Nov 30'09, Hermann Peifer found and fixed bug in an older version of the test code at the end of this file.)
Writing in comp.lang.awk, Ed Morton reveals the secret WHINY_USERS flag.
"Nag" asked:
Hi,
I am creating a file like...
How can I sort file1 within awk code?
Ed Morton writes:
Your editor coded up the following test for the runtime costs of WHINY_USERS. The following code
is called twice (once with, and once without setting WHINY_USERS):
And the results? Sorted added 15% to runtimes:
In comp.lang.awk, Ed Morton offers advise on how to print ranges of Awk records.
Suppose you are looking to extract a section of code from a text file based on
two regular expressions.
Say the file looks like this:
newspaper
magazing
hiking
hiking trails in the city
muir hike
black mountain hike
summer meados hike
end hiking
phone
cell
skype
and you want to
extract
What do do?
Personally, I rarely if ever use
as I'm never immediately sure what it'd output for input such as:
and whenever you want to do something just slightly different with the
selection you need to change the script a lot.
Not being sure of the semantics is probably a catch 22 since I rarely
use it but the benefit of using that syntax vs spelling it out:
just doesn't really seem worthwhile, and then if you want to do
something extra like test for some other condition over the block
this:
is about as brief as:
and if you want to exclude the start (or end) of the block you're
printing then you just move the "f" test to the obvious place and you
don't need to duplicate the condition:
and note the different semantics now. This:
will exclude the line at the start of the block you're printing,
whereas this:
will exclude that line plus every other occurrence of "start" within
the block which is probably not what you'd want. To simply exclude
only the first line of the block but stay with the /start/,/end/
approach you'd need to do something like:
(which is getting fairly obscure.)
Download all the following example code and support data files from
LAWKER
This page contains a set of sample Awk scripts
to manage different kinds of databases.
In all cases, we'll use a text editor such as edit.exe to create and edit the data files,
and Awk scripts will be used to query and manipulate the data.
OK, so it's not a fancy GUI-based system,
but this method is flexible and
the scripts execute relatively quickly.
Also, your data won't be locked in some company's
proprietary binary file format.
There is also the benefit of portability:
If your PC can run DOS, you can also run these scripts on your PC.
Awk is also available on Linux and on other operating systems.
This page assumes that you are already familiar with database terms
like 'record', 'field', and 'search keyword'.
Awk is an interpreted programming language that is designed for
managing and converting data files and generating reports from the data.
Awk will automatically read an input file and
parse it into records and fields, one record at a time.
A typicall Awk script will then manipulate the fields using
predefined variables like $1 (the first field), $2 (the second field), etc.
To use Awk, you create an Awk script, and then run it
with the Awk program (gawk.exe in this case).
Many Awk scripts are small, and it lends itself to
writing "one-time use" programs.
All the files on this page are available in the ZIP archive
at this link.
Feel free to reuse and customize them.
You will need the GNU Awk program gawk.exe to be installed on your
QuickPAD Pro.
See the programming page for instructions on installing GNU Awk.
Here is the general format of a gawk command line:
That command line will not modify the input file and all the output
will be directed to the screen.
If a script creates a new data file (for example, a sort script),
the command line will be:
If you use a particular script often and get tired of typing in a
long command line, you can create a batch file to execute the long
command line for you.
are currently limited to 64K files for our data.
We can work around this restriction by using the chop
utility program that is described in the software page.
In this section we demonstrate some Awk scripts to manage
This type of database can be used for any type of simple text
lists, like lists of books, music CDs, recipes, quotations, etc.
Our information will be stored into 'cards'.
Each card will have a 'title' and a 'body':
For example,
let's create a sample card file called 'cards.txt'
and use it to store a list of our goals.
Let's begin with an Awk script to print out the titles of
all the cards in the file.
Here is the script called 'titles':
Here is a sample run:
Another useful script is one that can be used for searching the data file,
ignoring uppercase and lowercase distinctions.
The following script called 'search' will display the cards that contain the
keyword 'write'.
Here is a sample run:
To search for other strings, edit the 'search' script and replace 'write'
with another search keyword.
Sorting the cards based on the titles would also be a useful operation.
Here is a script called 'sort' which reads the entire data file into
and array and then uses the QuickSort algorithm to sort it:
And here is a sample run:
However, the 'sort' script had some trouble with large files
because it reads in all the cards into an array in RAM.
In my tests,
the largest file I was able to sort was only about 100K.
Index cards can also be used for memorization.
The title of the card can contain a question
and the body of the card
contains the answer that you want to memorize.
Let's write a program that randomly chooses
a card from our 'cards.txt' file, displays its title,
asks the user to press the 'Enter' key,
and then displays the body of that card.
First, we need a text file which contains the questions
and answers that we want to memorize.
Let's name the file 'question.txt'.
Note that the answer can contain multiple lines:
Here is the Awk script called 'memorize'.
It will read the data file into an array,
randomly shuffle the array,
and then it will loop through the array and
display each question and answer.
Here is a sample run.
The script will randomly choose cards until it either finishes going
through all the cards,
or until the user enters a 'q' to quit.
The databases above used a simple 'index card' analogy.
That data model works fine for simple lists with free form data,
but there are also cases where we need
to manage records with specialized data fields.
Let's create a data file and some scripts for an 'address book' database.
Our data file will be a text file where every line is one record.
Within a line of the file, the data will be separated into fields.
When choosing a delimiter for our fields, we need to make sure
that it won't appear accidentally within a field itself.
For example,
an address book has fields like name, company name, address, etc.,
and in this case, each of those fields can contain spaces within them
(e.g. "ACME Mail Order Company").
Therefore, we can't use a space to separate the fields of the line.
Instead, let's use commas to separate the fields,
and we'll need a rule that commas cannot appear within a field.
Here is a sample data file called 'address.txt':
It may also be useful to extract just the phone numbers from our data
file.
Here is the script called 'phones' which will extract only the names
and phone numbers from the data file:
Here is a sample run:
Awk can also be used for mathematical computation of fields.
Let's demonstrate this with a data file called 'grades.txt' that contains
grades of students.
Here is a longer script that will take all the grades, average them equally,
and compute the final average and the final grade for each student.
At the end, it will compute some statistics about the entire class.
Here is the script called 'grades'.
Here is a sample run:
Another useful script is the following program that computes a
histogram of the grades.
It is hardcoded to only read the third column ($3),
but you can edit it and change it to read any of the columns in
the input file.
Here is the script called 'histo':
The output shows that there were six grades,
and most of them were in the 80-89 range.
This program takes a data file which lists your checkbook entries
and your deposits,
and calculates the totals.
Here is what a sample input file called 'checks.txt' looks like:
Here is the script called 'check' which will calculate the totals:
And this is a sample run:
Awk works well with data files that are stored in text files.
Awk assumes that the data file is organized into records,
within each record the data is divided into fields,
and there are unique characters in the file that are used as the field
separators and record separators.
By default, Awk assumes that newline characters are the record
separators and whitespace characters (spaces and tabs) are the field separators.
It is also possible to redefine the field separators to other characters,
like a comma or a tab character,
which means that
Awk can process the commonly used "comma separated" and
"tab separated" format for data files.
But note that if a file uses newline characters as record separators,
it means that a newline cannot appear within a field.
For example, a data file file with one record per line
cannot contain a text field
(e.g. a "notes" field) that contains free form text with newline
characters within it.
That would confuse Awk unless we added special code to handle that notes field.
The same restrictions apply to the field separators.
If a file is defined to be comma separated, it means that no field
is allowed to contain comma characters within it
(e.g. a Name field that contains "Alvarado, Victor")
because Awk would parse that as two fields, not one.
That is why tab separated files tend to be used more often.
That way, the fields are allowed to contain spaces and commas.
Another way to format data for use by Awk is to use the "multiline"
format, which is what we used for our index card databases above.
Awk will treat each line as a field, and a blank line is the
record separator.
To export data to Excel,
all we need to do is to convert the data file into tab-delimited format,
and store it in a text file with a *.xls extension.
When that file is opened in Microsoft Windows,
Excel will open it automatically as if it were a spreadsheet.
As an example, let's export our grades.txt file to Excel.
Here is our 'grades.txt' file:
The file uses spaces as the field separator,
so we'll need a script that will convert the field separators
into tabs.
Here is a script called 'conv2xls':
And here is the sample run, where we store the tab-delimited output
into a text file called grades.xls:
We can then copy the grades.xls text file to a Windows PC,
double-click on it,
and Excel will open it as if it were a spreadsheet:
You can then do a "Save As" in Excel to save it as the regular
Excel binary format.
To export our data to a web page,
we will need a script that will input our data file and
generate HTML.
Let's start with our 'grades.txt' data file:
Here is a script called 'html' that will do the conversion.
Note that the data will appear as rows of a table in HTML.
Here is the sample run.
The output will be placed in a file called 'grades.htm'.
This is what the resulting 'grades.htm' file looks like:
And here is a link to the
grades.htm file
so you can see what the web page looks like in your browser.
First, we will need to install a database program on the Palm.
There are several database programs to choose from,
but let's use the freeware database program called
Pilot-DB
(available here from PalmGear).
Next, we will need the freeware DOS tools that come with Pilot-DB
to help us create the PDB data file.
The DB-tools package
is available here at PalmGear.
You can download it and install it on your Windows PC.
Those are DOS tools, but they were compiled to run in DOS under Windows,
so we can't run them on the QuickPAD Pro.
(Note: DB-tools is an open source project,
so the source code is available.)
The DB-tools package contains a program called 'csv2pdb.exe'.
It will do the conversion into a Palm PDB file.
Let's use the 'grades.txt' data file as an example:
Before we can run the 'csv2pdb.exe' program
we first need to convert our data into "csv"
(comma separated values)
format.
We can do that with the following awk script called 'conv2csv':
Here is the command line to create the comma-delimited data file,
which we will call 'grades.csv':
This is what the 'grades.csv' file looks like:
Next,
we need to create an "info" file which will describe the format of our data.
The 'csv2pdb.exe' program will need this information for the conversion
to Palm format.
The info file will give our database a title and describe the fields of
each record.
In grades.csv, the first field is the student's last name, the second field
is the student's first name,
and the other six fields are the grades.
Here is the resulting info file called 'grades.ifo':
The numbers at the end of the lines are the field widths in pixels;
we can make a guess for the field widths,
and then fine-tune them on the Palm Pilot.
The last line will set the backup bit on the PDB file so that it will
be backed up at every hotsync.
From this point on, the rest of the steps must be done on your Windows PC.
C:\> csv2pdb -i grades.ifo grades.csv grades.pdb
C:\>
It will create a new file called 'grades.pdb' in the current directory.
This is the Palm database file.
The last step is to install the PDB file to the Palm Pilot:
in the Windows Explorer
double-click on the PDB file and then hotsync your Palm Pilot
as usual.
Here is a screen shot of the Palm Pilot running Pilot-DB
with our grades database.
(Make sure you have selected the blank unnamed view from menu
at the top-right corner of the screen):
As you can see, storing data as text files gives you a lot of flexibility
in manipulating the data and exporting it to other formats.
Victor Alvarado
(Summarized and extended from a recent discussion at comp.lang.awk.)
A standard idiom in Gawk is to reset the random number generator in a BEGIN block.
Sadly, when called with no arguments, this "reseeding" uses time-in-seconds. So if the same "random"
task runs multiple times in the same second, it will get the same random number seed.
"Ben" writes:
I have a Gawk script that puts random comments into a file. It is run 3
times in a row in quick succession. I found that seeding the random
number generator using gawk did not work because all 3 times it was run
was done within the same second (and it uses the time).
I was wondering if anyone could give me some suggestions as to
what can be done to get around this
problem.
Kenny McCormack writes:
When last I ran into this problem, what I did was to save the last value
returned by rand() to a file, then on the next run, read that in and use
that value as the arg to srand(). Worked well.
(Editor's comment: Kenny's solution does work well but incurs the cost of maintaining and reading/writing that
"last value" file.) Tim Menzies writes:
How about setting the seed using the BASH $RANDOM variable:
If referenced multiple times in a second, it always generates
a different number.
In the above usage, if we have a seed, use it. Else, no seed so start all "random" at the same place. If you prefer to
use the default "seed from time-in-seconds" then use:
(Editor's comment: Tim's solution incurs the overhead of additional command-line syntax. However, it does allow
the process calling Gawk to control the seed. This is important when trying to, say, debug code by recreating the sequence
of random numbers that lead to the bug.)
Thomas Weidenfeller writes:
Is that good enough (random enough) for your task?
(Editor's comment: Nice. Thomas' solution reminds us that "Gawk" can access a whole host of operating system
facilities.)
Aharon Robbins writes:
You could so something like add PROCINFO["pid"] to the value of the time,
or use that as the seed.
(Editor's comment: Aharon's solution is the fastest of all the ones shown here. For example, on Mac OS/X, his
solution takes 6ms to run:
while Thomas' solution is somewhat slower:
Note that while Aharon's solution is the fastest, it does not let some master process set the seed for the Gawk process (e.g. as in Tim's approach).)
If you want raw speed, use Aharon's approach.
If you want seed control, see Tim's approach.
In this exchange from comp.lang.awk,
Jason Quinn discusses his super-for loop trick.
Arnold Robbins then chimes in to say that, with indirect functions, super-for loops
could become a generic tool.
Jason Quinn writes:
Arnold Robbins replies:
In comp.lang.awk, Janis Papanagnou comments on how Awk can read a CSV files where the headers are named in line one.
Suppose you have a a csv file with headers for field names.
Gawk can use those headers for field names- which makes the code more
intuitive and easier to work with. Given that awk is
expected to work on tabular data, this seems to be a good alternative
to just field numbers.
This script can be called with an arbitrary list of column names
as defined in the first line of your data file and separated by
the same field separator as your data.
For example, suppose the above code is in bycolname.sh
and we have data that looks like this:
Now, calling this command...
Non existing column names will expand to $0 each, which may
be surprising if there's an unnoticed typo in your field list.
by Ed Morton (and friends)
The following summary, composed to address the recurring
issue of getline (mis)use, was based primarily on information from the
book "Effective Awk Programming", Third Edition By Arnold Robbins;
(http://www.oreilly.com/catalog/awkprog3) with review and additional
input from many of the comp.lang.awk regulars, including
getline is fine when used correctly (see below for a list of those
cases), but it's best avoided by default because:
As the book "Effective Awk Programming", Third Edition By Arnold
Robbins; http://www.oreilly.com/catalog/awkprog3) which provides much
of the source for this discussion says:
The following summarises the eight variants of getline applications,
listing which variables are set by each one:
The "command |& ..." variants are GNU awk (gawk) extensions. gawk also
populates the ERRNO builtin variable if getline fails.
Although calling getline is very rarely the right approach (see
below), if you need to do it the safest ways to invoke getline are:
since those do not affect any of the builtin variables and they allow
you to correctly test for getline succeeding or failing. If you
need the input record split into separate fields, just call "split()"
to do that.
Users of getline have to be aware of the following non-obvious effects
of using it:
getline is an appropriate solution for the following:
In all other cases, it's clearest, simplest, less error-prone, and
easiest to maintain to let awks normal text-processing read the records.
In the case of "c", whether to use the BEGIN+getline approach or just
collect the data within the awk condition/action part after
testing for the first file is largely a style choice.
"a" above calls the UNIX command "ls" to list the current directory
contents, then prints the result one line at a time.
"b" above writes the letters of the alphabet in reverse order, one per
line, down the two-way pipe to the UNIX "sort" command. It then closes
the write end of the pipe, so that sort receives an end-of-file
indication. This causes sort to sort the data and write the sorted
data back to the gawk program. Once all of the data has been read,
gawk terminates the coprocess and exits. This is particularly necessary
in order to use the UNIX "sort" utility as part of a coprocess since
sort must read all of its input data before it can produce any output.
The sort program does not receive an end-of-file indication until gawk
closes the write end of the pipe. Other programs can be invoked as just:
Not that calling close() with a second argument is also gawk-specific.
"c" above reads every record of the first file passed as an argument to
awk into an array and then for every subsequent file passed as an
argument will print every record from that file that matches any of
the records that appeared in the first file (and so are stored in the
"data" array). This could alternatively have been implemented as:
or:
or:
or (gawk only):
"d" above not only expands all the lines that say "include subfile", but
by writing the result to a tmp file, resetting ARGV[1] (the highest
level input file) and not resetting ARGV[2] (the tmp file), it then lets
awk do any normal record parsing on the result of the expansion since
that's now stored in the tmp file. If you don't need that, just do the
"print" to stdout and remove any other references to a tmp file or
ARGV[2]. In this case, since it's convenient to use $1 and $2, and no
other part of the program references any builtin variables, getline was
used without populating an explicit variable. This method is limited in
its recursion depth to the total number of open files the OS permits at
one time.
The following tips may help if, after reading the above, you discover
you have an appropriate application for getline or if you're looking for
an alternative solution to using getline:
In this example there are no blank lines and the output is all aligned
with the left hand column and you want to print $0 for the second record
following the record that contains some pattern, e.g. the number 3:
That works Just fine. Now let's see the concise way to do it without
getline:
It's not quite so obvious at a glance what that does, but it uses an
idiom that most awk programmers could do well to learn and it is briefer
and avoids all those getline caveats.
Now let's say we want to print the 5th line after the pattern instead of
the 2nd line. Then we'd have:
i.e. we have to add a whole series of additional getline calls to the
getline version, as opposed to just changing the counter from 2 to 5 for
the non-getline version. In reality, you'd probably completely rewrite
the getline version to use a loop:
Still not as concise as the non-getline version, has all the getline
caveats and required a redesign of the code just to change a counter.
Now let's say we also have to print the word "Eureka" if the number 4
appears in the input file. With the getline verion, you now have to do
something like:
whereas with the non-getline version you just have to do:
i.e. with the getline version, you have to work around the fact that
you're now processing records outside of the normal awk work-loop,
whereas with the non-getline version you just have to drop your test for
"4" into the normal place and let awks normal record processing deal
with it like it always does.
Actually, if you look closely a
t the above you'll notice we just
unintentionally introduced a bug in the getline version. Consider what
would happen in both versions if 3 and 4 appear on the same line. The
non-getline version would behave correctly, but to fix the getline
version, you'd need to duplicate the condition somewhere, e.g. perhaps
something like this:
Now consider how the above would behave when there aren't 5 lines left
in the input file or when the last line of the file contains both a 3
and a 4. i.e. there are still design questions to be answered and bugs
that will appear at the limits of the input space.
Ignoring those bugs since this is not intended as a discussion on
debugging getline programs, let's say you no longer need to print the
5th record after the number 3 but still have to do the Eureka on 4. With
the getline version, you'd strip out the test for 3 and the getline
stuff to be left with:
which you'd then presumably rewrite as:
which is what you get just by removing everything involving the test for
3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:
i.e. again, one small requirement change required a complete redesign of
the getline code, but just the absolute minimum necessary tweak to the
non-getline version.
So, what you see above in the getline case was significant redesign
required for every tiny requirement change, much larger amounts of
handwritten code required, insidious bugs introduced during development
and challenging design questions at the limits of your input space,
whereas the non-getline version always had less code, was much easier to
modify as requirements changed, and was much more obvious, predictable,
and correct in how it would behave at the limits of the input space.
by Jim Hart
I've written this kind of thing
so often, it's tedious. I like this better:
Easier to type. And, in cases where front-to-back or back-to-front
doesn't matter, it's even simpler:
And, yes,
works, too. But, some loops don't involve arrays. :-)
This tip has been
discussed on comp.lang.awk.
Andrew Eaton wrote at comp.lang.awk:
I just started with awk and sed, I am more of a perl/C/C++ person. I
have a quick question reguarding the pipe. In Awk, I am trying to use this
construct.
Is it possible that "print" is no longer printing the value of
getline, if so how do I correct it?
Arnold Robbins comments:
The problem here is that `mv' doesn't read standard input, it only
processes command lines. Assuming that your data is something like:
You can do things two ways:
or this way:
The latter is more efficient.
by Arnold Robbins
From the Gawk Manual.
The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:
Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.
The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:
The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.
The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.
There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.
The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.
The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.
Download from
LAWKER.
The s2a project is a sed to awk conversion utility written in awk. As input it takes sed scripts, and it outputs an equivalent awk script.
This version should be fully functional as far as the following sed commands are concerned: a,d,s,p,q,c,i,n.
Commands to be implemented in the future: {},=,h,g,N,P,r,x,y,l,H,G,D,b,t,:
$ is not a valid line address.
Also, line continuation with '\' is not implemented.
James Lyons, Feb 2008.
For more excellent awk code, visit Lyon's awk.dsplab
web site.
Reference: Visual AWK: A, Model for Text Processing by Demonstration by
Jiirgen Landauer and Masahito Hirakawa . 11th International IEEE Symposium on Visual Languages, 1995
Download from
LAKWER.
In text editing users are often confronted
with reformatting tasks which involve large
portions of texts, sometimes consisting of hundreds of lines. For example, let us assume we
want to create mailing labels out of a
given address list. The task seems to
be easy to automat since all paragraphs are
similarly structured, containing a name, an
address, and a phone number e:ach.
However, both the built-in find and replace
function and the macro recorder of the editor
prove to be not flexible enough to handle the
task, because their facilities for specifying
search patterns and for dealing with special
cases and exceptions are limited.
On the other hand, most current end-uslers estimate solving such tasks with one of today's
programming languages as too difficult for
them.
Programming by Demonstration (PBD) is a
promising remedy here since, by contrast,
it promises nearly unlimited prograrnming
power though ease of learning and usage.
Therefore, a variety of PBD systems were
proposed for this application domain in
the past. But PBD is not yet very widespread in
commercial text editors because of some serious weaknesses.
This paper examines these weaknesses
and present a new approach for the solution of
the deficiencies of PBD. We introduce Visual
AWK, a prototype text processing system developed at the Information Systems Lab of
Hiroshima University based on the programming language AWK which incorporates
the new design approach. Extensive visual
feedback and program visualization via spreadsheets improve both usability and expressive
power.
Visual AWK is aimed at users without previous knowledge in programming, but with ex-
perience in text editor use.
The application domain are semi-structured
texts. That is, texts that consist of equally structured entities, for instance lines or paragraphs,
but may contain a few syntactically classifiable
sets of exceptions with a different structure.
by Gerard Holzmann
Micro-tracer is a little awk-script for verifying state machines; quite possibly the
world's smallest working verifier.
Some comments on the working of the script, plus a sample input
for the X.21 protocol, are given below.
Reproduce and use freely, at your own risk of course.
The micro-tracer was first described in this report:
This script was written to show how little code is
needed to write a working verifier for safety properties.
The hard problem in writing a practical verifier
is to make the search efficient, to support a useful logic,
and a sensible specification language... (see the
Spin
homepage.)
The first three lines of the script deal with the
input. Data are stored in two arrays. The initial state of machine A
is stored in array element proc[A]. The transitions
that machine A can make from state s are stored in
move[A,s]. All data are stored as strings, and most arrays are
also indexed with strings. All valid moves for A in state s,
for instance, are concatenated into the same array element move[A,s],
and later unwound as needed in function run().
The line starting with END is executed when the end of the
input file has been reached and the complete protocol specification
has been read. It initializes the signals
and calls the symbolic execution routine run().
The program contains three function definitions: run(), mkstate(),
and unwrap().
The global system state, state, is represented as a concatenation
of strings encoding process and signal states. The function
mkstate() creates the composite, and the function unwrap()
restores the arrays proc and signal to the contents that correspond
to the description in state. (The recursive step in run()
alters their contents.) Function run() uses three local variables,
but only one real parameter state that is passed by the calling
routine.
The analyzer runs by inspecting the possible moves for each
process in turn, checking for valid inp or out moves,
and performing a complete depth-first search. Any state that
has no successors is flagged as a deadlock. A backtrace of
transitions leading into a deadlock is maintained in array Level
and can be printed when a deadlock is found.
The first line in run() is a complete state space handler. The
composite state is used to index a large array space. If the
array element was indexed before it returns a count larger than zero:
the state was analyzed before, and the search can be truncated.
After the analysis completes, the contents of array space is
available for other types of probing. In this case, the micro tracer
just counts the number of states and prints it as a statistic,
together with the number of deadlocks
found.
The error listings give with each step number, the name of the
executing machine followed by its state and an arrow.
Behind the arrow is the transition rule: inp or out, the
new state, the required or provided signal value, and
the signal name.
From "AUI - the Debugger and Assertion Checker
for the Awk Programming Language" by Mikhail Auguston, Subhankar Banerjee, Manish Mamnani, Ghulam Nabi, Juris Reinfelds,
Ugis Sarkans, and Ivan Strnad
.
Proceedings of the 1996 International Conference on Software Engineering: Education and Practice (SE:EP '96)
Download from
LAWKER.
This paper describes the design of Awk User Interface (AUI). AUI is a graphical
programming environment for editing, running, testing and debugging of Awk
programs. The AUI environment supports tracing of Awk programs, setting
breakpoints, and inspection of variable values.
An assertion language to describe
relationship between input and output of Awk program is provided. Assertions can
be checked after the program run, and if violated, informative and readable
messages can be generated. The assertions and debugging rules for the Awk
program are written in a separate text file. Assertions are useful not only for
testing and debugging but can be considered as a mean for program formal
specification and documentation.
The input file contains a list of all states of U.S.A. There are 50 records separated by newlines,
one for each of the states. The number of fields in a record is variable. The first field is the name of
the state, and the subsequent fields are names of neighbor states. Fields are separated by tabs. For
example, the first two records in the database are
The task is to color the U.S.A. map in such a way that any two neighboring states are in different
colors. We will do it in a greedy manner (without backtracking), assigning to every state the ?rst
possible color. The Awk program for this task is the following:
We can check the correctness of the coloring using the following assertion:
From B.A. Bakar, T. Janowski,
Automated
Result Verification with AWK iceccs, pp.0188,
Sixth IEEE International Conference on Complex Computer Systems (ICECCS'00), 2000
Download from LAWKER.
This paper proposes a technical framework to apply this technique
in practice. We show how to write formal
result-based specifications, how to generate a verifier program to check a given specification and
to carry out result-verification according to the generated program.
The execution result is written as a text file, the verifier is written
in AWK (special-purpose language for text processing) and
verification is done automatically by the AWK interpreter;
given the verifier and the execution result as inputs.
all( fun, array [,max] collect( fun, array1, array2 [,max]) select( fun, array1, array2 [,max]) reject( fun, array1, array2 [,max]) detect( fun, array [,max]) inject( fun, array, carry [,max]) All these functions return the size of array or array2
An interesting new feature in Gawk 3.1.7 is
indirect functions.
This allows the function name to be a variable, passed
as an argument to an array, and called using the syntax
This enables a new kind of funcational programming style
in Gawk. For example, generic enumeration patterns
can be coded once, then called many different ways
with different function names passed as arguments.
This document illustrates this style of programming.
For example, here are some standard enumeration functions:
Applies the function fun to all items in the array.
If called with the max
argument, then they are iterated in the order i=1 .. max,
otherwise we use for(i in a).
Applies fun to each item in array1 and collects the
results in array2.
Find all the items in array1 that satisfies fun and
add them to array2.
Find all the items in array1 that do not satisfy fun and
add them to array2.
Return the first item found in array that satisfies fun.
If no such item is found, then return the magic global value Fail.
(This one is a little tricky.)
The result of applying fun to each item in array
is carried into the processing of the next item. Initially, the
carried value is carry. This function returns the final carry.
To illusrate the above, consider the following functions. Each of these are defined for
one array item.
When we run this ... eg/enum1
we see every item in arr printed using the above show function ... eg/enum1.out
When we run this ... eg/enum2
we see every item in arr divided in two ... eg/enum2.out
When we run this ... eg/enum3
we see every item in arr that satisfies odd.... eg/enum3.out
When we run this ... eg/enum4
we see every item in arr that do not satisfies odd.... eg/enum4.out
When we run this ... eg/enum5
we see the first item in arr that satisfies odd.... eg/enum5.out
When we run this ... eg/enum6
we see every the result of multiplying every item in arr by its predecessor. eg/enum6.out
Note one design principle in the following: any newly generated arrays have indexes 1..max
where max is the number of elements in that array.
The above code does not pass around any state information that
the fum functions can use. So all their deliberations are either
with the current array values (integers or strings) or with global state.
It might be worthwhile writing new versions of the above with one more argument,
to carry that sate.
This web site is a front end to a
repository
of Awk code.
The site, and the code, is maintained
by the international awk community (which includes you)
so there are many ways you can contribute:
Using this logo, link to http://awk.info:
(By the way, our current logo is pretty lame.
Want to contribute a better one? Please, be our guest!) When writing a page, please follow these guidelines:
To contribute code, zip up the directory and mail it to
All function and file names are global to
our code so please ensure your new function/file
name does not clobber an old one.
Optionally, you might considering adding:
In the language of this site, a function file is a 100% standalone file containing one or more
functions with no dependancies on other files.
Note that if your function file depends on other files, then it becomes a package (see below).
Functions are stored in a file caled myfunc.awk.
In the language of this site, a package is a file that depends on other files
(and the other files may depend on yet others, recursively).
Following a recent discussion in comp.lang.awk, we say that these dependancies are commented with
where file.awk is some
file (e.g. a file in the current directory).
Note that :
file.awk will be loaded before the file containing the reference to #use file.awk.
The code that renders the awk.info web site can "pretty print" awk code. For example:
To enable that pretty print, add some html syntax inside your
code and apply the following conventions.
Note that if you want to see your "looking pretty", then you could
could see how it looks using our preview tool:
For exmaple, the file http://menzies.us/tmp/xx.awk
can be previewed using
http://awk.info/?awk:menzies.us/tmp/xx.awk
Once you've got it "looking pretty", please consider contributing that code to awk.info,
so our code library can grow. To do so, either email mail@awk.info
with the URL of your pretty code or zip up the files and email them across.
The first paragraph of the file will be ignored. Use this first para
for copyright notices or comments about down-in-the weeds trivia. Note: the
first para ends with one blank line.
The next paragraph should start with
The code could should be topped and tailed as follows:
All other comment lines should start with a single "#" at front-of-line.
These comment characters will be stripped away by the awk.info renderer.
Awk.info's renderer adopts the following html shorthand. If a line starts with
this this is replaced with
If no other words follow #.WORD then the line becomes just <WORD>
Awk.info's renderer supports a few HTML extensions:
That's it. Now you can pretty print your code on the web just be adding
a little html in the comments. Ideally, all code in our code repository
comes with unit tests:
Accordingly code offered to this site can contain unit tests, using the methods
described in this page.
But before going on, we stress that
awk.info gratefully accepts
awk contributions in any form. That is, including unit tests with code is optional.
If your code is in directory yourcode then create a sub-directory yourcode/eg
Write a test in a file yourcode/eg/yourtest. Divide that test into two parts:
Write the expected output of that test case in yourcode/eg/yourtest.out
The above file conventions mean that an automatic tool can run over the entire code base and perform a regression test (checking if all the tests generate
all the *.out files.
Another advantage of the above scheme is that you can use the tests to document your code.
To show the test case, add the following into your .awk file:
Then zip the directory yourcode (including yourcode/eg) and send it to awk.info. Once
we install those files on our site then
when awk.info displays that file, the test case trivia is hidden and the users only see the essential details.
For an example of this, see http://awk.info/?gawk/array/join.awk.
The following list is sorted by newbie-ness (so best to start at the top):
The following list is sorted by the number of times this material
is tagged at delicious.com (most tagged at top):
(For tutorial material on Awk, see Learning Awk page.)
R. Loui loui@ai.wustl.edu is Associate Professor of Computer Science, at Washington University in St. Louis. He has published in AI Journal, Computational Intelligence, ACM SIGART, AI Magazine, AI and Law, the ACM Computing Surveys Symposium on AI, Cognitive Science, Minds and Machines, Journal of Philosophy.
Whenever Ronald Loui teaches GAWK, he gives the students the choice of learning PERL instead. Ninety percent will choose GAWK after looking at a few simple examples of each language (samples shown below). Those who choose PERL do so because someone told them to learn PERL.
After one laboratory, more than half of the GAWK students are confident with their GAWK skills and can begin designing. Almost no student can become confident in PERL that quickly.
After a week, 90% of those who have attempted GAWK have mastered it, compared to fewer than 50% of PERL students attaining similar facility with the language (it would be unfair to require one to `master' PERL).
By the end of the semester, over 90% who have attempted GAWK have succeeded, and about two-thirds of those who have attempted PERL have succeeded.
To be fair, within a year, half of the GAWK programmers have also studied PERL. Most are doing so in order to read PERL and will not switch to writing PERL. No one who learns PERL migrates to GAWK.
PERL and GAWK appear to have similar programming, development, and debugging cycle times.
Finally, there seems to be a small advantage for GAWK over PERL, after a year, for the programmers willingness to begin a new program. That is, both GAWK and PERL programmers tend to enjoy writing a lot of programs, but GAWK has the slight edge here.
by T. Menzies
Imagine Gawk as a kind of a cut-down C language with four tricks:
What to all these do? Well....
You don't need to define variables- they appear as your use them. There are only three types: stings, numbers, and arrays. To ensure a number is a number, add zero to it. To ensure a string is a string, add an empty string to it. To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables: Note that its good practice to add white space between passed and local variables. Gawk programs can contain functions AND pattern/action pairs. If the pattern is satisfied, the action is called. Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator: The default action is {print $0}; i.e. print the whole line. The default pattern is 1; i.e. true. Patterns are checked, top to bottom, in source-code order. Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1".
Regular expressions are important enough for their own section. Ok, so now we know enough to explain an simple report function.
How does hist.awk work
in the following?
hist.awk reads
the maximum width from line one (when NR==1), then scales it to some maximum width value.
For each line, it then
prints the line ($0) with some stars at front.
Do you know what these mean? Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string: or recognize something that isn't a number: Regular expressions are an astonishingly useful tool supported
by many languages (e.g. Awk, Perl, Python, Java). The
following notes review the basics. For full details, see
http://www.gnu.org/manual/Gawk-3.1.1/html_node/Regexp.html#Regexp. Syntax: Here's the basic building blocks of regular expressions: c \c . ^ $ [abc...] [^ac...] r* And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line. But that's only the start of regular expressions. There's lots more. For example: r+ r? r1|r2 r1r2 (r) Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this: ^[+-]? ... ...[0-9]+... ...[.]?[0-9]*... ...|[.][0-9]+... .... ([eE]...)?$ ...[+-]?[0-9]+)?$ Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out): The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script: You can find out if an element exists in an array at a certain index with the expression: This expression tests whether or not the particular index exists,
without the side effect of creating that element if it is not present. You can remove an individual element of an array using the delete statement: It is not an error to delete an element which does not exist. Gawk has a special kind of for statement for scanning an array: This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index. There order in which the array is scanned is not defined. To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack The pop function can be used in the usual way: We can catch everything in an array to a string: And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form: ^=include file This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line: Note that the third argument of the split function can be any regular expression. By the way, here's a nice trick with arrays. To print the lines in a files in a random order: Short, heh? This is not a perfect solution. Gawk can only generate
1,000,000 different random numbers so the birthday theorem cautions
that there is a small chance that the lines will be lost when different
lines are written to the same randomly selected location. After some
experiments, I can report that you lose around one item after 1,000
inserts and 10 to 12 items after 10,000 random inserts. Nothing to write
home about really. But for larger item sets, the above three liner is not
what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%)
are lost after 100,000 random inserts. Not good! Awk is famous for how much it can do in one
line.
This site has many samples of that capability.
And if you have any more to add, please
send them in.
Eric Pement Latest version of this file is usually at:
Most of my experience comes from version of GNU awk (gawk) compiled for
Win32. Note in particular that DJGPP compilations permit the awk script
to follow Unix quoting syntax '/like/ {"this"}'. However, the user must
know that single quotes under DOS/Windows do not protect the redirection
arrows (<, >) nor do they protect pipes (|). Both are special symbols
for the DOS/CMD command shell and their special meaning is ignored only
if they are placed within "double quotes." Likewise, DOS/Win users must
remember that the percent sign (%) is used to mark DOS/Win environment
variables, so it must be doubled (%%) to yield a single percent sign
visible to awk.
If I am sure that a script will NOT need to be quoted in Unix, DOS, or
CMD, then I normally omit the quote marks. If an example is peculiar to
GNU awk, the command 'gawk' will be used. Please notify me if you find
errors or new commands to add to this list (total length under 65
characters). I usually try to put the shortest script first.
Double space a file
Double space a file which already has blank lines in it. Output file
should contain no more than one blank line between lines of text.
NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
often treated as non-blank, and thus 'NF' alone will return TRUE.
Triple space a file
Precede each line by its line number FOR THAT FILE (left alignment).
Using a tab (\t) instead of space will preserve margins.
Precede each line by its line number FOR ALL FILES TOGETHER, with tab.
Number each line of a file (number on left, right-aligned)
Double the percent signs if typing from the DOS command prompt.
Number each line of file, but only print numbers if line is not blank
Remember caveats about Unix treatment of \r (mentioned above)
Count lines (emulates "wc -l")
Print the sums of the fields of every line
Add all fields in all lines and print the sum
Print every line after replacing each field with its absolute value
Print the total number of fields ("words") in all lines
Print the total number of lines that contain "Beth"
Print the largest first field and the line that contains it
Intended for finding the longest string in field #1
Print the number of fields in each line, followed by the line
Print the last field of each line
Print the last field of the last line
Print every line with more than 4 fields
Print every line where the value of the last field is > 4
IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
Cannot be done with DOS versions of awk, other than gawk:
Use "tr" instead.
Delete leading whitespace (spaces, tabs) from front of each line
aligns all text flush left
Delete trailing whitespace (spaces, tabs) from end of each line
Delete BOTH leading and trailing whitespace from each line
Insert 5 blank spaces at beginning of each line (make page offset)
Align all text flush right on a 79-column width
Center all text on a 79-character width
Substitute (find and replace) "foo" with "bar" on each line
Substitute "foo" with "bar" ONLY for lines which contain "baz"
Substitute "foo" with "bar" EXCEPT for lines which contain "baz"
Change "scarlet" or "ruby" or "puce" to "red"
Reverse order of lines (emulates "tac")
If a line ends with a backslash, append the next line to it
(fails if there are multiple lines ending with backslash...)
Print and sort the login names of all users
Print the first 2 fields, in opposite order, of every line
Switch the first 2 fields of every line
Print every line, deleting the second field of that line
Print in reverse order the fields of every line
Remove duplicate, consecutive lines (emulates "uniq")
Remove duplicate, nonconsecutive lines
Concatenate every 5 lines of input, using a comma separator
between fields
Print first 10 lines of file (emulates behavior of "head")
Print first line of file (emulates "head -1")
Print the last 2 lines of a file (emulates "tail -2")
Print the last line of a file (emulates "tail -1")
Print only lines which match regular expression (emulates "grep")
Print only lines which do NOT match regex (emulates "grep -v")
Print the line immediately before a regex, but not the line
containing the regex
Print the line immediately after a regex, but not the line
containing the regex
Grep for AAA and BBB and CCC (in any order)
Grep for AAA and BBB and CCC (in that order)
Print only lines of 65 characters or longer
Print only lines of less than 65 characters
Print section of file from regular expression to end of file
Print section of file based on line numbers (lines 8-12, inclusive)
Print line number 52
Print section of file between two regular expressions (inclusive)
Delete ALL blank lines from a file (same as "grep '.' ")
Special thanks to Peter S. Tillier for helping me with the first release
of this FAQ file.
For additional syntax instructions, including the way to apply editing
commands from a disk file instead of the command line, consult:
To fully exploit the power of awk, one must understand "regular
expressions." For detailed discussion of regular expressions, see
The manual ("man") pages on Unix systems may be helpful (try "man awk",
"man nawk", "man regexp", or the section on regular expressions in "man
ed"), but man pages are notoriously difficult. They are not written to
teach awk use or regexps to first-time users, but as a reference text
for those already acquainted with these tools.
USE OF '\t' IN awk SCRIPTS: For clarity in documentation, we have used
the expression '\t' to indicate a tab character (0x09) in the scripts.
All versions of awk, even the UNIX System 7 version should recognize
the '\t' abbreviation.
Peteris Krumins explaining Eric Pement's Awk one-liners:
Awk is famous for how much it can do in (around) 101
lines. Here are
some samples of that capability.
(And if you have any more to add, please
send them in.)
by R. Loui
Here are a few short programs that do the same thing in each language. When reading these examples, the question to ask is `how many language features do I need to understand in order to understand the syntax of these examples'. Some of these are longer than they need to be since they don't exploit some (e.g.) command line trick to wrap the code in for each line do X. And that is the point- for teach-ability, the preferred language is the one you need to know LESS about before you can be useful in it. hello world PERL: GAWK: One plus one PERL GAWK Printing PERL GAWK Printing the first field in a file PERL GAWK Printing lines, reversing fields PERL GAWK Concatenation of variables PERL GAWK Looping PERL: GAWK: Pairs of numbers PERL: GAWK: List of words into a hash PERL GAWK Printing a hash in some key order PERL AWK Printing all lines in a file PERL GAWK Printing a string PERL GAWK Building and printing an array PERL GAWK Sorting an array PERL GAWK Sorting an array (#2) GAWK Print all lines, vowels changed to stars PERL GAWK Report from file PERL GAWK Web Slurping PERL GAWK saya(array [,label,sep,before,after,eq]) Array printing function. Contents printed, sorted on key. Size of the array The most common usage is to just use the first two arguments; e.g. For other usages, see the examples, below. Tim Menzies join(a [,start,end,sep]) Joins at array into a string
If sep is set to the magic value SUBSEP
then internally, join adds nothing between the items.
A string of a's contents. In earlier gawks, length(a) did not work in functions. Hence....
Arnold Robbins, then Tim Menzies arrray(a) Ensure that an array is empty
Download from
LAWKER.
Below is a script I wrote to demonstrate how to use arrays, functions,
numerical vs string comparison, etc.
It also provides a framework
for people to implement sorting algorithms for comparison. I've
implemented a couple and I'm hoping others will contribute more in
the same style.
I put very few comments in deliberately because I
think the only parts that are hard to understand given some small
amount of reading awk manuals are the actual sorting algorithms,
and those should be well documented already given a reference except
my made-up "Key Sort" but I think that's very easy to understand.
Selection Sort, O(n^2): http://en.wikipedia.org/wiki/Selection_sort
Key Sort O(n^2): made up by Ed Morton for simplicity.
This code demonstrates the use
of arrays, functions, and string vs numeric comparisons in awk.
It also provides a framework for people to implement various
sorting algorithms in awk such as those listed at
http://en.wikipedia.org/wiki/Sorting_algorithm
Traverses the input array, storing it's indices in the output
array in sorted order of the input array elements. e.g.
Can sort on specific fields given a field number and
field separator.
sortType of "n" means sort by numerical comparison, sort by
string comparison otherwise.
Ed Morton A recent discussion
in comp.lang.awk demonstrated a very cute, and very succinct, awk trick.
Neil Harris wanted to clean up this output:
He was using an uppercase J in vi to manually move the hostname's
IP address up onto the same line as it's hostname.
But he wanted to automate the task with awk.
Kenny McCormack offered:
(Yes, that is the whole program.)
Ed Morton offered a more elegant version:
Finally, Kenny McCormack commented:
Much has been written in
comp.lang.awk and awk.info about using
Awk code to sort Awk arrays. While all that code is clever and
good, I wondered if a little shell scripting would simplify the task.
On the plus side:
On the negative side:
All that said, I use this code all the time- it is very useful during debugging to dump
the contents of the internal structures in my Awk code.
By the way, if you want to see an even shorter sort routine (that uses a platform
independent shell programming
trick), check out
David Long's amazing quicksort.
Input:
Output:
Print the array, no control string. Defaults to sorting on the index. Print array, passing a numeric control string. Prints only the first three items, sorted on the index. Print array, sorted on the contents. Print an array with strings for keys. Prints in array label order. Print an array with strings for keys. Prints in reverse array label order. The code is short, yes?
Debbie Forbes
cat numbers | gawk -f quicksort2.awk
Download from
LAWKER.
Quicksort divides the input data around a randomly selected pivot, then recurses
on the divided data.
In quicksort2, the pivot is selected from
the first line of input.
Each data division is handled by a different UNIX pipe
and recursive gawk processes are called on the divided data.
Yes, this is not the fastest way to do it but (in theory anyway) it should be able
to handle very big data sets.
The output ignores repeated input values. I thought it was a problem with repeating the name of the pipes (hence the "rand()" labelling)
but that did not fix the issues.
Copyright (c) 2009 by David Long.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Original version: David Long, 2004. Tim Menzies added some modifications in 2009
to call recursive Gawk pipes on both sides of the pivot.
(The above code should print "3").
Download from
LAWKER.
The Levenshtein edit distance calculation is useful for comparing text strings for similarity, such as would be done with a spell checker.
Hi_saito (from awk.freeshell.org) has written what looks like a straightforward implementation of the reference algorithm described in the above-linked Wikipedia article. hi_saito's code is linked to rather than included outright because no licensing terms appear on the page.
Gnomon (from awk.freeshell.org) is planning to write a more compact (and hopefully speedier) implementation that will appear here soon. The plan is to compute and retain only those values that are necessary to calculate the edit distance, rather than calculating the entire NxM? matrix. The lazy-evaluation method, which can post substantial speed improvements, probably requires more effort and code complexity than the performance gains would be worth; still, for short strings, the lazy code could perhaps be modeled via recursion by executing from the end of the string rather than the beginning. If experiments are run, the results will also appear here.
Here is the abovementioned streamlined implementation. There were eleven previous versions, all of which were benchmarked across gawk, mawk and busybox awk. The approaches started with a naive implementation and explored table-based, recursive (with no, single and shared memoization) and lazy models. As expected, the lazy version was incredibly fiddly and not pleasant to read or pursue. Findings will appear here later, but for now, here's the code.
Run demo.awk using gawk -f levenshtein.awk -f demo.awk.
Run utests.awk using gawk -f levenshtein.awk -f utests.awk.
pierre.gaston <a.t> gmail.com
h-67-101-152-180.nycmny83.dynamic.covad.net
That afternoon, I wrote a gawk script that widens the lines in a 256 color BMP version of the image - I can convert it back to a transparent background GIF later.
That script was presented in awk.info July 30, 2009.
is an updated and extended version
The script widens lines in .bmp files to make them more visible
when converted to TV video images. For the complete conversion, it
is also necessary to mung the line colors to get rid of interpolated
colors and togive some lines more contrast, but that is done elsewhere.
This functions converts byte strings (binary numbers) into their
corresponding numeric strings so that they can be processed as gawk numbers.
The lookup table (CharString) is a global variable.
This code assumes that binary numbers are big-endian (most significant
byte first) - it is up to the calling program to order the bytes.
On the first use, the (global) LUT is created, then left for later use. It
consists of a list of characters from \000 to \777 in order - the (index
value minus 1) of a character multiplied by the power of 256 corresponding
to its position in the string is the byte's numerical weight. The function
doesn't care about the length of the byte string (within the integer limits
of the gawk version and port).
Uses a brute force approach to factor the image size into width and height
numbers that actually match the real image size. It searches around the
nominal values for a pair of numbers that, when multiplied together, produce
the known size of the image in pixels.
It is necessary to tell gawk to read/write the file as binary, especially under
Windows where ^Z in files is a killer. Setting BINMODE to 3 will also work,
but it throws error messages.
Setting FS to null causes gawk to make each byte a separate field.
Testing indicates that, in Windows at least, it is necessary to specify RS, even though
it would appear redundant to set it to \n - not doing so results in 0A0D being
replaced with 0A in the output, with the loss of one byte for each occurance.
The value is arbitrary - it has been tested using one of the line colors.
Read the file into an array. If there are multiple lines, that is, if RS appears
in the file, insert the record separator back into the array at the end of
each line for which RT exists.
Closing FILENAME here allows overwriting the original file - if that is
desired, comment out the next line (which creates a new filename for the
output).
Regarding image parameters: Width and Height are in pixels; Depth is the number of bytes
per pixel; Data is the zero based index of the actual image in the file; Size
refers to the bytes in the file, not the image; ImgSize is the number of pixels
in the image.
Unfortunately, Width and Height may be wrong: RealSize() calculates the actual values
as found from the data block.
Once the image parameters are set, the two arrays for the image can be built: one to contain an unmodified
copy (A) and one to contain a copy to be modified (B). These arrays are
indexed by line and dots (Height, Width); data are complete pixels. The C
array is used to determine the background color: it uses the pixel data as indexes
and the count of the number of copies of that pixel as values - the largest value
represents the most common color, and assuming that the image is mostly background,
therefore the background color. This assumption will be true for almost all line art.
When performing line widening: for each pixel that is not part of
the background, copy its color to the four surrounding pixels, provided that
they are background. This approach prevents one line from encroaching on another,
but does not prevent the ends of lines that do not intersect other lines from
growing by one pixel on each pass through the program for each free end.
u, v, w, and z (z has been reused) are the coordinates of the four pixels
surrounding the one in work (defined by x and y).
Note the final nested for loops in the above code.
After the B array has been modified, the target file can be completed
by reading that array out to the file pixel by pixel. The array cannot be
output during processing because pixels that have already been through
the processor can still be changed.
Ted Davis tdavis@mst.edu.
by Ted Davis
(For an update to this page, see wdenbmp.awk).
My boss wants to put NOAA weather radar images in a looping presentation
that is displayed as 720 video on the 1040 LCD TV in the atrium. He
couldn't figure out how to download the various layers needed, so he gave
me the task. Of course, I had a sample composite image for him in half an
hour. It looked terrible on the TV: the writing came out as just a blur
and the county and state lines (single pixel mostly) were essentially
invisible. Obviously, I could make my own 'cities' overlay, but no tools
I had would convert the 'counties' image to any usable vector format for
line resizing.
This afternoon, I wrote a gawk script that widens the lines in a 256 color
BMP version of the image - I can convert it back to a transparent
background GIF later.
The power and range of gawk never ceases to amaze me - a 42 line (pretty
printed) program was all it took.
The script uses FS="" to convert the entire file into 331 078 single byte
fields. The first 1078 went into a header string and printf()ed to the
outfile. The rest went into a a pair of 550 row by 600 column arrays.
Then I looked at each pixel in the A array, and if it was not the
background color, made the four surrounding pixels in the B array the same
color, provided they were background color (not part of an existing line).
Then I read out the array in order and printf()ed it to the outfile. The
resulting overlay should be readable after changing the colors to make the
dark lines brighter and moving its location in the stack to be on top of
the other images.
There is one known flaw that I have no intention of addressing: lines that
do not intersect other lines grow longer by one pixel for each pass
through the program.
While the actual code is proprietary, the following code snippets show most of the
idioms required to handle binaries.
The following code initializes the CharString variable needed by Bytes2Number.
The above code generates the list of bytes for the Bytes2Number function.
Mote that the string "ABC" does not appear in any of the image files processed by this code. Hence, the above lines
means that the whole image ends up in one record.
The next block analyzes the header to extract useful information.
(note: I found that the image size in the header may be wrong, notably in files resized by Paint Shop Pro. Calculating it proved more reliable.)
I've just installed the openSUSE Milestone 8 (11.2) in a virtual machine in my PC.
In about half an hour, I've also downloaded MySQL, gawk sources and SPAWK (SQL + AWK) sources,
compiled and build the SPAWK libraries (/usr/lib/libspawk.so and /usr/lib/libspawk_r.so).
I've tested the module and worked just fine, so I've uploaded the binary tarball for this distro in SPAWK project
(http://code.google.com/p/spawk/downloads/list).
Have a Happy New Year!
He has also written extensive tutorial notes at the SPAWK wiki.
SPAWK is an elegant collection of functions for accessing and
updating MySQL databases from within GNU awk programs. The SPAWK
module consists of a single awk extension library, namely libspawk.so,
which may be loaded in awk programs using the standard extension
awk function:
Here's a short example of using SPAWK (for more details, see
http://sites.google.com/site/spawkinfo/Home/manual).
When calling spawk_select, SPAWK sends the query already given
(maybe some spawk_query calls preceeded the spawk_select) to the
current server (remind you that "server" in SPAWK's point of view
is a connection to the actual MySQL server mysqld). After calling
spawk_select, the server is ready to return the results to the awk
process via spawk_data, spawk_first or spawk_last calls. Alternatively,
at any time we can clear the results' set and release the server
with a spawk_clear function call.
The main data receiver is spawk_data function. This function is
usually called with one or two arguments. The first argument is an
array to be used as a data transfer vehicle, while the second
argument may be used optionally to hold the null valued columns.
spawk_data returns the number of columns of each returned data row
or zero if there are no more data to return (EOD). spawk_first
function's arguments and return values are exactly the same as those
of spawk_data arguments and returns values, but the rest of the
data will be lost, that is get the next available data row and
release the server. Similar is the spawk_last function, but the row
returned is the last row of the results' set. By the way, the
spawk_last function is less efficient than spawk_first; actually,
there is no particular reason to call spawk_last at all! Let's see
some examples:
Things need to be explained:
These pages focus on macro pre-processors (a natural application for Awk).
Download from
LAWKER
In general, specify the state machine in FILE.fsm and define the
action functions in FILE_actions.c. Then run
fsm.awk
compile and link
fsm.c
fsm_FILE.c and any driver file. Thats it.
Multiple fsms may be built and run in the same application using the
function fsm_allocFsm(). Moreover, calls to fsm() may be nested
using the same state machine as long as a different context is used.
fsm_allocFsm() returns a context number that must be stored and passed
to fsm() on each invoction. In the provided sample, the context is
stored in myContext in test_driver.c.
Fsm() may be called either by polling for events or from inside an
interrupt service routine. If fsm() is called from an interrupt
service routine, it must be protected from nested calls using the
same context. Interrupting calls using other contexts is permitted.
Note that the function fsminit() is called only once and should not
be called for each fsm. If there are special requirements for a
given fsm, an appropriate init function should be provided and
called for that particular fsm.
Currently, fsm traceEnable is set to true and cannot be disbled
(without changing fsm_allocFsm()). An array is maintained within
each fsm context wherein each state and event are recorded for each
call to fsm().
Fsm.awk is an awk script designed to read a finite state machine (fsm)
specification and produce C files which implement that fsm. The file
fsm.c,
included in the distribution, provides the actual state transition
function, and the user provides the state transition "action" functions
and any special initialization.
The fsm distribution consists mainly of
fsm.awk and
fsm.c, although
there are a number of header files for declarations - doesn't get
much simpler than that.
Typically, the fsm specification is named in the form fsm_name.fsm, but
may be named any legal filename. The action functions may be placed in
any number of files by any name the user chooses. Each function should
return either true or false so that the appropriate next state may be
chosen.
The chief benefit of using fsm.awk
is easy to read, consistent state
machine specifications and reuse of existing, tested code. Multiple
tables and multiple users are happily accommodated. It's not hi-tech,
but in provides an easy avenue to generalization and consistency where
fsms are required.
This distribution represents a rewrite of an earlier version written
many years ago - rewritten with newer versions of awk and gcc in mind.
Consequently, it has not been tested using other compiler suites.
There are no known bugs, but, it IS a rewrite.
Although a good candidate for C++, C was used because C++ was not being
used in any of the systems currently using fsm-gen. Maybe a C++ version
will be in a subsequent release.
The distribution provides the following files:
To build the sample,
When fsm.awk is run, (run via fsm.awk fsmName.fsm) it produces two
files, fsm_fsmName.c and fsm_fsmName.h. Fsm_fsmName.c will contain
an array of struct fsm_s tagged as fsm_fsmName, eg.,
In the fsm distribution, the files fsm_test.c, fsm_test.h and
test_actions.c may be built as an executable sample.
The file fsm.c should be compiled and linked with the final executable
as it contains the C code necessary to read the generated tables and
update context.
<> P>
Building the example should compile error free with the exception of
a warning about using "gets()" in the sample driver. Hey - it's just
a driver for a test.
In its purist form, a fsm specifies state, event, action, new state.
For example, a rudimentary ftp server might be specified as follows:
It is useful on occasion to make the next state depend on the success
or failure of the action function. Here, "ok" and "fail" mean "true"
and "false", respectively. For example, as each buffer is sent
it would be useful to specify a different state if sendFile() returns
fail (indicating EOF).
State, event, action, and new state may be specified according to the
same rules as C variables/functions. In the above table, the words
CONNECTED, GET_REQ, SENDING, and IDLE are used to generate #defines,
and the action sendBuffer is the name of a user supplied function.
The file test.fsm illustrates several idioms: means, when receiving event EVENT_1 or EVENT_2 in state S1,
Included in the distribution are test.fsm and test_actions.c which
implement a very simple state machine called "test". After the
executable "test" is produced (via make), it may be used to show the
behavior of the fsm.
The example fsm was built and tested with gcc version 4.0.2 and awk
version 3.1.4.
On running "test", first the line "testing fsm test" is printed, then a
line indicating the initial state. It then asks for the next event.
All events in the example are the lowercase letters 'a' thru 'd',
entered from the keyboard. A special event 'z' will cause the trace to
be dumped. Entering 'q' will cause test to exit. Note that to keep
the example simple, other than special events 'z' and 'q', there is no
checking of input for being outside the known set of events. A sample
session might look like this:
Copyright 2008 Wm Miller
This file is part of fsm-gen, and is distributed under the terms of the
GNU Lesser General Public License .
Copies of the GNU General Public License and the GNU Lesser General Public
License are included with this distrubution in the files COPYING and
COPYING.LESSER, respectively.
Fsm-gen is free software: you can redistribute it and/or modify it under the
terms of the GNU Lesser General Public License as published by the Free
Software Foundation, either version 3 of the License, or (at your option)
any later version.
Fsm-gen is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for
more details.
You should have received a copy of the GNU Lesser General Public License
along with fsm-gen. If not, see
http://www.gnu.org/licenses.
gawk -f cryptosig.awk tim@menzies.us
Download from
LAWKER.
Generates a one-line Awk program that can print your email, from a seemingly jumbled string.
This program can then become your email sig and only the Awk cognoscente can generate a reply.
Example
This
can be tested as follows:
or
both of which should print "tim@menzies.us".
Download from
LAWKER.
Generates random signtures. Signatures and generation code included
in same file so installation is just a matter of calling one file.
Most of the file is a large "here" document. Paragraph 1 of that
document is always added to the signatures, followed one of
the folowing paragraphs, selected at radonom.
To add to the signtures, include them in the here document,
with one preceeding blank line.
Tim Menzies
This script calculates the correlation between two columns of numbers.
For more Sherwood scripts, see
Some useful Awk scripts.
This outputs Tim Sherwood These pages focus on muic players and music analysis
tools in Awk.
These pages focus on tools for larger Gawk programs; e.g. ways to
load multiple files or
auto-generate documentation
straight from the source code.
Download from
LAWKER.
Scan a string for mysql escaped tokens and replace them with the appropriate
character. This is a fairly slow operation for large strings but it's
necessary.
"THE BEER-WARE LICENSE" (Revision 43) borrowed from FreeBSD's jail.c:
Scott S. McCoy
By Carlo Strozzi (carlo@strozzi.it).
NoSQL is a fast, portable, relational database management system
without arbitrary limits, (other than memory and processor speed)
that runs under, and interacts with, the UNIX Operating System.
It uses the "Operator-Stream Paradigm" described in Unix Review
(March, 1991, page 24, "A 4GL Language") where there are a number
of "operators" that each perform a unique function on the data.
These operators are written in Awk and C, designed to be lightweight
Operators will have to be lightweight ones (have a small memory footprint and allows fast startup of the command).
The main reason why NoSQL decided to turn an original RDB system
into NoSQL is precisely that the former is entirely written in Perl.
Perl is a good programming language for writing self-contained
programs, but its pre-compilation phase and long start-up time are
worth paying only if once the program has loaded it can do everything
in one go. This contrasts sharply with the Operator-stream Paradigm,
where operators are chained together in pipelines of two, three or
more programs. The overhead associated with initializing Perl at
every stage of the pipeline makes pipelining Perl inefficient. A
better way of manipulating structured ASCII files is to use the AWK
programming language, which is much smaller than Perl, is more
specialized for this task, and is very fast at startup.
For more information on NoSQL, see the
NoSQL home page.
Download from
LAWKER
or, for the latest version, from
SourceForge
Plaiter (pronounced "player") is a command line front end to command
line music players. It uses shell scripting to try to create the
command line music player that Plait would have used if it already
existed. It complements Plait but is also quite useful on its own,
especially if you already use mpg123 or similar programs and find
yourself wanting more features.
What does Plaiter do that (say) mpg123 can't already? It queues tracks,
first of all. Secondly, it understands commands like play, plause,
stop, next and prev. Finally, unlike most of the command line music
players out there, Plaiter can handle a play list with more than one
type of audio file, selecting the proper helper app to handle each
type of file you throw at it.
Plaiter will automatically configure itself to use ogg123, mpg123,
and/or mpg321, if they are installed on your system. If you have a
helper application that plays other types of audio, Plaiter can be
configured to use it as well.
Like many of us, Plaiter is part daemon and part controller. The
controller builds a play list from the files you provide on the
command line and forwards commands to the daemon. The daemon reads
commands and executes them by running helper applications.
Copyright (C) 2005, 2006 by Stephen Jungels. Released under the GPL.
Written by Stephen Jungels (sjungels@gmail.com)
http://www.music-cog.ohio-state.edu/HumdrumDownload/downloading.html.
The Humdrum Toolkit provides a set of free software tools intended
to assist in music research. The toolkit is suitable for use in a
wide variety of computer-based musical tasks.
The Humdrum web site contains a
comprehensive collection of over 200 web pages providing both
detailed and summary information concerning all aspects of the
Humdrum Toolkit.
About 15% of the code is written in C,
another 15% in kornshell, and about 2% using the
LEX lexical parser and YACC compiler-compiler.
The bulk of the code is written in AWK.
Questions that can be answered in Humdrum are:
(For a longer list of such questions, see the Humdrum
sample problems page.
David Huron
Go to
http://www.music-cog.ohio-state.edu/Humdrum/.
To rearrange the items in the input list: To rearrange the items in a copy of the input list:
The above calls assumes that array item zero stores the length of the array.
If this is not the case, use:
Download from LAWKER.
Suppose we want to shuffle items an array into
a random order. This shuffle sort do so in linear
time and memory.
The algorithm comes from the dawn of computer time but
I first heard of it from Bart Massey (at Portland State). Thank Bart for the
clarity of the explanation and blame me for any silliness in the
implementation.
A simple way to shuffle an input array of elements is to:
This
algorithm is clearly correct. However, the algorithm
requires time quadratic in the size of the list, and 2x
space.
We can easily reduce the time complexity to O(N).
The only thing done with the input array is to
select random elements from it, the order of the elements
in it is irrelevant. Therefore, instead of closing the
hole left by a removed element by shifting elements,
we'll close it by moving the first remaining element of the
input array to fill the gap.
Note an important invariant of the algorithm:
This means that once an element is
removed from the input array and the hole filled, there is
a fresh hole created right at the beginning of the input
array. Let us put the newly removed element in that hole.
Now we can dispense with the output array altogether, and
just return the input array. Now the space complexity is
just x+1.
This code assumes that the array "a" stores its size at "a[0]".
nshuffle is fast, but rearranges the order of
items in the original list.
shuffle generates a new
copy of the list with the items in a random order.
nshuffle also assumes that the list is stores the list size
at position zero. If this is not the case, use shuffles.
By number of loop iterations
One way to use the above is to run down a list in a random order. For example:
The above can be run using
If you run this twice, you'll see two different orderings. Here's one:
And here's another:
If you are generating the above lists very quickly, then be aware that
srand() initializes its random number generator using CPU time in seconds.
So, if you are calling the above command line many times per second, you can
get repeated outputs.
The fix is to supply a seed from the Bash $RANDOM variable:
much faster than once a second, the above call will generate (far) fewer repeats.
If you want to repeat some prior run (say, during debugging),
set the Seed variable on the
command line using (e.g.)
This will always print out the same ordering.
Tim Menzies (Note: see recent update.)
Download from
LAWKER
or
a tar file
or from
SourceForge.
runawk - wrapper for AWK interpreter runawk [options] program_file runawk -e program After years of using AWK for programming I've found that despite of
its simplicity and limitations AWK is good enough for scripting a wide
range of different tasks. AWK is not as poweful as their bigger
counterparts like Perl, Ruby, TCL and others but it has their own
advantages like compactness, simplicity and availability on almost all
UNIX-like systems. I personally also like its data-driven nature and
token orientation, very useful technique for simple text processing
utilities. But! Unfortunately awk interpreters lacks some important features and
sometimes work not as good as it whould be. Problems I see (some of them, of course) AWK lacks support for modules. Even if I create small programs, I
often want to use the functions created earlier and already used in
other scripts. That is, it whould great to orginise functions into
so called libraries (modules). In order to pass arguments to Example: awk_program: Shell session: In my opinion awk_program script should work like this It is possible using runawk. When Example: awk_program: Shell session: Ideally awk_program should work like this runawk was created to solve all these problems Display help information. Display version information. Turn on a debugging mode in which runawk prints argument list
with which real awk interpreter will be run. Always add stdin file name to a list of awk arguments Do not add stdin file name to a list of awk arguments Specify program. If -e is not specified program is read from
program_file. Under UNIX-like OS-es you can use runawk
by beginning your script with line or something like this instead of or similar.
In order to activate modules you should add them into awk script like this that is the line that specifies module name is treated as a comment line
by normal AWK interpreter but is processed by runawk especially. Note that #use should begin with column 0,
no spaces are allowed before it and no spaces are allowed between
# and use. Also note that AWK modules can also "use" another modules and so forth.
All them are collected in a depth-first order
and each one is added to the list of
awk interpreter arguments prepanded with -f option.
That is #use directive is *NOT* similar to #include in
C programming language,
runawk's module code is not inserted into the place of #use.
Runawk's modules are closer to Perl's "use" command.
In case some module is mentioned more than once, only one -f
will be added for it, i.e duplications are removed automatically. Position of #use directive in a source file does matter, i.e.
the earlier module is mentioned, the earlier -f will be generated for it. Example: If you run or the following command will actually run. You can check this by running Modules are first searched in a directory where main
program (or module in which #use directive is specified) is placed.
If it is not found there, then
AWKPATH environment variable is
checked. AWKPATH keeps a colon separated
list of search directories.
Finally, module is searched in system runawk modules directory,
by default PREFIX/share/runawk but this can be changed at build time. An absolute path of the module can also be specified. In order to pass arguments to AWK script correctly, runawk
treats their arguments beginning with `-' sign (minus) especially.
The following command or will actually run therefore -s, -f, -o options will be passed to ARGV/ARGC awk's variables
together with file1 and file2. If all arguments begin with `-' (minus),
runawk will add stdin filename to the end of argument list,
(unless -I option is specified) i.e. running or will actually run the following Like some other interpreters
runawk can obtain the script from a command line like this For some reason you may prefer one AWK interpreter or another with a help of
#interp command like this The reason may be efficiency for a particular task, useful but
not standard extensions or enything else. Note that #interp directive should also begin with column 0,
no spaces are allowed before it and between # and interp. In some cases you may want to run AWK interpreter with a
specific environment. For example, your script may be oriented to
process ASCII text only. In this case you can run AWK with LC_CTYPE=C
environment and use regexp ranges. runawk provides #env directive for this. Strings inside double quotes
is passed to putenv(3) libc function. Example: If AWK interpreter exits normally, runawk exits with its exit
status. If AWK interpreter was killed by signal, runawk
exits with exit status 128+signal. Colon separated list of directories where awk modules are searched. Sets the path to the AWK interpreter, used by default,
i.e. this variable overrides the compile-time default.
Note that #interp directive overrides this. Copyright (c) 2007-2008 Aleksey Cheusov <vle@gmx.net> Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions: The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Please send any comments, questions, bug reports etc. to me by e-mail
or (even better) register them at sourceforge project home. Feature
requests are also welcomed.
Download from
LAWKER.
M1 is a simple macro language that
supports the essential operations of defining strings and replacing strings in text by
their definitions. It also provides facilities for file inclusion and for conditional expan-
sion of text. It is not designed for any particular application, so it is mildly useful
across several applications, including document preparation and programming. This
paper describes the evolution of the program; the final version is implemented in about
110 lines of Awk.
M1 copies its input file(s) to its output unchanged except as modified by
certain "macro expressions." The following lines define macros for
subsequent processing:
A definition may extend across many lines by ending each line with
a backslash, thus quoting the following newline.
Any occurrence of @name@ in the input is replaced in the output by
the corresponding value.
@name at beginning of line is treated the same as @name@.
We'll start with a toy example that illustrates some simple uses of m1. Here's a form letter that
I've often been tempted to use:
If that file is namedsayno.mac, it might be invoked with this text:
Recall that a @default takes effect only if its variable was not previously @defined.
I've found m1 to be a handy Troff preprocessor. Many of my text files (including this one) start
with m1 definitions like:
Even a simple form of arithmetic would be useful in numeric sequences of definitions. The longer m1
variables get around Troff's dreadful two-character limit on string names; these variables are also avail-
able to Troff preprocessors like Pic and Eqn. Various forms of the @define, @if, and @include
facilities are present in some of the Troff-family languages (Pic and Troff) but not others (Tbl); m1
provides a consistent mechanism.
I include figures in documents with lines like this:
The two @defines are a hack to supply the two parameters of number and title to the figure. The
figure might be set off by horizontal lines or enclosed in a box, the number and title might be printed at
the top or the bottom, and the figures might be graphs, pictures, or animations of algorithms. All
figures, though, are presented in the consistent format defined by FIGSTART and FIGEND.
I have also used m1 as a preprocessor for Awk programs. The @include statement allows one
to build simple libraries of Awk functions (though some- but not all- Awk implementations provide
this facility by allowing multiple program files). File inclusion was used in an earlier version of this
paper to include individual functions in the text and then wrap them all together into the completem1
program. The conditional statements allow one to customize a program with macros rather than run-time
if statements, which can reduce both run time and compile time.
The most interesting application for which I've used this macro language is unfortunately too
complicated to describe in detail. The job for which I wrote the original version of m1 was to control a
set of experiments. The experiments were described in a language with a lexical structure that forced
me to make substitutions inside text strings; that was the original reason that substitutions are bracketed
by at-signs. The experiments are currently controlled by text files that contain descriptions in the experiment
language, data extraction programs written in Awk, and graphical displays of data written in Grap;
all the programs are tailored bym1commands.
Most experiments are driven by short files that set a few keys parameters and then@includea
large file with many @defaults. Separate files describe the fields of shared databases:
These files are @included in both the experiment files and in Troff files that display data from the
databases. I had tried to conduct a similar set of experiments before I built m1, and got mired in muck.
The few hours I spent building the tool were paid back handsomely in the first days I used it.
M1 uses as fast substitution function.
The idea is to process the string from left to right, searching for the first substitution to be made.
We then make the substitution, and rescan the string starting at the fresh text. We implement this idea
by keeping two strings: the text processed so far is in L (for Left), and unprocessed text is in
R (for
Right). Here is the pseudocode for dosubs:
There are many ways in which them1program could be extended. Here are some of the biggest
temptations to "creeping creaturism":
The following code is short (around 100 lines),
which is
significantly shorter than other macro processors; see,
for instance, Chapter 8 of Kernighan and Plauger [1981].
The program uses several techniques that can be applied in many Awk programs.
Put next input line into global string "buffer".
Return "EOF" or "" (null string).
M1 is three steps lower than m4. You'll probably miss something
you have learned to expect.
M1 was documented in the 1997 sedawk book by Dale Dougherty & Arnold Robbins (ISBN 1-56592-225-5)
but may have been written earlier.
This page was adapted from
131.191.66.141:8181/UNIX_BS/sedawk/examples/ch13/m1.pdf (download from
LAWKER).
Jon L. Bentley.
Download from
LAWKER.
M5 is a Bourne shell script for invoking m5.awk, which actu-
ally performs the macro processing. m5, unlike many
macroprocessors, does not directly interpret its input.
Instead it uses a two-pass approach in which the first pass
translates the input to an awk program, and the second pass
executes the awk program to produce the final output.
Details of usage are provided below.
This two pass sytem
means that macros can contain awk commands, to be
executed on the second pass. This greatly extends the expressability
of the m5 macro system.
As noted in the synopsis above, its invocation may require
specification of awk, gawk, or nawk, depending on the ver-
sion of awk available on your system. This choice is
further complicated on some systems, e.g. Sun, which have
both awk (original awk) and nawk (new awk). Other systems
appear to have new awk, but have named it just awk. New awk
should be used, regardless of what it has been named. The
macro processor translator will not work using original awk
because the former, for example, uses the built-in function
match().
The following options are supported:
The program that performs the first pass noted above is
called the m5 translator and is named m5.awk. The input to
the translator may be either standard input or one or more
files listed on the command line. An input line with the
directive prefix character (# by default) in column 1 is
treated as a directive statement in the MP directive
language (awk). All other input lines are processed as text
lines. Simple macros are created using awk assignment
statements and their values referenced using the substitu-
tion prefix character ($ by default). The backslash (\) is
the escape character; its presence forces the next character
to literally appear in the output. This is most useful when
forcing the appearance of the directive prefix character,
the substitution prefix character, and the escape character
itself.
All input lines are scanned for macro references that are
indicated by the substitution prefix character. Assuming
the default value of that character, macro references may be
of the form $var, $(var), $(expr), $[str], $var[expr], or
$func(args). These are replaced by an awk variable, awk
variable, awk expression, awk array reference to the special
array M[], regular awk array reference, or awk function
call, respectively. These are, in effect, macros. The MP
translator checks for proper nesting of parentheses and dou-
ble quotes when translating $(expr) and $func(args) macros,
and checks for proper nesting of square brackets and double
quotes when translating $[expr] and $var[expr] macros. The
substitution prefix character indicates a a macro reference
unless it is (i) escaped (e.g., \$abc), (ii) followed by a
character other than A-Z, a-z, (, or [ (e.g., $@), or (iii)
inside a macro reference (e.g., $($abc); probably an error).
An understanding of the implementation of macro substitution
will help in its proper usage. When a text line is encoun-
tered, it is scanned for macros, embedded in an awk print
statement, and copied to the output program. For example,
the input line
is transformed into
Obviously the use of this transformation technique relies completely
on the presence of the awk concatenation operator (one or more blanks).
As already noted, a macro reference inside another macro
reference will not result in substitution and will probably
cause an awk execution-time error. Furthermore, a
substitution prefix character in the substituted string is
also generally not significant because the substitution pre-
fix character is detected at translation time, and macro
values are assigned at execution time. However, macro
references of the form $[expr] provide a simple nested
referencing capability. For example, if $[abc] is in a text
line, or in a directive line and not on the left hand side
of an assignment statement, it is replaced by
eval(M["abc"])/. When the output program is executed, the
m5 runtime routine eval()/ substitutes the value of M["abc"]
examining it for further macro references of the form $[str]
(where "str" denotes an arbitrary string). If one is found,
substitution and scanning proceed recursively. Function
type macro references may result in references to other mac-
ros, thus providing an additional form of nested referenc-
ing.
Except for the include directive, when a directive line is
detected, the directive prefix is removed, the line is
scanned for macros, and then the line is copied to the out-
put program (as distinct from the final output). Any valid
awk construct, including the function statement, is allowed
in a directive line. Further information on writing awk
programs may be found in Aho, Kernighan, and Weinberger,
Dougherty and Robbins, and Robbins.
A single non-awk directive has been provided: the include
directive. Assuming that # is the directive prefix,
#include(filename) directs the MP translator to immediately
read from the indicated file, processing lines from it in
the normal manner. This processing mode makes the include
directive the only type of directive to take effect at
translation time. Nested includes are allowed. Include
directives must appear on a line by themselves. More ela-
borate types of file processing may be directly programmed
using appropriate awk statements in the input file.
The MP translator builds the resulting awk program in one of
two ways, depending on the form of the first input line. If
that line begins with "function", it is assumed that the
user is providing one or more functions, including the func-
tion "main" required by m5. If the first line does not
begin with "function", then the entire input file is
translated into awk statements that are placed inside
"main". If some input lines are inside functions, and oth-
ers are not, awk will will detect this and complain. The MP
by design has little awareness of the syntax of directive
lines (awk statements), and as a consequence syntax errors
in directive lines are not detected until the output program
is executed.
Finally, unless the -c (compile only) option is specified on
the command line, the output program is executed to produce
the final output (directed by default to standard output).
The version of awk specified in ARGV[0] (a built-in awk
variable containing the command name) is used to execute the
program. If ARGV[0] is null, awk is used.
Understanding this example requires recognition that macro
substitution is a two-step process: (i) the input text is
translated into an output awk program, and (ii) the awk
program is executed to produce the final output with the
macro substitutions actually accomplished. The examples
below illustrate this process. # and $ are assumed to be
the directive and substitution prefix characters. This
example was successfully executed using awk on a Cray C90
running UNICOS 10.0.0.3, gawk on a Gateway E-3200 runing
SuSE Linux Version 6.0, and nawk on a Sun Ultra 2 Model 2200
running Solaris 2.5.1.
William A. Ward, Jr., School of Computer and Information
Sciences, University of South Alabama, Mobile, Alabama, July
23, 1999.
awkwords --title "Title" file > file.html awkwords file > file.html This code requires gawk and bash. To download: To test the code, apply it to itself:
AwkWords is a simple-to-use markup language
for writing documentation for programs whose comment lines
start with "#" and whose comments contain HTML code.
For example,
awk.info?tools/awkwords
shows the html generated from
this bash script.
When used with the --title option, a stand alone web page is generated
(to control the style of that page, see the CSS function, dicussed below).
When used without --title it generated some html suitable for inclusion
into other pages.
Also, AwkWords finds all the <h2>, <h3>, <h4>, <h5>,
<h6>, <h7>, <h8>, <h9> headings and copies them to a table
of contents at the front of the file.
Note that AwkWords assumes that the file contains only one
<h1> heading- this is printed before the table of contents.
AwkWords adds some short cuts for HTML markup, as well as including
nested contents (see below: "including nested content"). This is useful for including, say,
program output along with the actual program.
Awkwords is divided into three functions:
unhtml fixes the printing of pre-formatted blocks;
toc adds the table of contents while
includes handles the details of the extra mark-up.
The xpand function controls recursive inclusion of content. Note that
The xpand1 function controls the printing of a single line.
If we are formatting verbatim text, we must remove the start-of-html character "<".
Otherwise, we expand any html shortcuts.
The function xpandHtml controls the html short cuts The rest of the code is just some book-keeping and managing the recursive addition of content.
If used to generate a full web page, then the following styles are added.
Note that the htmltoc class controls the appearance of the table of contents.
There's no checking for valid input (e.g. pre-formatting tags that never close). If the input file contains no html mark up, the results are pretty messy. Recursive includes fail silently if the referenced file does not exist.
I don't like the way I need a seperate pass to do "unhtml". I tried making it work
within the code but it got messy.
The amazingly workable (text) formatter
awf -macros [ file ] ...
Download from
LAWKER.
Type "make r" to run a regression test, formatting the manual page
(awf.1) and comparing it to a preformatted copy (awf.1.out). Type
"make install" to install it. Pathnames may need changing.
Awf formats the text from the input file(s) (standard input if none) in
an imitation of nroff's style with the -man or -ms macro packages. The
-macro option is mandatory and must be `-man' or `-ms'.
Awf is slow and has many restrictions, but does a decent job on most
manual pages and simple -ms documents, and isn't subject to AT&T's
brain-damaged licensing that denies many System V users any text
formatter at all. It is also a text formatter that is simple enough
to be tinkered with, for people who want to experiment.
Awf implements the following raw nroff requests:
and the following in-text codes:
plus the full list of nroff/troff special characters in the original V7
troff manual.
Many restrictions are present; the behavior in general is a subset of
nroff's. Of particular note are the following:
White space at the beginning of lines, and imbedded white space within
lines, is dealt with properly. Sentence terminators at ends of lines are
understood to imply extra space afterward in filled lines. Tabs are implemented crudely and not quite correctly, although in most cases they
work as expected. Hyphenation is done only at explicit hyphens, emdashes, and nroff discretionary hyphens.
The -man macro set implements the full V7 manual macros, plus a few semi-
random oddballs. The full list is:
.BY and .NB each take a single string argument (respectively, an indi-
cation of authorship and a note about the status of the manual page) and
arrange to place it in the page footer.
The -ms macro set is a substantial subset of the V7 manuscript macros.
The implemented macros are:
Size changes are recognized but ignored, as are .RP and .ND. .UL just
prints its argument in italics. .DS/.DE does not do a keep, nor do any
of the other macros that normally imply keeps.
Assignments to the header/footer string variables are recognized and
implemented, but there is otherwise no control over header/footer
formatting. The DY string variable is available. The PD, PI, and LL
number registers exist and can be changed.
The only output format supported by awf, in its distributed form, is that
appropriate to a dumb terminal, using overprinting for italics (via
underlining) and bold. The nroff special characters are printed as some
vague approximation (it's sometimes very vague) to their correct
appearance.
Awf's knowledge of the output device is established by a device file,
which is read before the user's input. It is sought in awf's library
directory, first as dev.term (where term is the value of the TERM
environment variable) and, failing that, as dev.dumb. The device file
uses special internal commands to set up resolution, special characters,
fonts, etc., and more normal nroff commands to set up page length etc.
All in /usr/lib/awf (this can be overridden by the AWFLIB environment
variable):
awk(1), nroff(1), man(7), ms(7)
Unlike nroff, awf complains whenever it sees unknown commands and macros.
All diagnostics (these and some internal ones) appear on standard error
at the end of the run.
Written at University of Toronto by Henry Spencer, more or less as a
supplement to the C News project.
Copyright 1990 University of Toronto. All rights reserved.
Written by Henry Spencer.
This software is not subject to any license of the American Telephone
and Telegraph Company or of the Regents of the University of California.
Permission is granted to anyone to use this software for any purpose on
any computer system, and to alter it and redistribute it freely, subject
to the following restrictions:
There are plenty, but what do you expect for a text formatter written
entirely in (old) awk?
The -ms stuff has not been checked out very thoroughly.
Axel Renihold's
MacroCALC (mc) interactive spreadhsheet calculator
is an interactive, macro-programmable tool. mc has no graphic features, but therefore it can run also on terminals. It uses a convenient, well-known user interface and has some special features especially interesting in the UNIX environment.
mc has an elaborate operating
system via piping.
That is, mc and Unix tools like Awk can be easily intergrated.
A "cell" statement has the syntax:
The output
is read line by line into the rows of the range. The columns, which
have to be separated by "tab" in the output of the command, are
placed into the columns of the range.
At the end of the data a
special cell value designated 'EOF' (end of file) is placed in the
cell below the data.
This offers great flexibility based upon the Unix operating
system's piping mechanism
For more details, see the
MacroCALC
home page.
These pages focus on postscript tricks, written in Awk.
gawk -f pschoose Download from LAWKER Pagerange : list of pages from command line. Pages : array with broken out list.
At end:
"(n in Pages)" is true if page n should be printed
Arnold Robbins gawk -f psrev.awk Download from LAWKER Reverse the pages in a postscript file. Arnold Robbins gawk -f indent.awk file.sh Download from LAWKER
This is part of Phil's AWK tutorial at
http://www.bolthole.com/AWK.html.
This program adjusts the indentation level based on which keywords are
found in each line it encounters.
This is the 'default' action, that actually prints a line out.
This gets called AS WELL AS any other matching clause, in the
order they appear in this program.
An "if" match is run AFTER we run this clause.
A "done" match is run BEFORE we run this clause.
Philip Brown phil@bolthole.com For the 7 day period ending Monday April 27, 2009.
Example Code
% cat 1.awk
#!/usr/bin/env runawk
#use "power_getopt.awk"
#.begin-str help
# power_getopt - program demonstrating a power of power_getopt.awk module
# usage: power_getopt [OPTIONS]
# OPTIONS:
# -h|--help display this screen
# -f|--flag flag
# --long-flag long flag only
# -s short flag only
# =F|--FLAG
./1.awk
% ./1.awk
f --- 0
flag --- 0
long-flag --- 0
s --- 0
F --- default1
FLAG --- default2
./1.awk -h
% ./1.awk -h
power_getopt - program demonstrating a power of power_getopt.awk module
usage: power_getopt [OPTIONS]
OPTIONS:
-h|--help display this screen
-f|--flag flag
--long-flag long flag only
-s short flag only
-F|--FLAG
./1.awk -f
% ./1.awk -f
f --- 1
flag --- 1
long-flag --- 0
s --- 0
F --- default1
FLAG --- default2
./1.awk -F value
% ./1.awk -F value
f --- 0
flag --- 0
long-flag --- 0
s --- 0
F --- value
FLAG --- value
./1.awk --FLAG=value
% ./1.awk --FLAG=value
f --- 0
flag --- 0
long-flag --- 0
s --- 0
F --- value
FLAG --- value
Killer Awk Snake
New Mascot
Word Processing in Awk
Writing Interpreters
AASL: Parser Genrator in Awk
Download
Synopsis
aaslg [ -x ] [ file ... ]
aaslr [ -x ] table [ file ... ]
Description
AASL Specifications
<trivial> "," ";" ;
<lineterm> ";" ;
<endmarker> "EOF" ;
"id" -> "___"
"string" -> "\"___\""
abbreviation expansion
( items ?) ( items | [*] )
{ items ?} { items | [*] >> }
( ! [look] items ?) ( [ look] | items )
{ ! [look] items ?} { [ look] >> | items }
rules: {
"id" ":" contents ";"
| "<" "id" ">" {"string" ?} ";"
| "string" "->" "string"
| "EOF" >>
};
contents: {
">>"
| "<<"
| "id"
| "!" "id"
| "@%&!" "id"
| "string"
| "(" branches ")"
| "{" branches "}"
| [*] >>
};
branches: (
"!" "[" look "]" contents "?"
| [*] branch (
["|"] {"|" branch ?}
| "?" !endbranch
| [*]
)
);
branch: (
"string" contents
| "[" look "]" contents
);
look: (
["string"/"/"] "string" "/" "string"
| "*"
| [*] looker {"," looker ?}
);
looker: ( "string" | "id" ) ;
Error Repair
Files
all in $AASLDIR:
interp table interpreter
lex first pass of aaslg
syn AASL table for aaslg
sem third pass of aaslg
See Also
Diagnostics
History
Bugs
Assessment
Lessons From AASL
for (i in array)
arraystack[i ":" sp] = array[i]
arraystack[sp] = array
Author
Brainfuck to C
About BrainFuck
The Translator
#!/sw/bin/awk -f
# a brainfuck to C translator.
# Needs a recent version of gawk, if on OS X,
# try using Fink's version.
#
# steve jenson
BEGIN {
print "#include <stdio.h>\n";
print "int main() {";
print " int c = 0;";
print " static int b[30000];\n";
}
{
#Note: the order in which these are
#substituted is very important.
gsub(/\]/, " }\n");
gsub(/\[/, " while(b[c] != 0) {\n");
gsub(/\+/, " ++b[c];\n");
gsub(/\-/, " --b[c];\n");
gsub(/>/, " ++c;\n");
gsub(/</, " --c;\n");
gsub(/\./, " putchar(b[c]);\n");
gsub(/\,/, " b[c] = getchar();\n");
print $0
}
END {
print "\n return 0;";
print "}";
}
Updates
Author
OO tools in AWK
Domain-Specific Langauges
Graph.awk
Contents
Synopsis
Description
label here's some stuff
bottom ticks 1 5 10
left ticks 1 2 10 20
range 1 1 10 22
height 10
width 30
1 2 *
2 4 *
3 6 *
4 8 *
7 14 +
8 12 +
9 10 +
mb 0.9 11 =
|----------------------|
20 - = = =
| = = = = |
= = = + + |
10 - + |
| * * |
| * |
2 *---------|------------|
1 5 10
here's some stuff
Code
Initialization
BEGIN {
ht = 24; wid = 80
ox = 6; oy = 2
number = "^[-+]?([0-9]+[.]?[0-9]*|[.][0-9]+)" \
"([eE][-+]?[0-9]+)?$"
}
Handling patterns
/^[ \t]*#/ { next }
$1 == "height" { ht = $2; next }
$1 == "width" { wid = $2; next }
$1 == "label" { # for bottom
sub(/^ *label */, "")
botlab = $0
next
}
$1 == "bottom" && $2 == "ticks" { # ticks for x-axis
for (i = 3; i <= NF; i++) bticks[++nb] = $i
next
}
$1 == "left" && $2 == "ticks" { # ticks for y-axis
for (i = 3; i <= NF; i++) lticks[++nl] = $i
next
}
$1 == "range" { # xmin ymin xmax ymax
xmin = $2; ymin = $3; xmax = $4; ymax = $5
next
}
$1 ~ number && $2 ~ number { # pair of numbers
nd++ # count number of data points
x[nd] = $1; y[nd] = $2
ch[nd] = $3 # optional plotting character
next
}
$1 ~ number && $2 !~ number { # single number
nd++ # count number of data points
x[nd] = nd; y[nd] = $1; ch[nd] = $2
next
}
$1 == "mb" { # m b [mark]
expand()
for(i=xmin;i<=xmax;i++) {
nd++; x[nd]=i; y[nd]=$2*i + $3; ch[nd]=$4
}
next;
}
{ print "?? line " NR ": ["$0"]" >"/dev/stderr" }
END { expand(); frame(); ticks(); label(); data(); draw() }
Functions
function expand(note) { if (xmin == "") expand1(note) }
function expand1(note) {
xmin = xmax = x[1]
ymin = ymax = y[1]
for (i = 2; i <= nd; i++) {
if (x[i] < xmin) xmin = x[i]
if (x[i] > xmax) xmax = x[i]
if (y[i] < ymin) ymin = y[i]
if (y[i] > ymax) ymax = y[i] }
}
function frame() {
for (i = ox; i < wid; i++) plot(i, oy, "-") # bottom
for (i = ox; i < wid; i++) plot(i, ht-1, "-") # top
for (i = oy; i < ht; i++) plot(ox, i, "|") # left
for (i = oy; i < ht; i++) plot(wid-1, i, "|") # right
}
function ticks( i) {
for (i = 1; i <= nb; i++) {
plot(xscale(bticks[i]), oy, "|")
splot(xscale(bticks[i])-1, 1, bticks[i])
}
for (i = 1; i <= nl; i++) {
plot(ox, yscale(lticks[i]), "-")
splot(0, yscale(lticks[i]), lticks[i])
}
}
function label() {
splot(int((wid + ox - length(botlab))/2), 0, botlab)
}
function data( i) {
for (i = 1; i <= nd; i++)
plot(xscale(x[i]),yscale(y[i]),ch[i]=="" ? "*" : ch[i])
for(i in mark) print mark[i]
}
function draw( i, j) {
for (i = ht-1; i >= 0; i--) {
for (j = 0; j < wid; j++)
printf((j,i) in array ? array[j,i] : " ")
printf("\n")
}
}
function xscale(x) {
return int((x-xmin)/(xmax-xmin) * (wid-1-ox) + ox + 0.5)
}
function yscale(y) {
return int((y-ymin)/(ymax-ymin) * (ht-1-oy) + oy + 0.5)
}
function plot(x, y, c) {
array[x,y] = c
}
function splot(x, y, s, i, n) {
n = length(s)
for (i = 0; i < n; i++)
array[x+i, y] = substr(s, i+1, 1)
}
Author
UML in Awk
Contents
Synopsis
gawk -f uml.sh file.sdml > sequence_diagram
Description
Example
[Client, Proxy, DNS, Server
Query Name->
Answer IP<-
http GET >->
<<-html
Client Proxy DNS Server
| | | |
|----------Query Name-------->| |
|<---------Answer IP----------| |
|--http GET -->|----------http GET -------->|
|<----html-----|<-----------html------------|
Code
if [ "$1" = "--awkprog" ] ; then
cat - <<"EOF"
BEGIN {
EFS="[|<>-]";
AFS="[<>-]";
RAFS="[{}RL]";
FS= EFS;
ARROWS = 2 ; # Arrowhead constant
ST=1;
ARG["EP"] = 1; # Event Padding
ARG["ES"] = 0; # Event Spacing (lines below)
ARG["EA"] = 0; # Events Above
ARG["HP"] = 2; # Header Padding
ARG["HS"] = 1; # Header Spacing (lines below)
ARG["LM"] = 0; # Left Margin
ARG["SP"] = 2; # Start Row Padding (For continuous operation)
ARG["TSM"] = 1; # Text Spacing Margin (lines above & below)
ARG["TD"] = 1; # Text Dots (instead of bars in text margins)
ARG["SS"] = 1; # Enable Single Arrow Spans (|---A-->|, not |-A-+-A>|)
}
function padding(outter, inner, extra ,p,m) {
p = (outter - inner);
m = p % 2 ;
p = ((p - m)/2) + (extra ? m:0);
if(p<0) return 0;
return p;
}
function pad(char, count ,i,r) {
for(i=1 ; i <= count ; i++) { r = r char };
return r;
}
function ltrim(s) { gsub(/^[ ]*/, "", s) ; return s; }
function center(string, width, padchar, favor ,p,r,sw) {
sw = length(string);
p = padding(width, sw, favor=="r"?1:0);
r = pad(padchar, p);
r = r string;
p = padding(width, sw, favor=="r"?0:1);
return r pad(padchar, p);
}
function getevent_rev(row, field ,p) {
for(p=field-1; p>0; p--) { # search to the left
if(RF_s[row,p] !~ AFS) return "";
if(RF_f[row,p] != "") return RF_f[row,p];
}
return "";
}
function getevent_for(row, field ,n) {
for(n=field+1; n <= R_nf[row]; n++) { # search to the right
if(RF_s[row,n-1] !~ AFS) return "";
if(RF_f[row,n] != "") return RF_f[row,n];
}
return "";
}
function rlarrow(arrow, prevarrow) {
if(arrow == ">") return "R";
if(arrow == "<") return "L";
if(arrow == "R" || arrow=="L") return arrow;
return prevarrow;
}
function debug_events(s) {
for(r=1; r <= NRS; r++) debug_row(r, s);
}
function debug_row(r, s) {
if(!DEBUG_ROW) return;
printf("Row["r"]/Stage["s"]: ");
for(f=1; f <= R_nf[r]; f++) {
printf(f"="RF_f[r,f]"("RF_s[r,f]") ");
}
printf("\n");
}
function print_bars(num, char ,i,out) {
if(char == "") char = "|";
while(num--) {
# Center the bars under the Headers
out = pad(" ", F_width[0]);
for(i=1; i<= NH; i++) {
out = out char pad(" ", F_width[i]);
}
print out;
}
}
function print_event(r, type ,i,bar,out,aspad,span_width,arrow){
out = pad(" ", F_width[0]);
for(i=1; i<= MAXNF; i++) {
out = out "|";
arrow=" ";
if(type == "both" || type == "arrow") {
if(RF_s[r,i] == "{") arrow = "<";
if(RF_s[r,i] ~ /[}RL]/) arrow = "-";
}
out = out arrow;
aspad = "-"; # arrow or space pad
if(RF_s[r,i] == "|" || RF_s[r,i] == ""|| type == "event") aspad = " ";
span_width = F_width[i];
if(ARG["SS"]) while(RF_s[r,i] == "R" || RF_s[r,i+1] == "L") {
span_width += 1 + F_width[++i]; # include bar
}
event ="";
if(type == "both" || type == "event") {
event = RF_f[r,i];
}
out = out center(event, span_width - ARROWS, aspad, i>MAXNF/2? "r":"l");
if(type == "both" || type == "arrow") {
if(RF_s[r,i] == "}") arrow = ">";
if(RF_s[r,i] ~ /[{RL]/) arrow = "-";
}
out = out arrow;
}
out = out "|";
print out;
}
function print_sd(start_row) {
print " 1 2 3 4 5 6"
print "123456789012345678901234567890123456789012345678901234567890"
if(start_row!=1) { for(i=0; i<ARG["SP"];i++) print ""; }
for(j=start_row; j<= NRS; j++) {
if(R_ltype[j] == "Header") {
NH = R_nf[j];
out = pad(" ", ARG["HP"]+ARG["LM"]);
i =1;
out = out RF_f[j,i];
hp = ARG["HP"] + ARG["LM"] + RF_l[j,i]; # header pointer (last char)
bp = F_width[0] + 1 + F_width[i] + 1; # bar pointer
print "HP:" hp " BP: "bp
for(i=2; i<= NH; i++) {
l = int(RF_l[j,i]/2); r = RF_l[j,i] -l; # Header left & right
lp = (bp - hp) - (l + 1); # left padding
out = out pad(" ", lp) RF_f[j,i];
hp = bp + r - 1;
bp = bp + F_width[i] + 1;
print "HP:" hp " BP: "bp " LP:"lp " r:"r" l:"l
}
print out;
print_bars(ARG["HS"]);
}
if(R_ltype[j] == "Text") {
if(R_ltype[j-1] != "Text") {
if(ARG["TD"]) {
print_bars(ARG["TSM"], ".");
} else {
for(l=0;l<ARG["TSM"]; l++) print "";
}
}
if(T_type[j] == "indent") printf(pad(" ", F_width[0]));
print RF_f[j,1];
if(R_ltype[j+1] != "Text") {
if(ARG["TD"]) {
print_bars(ARG["TSM"], ".");
} else {
for(l=0;l<ARG["TSM"]; l++) print "";
}
}
}
if(R_ltype[j] == "Event") {
if (ARG["EA"]) {
print_event(j, "event");
print_event(j, "arrow");
} else print_event(j, "both");
print_bars(ARG["ES"]);
}
}
return j;
}
/^[ ]*#/ {next} # we don't want bars for comment only lines!
/#/ { $0 = sub(/#.*$/, ""); }
/^:/ {
print "Argument Variable Assignment" $0
i = split(substr($0,2), v, /,/);
for(;i>0;i--) {
j = split(v[i], kv, "=");
if(j==1) { ARG[kv[1]]= ""; }
if(j==2) { ARG[kv[1]]=kv[2]; }
}
for(k in ARG) { printf("ARG["k"]='"ARG[k]"' "); } ; print "";
next ;
}
{
NRS++; # NRSequences
}
/^;/ { ST=print_sd(ST); next; } # Allow continuous operation
/^@/ {
print "text line"
R_ltype[NRS] = "Text";
T_type[NRS] = "left";
sub(/^@/,"");
RF_f[NRS,1]=$0;
next;
}
/^"/ {
print "text line"
R_ltype[NRS] = "Text";
T_type[NRS] = "indent";
sub(/^"/,"");
RF_f[NRS,1]=$0;
next;
}
/^\[/ {
print "Event Headers (Titles)" $0
R_ltype[NRS] = "Header";
sub(/^\[/,"");
FS=","; $0 = $0; # resplit line
R_nf[NRS] = NF;
if(MAXNF < R_nf[NRS]-1) MAXNF= R_nf[NRS]-1; # print MAXNF;
for(i=1; i<= NF; i++) {
f= ltrim($i);
RF_f[NRS,i]=f;
RF_l[NRS,i]= length(f);
RF_s[NRS,i]= ",";
}
for(i=1; i<= NF; i++) {
F_width[i] = padding(RF_l[NRS,i] + 2*ARG["HP"], 1, 1) +\
padding(RF_l[NRS,i+1] + 2*ARG["HP"], 1, 0)\
-1; # Do not include width of bar
if(F_width[i] < 2*ARG["HP"]) F_width[i] = 2*ARG["HP"];
print padding(RF_l[NRS,i] + 2*ARG["HP"], 1, 1) " "\
padding(RF_l[NRS,i+1] + 2*ARG["HP"], 1, 0);
}
F_width[0] = padding(RF_l[NRS,1] + 2*ARG["HP"], 1, 1);
print padding(RF_l[NRS,1] + 2*ARG["HP"],1,0);
if(F_width[0] < ARG["HP"]) F_width["0"] = ARG["HP"];
F_width[0] += ARG["LM"];
for(i=0; i<= MAXNF; i++) printf("FW["i"]="F_width[i]" "); print ""
FS=EFS;
next;
}
{
print "Event Line: " $0 ; DEBUG_ROW=1;
R_ltype[NRS] = "Event";
stl=0;
for(i=1; i<= NF; i++) {
f = $i;
l = length(f);
stl += l +1;
s = substr($0, stl, 1);
RF_f[NRS,i]= f;
RF_s[NRS,i]= s;
}
R_nf[NRS] = NF;
debug_row(NRS, 1);
# Fill in missing (assumed) fields
for(i=1; i<= R_nf[NRS]; i++) {
if (RF_f[NRS,i]=="") RF_f[NRS,i] = getevent_rev(NRS, i);
if (RF_f[NRS,i]=="") RF_f[NRS,i] = getevent_for(NRS, i);
}
debug_row(NRS, 2);
# -> <- ->> >-> <-< <<-
# >- -< >>- -<<
# R> <L R>> >R> <L< <<L
for(i=1; i<= R_nf[NRS]; ) {
if(RF_s[NRS,i] ~ AFS) {
if(RF_s[NRS,i] == "-") { # left tail
for(n=i+1; n<= R_nf[NRS]; n++) {
if(RF_s[NRS,n]==">") {
pi=i; i=n; RF_s[NRS,n]="}";
for(n--; n>=pi; n--) RF_s[NRS,n]="R"; n= R_nf[NRS];
} else if(RF_s[NRS,n]=="<") {
pi=i; i=n; RF_s[NRS,pi]="{";
for(; n>pi; n--) RF_s[NRS,n]="L"; n= R_nf[NRS];
}
}
i++;
} else if(RF_s[NRS,i+1] != "-") { # singleton
RF_s[NRS,i]= RF_s[NRS,i]==">" ? "}":"{";
i++;
} else {
rl= rlarrow(RF_s[NRS,i], "");
for(n=i+1; n<= R_nf[NRS] && RF_s[NRS,n] ~ AFS; n++) {
rl= rlarrow(RF_s[NRS,n], rl);
}
n--;
if (RF_s[NRS,n] == "-") { # right tail
if (rl=="R") RF_s[NRS,n--]="}";
for(; n>=i && RF_s[NRS,n] == "-"; n--) RF_s[NRS,n]=rl;
if (rl=="L") RF_s[NRS,n]="{"; else RF_s[NRS,n]="R";
} else if (RF_s[NRS,n-1] != "-") { # singleton
RF_s[NRS,n]= RF_s[NRS,n]==">" ? "}":"{";
} else { # double ended -
if(RF_s[NRS,i]=="<") { # trumps no matter what
RF_s[NRS,i]="{";
for(i++; i<= R_nf[NRS] && RF_s[NRS,i]=="-"; i++) {
RF_s[NRS,i]="L";
}
} else {
for(n=i+1; n<= R_nf[NRS] && RF_s[NRS,n] =="-"; n++) ;
if(RF_s[NRS,n]==">") {
RF_s[NRS,n]="}";
for(n--; n>i && RF_s[NRS,n]=="-"; n--) {
RF_s[NRS,n]="R";
}
} else { # >-< # > is on the right and trumps
for(; i<= R_nf[NRS] && RF_s[NRS,i]=="-"; i++) {
RF_s[NRS,i]="R";
}
RF_s[NRS,i]="}";
}
}
}
}
} else i++;
}
debug_row(NRS, 3);
# ~ we need to test this with multi shifts (arrow/bar/arrow)
shift = 0;
for(i=1; i<= R_nf[NRS]+1; i++) {
if(RF_s[NRS,i-1] ~ RAFS && RF_s[NRS,i] !~ RAFS) shift++;
if(shift) RF_f[NRS,i-shift]=RF_f[NRS,i];
}
R_nf[NRS] = R_nf[NRS] - shift;
debug_row(NRS, 4);
# Trim empty trailing fields
for(i= R_nf[NRS]; i>0 && RF_f[NRS,i]==""; i--) R_nf[NRS]--;
debug_row(NRS, 5);
# Get event wlength and adjust the max length of each event
for(i=1; i<= R_nf[NRS]; i++) {
RF_l[NRS,i]= length(RF_f[NRS,i]);
if(RF_l[NRS,i] > E_ml[i]) E_ml[i] = RF_l[NRS,i];
}
# Adjust the max width of each column (headers/events)
if(MAXNF < R_nf[NRS]) MAXNF= R_nf[NRS]; # print MAXNF;
for(i=1; i<= MAXNF; i++) {
w = E_ml[i] + 2 * ARG["EP"] + ARROWS;
if (F_width[i] < w) F_width[i] = w;
printf("FW:"F_width[i]" W:"w" ");
}
print ""
}
END { ST=print_sd(ST); }
EOF
exit
fi
Usage()
{
cat - <<-EOF
use(v1.0): $0 file.sdml > sequence_diagram
This program will turn SDML into simple ascii text uml sequence
diagrams. SDML is an extremely simplistic uml Sequence Diagram
Markup Language. SDML is specified as:
.Lines starting with a [ are a comma separated list
of actors (bar headers)
.Events are defined easily by the following symbols:
> rightward event
< leftward event
- extension of the previous event
.Actors can be skipped with a |
.Text on a line after a # is a comment
.Lines starting with a @ are text lines
.Lines starting with a " are indented text lines
.Lines starting with a : are comma separated list of
parameter assignment lines. Parameters are:
E Event Padding (spaces on each side)
ES Event Spacing (lines below)
EA Events Above (put event text above arrows)
HP Header Padding (spaces on each side)
HS Header Spacing (lines below)
LM Left Margin (spaces on the left)
TSM Text Spacing Margin (lines above & below)
TD Text Dots (instead of bars in text margins)
SS Enable Single Arrow Spans (|---A-->|, not |-A-+-A>|)
Example SDML Input:
[Client, Proxy, DNS, Server
Query Name->
Answer IP<-
http GET >->
<<-html
Sequence Diagram Output:
Client Proxy DNS Server
| | | |
|----------Query Name-------->| |
|<---------Answer IP----------| |
|--http GET -->|----------http GET -------->|
|<----html-----|<-----------html------------|
Copyright: Martin Fick <mogulguy@yahoo.com>, Date: 2008-02-15
License: None. This is released into the public domain: do
as you wish.
EOF
exit
}
[ "$1" = "--help" -o "$1" = "-h" -o "$1" = "-u" ] && Usage
Hack to attempt to make this somewhat portable
AWK_PROG="`"$0" --awkprog`"
AWK=awk # default (should work most places)
[ -x /usr/bin/nawk ] && AWK=/usr/bin/nawk # solaris
$AWK "$AWK_PROG" "$@"
Author
AWKLISP v1.2
Download from
Synopsis
awk [-v profiling=1] -f awklisp [optional-Lisp-source-files]
gawk -f awklisp startup numbers lists -
Description
Overview
Examples
fib.lsp
(define fib
(lambda (n)
(if (< n 2)
1
(+ (fib (- n 1))
(fib (- n 2))))))
(fib 20)
gawk -f awklisp startup numbers lists fib.lsp
10946
Eliza
(define rules
'(((hello)
(How do you do -- please state your problem))
((I want)
(What would it mean if you got -R-)
(Why do you want -R-)
(Suppose you got -R- soon))
((if)
(Do you really think its likely that -R-)
(Do you wish that -R-)
(What do you think about -R-)
(Really-- if -R-))
((I was)
(Were you really?)
(Perhaps I already knew you were -R-)
(Why do you tell me you were -R- now?))
((I am)
(In what way are you -R-)
(Do you want to be -R-))
((because)
(Is that the real reason?)
(What other reasons might there be?)
(Does that reason seem to explain anything else?))
((I feel)
(Do you often feel -R-))
((I felt)
(What other feelings do you have?))
((yes)
(You seem quite positive)
(You are sure)
(I understand))
((no)
(Why not?)
(You are being a bit negative)
(Are you saying no just to be negative?))
((someone)
(Can you be more specific?))
((everyone)
(Surely not everyone)
(Can you think of anyone in particular?)
(Who for example?)
(You are thinking of a special person))
((perhaps)
(You do not seem quite certain))
((are)
(Did you think they might not be -R-)
(Possibly they are -R-))
(()
(Very interesting)
(I am not sure I understand you fully)
(What does that suggest to you?)
(Please continue)
(Go on)
(Do you feel strongly about discussing such things?))))
gawk -f awklisp startup numbers lists eliza.lsp -
> (eliza)
Hello-- please state your problem
> (I feel sick)
Do you often feel sick
> (I am in love with awk)
In what way are you in love with awk
> (because it is so easy to use)
Is that the real reason?
> (I was laughed at by the other kids at space camp)
Were you really?
> (everyone hates me)
Can you think of anyone in particular?
> (everyone at space camp)
Surely not everyone
> (perhaps not tina fey)
You do not seem quite certain
> (I want her to laugh at me)
What would it mean if you got her to laugh at me
Expressions and their evaluation
We'll see all the primitive procedures in the next section. A user-defined
procedure is represented as a list of the form (lambda <parameters> <body>),
such as (lambda (x) (+ x 1)). To apply such a procedure, evaluate its body
in the environment obtained by extending the current environment so that the
parameters are bound to the corresponding arguments. Thus, to apply the above
procedure to the argument 41, evaluate (+ x 1) in the same environment as the
current one except that x is bound to 41.
(let ((<var> <expr>)...)
<body>...)
Bind each <var> to its corresponding <expr> (evaluated in the current
environment), and evaluate <body> in the resulting environment.
(cond (<test-expr> <result-expr>...)... (else <result-expr>...))
where the final else clause is optional. Evaluate each <test-expr> in
turn, and for the first non-nil result, evaluate its <result-expr>. If
none are non-nil, and there's no else clause, return nil.
(and <expr>...)
Evaluate each <expr> in order, until one returns nil; then return nil.
If none are nil, return the value of the last <expr>.
(or <expr>...)
Evaluate each <expr> in order, until one returns non-nil; return that value.
If all are nil, return nil.
Built-in procedures
List operations:
Numbers:
I/O:
Meta-operations:
Miscellany:
Implementation Notes
Overview
Data representation
The evaluation/saved-bindings stack
if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)
for (i = 0; arglist != NIL; ++i) {
global_temp[i] = eval(car[arglist])
arglist = cdr[arglist]
}
# All the interpretation routines have the precondition that their
# arguments are protected from garbage collection.
References
Bugs
Author
Copyright
Amazing Awk Assembler
Download from
Description
Author
Convert Comments to Latex
Download
Synopsis
About
Requirements
Example
adoc -s -t "adoc" adoc > doc.tex
For the detailed documentation about the system and its implementation execute the following:
$ adoc -s -t "adoc" adoc > doc.tex
$ latex doc
$ makeindex doc
$ latex doc
$ makeindex doc
$ latex doc
$ latex doc
$ dvips doc
The created documentation can be downloaded in Pdf format from
here.
Reporting Bugs
In case of bug reports, suggestions, criticism e-mail
peteri@carme.sect.mce.hw.ac.uk
LICENSE
Author
Peter Ivanyi and Roman Putanowicz
md2html : Update to Markdown.awk
Download
Markdown.awk
Contents
• Synopsis
• Download
• Description
• Code
• Globals
• Images
• Links
• Code
• Emphasis
• Setex-style Headers
• Atx-style headers
• Unordered Lists
• Paragraphs
• Default
• End
• Bugs
• Author
Synopsis
Download
Description

Level 1 Header
===============
Level 2 Header
--------------
Level 3 Header
______________
# Level 1 Header
#### Level 4 Header
- List item 1
- List item 2
1 A numbered list item
Code
Globals
BEGIN {
env = "none";
text = "";
}
Images
/^!\[.+\] *\(.+\)/ {
split($0, a, /\] *\(/);
split(a[1], b, /\[/);
imgtext = b[2];
split(a[2], b, /\)/);
imgaddr = b[1];
print "<p><img src=\"" imgaddr "\" alt=\"" imgtext "\" title=\"\" /></p>\n";
text = "";
next;
}
Links
/\] *\(/ {
do {
na = split($0, a, /\] *\(/);
split(a[1], b, "[");
linktext = b[2];
nc = split(a[2], c, ")");
linkaddr = c[1];
text = text b[1] "<a href=\"" linkaddr "\">" linktext "</a>" c[2];
for(i = 3; i <= nc; i++)
text = text ")" c[i];
for(i = 3; i <= na; i++)
text = text "](" a[i];
$0 = text;;
text = "";
}
while (na > 2);
}
Code
/`/ {
while (match($0, /`/) != 0) {
if (env == "code") {
sub(/`/, "</code>");
env = pcenv;
}
else {
sub(/`/, "<code>");
pcenv = env;
env = "code";
}
}
}
Emphasis
/\*\*/ {
while (match($0, /\*\*/) != 0) {
if (env == "emph") {
sub(//, "</emph>");
env = peenv;
}
else {
sub(/\*\*/, "<emph>");
peenv = env;
env = "emph";
}
}
}
Setex-style Headers
/^=+$/ {
print "<h1>" text "</h1>\n";
text = "";
next;
}
/^-+$/ {
print "<h2>" text "</h2>\n";
text = "";
next;
}
/^_+$/ {
print "<h3>" text "</h3>\n";
text = "";
next;
}
Atx-style headers
/^#/ {
match($0, /#+/);
n = RLENGTH;
if(n > 6)
n = 6;
print "<h" n ">" substr($0, RLENGTH + 1) "</h" n ">\n";
next;
}
Unordered Lists
/^[*-+]/ {
if (env == "none") {
env = "ul";
print "<ul>";
}
print "<li>" substr($0, 3) "</li>";
text = "";
next;
}
/^[0-9]./ {
if (env == "none") {
env = "ol";
print "<ol>";
}
print "<li>" substr($0, 3) "</li>";
next;
}
Paragraphs
/^[ t]*$/ {
if (env != "none") {
if (text)
print text;
text = "";
print "</" env ">\n";
env = "none";
}
if (text)
print "<p>" text "</p>\n";
text = "";
next;
}
Default
// {
text = text $0;
}
End
END {
if (env != "none") {
if (text)
print text;
text = "";
print "</" env ">\n";
env = "none";
}
if (text)
print "<p>" text "</p>\n";
text = "";
}
Bugs
Author
Awk++
Contents
• Synopsis
• Download
• Description
• OO in AWK++
• Syntax
• Samples:
• Details
• Naming and behavior rules:
• Syntax notes
• Multiple Inheritance
• Running awk++
• Bugs
• Copyright
• Author
Synopsis
gawk -f awkpp file-name-of-awk++-program
This command is platform independent and sends the translated program to standard
output (stdout). See Running awk++ for variations.
This document may be copied only as part of an awk++ distribution
and in unmodified form.
Download
Description
OO in AWK++
Syntax
Samples:
a = class1.new[(optional parameters)] *** similar to Ruby
b = a.get("aProperty")
a.delete
class class1 {
property aProperty
method new([optional parameters]) {
# put initialization stuff here
}
method get(propName) {
if(propName = "aProperty")
return aProperty ### Note the use of 'return'. It behaves
### exactly the same as in an AWK function.
}
}
Details
class class_name {.....}
class class_name : inherited_class_name [ : inherited_class_name...] {.....}
class class_name {
attribute|attr|property|prop|element|elem|variable|var variable_name
..... }
class class_name {
attribute variable_name1
method method_name(parameters) {
...any awk code....
}
..other method definitions...
}
object_variable = class_name.new[(optional parameters)]
(runs the method named "new", if it exists; returns the object ID)
object_variable.method_name(parameters)
object_variable.delete
Naming and behavior rules:
Syntax notes
Multiple Inheritance
Running awk++
gawk -f awkpp file-name-of-awk++-program
or, if the "she-bang" line (line 1 in awkpp) has the right path to gawk, and awkpp is executable and in a directory in PATH,
awkpp file-name-of-awk++-program
To run the output program immediately,
gawk -f awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
or
awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
When running an awk++ program immediately, standard input (stdin) cannot be
used for data. One or more data file paths must be listed on the command line.
Bugs
Copyright
Author
Jim Hart, jhart@mail.avcnet.org
Awk + ANSI-C = OO
Description
Download
Description
Technical Details
make [all] create examples
make test [make and] run examples
make clean remove all but sources
make depend make dependencies (if makefile.$OSTYPE supports it)
Contents
Copyright
Author
Axel T. Schreiner,
http://www.cs.rit.edu/~ats/
Awk A*
While there are a number of systems that will help one construct full-blown metaprograms such as compilers and interpreters, we wanted something with extremely low overhead.
We set out to build a something with the property that it would
help even inexperienced users build simple meta-programs in a
matter of minutes with a few lines of code. A* is the result; it
is more than anything else an engineering exercise, as most of
its ideas are not new. It is the arrangement of these ideas and
the purpose to which they are directed distinguish A* from
other tools.
A* programmers are thus able to accomplish many useful tasks with little code.
Funky: Functional Gawk
function foo() { print "foo" }
function bar() { print "bar" }
BEGIN {
the_func = "foo"
@the_func() # calls foo()
the_func = "bar"
@the_func() # calls bar()
}
cvs -d:pserver:anonymous@cvs.sv.gnu.org:/sources/gawk co gawk-devel
The Functional Challange
Sed-clones (in Awk)
Shorten Your Pipes
Get syslog-owned log names from syslog.conf:
grep -v "^#" syslog.conf |
awk "{print $2}" | egrep -v "^(\*|\|)" |
sed "/^$/ d" | sed "s/^-//"
*.err;kern.*;auth.notice;authpriv,remoteauth,install.none;mail.crit /dev/console
*.notice;authpriv,remoteauth,ftp,install.none;kern.debug;mail.crit /var/log/system.log
# Send messages normally sent to the console also to the serial port.
# To stop messages from being sent out the serial port, comment out this line.
#*.err;kern.*;auth.notice;authpriv,remoteauth.none;mail.crit /dev/tty.serial
# The authpriv log file should be restricted access; these
# messages shouldn't go to terminals or publically-readable
# files.
auth.info;authpriv.*;remoteauth.crit /var/log/secure.log
lpr.info /var/log/lpr.log
mail.* /var/log/mail.log
ftp.* /var/log/ftp.log
install.* /var/log/install.log
install.* @127.0.0.1:32376
local0.* /var/log/appfirewall.log
local1.* /var/log/ipfw.log
stuff.* -/boo
things.* |/var/log
*.emerg *
awk '$0 !~ /^$|^#/ && $2 !~ /^\*/ {sub(/^-|^\|/,"",$2);print $2}' syslog.conf
Sed in Awk
(qqq)
aaa 111
bbb 222
aaa 111 (qqq)
bbb 222 (qqq)
sed -e '
/^([^)]*)/{
h; # remember the (qqq) part
d
}
/ [1-9][0-9]*$/{
G; # strap the (qqq) part to the list
s/\n/ /
}
' yourfile
awk '/^\(/{ h=$0;next } { print $0,h }' file
NR = Number or Records read so far
NF = Number of Fields in current record
FS = the Field Separator
RS = the Record Separator
BEGIN = a pattern that's only true before processing any input
END = a pattern that's only true after processing all input.
Introductory Examples
sed G
awk '{print $0 "\n"}'
sed '/^$/d;G'
awk 'NF{print $0 "\n"}'
sed 'G;G'
awk '{print $0 "\n\n"}'
sed 'n;d'
awk 'NF'
sed '/regex/{x;p;x;}'
awk '{print (/regex/ ? "\n" : "") $0}'
sed '/regex/G'
awk '{print $0 (/regex/ ? "\n" : "")}'
sed '/regex/{x;p;x;G;}'
awk '{print (/regex/ ? "\n" $0 "\n" : $0)}'
Numbering
sed = filename | sed 'N;s/\n/\t/'
awk '{print NR "\t" $0}'
sed = filename | sed 'N; s/^/ /; s/ *\(.\{6,\}\)\n/\1 /'
awk '{printf "%6s %s\n",NR,$0}'
ed '/./=' filename | sed '/./N; s/\n/ /'
awk 'NF{print NR "\t" $0}'
sed -n '$='
awk 'END{print NR}'
Text Conversion and Substitution
sed -e :a -e 's/^.\{1,78\}$/ &/;ta' # set at 78 plus 1 space
awk '{printf "%79s\n",$0}'
sed -e :a -e 's/^.\{1,77\}$/ & /;ta' # method 1
sed -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/' # method 2
awk '{printf "%"int((79+length)/2)"s\n",$0}'
sed '1!G;h;$!d' # method 1
sed -n '1!G;h;$p' # method 2
awk '{a[NR]=$0} END{for (i=NR;i>=1;i--) print a[i]}'
sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
awk -v FS='' '{for (i=NF;i>=1;i--) printf "%s",$i; print ""}'
sed '$!N;s/\n/ /'
awk '{printf "%s%s",$0,(NR%2 ? " " : "\n")}'
sed -e :a -e '/\\$/N; s/\\\n//; ta'
awk '{printf "%s",(sub(/\\$/,"") ? $0 : $0 "\n")}'
sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'
awk '{printf "%s%s",(sub(/^=/," ") ? "" : "\n"),$0} END{print ""}'
gsed '0~5G' # GNU sed only
sed 'n;n;n;n;G;' # other seds
awk '{print $0} !(NR%5){print ""}'
Selective Printing of Certain Lines
sed 10q
awk '{print $0} NR==10{exit}'
sed q
awk 'NR==1{print $0; exit}'
sed -e :a -e '$q;N;11,$D;ba'
awk '{a[NR]=$0} END{for (i=NR-10;i<=NR;i++) print a[i]}'
sed '$!N;$!D'
awk '{a[NR]=$0} END{for (i=NR-2;i<=NR;i++) print a[i]}'
sed '$!d' # method 1
sed -n '$p' # method 2
awk 'END{print $0}'
sed -e '$!{h;d;}' -e x # for 1-line files, print blank line
sed -e '1{$q;}' -e '$!{h;d;}' -e x # for 1-line files, print the line
sed -e '1{$d;}' -e '$!{h;d;}' -e x # for 1-line files, print nothing
awk '{prev=curr; curr=$0} END{print prev}'
sed -n '/regexp/p' # method 1
sed '/regexp/!d' # method 2
awk '/regexp/'
sed -n '/regexp/!p' # method 1, corresponds to above
sed '/regexp/d' # method 2, simpler syntax
awk '!/regexp/'
sed -n '/regexp/{g;1!p;};h'
awk '/regexp/{print prev} {prev=$0}'
sed -n '/regexp/{n;p;}'
awk 'found{print $0} {found=(/regexp/ ? 1 : 0)}'
sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
awk 'found {print preLine "\n" hitLine "\n" $0; found=0}
/regexp/ {preLine=prev; hitLine=NR " " $0; found=1}
{prev=$0}'
sed '/AAA/!d; /BBB/!d; /CCC/!d'
awk '/AAA/&&/BBB/&&/CCC/'
sed '/AAA.*BBB.*CCC/!d'
awk '/AAA.*BBB.*CCC/'
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d # most seds
gsed '/AAA\|BBB\|CCC/!d' # GNU sed only
awk '/AAA|BBB|CCC/'
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'
awk -v RS='' '/AAA/'
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
awk -v RS='' '/AAA/&&/BBB/&&/CCC/'
sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d' # GNU sed only
awk -v RS='' '/AAA|BBB|CCC/'
sed -n '/^.\{65\}/p'
awk -v FS='' 'NF>=65'
sed -n '/^.\{65\}/!p' # method 1, corresponds to above
sed '/^.\{65\}/d' # method 2, simpler syntax
awk -v FS='' 'NF<65'
sed -n '/regexp/,$p'
awk '/regexp/{found=1} found'
sed -n '8,12p' # method 1
sed '8,12!d' # method 2
awk 'NR>=8 && NR<=12'
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
awk 'NR==52{print $0; exit}'
gsed -n '3~7p' # GNU sed only
sed -n '3,${p;n;n;n;n;n;n;}' # other seds
awk '!((NR-3)%7)'
sed -n '/Iowa/,/Montana/p' # case sensitive
awk '/Iowa/,/Montana/'
sed '/string/q' FileID
awk '{print $0} /string/{exit}'
sed '/string/,$!d' FileID
awk '/string/{found=1} found'
sed '/string1/,$!d;/string2/q' FileID
awk '/string1/{found=1} found{print $0} /string2/{exit}'
Selective Deletion of Certain Lines
sed '/Iowa/,/Montana/d'
awk '/Iowa/,/Montana/{next} {print $0}' file
sed '$!N; /^\(.*\)\n\1$/!P; D'
awk '$0!=prev{print $0} {prev=$0}'
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
awk '!a[$0]++'
sed '$!N; s/^\(.*\)\n\1$/\1/; t; D'
awk '$0==prev{print $0} {prev=$0}' # works only on consecutive
awk 'a[$0]++' # works on non-consecutive
sed '1,10d'
awk 'NR>10'
sed '$d'
awk 'NR>1{print prev} {prev=$0}'
sed 'N;$!P;$!D;$d'
awk 'NR>2{print prev[2]} {prev[2]=prev[1]; prev[1]=$0}' # method 1
awk '{a[NR]=$0} END{for (i=i;i<=NR-2;i++) print a[i]}' # method 2
awk -v num=2 'NR>num{print prev[num]}
{for (i=num;i>1;i--) prev[i]=prev[i-1]; prev[1]=$0}' # method 3
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2
awk -v num=10 '...same as deleting last 2 method 3 above...'
gsed '0~8d' # GNU sed only
sed 'n;n;n;n;n;n;n;d;' # other seds
awk 'NR%8'
sed '/pattern/d'
awk '!/pattern/'
sed '/^$/d' # method 1
sed '/./!d' # method 2
awk '!/^$/' # method 1
awk '/./' # method 2
sed '/./,/^$/!d'
awk '/./,/^$/'
sed '/./,$!d'
awk 'NF{found=1} found'
sed -e :a -e '/^\n*$/{$d;N;ba' -e '}' # works on all seds
sed -e :a -e '/^\n*$/N;/\n$/ba' # ditto, except for gsed 3.02.*
awk '{a[NR]=$0} NF{nbNr=NR} END{for (i=1;i<=nbNr;i++) print a[i]}'
sed -n '/^$/{p;h;};/./{x;/./p;}'
awk -v FS='\n' -v RS='' '{for (i=1;i<=NF;i++) print $i; print ""}'
Special Applications
sed '/^$/q' # deletes everything after first blank line
awk '/^$/{exit}'
sed '1,/^$/d' # deletes everything up to first blank line
awk 'found{print $0} /^$/{found=1}'
sed '/^Subject: */!d; s///;q'
awk 'sub(/Subject: */,"")'
sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
awk '{sub(/ *\(.*\)/,""); sub(/>.*/,""); sub(/.*[:<] */,""); print $0}'
sed 's/^/> /'
awk '{print "> " $0}'
sed 's/^> //'
awk '{sub(/> /,""); print $0}'
Awk's RE Match Very Fast
This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics.
Errata: WHINY_USERS slows down Awk
The Secret WHINY_USERS Flag
awk '{
....
...
..
printf"%4s %4s\n",$1,$2 > "file1"
}' input
There's also the undocumented WHINY_USERS flag for GNU awk that allows for sorted processing of
arrays:
$ cat file
2
1
4
3
$ gawk '{a[$0]}END{for (i in a) print i}' file
4
1
2
3
$ WHINY_USERS=1 gawk '{a[$0]}END{for (i in a) print i}' file
1
2
3
4
Execution Cost
runWhin() {
WHINY_USERS=1 gawk -v M=1000000 --source '
BEGIN {
M = M ? M : 50
N = M
print N
while(N-- > 0) {
key = rand()" "rand()" "rand()" "rand()" "rand()
A[key] = M - N
}
for(i in A)
N++
}'
}
runNoWhin() {
gawk -v M=1000000 --source '
BEGIN {
M = M ? M : 50
N = M
print N
while(N-- > 0) {
key = rand()" "rand()" "rand()" "rand()" "rand()
A[key] = M - N
}
for(i in A)
N++
}'
}
time runWhin
time runNoWhin
% bash whiny.sh
1000000
real 0m18.897s
user 0m15.826s
sys 0m2.445s
1000000
real 0m16.345s
user 0m13.469s
sys 0m2.435s
Print Ranges
Problem
hiking trails in the city
muir hike
black mountain hike
summer meados hike
The following regular expression won't work right:
awk '/hiking/,/end hiking/{print}' myfile
since that returns some spurious information.
Solution
/start/,/end/
start
a
start
b
end
c
end
/start/{f=1} f; /end/{f=0}
/start/{f=1} f&&cond; /end/{f=0}
/start/,/end/{if (cond) print}
f; /start/{f=1} /end/{f=0}
vs
/start/,/end/{if (!/start/) print}
f; /start/{f=1} /end/{f=0}
/start/,/end/{if (!/start/) print}
/start/,/end/{if (!nr++) print; if (/end/) nr=0}
Using Awk for Databases
Contents
• Download
• General Information
• Introduction
• Introduction to Awk
• Using the Scripts
• Index Card Databases
• Card File
• "Flash Cards" for Memorization
• Custom Databases
• Address Book
• Grading Program
• Checkbook Program
• Importing and Exporting Data
• Importing Data for use by Awk
• Exporting Data to Microsoft Excel
• Exporting Data to a Web Page
• Exporting Data to a Palm Pilot
• On Your Windows PC
• Author
Download
General Information
Introduction
Introduction to Awk
Using the Scripts
gawk -f SCRIPT DATAFILE
where SCRIPT is the name of the file that contains the Awk script
and DATAFILE is the name of the text file that contains the input data.
gawk -f SCRIPT DATAFILE > NEWFILE
where NEWFILE is the name of the new data file that will be created.
Index Card Databases
Card File
Title of Card
-------------------------
Free-formatted field of
information about this
particular card, but
without any blank lines.
Let's take this information and store it in a text file.
To keep things simple,
the cards within the file are separated with a blank line,
and the first line of each card will be the title.
Write a book and become famous
This is a long range
goal. I need a good book
idea first. And writing
skills.
Solve the problems of society
This might take
a little longer
than expected.
Take out the garbage
It's stinking up
the garage.
# titles - Print the titles of all the cards in the
# index card file.
BEGIN { RS = ""; FS = "\n" }
{ print $1 }
[B:\] gawk -f titles cards.txt
Write a book and become famous
Solve the problems of society
Take out the garbage
[B:\]
# search - Print the index card that contains a string
BEGIN { RS = ""; FS = "\n"; IGNORECASE=1 }
/write/ { print $0, "\n" }
[B:\] gawk -f search cards.txt
Write a book and become famous
This is a long range
goal. I need a good book
idea first. And writing
skills.
[B:\]
# sort - Sort index card file by the card titles
BEGIN { RS = ""; FS = "\n" }
{ A[NR] = $0 }
END {
qsort(A, 1, NR)
for (i = 1; i <= NR; i++) {
print A[i]
if (i == NR) break
print ""
}
}
# QuickSort
# Source: "The AWK Programming Language", by Aho, et.al., p.161
function qsort(A, left, right, i, last) {
if (left >= right)
return
swap(A, left, left+int((right-left+1)*rand()))
last = left
for (i = left+1; i <= right; i++)
if (A[i] < A[left])
swap(A, ++last, i)
swap(A, left, last)
qsort(A, left, last-1)
qsort(A, last+1, right)
}
function swap(A, i, j, t) {
t = A[i]; A[i] = A[j]; A[j] = t
}
[B:\] awk -f sort cards.txt > new.txt
[B:\] rename cards.txt cards.bak
[B:\] rename new.txt cards.txt
[B:\] type cards.txt
Solve the problems of society
This might take
a little longer
than expected.
Take out the garbage
It's stinking up
the garage.
Write a book and become famous
This is a long range
goal. I need a good book
idea first. And writing
skills.
[B:\]
Note that we renamed our old data file to cards.bak,
instead of deleting the file.
It's always good to keep backups of old databases.
"Flash Cards" for Memorization
What is your name?
My name is
Sir Lancelot
of Camelot.
What is your quest?
To seek the
Holy Grail.
What is your favorite color?
Blue.
# memorize - randomly display an index card title, ask user to
# press return, then display the corresponding body of the card
BEGIN { RS=""; FS="\n" }
{ A[NR] = $0 }
END {
RS="\n"; FS=" "
shuffle(A, NR)
for (i = 1; i <= NR; i++) {
print "\nQUESTION: ", substr(A[i], 1, index(A[i], "\n")-1)
printf "\nPress return for the answer: "
getline < "-"
print "\nANSWER: "
print substr(A[i], index(A[i], "\n")+1)
if (i == NR) break
printf "\nPress return to continue, or 'q' to quit: "
getline < "-"
if ($1 == "q") break
}
}
# Shuffle the array
function shuffle(A, n, t) {
srand()
# Moses/Oakford shuffle algorithm
for (i = n; i > 1; i--) {
j = int((i-1) * rand()) + 1
t = A[j]; A[j] = A[i]; A[i] = t
}
}
[B:\] gawk -f memorize question.txt
QUESTION: What is your quest?
Press return for the answer:
ANSWER:
To seek the
Holy Grail.
Press return to continue, or 'q' to quit:
QUESTION: What is your favorite color?
Press return for the answer:
ANSWER:
Blue.
Press return to continue, or 'q' to quit:
QUESTION: What is your name?
Press return for the answer:
ANSWER:
My name is
Sir Lancelot
of Camelot.
[B:\] gawk -f memorize question.txt
QUESTION: What is your favorite color?
Press return for the answer:
ANSWER:
Blue.
Press return to continue, or 'q' to quit: q
[B:\]
Custom Databases
Address Book
John Robinson,Koren Inc.,978 4th Ave,Boston,MA 01760,617-696-0987
Phyllis Chapman,GVE Corp.,34 Sea Drive,Amesbury,MA 01881,781-879-0900
Here is the script called 'labels' which will print all the data and
format it like mailing labels:
# labels - Format the addresses for printing labels
# Source: blocklist.awk from "Sed & Awk", by Dale Dougherty, p.148
BEGIN { FS = "," }
{
print "" # blank line
print $1 # name
print $2 # company
print $3 # street
print $4, $5 # city, state zip
}
This is the sample run:
[B:\] gawk -f labels address.txt
John Robinson
Koren Inc.
978 4th Ave
Boston MA 01760
Phyllis Chapman
GVE Corp.
34 Sea Drive
Amesbury MA 01881
[B:\]
# phones
# Source: phonelist.awk, from "Sed & Awk", by Dale Dougherty, p.148
BEGIN { FS="," }
{ print $1 ", " $6 }
Here is a sample run:
[B:\] gawk -f phones address.txt
John Robinson, 617-696-0987
Phyllis Chapman, 781-879-0900
[B:\]
We'll also need a script to search our data file for a name.
Here is a script called 'searchad' with will search for the string 'robinson':
# searchad - Return the record that matches a string
BEGIN { FS = ","; IGNORECASE=1 }
/robinson/ {
print "" # blank line
print $1 # name
print $2 # company
print $3 # street
print $4, $5 # city, state zip
}
[B:\] gawk -f searchad address.txt
John Robinson
Koren Inc.
978 4th Ave
Boston MA 01760
[B:\]
Grading Program
Allen Mona 70 77 85 83 70 89
Baker John 85 92 78 94 88 91
Jones Andrea 89 90 85 94 90 95
Smith Jasper 84 88 80 92 84 82
Turner Dunce 64 80 60 60 61 62
Wells Ellis 90 98 89 96 96 92
# grades -- average student grades and determine
# letter grade as well as class averages
# Source: "Sed & Awk", by Dale Dougherty, p.192
# set output field separator to tab.
BEGIN { OFS = "\t" }
# action applied to all input lines
{
# add up the grades
total = 0
for (i = 3; i <= NF; ++i)
total += $i
# calculate average
avg = total / (NF - 2)
# assign student's average to element of array
class_avg[NR] = avg
# determine letter grade
if (avg >= 90) grade="A"
else if (avg >= 80) grade="B"
else if (avg >= 70) grade="C"
else if (avg >= 60) grade="D"
else grade="F"
# increment counter for letter grade array
++class_grade[grade]
# print student name, average, and letter grade
print $1 " " $2, avg, grade
}
# print out class statistics
END {
# calculate class average
for (x = 1; x <= NR; x++)
class_avg_total += class_avg[x]
class_average = class_avg_total / NR
# determine how many above/below average
for (x = 1; x <= NR; x++)
if (class_avg[x] >= class_average)
++above_average
else
++below_average
# print results
print ""
print "Class Average: ", class_average
print "At or Above Average: ", above_average
print "Below Average: ", below_average
# print number of students per letter grade
for (letter_grade in class_grade)
print letter_grade ":", class_grade[letter_grade]
}
[B:\] gawk -f grades grades.txt
Allen Mona 79 C
Baker John 88 B
Jones Andrea 90.5 A
Smith Jasper 85 B
Turner Dunce 64.5 D
Wells Ellis 93.5 A
Class Average: 83.4167
At or Above Average: 4
Below Average: 2
A: 2
B: 2
C: 1
D: 1
[B:\]
# histogram
# Source: "The AWK Programming Language", by Aho, et.al., p.70
{ x[int($3/10)]++ } # use the third column of input data
END {
for (i = 0; i < 10; i++)
printf(" %2d - %2d: %3d %s\n",
10*i, 10*i+9, x[i], rep(x[i],"*"))
printf("100: %3d %s\n", x[10], rep(x[10],"*"))
}
function rep(n, s, t) { # return string of n s's
while (n--> 0)
t = t s
return t
}
And here is the sample run:
[B:\] gawk -f histo grades.txt
0 - 9: 0
10 - 19: 0
20 - 29: 0
30 - 39: 0
40 - 49: 0
50 - 59: 0
60 - 69: 1 *
70 - 79: 1 *
80 - 89: 3 ***
90 - 99: 1 *
100: 0
[B:\]
Checkbook Program
check 1021
to Champagne Unlimited
amount 123.10
date 1/1/87
deposit
amount 500.00
date 1/1/87
check 1022
date 1/2/87
amount 45.10
to Getwell Drug Store
tax medical
check 1023
amount 125.00
to International Travel
date 1/3/87
check 1024
amount 50.00
to Carnegie Hall
date 1/3/87
tax charitable contribution
check 1025
to American Express
amount 75.75
date 1/5/87
# check - print total deposits and checks
# Source: "The AWK Programming Language", by Aho, et.al., p.87
BEGIN { RS=""; FS="\n" }
/(^|\n)deposit/ { deposits += field("amount"); next }
/(^|\n)check/ { checks += field("amount"); next }
END { printf("Deposits: $%.2f, Checks: $%.2f\n",
deposits, checks)
}
function field(name, i, f) {
for (i = 1; i <= NF; i++) {
split($i, f, "\t")
if (f[1] == name)
return f[2]
}
printf("Error: no field %s in record\n%s\n", name, $0)
}
[B:\] gawk -f check checks.txt
Deposits: $500.00, Checks: $418.95
[B:\]
Importing and Exporting Data
Importing Data for use by Awk
Exporting Data to Microsoft Excel
Allen Mona 70 77 85 83 70 89
Baker John 85 92 78 94 88 91
Jones Andrea 89 90 85 94 90 95
Smith Jasper 84 88 80 92 84 82
Turner Dunce 64 80 60 60 61 62
Wells Ellis 90 98 89 96 96 92
# conv2xls - Convert a data file into tab-separated format
BEGIN {
IFS=" " # input field separator is a space
OFS="\t" # output field separator is a tab
}
{ print $1, $2, $3, $4, $5, $6, $7, $8 }
[B:\] gawk -f conv2xls grades.txt > grades.xls
[B:\]
Here is the contents of the 'grades.xls' text file:
Allen Mona 70 77 85 83 70 89
Baker John 85 92 78 94 88 91
Jones Andrea 89 90 85 94 90 95
Smith Jasper 84 88 80 92 84 82
Turner Dunce 64 80 60 60 61 62
Wells Ellis 90 98 89 96 96 92

Exporting Data to a Web Page
Allen Mona 70 77 85 83 70 89
Baker John 85 92 78 94 88 91
Jones Andrea 89 90 85 94 90 95
Smith Jasper 84 88 80 92 84 82
Turner Dunce 64 80 60 60 61 62
Wells Ellis 90 98 89 96 96 92
# html - Convert a data file into an HTML web page with a table
BEGIN {
print "<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>"
print "<BODY BGOLOR=\"#ffffff\">"
print "<CENTER><H1>Grades Database</H1></CENTER>"
print "<HR noshade size=4 width=75%>"
print "<P><CENTER><TABLE BORDER>"
printf "<TR><TH>Last<TH>First"
print "<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6"
}
{ # Print the data in table rows
printf "<TR><TD>" $1 "<TD>" $2
printf "<TD>" $3 "<TD>" $4 "<TD>" $5
print "<TD>" $6 "<TD>" $7 "<TD>" $8
}
END {
print "</TABLE></CENTER><P>"
print "<HR noshade size=4 width=75%>"
print "</BODY></HTML>"
}
[B:\] gawk -f html grades.txt > grades.htm
[B:\]
<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>
<BODY BGOLOR="#ffffff">
<CENTER><H1>Grades Database</H1></CENTER>
<HR noshade size=4 width=75%>
<P><CENTER><TABLE BORDER>
<TR><TH>Last<TH>First<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6
<TR><TD>Allen<TD>Mona<TD>70<TD>77<TD>85<TD>83<TD>70<TD>89
<TR><TD>Baker<TD>John<TD>85<TD>92<TD>78<TD>94<TD>88<TD>91
<TR><TD>Jones<TD>Andrea<TD>89<TD>90<TD>85<TD>94<TD>90<TD>95
<TR><TD>Smith<TD>Jasper<TD>84<TD>88<TD>80<TD>92<TD>84<TD>82
<TR><TD>Turner<TD>Dunce<TD>64<TD>80<TD>60<TD>60<TD>61<TD>62
<TR><TD>Wells<TD>Ellis<TD>90<TD>98<TD>89<TD>96<TD>96<TD>92
</TABLE></CENTER><P>
<HR noshade size=4 width=75%>
</BODY></HTML>
Exporting Data to a Palm Pilot
Allen Mona 70 77 85 83 70 89
Baker John 85 92 78 94 88 91
Jones Andrea 89 90 85 94 90 95
Smith Jasper 84 88 80 92 84 82
Turner Dunce 64 80 60 60 61 62
Wells Ellis 90 98 89 96 96 92
# conv2csv - Convert a data file into comma-separated format
BEGIN {
IFS=" " # input field separator is a space
OFS="," # output field separator is a comma
}
{ print $1, $2, $3, $4, $5, $6, $7, $8 }
[B:\] gawk -f conv2csv grades.txt > grades.csv
[B:\]
Allen,Mona,70,77,85,83,70,89
Baker,John,85,92,78,94,88,91
Jones,Andrea,89,90,85,94,90,95
Smith,Jasper,84,88,80,92,84,82
Turner,Dunce,64,80,60,60,61,62
Wells,Ellis,90,98,89,96,96,92
title "GradesDB"
field "Last" string 38
field "First" string 38
field "G1" integer 14
field "G2" integer 14
field "G3" integer 14
field "G4" integer 14
field "G5" integer 14
field "G6" integer 14
option backup on
On Your Windows PC
Now we create the PDB file on our PC with this command line:

Author
Random Numbers in Gawk
Background
BEGIN {srand() }
Houston, We Have a Problem
Solution #1: Persistent Memory
Solution #2: Use Bash
gawk -v Seed=$RANDOM --source 'BEGIN { srand(Seed ? Seed : 1) }'
BEGIN { if (Seed) { srand(Seed) } else { srand() } }
Solution #3: Query the OS
BEGIN {
"od -tu4 -N4 -A n /dev/random" | getline
srand(0+$0)
}
Solution #4: Use the Process Id
$ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
0.405889
$ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
0.671906
$ time gawk 'BEGIN { srand(systime() + PROCINFO["pid"]) }'
real 0m0.006s
user 0m0.002s
sys 0m0.004s
$ time gawk 'BEGIN { "od -tu4 -N4 -A n /dev/random" | getline; srand($0+0) }'
real 0m0.039s
user 0m0.004s
sys 0m0.034s
Conclusion
Super-For Loops
#shows an example of a superfor loop
BEGIN {
#define loop maximums
loopmax[1]=4
loopmax[2]=6
loopmax[3]=8
loopmax[4]=10
loopmax[5]=12
loopmax[6]=20
#call the loop
superfor(6)
}
function superfor(loopdepth, zz) { # zz is a local variable
currloopnum++
#start of prologue
#end of prologue
for(loopcounter[currloopnum]=1;
loopcounter[currloopnum]<=loopmax[currloopnum];
loopcounter[currloopnum]++) {
if ( loopdepth==1 ) {
#start of superfor body
for (zz=1;zz<=currloopnum;zz++) {
printf loopcounter[zz] FS
}
print ""
#end of superfor body
}
else if ( loopdepth>1 )
superfor(loopdepth-1)
}
#start of epilog
#end of epilog
loopdepth++ ; currloopnum--
}
function superfor(loopdepth, prologue, body, epilogue, zz)
{
currloopnum++
@prologue()
for(loopcounter[currloopnum]=1;
loopcounter[currloopnum]<=loopmax [currloopnum];
loopcounter[currloopnum]++) {
if ( loopdepth==1 ) {
@body()
}
else if ( loopdepth>1 )
superfor(loopdepth-1, proloogue,
body, epilogue)
}
@epilogue()
loopdepth++ ; currloopnum--
}
Using Field Names to Reference Columns
Problem
Solution
Try this shell script:
#!/bin/sh
awk -F, -v cols="${1:?}" '
BEGIN {
n=split(cols,col)
for (i=1; i<=n; i++) s[col[i]]=i
}
NR==1 {
for (f=1; f<=NF; f++)
if ($f in s) c[s[$f]]=f
next
}
{ sep=""
for (f=1; f<=n; f++) {
printf("%c%s",sep,$c[f])
sep=FS
}
print ""
}
'
hello,world,region_name,foo,bar,xyz,dummy
11111,22222,aspac,77777,8888888,xyz,zzzzz
21111,22222,ASPAC,77777,8888888,xyz,zzzzz
31111,22222,ASPAC,77777,8888888,XYZ,zzzzz
41111,22222,aspac,77777,8888888,XYZ,zzzzz
sh bycolname.sh world,hello
... would produce:
22222,11111
22222,21111
22222,31111
22222,41111
Bugs
Use (and Abuse) of Getline
Getline
"The getline command is used in several different ways and should not be
used by beginners. ... come back and study the getline command after you
have reviewed the rest ... and have a good knowledge of how awk works."
Variants
Variant Variables Set
------- -------------
getline $0, ${1...NF}, NF, FNR, NR, FILENAME
getline var var, FNR, NR, FILENAME
getline < file $0, ${1...NF}, NF
getline var < file var
command | getline $0, ${1...NF}, NF
command | getline var var
command |& getline $0, ${1...NF}, NF
command |& getline var var
if/while ( (getline var < file) > 0)
if/while ( (command | getline var) > 0)
if/while ( (command |& getline var) > 0)
Caveats
FNR==1 { ... start of file actions ... }
File transitions can occur at getlines, so FNR==1 needs to also be
checked after each unredirected (from a specific file name) getline.
e.g. if you want to print the first line of each of these files:
$ cat file1
a
b
$ cat file2
c
d
you'd normally do:
$ awk 'FNR==1{print}' file1 file2
a
c
but if a "getline" snuck in, it could have the unexpected consequence of
skipping the test for FNR==1 and so not printing the first line of the
second file.
$ awk 'FNR==1{print}/b/{getline}' file1 file2
a
some header line
----------------
data line 1
data line 2
...
data line 10000
you may consider using...
BEGIN { getline header; getline }
{ whatever_using_header_and_data_on_the_line() }
instead of...
FNR == 1 { header = $0 }
FNR < 3 { next }
{ whatever_using_header_and_data_on_the_line() }
but the getline version would not work on multiple files since the BEGIN
section would only be executed once, before the first file is processed,
whereas the non-getline version would work as-is. This is one example of
the common case where the getline command itself isn't directly causing
the problem, but the type of design you can end up with if you select a
getline approach is not ideal.
Applications
command = "ls"
while ( (command | getline var) > 0) {
print var
}
close(command)
command = "LC_ALL=C sort"
n = split("abcdefghijklmnopqrstuvwxyz", a, "")
for (i = n; i > 0; i--)
print a[i] |& command
close(command, "to")
while ((command |& getline var) > 0)
print "got", var
close(command)
BEGIN {
while ( (getline var < ARGV[1]) > 0) {
data[var]++
}
close(ARGV[1])
ARGV[1]=""
}
$0 in data
awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' file1 tmp
command = "program"
do {
print data |& command
command |& getline var
} while (data left to process)
close(command)
# fails if first file is empty
NR==FNR{ data[$0]++; next }
$0 in data
FILENAME==ARGV[1] { data[$0]++; next }
$0 in data
FILENAME=="specificFileName" { data[$0]++; next }
$0 in data
ARGIND==1 { data[$0]++; next }
$0 in data
Tips
cmd="some command"
do something with cmd
close(cmd)
awk 'c&&!--c;/pattern/{c=N}' file
awk 'c&&!--c{next}/pattern/{c=N}' file
awk 'c&&c--;/pattern/{c=N}' file
awk 'c&&c--{next}/pattern/{c=N}' file
$ cat file
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
$ awk '/3/{getline;getline;print}' file
line 5
$ awk 'c&&!--c;/3/{c=2}' file
line 5$ awk '/3/{getline;getline;getline;getline;getline;print}' file
line 8
$ awk 'c&&!--c;/3/{c=5}' file
line 8
$ awk '/3/{for (c=1;c<=5;c++) getline; print}' file
line 8
$ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" }
print}' file
Eureka!
line 8
$ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file
Eureka!
line 8
$ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline }
if ($0 ~ /4/) print "Eureka!"; print}' file
Eureka!
line 8
$ awk '{if ($0 ~ /4/) print "Eureka!"}' file
Eureka!
$ awk '/4/{print "Eureka!"}' file
Eureka!
$ awk '/4/{print "Eureka!"}' file
Eureka!
Never write for(i=1;i<=n;i++).. again?
n = split(something,arr,/re/)
for(i=1;i<=n;i++) {
print arr[i]
}
n = split(something,arr,/re/)
while(n--) {
print arr[i++]
}
# copy a number indexed array, assuming n contains the number of
# elements
while(n--) arr2[n] = arr1[n]
for(i in arr1) arr2[i] = arr1[i]
Want more?
Moving Files with Awk
while ((getline < "somedata.txt") > 0)
{print | "mv"} #or could be "mv -v" for verbose.
oldfile newfile
# build the command and execute it
while ((getline < "somedata.txt") > 0) {
command = "mv " $1 " " $2
system(command)
}
close("somedata.txt")
# send commands to the shell
while ((getline < "somedata.txt") > 0) {
printf("mv %s %s\n", $1, $2) | "sh"
}
close("somedata.txt")
close("sh")
AwkSed: A Simple Stream Editor
command1 < orig.data | sed 's/old/new/g' | command2 > result
# awksed.awk --- do s/foo/bar/g using just print
# Thanks to Michael Brennan for the idea
function usage()
{
print "usage: awksed pat repl [files...]" > "/dev/stderr"
exit 1
}
BEGIN {
# validate arguments
if (ARGC < 3)
usage()
RS = ARGV[1]
ORS = ARGV[2]
# don't use arguments as files
ARGV[1] = ARGV[2] = ""
}
# look ma, no hands!
{
if (RT == "")
printf "%s", $0
else
print
}
s2a: sed to Awk
Contents
Download
Description
Bugs
Author
Code
BEGIN{RS=";|\n"; FS=""; var=1;}
{
i=1; case1=""; case2="";
while($i==" ")i++;
if($i=="\\"||$i=="/"||$i~/[0-9]/) case1=matchaddr();
if($i==","){i++; case2=matchaddr()};
handle sed commands
####################################################################################################
if($i == "d"){ a1=a2="next;";
}else if($i == "p"){ a1=a2="print;";
}else if($i == "a"){ rest="";
for(c=i+2;c<=NF;c++) rest=rest$c;
a1=a2="$0=$0\"\\n"rest"\";";
}else if($i == "q"){ a1=a2="print; exit;";
}else if($i == "n"){ a1=a2="print; if(getline <= 0) next;"
}else if($i == "s"){
re=substr($0, i); p=substr(re,2,1); match(re,"s"p"((\\"p"|.)*)"p"((\\"p"|.)*)"p"([a-zA-Z])?",tmp);
tmp[3]=gensub(/\\[0-9]/,"\\\\&","g",tmp[3]);
tmp[1]=gensub(/\\\(/,"(","g",tmp[1]); tmp[1]=gensub(/\\\)/,")","g",tmp[1]);
if(tmp[3]=="") a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",1);";
else a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",\""tmp[5]"\");";
}else if($i == "c"){ rest="";
for(c=i+2;c<=NF;c++) rest=rest$c;
a1="$0=\""rest"\";";
a2="next;";
}else if($i == "i"){ rest="";
for(c=i+2;c<=NF;c++) rest=rest$c;
a1=a2="$0=\""rest"\\n\"$0;";
}else{
print "ERROR: invalid syntax. Unkown command in expression "$0" (expr number "NR")"; exit;
}
####################################################################################################
output awk commands
if(case1=="" && case2=="") print "{"a1"}";
else if(case1~/^[0-9]/ && case2=="") print "NR=="case1"{"a1"}";
else if(case2 == "") print "/"case1"/{"a1"}";
else if(case1~/^[0-9]/ && case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
else if(case1~/^[0-9]/) print "temp"var"==1&&/"case2"/{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
else if(case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
else print "temp"var"==1&&/"case2"/{temp"var++"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
var++;
}
function matchaddr(){
str=substr($0, i); p=1;
if($i == "\\"){ p=substr(str,2,1); match(str,p"([^"p"]*)"p,arr); i++}
else if($i == "/"){ p=substr(str,1,1); match(str,p"([^"p"]*)"p,arr); }
else { match(str,/^([0-9]*)/,arr) };
i += RLENGTH;
return arr[1];
}
END{print "{print}";}
Visual Awk
Download
Abstract
Programming by Demonstration (PBD) systems often
have problems with control structure injerence
and user-intended generalization. We propose
a new solution for these weaknesses basred on
concepts of AWK and present a prototype system for text processing. It utilizes vertical
demonstration, extensive visual feedback, and
program visualization via spreadsheets to
achieve improved usability and expressive
power.
Introduction
MicroTrace
Description
Code
$1 == "init" { proc[$2] = $3 }
$1 == "inp" { move[$2,$3]=move[$2,$3] $1 "/" $4 "/" $5 "/" $6 "/;" }
$1 == "out" { move[$2,$3]=move[$2,$3] $1 "/" $4 "/" $5 "/" $6 "/;" }
END { verbose=0; for (i in proc) signal[i] = "-"
run(mkstate(state))
for (i in space) nstates++;
print nstates " states, " deadlocks " deadlocks"
}
function run(state, i,str,moved) # 1 parameter, 3 local vars
{
if (space[state]++) return # been here before
level++; moved=0
for (i in proc)
{ str = move[i,proc[i]]
while (str)
{ v = substr(str, 1, index(str, ";"))
sub(v, "", str)
split(v, arr, "/")
if (arr[1] == "inp" && arr[3] == signal[arr[4]])
{ Level[level] = i " " proc[i] " -> " v
proc[i] = arr[2]
run(mkstate(k))
unwrap(state); moved=1
} else if (arr[1] == "out")
{ Level[level] = i " " proc[i] " -> " v
proc[i] = arr[2]; signal[arr[4]] = arr[3]
run(mkstate(k))
unwrap(state); moved=1
} } }
if (!moved)
{ deadlocks++
print "deadlock " deadlocks ":"
for (i in proc) print "\t" i, proc[i], signal[i]
if (verbose)
for (i = 1; i < level; i++) print i, Level[i]
}
level--
}
function mkstate(state, m)
{ state = ""
for (m in proc) state = state " " proc[m] " " signal[m]
return state
}
function unwrap(state, m)
{ split(state, arr, " "); nxt=0
for (m in proc) { proc[m] = arr[++nxt]; signal[m] = arr[++nxt] }
}
A Sample Application -- X21
The transition rules are based on the classic two-process model
for the call establishment phase of CCITT Recommendation X.21.
Interface signal pairs T, C and R, I
are combined. Each possible combination of values on these
line pairs is represented by a distinct lower-case ASCII
character below. Note that since the lines are modeled as
true signals, the receiving process can indeed miss signals
if the sending process changes them rapidly and does not wait
for the peer process to respond.
Transition rules for the `dte' process.
inp dte state01 state08 u dte
inp dte state01 state18 m dte
inp dte state02 state03 v dte
inp dte state02 state15 u dte
inp dte state02 state19 m dte
inp dte state04 state19 m dte
inp dte state05 state19 m dte
inp dte state05 state6A r dte
inp dte state07 state19 m dte
inp dte state07 state6B r dte
inp dte state08 state19 m dte
inp dte state09 state10B q dte
inp dte state09 state19 m dte
inp dte state10 state19 m dte
inp dte state10 state6C r dte
inp dte state10B state19 m dte
inp dte state10B state6C r dte
inp dte state11 state12 n dte
inp dte state11 state19 m dte
inp dte state12 state19 m dte
inp dte state14 state19 m dte
inp dte state15 state03 v dte
inp dte state15 state19 m dte
inp dte state16 state17 m dte
inp dte state17 state21 l dte
inp dte state18 state01 l dte
inp dte state18 state19 m dte
inp dte state20 state21 l dte
inp dte state6A state07 q dte
inp dte state6A state19 m dte
inp dte state6B state07 q dte
inp dte state6B state10 q dte
inp dte state6B state19 m dte
inp dte state6C state11 l dte
inp dte state6C state19 m dte
out dte state01 state02 d dce
out dte state01 state14 i dce
out dte state01 state21 b dce
out dte state02 state16 b dce
out dte state03 state04 e dce
out dte state04 state05 c dce
out dte state04 state16 b dce
out dte state05 state16 b dce
out dte state07 state16 b dce
out dte state08 state09 c dce
out dte state08 state15 d dce
out dte state08 state16 b dce
out dte state09 state16 b dce
out dte state10 state16 b dce
out dte state10B state16 b dce
out dte state11 state16 b dce
out dte state12 state16 b dce
out dte state14 state01 a dce
out dte state14 state16 b dce
out dte state15 state16 b dce
out dte state18 state16 b dce
out dte state19 state20 b dce
out dte state21 state01 a dce
out dte state6A state16 b dce
out dte state6B state16 b dce
out dte state6C state16 b dce
Transition rules for the `dce' process.
inp dce state01 state02 d dce
inp dce state01 state14 i dce
inp dce state01 state21 b dce
inp dce state02 state16 b dce
inp dce state03 state04 e dce
inp dce state04 state05 c dce
inp dce state04 state16 b dce
inp dce state05 state16 b dce
inp dce state07 state16 b dce
inp dce state08 state09 c dce
inp dce state08 state15 d dce
inp dce state08 state16 b dce
inp dce state09 state16 b dce
inp dce state10 state16 b dce
inp dce state10B state16 b dce
inp dce state11 state16 b dce
inp dce state12 state16 b dce
inp dce state14 state01 a dce
inp dce state14 state16 b dce
inp dce state15 state16 b dce
inp dce state18 state16 b dce
inp dce state19 state20 b dce
inp dce state21 state01 a dce
inp dce state6A state16 b dce
inp dce state6B state16 b dce
inp dce state6C state16 b dce
out dce state01 state08 u dte
out dce state01 state18 m dte
out dce state02 state03 v dte
out dce state02 state15 u dte
out dce state02 state19 m dte
out dce state04 state19 m dte
out dce state05 state19 m dte
out dce state05 state6A r dte
out dce state07 state19 m dte
out dce state07 state6B r dte
out dce state08 state19 m dte
out dce state09 state10B q dte
out dce state09 state19 m dte
out dce state10 state19 m dte
out dce state10 state6C r dte
out dce state10B state19 m dte
out dce state10B state6C r dte
out dce state11 state12 n dte
out dce state11 state19 m dte
out dce state12 state19 m dte
out dce state14 state19 m dte
out dce state15 state03 v dte
out dce state15 state19 m dte
out dce state16 state17 m dte
out dce state17 state21 l dte
out dce state18 state01 l dte
out dce state18 state19 m dte
out dce state20 state21 l dte
out dce state6A state07 q dte
out dce state6A state19 m dte
out dce state6B state07 q dte
out dce state6B state10 q dte
out dce state6B state19 m dte
out dce state6C state11 l dte
out dce state6C state19 m dte
Initialization
init dte state01
init dce state01
Error Listings (verbose mode)
deadlock 1:
dce state21 b
dte state16 l
1 dce state01 -> out/state08/u/dte/;
2 dce state08 -> out/state19/m/dte/;
3 dte state01 -> inp/state18/m/dte/;
4 dte state18 -> inp/state19/m/dte/;
5 dte state19 -> out/state20/b/dce/;
6 dce state19 -> inp/state20/b/dce/;
7 dce state20 -> out/state21/l/dte/;
8 dte state20 -> inp/state21/l/dte/;
9 dte state21 -> out/state01/a/dce/;
10 dce state21 -> inp/state01/a/dce/;
11 dce state01 -> out/state08/u/dte/;
12 dce state08 -> out/state19/m/dte/;
13 dte state01 -> inp/state18/m/dte/;
14 dte state18 -> out/state16/b/dce/;
15 dce state19 -> inp/state20/b/dce/;
16 dce state20 -> out/state21/l/dte/;
deadlock 2:
dce state03 b
dte state16 v
1 dce state01 -> out/state08/u/dte/;
2 dce state08 -> out/state19/m/dte/;
3 dte state01 -> inp/state18/m/dte/;
4 dte state18 -> inp/state19/m/dte/;
5 dte state19 -> out/state20/b/dce/;
6 dce state19 -> inp/state20/b/dce/;
7 dce state20 -> out/state21/l/dte/;
8 dte state20 -> inp/state21/l/dte/;
9 dte state21 -> out/state01/a/dce/;
10 dce state21 -> inp/state01/a/dce/;
11 dce state01 -> out/state08/u/dte/;
12 dce state08 -> out/state19/m/dte/;
13 dte state01 -> out/state21/b/dce/;
14 dce state19 -> inp/state20/b/dce/;
15 dte state21 -> out/state01/a/dce/;
16 dte state01 -> inp/state18/m/dte/;
17 dce state20 -> out/state21/l/dte/;
18 dce state21 -> inp/state01/a/dce/;
19 dce state01 -> out/state18/m/dte/;
20 dte state18 -> inp/state19/m/dte/;
21 dce state18 -> out/state01/l/dte/;
22 dte state19 -> out/state20/b/dce/;
23 dte state20 -> inp/state21/l/dte/;
24 dce state01 -> out/state08/u/dte/;
25 dce state08 -> inp/state16/b/dce/;
26 dte state21 -> out/state01/a/dce/;
27 dte state01 -> inp/state08/u/dte/;
28 dce state16 -> out/state17/m/dte/;
29 dce state17 -> out/state21/l/dte/;
30 dce state21 -> inp/state01/a/dce/;
31 dce state01 -> out/state08/u/dte/;
32 dte state08 -> out/state15/d/dce/;
33 dce state08 -> inp/state15/d/dce/;
34 dce state15 -> out/state03/v/dte/;
35 dte state15 -> inp/state03/v/dte/;
36 dte state03 -> out/state04/e/dce/;
37 dte state04 -> out/state05/c/dce/;
38 dte state05 -> out/state16/b/dce/;
deadlock 3:
dce state03 b
dte state20 v
1 dce state01 -> out/state08/u/dte/;
2 dce state08 -> out/state19/m/dte/;
3 dte state01 -> inp/state18/m/dte/;
4 dte state18 -> inp/state19/m/dte/;
5 dte state19 -> out/state20/b/dce/;
6 dce state19 -> inp/state20/b/dce/;
7 dce state20 -> out/state21/l/dte/;
8 dte state20 -> inp/state21/l/dte/;
9 dte state21 -> out/state01/a/dce/;
10 dce state21 -> inp/state01/a/dce/;
11 dce state01 -> out/state08/u/dte/;
12 dce state08 -> out/state19/m/dte/;
13 dte state01 -> out/state21/b/dce/;
14 dce state19 -> inp/state20/b/dce/;
15 dte state21 -> out/state01/a/dce/;
16 dte state01 -> inp/state18/m/dte/;
17 dce state20 -> out/state21/l/dte/;
18 dce state21 -> inp/state01/a/dce/;
19 dce state01 -> out/state18/m/dte/;
20 dte state18 -> inp/state19/m/dte/;
21 dce state18 -> out/state01/l/dte/;
22 dte state19 -> out/state20/b/dce/;
23 dte state20 -> inp/state21/l/dte/;
24 dce state01 -> out/state08/u/dte/;
25 dce state08 -> inp/state16/b/dce/;
26 dte state21 -> out/state01/a/dce/;
27 dte state01 -> inp/state08/u/dte/;
28 dce state16 -> out/state17/m/dte/;
29 dce state17 -> out/state21/l/dte/;
30 dce state21 -> inp/state01/a/dce/;
31 dce state01 -> out/state18/m/dte/;
32 dte state08 -> out/state15/d/dce/;
33 dte state15 -> inp/state19/m/dte/;
34 dce state18 -> out/state01/l/dte/;
35 dce state01 -> inp/state02/d/dce/;
36 dce state02 -> out/state03/v/dte/;
37 dte state19 -> out/state20/b/dce/;
deadlock 4:
dce state21 b
dte state16 -
1 dte state01 -> out/state02/d/dce/;
2 dte state02 -> out/state16/b/dce/;
3 dce state01 -> inp/state21/b/dce/;
307 states, 4 deadlocks
An AWK Debugger and Assertion Checker
Abstract
Example
Alabama Mississippi Tennessee Georgia Florida
Alaska
# Greedy map coloring
BEGIN { FS= "\t"; OFS= "\t" # fields separated by tabs
color[0]= "yellow" # color names
color[1]= "blue"
color[2]= "red"
color[3]= "green"
color[4]= "black"
}
{ i=0
while (a[$1,i] ) i++ # find first acceptable color for
# state $1
print $1"\t" color[i] # assign that color
for (j=2; j<=NF; j++) a[$j,i]=1 # make that color
# unacceptable for
# states $2..$NF
}
/* Checks the correctness of map coloring - any two neighbor
states should be colored in different colors */
FOREACH r1: RECORD FROM FILE input
(EXISTS r2: RECORD FROM FILE output
(r1.$1 == r2.$1 AND
FOREACH i IN 2..FIELD_NUM(r1)
(EXISTS r3: RECORD FROM FILE output
(r3.$1 == r1.$i ANDr3.$2!=r2.$2)
)
)
)
SAY "Map colored correctly"
ONFAIL SAY r1.$1 "and" r1.$i "are of the same color"
SAY "although they are neighboring states"
Automated Result Verification with Awk
Source
Download
Abstract
The goal of result-verification is to prove that one execution
run of a program satisfies its specification. Compared
with implementation-verification,
result-verification has a
larger scope for applications in practice,
gives more opportunities for automation and, based on the execution record
not the implementation, is particularly suitable for complex
systems.
In this paper...
In this paper we propose a technical framework to carry
out automated result-verification in practice.
Its main features are:
Functional Enumeration in Gawk 3.1.7
Contents
• Synopsis
• Description
• Enumerators
• all(fun,array [,max]
• collect(fun,array1,array2 [,max])
• select(fun,array1,array2 [,max])
• reject(fun,array1,array2 [,max])
• detect(fun,array [,max])
• inject(fun,array,carry [,max])
• Sample Functions
• Using the Functions
• Code
• all
• collect
• select
• reject
• detect
• inject
• Bugs
• Author
Synopsis
Description
@fun(arg1,arg2,...)
Enumerators
all(fun,array [,max]
collect(fun,array1,array2 [,max])
select(fun,array1,array2 [,max])
reject(fun,array1,array2 [,max])
detect(fun,array [,max])
inject(fun,array,carry [,max])
Sample Functions
function odd(x) { return (x % 2) == 1 }
function show(x) { print "[" x "]" }
function mult(x,y) { return x * y }
function halve(x) { return x/2 }
Using the Functions
function do_all( arr) {
split("22 23 24 25 26 27 28",arr)
all("show",arr)
}
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_all() }'
[25]
[26]
[27]
[28]
[22]
[23]
[24]
function do_collect( max,arr1,arr2,i) {
max=split("22 23 24 25 26 27 28",arr1)
collect("halve",arr1,arr2,max)
for(i=1;i<=max;i++) print arr2[i]
}
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_collect() }'
11
11.5
12
12.5
13
13.5
14
function do_select( all,less,arr1,arr2,i) {
all = split("22 23 24 25 26 27 28",arr1)
less = select("odd",arr1,arr2,all)
for(i=1;i<=less;i++) print arr2[i]
}
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_select() }'
23
25
27
function do_reject( all,less,arr1,arr2,i) {
all = split("22 23 24 25 26 27 28",arr1)
less = reject("odd",arr1,arr2,all)
for(i=1;i<=less;i++) print arr2[i]
}
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_reject() }'
22
24
26
28
function do_detect( all,arr1) {
all = split("22 23 24 25 26 27 28",arr1)
print detect("odd",arr1,all)
}
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_detect() }'
23
function do_inject( all,less,arr1,arr2,i) {
split("1 2 3 4 5",arr1)
print inject("mult",arr1,1)
}
gawk317="$HOME/opt/gawk/bin/gawk"
$gawk317 -f ../enumerate.awk --source 'BEGIN { do_inject() }'
120
Code
all
function all (fun,a,max, i) {
if (max)
for(i=1;i<=max;i++) @fun(a[i])
else
for(i in a) @fun(a[i])
}
collect
function collect (fun,a,b,max, i) {
if (max)
for(i=1;i<=max;i++) {n++; b[i]= @fun(a[i]) }
else
for(i in a) {n++; b[i]= @fun(a[i])}
return n
}
select
function select (fun,a,b,max, i,n) {
if (max)
for(i=1;i<=max;i++) {
if (@fun(a[i])) {n++; b[n]= a[i] }}
else
for(i in a) {
if (@fun(a[i])) {n++; b[n]= a[i] }}
return n
}
reject
function reject (fun,a,b,max, i,n) {
if (max)
for(i=1;i<=max;i++) {
if (! @fun(a[i])) {n++; b[n]= a[i] }}
else
for(i in a) {
if (! @fun(a[i])) {n++; b[n]= a[i] }}
return n
}
detect
BEGIN {Fail="someUnLIKELYSymbol"}
function detect (fun,a,max, i) {
if (max)
for(i=1;i<=max;i++) {
if (@fun(a[i])) return a[i] }
else
for(i in a) {
if (@fun(a[i])) return a[i] }
return Fail
}
inject
function inject (fun,a,carry,max, i) {
if (max)
for(i=1;i<=max;i++)
carry = @fun(a[i],carry)
else
for(i in a)
carry = @fun(a[i],carry)
return carry
}
Bugs
Author
Tim Menzies
How to Contribute
Link to this site from your home page
Improve a Page
Found a Typo? A Rendering Problem? Want to clarify something?
Want to add some links?
How to Write Pages for this Site
1 2 3 4 5 6 7
012345678901234567890123456789012345678901234567890123456789012345678901234567890
Contributing Code
Coding Standards
Add a Library Function Files
Add a Package
#use file.awk
Pretty Print AWK Code
Preview Engine
http://awk.info/?awk:urlWithoutHTTPprefix
Contributing Pretty Code
HTML-based Commenting Conventions
#.H1 <join>Title</join>
#<pre>
code
#</pre>
#.WORD other words
<WORD> other words</WORD>
Show Unit Tests
Files
# assumes
# - the LAWKER trunk has been checked out and
# - .bash_profile contains: export Lawker="$HOME/svns/lawker/fridge"
. $Lawker/lib/bash/setup
gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
print join(a,2)
}'
Regression Tests
Displaying the Tests (and Output)
#.BODY yourcode/eg/yourtest
#.CODE yourcode/eg/yourtest.out
Learning Awk
Short Overviews
Longer Tutorials
Other Stuff
Teaching Awk
Four Keys to Gawk
Self-initializing variables.
x=x+0
x= x "" "the string you really want to add"
function haslocals(passed1,passed2, local1,local2,local3) {
passed1=passes1+1 # changes externally
local1=7 # only changed locally
}
Pattern-based programming
/^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
p = 1;
}
/^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
p = 0;
}
END { if (p != 0) print "missing .P2 at end" }
BEGIN {
while (getline < "Usr.Dict.Words") #slurp in dictionary
dict[$0] = 1
FS=","; #set field seperator
srand(); #reset random seed
Round=10; #always start globals with U.C.
}
A Small Example
% cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
uniq -c | sort -r -n | Gawk -f hist.awk
************************** 26 /var/empty
** 2 /var/virusmails
** 2 /var/root
* 1 /var/xgrid/controller
* 1 /var/xgrid/agent
* 1 /var/teamsserver
* 1 /var/spool/uucp
* 1 /var/spool/postfix
* 1 /var/spool/cups
* 1 /var/pcast/server
* 1 /var/pcast/agent
* 1 /var/imap
* 1 /Library/WebServer
NR==1 { Width = Width ? Width : 40 ; sets Width if it is missing
Scale = $1 > Width ? $1 / Width : 1
}
{ Stars=int($1*Scale);
print str(Width - Stars," ") str(Stars,"*") $0
}
# note that, in the following "tmp" is a local variable
function str(n,c, tmp) { # returns a string, size "n", of all "c"
while((n--) > 0 ) tmp= c tmp
return tmp
}
Regular Expressions
function trim(s, t) {
t=s;
sub(/^[ \t\n]*/,"",t);
sub(/[ \t\n]*$/,"",t);
return t
}
if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ )
{print "ERROR: " $i " not a number}
matches the character c (assuming c is a character with no special meaning in regexps).
matches the literal character c; e.g. tabs and newlines are \t and \n respectively.
matches any character except newline.
matches the beginning of a line or a string.
matches the end of a line or a string.
matches any of the characters ac... (character class).
matches any character except abc... and newline (negated character class).
matches zero or more r's.More Syntax:
matches one or more r's.
matches zero or one r's.
matches either r1 or r2 (alternation).
matches r1, and then r2 (concatenation).
matches r (grouping).
Numbers begin with zero or one plus or minus signs.
Simple numbers are just one or more numbers.
which may be followed by a decimal point and zero or more digits.
Alternatively, a number can have zero leading numbers and just start with a decimal point.
Also, there may be an exponent added
and that exponent is a positive or negative bunch of digits.Associative arrays
Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename
#!/usr/bin/awk -f
{for(i=1;i <=NF;i++) freq[$i]++ }
END{for(word in freq) print word, freq[word] }
index in array
delete array[index]
for (var in array)
body
function top(a) {return a[a[0]]}
function push(a,x, i) {i=++a[0]; a[i]=x; return i}
function pop(a, x,i) {
i=a[0]--;
if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}
BEGIN {push(a,1); push(a,2); push(a,3);
while(x=pop(a)) print x
3
2
1
function a2s(a, i,s) {
s="";
for (i in a) {s=s " " i "= [" a[i]"]\n"};
return s}
BEGIN {push(L,1); push(L,2); push(L,3);
print a2s(L);}
0= [3]
1= [1]
2= [2]
3= [3]
function rinclude (line, x,a) {
split(line,a,/ /);
if ( a[1] ~ /^\=include/ ) {
while ( ( getline x < a[2] ) > 0) rinclude(x);
close(a[2])}
else {print line}
}
BEGIN {srand()}
{Array[rand()]=$0}
END {for(I in Array) print $0}
Awk one-liners
Handy One-Liners For Awk (v0.22)
pemente@northpark.edu
http://www.student.northpark.edu/pemente/awk/awk1line.txt
USAGE
Unix: awk '/pattern/ {print "$1"}' # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}' # okay for DJGPP compiled
awk "/pattern/ {print \"$1\"}" # required for Mingw32
File Spacing
awk '1;{print ""}'
awk 'BEGIN{ORS="\n\n"};1'
awk 'NF{print $0 "\n"}'
awk '1;{print "\n"}'
Numbering and Calculations
awk '{print FNR "\t" $0}' files*
awk '{print NR "\t" $0}' files*
awk '{printf("%5d : %s\n", NR,$0)}'
awk 'NF{$0=++a " :" $0};{print}'
awk '{print (NF? ++a " :" :"") $0}'
awk 'END{print NR}'
awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'
awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'
awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
awk '{ total = total + NF }; END {print total}' file
awk '/Beth/{n++}; END {print n+0}' file
awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'
awk '{ print NF ":" $0 } '
awk '{ print $NF }'
awk '{ field = $NF }; END{ print field }'
awk 'NF > 4'
awk '$NF > 4'
Text Conversion and Substitution
awk '{sub(/\r$/,"");print}' # assumes EACH line ends with Ctrl-M
awk '{sub(/$/,"\r");print}
awk 1
gawk -v BINMODE="w" '1' infile >outfile
tr -d \r
awk '{sub(/^[ \t]+/, ""); print}'
awk '{sub(/[ \t]+$/, "");print}'
awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
awk '{$1=$1;print}' # also removes extra space between fields
awk '{sub(/^/, " ");print}'
awk '{printf "%79s\n", $0}' file*
awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*
awk '{sub(/foo/,"bar");print}' # replaces only 1st instance
gawk '{$0=gensub(/foo/,"bar",4);print}' # replaces only 4th instance
awk '{gsub(/foo/,"bar");print}' # replaces ALL instances in a line
awk '/baz/{gsub(/foo/, "bar")};{print}'
awk '!/baz/{gsub(/foo/, "bar")};{print}'
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*
awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*
awk -F ":" '{ print $1 | "sort" }' /etc/passwd
awk '{print $2, $1}' file
awk '{temp = $1; $1 = $2; $2 = temp}' file
awk '{ $2 = ""; print }'
awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file
awk 'a !~ $0; {a=$0}'
awk '! a[$0]++' # most concise script
awk '!($0 in a) {a[$0];print}' # most efficient script
awk 'ORS=%NR%5?",":"\n"' file
Selective Printing of Certain Lines
awk 'NR < 11'
awk 'NR>1{exit};1'
awk '{y=x "\n" $0; x=$0};END{print y}'
awk 'END{print}'
awk '/regex/'
awk '!/regex/'
awk '/regex/{print x};{x=$0}'
awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'
awk '/regex/{getline;print}'
awk '/AAA/; /BBB/; /CCC/'
awk '/AAA.*BBB.*CCC/'
awk 'length > 64'
awk 'length < 64'
awk '/regex/,0'
awk '/regex/,EOF'
awk 'NR==8,NR==12'
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files
awk '/Iowa/,/Montana/' # case sensitive
Selective Deletion of Certain Lines:
awk NF
awk '/./'
Credits and Thanks
Explaining Pemet's One Liners
Awk ten-liners
Some Gawk (and PERL) Samples
print "hello world\n"
BEGIN { print "hello world" }
$x= $x+1;
x= x+1
print $x, $y, $z;
print x,y,z
while (<>) {
split(/ /);
print "@_[0]\n"
}
{ print $1 }
while (<>) {
split(/ /);
print "@_[1] @_[0]\n"
}
{ print $2, $1 }
command = "cat $fname1 $fname2 > $fname3"
command = "cat " fname1 " " fname2 " > " fname3
for (1..10) { print $_,"\n" }
BEGIN {
for (i=1; i<=10; i++) print i
}
for (1..10) { print "$_ ",$_-1 }
print "\n"
BEGIN {
for (i=1; i<=10; i++) printf i " " i-1
print ""
}
foreach $x ( split(/ /,"this is not stored linearly") )
{ print "$x\n" }
BEGIN {
split("this is not stored linearly",temp)
for (i in temp) print temp[i]
}
$n = split(/ /,"this is not stored linearly");
for $i (0..$n-1) { print "$i @_[$i]\n" }
print "\n";
for $i (@_) { print ++$j," ",$i,"\n" }
BEGIN {
n = split("this is not stored linearly",temp)
for (i=1; i<=n; i++) print i, temp[i]
print ""
for (i in temp) print i, temp[i]
}
open file,"/etc/passwd";
while (<file>) { print $_ }
BEGIN {
while (getline < "/etc/passwd") print
}
$x = "this " . "that " . "\n";
print $x
BEGIN {
x = "this " "that " "\n" ; printf x
}
$assoc{"this"} = 4;
$assoc{"that"} = 4;
$assoc{"the other thing"} = 15;
for $i (keys %assoc) { print "$i $assoc{$i}\n" }
BEGIN {
assoc["this"] = 4
assoc["that"] = 4
assoc["the other thing"] = 15
for (i in assoc) print i,assoc[i]
}
split(/ /,"this will be sorted once in an array");
foreach $i (sort @_) { print "$i\n" }
BEGIN {
split("this will be sorted once in an array",temp," ")
for (i in temp) print temp[i] | "sort"
while ("sort" | getline) print
}
BEGIN {
split("this will be sorted once in an array",temp," ")
n=asort(temp)
for (i=1;i<=n;i++) print temp[i]
}
while (<STDIN>) {
s/[aeiou]/*/g;
print $_
}
{gsub(/[aeiou]/,"*"); print }
#!/pkg/gnu/bin/perl
# this is a comment
#
open(stream1,"w | ");
while ($line = <stream1>) {
($user, $tty, $login, $junk) = split(/ +/, $line, 4);
print "$user $login ",substr($line,49)
}
#!/pkg/gnu/bin/gawk -f
# this is a comment
#
BEGIN {
while ("w" | getline) {
user = $1; tty = $2; login = $3
print user, login, substr($0,49)
}
}
open(stream1,"lynx -dump 'cs.wustl.edu/~loui' | ");
while ($line = <stream1>) {
if ($flag && $line =~ /[0-9]/) { print $line }
if ($line =~ /References/) { $flag = 1 }
}
BEGIN {
com = "lynx -dump 'cs.wustl.edu/~loui' &> /dev/stdout"
while (com | getline line) {
if (flag && line ~ /[0-9]/) { print line }
if (line ~ /References/) { flag = 1 }
}
}
saya
Synopsis
Description
Arguments
Returns
Notes
saya(a,"name") ==>
name[1] = tim
name[2] = menzies
Source
function saya(a,s, sep0,b4,after,eq, c,m,n,key,val,i,j,tmp,sep) {
sep0 = sep0 ? sep0 : "\n"
b4 = b4 ? b4 : "\n"
after = after ? after : "\n"
eq = eq ? eq : " = "
pre = s ? s"[" : ""
post = s ? "]" : ""
m = asorti(a,b)
printf("%s",b4)
for(i=1;i<=m;i++) {
key=b[i]
val=a[b[i]]
printf("%s", sep pre )
n=split(key,tmp,SUBSEP)
c = ""
for(j=1;j<=n;j++) {
printf("%s", c tmp[j] )
c=","
}
printf("%s", post eq val )
sep=sep0;
};
printf("%s",after)
return m
}
Example
gawk -f saya.awk --source '
BEGIN {
A["fname" ] = "tim"
A["lname" ] = "menzies"
A["address"] = "usa"
saya(A,"",", ","[","]")
print ""
saya(A,"message")
B[2,3,9] = 100
B[10,1,11] = 200
B[1,3,10] = 300
saya(B,"b")
}'
[address = usa, fname = tim, lname = menzies]
message[address] = usa
message[fname] = tim
message[lname] = menzies
b[1,3,10] = 300
b[10,1,11] = 200
b[2,3,9] = 100
Author
join
Synopsis
Description
Arguments
Returns
Example
gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
print join(a,2)
}'
tom tam
Source
function join(a,start,end,sep, result,i) {
sep = sep ? start : " "
start = start ? start : 1
end = end ? end : sizeof(a)
if (sep == SUBSEP) # magic value
sep = ""
result = a[start]
for (i = start + 1; i <= end; i++)
result = result sep a[i]
return result
}
Helper
function sizeof(a, i,n) { for(i in a) n++ ; return n }
Change Log
Author
array
Synopsis
Description
Arguments
Example
gawk -f array.awk --source '
BEGIN { array(A);
A[1]=2;
print length(A);
array(A);
print length(A);
}'
1
0
Source
function array(a) { split("",a,"") }
Sorting in Awk
Contents
Download
About
Code
selSort
function selSort(keyArr,outArr, swap,thisIdx,minIdx,cmpIdx,numElts) {
for (thisIdx in keyArr) {
outArr[++numElts] = thisIdx
}
for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
minIdx = thisIdx
for (cmpIdx=thisIdx + 1; cmpIdx <= numElts; cmpIdx++) {
if (keyArr[outArr[minIdx]] > keyArr[outArr[cmpIdx]]) {
minIdx = cmpIdx
}
}
if (thisIdx != minIdx) {
swap = outArr[thisIdx]
outArr[thisIdx] = outArr[minIdx]
outArr[minIdx] = swap
}
}
return numElts+0
}
keySort
function keySort(keyArr,outArr, \
occArr,thisIdx,thisKey,cmpIdx,outIdx,numElts) {
for (thisIdx in keyArr) {
thisKey = keyArr[thisIdx]
outIdx=++occArr[thisKey] # start at 1 plus num occurrences
for (cmpIdx in keyArr) {
if (thisKey > keyArr[cmpIdx]) {
outIdx++
}
}
outArr[outIdx] = thisIdx
numElts++
}
return numElts+0
}
genSort
in: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
outArr[] is empty
out: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
outArr[1]="bar"; outArr[2]="foo"; outArr[3]="xyz"
function genSort(sortAlg,sortType,inArr,outArr,fldNum,fldSep, \
keyArr,thisIdx,thisArr) {
if (fldNum) {
if (sortType == "n") {
for (thisIdx in inArr) {
split(inArr[thisIdx],thisArr,fldSep)
keyArr[thisIdx] = thisArr[fldNum]+0
}
} else {
for (thisIdx in inArr) {
split(inArr[thisIdx],thisArr,fldSep)
keyArr[thisIdx] = thisArr[fldNum]""
}
}
} else {
if (sortType == "n") {
for (thisIdx in inArr) {
keyArr[thisIdx] = inArr[thisIdx]+0
}
} else {
for (thisIdx in inArr) {
keyArr[thisIdx] = inArr[thisIdx]""
}
}
}
if (sortAlg ~ /^sel/) {
numElts = selSort(keyArr,outArr)
} else {
numElts = keySort(keyArr,outArr)
}
return numElts
}
Main Loop
{ inArr[NR]=$0 }
<H3> Output</H3>
END {
numElts = genSort(sortAlg,sortType,inArr,outArr,fldNum,FS)
for (outIdx=1;outIdx<=numElts;outIdx++) {
print inArr[outArr[outIdx]]
}
}
Author
Awk's Equivalent to VI's J
host1name.com
10.10.10.1
host2name.com
10.10.10.2
host3name.com
10.10.10.3
ORS=NR%2?" ":"\n"
ORS=NR%2?FS:RS
Sorting Arrays Via the Shell
Contents
Synopsis
o(array [,string,control])
Download
Download from
LAWKER.
Notes
Code
Example
function odemo( a,b,i,n) {
n = split("watermelon,banana,apple,grape",a,/,/);
print "\nEG1"; o(a,"fruit")
print "\nEG2"; o(a,"fruit",3)
print "\nEG3"; o(a,"fruit","-k 6")
for(i in a)
b[a[i]] = i
print "\nEG4"; o(b,"fruit")
print "\nEG5"; o(b,"fruit","-r -k 2")
}
gawk -f o.awk --source "BEGIN { odemo() }"
EG1
fruit[ 1 ] = [ watermelon ]
fruit[ 2 ] = [ banana ]
fruit[ 3 ] = [ apple ]
fruit[ 4 ] = [ grape ]
EG2
fruit[ 1 ] = [ watermelon ]
fruit[ 2 ] = [ banana ]
fruit[ 3 ] = [ apple ]
EG3
fruit[ 3 ] = [ apple ]
fruit[ 2 ] = [ banana ]
fruit[ 4 ] = [ grape ]
fruit[ 1 ] = [ watermelon ]
EG4
fruit[ apple ] = [ 3 ]
fruit[ banana ] = [ 2 ]
fruit[ grape ] = [ 4 ]
fruit[ watermelon ] = [ 1 ]
EG5
fruit[ watermelon ] = [ 1 ]
fruit[ grape ] = [ 4 ]
fruit[ banana ] = [ 2 ]
fruit[ apple ] = [ 3 ]
Main driver
function o(a, str,control, i) {
if (control ~ /^[0-9]/)
for(i=1;i<=control;i++)
print str "[ " i " ]\t=\t [ " a[i] " ]"
else {
com = control ? control : " -n -k 2"
com = "sort " com " #" rand(); # ensure com is unique
for(i in a)
print str "[ " i " ]\t=\t [ " a[i] " ]" | com;
close(com);
}}
Author
quicksort2.awk
Contents
Synopsis
Download
Description
Code
BEGIN {
recurse1 = "gawk -f quicksort2.awk #" rand()
recurse2 = "gawk -f quicksort2.awk #" rand()
}
NR == 1 { pivot=$0; next }
NR > 1 { if($0 < pivot) print | recurse1
if($0 > pivot) print | recurse2
}
END { close(recurse1)
if(NR > 0) print pivot
close(recurse2)
}
Bugs
See also
Copyright
Author
levenshtein.awk
Contents
Synopsis
gawk -f levenshtein.awk --source 'BEGIN {
print levdist("kitten", "sitting")}'
Download
Notes
Code
levdist
function levdist(str1, str2, l1, l2, tog, arr, i, j, a, b, c) {
if (str1 == str2) {
return 0
} else if (str1 == "" || str2 == "") {
return length(str1 str2)
} else if (substr(str1, 1, 1) == substr(str2, 1, 1)) {
a = 2
while (substr(str1, a, 1) == substr(str2, a, 1)) a++
return levdist(substr(str1, a), substr(str2, a))
} else if (substr(str1, l1=length(str1), 1) == substr(str2, l2=length(str2), 1)) {
b = 1
while (substr(str1, l1-b, 1) == substr(str2, l2-b, 1)) b++
return levdist(substr(str1, 1, l1-b), substr(str2, 1, l2-b))
}
for (i = 0; i <= l2; i++) arr[0, i] = i
for (i = 1; i <= l1; i++) {
arr[tog = ! tog, 0] = i
for (j = 1; j <= l2; j++) {
a = arr[! tog, j ] + 1
b = arr[ tog, j-1] + 1
c = arr[! tog, j-1] + (substr(str1, i, 1) != substr(str2, j, 1))
arr[tog, j] = (((a<=b)&&(a<=c)) ? a : ((b<=a)&&(b<=c)) ? b : c)
}
}
return arr[tog, j-1]
}
Demo code
#demo.awk
BEGIN {OFS = "\t"}
{words[NR] = $0}
END {
max = 0
for (i = 2; i in words; i++) {
for (j = i + 1; j in words; j++) {
new = levdist(words[i], words[j])
print words[i], words[j], new
if (new > max) {
max = new
bestpair = (words[i] " - " words[j] ": " new)
}
}
}
print bestpair
}
Unit tests
#utests.awk
function testlevdist(str1, str2, correctval, testval) {
testval = levdist(str1, str2)
if (testval == correctval) {
printf "%s:\tCorrect distance between '%s' and '%s'\n", testval, str1, str2
return 1
} else {
print "MISMATCH on words '%s' and '%s' (wanted %s, got %s)\n", str1, str2, correctval, testval
return 0
}
}
BEGIN {
testlevdist("kitten", "sitting", 3)
testlevdist("Saturday", "Sunday", 3)
testlevdist("acc", "ac", 1)
testlevdist("foo", "four", 2)
testlevdist("foo", "foo", 0)
testlevdist("cow", "cat", 2)
testlevdist("cat", "moocow", 5)
testlevdist("cat", "cowmoo", 5)
testlevdist("sebastian", "sebastien", 1)
testlevdist("more", "cowbell", 5)
testlevdist("freshpack", "freshpak", 1)
testlevdist("freshpak", "freshpack", 1)
}
Author
Columnate
Contents
Synopsis
#e.g.
gawk -F: -f columnate.awk /etc/passwd
Download
Download from
LAWKER.
About
This script columnates the input file, so that columns line up like in the GNU column(1) command. Its output is like that of column -t. First, awk reads the whole file, keeps track of the maximum width of each field, and saves all the lines/records. At the END, the lines are printed in columnated format. If your terminal is not too narrow, you'll get a handsome display of the file.
Code
{ line[NR] = $0 # saves the line
for (f=1; f<=NF; f++) {
len = length($f)
if (len>max[f])
max[f] = len } # an array of maximum field widths
}
END {
for(nr=1; nr<=NR; nr++) {
nf = split(line[nr], fields)
for (f=1; f<nf; f++)
printf "%-*s", max[f]+2, fields[f]
print fields[f] } # the last field need not be padded
}
Author
WidenBmp.awk
Contents
Background
My boss wants to put NOAA weather radar images in a looping presentation that is displayed as 720 video on the 1040 LCD TV in the atrium. He couldn't figure out how to download the various layers needed, so he gave me the task. Of course, I had a sample composite image for him in half an hour. It looked terrible on the TV: the writing came out as just a blur and the county and state lines (single pixel mostly) were essentially invisible. Obviously, I could make my own 'cities' overlay, but no tools I had would convert the 'counties' image to any usable vector format for line resizing.
Code
Bytes2Number
function Bytes2Number( String, x, y, z, Number ) {
if( !CharString ) {
for( x = 0; x <= 255; x++ ) CharString = CharString sprintf( "%c", x )
}
x = split( String, Scratch, "" )
Number = 0
for( y = 1; y <= x; y++ ) {
z = index( CharString, Scratch[ y ] ) -1
Number = Number + z * (256^(x - y))
}
return Number # Note that Number is a regular gawk scalar variable.
}
RealSize
function RealSize( Wide, High, Pixels, x, y ) {
for( x = Wide - 5; x <= Wide +5; x++ ) {
for( y = High - 5; y <= High + 5; y++ ) {
if( x * y == Pixels ) {
Width = x
Height = y
}
}
}
}
BEGIN
BEGIN{
BINMODE = "rw"
FS= ""
# The next two lines are not strictly necessary-
# there are here for clarity.
Header = ""
ByteCount = 0
RS = "\n"
}
For Each Record...
{
for( x = 1; x <= NF; x++ ) Bytes[ ++ByteCount ] = $(x)
if( RT ) { Bytes[ ++ByteCount ] = RT }
}
END
END{
if( !OutFile ) OutFile = FILENAME
close( FILENAME )
sub( /[bB][mM][pP]$/, "widened.bmp" Arr[1], OutFile )
Width = Bytes2Number( Bytes[ 22 ] Bytes[ 21 ] Bytes[ 20 ] Bytes[ 19 ] )
Height = Bytes2Number( Bytes[ 26 ] Bytes[ 25 ] Bytes[ 24 ] Bytes[ 23 ] )
Data = Bytes2Number( Bytes[ 14 ] Bytes[ 13 ] Bytes[ 12 ] Bytes[ 11 ] )
Size = Bytes2Number( Bytes[ 6 ] Bytes[ 5 ] Bytes[ 4 ] Bytes[ 3 ] )
Depth = Bytes2Number( Bytes[ 30 ] Bytes[ 29 ] ) / 8
ImgSize = Bytes2Number( Bytes[ 38 ] Bytes[ 37 ] Bytes[ 36 ] Bytes[ 35 ] )
RealSize( Width, Height, ImgSize / Depth )
# Output the header in its original form to the target file.
for( x = 1; x <= Data; x++ ) Header = Header Bytes[ x ]
printf( "%s", Header ) > OutFile
# Build the two arrays
for( x = 1; x <= Height; x++) {
for( y = 1; y <= Width; y++ ) {
S = ""
# Values for the A & B array entries are strings of
# bytes representing the color of the pixel, either directly or
# as a pointer into a palette.
for( z = 1; z <= Depth; z++ ) S = S Bytes[ ++Data ]
A[x,y] = S
B[x,y] = S
C[ S ]++
}
}
z = 0
# Bkg is the (assumed) background color.
# The code is a simple maximum value loop.
for( x in C ) {
y = C[x]
if( y > z ) {
Bkg = x
z = y
}
}
# Begin the actual line widenning code.
for( x = 1; x <= Height; x++) {
for( y = 1; y <= Width; y++ ) {
if( A[x,y] !~ Bkg ) {
u = x + 1
v = x - 1
w = y + 1
z = y - 1
if( B[u,y] ~ Bkg ) B[u,y] = A[x,y]
if( B[v,y] ~ Bkg ) B[v,y] = A[x,y]
if( B[x,w] ~ Bkg ) B[x,w] = A[x,y]
if( B[x,z] ~ Bkg ) B[x,z] = A[x,y]
if( B[u,w] ~ Bkg ) B[u,w] = A[x,y]
if( B[u,z] ~ Bkg ) B[u,z] = A[x,y]
if( B[v,w] ~ Bkg ) B[v,w] = A[x,y]
if( B[v,z] ~ Bkg ) B[v,z] = A[x,y]
}
}
}
for( x = 1; x <= Height; x++) {
for( y = 1; y <= Width; y++ ) {
printf( "%s", B[x,y] ) > OutFile
}
}
}
Author
Processing Binary (BMP) files in Gawk
Updates
Description
Code Fragments
function Bytes2Number( String, x, y, z, Number ) {
x = split( String, Scratch, "" )
Number = 0
for( y = 1; y <= x; y++ ) {
z = index( CharString, Scratch[ y ] ) -1
Number = Number + z * (256^(x - y))
}
return Number
}
BEGIN{
for( x = 0; x <= 255; x++ ) {
CharString = CharString sprintf( "%c", x )
FS= ""
RS = /ABC/
}
{ Width = Bytes2Number( $22 $21 $20 $19 )
Height = Bytes2Number( $26 $25 $24 $23 )
Data = Bytes2Number( $14 $13 $12 $11 )
Size = Bytes2Number( $6 $5 $4 $3 )
Depth = Bytes2Number( $30 $29 ) / 8
ImgSize = Bytes2Number( $38 $37 $36 $35 )
....
}
Spawk for SUSE Linux
-- Panos Papadopoulos to tim
SPAWK moves to GoogleCode
Panos I. Papadopoulos
reports that he has moved the
SPAWK project (SQL and AWK) to Mercurial and spawk.googlecode.com.
SQL Powered AWK
Website
http://sites.google.com/site/spawkinfo.
Author
Panos I. Papadopoulos
(panos1962@gmail.com).
Description
BEGIN {
extension("libspawk.so", "dlload")
...
A Short Example
BEGIN {
extension("libspawk.so", "dlload")
SPAWKINFO["database"] = "information_schema"
spawk_select("SELECT TABLE_SCHEMA, TABLE_NAME FROM TABLES")
while (spawk_data(data))
print data[0]
exit(0)
}
Macros
Finite State Machine Generator
Contents
• Download
• Usage
• DESCRIPTION
• Building the Sample FSM
• Example FSM Specification File
• The Example FSM
• Example Output from the Sample
• Copyright
• Author
Download
Usage
DESCRIPTION
Building the Sample FSM
COPYING and FSF licenses
COPYING.LESSER
filelist the "packing list"
fsm.awk the code generator
fsm.c the context and transition code
fsm.h definitions for the API
makefile simple makefile for the test driver code
utils.h error and utility definitions
test.fsm a sample fsm specification named "test"
test_actions.c action functions for the sample
struct fsm_s fsm_fsmName [STATES_COUNT][EVENTS_COUNT].
Example FSM Specification File
# current event action next
# state state
# --------------+----------+---------------+------------
IDLE CONN_REQ makeConnection CONNECTED
CONNECTED GET_REQ sendBuffer SENDING
SENDING FILE_SENT closeFile IDLE
# current event action next next
# state state state
# ok fail
# --------------+----------+----------+---------+-----
CONNECTED GET_REQ sendBuffer SENDING IDLE
# current event action next next
# state state state
# ok fail
# --------------+----------+----------+---------+-----
S1 EVENT_1 action_1 S2 S3
# current event action next next
# state state state
# ok fail
# --------------+----------+----------+---------+-----
S1 EVENT_1 action_1 S2 -
means, when receiving event EVENT_1 in state S1, execute action
action_1 and go to state S2 irrespective of the return value of
action_1().
The Example FSM
Example Output from the Sample
$>
$> ./test
testing fsm test
starting in state 1
next event: a
got a (0) ----> called fsm_s2_ab ----> ,went to state 0
next event: d
got d (3) ----> invalid eventwent to state 0
next event: b
got b (1) ----> called fsm_s1_b ----> ,went to state 1
next event: c
got c (2) ----> went to state 1
next event: z
trace index is 4
event state
0 0
3 0
1 1
2 1
0 0 <-- next/oldest
0 0
0 0
0 0
next event: q
bye
$>
Copyright
Author
Wm Miller.
The author may be contacted at wmmsf at users.sourceforge.net.
Hiding Email Address
Contents
Synopsis
Download
Description
% gawk -f cryptosig.awk tim@menzies.us
BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}
echo 'BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}' | gawk -f -
gawk -f crypotsig.awk tim@menzies.us | gawk -f -
Code
BEGIN {
for (i=0; i<=255; i++) { # build table of char=value pairs
ord_arr[sprintf("%c",i)] = i # character = ordinal value
}
for (i=1; i<=ARGC-1; i++) {
str = ""
for (j=1; j<=length(ARGV[i]); j++) {
str = sprintf("%s%02d",str,ord_arr[substr(ARGV[i],j,1)]-46)
}
printf("BEGIN{a=\"%s\";while(a){printf(\"%%c\",46+substr(a,1,2));a=substr(a,3)}}\n",str)
}
exit(0)
}
Author
BEGIN{a="535170696159626207061118755158656500536563";
while(a){
printf("%c",46+substr(a,1,2));a=substr(a,3)};
print("")
}
Random Signatures
Contents
Synopsis
chmod +x sigs; ./sigs
Download
Description
Code
Pick1
pick1() {
gawk 'BEGIN { srand(); RS="" }
NR==1 { print $0 "\n" }
NR>1 { Recs[rand()] = $0 }
END { for ( R in Recs ) {print Recs[R]; exit}}
' $1
}
The Signatures
cat << SoMEI_mpOSSIblE_sYMBOl | pick1
tim.menzies {
title: dr (Ph.D.) and associate professor;
align: csee, west virginia university;
cell: esb 841A;
url: http://menzies.us;
fyi: unless marked "URGENT", i usually won't get 2 your email b4 5pm;
}
Doing a job RIGHT the first time gets the job done. Doing the job WRONG
fourteen times gives you job security.
Rome did not create a great empire by having meetings, they did it by
killing all those who opposed them.
INDECISION is the key to FLEXIBILITY.
"When a subject becomes totally obsolete we make it a required
course." Peter Drucker
I saw two shooting stars last night but they were only satellites .
Its wrong to wish on space hardware. I wish, I wish, I wish you cared.
-- Billy Bragg
Then, in 1995, came the most amazing event in the
history of programming languages: the introduction
of Java. -- Programming Languages: Principles and Practice
Suburbia is where the developer bulldozes out the trees, then names
the streets after them. --Bill Vaughan
Instant gratification takes too long.
-- Carrie Fisher
Complexity is easy. Simplicity is hard.
--Unknown
Author
Correlate.awk
Contents
Synopsis
cat data | gawk -f correlate.awk
Notes
Example
cat <<EOF | gawk -f correlate.awk
1 1.417600305
2 2.265271781
3 3.241368347
4 4.367711955
5 5.390612315
6 6.296879718
7 7.43218197
8 8.117831008
9 9.338019481
10 10.01823657
EOF
NR=10
ssx=82.5
ssy=79.0584
ssxy=80.6985
r=0.999227
Code
{ xy+=($1*$2);
x+=$1;
y+=$2;
x2+=($1*$1);
y2+=($2*$2);
}
END {
print "NR=" NR;
ssx=x2-((x*x)/NR);
print "ssx=" ssx;
ssy=y2-((y*y)/NR);
print "ssy=" ssy;
ssxy = xy - ((x*y)/NR);
print "ssxy=" ssxy;
r=ssxy/sqrt(ssx*ssy);
print "r=" r;
}
Author
Music and Awk
Project Tools
A MySql Client
Contents
Download
Code
Set Up
BEGIN {
if (!mysql["path"]) {
mysql["path"] = "/usr/bin/mysql"
}
if (mysql["user"]) mysql["user"] = "-u" mysql["user"]
if (mysql["pass"]) mysql["pass"] = "-p" mysql["pass"]
if (!mysql["tempfile_command"]) {
mysql["tempfile_command"] = "mktemp /tmp/__mysql.awk.XXXXXX"
}
mysql["resource_id"] = 1
__mysql_dequote["r"] = "\r"
__mysql_dequote["n"] = "\n"
__mysql_dequote["t"] = "\t"
__mysql_dequote["\\"] = "\\"
}
Main Functions
function mysql_db (db) { mysql["database"] = db }
function mysql_path (path) { mysql["path"] = path }
function mysql_tempfile_command (command) {
mysql["tempfile_command"] = command
}
function mysql_login (username, password, host, args) {
mysql["user"] = "-u" username
mysql["pass"] = "-p" password
if (host) mysql["host"] = "-h" host
if (args) mysql["args"] = args
}
function mysql_query (query ,input,key,i,call,resource) {
resource = mysql["resource_id"]++
mysql["tempfile_command"] | getline mysql[resource]
close(mysql["tempfile_command"])
call = sprintf("%s %s %s %s %s %s > %s",
mysql["path"], mysql["user"], mysql["pass"], mysql["host"],
mysql["args"], mysql["database"],
mysql[resource])
print query | call
close(call)
if (getline input < mysql[resource]) {
for (i = split(input, key, "\t"); i > 0; i--)
mysql[resource, i] = key[i]
}
return resource
}
function mysql_fetch_assoc (resource,row ,input,i,fields) {
fields = 0
if (getline input < mysql[resource]) {
fields = mysql_split(row, input)
for (i = 1; i <= fields; i++)
row[mysql[resource, i]] = row[i]
}
return fields
}
function mysql_split (row, input, r,i) {
r = split(input, row, "\t")
for (i = 0; i <= r; i++) {
row[i] = mysql_dequote(row[i])
}
return r
}
function mysql_fetch_row (resource,row ,input,r,i) {
if (getline input < mysql[resource]) {
return mysql_split(row, input)
}
return 0
}
function mysql_index (resource, id) {
return mysql[resource, id]
}
function mysql_finish (resource, i) {
close(mysql[resource])
system(sprintf("rm %s", mysql[resource]))
delete mysql[resource]
i = 1
while (mysql[resource,i])
delete mysql[resource, i++]
}
function mysql_cleanup ( i) {
for (i = 1; i < mysql["resource_id"]; i++)
if (mysql[i]) {
close(mysql[i])
system(sprintf("rm %s", mysql[i]))
delete mysql[resource]
i = 1
while (mysql[resource,i])
delete mysql[resource, i++]
}
}
Support Utils
function mysql_dequote (string, result,i,l,c) {
result = ""
l = length(string)
for (i = 1; i <= l; i++) {
c = substr(string, i, 1)
if (c == "\\") {
# This simply shouldn't happen...
## if ((i + 1) == l) continue;
c = substr(string, ++i, 1)
result = result __mysql_dequote[c]
}
else {
result = result c
}
}
return result
}
function mysql_quote (string, result) {
gsub(/\\/, "\\\\", string)
gsub(/'/, "\\'", string)
return "'" string "'"
}
Copyright
Author
NoSQL
Plaiter: a music player
Synopsis
plaiter [options] [file, playlist, directory or stream ...]
Download
Description
Options
Copyright
Author
Humdrum
Download
Description
Author
For more information
shuffle.awk
Contents
• Synopsis
• Download
• Description
• The Slow Way
• The Better Way
• Code
• Correctness proof
• Examples
• Random orders
• Fast sampling
• Repeats
• Author
Synopsis
nshuffle(Array)
shuffle(Array,Copy)
shuffles(Array,Copy)
Download
Description
The Slow Way
The Better Way
the number of elements left in the input array
+ the number of elements in the output array
------------------------------------------------
= the number of elements initially passed in.
Code
function nshuffle(a, i,j,n,tmp) {
n=a[0]; # a has items at 1...n
for(i=1;i<=n;i++) {
j=i+round(rand()*(n-i));
tmp=a[j];
a[j]=a[i];
a[i]=tmp;
};
return n;
}
function round(x) { return int(x + 0.5) }
function shuffle(a,b) {
for(i in a) b[i]=a[i];
nshuffle(b);
}
function shuffles(a,b, c,n) {
for(i in a) {n++; c[i]=a[i]};
c[0]=n;
shuffle(c,b);
}
Correctness proof
Examples
Random orders
BEGIN {
if (ShuffleDemo) {
if (Seed) { srand(Seed) } else { srand() };
s2i(ShuffleDemo,L1," ");
shuffles(L1,L2);
while(Item =pop(L2)) print Item;
}
}
function s2i(str,a,sep, n,i,tmp) {
n=split(str,tmp,sep);
for(i=1;i<=n;i++) a[i]=tmp[i];
return n;
}
function pop(a, x,i) {
i=a[0]--;
if (!i) {return ""} else {x=a[i]; delete a[i]; return x}
}
gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd"
cc
aa
dd
bb
dd
bb
cc
aa
Fast sampling
gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=$RANDOM
Repeats
gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=23
Author
runawk - wrapper for AWK interpreter
Contents
Download from...
NAME
SYNOPSIS
DESCRIPTION
#!/usr/bin/awk -f script (not to awk
interpreter), it is necessary to prepand a list of
arguments with -- (two minus signes). In my view, this looks badly.
#!/usr/bin/awk -f
BEGIN {
for (i=1; i < ARGC; ++i){
printf "ARGV [%d]=%s\n", i, ARGV [i]
}
}
% awk_program --opt1 --opt2
/usr/bin/awk: unknown option --opt1 ignored
/usr/bin/awk: unknown option --opt2 ignored
% awk_program -- --opt1 --opt2
ARGV [1]=--opt1
ARGV [2]=--opt2
%
% awk_program --opt1 --opt2
ARGV [1]=--opt1
ARGV [2]=--opt2
%
#!/usr/bin/awk -f script handles arguments (options) and wants
to read from stdin, it is necessary to add
/dev/stdin (or `-') as a last argument explicitly.
#!/usr/bin/awk -f
BEGIN {
if (ARGV [1] == "--flag"){
flag = 1
ARGV [1] = "" # to not read file named "--flag"
}
}
{
print "flag=" flag " $0=" $0
}
% echo test | awk_program -- --flag
% echo test | awk_program -- --flag /dev/stdin
flag=1 $0=test
%
% echo test | awk_program --flag
flag=1 $0=test
%
OPTIONS
DETAILS/INTERNALS
Standalone script
#!/usr/local/bin/runawk
#!/usr/bin/awk -f
AWK modules
#use "module1.awk"
#use "module2.awk"
file prog:
#!/usr/local/bin/runawk
#use "A.awk"
#use "B.awk"
#use "E.awk"
PROG code
...
file B.awk:
#use "A.awk"
#use "C.awk"
B code
...
file C.awk:
#use "A.awk"
#use "D.awk"
C code
...
A.awk and D.awk don't contain #use directive.
runawk prog file1 file2
/path/to/prog file1 file2
awk -f A.awk -f D.awk -f C.awk -f B.awk -f E.awk -f prog -- file1 file2
runawk -d prog file1 file2
Module search strategy
AWK interpreter and its arguments
runawk prog2 -x -f=file -o=output file1 file2
/path/to/prog2 -x -f=file -o=output file1 file2
awk -f prog2 -- -x -f=file -o=output file1 file2
runawk prog3 --value=value
/path/to/prog3 --value=value
awk -f prog3 -- --value=value /dev/stdin
Program as an argument
/path/to/runawk -e '
#use "alt_assert.awk"
{
assert($1 >= 0 && $1 <= 10, "Bad value: " $1)
# your code below
...
}'
Selecting a preferred AWK interpreter
file prog:
#!/usr/local/bin/runawk
#use "A.awk"
#use "B.awk"
#interp "/usr/pkg/bin/nbawk"
# your code here
...
Setting environment
file prog:
#!/usr/local/bin/runawk
#env "LC_ALL=C"
$1 ~ /^[A-Z]+$/ { # A-Z is valid if LC_CTYPE=C
print $1
}
EXIT STATUS
ENVIRONMENT
AUTHOR/LICENSE
BUGS/FEEDBACK
m1 : A Micro Macro Processor
Contents
• Synopsis
• Download
• Description
• Applications
• Form Letters
• Troff Pre-Processing
• Awk Library Management
• Controlling Experiments
• The Substitution Function
• Possible Extensions
• Code
• error
• dofile
• readline
• gobble
• dosubs
• docodef
• BEGIN
• Bugs
• History
• Author
Synopsis
awk -f m1.awk [file...]
Download
Description
@comment Any text
@@ same as @comment
@define name value
@default name value set if name undefined
@include filename
@if varname include subsequent text if varname != 0
@unless varname include subsequent text if varname == 0
@fi terminate @if or @unless
@ignore DELIM ignore input until line that begins with DELIM
@stderr stuff send diagnostics to standard error
Applications
Form Letters
@default MYNAME Jon Bentley
@default TASK respond to your special offer
@default EXCUSE the dog ate my homework
Dear @NAME@:
Although I would dearly love to @TASK@,
I am afraid that I am unable to do so because @EXCUSE@.
I am sure that you have been in this situation
many times yourself.
Sincerely,
@MYNAME@
@define NAME Mr. Smith
@define TASK subscribe to your magazine
@define EXCUSE I suddenly forgot how to read
Troff Pre-Processing
@define ArrayFig @StructureSec@.2
@define HashTabFig @StructureSec@.3
@define TreeFig @StructureSec@.4
@define ProblemSize 100
@define FIGNUM @FIGMFMOVIE@
@define FIGTITLE The Multiple Fragment heuristic.
@FIGSTART@
<PS> <@THISDIR@/mfmovie.pic</PS>
@FIGEND@
Awk Library Management
Controlling Experiments
@define N ($1)
@define NODES ($2)
@define CPU ($3)
...
The Substitution Function
L = Empty
R = Input String
while R contains an "@" sign do
let R = A @ B; set L = L A and R = B
if R contains no "@" then
L = L "@"
break
let R = A @ B; set M = A and R = B
if M is in SymTab then
R = SymTab[M] R
else
L = L "@" M
R = "@" R
return L R
Possible Extensions
Code
error
function error(s) {
print "m1 error: " s | "cat 1>&2"; exit 1
}
dofile
function dofile(fname, savefile, savebuffer, newstring) {
if (fname in activefiles)
error("recursively reading file: " fname)
activefiles[fname] = 1
savefile = file; file = fname
savebuffer = buffer; buffer = ""
while (readline() != EOF) {
if (index($0, "@") == 0) {
print $0
} else if (/^@define[ \t]/) {
dodef()
} else if (/^@default[ \t]/) {
if (!($2 in symtab))
dodef()
} else if (/^@include[ \t]/) {
if (NF != 2) error("bad include line")
dofile(dosubs($2))
} else if (/^@if[ \t]/) {
if (NF != 2) error("bad if line")
if (!($2 in symtab) || symtab[$2] == 0)
gobble()
} else if (/^@unless[ \t]/) {
if (NF != 2) error("bad unless line")
if (($2 in symtab) && symtab[$2] != 0)
gobble()
} else if (/^@fi([ \t]?|$)/) { # Could do error checking here
} else if (/^@stderr[ \t]?/) {
print substr($0, 9) | "cat 1>&2"
} else if (/^@(comment|@)[ \t]?/) {
} else if (/^@ignore[ \t]/) { # Dump input until $2
delim = $2
l = length(delim)
while (readline() != EOF)
if (substr($0, 1, l) == delim)
break
} else {
newstring = dosubs($0)
if ($0 == newstring || index(newstring, "@") == 0)
print newstring
else
buffer = newstring "\n" buffer
}
}
close(fname)
delete activefiles[fname]
file = savefile
buffer = savebuffer
}
readline
function readline( i, status) {
status = ""
if (buffer != "") {
i = index(buffer, "\n")
$0 = substr(buffer, 1, i-1)
buffer = substr(buffer, i+1)
} else {
# Hume: special case for non v10: if (file == "/dev/stdin")
if (getline <file <= 0)
status = EOF
}
# Hack: allow @Mname at start of line w/o closing @
if ($0 ~ /^@[A-Z][a-zA-Z0-9]*[ \t]*$/)
sub(/[ \t]*$/, "@")
return status
}
gobble
function gobble( ifdepth) {
ifdepth = 1
while (readline() != EOF) {
if (/^@(if|unless)[ \t]/)
ifdepth++
if (/^@fi[ \t]?/ && --ifdepth <= 0)
break
}
}
dosubs
function dosubs(s, l, r, i, m) {
if (index(s, "@") == 0)
return s
l = "" # Left of current pos; ready for output
r = s # Right of current; unexamined at this time
while ((i = index(r, "@")) != 0) {
l = l substr(r, 1, i-1)
r = substr(r, i+1) # Currently scanning @
i = index(r, "@")
if (i == 0) {
l = l "@"
break
}
m = substr(r, 1, i-1)
r = substr(r, i+1)
if (m in symtab) {
r = symtab[m] r
} else {
l = l "@" m
r = "@" r
}
}
return l r
}
docodef
function dodef(fname, str, x) {
name = $2
sub(/^[ \t]*[^ \t]+[ \t]+[^ \t]+[ \t]*/, "") # OLD BUG: last * was +
str = $0
while (str ~ /\\$/) {
if (readline() == EOF)
error("EOF inside definition")
# OLD BUG: sub(/\\$/, "\n" $0, str)
x = $0
sub(/^[ \t]+/, "", x)
str = substr(str, 1, length(str)-1) "\n" x
}
symtab[name] = str
}
BEGIN
BEGIN {
EOF = "EOF"
if (ARGC == 1)
dofile("/dev/stdin")
else if (ARGC >= 2) {
for (i = 1; i < ARGC; i++)
dofile(ARGV[i])
} else
error("usage: m1 [fname...]")
}
Bugs
History
Author
m5 - macro processor
Download
Synopsis
m5 [ -Dname ] [ -Dname=def ] [-c] [ -dp char ]
[ -o file ] [-sp char ] [ file ... ]
[g|n]awk -f m5.awk X [ -Dname ] [ -Dname=def ] [-c] [ -dp char ]
[ -o file ] [ -sp char ] [ file ... ]
Description
Options
Usage
Overview
Macro Substitution
The quick $fox jumped over the lazy $dog.
print "The quick " fox " jumped over the lazy " dog "."
Macros Containing Macros
Directive Lines
Include Directive
Main Program and Functions
Output
EXAMPLE
Input Text
#function main() {
Example 1: Simple Substitution
------------------------------
# br = "brown"
The quick $br fox.
Example 2: Substitution inside a String
---------------------------------------
# r = "row"
The quick b$(r)n fox.
Example 3: Expression Substitution
----------------------------------
# a = 4
# b = 3
The quick $(2*a + b) foxes.
Example 4: Macros References inside a Macro
-------------------------------------------
# $[fox] = "\$[q] \$[b] \$[f]"
# $[q] = "quick"
# $[b] = "brown"
# $[f] = "fox"
The $[fox].
Example 5: Array Reference Substitution
---------------------------------------
# x[7] = "brown"
# b = 3
The quick $x[2*b+1] fox.
Example 6: Function Reference Substitution
------------------------------------------
The quick $color(1,2) fox.
Example 7: Substitution of Special Characters
---------------------------------------------
\# The \$ quick \\ brown $# fox. $$
#}
#include(testincl.m5)
Included File testincl.m5
#function color(i,j) {
The lazy dog.
# if (i == j)
# return "blue"
# else
# return "brown"
#}
Output Program
function main() {
print
print " Example 1: Simple Substitution"
print " ------------------------------"
br = "brown"
print " The quick " br " fox."
print
print " Example 2: Substitution inside a String"
print " ---------------------------------------"
r = "row"
print " The quick b" r "n fox."
print
print " Example 3: Expression Substitution"
print " ----------------------------------"
a = 4
b = 3
print " The quick " 2*a + b " foxes."
print
print " Example 4: Macros References inside a Macro"
print " -------------------------------------------"
M["fox"] = "$[q] $[b] $[f]"
M["q"] = "quick"
M["b"] = "brown"
M["f"] = "fox"
print " The " eval(M["fox"]) "."
print
print " Example 5: Array Reference Substitution"
print " ---------------------------------------"
x[7] = "brown"
b = 3
print " The quick " x[2*b+1] " fox."
print
print " Example 6: Function Reference Substitution"
print " ------------------------------------------"
print " The quick " color(1,2) " fox."
print
print " Example 7: Substitution of Special Characters"
print " ---------------------------------------------"
print "\# The \$ quick \\ brown $# fox. $$"
}
function color(i,j) {
print " The lazy dog."
if (i == j)
return "blue"
else
return "brown"
}
function eval(inp ,isplb,irb,out,name) {
splb = SP "["
out = ""
while( isplb = index(inp, splb) ) {
irb = index(inp, "]")
if ( irb == 0 ) {
out = out substr(inp,1,isplb+1)
inp = substr( inp, isplb+2 )
} else {
name = substr( inp, isplb+2, irb-isplb-2 )
sub( /^ +/, "", name )
sub( / +$/, "", name )
out = out substr(inp,1,isplb-1) eval(M[name])
inp = substr( inp, irb+1 )
}
}
out = out inp
return out
}
BEGIN {
SP = "$"
main()
exit
}
Final Output
Example 1: Simple Substitution
------------------------------
The quick brown fox.
Example 2: Substitution inside a String
---------------------------------------
The quick brown fox.
Example 3: Expression Substitution
----------------------------------
The quick 11 foxes.
Example 4: Macros References inside a Macro
-------------------------------------------
The quick brown fox.
Example 5: Array Reference Substitution
---------------------------------------
The quick brown fox.
Example 6: Function Reference Substitution
------------------------------------------
The lazy dog.
The quick brown fox.
Example 7: Substitution of Special Characters
---------------------------------------------
# The $ quick \ brown $# fox. $$
File
a.awk is the default output program file.
See Also
awk(1), cpp(1), gawk(1), m4(1), nawk(1). vi(1)
Author
AWKWORDS
Contents
• Synopsis
• Download
• Description
• Extra Markup
• Short cuts for HTML
• Including nested content:
• Programmer's Guide
• Functions
• unhtml
• toc
• includes
• CSS styles
• Main command line
• Bugs
• Author
Synopsis
Download
wget http://lawker.googlecode.com/svn/fridge/lib/bash/awkwords
chmod +x awkwords
Description
Extra Markup
Short cuts for HTML
Including nested content:
Programmer's Guide
Functions
unhtml
unhtml() { cat $1| gawk '
BEGIN {IGNORECASE=1}
/^<PRE>/ {In=1; print; next}
/^<\/PRE>/ {In=0; print; next}
In {gsub("<","\\<",$0); print; next }
{print $0 }'
}
toc
toc() { cat $1 | gawk '
BEGIN { IGNORECASE = 1 }
/^<[h]1>/ { Header=$0; next}
/^[<]h[23456789]>/ {
T++ ;
Toc[T] = gensub(/(.*)<h(.*)>[ \t]*(.*)[ \t]*<\/h(.*)>(.*)/,
"<""h\\2><""font color=black>\\•</font></a> <""a href=#" T ">\\3</a></h\\4>",
"g",$0)
Pre="<a name="T"></a>" }
{ Line[++N] = Pre $0; Pre="" }
END { print Header;
print "<" "h2>Contents</h2>"
print "<" "div id=\"htmltoc\">"
for(I=1;I<=T;I++) print Toc[I]
print "<" "/div><!--- htmltoc --->"
print "<" "div id=\"htmlbody\">"
for(I=1;I<=N;I++) print Line[I]
print "</" "div><!--- htmlbody --->"
}'
}
includes
includes() { cat $1 | gawk '
function xpand(pre, tmp) {
if ($1 ~ "^#.IN") xpands($2,pre)
else if ($1 ~ "^#.BODY" ) xpandsBody($2,pre)
else if ($1 ~ "^#.LISTING") {
print "<" "pre>"
xpands($2,1) # <===== note the recursive call with "1"
print "<" "/pre>" }
else if ($1 ~ "^#.CODE") {
print "<" "p>" $2 "\n<" "pre>"
xpands($2,1) # <===== note the recursive call with "1"
print "<" "/pre>" }
else if ($1 ~ "^#.URL") {
tmp = $2; $1=$2="";
print "<" "a href=\""tmp"\">" trim($0) "</a>"
}
else if ($1 ~ "^#.TO") {
tmp = $2; $1=$2="";
print "<" "a href=\"mailto:"tmp"\">" trim($0) "</a>"
}
else
xpand1(pre)
}
function xpand1(pre) {
if (pre)
gsub("<","\\<",$0) # <=== remove start-of-html-character
else {
$0= xpandHtml($0) # <=== expand html short cuts
sub(/^#/,"",$0) }
print $0
}
function xpandHtml( str,tag) {
if ($0 ~ /^#\.H1/) {
$1=""
return "<" "h""1><join>" $0 "</join></" "h1>" }
if (sub(/^#\./,"",$1)) {
tag=$1; $1=""
return "<" tag ">" (($0 ~ /^[ \t]*$/) ? "" : $0"</"tag">")
}
return $0
}
function xpands(f,pre) {
if (newFile(f)) {
while((getline <f) > 0) xpand(pre)
close(f) }
}
function xpandsBody(f,pre, using) {
if (newFile(f)) {
while((getline <f) >0) {
if ( !using && ($0 ~ /^[\t ]*$/) ) using = 1
if ( using ) xpand(pre)}
close(f) }
}
function newFile(f) { return ++Seen[f]==1 }
function trim (s) { sub(/^[ \t]*/,"",s); sub(/[ \t]*$/,"",s); return s }
BEGIN { IGNORECASE=1 }
{ xpand() }'
}
CSS styles
css() {
echo "<""STYLE type=\"text/css\">"
cat<<-'EOF'
div.htmltoc h2 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 30px;}
div.htmltoc h3 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 60px;}
div.htmltoc h4 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 90px;}
div.htmltoc h5 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 120px;}
div.htmltoc h6 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 150px;}
div.htmltoc h7 { font-size: medium; font-weight: normal;
margin: 0 0 0 0; margin-left: 180px; }
</STYLE>
EOF
}
Main command line
main() { cat $1 | includes | unhtml | toc; }
if [ $1 == "--title" ]
then
echo "<""html><""head><""title>$2</title>`css`</head><""body>";
shift 2
main $1
echo "<""/body><""/html>"
else
main $1
fi
Bugs
Author
Tim Menzies
awf
Synopsis
Download
Description
.\" .ce .fi .in .ne .pl .sp
.ad .de .ft .it .nf .po .ta
.bp .ds .ie .ll .nr .ps .ti
.br .el .if .na .ns .rs .tm
\$ \% \* \c \f \n \s
MAN Macros
.B .DT .IP .P .RE .SM
.BI .HP .IR .PD .RI .TH
.BR .I .LP .PP .RS .TP
.BY .IB .NB .RB .SH .UC
MS Macros
.AB .CD .ID .ND .QP .RS .UL
.AE .DA .IP .NH .QS .SH .UX
.AI .DE .LD .NL .R .SM
.AU .DS .LG .PP .RE .TL
.B .I .LP .QE .RP .TP
Output
FiLes
common common device-independent initialization
dev.* device-specific initialization
mac.m* macro packages
pass1 macro substituter
pass2.base central formatter
pass2.m* macro-package-specific bits of formatter
pass3 line and page composer
See Also
Diagnostics
Author
Copyright
Bugs
Linking Awk to Spreadsheets
cell < command
(and "command" is any Unix script, e.g. using Awk). When such a
cell is entered, it will:
Postscript Tricks
pschoose.awk
Contents
Synopsis
Download
Description
Pulls out a range of pages from postscript and just print those.
Details
Code
Set up the list of paes to print.
function set_pagerange( n, m, i, j, f, g)
{
delete Pages
n = split(Pagerange, f, ",")
for (i = 1; i <= n; i++) {
if (index(f[i], "-") != 0) { # a range
m = split(f[i], g, "-")
if (m != 2 || g[1] >= g[2]) {
printf("bad list of pages: %s\n",
f[i]) > "/dev/stderr"
exit 1
}
for (j = g[1]; j <= g[2]; j++)
Pages[j] = 1
} else
Pages[f[i]] = 1
}
}
BEGIN {
# constants
TRUE = 1
FALSE = 0
if (ARGC != 3) {
print "usage: pschoose range-spec file\n" > "/dev/stderr"
exit 1
}
Pagerange = ARGV[1]
delete ARGV[1]
set_pagerange()
}
NR == 1, /^%%Page:/ {
if (! /^%%Page/) {
Prolog[++nprolog] = $0
next
}
}
/^%%Trailer/ || In_trailer {
In_trailer = TRUE
Epilog[++nepilog] = $0
next
}
/^%%Page: / {
++Npage
line = 0
}
for all non-special lines
{
# only save it if we will want to print it
if (Npage in Pages)
Page[Npage, ++line] = $0
}
END {
# print the prologue
for (i = 1; i in Prolog; i++)
print Prolog[i]
# print the actual body
for (i = 1; i <= Npage; i++) {
if (i in Pages) {
for (j = 1; (i, j) in Page; j++) {
print Page[i, j]
}
}
}
# print the epilog
for (i = 1; i in Epilog; i++)
print Epilog[i]
}
Author
psrev.awk
Contents
Synopsis
Download
Description
Code
BEGIN {
# constants
TRUE = 1
FALSE = 0
# Initialize global booleans
Twoup = FALSE
# process command line flags
for (i = 1; i in ARGV && ARGV[i] ~ /^-/; i++) {
if (ARGV[i] == "-2")
Twoup = TRUE
else
printf("psrev: unrecognized option %s\n",
ARGV[i]) > "/dev/stderr"
delete ARGV[i]
}
}
NR == 1, /^%%Page:/ {
if (! /^%%Page/) {
Prolog[++nprolog] = $0
next
}
}
/^%%Trailer/ || In_trailer {
In_trailer = TRUE
Epilog[++nepilog] = $0
next
}
/^%%Page: / {
++Npage
line = 0
}
for all non-special lines
{
Page[Npage, ++line] = $0
}
END {
# print the prologue
for (i = 1; i in Prolog; i++)
print Prolog[i]
# print the actual body
if (Twoup) {
hasodd = (Npage %2 == 1)
if (hasodd) {
# print last page
for (j = 1; (Npage, j) in Page; j++)
print Page[Npage, j]
# make a fake last page for psnup
printf "%%%%Page: %d %d\n", Npage+1, Npage+1
printf "showpage\n"
print "%%BeginPageSetup"
print "BP"
print "%%EndPageSetup"
print "EP"
}
lastpage = (hasodd ? Npage - 1 : Npage)
for (i = lastpage; i > 0; i -= 2) {
for (k = i - 1; k <= i; k++)
for (j = 1; (k, j) in Page; j++)
print Page[k, j]
}
} else {
# regular 1 up printing
for (i = Npage; i > 0; i--)
for (j = 1; (i, j) in Page; j++)
print Page[i, j]
}
# print the epilog
for (i = 1; i in Epilog; i++)
print Epilog[i]
}
Author
indent.awk
Contents
Synopsis
Download
Description
Code
doindent
function doindent(){
tmpindent=indent;
if(indent<0){
print "ERROR; indent level == " indent
}
while(tmpindent >0){
printf(" ");
tmpindent-=1;
}
}
Out-denting
$1 == "done" { indent -=1; }
$1 == "fi" { indent -=1; }
$0 ~ /}/ { if(indent!=0) indent-=1; }
Worker
{
doindent();
print $0;
}
In-denting
$0 ~ /if.*;[ ]*then/ { indent+=1; }
$0 ~ /for.*;[ ]*do/ { indent+=1; }
$0 ~ /while.*;[ ]*do/ { indent+=1; }
$1 == "then" { indent+=1; }
$1 == "do" { indent+=1; }
$0 ~ /{$/ { indent+=1; }
Author
Top posters at comp.lang.awk
posts
kbytes
name
address
13
28.4
roby
elleroroberto@katamail.com
7
11.6
Steffen Schuler
schuler.steffen@gmail.com
4
10.9
pmarin
pacogeek@gmail.com
3
9.7
Ed Morton
mortonspam@gmail.com
3
5.2
Janis Papanagnou
janis_papanagnou@hotmail.com
3
5.1
nag
visitnag@gmail.com
2
6.5
Tim Menzies
menzies.tim@gmail.com
2
6.1
r.p.loui@gmail.com
r.p.loui@gmail.com
2
5.8
Hermann Peifer
peifer@gmx.net
2
5.7
kielhd
kielhd@freenet.de
41
95.0
Total for top 10
Totals for the newsgroup
For the 7 day period ending Monday April 27, 2009.
| posts | kbytes | subject |
|---|---|---|
| 10 | 33.5 | OS-variables in awk |
| 9 | 17.9 | user functions with variable number of parameters |
| 5 | 8.9 | File infos |
| 3 | 8.5 | Interpreter Informations |
| 3 | 5.0 | Log/History Files |
| 3 | 4.9 | Help with an input file |
| 3 | 4.8 | gawk can't run an awk program... |
| 3 | 4.6 | Log/History File |
| 2 | 5.6 | pgawk.exe.stackdump |
| 2 | 4.7 | OT: Re: Interpreter Informations |
For the 365 day period ending Sunday April 26, 2009.
| posts | kbytes | name | address |
|---|---|---|---|
| 156 | 530.8 | Ed Morton | mortonspam@gmail.com |
| 156 | 388.3 | Janis Papanagnou | janis_papanagnou@hotmail.com |
| 146 | 256.1 | pk | pk@pk.invalid |
| 109 | 306.6 | Ed Morton | morton@lsupcaemnt.com |
| 84 | 146.5 | Steffen Schuler | schuler.steffen@gmail.com |
| 83 | 139.4 | Kenny McCormack | gazelle@shell.xmission.com |
| 77 | 174.1 | Aharon Robbins | arnold@skeeve.com |
| 64 | 162.2 | Dave B | daveb@addr.invalid |
| 54 | 194.9 | r.p.loui@gmail.com | r.p.loui@gmail.com |
| 50 | 107.7 | Hermann Peifer | peifer@gmx.eu |
| 979 | 2406.6 | Total for top 10 | |
For the 365 day period ending Sunday April 26, 2009.
| posts | kbytes | subject |
|---|---|---|
| 61 | 219.6 | changing a field without recompiling the record |
| 44 | 71.3 | Top 10 subjects comp.lang.awk |
| 42 | 88.1 | GAWK: A fix for "missing file is a fatal error" |
| 34 | 59.6 | Top 10 posters comp.lang.awk |
| 30 | 75.3 | Indirect function calls patch for gawk available |
| 29 | 65.0 | gawk for windows: system() does not yield exit status |
| 26 | 67.1 | split field by delimiter |
| 24 | 63.6 | Is there an simple way to initialise arrays in bulk? |
| 23 | 63.5 | Sed1liners in Awk? |
| 23 | 62.6 | Gawk match() and numbers in scientific notation |
[gn]awk -f holidays.awk "opts" holidayfile
Download from LAWKER.
Job scheduling around holidays has always been a pain. To prevent messing around with crons several times a year, I used to place a "holidays" file in, for example, /usr/local/bin. The file contained the holiday date in yyyymmdd format, followed by the holiday name. (See Dateplus program for easy date manipulation.) That worked, but every year I had to refresh the file with those dates that fall on, for example, the last Monday in May. This meant remembering to edit the holidays file after the company calendar was set for the year.
Then, I came across the American Secular Holidays web site by Marcos J. Montes. Montes cites Claus Tondering as his primary source, and Timothy Barmann, and Bobby Cossum for their contributions in simplifying the equations used in the alorithms. This is significant for these algorithms provide a robust yet elegant method for identifying whether a given date is a holiday without constantly updating a configuration file.
To make these algorithms and routines as portable as possible (as long as the porting OS has nawk or gawk), I rewrote the whole thing in [gn]awk. Now practically any program with access to AWK can avail itself of these holiday date capabilities. The AWK version of the program can return the nth business day, a multi-line yyyymmdd date list, or a single line of yyyymmdd holiday dates. With those, you can easily determine whether the date you have is a holiday or specific business day.
In the following code, none of my holiday work is possible without the algorithms presented by Montes, Tondering, Barmann, and Cossum. The holidays file and the logic to process that, are my contributions.
Although second to the algorithms, the holidays file is central to this system. The file's directives allow for the handling of, for example, the Friday after U.S. Thanksgiving Day (Thursday). For those organizations and companies that grant a Friday holiday when a day like Christmas or New Year's Day falls on a Saturday, or give a Monday holiday when those holidays fall on a Sunday, the holidays file provides the necessary vehicle.
After a brief description of holidays file layout, I'll discuss the the file itself, and see how three holidays are handled: Memorial Day, Thanksgiving Day (including the Friday after), and Christmas.
The file itself is a simple ASCII file available to to all programs. It contains values that allow the calling program to calculate holidays either by given (fixed) month and day, or by day of a given week. The general layout is as follows:
# Mm N.Day Adj Holiday name # Comments
Mm = Month number (leading zeros NOT required)
N.day = Nth day (1-5 and "last") "." weekday (0-6)
(Not every part is required.)
Adj = Can be either a +|- n days,
or weekday followed immediately by a +|- n days,
Holiday name = How you want it spelled out--your call.
Comments = ignored.
Leading white space is ignored, as is everything following and including the octothorpe (#-sign). Here are the entries for the three holidays:
#-----------------------------------------------------------------------#
# Mm N.Day.OnOrA Adj Holiday name # Comments #
# -- ----------- --- -------------------------------- ----------------- #
05 last.1 Memorial Day # Last Mon in May
11 4.4 Thanksgiving Day (US) # 4th Thu in Nov
11 4.4 +1 Thanksgiving Day II (US)
12 25 Christmas Day # M-F
12 25 6-1 Christmas Day (pre-holiday obs) # Sat? Use Fri
12 25 0+1 Christmas Day (post-holiday obs) # Sun? Use Mon
Memorial Day is the last Monday in May. In the table the month is "05" (again, leading zero is unnecessary). The last Monday is specified by the word "last" and not a 5 because the last Monday may not be the 5th Monday (there is no 5th Monday in May, 2003). Monday is identified by the 1 following the dot (".1"). This is based on the 0-6 convention for representing Sunday through Saturday.
Thanksgiving Day (U.S. observance) is the forth Thursday in November. November is identified by the "11". The fourth (nth) day is the first "4". Thursday is the ".4". Same method as was used for Memorial Day. The day after Thanksgiving, Friday, is a little tricky.
Contrary to what you might think, you cannot specify:
11 4.5 Thanksgiving Day II (Friday)
since the fourth Friday might not follow the fourth Thursday of a given month. Consider Thanksgiving Day, 2002--the fourth Thursday was November 28. The fourth Friday fell on the 22nd. So, to accurately capture the Friday after Thanksgiving Day, specify the same parameters for Thanksgiving, and an adjustment of +1:
11 4.5 +1 Thanksgiving Day II (Friday)
Christmas is December 25. Like New Year's Day (January 1) and Independence Day (July 4), Christmas is a fixed date. Simply specifying "12 25 Christmas Day" in the holidays file returns "yyyy1225". However, with many companies, if Christmas falls on a Saturday (day 6), the Friday before is observed by adjusting it by -1. If it falls on a Sunday (day 0), the Monday following is observed by adjusting it by +1. Hence, the three entries:
12 25 Christmas Day # M-F 12 25 6-1 Christmas Day (pre-holiday obs) # Sat? Use Fri 12 25 0+1 Christmas Day (post-holiday obs) # Sun? Use Mon
New Year's Day is a fixed date, January 1, and like Christmas and Independence Day, it can be observed on the Friday before a Saturday occurrence or the Monday after a Sunday occurrence simply by setting it up like the Christmas example above. However, some organizations use a post-holiday observance of New Year's Day when it falls on a Saturday simply so the holiday falls in the correct year. You can do that by specifying New Year's Day as follows:
01 01 New Year's Day # M-F 01 01 6+2 New Year's Day (post-holiday obs) # Sat? Use Mon 01 01 0+1 New Year's Day (post-holiday obs) # Sun? Also Mon
Remember, the "6" in our "6+2" means the actual date, January 1st, falls on a Saturday (day 6 in the 0-6 day-numbering schema), so adjust that date by +2 days (i.e. Saturday's date (01/01) plus two days (01/03).
While the program is incapable of handling Daylight Savings dates in Iran where DST starts on the first day of Farvardin and ends the first day of Mehr, holidays.awk (v1.22) is capable of handling at least one set of unique Daylight Savings Time (DST) dates. In the Falkland Islands, DST begins on the first Sunday on or after September 8th and ends on the first Sunday on or after April 6th. Those exceptions (starting on or after a date in the month) are handled by specifying a holidays line like this:
04 1.0.6 Falklands ST # 1st Sun on/after Apr 6 09 1.0.8 Falklands DST # 1st Sun on/after Sep 8
The ".6" in our "1.0.6" means Standard Time (ST) begins on the first Sunday (1.0) in April that falls on or after the 6th of April. Likewise, the ".8" in our "1.0.8" means DST begins on the first Sunday in September that comes on or after the 8th of September.
Since Daylight Savings dates are not usually holidays, you can also retrieve the Daylight Savings Time dates via the -d option and bypass the need for the holidays file altogether. Here are Daylight Savings Times for the United States (begins the second Sunday in March) and the Faulklands (begins on the first Sunday on/after September 8).
holidays.awk -- -d 2.0 -m 3 holidays.awk -- -d 1.0.8 -m 9
You can even set up a cron to test for Daylight Savings Time and perform some action if true.
05 00 * 03 * [ `/usr/local/bin/holidays.awk -- -d 2.0 -m 3 -t` -eq 1 ] \
&& ... Some action ...
I incorporated the business day calculation into my date routines because of a need to run a given process on the second business day of the month. Once the holidays are known, business day calculation is relatively simple--just grab the month's days and remove holidays, Saturdays and Sundays. For example, to provide the second business day, just pass a "-b 2" option to the program:
bizday=`nawk -f holidays.awk -- -b 2 holidays`
if [ `date "+%Y%m%d"` = $bizday ]; then
echo "Today is the 2nd business day of the month."
# Do whatever
fi
Last business day and business day offset from the last business day (negative numbers) is also available in holidays.awk. To retrieve the last business day of the month, specify the "last" option argument (optarg) for -b option (i.e., "-b last"). For the next-to-last business day of the month, provide "-b -1" as an option and optarg.
Holidays.awk is a well-behaved program in that it uses exit status to indicate success or failure. As indicated in the documentation, all options except business day (-b), returning a zero status means the program completed successfully; non-zero indicates failure. However, with the business day option, non-zero indicates success because it is the day of the month on which the business day falls. Therefore, use the holidays.awk the exit status as the test comparand:
nawk -f holidays.awk -- -b last holidays > /dev/null 2>&1
if [ $? -eq `date +%d` ]; then
echo "Today is the last business day of the month."
# Do whatever
fi
You can also combine -b with -m and -y to return the nth business day for a given month and year. If you request a business day (positive or negative) that is not found in the month, you receive an error message, and a 0 exit status indicating an error.
For those needing only an indication that today is a given business day, you can use the -t option in conjunction with -b. For example, using Unix cron (scheduler) we combine those options to set up a job to run only on the second business day of the month with as little as the following:
00 02 2-5 * * /usr/local/bin/holidays.awk -- -b 2 -t \
|| some_program > some_program.out 2>&1
In this example, no holidays file is specified because we use the default, /usr/local/bin/holidays (you can change the program to point to wherever you wish to locate the file). No nawk -f is used because the first line of holidays.awk uses the shebang syntax (#!/usr/bin/nawk -f) to execute itself. (Obviously, the program must have the necessary execution permissions to run this way.) With the -t option, holidays.awk returns true or false (which is not the same as success or failure), only running the called program if the day is, indeed, the second business day of the month.
There appears to be as much interest in determining the nth weekday day as there is in business days, so I added an option to holidays.awk to return that. To get the first Monday in the current month, simply pass a "-d 1.Mon" option to the program:
fst_monday=`nawk -f holidays.awk -- -d 1.Mon`
An alternative syntax is also provided:
nawk -f holidays.awk -- -d 1.1
You can expand this to report the first Monday in any month and year like this.
yyyy=2005
for mm in 1 2 3 4 5 6 7 8 9 10 11 12
do
nawk -f holidays.awk -- -y $yyyy -m $mm -d 1.1
done
For the last Sunday in a month use
nawk -f holidays.awk -- -d last.Sun
For those preferring a simpler syntax: If your OS recognizes the #! (shebang) syntax, you can place a #!/usr/bin/nawk -f (or gawk) at the start of holidays.awk, thereby allowing you skip the [gn]awk -f during invocation and simply call it like this,
holidays.awk -- -d last.Sun holidays.awk -- -d last.0 holidays.awk -- -d 5.0
Holidays.sh executes holidays.awk, providing examples of holiday and business day testing. Provided the holidays file is located properly, executing holidays.sh on June 21, 2003 displays:
Today's no holiday, get busy. :-(( 20030101 Wed. New Year's Day 20030120 Mon. M.L.King Jr. Birthday 20030526 Mon. Memorial Day 20030704 Fri. Independence Day 20030901 Mon. Labor Day 20031127 Thu. Thanksgiving Day (US) 20031128 Fri. Thanksgiving Day II (US) 20031225 Thu. Christmas Day Today is NOT the 2nd business day (20030603) of the month. Today is NOT the last business day (20030630) of the month. Today is NOT the next-to-the-last business day (20030627) of the month.
As a real acid test, I include the next-to-last and last business days of every month from 2000 to 2010. The holidays.sh script concludes with a report for all holidays for the 21st century.
Copyright (c) 1995-2005 by Bob Orlando. All rights reserved.
Permission to use, copy, modify and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies, and that both the copyright notice and this permission notice appear in supporting documentation, and that the name of Bob Orlando not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. Bob Orlando makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.
Bob Orlando disclaims all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall Bob Orlando be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.
Bob Orlando
./nbc -[hlx] train test
Download from Lawker
Is less more? Can a few lines of gawk developed in a day or two stand in for a sophisticated state-of-the-art JAVA package? In the general case there may be software engineering advantages to working with rich languages like JAVA. However, in the specific case of a Naive Bayes classifier for discrete data, it is interesting to test if less is indeed more.
A Naive Bayes classifier collects frequency counts of old events, grouped into "classes". Then, if a new event arrives without a classification, it checks through the old list of classes looking for the one with the highest frequency counts for this new event.
The method is called "naive" This assumption allows us to collect frequency counts just on each attribute value (and not pairs, or triples, or quads of values).
In practice, this "naive" strategy works remarkably well- often performs as well as other schemes that try to model interactions between frequencies.
Hence, we call this system a not-so-naive Bayes Classifier.
In summary, the performance on this simple gawk-based Naive Bayes classifier is quite remarkable.
The following table compares classification accuracies between nbc.awk and WEKA's weka.classifiers.NaiveBayes.
All the data sets were discrete so no kernel estimation was used.
Results come from a 10-way cross-val (but no initial randomization of data set order).
The table is sorted by increase mean difference. Nbc.awk does better than WEKA Bayes on the datasets shown at the bottom of the table.
mean significant
difference std. difference?
data in accuracy dev. (alpha=0.01)
-----------------------------------------------
soybean | -3.02 | 2.02 | y
iris | -2.14 | 4.51 | n
zoo | -0.38 | 6.54 | n
primary-tumor | -0.30 | 3.28 | n
audiology | -0.27 | 2.67 | n
mushroom | -0.25 | 0.23 | n
splice | 0.00 | 0.00 | n
kr-vs-kp | 0.00 | 0.00 | n
breast-cancer | 0.00 | 0.00 | n
contact-lenses | 0.00 | 0.00 | n
vote | 0.27 | 1.47 | n
lymph | 0.73 | 5.01 | n
breast-w | 1.63 | 1.32 | y
credit-a | 7.40 | 5.60 | y
letter | 9.44 | 1.40 | y
On the whole, nbc.awk works as well as WEKA Bayes.
The following table compares the runtimes of nbc.awk (awk) vs WEKA BAYES (java) measured in seconds.
Each lines show total times for ten training+test runs (one for each item in the cross val). E.g. letter actually ran in time 4.92 seconds (on average) and this was called 10 times.
Note: the time for dividing files for the x-val is not shown.
The table is sorted on the ratio of awk vs java runtimes. Ratios less than one mean awk ran faster than java. Sampler.awk does better than Weka Bayes on the datasets shown at the bottom of the table (below the middle line).
runtimes (secs) |
-----------------|---------------------
data awk java ratio| insts attrs classes
--------------------------------|---------------------
letter 49.2 17.6 2.8 20,000 17 27
mushroom 10.1 5.9 1.7 8,124 23 3
kr-vs-kp 8.1 5.1 1.6 3,916 37 3
splice 11.3 7.8 1.4 3,190 62 4
soybean 4.2 3.4 1.2 683 36 20
------------------------------------------------------
audiology 2.9 3.4 0.9 226 70 25
primary-tumor 1.3 2.8 0.5 339 18 23
vote 1.0 2.4 0.4 435 17 3
contact-lenses 0.6 2.0 0.3 24 5 4
breast-cancer 0.7 2.4 0.3 286 10 3
credit-a 1.1 3.3 0.3 690 16 3
breast-w 1.0 3.5 0.3 699 10 3
lymph 0.6 2.4 0.2 148 19 5
iris 0.5 2.5 0.2 150 5 4
zoo 0.6 2.4 0.2 101 18 8
------------------------------------------------------
total 93.1 66.8 1.4
All up, the awk-based learner was 40% slower than the JAVA. For larger data sets, JAVA was always faster. However, for smaller datasets (under 1000 instances) the awk version was nearly as fast or faster.
We have run this small Awk script on 100s of megabytes of data, without crashes or core dumps. The code is very memory effecient- unlike the WEKA which loads all the data into RAM.
It is hardly surprising that a state-of-the-art tool kit built and optimized by JAVA gurus can out-perform awk code on large examples. However, what is surprising is that an 32 line AWK script built and debugged in a weekend often works nearly as well, or better.
Perhaps "nbc" is not-so-naive after all.
To check the download, unzip the contents.zip then
chmod +x nbc
./nbc nbceg.train nbceg.test |
gawk -F, '{print $0 "\t " ($1 !=$2 ? " <== bad" : "")}'
This should print:
malign_lymph,malign_lymph metastases,metastases malign_lymph,malign_lymph metastases,metastases malign_lymph,metastases <== bad malign_lymph,malign_lymph malign_lymph,malign_lymph metastases,metastases metastases,metastases metastases,metastases malign_lymph,malign_lymph metastases,metastases malign_lymph,malign_lymph
Here is the nbc.awk code called by the Bash script (shown below).
BEGIN {
#Internal globals:
Total=0 # count of all instances
# Classes # table of class names/frequencies
# Freg # table of counters for values in attributes in classes
# Seen # table of counters for values in attributes
# Attributes # table of number of values per attribute
}
Pass==1 {train()}
Pass==2 {print $NF "," classify()}
function train( i,c) {
Total++;
c=$NF;
Classes[c]++;
for(i=1;i<=NF;i++) {
if ($i=="?") continue;
Freq[c,i,$i]++
if (++Seen[i,$i]==1) Attributes[i]++}
}
function classify( i,temp,what,like,c) {
like = -100000; # smaller than any log
for(c in Classes) {
temp=log(Classes[c]/Total); #uses logs to stop numeric errors
for(i=1;i<NF;i++) {
if ( $i=="?" ) continue;
temp += log((Freq[c,i,$i]+1)/(Classes[c]+Attributes[i]));
};
if ( temp >= like ) {like = temp; what=c}
};
return what;
}
copyleft() { cat<<EOF
nbc: a naive bayes classifier
Copyright (C) 2004 Tim Menzies
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation, version 2.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
EOF
}
usage() { cat<<-EOF
Usage: nbc [FLAGs] TRAIN TEST
Naive bayes classifier.
TRAIN and TEST are comma-separated data files with the same
number of columns. The last column of each is the class
symbol. This classifier learns from TRAIN and then tries
to classify the examples in TEST.
Flags:
-h print this help text
-l copyright notice
-x run an example
EOF
exit
}
nbcDemo() {
main nbceg.train nbceg.test
}
main() {
gawk -F, -f nbc.awk Pass=1 $1 Pass=2 $2
}
demo=""
while getopts "hlx" flag
do case "$flag" in
l) copyleft; exit;;
h) usage; exit ;;
x) demo="nbcDemo";;
esac
done
shift $(($OPTIND - 1))
if [ -n "$demo" ]
then $demo
exit
else main $1 $2
fi
Tim Menzies
These pages focus on Awk and operating systems.
Brian Jones writes at linux.com:
The nice thing about humans is that they're at least somewhat predictable. Given the choice between having data randomly strewn about, and having it in some predictable pattern, humans will generally choose predictable patterns (Microsoft filesystem management issues notwithstanding). These patterns are what make awk, a pattern-matching programming language, a wonderful tool for systems administrators, database administrators, and even command-line junkies who use their box mainly for pleasure. The notion of being able to write a one-line command to do almost anything draws ever closer with awk in your tool belt. For most things administrators use awk for, it's an extremely simple language. As you get into writing more advanced awk scripts, at some point it becomes a bit cumbersome, and you realize that Perl is also your friend. But for now, let's focus on how awk can get you the most bang for your keyboard strokes, shall we?
The first thing you should know is that awk is actually a rather powerful language. Entire books have been written about its use. If you're so inclined, you can write extremely complex 1000-line scripts using awk. However, as a systems administrator (the intended audience for this article), 99% of your use of awk will consist of relatively short scripts, and one-off one-liners typed right on the command line. Here's an example of a common use of awk:
[jonesy@newhotness jonesy]$ cat access_log |
awk '{print $1}' | sort | uniq -c | sort -rn
The above one-liner uses awk to slim down the amount of data coming from the web server's access log. The access log is space-delimited, and I only want to see the first field (hence "print $1"). Once I have that data, I want to sort it, then I have "uniq -c" provide a count of each occurrence for each unique value, and then I produce a reverse sort based on the numeric count provided by "uniq". The result has the number of hits in the left column, and the host in the right column, and the most frequent visitors are at the top of the list. Give it a shot! Even if you're hosted by an ISP, you should be able to access this log.
Awk is perfect for ripping data into smaller chunks, to make it more bite-size for other applications or manipulation. To use it on the command line on files that are not space-delimited, you can use the "-F" flag, and indicate a delimiter. This is useful for tearing apart /etc/passwd and /etc/shadow files. For example:
[jonesy@tux jonesy]$ cat /etc/passwd | awk -F: '{print $5}' | awk -F, '{print NF}'
I actually used something kinda similar to that during a NIS to LDAP migration to see if the gecos field ($5 in /etc/passwd) had consistent enough data to be useful. One of the tests is to see how consistent the number of datapoints held in the gecos field is from record to record. To figure out the number of fields in each record's gecos field, I tell awk to use ":" as the delimiter, and, based on that, print the fifth field. I then pipe that to another awk one-liner, which uses an awk built-in variable, "NF" and a different delimiter (gecos is generally comma-delimited, if it's even used for useful data).
When one-liners just aren't enough for you, you can store a whole bunch of awk one-liners in a file, and call awk with "-f script" to tell it which file to read its commands from. Additionally, since awk needs to act on some data, you should also tag on something to take care of feeding awk the data it so desperately needs. For example, if I have a script called "getuname", which looks like this:
BEGIN { FS=":" }
{print $1}
I can now call that script, feeding it anything that I know ahead of time has the user name as the first field in a given record. So I can say "awk -f getuname < /etc/passwd", or "ypcat passwd | awk -f getuname". There are two rather important things I did in this script that will save you some headaches. First, notice the "BEGIN" statement. This statement exists to give you some space to do some tasks before awk starts reading any data. In this example, I want awk to know before it processes any data, that it should use a colon as its field separator. Sure, I could've called awk differently to get around this, ie "awk -F: -f getuname < /etc/passwd", but this way is shorter, and that's the point! It should also be noted that, if you have the need, you can also have an "END" section to your script, which will perform any actions, once, after the last data record has been processed.
On the second line, I've just called a simple awk "action" statement, just like on the command line, with one important exception: I didn't use single quotes around it. If I had, the shell would've tried to interpret this part of the script and choked. I know, because it happened while I was testing this script. Bad admin!
Awk has some built-in functions, like most scripting languages, which make life a bit easier. It also has some built-in variables that awk keeps track of for you -- and you get their values for free, just for asking, which is nice. The most useful variable I've had the pleasure to use as an admin is the "NF" variable, which will tell you, based on the field separator given (space by default), how many fields are in the current record. Conversely, the most useful function I've used as an awk scripter is the "split" function, which can break a single field into another array of separate fields. First, here's a quick example of NF in action:
cat /etc/passwd | awk -F: '{print NF}'
This is the lazy man's way to get the users' shells from the /etc/passwd file without having to remember how many fields are in the file. But wait! This doesn't print the last field in the record! It prints the number of fields in the record! Simple enough -- add a "$" to the front of "NF", and you'll get what you're looking for. Pipe the output to a couple of "sort" and "uniq" commands like we did earlier with the web log, and you'll get a snapshot of what the most commonly used shells are.
Now let's have a look at the split function. Let's say you use your gecos field to store a bunch of datapoints, and the datapoints within the gecos field are comma-delimited. This is not nearly so contrived as it might sound -- this happens in more than two environments I've done work in. Here's what it might look like:
jonesy:x:12000:13:Brian K. Jones,LUSER,101B,NONE:/home/jonesy:/bin/bash
Now let's say your PHB comes along and says he's tired of referring to me as "jonesy" and wants to know my real name. You can use awk's "split" function to help you here, and the code for doing so is fairly short:
BEGIN { FS=":" }
{
gfields = split ( $5, gecos, ",")
chunkname = split ( gecos[1], fullname, " " )
print fullname[chunkname], fullname[1]
}
Let's translate that into English, shall we? Of course, you now know what the BEGIN statement does here -- nothing new. We'll start by looking at the "gfields" line, where I use "split" to break up the 5th field of the record, (the gecos field), using the comma as a delimiter, and storing all of the resulting fields in an array called "gecos". This can be counterintuitive, as you may be tempted to think that the resulting array is called "gfields". However, the "gfields" variable actually represents the last field in the record. You get a look at how this works in the following two lines. "chunkname" represents the number of fields in the "fullname" array. The "fullname" array is created by splitting the first field of the "gecos" array (in this case, the field holding my full name), using a space as the delimiter. On the next line, I reference "fullname[chunkname]", which will print the last name of the person, even if (as in my case) they have a middle name or initial. Then I print the very first field in the fullname array, so the output generated by this script acting on my passwd record would be "Jones Brian".
Whew! That was a mouthful. Awk has so many cool little hacks and built-in features that there has been more than one book published just on Awk. Undoubtedly, I'll utilize some of these features in future articles that involve putting together syadmin solutions using various scripts as duct tape.
These pages focus on XML tools and Awk.
A simple XML parser for awk
awk -f xmlparse.awk [FILESPEC]...
From LAWKER.
This script is a simple XML parser for (modern variants of) awk. Input in XML format is saved to two arrays, "type" and "item".
The term, "item", as used here, refers to a distinct XML element, such as a tag, an attribute name, an attribute value, or data.
The indexes into the arrays are the sequence number that a particular item was encountered. For example, the third item's type is described by type[3], and its value is stored in item[3].
The "type" array contains the type of the item encountered for each sequence number. Types are expressed as a single word: "error" (invalid item or other error), "begin" (open tag), "attrib" (attribute name), "value" (attribute value), "end" (close tag), and "data" (data between tags).
The "item" array contains the value of the item encountered for each sequence number. For types "begin" and "end", the item value is the name of the tag. For "error", the value is the text of the error message. For "attrib", the value is the attribute name. For "value", the value is the attribute value. For "data", the value is the raw data.
WARNING: XML-quoted values ("entities") in the data and attribute values are *NOT* unquoted; they are stored as-is.
BEGIN {
In XML, literal "<" and ">" are only valid as tag delimiters; to include a "<" or ">" as data, they must be quoted: "<" and ">". So we know that if we encounter a ">", we have reached the end of a tag. This makes a convenient end-of-record marker, as the end-of-tag delimiter marks a special event, whereas a new-line is simply whitespace in XML.
RS = ">";
lineno = 1;
sptr = 0;
}
Count input lines.
{
data = $0;
lineno += gsub( /\n/, "", data );
data = "";
}
Special modes of operation. These handle special XML sections, such as literal character data containing XML meta-characters ("cdata" sections), comments, and processing instructions ("pi") for other document processors.
"Cdata" sections are teminated by the sequence, "]]>".
( mode == "cdata" ) {
if ( $0 ~ /\]\]$/ ) {
sub( /\]\]$/, "", $0 );
mode = "";
};
item[idx] = item[idx] RS $0;
next;
}
Comment sections are terminated by the sequence, "-->".
( mode == "comment" ) {
if ( $0 ~ /--$/ ) {
sub( /--$/, "", $0 );
mode = "";
};
item[idx] = item[idx] RS $0;
next;
}
Processing instruction sections are terminated by the sequence, "?>".
( mode == "pi" ) {
if ( $0 ~ /\?$/ ) {
sub( /\?$/, "", $0 );
mode = "";
};
item[idx] = item[idx] RS $0;
next;
}
( !mode ) {
mline = 0;
Our record separator is the end-of-tag marker, ">". If we've encountered an end-of-tag marker, we should have a beginning-of-tag marker ("<") somewhere in the input record. If not, either there is a spurious end-of-tag marker, or the record was terminated by the end-of-file.
p = index( $0, "<" );
Any data preceeding the beginning-of-tag marker is raw data. If no beginning-of-tag marker is present, everything in the input is data.
if ( !p || ( p > 1 )) {
idx += 1;
type[idx] = "data";
item[idx] = ( p ? substr( $0, 1, ( p - 1 )) : $0 );
if ( !p ) next;
$0 = substr( $0, p );
};
Recognize special XML sections. Sections are not processed as XML, but handled specially. If the section end with the current input record, we continue processing XML in the next record; otherwise, we enter a special mode and perform special processing.
Character data ("cdata") sections contain literal character data containing XML meta-characters that should not be processed. Character data sections begin with the sequence, "<![CDATA[" and end with "]]>". This section may span input records.
if ( $0 ~ /^<!\[[Cc][Dd][Aa][Tt][Aa]\[/ ) {
idx += 1;
type[idx] = "cdata";
$0 = substr( $0, 10 );
if ( $0 ~ /\]\]$/ ) sub( /\]\]$/, "", $0 );
else {
mode = "cdata";
mline = lineno;
};
item[idx] = $0;
next;
}
Comments begin with the sequence, "". This section may span input records.
else if ( $0 ~ /^<!--/ ) {
idx += 1;
type[idx] = "comment";
$0 = substr( $0, 5 );
if ( $0 ~ /--$/ ) sub( /--$/, "", $0 );
else {
mode = "comment";
mline = lineno;
};
item[idx] = $0;
next;
}
Declarations begin with the sequence, "". This section may *NOT* span input records.
else if ( $0 ~ /^<!/ ) {
idx += 1;
type[idx] = "decl";
$0 = substr( $0, 3 );
item[idx] = $0;
next;
}
Processing instructions ("pi") begin with the sequence, "" and end with "?>". This section may span input records.
else if ( $0 ~ /^<\?/ ) {
idx += 1;
type[idx] = "pi";
$0 = substr( $0, 3 );
if ( $0 ~ /\?$/ ) sub( /\?$/, "", $0 );
else {
mode = "pi";
mline = lineno;
};
item[idx] = $0;
next;
};
Beyond this point, we're dealing strictly with a tag.
idx += 1;
A tag that begins with "" (e.g. as in "
") is a close tag: it closes a tag-enclosed block.
if ( substr( $0, 1, 2 ) == "</" ) {
type[idx] = "end";
tag = $0 = substr( $0, 3 );
}
A tag that begins simply with "<" (e.g. as in "
") is an open tag: it starts a tag-enclosed block. Note that a stand-alone tag (e.g. "") will be handled later, and will appear as an open tag and close tag, with no data between.
else {
type[idx] = "begin";
tag = $0 = substr( $0, 2 );
};
The tag name is saved in "tag" so that we can retreive it later should we find that the tag is stand-alone and need to save a close tag item.
sub( /[ \n\t/].*$/, "", tag );
tag = toupper( tolower( tag ));
item[idx] = tag;
Validate the tag name. If invalid, indicate so and exit.
if ( tag !~ /^[A-Za-z][-+_.:0-9A-Za-z]*$/ )
{
type[idx] = "error";
item[idx] = "line " lineno ": " tag ": invalid tag name";
exit( 1 );
}
If an open tag is encountered, its name is recorded on the stack. If a close tag is encountered, its name is compared against the name on the top of the stack. If the names differ, an error is generated (XML does not allow overlapping tags).
if ( type[idx] == "begin" ) {
sptr += 1;
lstack[sptr] = lineno;
tstack[sptr] = tag;
}
else if ( type[idx] == "end" ) {
if ( tag != tstack[sptr] ) {
type[idx] = "error";
item[idx] = "line " lineno ": " tag \
": unexpected close tag, expecting " \
tstack[sptr];
exit( 1 );
};
delete tstack[sptr];
sptr -= 1;
};
sub( /[^ \n\t/]*[ \n\t]*/, "", $0 );
Beyond this point, we're dealing with the tag attributes, if any, and/or the stand-alone end-of-tag marker.
while ( $0 ) {
If $0 contains only a slash (/), then the tag we're processing is stand-alone (e.g. ""), so we generate a close tag, but no data between the open and close tags.
if ( $0 == "/" )
{
idx += 1;
type[idx] = "end";
item[idx] = tag;
delete lstack[sptr];
delete tstack[sptr];
sptr -= 1;
break;
};
The attribute name is determined. Note that the attribute name is also saved to "attrib" so that we can reference it should the attribute not include a value. If the attribute does not include a value, it's name is given as its value.
idx += 1;
type[idx] = "attrib";
attrib = $0;
sub( /=.*$/, "", attrib );
attrib = tolower( attrib );
item[idx] = attrib;
Validate the attribute name. If invalid, indicate so and exit.
if ( attrib !~ /^[A-Za-z][-+_0-9A-Za-z]*$/ )
{
type[idx] = "error";
item[idx] = "line " lineno ": " attrib \
": invalid attribute name";
exit( 1 );
}
sub( /^[^=]*/, "", $0 );
Each attribute must have a value. If one isn't explicit in the input, we assign it one equal to the name of the attribute itself. Attribute values in the input may be in one of three forms: enclosed in double quotes ("), enclosed in single quotes/apostrophes ('), or a single word.
idx += 1;
type[idx] = "value";
if ( substr( $0, 1, 1 ) == "=" ) {
if ( substr( $0, 2, 1 ) == "\"" ) {
item[idx] = substr( $0, 3 );
sub( /".*$/, "", item[idx] );
sub( /^="[^"]*"/, "", $0 );
}
else if ( substr( $0, 2, 1 ) == "'" ) {
item[idx] = substr( $0, 3 );
sub( /'.*$/, "", item[idx] );
sub( /^='[^']*'/, "", $0 );
}
else {
item[idx] = $0;
sub( /[ \n\t/]*.$/, "", item[idx] );
sub( /^=[^ \n\t/]*/, "", $0 );
};
}
else item[idx] = attrib;
sub( /^[ \n\t]*/, "", $0 );
};
attrib = "";
tag = "";
next;
}
END {
If mode is defined, the input stream ended without terminating an XML section. Thus, the input contains invalid XML.
if ( mode ) {
idx += 1;
type[idx] = "error";
if ( mode == "cdata" ) mode = "character data";
else if ( mode == "pi" ) mode = "processing instruction";
item[idx] = "line " mline ": unterminated " mode;
};
If an open tag occured with no corresponding close tag, we have invalid XML.
for ( n = sptr; n; n -= 1 ) {
idx += 1;
type[idx] = "error";
item[idx] = "line " lstack[n] ": " \
tstack[n] ": unclosed tag";
};
}
The following simple examples demonstrate the use of the accumulated data from the XML input stream.
END {
If errors occured, generate appropriate messages and exit without
further processing.
if ( type[idx] == "error" ) {
for ( n = idx; n && ( type[n] == "error" ); n -= 1 );
for ( n += 1; n <= idx; n += 1 ) print "ERROR:", item[n];
exit 1;
};
# Print simplified XML. If output completes successfully and the stack
# is not empty, close tags are generated for each tag on the stack.
# in_tag = 0;
#
# for ( n = 1; n <= idx; n += 1 ) {
#
# if ( type[n] == "attrib" ) printf( " %s", item[n] );
#
# else if ( type[n] == "begin" ) {
# if ( in_tag ) printf( ">" );
# else in_tag = 1;
# printf( "<%s", item[n] );
# }
#
# else if ( type[n] == "cdata" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "<![CDATA[%s]]>", item[n] );
# }
#
# else if ( type[n] == "comment" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "<!--%s-->", item[n] );
# }
#
# else if ( type[n] == "data" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "%s", item[n] );
# }
#
# else if ( type[n] == "decl" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# }
# printf( "<!%s>", item[n] );
# }
#
# else if ( type[n] == "end" ) {
# if ( in_tag ) {
# printf( "/>" );
# in_tag = 0;
# }
# else printf( "</%s>", item[n] );
# }
#
# else if ( type[n] == "error" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# print "";
# print "<!-- ERROR:", item[n], "-->";
# break;
# }
#
# else if ( type[n] == "pi" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "<?%s?>", item[n] );
# }
#
# else if ( type[n] == "value" ) {
# if ( item[n] ~ /"/ ) printf( "='%s'", item[n] );
# else printf( "=\"%s\"", item[n] );
# };
# };
#
# if ( in_tag ) printf( "\>" );
#
# for ( n = sptr; n; n -= 1 ) printf( "</%s>", tstack[n] );
# Print an object tree, identifying tags and attributes. Nesting is # emphasized by indenting.
# indent = "";
# for ( n = 1; n <= idx; n += 1 ) {
# if ( type[n] == "attrib" ) print indent "attrib", item[n];
# else if ( type[n] == "begin" ) {
# print indent "begin", item[n];
# indent = indent " ";
# }
# else if ( type[n] == "end" ) {
# indent = substr( indent, 3 );
# print indent "end", item[n];
# }
# else if ( type[n] == "error" ) print "ERROR:", item[n];
# else print indent type[n];
# };
Print in a linear format suitable for parsing by shell scripts. Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.
for ( n = 1; n <= idx; n += 1 ) {
value = item[n];
gsub( /\\/, "\\\\", value );
gsub( /\n/, "\\n", value );
print type[n], value;
};
for ( n = sptr; n; n -= 1 ) print "end", tstack[n];
Print attribute values and data in a linear format suitable for searching (e.g. with grep). Attributes are representd as:
[TAG/]...TAG/ATTRIB=VALUE
Data is represented as:
[TAG/]...TAG: DATA
Note that all tag names are displayed in upper-case. All attribute names are displayed in lower-case.
Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.
# sptr = 0;
# for ( n = 1; n <= idx; n += 1 ) {
# if ( type[n] == "attrib" ) {
# lead = stack[1];
# for ( m = 2; m <= sptr; m += 1 ) \
# lead = lead "/" stack[m];
# lead = lead "/" item[n] "=";
# }
# else if ( type[n] == "begin" ) stack[++sptr] = item[n];
# else if (( type[n] == "cdata" ) || ( type[n] == "data" )) {
# lead = stack[1];
# for ( m = 2; m <= sptr; m += 1 ) \
# lead = lead "/" stack[m];
# lead = lead ": ";
# }
# else if ( type[n] == "end" ) sptr -= 1;
# if (( type[n] == "data" ) || ( type[n] == "value" )) {
# value = item[n];
# gsub( /\\/, "\\\\", value );
# gsub( /\n/, "\\n", value );
# print lead value;
# };
# };
}
Steve Coile
Download from Source Forge.
Jawk runs on any platform which supports, at minimum, J2SE 5.
java -jar jawk.jar {command-line-arguments}
To view the command line argument usage summary, execute
java -jar jawk.jar -hThe output of this command is shown below:
java ... org.jawk.Awk [-F fs_val] [-f script-filename]
[-o output-filename] [-c] [-z] [-Z]
[-d dest-directory] [-S] [-s] [-x] [-y] [-r]
[-ext] [-ni] [-t] [-v name=val]...
[script] [name=val | input_filename]...
-F fs_val = Use fs_val for FS.
-f filename = Use contents of filename for script.
-v name=val = Initial awk variable assignments.
-t = (extension) Maintain array keys in sorted order.
-c = (extension) Compile to intermediate file. (default: a.ai)
-o = (extension) Specify output file.
-z = (extension) | Compile for JVM. (default: AwkScript.class)
-Z = (extension) | Compile for JVM and execute it. (default: AwkScript.class)
-d = (extension) | Compile to destination directory. (default: pwd)
-S = (extension) Write the syntax tree to file. (default: syntax_tree.lst)
-s = (extension) Write the intermediate code to file. (default: avm.lst)
-x = (extension) Enable _sleep, _dump as keywords, and exec as a builtin func.
(Note: exec enabled only in interpreted mode.)
-y = (extension) Enable _INTEGER, _DOUBLE, and _STRING casting keywords.
-r = (extension) Do NOT hide IllegalFormatExceptions for [s]printf.
-ext= (extension) Enable user-defined extensions. (default: not enabled)
-ni = (extension) Do NOT process stdin or ARGC/V through input rules.
(Useful for blocking extensions.)
(Note: -ext & -ni available only in interpreted mode.)
-h or -? = (extension) This help screen.
The Jawk extension facility allows for arbitrary Java code to be called as Awk functions in a Jawk script. These extensions can come from the user (developer) or 3rd party providers (i.e., the Jawk project team). And, Jawk extensions are opt-in. In other words, the -ext flag is required to use Jawk extensions and extensions must be explicitly registered to the Jawk instance via the -Djawk.extensions property (except for core extensions bundled with Jawk ).
Also, Jawk extensions support blocking. You can think of blocking as a tool for extension event management. A Jawk script can block on a collection of blockable services, such as socket input availability, database triggers, user input, GUI dialog input response, or a simple fixed timeout, and, together with the -ni option, action rules can act on block events instead of input text, leveraging a powerful AWK construct originally intended for text processing, but now can be used to process blockable events. A sample enhanced echo server script is included in this article. It uses blocking to handle socket events, standard input from the user, and timeout events, all within the 47-line script (including comments).
## to run: java ... -jar jawk.jar -ext -ni -f {filename}
BEGIN {
css = CServerSocket(7777);
print "(echo server socket created)"
}
## note: default input processing disabled by -ni
$0 = SocketAcceptBlock(css,
SocketInputBlock(sockets,
SocketCloseBlock(css, sockets,
StdinBlock(
Timeout(1000)))));
## note: default action { print } disabled by -ni
# $1 = "SocketAccept", $2 = socket handle
$1 == "SocketAccept" {
socket = SocketAccept($2)
sockets[socket] = 1
}
# $1 = "SocketInput", $2 = socket handle
$1 == "SocketInput" {
## echo server action:
socket = $2
line = SocketRead(socket)
SocketWrite(socket, line)
}
# $1 = "SocketClose", $2 = socket handle
$1 == "SocketClose" {
socket = $2
SocketClose(socket)
delete sockets[socket]
}
## display a . for every second the server is running
$0 == "Timeout" {
printf "."
}
## stdin block is last because StdinGetline writes directly to $0
## $0 == "Stdin"
$0 == "Stdin" {
## broadcast message to all sockets
retcode = StdinGetline()
if (retcode != 1)
exit
for (socket in sockets)
SocketWrite(socket, "From server : " $0)
print "(message sent)"
}
Each extension function used in the script above is covered in some detail below:
extension-label-prefix OFS parameterwhile StdinBlock and Timeout returns
extension-label-prefix
As stated by the comments, -ni disables stdin processing (as provided
by Jawk
itself, not the StdinExtension) and the default blank rule of
{ print } . Disabling stdin processing is paramount to extension
processing because, otherwise,
it would be confusing, if not completely impossible, to multiplex
extension blocking with Jawk
's default stdin processing. And, disabling
the default blank rule allows for easy-to-read blocking statements
(like the one provided in the sample script) without the wierd side
effect of printing the result.
Dan: ddaglas at users.sourceforge.net.
Editor's note:
Programmers often take awk "as is", never thinking to use it as a lab in which
they can explore other language extensions.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach is taken in the Awk A*
project and, as shown here, in XMLgawk.
IMHO,
XMLgawk is one of the most exciting new innovations
seen in Gawk for many years.
It shows that Awk is more than "just" a text processor: rather
it is also a candidate technology for modern XML-based web applications.
)
Extends standard gawk with built-in XML processing.
Main developers: Jurgen Kahrs and Andrew Schorr.
Conceptual guidance: Manuel Collado.
MS Windows build expert: Victor Paeza.
Contributor of ideas for new features: Peter Saveliev.
XML processing, plus libraries for other extensions to Gawk.
XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.
Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.
XMLgawk provides the following functionality including:
3=Released
3=Free/public domain.
November 2003.
April 28, 2009.
After some hard work I seem to be able to build XMLgawk for native Windows :-). Jurgen, Victor and Manuel: thanks for all the tips!
If you're interested, have a look at http://www.wimdows.info/project/xgawk and have fun.
-- Wim van Blitterswijk
AI Programming lab class challenge .
Download from LAWKER. Look at the first line of each file for something that looks like thos:
#!/usr/bin/gawk -fReplace this with the full path to the local version of Gawk.
Ronald Loui (programmer and designer)
Washington University in St. Louis
USA
Text-based game simulation.
Ronald P. Loui
r.p.loui@gmail.com
Ronald Loui writes: Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language
A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.
In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
Was written for gawk in 1995 but should run on almost any awk dialect; some css positioning commands will not work in all browsers; try IE6.
Was written on Redhat Linux with multiple hardware platforms in mind.
Intended to be run on close server to minimize delays.
605 lines in main cgi with several small aux control programs.
Minimal compared to development effort, but potentially will require css for new browsers.
Number of person-months since, including enhancements
2=Evaluation.
50 students in artificial intelligence project classes had to use some version of this code over seven years
October 2004
April 2009
Awk-Linux Educational Operating Systems
Teaching operating systems.
Yung-Pin Cheng
ypc@csie.ntnu.edu.tw
Software Engineering Lab. Department of Computer Science and Information Engineering National Taiwan Normal University
TAIWAN
Educators of Operating Systems
Most well-known instructional operating systems are complex, particularly if their companion software is taken into account. It takes considerable time and effort to craft these systems, and their complexity may introduce maintenance and evolution problems. In this project, a courseware called Awk-Linux is proposed. The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt, which are simulated through program instrumentation over user programs.
A major advantange of the use of Awk for this tool is platform independence. Awk-Linux can be crafted relatively more easily and it does not depend on any hardware simulator or platform. Stable Awk versions run on many platforms so this tool can be readily and easily ported to other machines. The same can not be said for other, more complex operating systems courseware that may be much harder to port to new environments.
In practice, using Awk-Linux is very simple for the instructor and students:
Gawk under cygwin or Linux
Windows (CYGWIN required) or Linux
C programming language
Status 3 (Released)
3(Free/public domain)
2004
Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux: A Lightweight Operating Systems Courseware IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.
www.csie.ntnu.edu.tw/~ypc/awklinux.htm
awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
[=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
[-strip] [-verbose] [file(s)]
Download from LAWKER.
This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.
It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:
(And to write even larger programs, divided into many files, see runawk.)
Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.
For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.
All word matching is case insensitive (subject to the workings of tolower()).
In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.
Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:
ies$ ie ies y # flies -> fly, series -> series, ties -> tie ily$ y ily # happily -> happy, wily -> wily nnily$ n # funnily -> fun
Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.
Suffixes are tested in order of decreasing length, so that the longest matches are tried first.
The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form
filename:linenumber:exception
Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.
BEGIN { initialize() }
{ spell_check_line() }
END { report_exceptions() }
function get_dictionaries( files, key)
{
if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
Dictionaries = ENVIRON["DICTIONARIES"]
if (Dictionaries == "") # Use default dictionary list
{
DictionaryFiles["/usr/dict/words"]++
DictionaryFiles["/usr/local/share/dict/words.knuth"]++
}
else # Use system dictionaries from command line
{
split(Dictionaries, files)
for (key in files)
DictionaryFiles[files[key]]++
}
}
function initialize()
{
NonWordChars = "[^" \
"'" \
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz" \
"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
"]"
get_dictionaries()
scan_options()
load_dictionaries()
load_suffixes()
order_suffixes()
}
function load_dictionaries( file, word)
{
for (file in DictionaryFiles)
{
## print "DEBUG: Loading dictionary " file > "/dev/stderr"
while ((getline word < file) > 0)
Dictionary[tolower(word)]++
close(file)
}
}
function load_suffixes( file, k, line, n, parts)
{
if (NSuffixFiles > 0) # load suffix regexps from files
{
for (file in SuffixFiles)
{
## print "DEBUG: Loading suffix file " file > "/dev/stderr"
while ((getline line < file) > 0)
{
sub(" *#.*$", "", line) # strip comments
sub("^[ \t]+", "", line) # strip leading whitespace
sub("[ \t]+$", "", line) # strip trailing whitespace
if (line == "")
continue
n = split(line, parts)
Suffixes[parts[1]]++
Replacement[parts[1]] = parts[2]
for (k = 3; k <= n; k++)
Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
}
close(file)
}
}
else # load default table of English suffix regexps
{
split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
for (k in parts)
{
Suffixes[parts[k]] = 1
Replacement[parts[k]] = ""
}
}
}
function order_suffixes( i, j, key)
{
# Order suffixes by decreasing length
NOrderedSuffix = 0
for (key in Suffixes)
OrderedSuffix[++NOrderedSuffix] = key
for (i = 1; i < NOrderedSuffix; i++)
for (j = i + 1; j <= NOrderedSuffix; j++)
if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
swap(OrderedSuffix, i, j)
}
function report_exceptions( key, sortpipe)
{
sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
for (key in Exception)
print Exception[key] | sortpipe
close(sortpipe)
}
function scan_options( k)
{
for (k = 1; k < ARGC; k++)
{
if (ARGV[k] == "-strip")
{
ARGV[k] = ""
Strip = 1
}
else if (ARGV[k] == "-verbose")
{
ARGV[k] = ""
Verbose = 1
}
else if (ARGV[k] ~ /^=/) # suffix file
{
NSuffixFiles++
SuffixFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
else if (ARGV[k] ~ /^[+]/) # private dictionary
{
DictionaryFiles[substr(ARGV[k], 2)]++
ARGV[k] = ""
}
}
# Remove trailing empty arguments (for nawk)
while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
ARGC--
}
function spell_check_line( k, word)
{
## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
gsub(NonWordChars, " ") # eliminate nonword chars
for (k = 1; k <= NF; k++)
{
word = $k
sub("^'+", "", word) # strip leading apostrophes
sub("'+$", "", word) # strip trailing apostrophes
if (word != "")
spell_check_word(word)
}
}
function spell_check_word(word, key, lc_word, location, w, wordlist)
{
lc_word = tolower(word)
## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
if (lc_word in Dictionary) # acceptable spelling
return
else # possible exception
{
if (Strip)
{
strip_suffixes(lc_word, wordlist)
## for (w in wordlist) print "DEBUG: wordlist[" w "]"
for (w in wordlist)
if (w in Dictionary)
break
if (w in Dictionary)
return
}
## print "DEBUG: spell_check():", word
location = Verbose ? (FILENAME ":" FNR ":") : ""
if (lc_word in Exception)
Exception[lc_word] = Exception[lc_word] "\n" location word
else
Exception[lc_word] = location word
}
}
function strip_suffixes(word, wordlist, ending, k, n, regexp)
{
## print "DEBUG: strip_suffixes(" word ")"
split("", wordlist)
for (k = 1; k <= NOrderedSuffix; k++)
{
regexp = OrderedSuffix[k]
## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
if (match(word, regexp))
{
word = substr(word, 1, RSTART - 1)
if (Replacement[regexp] == "")
wordlist[word] = 1
else
{
split(Replacement[regexp], ending)
for (n in ending)
{
if (ending[n] == "\"\"")
ending[n] = ""
wordlist[word ending[n]] = 1
}
}
break
}
}
## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
}
function swap(a, i, j, temp)
{
temp = a[i]
a[i] = a[j]
a[j] = temp
}
Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books
(For the original version of this code, see http://feedback.exalead.com/feedbacks/191466-spell-checking.)
Peter Norvig of Google describes "How to Write a Spelling Corrector" at
http://norvig.com/spell-correct.html.
He gave a python solution, and points to a number of other implementations
I saw one was missing for awk/gawk, so here it is
it uses the "big.txt" file found at
http://norvig.com/big.txt.
function words(text) {
while (getline line < text ) {
line=tolower(line) ;
while (match(line,/[a-z]+/)) {
NWORDS[substr(line,RSTART,RLENGTH)]++ ;
line=substr(line,RSTART+RLENGTH) }}
}
BEGIN { words("big.txt"); }
BEGIN { alph="abcdefghijklmnopqrstuvwxyz";
for(i=1;i<=26;i++)
alphabet[substr(alph,i,1)]++ }
function edits1 (word,set) {
n = length(word);
delete set;
for (i=1;i<=n+1;i++) {
if(i<=n) # deletion
set[substr(word,1,i-1)""substr(word,i+1)]++;
if(i<n) # transposition
set[substr(word,1,i-1)""substr(word,i+1,1)""substr(word,i,1)""substr(word,i+2)]++;
if(i<=n)
for (c in alphabet) # alteration
set[substr(word,1,i-1)""c""substr(word,i+1)]++;
for (c in alphabet) # insertion
set[substr(word,1,i-1)""c""substr(word,i)]++; }
}
function known_edits2(oneChange,twoChanges) {
delete twoChanges;
for (e2 in oneChange) {
edits1(e2,set);
known(set,goods) ;
for (w in goods) {
twoChanges[w]=goods[w]}}
}
function known(words,knowntable) {
delete knowntable;
found=0;
for (w in words)
if(w in NWORDS) {
found++;
knowntable[w]=NWORDS[w] }
return (found)
}
function maxtable(tab) {
maxval=0;
for(i in tab) {
if(tab[i]>maxval) {
maxval=tab[i];
max=i}}
return(max)
}
function correct(word) {
delete candidates;
candidates[word]=1;
if( known(candidates,good) ) { }
else { edits1(word, candidates);
if ( known(candidates,good) ) { }
else { known_edits2(candidates,candidates2);
if ( known(candidates2,good) ) { }
else { delete good;
good[word]=1;}}}
print maxtable(good);
}
correct, one word per line
{ gsub(" ","");
correct(tolower($0)) }
Gregory Grefenstette, Nov 24, 2008
Run a WIKI using Gawk.
Download from LAWKER or Wolfgan Zekol's web site.
For a live demo, see the Yawk home page.
Wolfgan Zekol.
Web application.
Wolfgan Zekol.
dag@awk-scripting.de
Yawk is "yet another wiki klone", one among a lot of others. Yawk was written because the available wikis were missing some formatting capabilities or used strange formatting rules (and you might not like mine) or imposed too much requirements for understanding a wiki (mysql database installation with or without php installed).
Gawk 3.1.4 or later.
CGI
6000 lines.
Status 3=Released.
3=Free/public domain.
2004
2009
Code up a LISP/Scheme interpreter in Awk.
See awklisp.
1
Domain-specific language.
Darius Bacon dairus@wry.me
dairus@wry.me
At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster.
Awk/Gawk
350
1=Prototype
1=Personal use.
1994
2009
Not a single program.
Generate TeX code for a bilingual dictionary from a flat file database. This system has been used to generate multiple editions of dictionaries for several dialects of Carrier, the endangered language of a large portion of the central interior of British Columbia.
Bill Poser
Canada
linguistics - dictionary publishing
Bill Poser
billposer@alum.mit.edu
A dictionary database consists of four flat files containing records in which fields are identified by tags, in a format isomorphic to Standard Dictionary Format. The four files contain: main entries, example sentences with translations, verb roots, verb stems. This provides modest degree of relativization. Awk scripts controlled by a makefile do the bulk of the work of generating TeX code for printing dictionaries containing front matter, a Carrier-English section, an English-Carrier section, a topical index, an alphabetical root list, a list of roots sorted by English gloss, an alphabetical list of verb stems, a list of verb stems sorted by root, an alphabetical list of affixes, a list of affixes sorted by English gloss, a list of scientific names , a list of placenames, and credits for illustrations.
gawk
The awk scripts are executed from a make file.
GNU/Linux on x86.
The awk scripts are executed from a makefile by GNU make. The other program used extensively is the sort utility msort.
5500
The first usable version took no more than a day (plus the time to create the TeX template into which the generated code is inserted).
Pure maintenance due to changes in environment, bit rot, etc. has been just about nil. The effort devoted to adding features very difficult to estimate as it has taken place at irregular intervals over a period of 15 years.
Status 1=Prototype, 2=Evaluation, 3=Released, 4=No longer supported, 5=Dead 3, I guess. The code is mature but not really released since the author is the only one who normally uses it.
1=Personal use, 2=in-House use, 3=Free/public domain, 4=Licensed, 5=Sold product 1
1
June 1993.
A paper describing these databases and the process for generating dictionaries from them is available: Lexical Databases for Carrier
Some information about the resulting dictionaries: http://www.ydli.org/products/dicts.htm
Demonstration to DoD of a clustering algorithm suitable for streaming data.
http://www.cse.wustl.edu/~loui/boris.cgi.
Ronald Loui and a programmer named Boris.
Washington University in St. Louis, CS Dept.
USA
This is an evolutionary algorithm and visualization of a clustering algorithm that could be turned from O(n^4) to O(nlogn) with a few judicious uses of constants. Later developments added other interactive devices, including progress meters and mouse-and-click behavior.
Ronald Loui
r.p.loui@gmail.com
The code is an excellent example of the power of Awk as a prototyping tool: after getting the code running, with the least development time, a quirk was observed in the code that allowed a reduction from O(n^4) to O(nlogn).
Gawk
Intended for fast servers, 1+ ghz.
Html.
158.
One weekend.
None.
2=Evaluation.
2=in-House use.
5
2004.
Feb 2009.
Streaming Hierarchical Clustering for Concept Mining Looks, M.; Levine, A.; Covington, G.A.; Loui, R.P.; Lockwood, J.W.; Cho, Y.H. Aerospace Conference, 2007 IEEE Volume , Issue , 3-10 March 2007 Page(s):1 - 12 Digital Object Identifier 10.1109/AERO.2007.352792
Download videos from youtube.
Peter Krumin: Downloading YouTube Videos With Gawk
World wide web, slurping, file sharing.
Peter Krumin
How to download YouTube videos.
Gawk
331 lines
3=Released
1=Personal use
July 2007
Sat Feb 21 19:46:10 EST 2009
Downloading YouTube Videos With Gawk
This is a Awk 100 program.
Jim Hart
Solve sudoku puzzles using the same strategies as a person would, not by brute force.
Jim Hart
US
Jim Hart
jhart50@gmail.com
see Purpose
gawk
Mac OS X, PowerPC
529
1
0
/2006
An Awk100 program.
Research on a model of negotiation incorporating search, dialogue, and changing expectations
Ronald Loui (programmer and designer), Anne Jump (adversary)
National Science Foundation grant at Washington University in St. Louis
USA
Prototype of a new idea for cognitive modelling (in artificial intelligence/economics/organizational behavior)
Ronald P. Loui
r.p.loui@gmail.com
Program generates a game board upon which players take turn searching or declaring according to a protocol. It is based on the same game bimatrix made famous by people like von Neumann and Nash, but invents a new approach to negotiation based on process instead of solution.
Was written for gawk in 1997 but should run on almost any awk dialect
Was written on Redhat Linux with multiple hardware platforms in mind
Was intended to be self-contained
658 lines, of which 39 are comments
One day, 6-8 hours total
Two revisions are available, mainly to permit programs to negotiate instead of humans, and to provide a web-based dashboard to monitor the events
2=Evaluation
2=in-House use
50 students in artificial intelligence project classes had to use some version of this code over three yeears
October 1997
January 2008
There is a draft article (unpublished), and several talks, e.g.
The paper in Harper and Wheeler, Probability and Inference: Essays in Honour of Henry E. Kyburg Jr. (Paperback), Publisher: College Publications (23 April 2007) ISBN-10: 1904987184 ISBN-13: 978-1904987185 also refers to the theory implemented here. Diana Moore's thesis on negotiation and draft article http://citeseer.ist.psu.edu/11983.html contains some precursor ideas.
http://www.cs.wustl.edu/~loui/313f97/anne4.expl.html
This is a Awk 100 program.
A quick and dirty baseball simulator for investigating the efficiency of batting lineups
Ronald P. Loui
Washington University in St. Louis
USA
Research/Decision Support
Ronald P. Loui
r.p.loui@gmail.com
This was written for the AI course, and for several investigations, including the determination of whether it is a good idea to bat the pitcher in the 8th spot. One hypothesis that emerges from this program that deserves further study is that the most potent offense is one that spreads rather than concentrates the batting threats.
Gawk around 2002
Linux around 2002
None
409
Approximately one day
Further simulators were developed for improved domain modeling and for successive addition of functionality; no other code maintenance was required.
1=Prototype
1=Personal use
About 50 students used this program over three years in AI classes, and two undergraduate theses and one Master's thesis on evolutionary computing made use of this simulator.
October 2002
January 2009
None, but see Tony LaRussa's comments on batting order while managing the St. Louis Cardinals
An Awk100 program.
A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.
See gawk/awk100/argcol.
Mark Foltz, Ronald Loui, Thieu Dang, Jeremy Frens
Washington University in St. Louis
USA
Application/text support for text editor.
Ronald Loui
r.p.loui@gmail.com
Gawk circa 1994, Solaris and MS-DOS-based awk such as mawk.
Solaris and MS-DOS
Vi and variants such as stevie.
278
One week.
No maintenance, eventually rewritten as cgi/web program in Room5 project.
4=No longer supported
3=Free/public domain
2
May 1994
Jan 2009
Progress on Room 5: a testbed for public interactive semi-formal legal argumentation International Conference on Artificial Intelligence and Law archive Proceedings of the 6th international conference on Artificial intelligence and law Melbourne, Australia Pages: 207 - 214 Year of Publication: 1997 ISBN:0-89791-924-6
(This page comes from the XML Gawk tutorial.)
One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).
These mechanisms for checking correctness are applied at different levels. The lowest level being well-formedness. The next higher levels of correctness-check are the level of the DTD and (even higher, but not required yet by standards) the Schema. If you have a DTD (or Schema) specification for your XML file, you can hand it over to a validation tool, which applies the specification, checks for conformance and tells you the result. A simple tool for validation against a DTD is xmllint, which is part of libxml and therefore installed on most GNU/Linux systems. Validation against a Schema can be done with more recent versions of xmllint or with the xsv tool.
There are two reasons why validation is currently not incorporated into the gawk interpreter.
@load xml
END {
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
else
print "file is well-formed"
}
As usual, the script starts with switching gawk into XML mode. We are not interested in the content of the nodes being traversed, therefore we have no action to be triggered for a node. Only at the end (when the XML file is already closed) we look at some variables reporting success or failure. If the variable XMLERROR ever contains anything other than 0 or the empty string, there is an error in parsing and the parser will stop tree traversal at the place where the error is. An explanatory message is contained in XMLERROR (whose contents depends on the specific parser used on this platform). The other variables in the example contain the line number and the column in which the XML file is formed badly.
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
The declaration of a document type in the header of an XML file is an optional part of the data, not a mandatory one. If such a declaration is present, the reference to the DTD will not be resolved and its contents will not be parsed. However, the presence of the declaration will be reported by gawk. When the declaration starts, the variable XMLSTARTDOCT contains the name of the root element's tag; and later, when the declaration ends, the variable XMLENDDOCT is set to 1. In between, the array variable XMLATTR will be populated with the values of the public identifier of the DTD (if any) and the value of the system's identifier of the DTD (if any). Other parts of the declaration (elements, attributes and entities) will not be reported.
@load xml
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE" ]
}
XMLSTARTDOCT {
root = XMLSTARTDOCT
pub_id = XMLATTR["PUBLIC" ]
sys_id = XMLATTR["SYSTEM" ]
intsubset = XMLATTR["INTERNAL_SUBSET"]
}
XMLENDDOCT {
print FILENAME
print " version '" version "'"
print " encoding '" encoding "'"
print " standalone '" standalone "'"
print " root id '" root "'"
print " public id '" pub_id "'"
print " system id '" sys_id "'"
print " intsubset '" intsubset "'"
print ""
version = encoding = standalone = ""
root = pub_id = sys_id = intsubset ""
}
Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script shown above will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.
In the following, you can see an example output of
the script shown above. Obviously, the first
entry is a DocBook file (English version 4.2) containing a
book element which has to be validated against a local
copy of the DTD at CERN in Switzerland. The second file is a
chapter element of DocBook (English version 4.1.2) to
be validated against a DTD on the Internet. Finally, the third
entry is a file describing a project of the GanttProject application.
There is only a tag name for the root element specified, a DTD
does not seem to exist.
data/dbfile.xml
version ''
encoding ''
standalone ''
root id 'book'
public id '-//OASIS//DTD DocBook XML V4.2//EN'
system id '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
intsubset ''
data/docbook_chapter.xml
version ''
encoding ''
standalone ''
root id 'chapter'
public id '-//OASIS//DTD DocBook XML V4.1.2//EN'
system id 'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
intsubset ''
data/exampleGantt.gan
version '1.0'
encoding 'UTF-8'
standalone ''
root id 'ganttproject.sourceforge.net'
public id ''
system id ''
intsubset ''
You may wish to make changes to this script if you need it in daily work. For example, the script currently reports nothing for files which have no DTD declaration in them. You can easily change this by appending an action for the END rule which reports in case all the variables root, pub_id and sys_id are empty. As it is, the script parses the entire XML file, although the DTD is always positioned at the top, before the root element. Parsing the root element is unnecessary and you can improve the speed of the script significantly if you tell it to stop parsing when the first element (the root element) comes in.
XMLSTARTELEM { nextfile }
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:
<book id="hello-world" lang="en">
<bookinfo>
<title>Hello, world</title>
</bookinfo>
<chapter id="introduction">
<title>Introduction</title>
<para>This is the introduction. It has two sections</para>
<sect1 id="about-this-book">
<title>About this book</title>
<para>This is my first DocBook file.</para>
</sect1>
<sect1 id="work-in-progress">
<title>Warning</title>
<para>This is still under construction.</para>
</sect1>
</chapter>
</book>
Software developers are used to reading text files with proper indentation like this:
book lang='en' id='hello-world'
bookinfo
title
chapter id='introduction'
title
para
sect1 id='about-this-book'
title
para
sect1 id='work-in-progress'
title
para
Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).
The outline tool produces such an indented output
and we will now write a script that imitates this kind
of output.
@load xml
XMLSTARTELEM {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
For the first time, we don't
just check if the XMLSTARTELEM variable contains
a tag name, but we also print the name out, properly indented
with a printf format statement (two blank characters
for each indentation level).
Note the use of the
associative
array XMLATTR. Whenever we enter a markup block
(and XMLSTARTELEM is non-empty), the array XMLATTR
contains all the attributes of the tag. You can find out the
value of an attribute by accessing the array with the attribute's
name as an array index. In a well-formed XML file, all the attribute
names of one tag are distinct, so we can be sure that each attribute
has its own place in the array. The only thing that's left to do is
to iterate over all the entries in the array and print name and value
in a formatted way. Earlier versions of this script really iterated
over the associative array with the for (i in XMLATTR)
loop. Doing so is still an option, but in this case we wanted to
make sure that attributes are printed in exactly the same oder
that is given in the original XML data. The exact order of attribute
names is reproduced in the fields $1 .. $NF. So the
for loop can iterate over the attributes names in the
fields $1 .. $NF and print the attribute values
XMLATTR[$i].
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
In a procedural language, the software developer expects that he himself determines control flow within a program. He writes down what has to be done first, second, third and so on. In the pattern-action model of AWK, the novice software developer often has the oppressive feeling that
This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:
lex and yacc
tools, the main program only invokes a function yyparse()
and the exact control flow depends on the input source which
controls invocation of certain rules.
Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.
@load xml
BEGIN {
while (getline > 0) {
switch (XMLEVENT) {
case "STARTELEM": {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
}
}
}
One XML event after the other is pulled out of the data
with the getline command. It's like feeling each grain
of sand pour through your fingers. Users who prefer this style
of reading input will also appreciate another novelty: The variable
XMLEVENT. While the push-style script in
another page used the event-specific variable
XMLSTARTELEM to detect the occurrence of a new XML element,
our pull-style script always looks at the value of the same
universal variable XMLEVENT to detect a new XML element.
Formally, we have a script that consists of one BEGIN
pattern followed by an action which is always invoked. You
see, this is a corner case of the pattern-action model
which has been reduced so wide that its essence has disappeared.
Instead of the patterns you now see the cases of switch
statement, embedded into a while loop (for reading the
file item-wise).
Obviously, we have explicite conditionals now, instead of the
implicite ones we used formerly. The actions invoked within
the case conditions are the same we have seen in the
push approach.
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
Displays components within a set of named XML files. With no options, displays the XML files much like that cat command. When options are supplied, displays only the selected components.
Editor's note: for those who do not want to take the plunge into xgawk, dumpxml shows that shows standard Awk supports XML. For a discussion of this file, see comp.lang.awk.
xmldump -[cdit] file
This code requires awk and ksh. To download:
wget http://lawker.googlecode.com/svn/fridge/lib/ksh/dumpxml chmod +x dumpxml
One reason I have a distinct loathing for XML, esp. in configuration files, is it's very difficult to parse (with line-based editors) and it's not very readable either. In my book, this breaks both of the fundamental tests for a useable configuration standard .... whoever first thought XML was a good idea for anything except document mark-up should be shot (steps off soap box before he gets lynched for posting off-topic).
Anyway, personal grievances aside, here's a script I was forced to write, unhappy and at gun-point, to try and make some XML files I was dealing with more readable. This demonstrates how much work it takes in AWK just to parse the structure alone. This doesn't even take into consideration reading attribute values or parsing DTDs.
The next person who thinks it's a good idea to write a configuration file in XML will have to personally answer to my wrath ........ perhaps I should set-up a new website banxml.org or xmlboycott.com with the sole intent to make the world see reason. Anyone with me? :-)
#!/bin/ksh CALL=$(basename $0) USAGE="Syntax: $CALL [-cdit] xmlfile ..."
Displays selected components of a named XML file. Arguments:
DisplayXML()
{
nawk -v shdoc=$1 -v shtags=$2 -v shcomm=$3 -v indent=$4 '
{
pushline=levhigh=0
### If indenting strip any leading blanks from input
CloseFlags()
if (indent && !comment) sub("^[ ][ ]*","")
### Strip carriage returns
gsub("\\r","")
### Scan line one character at a time
for (c=1;c<=length($0);c++)
{
CloseFlags()
ReadChars()
DisplayChars()
}
if (newline)
{
print ""
newline=0
}
}
function CloseFlags()
{
if (comment==2) comment=0 # close comment
if (tag==2) tag=0 # close tag
if (quotes==2) quotes=0 # close quote
}
function ReadChars()
{
ch=substr($0,c,1)
if (!comment)
{
if (ch=="<" && substr($0,c,4)=="<!--")
{
comment=1 # opening comment
ch=substr($0,c,4) # stretch chars
c+=3
}
else if (!tag && ch=="<")
{
tag=1 # opening tag
### Increase or decrease indent depending
### on tag style <tag> or </tag>
### but not <?tag?> or <!tag>
ch2=substr($0,c,2)
if (ch2=="</") level--
else if (ch2!="<?" && ch2!="<!")
{
level++
levhigh=1
}
}
else if (tag)
{
if (!quotes && ch=="\"") quotes=1 # opening quote
else if (quotes && ch=="\"") quotes=2 # closing
else if (!quotes && ch==">")
{
tag=2 # closing tag
### Catch <tag/> style where
### indent level should not change
if (c>1 && substr($0,c-1,2)=="/>") level--
}
}
}
else
{
if (ch=="-" && substr($0,c,3)=="-->")
{
comment=2 # closing comment
ch=substr($0,c,3) # stretch chars
c+=2
}
}
}
function DisplayChars()
{
### Work out whether to display this character or not
dispch=0
if (comment && shcomm) dispch=1
if (tag && shtags) dispch=1
if (!comment && !tag && shdoc) dispch=1
if (dispch)
{
if (indent) IndentLine()
printf("%s",ch)
if (!newline) newline=1
}
}
function IndentLine()
{
if (pushline || comment) return
pushline=1
### Have begun processing first tag so indent level
### may already be one level too high
if ((thislevel=(levhigh?level-1:level))<0) thislevel=0
for (lev=0;lev<thislevel;lev++) printf(" ")
}' "$5"
}
comments=0
doc=0
indent=0
tags=0
help=0
while getopts cdit c
do
case $c in
c) comments=1;;
d) doc=1;;
i) indent=1;;
t) tags=1;;
?) help=1;;
esac
done
shift $(($OPTIND - 1))
Display help message
if [ $help -eq 1 -o $# -eq 0 ]; then
cat << EOF
Displays components within a set of named XML files.
With no options, displays the XML files much like that cat command.
When options are supplied, displays only the selected components.
$USAGE
where -c displays comments
-d displays document contents
-i indent properly
-t displays tags
EOF
exit 2
fi
If no options supplied, then display entire XML files
if [ $comments -eq 0 -a $doc -eq 0 -a $tags -eq 0 ]; then
comments=1
doc=1
tags=1
fi
first=1
while [ $# -gt 0 ]
do
if [ $first -eq 1 ]; then
first=0
else echo " " ### this should be Ctrl+L for a form-feed
fi
echo "<!-- --- $1 --- -->"
DisplayXML $doc $tags $comments $indent "$1"
shift
done
Mark R.Bannister <markb at freedomware.co.uk>.
Here is some Awk code from the Rosetta Code wiki hat multiplyes integers using only addition, doubling, and halving.
For example: 17 X 34
17 34
Halving the first column:
17 34
8
4
2
1
Doubling the second column:
17 34
8 68
4 136
2 272
1 544
Strike-out rows whose first cell is even:
17 34
8 --
4 ---
2 ---
1 544
Sum the remaining numbers in the right-hand column:
17 34
8 --
4 ---
2 ---
1 544
====
578
So 17 multiplied by 34, by the Ethiopian method is 578.
The task is to define three functions/methods/procedures/subroutines:
function halve(x) { return(int(x/2)) }
function double(x) { return(x*2) }
function iseven(x) { return((x%2) == 0) }
function ethiopian(plier, plicand) {
r = 0
while(plier >= 1) {
if ( !iseven(plier) ) {
r += plicand
}
plier = halve(plier)
plicand = double(plicand)
}
return(r)
}
BEGIN { print ethiopian(17, 34) }
In the Awk-verse, there are two TAWKs.
TAWK #1 is the TAWK Compiler from Thompson Automation Software (no longer trading)
TAWK #2 was a ultra-cut down version of AWK written in C++ by Bruce Eckel in 1989. Eckel writes:
gawk -f getXML.awk Download from LAWKER
Main function, read snext xml-data into XTYPE,XITEM,XATTR
Unescape data and attribute values, used by getXML.
Close xml file
Jan Weber Download
Example
BEGIN {
while ( getXML(ARGV[1],1) ) {
print XTYPE, XITEM;
for (attrName in XATTR)
print "\t" attrName "=" XATTR[attrName];
}
if (XERROR) {
print XERROR;
exit 1;
}
}
Details
getXML( file, skipData ):
External variables:
Returns
Private Data
Code
function getXML( file, skipData \
,end,p,q,tag,att,accu,mline,mode,S0,ex,dtd) {
XTYPE=XITEM=XERROR=XNODE=""; split("",XATTR);
S0=_XMLIO[file,"S0"]; XLINE=_XMLIO[file,"line"];
XPATH=_XMLIO[file,"path"]; dtd=_XMLIO[file,"dtd"];
while (!XTYPE) {
if (S0=="") { if (1!=(getline S0 <file)) break; XLINE++; S0=S0 RS; }
if ( mode == "" ) {
mline=XLINE; accu=""; p=substr(S0,1,1);
if ( p!="<" && !(dtd && p=="]") )
mode="DAT";
else if ( p=="]" )
{ S0=substr(S0,2); mode="DTE"; end=">"; dtd=0; }
else if ( substr(S0,1,4)=="<!--" )
{ S0=substr(S0,5); mode="COM"; end="-->"; }
else if ( substr(S0,1,9)=="<!DOCTYPE" )
{ S0=substr(S0,10); mode="DTB"; end=">"; }
else if ( substr(S0,1,9)=="<![CDATA[" )
{ S0=substr(S0,10); mode="CDA"; end="]]>"; }
else if ( substr(S0,1,2)=="<!" )
{ S0=substr(S0,3); mode="DEC"; end=">"; }
else if ( substr(S0,1,2)=="<?" )
{ S0=substr(S0,3); mode="PIN"; end="?>"; }
else if ( substr(S0,1,2)=="</" )
{ S0=substr(S0,3); mode="END"; end=">";
tag=S0;sub(/[ \n\r\t>].*$/,"",tag);
S0=substr(S0,length(tag)+1);
ex=XPATH;sub(/\/[^\/]*$/,"",XPATH);
ex=substr(ex,length(XPATH)+2);
if (tag!=ex) {
XERROR="unexpected close tag <" ex ">..</" tag ">";
break; } }
else{
S0=substr(S0,2); mode="TAG";
tag=S0;sub(/[ \n\r\t\/>].*$/,"",tag);
S0=substr(S0,length(tag)+1);
if ( tag !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) {
XERROR="invalid tag name '" tag "'"; break; }
XPATH = XPATH "/" tag; } }
else if ( mode == "DAT" ) {
p=index(S0,"<");
if ( dtd && (q=index(S0,"]")) && (!p || q<p) ) p=q;
if (p) {
if (!skipData) { XTYPE="DAT";
XITEM=accu unescapeXML(substr(S0,1,p-1)); }
S0=substr(S0,p); mode=""; }
else{ if (!skipData) accu=accu unescapeXML(S0); S0=""; } }
else if ( mode == "TAG" ) {
sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
if ( substr(S0,1,2)=="/>" ) {
S0=substr(S0,3); mode=""; XTYPE="TAG";
XITEM=tag; S0="</"tag">"S0; }
else if ( substr(S0,1,1)==">" ) {
S0=substr(S0,2); mode=""; XTYPE="TAG"; XITEM=tag; }
else{
att=S0; sub(/[= \n\r\t\/>].*$/,"",att);
S0=substr(S0,length(att)+1); mode="ATTR";
if ( att !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) {
XERROR="invalid attribute name '" att "'";
break; } } }
else if ( mode == "ATTR" ) {
sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
if ( substr(S0,1,1)=="=" ) { S0=substr(S0,2); mode="EQ"; }
else { XATTR[att]=att; mode="TAG";
XNODE=XNODE att"="att"\001"; } }
else if ( mode == "EQ" ) {
sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
end=substr(S0,1,1);
if ( end=="\"" || end=="'" ) {
S0=substr(S0,2);accu="";mode="VALUE";}
else{
accu=S0; sub(/[ \n\r\t\/>].*$/,"",accu);
S0=substr(S0,length(accu)+1);
XATTR[att]=unescapeXML(accu); mode="TAG";
XNODE=XNODE att"="XATTR[att]"\001"; } }
else if ( mode == "VALUE" ) { # terminated by end
if ( p=index(S0,end) ) {
XATTR[att]=accu unescapeXML(substr(S0,1,p-1));
XNODE=XNODE att"="XATTR[att]"\001";
S0=substr(S0,p+length(end)); mode="TAG"; }
else{ accu=accu unescapeXML(S0); S0=""; } }
else if ( mode == "DTB" ) { # terminated by "[" or ">"
if ( (q=index(S0,"[")) && (!(p=index(S0,end)) || q<p ) ) {
XTYPE=mode; XITEM= accu substr(S0,1,q-1);
S0=substr(S0,q+1); mode=""; dtd=1; }
else if ( p=index(S0,end) ) {
XTYPE=mode; XITEM= accu substr(S0,1,p-1);
S0="]"substr(S0,p); mode=""; dtd=1; }
else{ accu=accu S0; S0=""; } }
else if ( p=index(S0,end) ) { # terminated by end
XTYPE=mode; XITEM= ( mode=="END" ? tag : accu substr(S0,1,p-1) );
S0=substr(S0,p+length(end)); mode=""; }
else{ accu=accu S0; S0=""; } }
_XMLIO[file,"S0"]=S0; _XMLIO[file,"line"]=XLINE;
_XMLIO[file,"path"]=XPATH; _XMLIO[file,"dtd"]=dtd;
if (mode=="DAT") { mode=""; if (accu!="") XTYPE="DAT"; XITEM=accu; }
if (XTYPE) { XNODE=XTYPE"\001"XITEM"\001"XNODE; return 1; }
close(file);
delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"];
delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
if (XERROR) XERROR=file ":" XLINE ": " XERROR;
else if (mode) XERROR=file ":" mline ": " "unterminated " mode;
else if (XPATH) XERROR=file ":" XLINE ": " "unclosed tag(s) " XPATH;
}
function unescapeXML( text ) {
gsub( "'", "'", text );
gsub( """, "\"", text );
gsub( ">", ">", text );
gsub( "<", "<", text );
gsub( "&", "\\&", text );
return text
}
function closeXML( file ) {
close(file);
delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"];
delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
delete _XMLIO[file,"open"]; delete _XMLIO[file,"IND"];
}
Author
(Editor's Note: This page is a mirror of the original web site. It describes a collection of shell/awk/tcl scripts used for modeling complex domains. This code illustrates how language choice is not a matter of "awk" vs "X". Rather, systems can be a menagerie of different languages, including Awk.)
These simulation scripts are also available from in LAWKER.
To test the code:
contents.zip cd contents
To use these scripts, you must go the following:
gcc bwcnt2.c -o bwcnt2 gcc bwcnt2a.c -o bwcnt2a
Then, put a copy of "ns" in the current directory, for example:
ln -s ~/vint/ns-2/ns ns
To run the tests:
./single.com ./tfrm12.com ./queue2.com ./increase.com ./reduce.com ./reduce1.com
These scripts are quick amalgams of shell scripts, awk, tcl, and whatever else was handy at the time, so they are not intended as an example of good programming style. They are run in a directory with a "graphs" subdirectory for saved output and *.mf files (gnuplot command files), and an "awk" subdirectory for awk files. Some of these scripts use supporting *.awk files that are available in the awk directory, but are not listed separately below. Some of the scripts (tfrm12.run) also use "bwcnt" C programs for processing output data; the C code for these is in the scripts directory. Possibly one day we will clean this all up to reduce the proliferation of scripts and languages involved.
The implementation of TFRC in the NS simulator is still occasionally being modified, so the precise results of simulations can change with different versions of NS.
Some of these simulations must be run with SBSIZE in scoreboard.h set to 10000 instead of to 1024, to allow larger TCP congestion windows.
The simulation for Figure 2 on
"Illustration of the Average Loss Interval"
can be run with
"contents/single.com",
with supporting files
"contents/single.run",
"contents/single.tcl",
and
"contents/queueSize.tcl".
Generating the postscript file also uses the following files:
"contents/graphs/s0.interval.mf",
"contents/graphs/s0.loss.mf", and
"contents/graphs/s0.rate.mf".
The simulations for Figure 5 on "TCP flow sending rate" can be run with "contents/tfrm-full.CA.DropTail.run", "contents/tfrm-full.CA.RED.run" with supporting files "contents/tfrm-full.CA.tcl", "contents/queueSize.tcl", "contents/getmean-full.tcl". These scripts will produce data files called
graphs/s-full-RED.CA.tcpmean graphs/s-full-DropTail.CA.tcpmeanThere are three values for each data point (from three runs) in these output files. To merge them, use "contents/merge2.tcl":
merge2.tcl graphs/graphs/s-full-RED.CA.tcpmean > graphs/s-full-RED.CA.tcp merge2.tcl graphs/graphs/s-full-DropTail.CA.tcpmean > graphs/s-full-DropTail.CA.tcpUnfortunately, we no longer have the *.mf gnuplot script for generating the postscript from "s-full-RED.CA.tcp" and "s-full-DropTail.CA.tcp". BTW, on a 450MHz Xeon, each graph takes about 7 hours to generate
The simulations for Figure 6 on can be run with "contents/tfrm12.com", with supporting files "contents/tfrm12.run", "contents/tfrm12.tcl", "contents/awk/plotdrops.awk" and "contents/queueSize.tcl". The supporting programs "bwcnt2" and "bwcnt2a" for processing the output data are compiled from "contents/bwcnt2.c" and "contents/bwcnt2a.c". FYI: On Sally's computer, this simulation set took 13 minutes. The following supporting files were also required for generating the postscript file "contents/tfrm12.run1", "contents/graphs/getmean.tcl", "contents/graphs/s0.12.mf", "contents/graphs/s0.loss3.mf".
The simulations for Figure 7 on "Coefficient of variation of throughput between flows" can be run with "contents/tfrmvar.run" with supporting files "contents/tfrmvar.tcl", "contents/queueSize.tcl", and "contents/graphs/getvar.tcl". The scripts "contents/fixcov.tcl" combines the many output files together, and gnuplot requires "contents/graphs/s3xxx.mf" to generate the postscript.
When we have collected the scripts for Figure 8, we will put them on-line.
The simulations for Figures 9 and 10 can be run with the script "contents/long/doit". The supporting scripts are in the tar file. The simulation takes perhaps one hour.
The simulations for Figures 11-13 can be run with the script "contents/short/doit". The simulation takes up to three days.
The simulations for Figure 14 on 40 long-lived flows can be run with "contents/queue2.com", with supporting files "contents/queue.run", "contents/queue.tcl", "contents/queueSize.tcl", "contents/tracequeue.tcl", awk/"contents/awk/plotaveq.awk", and awk/"contents/awk/plotqueue.awk". Generating the postscript file also uses the following file: "contents/graphs/s0.queue.mf".
Figures 15-18 are from experiments.
The simulations for Figure 19 on "A TFRC flow with an end to congestion" can be run with "contents/increase.com", with supporting files "contents/increase.run", "contents/increase.tcl", "contents/queueSize.tcl", "contents/awk/increase.awk", and graphs/"scriptsTR/graphs/s0.packetrate.mf".
The simulations for Figure 20 on "A TFRC flow with persistent congestion" can be run with "contents/reduce.com", with supporting files "contents/reduce.run", "contents/reduce.tcl", "contents/queueSize.tcl", "contents/awk/reduce.awk", and "contents/awk/reduce1.awk". Generating the postscript file also uses the following file: "contents/graphs/s0.rate1.mf".
The simulations for Figure 21 on "Number of round-trip times to reduce the sending rate" can be run with "contents/reduce1.com", with supporting files "contents/reduce1.run", "contents/reduce.tcl", "contents/queueSize.tcl", "contents/awk/reduce1.awk", and "contents/awk/reduce2.awk". Generating the postscript file also uses the following file: graphs/"contents/graphs/s0.half.mf".
Jon L. Bentley, Mary F. Fernandez, Brian W. Kernighan, and Norman L. Schryer, ACM Transactions on Mathematical Software, Vol. 19, No. 3, September 1993, Pages 265-287
This paper describes a set of interfaces for numerical subroutines. Typing a short (often one-line) description allows one to solve problems in application domains including least-squares data fitting, differential equations, minimization, root finding, and integration. Our approach of "template-driven programming" makes it easy to build such an interface: a simple one takes a few hours to construct, while a few days suffice to build the most complex program we describe.
It is straightforward to implement this approach on many systems. We have tailored our implementation to our computing environment: our numerical routines are from the Port library, we call the routines from Fortran programs, and our interfaces are implemented in Awk.
An appendix to the paper describes "L2fit". This program performs only the least-squares regression to calculate the parameters; it does not prepare the graphical summary. It is implemented as a 50-line Awk program and a 40-line Fortran template. The complete L2fit is a 330-line Awk program that uses a 45-line Fortran template; it also uses a 60-line Troff and Grap template to produce the output.
From Intrusion Alert Normalization method using AWK scripts and attack name database. Dongyoung Kim, HyoChan Bang, Jung-Chan Na, Advanced Communication Technology, 2005, ICACT 2005. The 7th International Conference on Publication Date: 21-23 Feb. 2005 Volume: 1, On page(s): 608- 611 Vol. 1
The current several classes of intrusion alert have various formats and semantics. And it is transferred using a variety of protocols. The protocols that transfer intrusion alert are IDXP, SNMP trap, SYSLOG protocol, etc. These varieties of intrusion alert formats make it difticult to use that together. Intrusion alert normalization makes various intrusion alert to same structure data and same semantics. We need this normalition process to unify alerts from a variety of security equipments. This paper describes how to normalize alerts from several IDS and security equipments.
Some of the code at awk.info is somewhat historical in nature. For example, Scott Pakin's gender predictor was written in 1991. Given that, it might be mistakenly concluded that Awk is somehow old-fashioned and not suitable for modern tasks.
Text mining, on the other hand, could be the killer app for Awk in the 21st century. The language excels at creating one-off reports that handle the quirks of a particular file format.
There is a growing interest in using Awk for this kind of work. All the examples presented below come from work conducted in 2007, 2008:
If we could properly understand unstructured text, this would be a result of tremendous practical importance. A recent study concluded that:
That is, if we can tame the text mining problem, it would be possible to reason and learn from a much wider range of business data than ever before.
Note that, in the Menzies/Marcus and Schmitt/Christianson tool kits, Awk by itself was not enough. The two data mining toolkits mentioned above were all intricate combinations of Awk and sed and bash and etc end etc. Within that combination, Awk was very useful for handling the specifics not managed by the other tools.
Yasumasa Someya describes an entire natural langauge processing kit, written in Awk at http://someya-net.com/09-MA/.
In this sense, the toolkit is an excellent example of Awk-in-the-large. Appendix C1 of that documentation lists the Awk programs used in that study. It is a fascinating combination of tiny filters and complex code, which can be combined in multiple ways to result in an instricate analysis:
The Awk file list is shown below.
Num File Dated Description
1
ad_sp_ed.awk
980628
Insert space before the return mark
2
add.awk
980820
Adds all the values contained in $1 through $n respectively.
3
bun_fre2.awk
980724
The main program of "Sentence Profiler (Ver.1)." Print sentence- length profile table and graph.
4
bun_fre4.awk
980730
Revised version of "bun_fre2.awk"
5
cnt_freq.awk
Counts
the number of each tag sequence and to produce a list of modal verb-structures with frequency information.
6
capital.awk
980622
Prints text lines beginning with a capital letter (for extracting proper nouns from a wordlist).
7
chikan.awk
980814
Compares an input file and a specified dictionary. If the words in $1 of the dictionary matches words in the input file, the latter will be replaced with the $2 data in the former. (See "fmatch. awk").
8
cleantag.awk
980818
Cleans up a file tagged with the Brill Tagger, and replaces the default slash symbol (/) with the underbar (_).
9
countme.awk
981002
Counts the number of words in a text , either as type or token.
10
del_hyph.awk
971117
Deletes line-end hyphens.
11
del_nbr.awk
980623
Deletes line-initial numbers and symbols.
12
del_null.awk
980205
Deletes excess blank lines, leaving only one blank line.
13
del_rtn.awk
980518
Deletes the return mark at the end of each record
14
del{_}.awk
981007
Deletes the idiom mark from the output of "dmfreq. awk".
15
delblank.awk
980601
Deletes all blank lines.
16
delkigou.awk
980721
Deletes all symbols and marks in $2.
17
delslash.awk
970831
Replaces the slash with a space.
18
ex_there.awk
980628
Extract all the "Ex-There" constructions.
19
f1_del.awk
980417
Print all the data except those in $1.
20
fmatch.awk
980814
The main program of "Collocation and Idiom Finder (Ver.1)." Marks all the matched strings in the format of "{ idiom }_IDM."
21
hv_vbn.awk
980628
Extracts all the present perfect constructions from a tagged corpus.
22
ichigyo.awk
980205
Same as "del_null.awk"
23
idmfreq.awk
981007
Produces a frequency comparison table of specified collocations and idioms. Used as part of Collocation and Idiom Finder (Ver.1).
24
if$2none.awk
980821
Prints records whose $2 is not blank.
25
if_md.awk
980628
Extracts all the IF+MD constructions from a tagged corpus.
26
JJ.awk
980912
Extracts all the adjectives from a tagged wordlist.
27
kaihi-1.awk
980523
Prints the data as is, except for those marled by #.
28
kaihi-2.awk
980730
Prints the data as is if marked marked by #. If not, adds sentence ID numbers before printing.
29
karamoji.awk
980417
Deletes sentence-initail space.
30
kensaku.awk
980201
Regular expression search from the command line.
31
l_sp_del.awk
971004
Deletes excess line-initial space.
32
line_nbr.awk
980518
Adds sentence numbers.
33
makeline.awk
980201
Inserts a return code at the end of sentence-initial punctuation marks and symbols, except at specified abbreviations (used in conjunction with "txt_id.awk").
34
matching.awk
980620
Replaces each entry word in the input file with a corresponding WL tag as defined in the WL-tag dictionary file. Non-match strings are printed as is (used as part of "Word Level Checker").
35
matchnew.awk
980825
Replaces each entry word in the input file with a corresponding WL tag as defined in the WL-tag dictionary file (See endnote 8, Chapter 3).
36
merge0.awk
980623
Merges two wordlists (Add FILE1 to FILE2, and prints FILE3, for $0).
37
merge1.awk
980703
Merges two wordlists (Add FILE1 to FILE2, and prints FILE3, for $1).
38
nandoprn.awk
980624
Sorts and prints the results of "matching.awk" (used as part of "wlc.bat" and "w_nando.bat").
39
NN.awk
980915
Extracts all the words with NN tags.
40
non_cap.awk
980624
Prints all lines starting with a lower case letter (for extracting data other than proper nouns from a wordlist).
41
open_con.awk
980529
Opens contractions (e.g. I'm, we'd, we'll, couldn't, etc.) . Used before executing the Brill Tagger.
42
predcnt1.awk
980725
Counts the number of predications (mentioned in End note 19, Chapter
2)
43
prn_!tag.awk
980825
Prints text data only from a POS-tagged text.
44
prn_tag.awk
980601
Extracts POS tag data from a tagged text, and prints them onto a separate file (See Endnote 6, Chapter 2).
45
prn{_}.awk
981007
Prints lines that include strings marked "{É}_IDM" (used as part of "fmatch1.bat" and "fmatch2.bat").
46
prn_MD.awk
980601
Extracts all MD tags from a tagged text, and prints them onto a separate file (See Endnote 30, Chapter 4).
47
r_sp_del.awk
980207
deletes space between the return code and the last word of each sentence.
48
RB.awk
980912
Extracts all the words with RB tags.
49
rtn@}.awk
981007
Inserts a return code at the X mark to the output of "prn{_}.awk.".
50
sentence.awk
980718
Main program of "Sentence Profiler (Ver.1)." Counts the numbers of words and sentences, and the average number of words per sentence, and print the result.
51
shiage.awk
980107
Deletes unnecessary data from the output of "prn{_{.awk tp rtn@}.awk" and prints the result after sorting.
52
sp_kigou.awk
980523
Adds space before and/or after specified punctuation marks and symbols (used in conjunction with the Brill Tagger).
53
tagme.awk
980623
Experimental tagging program.
54
TagToSyn.awk
980720
Extract syntactic information from POS tag data.
55
tokei.awk
980801
Calculates sum, mean, variance, SD, dispersion and usage.
56
txt_id.awk
980623
Adds "Sentence ID and Number" to a plain running text.
57
VB.awk
980830
Extracts all the words with VB tags (See Endnote 25, Chapter 3).
58
voc_lev1.awk
980823
Processed input data for "voc_lev2.awk"
59
voc_lev2.awk
980823
Prints the results of "matching.awk to nandoprn. awk" with a graph and a table (used as part of "wlc.bat").
60
mk_list.awk
980801
A multi-function wordlist compiler, mk_list.awk. Mentioned in Endnote 14, Chapter 3. See Appendix C2 for program source.
61
word.awk
980828
Produces a simple wordlist from a plain running text file.
62
wordlist.awk
980718
Produces a simple wordlist with frequency information from a plain running text file.
63
wrdlevel.awk
980623
Replaces entries in a wordlist with WL tags.
Lothar M. Schmitt and Kiel T. Christianson:
Their notes include a short introduction to programming the Bourne-shell and rather short, but complete descriptions of sed and awk customized in regard to language analysis.
Tim Menzies and Andrian Marcus:
Severis is a set of Awk, bash, sed, etc scripts for finding predictors of high severity issues in text reports. Test engineers write such issue reports whenever they encounter anomalies in the code they are inspecting.
Severis was designed to be an audit tool for test engineers, a second "look over the shoulder" to alert a senior engineer if a junior test engineer was doing something strange.
At least for the text issue reports studied by Severis, very simple tools were enough to determine the terms that predicting for different issue severities.
Donald 'Paddy' McCarthy reports an interesting comparison of Awk vs Perl vs Python for doing some text pre-processing.
The example shows off Awk's ability to quickly prototype a one-off specialized report for a particular data format.
It also offers some comment on the language wars between Awk and <insert your favorite scripting language here>: there is no evidence in the following code that dear old-fashioned Awk is more complex or arcane or slower that more recent, supposedly better, languages.
<string:date> [ <float:data-n> <int:flag-n> ]*24
e.g.
1991-03-31 10.000 1 10.000 1 ... 20.000 1 35.000 1
The awk example:
# Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN{
nodata = 0; # Curret run of consecutive flags < 0 in lines of file
nodata_max=-1; # Max consecutive flags < 0 in lines of file
nodata_maxline="!"; # ... and line number(s) where it occurs
}
FNR==1 {
# Accumulate input file names
if(infiles){
infiles = infiles "," infiles
} else {
infiles = FILENAME
}
}
{
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field
for(field=2; field < =NF; field+=2){
datum=$field;
flag=$(field+1);
if(flag < 1){
nodata++
}else{
# check run of data-absent fields
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
}
if(nodata_max < nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=$1
}
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line+=datum
num_line++;
}
}
# totals for the file so far
tot_file += tot_line
num_file += num_line
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n", \
$1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0
# debug prints of original data plus some of the computed values
#printf "%s %15.3g %4i\n", $0, tot_line, num_line
#printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line, nodata, nodata_max, nodata_maxline
}
END{
printf "\n"
printf "File(s) = %s\n", infiles
printf "Total = %10.3f\n", tot_file
printf "Readings = %6i\n", num_file
printf "Average = %10.3f\n", tot_file / num_file
printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}
The same functionality in perl is very similar to the awk program:
# Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN {
$nodata = 0; # Curret run of consecutive flags < 0 in lines of file
$nodata_max=-1; # Max consecutive flags < 0 in lines of file
$nodata_maxline="!"; # ... and line number(s) where it occurs
}
foreach (@ARGV) {
# Accumulate input file names
if($infiles ne ""){
$infiles = "$infiles, $_";
} else {
$infiles = $_;
}
}
while ( < >){
$tot_line=0; # sum of line data
$num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field
chomp;
@fields = split(/\s+/);
$nf = @fields;
$date = $fields[0];
for($field=1; $field < $nf; $field+=2){
$datum = $fields[$field] +0.0;
$flag = $fields[$field+1] +0;
if(($flag+1 < 2)){
$nodata++;
}else{
# check run of data-absent fields
if($nodata_max==$nodata and ($nodata>0)){
$nodata_maxline = "$nodata_maxline, $fields[0]";
}
if($nodata_max < $nodata and ($nodata>0)){
$nodata_max = $nodata;
$nodata_maxline=$fields[0];
}
# re-initialise run of nodata counter
$nodata = 0;
# gather values for averaging
$tot_line += $datum;
$num_line++;
}
}
# totals for the file so far
$tot_file += $tot_line;
$num_file += $num_line;
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f\n",
$date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;
}
printf "\n";
printf "File(s) = %s\n", $infiles;
printf "Total = %10.3f\n", $tot_file;
printf "Readings = %6i\n", $num_file;
printf "Average = %10.3f\n", $tot_file / $num_file;
printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
$nodata_max, $nodata_maxline;
The python program however splits the fields in the line slightly differently (although it could use the method used in the perl and awk programs too):
# Author Donald 'Paddy' McCarthy Jan 01 2007
import fileinput
import sys
nodata = 0; # Curret run of consecutive flags < 0 in lines of file
nodata_max=-1; # Max consecutive flags < 0 in lines of file
nodata_maxline=[]; # ... and line number(s) where it occurs
tot_file = 0 # Sum of file data
num_file = 0 # Number of file data items with flag>0
infiles = sys.argv[1:]
for line in fileinput.input():
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0
# extract field info
field = line.split()
date = field[0]
data = [float(f) for f in field[1::2]]
flags = [int(f) for f in field[2::2]]
for datum, flag in zip(data, flags):
if flag < 1:
nodata += 1
else:
# check run of data-absent fields
if nodata_max==nodata and nodata>0:
nodata_maxline.append(date)
if nodata_max < nodata and nodata>0:
nodata_max=nodata
nodata_maxline=[date]
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line += datum
num_line += 1
# totals for the file so far
tot_file += tot_line
num_file += num_line
print "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f Line_avg: %10.3f" % (
date,
len(data) -num_line,
num_line, tot_line,
tot_line/num_line if (num_line>0) else 0)
print ""
print "File(s) = %s" % (", ".join(infiles),)
print "Total = %10.3f" % (tot_file,)
print "Readings = %6i" % (num_file,)
print "Average = %10.3f" % (tot_file / num_file,)
print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
nodata_max, ", ".join(nodata_maxline))
You know the song:
99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.
98 bottles of beer on the wall, 98 bottles of beer. Take one down and pass it around, 97 bottles of beer on the wall.
97 bottles of beer on the wall, 97 bottles of beer. Take one down and pass it around, 96 bottles of beer on the wall.
....
But how do you code it? Here's Wilhelm Weske's version. It is kind of fun but its a little hard to read:
#!/usr/bin/awk -f
BEGIN{
split( \
"no mo"\
"rexxN"\
"o mor"\
"exsxx"\
"Take "\
"one dow"\
"n and pas"\
"s it around"\
", xGo to the "\
"store and buy s"\
"ome more, x bot"\
"tlex of beerx o"\
"n the wall" , s,\
"x"); for( i=99 ;\
i>=0; i--){ s[0]=\
s[2] = i ; print \
s[2 + !(i) ] s[8]\
s[4+ !(i-1)] s[9]\
s[10]", " s[!(i)]\
s[8] s[4+ !(i-1)]\
s[9]".";i?s[0]--:\
s[0] = 99; print \
s[6+!i]s[!(s[0])]\
s[8] s[4 +!(i-2)]\
s[9]s[10] ".\n";}}
Osamu Aoki has a more maintainable version. Note how all the screen I/O is localized via functions that return strings, rather than printing straight to the screen. This is very useful for maintaince purposes or including code as libraries into other Awk programs.
BEGIN {
for(i = 99; i >= 0; i--) {
print ubottle(i), "on the wall,", lbottle(i) "."
print action(i), lbottle(inext(i)), "on the wall."
print
}
}
function ubottle(n) {
return \
sprintf("%s bottle%s of beer", n ? n : "No more", n - 1 ? "s" : "")
}
function lbottle(n) {
return \
sprintf("%s bottle%s of beer", n ? n : "no more", n - 1 ? "s" : "")
}
function action(n) {
return \
sprintf("%s", n ? "Take one down and pass it around," : \
"Go to the store and buy some more,")
}
function inext(n) {
return n ? n - 1 : 99
}
Osamu's version is very similar to how it'd be done in C or other languages and it does not take full advantage of Awk's features. So Arnold Robbins wrote a third version that is more data driven. Most of the work is done in a pre-processor and the actual runtime just dumps text decided before the run. This solution might take more time (to do the setup) but it does allow for the simple switching of the interface (just change the last 10 lines).
BEGIN {
# Setup
take = "Take one down, pass it around"
buy = "Go to the store and buy some more"
Instruction[0] = buy
Next[0] = 99
Count[0, 1] = "No more"
Count[0, 0] = "no more"
for (i = 99; i >= 1; i--) {
Instruction[i] = take
Next[i] = i - 1
Count[i, 0] = Count[i, 1] = (i "")
Bottles[i] = "bottles"
}
Bottles[1] = "bottle"
Bottles[0] = "bottles"
# Execution
for (i = 99; i >= 0; i--) {
printf("%s %s of beer on the wall, %s %s of beer.\n",
Count[i, 1],
Bottles[i],
Count[i, 0],
Bottles[i])
printf("%s, %s %s of beer on the wall.\n\n",
Instruction[i],
Count[Next[i], 0],
Bottles[Next[i]])
}
}
I'll drink to that.
These pages focused on using Awk to implement filters on Unix mail files.
http://www.tc.umn.edu/~hause011/article/Statistical_spam_filter.html
I now use a "Statistical Spam Filter". Wow, the scummy sewer of internet mail is cleansed, refreshed and usable again. Just using the delete button was getting too difficult, I got 8 to 10 spam for every good piece of mail. As a spam detector I am not as good a filter as you might think, just the subject and address is not always enough, an anti-spam tool I am not, I would occasionally open a spam to my great annoyance.
My filter was inspired by Paul Graham's article about a Naive Bayesian spam filter. The article is at "A Plan for Spam". He basically says that you get statistics on how often tokens show up in two bodies of mail, (spam and good,) and then calculate the a statistical value that a single mail is spam by looking at the tokens in it. The more mail in the good and spam mail bodies, the better the filter is "trained". Jeez, he made it sound so easy. And it is. I slapped an anti-spam tool together as a ksh and awk script for use as a personal filter on a Unix type system. To implement it I put it in the ~/.forward file. The code is at the bottom of the article, less than 100 The total code for the filter and training script is less than 200
This filter differs in lots of ways from the Paul Graham article. I took out some of the biases he describes and simplified it, maybe it is too simple. What I find most interesting is that the differences do not seem to matter much, I still filter out 96+% of spams. I got those results with a spam sample that is at least 500 emails and a good email sample that is at least 700 emails. With smaller training samples or a different mail mix it may not get as good results, or it may be better. Note: I later changed the training body to be more like the proportion of real spam to good mail, which is much more spam than good mail, about 8-10 spam to every good mail received and the anti-spam tool worked better.
First I run the training script on two bodies of mail, ~/Mail/received (good mail) and ~/Mail/junk (saved spam mail.) The ~/Mail/received file is already created on my unix box and holds mail that I have read and not deleted. The training script finds all the tokens in the emails and gives them a probability depending on how the token is found in the "spam mail" and the "good mail". The training script also creates the whitelist of addresses from the "good mail." As the mail flows through the system the training script will then "learn" each time it is run.
I run the actual spam filter script from the .forward file which allows a user to process mail before it hits your inbox. (Look up "man forward" at the shell prompt for further information on the .forward file.) The script first checks the whitelist for a good address, if it is found it passes the filter. If the address is not found it is passed to the statistical spam filter, the tokens are checked and the email is given a spaminess value. Above a certain value the email is classified as spam and put in the ~/Mail/junk file, below the value it passes to /var/spool/mail/mylogin where I read it as god intended email to be read, with a creaky old unix client. However, I can still read it with any other client I want, POP or IMAP.
I included a little test script below that I used to check my results. I just split emails into files and run them one at a time and check the value the filter gives.
Testing on email that has been used to train the filter will give results that are very good and not valid, so I tested on email not seen by the training script. The filter does get much better at filtering as the training sample gets bigger, just like the other statistical spam filters. For example, at lower sample sizes (trained with 209 good mails, 301 spammails) the filter was pretty bad. When the average spam value cutoff was raised to .51 so no good mail was blocked, 44% of the spam email passed through on a set of 320 spam and 683 good email. Even so, that means %56 of the spam was blocked. Small sample sizes are not perfect, but are usable and I began using the mail filter with a sample set of about 600 good mail and 300 spam. As the training sample increased the results improved. As I changed the mail mix to reflect the real spam proportions it got even better, around 96-98% of the spam blocked. I think the lower early results were because of the proportions of spam to good email, they should reflect the real proportion received on the system used by the filter.
Paul Graham or others may have superior filters and better mathematics for anti-spam algorithms but I am not sure that it matters all that much, the amount of spam that gets through is small enough not to bother me.
I used gawk in the filter and checked it with the gawk profiler to look for performance problems. The largest performance constraint is creating the spam-probability associative array in memory, the key-value pairs of tokens and the spam value I assign to them. Creating this associative array is more that 95% of the current time to process an email through the filter and gets worse when the set of tokens gets larger. Perl and other language users can get around this performance problem with DBM file interfaces, currently not available to my gawk filter.
I added a "whitelist" of good email addresses, a feature that helps keep good email from a bad classification and improves performance by a huge amount (at least a magnitude of 100) by not having to further filter the message. The white list is not one of the "challenge-response" things that annoys me so much that I toss any such email away, it simply learns from the email used to train the filter, it saves addresses that are from email that has passed the filter and gets in my "received" file. I figure that if I receive a good email from someone, chances are 100% that I want to receive email from that address. Note there is a place in the white list script to get rid of commonly forged email addresses, like your own address.
The main concept put forward by Paul Graham holds true and seems ungodly robust: applying statistics to filter spam works very well compared to lame rule sets and black lists. My program just proves the robustness of the solution; apparently any half-baked formula (like what I used) seems to work as long as the base probability of the tokens is computed.
Here are some of the many differences between this filter and the filter in the Paul Graham article in no particular order of importance:
Note: Do not test mail that has been used to train the filter, test mail not seen by the training program.
#!/bin/ksh
filter_test () {
# Split a file of unix email into many mail files with this:
cat ~/Mail/rece* |csplit -k -f good -n 4 - '/^From /' {900}
# Run a modified filter that displays the spam value for each mail file.
# I just commented out the last part of the filter and added a
# print statement of the Subject line and spam value the filter found.
for I in test/good*
do
cat $I | [filter_program-that_shows_the_value_only]
done | sort -n
}
Call from the command line or in a crontab file.
#!/bin/ksh
number_of_tokens (){
zcat $1 | cat $2 - | wc -w
}
# Note: Get rid of addresses that are commonly forged at the
# "My-Own-Address" string.
address_white_list (){
zcat $1 |
cat $2 - |
egrep '^From |^Return-Path: ' |
nawk '{print tolower($2)}'|
nawk '{gsub ("<",""); gsub (">","");print;}'|
grep -v 'My-Own-Address'|
sort -u > ~/Mail/address_whitelist
}
# Create a hash with probability of spaminess per token.
# Words only in good hash get .01, words only in spam hash get .99
spaminess () {
nawk 'BEGIN {goodnum=ENVIRON["GOODNUM"]; junknum=ENVIRON["JUNKNUM"];}
FILENAME ~ "spamwordfrequency" {bad_hash[$1]=$2}
FILENAME ~ "goodwordfrequency" {good_hash[$1]=$2}
END {
for (word in good_hash) {
if (word in bad_hash) { print word,
(bad_hash[word]/junknum)/ \
((good_hash[word]/goodnum)+(bad_hash[word]/junknum)) }
else { print word, "0.01"}
}
for (word in bad_hash) {
if (word in good_hash) { done="already"}
else { print word, "0.99"}
}}' ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency
}
# Print list of word frequencies
frequency (){
nawk ' { for (i = 1; i <= NF; i++)
freq[$i]++ }
END {
for (word in freq){
if (freq[word] > 2) {
printf "%s\t%d\n", word, freq[word];
}
}
}'
}
# Note: I store the email in compressed files to keep my storage space small,
# so I have the gzipped mail that I run through the filter training
# script as well as current uncompressed "good" and spam files.
#
prepare_data () {
export JUNKNUM=$(number_of_tokens '/Your/home/Mail/*junk*.gz' '/Your/home/Mail/junk')
export GOODNUM=$(number_of_tokens '/Your/home/Mail/*received*.gz' '/Your/home//Mail/received')
address_white_list '/Your/home/Mail/*received*.gz' '/Your/home/Mail/received'
echo $JUNKNUM $GOODNUM
zcat ~/Mail/*junk*.gz | cat ~/Mail/junk - |
frequency|
sort -nr -k 2,2 > ~/Mail/spamwordfrequency
zcat ~/Mail/*received*.gz | cat ~/Mail/received - |
frequency|
sort -nr -k 2,2 > ~/Mail/goodwordfrequency
spaminess|
sort -nr -k 2,2 > ~/Mail/spamprobability
# Clean up files
rm ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency
}
#########
# Main
prepare_data
exit
Inspired by the Paul Graham article "A Plan for Spam" www.paulgraham.com
Implement in the .forward file like so:
"| /Your/path/to/bin/spamfilter"
If mail is spam then put in a spam file else put in the good mail file.
#!/bin/ksh
spamly () {
/usr/bin/nawk '
{ message[k++]=$0; }
END { if (k==0) {exit;} # empty message or was in the whitelist.
good_mail_file="/usr/spool/mail/your_user";
spam_mail_file="/Your/home/Mail/junk";
spam_probability_file="/Your/home/Mail/spamprobability";
total_tokens=0.01;
while (getline < spam_probability_file)
bad_hash[$1]=$2; close(spam_probability_file);
for (line in message){
token_number=split(message[line],tokens);
for (i = 0; i <= token_number; i++){
if (tokens[i] in bad_hash) {
if (bad_hash[tokens[i]] <= 0.06 || bad_hash[tokens[i]] >= 0.94){
total_tokens+=1;
spamtotal+=bad_hash[tokens[i]];
}
}
}
}
if (spamtotal/total_tokens > 0.50) {
for (j = 0; j <= k; j++){ print message[j] >> spam_mail_file}
print "\n\n" >> spam_mail_file;
}
else {
for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
print "\n\n" >> good_mail_file;
}
}'
}
# Check whitelist for good address.
# if in whitelist then put in good_mail_file
# else Pass message through filter.
whitelister () {
/usr/bin/nawk '
BEGIN { whitelist_file="/Your/home/Mail/address_whitelist";
good_mail_file="/usr/spool/mail/your_user";
found="no";
while (getline < whitelist_file)
whitelist[$1]="address"; close(whitelist_file);
}
{ message[k++]=$0;}
/^From / {sender=tolower($2);
gsub ("\<","",sender);
gsub ("\>","",sender);
if (whitelist[sender]) { found="yes";}
}
/^Return-Path: / {sender=tolower($2);
gsub ("\<","",sender);
gsub ("\>","",sender);
if (whitelist[sender]) { found="yes";}
}
END { if (found=="yes") {
for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
print "\n\n" >> good_mail_file;
}
else {
for (j = 0; j <= k; j++){ print message[j];}
}
}'
}
#####################################
# Main
# The mail is first checked by the white list, if it is not found in the
# white list it is piped to the spam filter.
whitelister | spamly
exit
Download from LAWKER.
Sorts a Unix style mailbox by "thread", in date+subject order.
This is a script I use quite a lot. It requires gawk although with some work could be ported to standard awk. The timezone offset from GMT has to be adjust to one's local offset, although I could probably eliminate that if I wanted to work on it hard enough.
This took me a while to write and get right, but it's been working flawlessly for a few years now. The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for time/date functions and for efficiency hack in string concatenation but could be made to run on a POSIX awk with some work.
BEGIN {
TRUE = 1
FALSE = 0
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
for (i in months)
Month[months[i]] = i # map name to number
MonthDays[1] = 31
MonthDays[2] = 28 # not used
MonthDays[3] = 31
MonthDays[4] = 30
MonthDays[5] = 31
MonthDays[6] = 30
MonthDays[7] = 31
MonthDays[8] = 31
MonthDays[9] = 30
MonthDays[10] = 31
MonthDays[11] = 30
MonthDays[12] = 31
In_header = FALSE
Body = ""
LocalOffset = 2 # We are two hours ahead of GMT
# These keep --lint happier
Debug = 0
MessageNum = 0
Duplicates = FALSE
}
/^From / {
In_header = TRUE
if (MessageNum)
Text[MessageNum] = Body
MessageNum++
Body = ""
# print MessageNum
}
In_header && /^Date: / {
Date[MessageNum] = compute_date($0)
}
In_header && /^Subject: / {
Subject[MessageNum] = canonacalize_subject($0)
}
In_header && /^Message-[Ii][Dd]: / {
if (NF == 1) {
getline junk
$0 = $0 RT junk # Preserve original input text!
}
# Note: Do not use $0 directly; it's needed as the Body text
# later on.
line = tolower($0)
split(line, linefields)
message_id = linefields[2]
Mesg_ID[MessageNum] = message_id # needed for disambiguating message
if (message_id in Message_IDs) {
printf("Message %d is duplicate of %s (%s)\n",
MessageNum, Message_IDs[message_id],
message_id) > "/dev/stderr"
Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
Duplicates++
} else {
Message_IDs[message_id] = MessageNum ""
}
}
In_header && /^$/ {
In_header = FALSE
# map subject and date to index into text
if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
printf(\
("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
" SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
> "/dev/stderr"
}
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum
if (Debug) {
printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
MessageNum, length(SubjectDateId)) > "/dev/stderr"
if (MessageNum != length(SubjectDateId) && ! Printed1) {
Printed1++
printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
}
}
# build up mapping of subject to earliest date for that subject
if (! (Subject[MessageNum] in FirstDates) ||
FirstDates[Subject[MessageNum]] > Date[MessageNum])
FirstDates[Subject[MessageNum]] = Date[MessageNum]
}
{
Body = Body ($0 "\n")
}
END {
Text[MessageNum] = Body # get last message
if (Debug) {
printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
length(SubjectDateId), length(Subject), length(Date))
printf("length(FirstDates) = %d\n", length(FirstDates))
}
# Create new array to sort by thread. Subscript is
# earliest date, subject, actual date
for (i in SubjectDateId) {
n = split(i, t, SUBSEP)
if (n != 3) {
printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
exit 1
}
# now have subject, date, message-id in t
# create index into Text
Thread[FirstDates[t[1]], i] = SubjectDateId[i]
}
n = asorti(Thread, SortedThread) # Shazzam!
if (Debug) {
printf("length(Thread) = %d, length(SortedThread) = %d\n",
length(Thread), length(SortedThread))
}
if (n != MessageNum && ! Duplicates) {
printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
n, MessageNum) > "/dev/stderr"
# exit 1
}
if (Debug) {
for (i = 1; i <= n; i++)
printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
close("DUMP1")
if (Debug ~ /exit/)
exit 0
}
for (i = 1; i <= MessageNum; i++) {
if (Debug) {
printf("Date[%d] = %s\n",
i, strftime("%c", Date[i]))
printf("Subject[%d] = %s\n", i, Subject[i])
}
printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
}
close("OUTPUT")
close("/dev/stderr") # shuts up --lint
}
Pull apart a date string and convert to timestamp.
function compute_date(date_rec, fields, year, month, day,
hour, min, sec, tzoff, timestamp)
{
split(date_rec, fields, "[:, ]+")
if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
# Date: Thu, 05 Jan 2006 17:11:26 -0500
year = fields[5]
month = Month[fields[4]]
day = fields[3] + 0
hour = fields[6]
min = fields[7]
sec = fields[8]
tzoff = fields[9] + 0
} else {
# Date: 05 Jan 2006 17:11:26 -0500
year = fields[4]
month = Month[fields[3]]
day = fields[2] + 0
hour = fields[5]
min = fields[6]
sec = fields[7]
tzoff = fields[8] + 0
}
if (tzoff == "GMT" || tzoff == "gmt")
tzoff = 0
tzoff /= 100 # assume offsets are in whole hours
tzoff = -tzoff
# crude compensation for timezone
# mktime() wants a local time:
# hour + tzoff yields GMT
# GMT + LocalOffset yields local time
hour += tzoff + LocalOffset
# if moved into next day, reset other values
if (hour > 23) {
hour %= 24
day++
if (day > days_in_month(month, year)) {
day = 1
month++
if (month > 12) {
month = 1
year++
}
}
}
timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
year, month, day, hour, min, sec))
# timestamps can be 9 or 10 digits.
# canonicalize them into 11 digits with leading zeros
return sprintf("%011d", timestamp)
}
How many days in the given month?
function days_in_month(month, year)
{
if (month != 2)
return MonthDays[month]
if (year % 4 == 0 && year % 400 != 0)
return 29
return 28
}
Trim out "Re:", white space.
function canonacalize_subject(subj_line)
{
subj_line = tolower(subj_line)
sub(/^subject: +/, "", subj_line)
sub(/^(re: *)+/, "", subj_line)
sub(/[[:space:]]+$/, "", subj_line)
gsub(/[[:space:]]+/, " ", subj_line)
return subj_line
}
Copyright 2007, 2008, Arnold David Robbins arnold@skeeve.com
These pages focused on using Awk for analysis in engineering domains.
A style seen in many Awk libraries is lots of small scripts, each handling a very specific task.
A good example of this style is Eiso Ab's library of scripts for chemical engineering. Shown below are dozens of his scripts. His library is an interesting example of real-world Awk programming.
You can download a tgz of all awk and other scripts from http://www.nmr.chem.uu.nl/~eiso/scripts.tgz. Please direct all bugs and , questions to eiso@nmr.chem.uu.nl
help
ass2shift.awk (Jan-23-2008) purpose: read in anything with assignments or chemical shifts check consistency and write a shift list help
ppm2prot.awk (Jun-12-12:15) generate an xeasy .prot file from another shift.list or ppm.out file help
xpk2peaks.awk (Sep-20-2007) help
pdb2iupac.awk (Feb-11-2005) help
pdb2pdb.awk (Jun-12-12:17) purpose: - reformat ATOM records for various conventions - set B-factors for residues and/or atoms help
seq2shift.awk (Nov-21-2007) fill shift list with chemical shift statistics from a database , currently the cyana lib file. help
predict.awk (Jun-12-12:16) purpose : make list of predicted peaks from shift-lists help
addass.awk (Feb-11-2005) help
reref.awk (Jun-12-12:18) compare referencing for one or two peaklists in a 2D-histogram see also plotpeaks.awk help
plotpeaks.awk (Jun-12-12:15) plot peaks in a 2D graph useful for comparing referencing between peak files or within domains of one peakfile, see also reref.awk help
calib.awk (Jul-13-2005) determining calibration parameters from bruker acqu file use: calib.awk temp=298 acqu [1] Wishart, D.S; Sykes, B.D. (1995) J. Biomol. NMR., 6, 135-140 1H, 13C and 15N chemical shift referencing in biomolecular NMR help
mergeshift.awk (Jun-12-12:12) help
gmx2nmr.awk (Jun-12-12:10) opposite of nmr2gmx.awk convert gmx topology distance and orientation restraints help
nmr2gmx.awk (Jun-16-13:55) make gromacs topology files for distance restraints and dipolar coupling data help
diffshift.awk (Apr-23-10:33) compare shift lists e.g. diffshift.awk [ options ] shiftlist1 shiftlist2 help
complete_assignments.awk (Sep-20-2007) adds assignments in xeasy peaklist where one atom of a proton-heteroatom couple is assigned and the remaining assignment is clear. help
peaks-project.awk (Jun-12-12:13) make lower dimension projections from xeasy peak files help
peaks-unfold.awk (Jun-12-12:14) unfolds peaks in xeasy peaks files help
unwatergate.awk (Feb-14-2005) undo the effect of watergate water suppression on peak intensities in nmrview xpk files help
colorchain2mac.awk (Feb-15-2005) make molmol macro for rainbow-colored spline help
seq2seq.awk (Sep-20-2007) convert protein aminoacid sequence files from oneletter to threeletter format and vice versa. help
seq2shift.awk (Nov-21-2007) fill shift list with chemical shift statistics from a database , currently the cyana lib file. help
makehbonds.awk (Feb-11-2005) help
sparkysave2peaks.awk (Sep-21-2006) convert sparky save files to xeasy peakslist help
peaks2sparky.awk (Jun-12-12:14) create sparky readable peaklists (.list) example: peaks2sparky.awk protein.seq protein.prot c13-cycle7.peaks help
addass2sparkysave.awk (Feb-11-2005) help
shifts2sparky-rl.awk (Sep-20-2007) help
check_hetero_atom.awk (May-15-2007) help
splitass.awk (Feb-11-2005) help
splitnoa.awk (Feb-11-2005) help
pdb2ariapdb.awk (Feb-11-2005) help
tblcount.awk (Aug-15-2005) make a table with numbers of ambiguous and unambiguous intra/seq/medium/long/inter-dom restraints it needs a2ps to format the output. help
upl2tbl.awk (Jun-12-12:20) very simple converter for xeasy .upl files to xplor/cns .tbl files help
tbl2upl.awk (Apr-22-2005) convert (ambiguous) distance restraints in xplor *.tbl file to xeasy/cyana .upl/lol file Use: tbl2upl.awk name.(seq|pdb) unambig.tbl > out.upl help
filterpeaks.awk (May-26-2005) filter peaklist for diagonal,water,lowest/highest intensities help
tabstat.awk (Jun-12-12:21) get stats on values in columns of tables help
linestat.awk (Jun-12-12:10) perform statistic on a certain number of columns of each line help
qual-col.awk (Jun-12-12:17) purpose: color residues according to whatcheck bad/poor scores. creates molmol macros help
make_IDR.awk (Jun-12-12:11) purpose: create restraints for working with proxy residues help
add-linkers.awk (Sep-20-2007) help
cyana-renum-lib.awk (Sep-20-2007) renumber the atoms a cyana residue library entry
(Editor's note: This page is adapted from David Leo's excellent mechanical engineering using Awk scripts website.)
Here is yet another Awk library for engineering applications. Elsewhere, we have seen an extensive library of chemical engineering scripts. Here, David Leo applies Awk to numerous mechanical engineering tasks. Interestingly, the style of David's code is similar to that seen in the chemical engineering library; i.e. lots of small scripts, each doing a different specific task.
To learn more about these scripts, go to David's Awk scripts site. At that site, you will find:
This script calculates the heat transfer through a flat wall or plate made up of several material layers and having convection heat transfer on both sides of the wall or plate.
This script calculates the overall average heat transfer out of a house through a winter. It also calculates and compares oil heat to a geothermal heat pump.
This script calculates the convection heat transfer coefficient on the surface of a flat plate with fluid flowing over it. The boundary layer may transition from laminar to turbulent, as established by the critical Reynold's number (a user input value).
This script calculates the heat transfer through a pipe, tube or duct made up of several material layers and having convection heat transfer on both sides of the wall.
This script calculates the internal heat transfer coefficients for flow through an intenal passage (pipe, tube, duct). It includes an "entrance effect" where the coefficients are larger at the inlet, as the boundary layer builds up. It also includes the effects of fluid being heated or cooled, and uses a laminar boundary layer if the Reynold's number is below 2300.
This script calculates the average heat transfer coefficient on the external wall of a pipe, with forced convection (fluid flowing across the pipe at some prescribed velocity).
This script calculates the forced, damped response of a 1 degree of freedom mass & spring system. Input file, script file, sample output file
This is the classic textbook 1DOF response to an applied force of fixed magnitude and varying frequency. A crude bar chart is plotted for a quick visual check.
This script calculates the first few natural frequencies of beams with common end conditions. It allows added distributed weight and a G-level for simulating static shock, and calculates the resulting peak deflection and peak stress. Input file, script file, sample output file
This script calculates the first natural frequency of a uniform rotor shaft on resilient bearings. A distributed weight may be added. It also calculates the damped, forced response to a specified (oz-in) unbalance at the midspan of the shaft. Input file, script file, sample output file
The unbalance force is F = m*r*(RPM * pi / 30)2, which increases with speed. The m*r term is converted from the commonly specified oz-in to the correct units for the calculation. The response is calculated as a function of frequency ratio, using the classic textbook equation for a 1 degree of freedom system.
Critical frequencies are explicitly calculated as well. Finally, a crude bar chart is plotted for a quick visual check.
This script is a simple calculation of the heat transfer requirements for a heat exchanger. This is taken as U * A, where U = the overall heat transfer coefficient of the design, and A = the total heat transfer area between the two fluids.
The user can tweak any or all of the input variables until Qhot = Qcool. At that condition, the total heat lost by one fluid equals the total heat gained by the other fluid. This equality must be achieved (by user input variables) or the resulting answers will be incorrect. The script was written this way to allow the user infinite latitude for tweaking whatever variables desired. But the requirement that Qhot = Qcool must be met.
blog comments powered by Disqus