Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Awk100,Jan,2009,Admin

The Awk 100

Goals

Awk is being used all around the world for real programming problems, but the news is not getting out.

We are aiming to create a database of at least one hundred Awk programs which will:

  • Identify the tasks that Awk is really being used for
  • Enable analysis of the benefits of the language for practical programming
  • Serve as an information exchange for applications

Contribute

If you, or your colleagues or friends have written a program which has been used for purposes small or large, why not take five minutes to record the facts, so that others can see what you've done?

To contribute, fill in this template and mail it to mail@awk.info with the subject line Awk 100 contribution.

Current Listing

(Recent additions are shown first.)

  1. A. Lahm and E. de Rinaldis' Patent Matrix
    • PatentMatrix is an automated tool to survey patents related to large sets of genes or proteins. The tool allows a rapid survey of patents associated with genes or proteins in a particular area of interest as defined by keywords. It can be efficiently used to evaluate the IP-related novelty of scientific findings and to rank genes or proteins according to their IP position.
  2. P Janouch's AWK IRC agent:
    • VitaminA IRC bot is an experiment on what can be done with GNU AWK. It's a very simple though powerful scripting language. Using the coprocess feature, plugins can be implemented very easily and in a language-independent way as a side-effect. The project runs only on Unix-derived systems.
  3. Stephen Jungels' music player:
    • Plaiter (pronounced "player") is a command line front end to command line music players. What does Plaiter do that (say) mpg123 can't already? It queues tracks, first of all. Secondly, it understands commands like play, plause, stop, next and prev. Finally, unlike most of the command line music players out there, Plaiter can handle a play list with more than one type of audio file, selecting the proper helper app to handle each type of file you throw at it.
  4. Dan at sourceforge's Jawk system:
    • Awk, impelemeneted in the Java virtual machine. Very useful for extending lightweight scripting in Awk with (e.g.) network and GUI facilities from Java.
  5. Axel T. Schreiner's OOC system:
    • ooc is an awk program which reads class descriptions and performs the routine coding tasks necessary to do object-oriented coding in ANSI C.
  6. Ladd and Raming's Awk A-star system:
    • Programmers often take awk "as is", never thinking to use it as a lab in which we can explore other language extensions. This is of course, only one way to treat the Awk code base. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach was take by David Ladd and J. Christopher Raming in their A* system.
  7. Henry Spencer's Amazing Awk Syntax Language system:
    • Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro- nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL specification from the concatenation of the file(s) (default standard input) and emits the corresponding AASL table on standard output.
    • The AASL implementation is not large. The scanner is 78 lines of awk,the parser is 61 lines of AASL (using a fairly low-density paragraphing style and a good manycomments), and the semantics pass is 290 lines of awk. The table interpreter is 340 lines, about half of which (and most of the complexity) can be attributed to the automatic error recovery.
    • As an experiment with a more ambitious AASL specification, one for ANSI C was written. This occupies 374 lines excluding comments and blank lines, and with the exception of the messy details of C declarators is mostly a fairly straightforward transcription of the syntax given in the ANSI standard.
  8. Jurgen Kahrs (and others) XMLgawk system:
    • XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser.
    • The same tool that can load the XML shared library can also add other libraries (e.g. PostgreSQL).
  9. Henry Spencer's Amazing Awk Assembler
    • "aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely in awk and sed. It was done for fun, to establish whether it was possible. It is; it works. Using "aaa", it's very easy to adapt to a new machine, provided the machine falls into the generic "8-bit-micro" category.
  10. Ronald Loui's AI programming lab.
    • For many years, Ronald Loui has taugh AI using Awk. He writes:
      • Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK.
      • A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the Java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.
      • What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
  11. Henry Spencer's Amazing Awk Formatter.
    • Awf may not be lightning fast, and it has certain restrictions, but it does a decent job on most manual pages and simple -ms documents, and isn't subject to AT&T's brain-damaged licensing that denies many System V users any text formatter at all. It is also a text formatter that is simple enough to be tinkered with, for people who want to experiment.
  12. Yung-Pin Cheng's Awk-Linux Course ware.
    • The stable and cross-platform nature of Awk enabled the simple creation of a robust toolkit for teaching operating system concepts to university students. The toolkit is much simpler/ easier to port to new platforms, than alternative and more elaborate course ware tools.
    • This work was the basis for a quite prestigious publication in the IEEE Transactions on Education journal, 2008, Vol 51, Issue 4. Who said Awk was an old-fashioned tool?
  13. Jon Bentley's m1 micro macro processor.
    • Supports the essential operations of defining strings and replacing strings in text by their definitions. All in 110 lines. A little awk goes a long way.
  14. Arnold Robbins and Nelson Beebe's classic spell checker
    • A powerful spell checker, and a case-study on how to best write applications using hundreds of lines of Awk.
  15. Jim Hart's awk++
    • An object-oriented Awk.
  16. Wolfgan Zekol's Yawk
    • WIKI written in Awk
  17. Darius Bacon: AwkLisp
    • LISP written in Awk
  18. Bill Poser: Name
    • Generate TeX code for a bilingual dictionary.
  19. Ronald Loui: Faster clustering
    • Demonstration to DoD of a clustering algorithm suitable for streaming data
  20. Peter Krumin: Get YouTube videos
    • Download YouTube videos
  21. Jim Hart: Sudoku
    • Solve sudoku puzzles using the same strategies as a person would, not by brute force.
  22. Ronald Loui: Anne's Negotiation Game
    • Research on a model of negotiation incorporating search, dialogue, and changing expectations.
  23. Ronald Loui: Baseball Sim
    • A baseball simulator for investigating the efficiency of batting lineups.
  24. Ronald Loui: Argcol
    • A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.

categories: Who,Jan,2009,Mikel

Mikel= Mike Langmann

From: Tim Menzies <tim@menzies.us>
To: mikelangman@blueyonder.co.uk
Subject: auk images

I write to see if you would be gracious enough to grant us usage rights for your auk paintings to use on this site, in exchange for appropriate credit such as:

  • your name + links to your site on every page of this site;
  • a link with image that take our users to your site.

From: Mike Langman <mikelangman@blueyonder.co.uk>
Date: Mon, Jan 19, 2009 at 2:55 AM
Subject: Re: auk images

I normally charge for the use of images but as there is no money involved please carry on using the images and include a link to my website as suggested.

Many thanks for asking.

- Mike


categories: Misc,WhyAwk,Jan,2009,Admin

Awk Advocacy

"Because easy is not wrong." - Anon

From various sources:

Quotes:

  • "Listen to people who program, not to people who want to tell you how to program."
    - Ronald P. Loui
  • "Good design is as little design as possible."
    - Dieter Rams
  • "When we have on occasion rewritten an Awk program in a conventional programming language like C or C++, the result was usually much longer, and much harder to debug."
    - Arnold Robbins & Nelson Beebe

From Project Management Advice:

  • More programming theory does not make better programmers.
  • Don't let old/compiler people tell you what language to use.
  • If there is already a way of doing something, do not invent a harder way.

From Awk programming:

  • Awk is a simple and elegant pattern scanning and processing language.
  • Awk is also the most portable scripting language in existence.
  • But why use it rather than Perl (or PHP or Ruby or...):
    • Awk is simpler (especially important if deciding which to learn first);
    • Awk syntax is far more regular (another advantage for the beginner, even without considering syntax-highlighting editors);
    • You may already know Awk well enough for the task at hand;
    • You may have only Awk installed;
    • Awk can be smaller, thus much quicker to execute for small programs.

From Awk as a Major Systems Programming Language:

  • Effective use of its data structures and its stream-oriented structure takes some adjustment for C programmers, but the results can be quite striking.

According to Ramesh Natarajan:

  • AWK is a superb language for testing algorithms and applications with some complexity, especially where the problem can be broken into chunks which can streamed as part of a pipe. It's an ideal tool for augmenting the features of shell programming as it is ubiquitous; found in some form on almost all Unix/Linux/BSD systems. Many problems dealing with text, log lines or symbol tables are handily solved or at the very least prototyped with awk along with the other tools found on Unix/Linux systems.

From the NoSQL pages:

  • (Other languages like Perl is) a good programming language for writing self-contained programs, but pre-compilation and long start-up time are worth paying only if once the program has loaded it can do everything in one go. This contrasts sharply with the Operator-stream Paradigm, where operators are chained together in pipelines of two, three or more programs. The overhead associated with initializing (say) Perl at every stage of the pipeline makes pipelining inefficient. A better way of manipulating structured ASCII files is to use the AWK programming language, which is much smaller, more specialized for this task, and is very fast at startup.

categories: Misc,Jan,2009,Admin

Community

To join our community, consider contributing to this site.

For a list of authors of this site, see our credits pages.

The Awk Wiki.

USENET discussion group: comp.lang.awk.


categories: Misc,Jan,2009,Admin

Contact

For discussions on Awk, see the Awk discussion group.

For comments/ complaints/ corrections/ extensions to this site, contact mail@awk.info.


categories: Misc,Jan,2009,Admin

Welcome to the Awk Community Portal

Awk is a stable, cross platform computer language named for its authors Alfred Aho, Peter Weinberger & Brian Kernighan. They write: "Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks".

In Classic Shell Scripting, Arnold Robbins & Nelson Beebe confess their Awk bias: "We like it. A lot. The simplicity and power of Awk often make it just the right tool for the job."

Besides the Bourne shell, Awk is the only other scripting language available in the standard Unix environment. Implementations of AWK exist as installed software for almost all other operating systems.

Awk is a mature language- it was first implemented in the 1970s. As a tool from the golden age, it is sometimes called primitive. It is more accurate to call it elemental, so tightly focused is the language on what it does best: quickly converting this into that.

Consequently, throughout history, Awk has been the language of choice for many famous scientists such as Leonardo daVinci.



categories: Misc,Jan,2009,Admin

Code

LAWKER is a repository of Awk code divided into:

fridge
Fresh code (for the current trunk). Best place to start is fridge/gawk.
block
Place to chop up and experiment with code. Usually, avoid this one.
freezer
Frozen code. place to store tags. Currently, empty. But we plan to grow this one.
wiki
Wiki pages. Useful for documentation but, where possible, use the in-line pretty print method, described below.

How to contribute to LAWKER

See How to Contribute.

How to report bug

Use our issue tracking system.


categories: Mascot,Misc,Jan,2009,Admin

Mascot

Missing: the Awk Mascot

Many communities have a mascot, a banner that they proudly wave high. So where's the Awk mascot?

I made on up, but you gotta say, it is kinda lame:

So you have any ideas for such a mascot, please email mail@awk.info with the subject line "suggestion for mascot".

Not to stiffle anyone's creativity but the mascot might be based on the mantra "less, but better" or "easy is not wrong" or "a little awk goes a long way".

Current Offerings

Chris Johnson

Chris writes "more of a logo rather than a mascot":

Other Mascots

Lisp: Aliens

Perl: Camel

Linux: Tux

Java: Duke


categories: Top10,Papers,Misc,WhyAwk,Jan,2009,Ronl

GAWK for AI

by R. Loui

ACM Sigplan Notices, Volume 31, Number 8, August 1996

Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a `scripting language.' It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't even been close (we have not seen work in PROLOG or JAVA).

There are some quick answers that have to do with the pragmatics of undergraduate programming. Then there are more instructive answers that might be valuable to those who debate programming paradigms or to those who study the history of AI languages. And there are some deep philosophical answers that expose the nature of reasoning and symbolic AI. I think the answers, especially the last ones, can be even more surprising than the observed effectiveness of GAWK for AI.

First it must be confessed that PERL programmers can cobble together AI projects well, too. Most of GAWK's attractiveness is reproduced in PERL, and the success of PERL forebodes some of the success of GAWK. Both are powerful string-processing languages that allow the programmer to exploit many of the features of a UNIX environment. Both provide powerful constructions for manipulating a wide variety of data in reasonably efficient ways. Both are interpreted, which can reduce development time. Both have short learning curves. The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonful of syntactic sugar. Some will argue that PERL has superior functionality, but for quick AI applications, the additional functionality is rarely missed. In fact, PERL's terse syntax is not friendly when regular expressions begin to proliferate and strings contain fragments of HTML, WWW addresses, or shell commands. PERL provides new ways of doing things, but not necessarily ways of doing new things.

In the end, despite minor difference, both PERL and GAWK minimize programmer time. Neither really provides the programmer the setting in which to worry about minimizing run-time.

There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI test bed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is turn left; turn right. If the robot is Netscape, then the right language is something that can generate Netscape -remote 'openURL(http://cs.wustl.edu/~loui) with elan.

Of course, there are deeper answers. Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays. GAWK asks the programmer to use the file system for data organization and the operating system for debugging tools and subroutine libraries. There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time. (footnote: I am reminded of my IBM colleague Ben Grosof's advice for Palo Alto: Don't worry about whether it's highway 101 or 280. Don't worry if you have to head south for an entrance to go north. Just get on the highway as quickly as possible.)

There are some similarities between GAWK and LISP that are illuminating. Both provided a powerful uniform data structure (the associative array implemented as a hash table for GAWK and the S-expression, or list of lists, for LISP). Both were well-supported in their environments (GAWK being a child of UNIX, and LISP being the heart of lisp machines). Both have trivial syntax and find their power in the programmer's willingness to use the simple blocks to build a complex approach.

Deeper still, is the nature of AI programming. AI is about functionality and exploratory programming. It is about bottom-up design and the building of ambitions as greater behaviors can be demonstrated. Woe be to the top-down AI programmer who finds that the bottom-level refinements, `this subroutine parses the sentence,' cannot actually be implemented. Woe be to the programmer who perfects the data structures for that heap sort when the whole approach to the high-level problem needs to be rethought, and the code is sent to the junk heap the next day.

AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor.

Now for the surprising philosophical answers. First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution. Chess, neural nets, and genetic programming show the limits of brute computation. The alternative is clever program organization. (footnote: One might add that the former are the AI approaches that work, but that is easily dismissed: those are the AI approaches that work in general, precisely because cleverness is problem-specific.) So AI programmers always want to maximize the content of their program, not optimize the efficiency of an approach. They want minds, not insects. Instead of enumerating large search spaces, they define ways of reducing search, ways of bringing different knowledge to the task. A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.

Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call reasoning'' instead of logic.'' The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.

I view these last two points as news not only to the programming language community, but also to much of the AI community that has not reflected on the past decade's lessons.

In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.


categories: Misc,Jan,2009,Timm

History

Recipe for a Language

  • 1 part egrep
  • 1 part snobol
  • 2 parts ed
  • 3 parts C
  • Blend all parts well using lex and yacc. Document minimally and release.
  • After eight years, add another part egrep and two more parts C. Document very well and release.

    Historical Notes

    From awk.freeshell.org:

    • 1977-1985: awk and nawk (now also known as 'old awk' or 'the old true awk'): the original version of the language, lacking many of the features that make it fun to play with now
    • 1985-1996: The GNU implementation, Gawk, was written in 1986 by Paul Rubin and Jay Fenlason, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from Arnold Robbins, thoroughly reworked Gawk for compatibility with the newer Awk.
    • 1996: BWK awk was released under an open license. Huzzah!
    • Sometime before the present, mawk, xgawk, jawk, awkcc, Kernighan's nameless awk-to-C++ compiler, awka, tawk and busybox awk came to be.

    It's a bit embarassing to note that the exact origins of each are a bit hazy. This whole section requires further work, including the addition of links pointing to source repositories and binary distribution points.

    Awk Implemenetations

    Historical list of Awk implementations.

    Awk's Authors: Interviews


    categories: Misc,WhyAwk,Jan,2009,Timm

    Why Gawk?

    by T. Menzies

    "The Enlightened Ones say that....

    • You should never use C if you can do it with a script;
    • You should never use a script if you can do it with awk;
    • Never use awk if you can do it with sed;
    • Never use sed if you can do it with grep."

    Awk is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and Awk programs are generally very short. Awk is useful when the overheads of more sophisticated approaches is not worth the bother. Also, the cost of learning Awk is very low.

    But aren't there better scripting languages? Faster? Well, maybe yes and maybe no.

    And Awk is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:

    loc/fp   language
    ------   --------
    
        6,   excel 5
       13,   sql
       21,   awk       <================
       21,   perl
       21,   eiffel
       21,   clos
       21,   smalltalk
       29,   delphi
       29,   visual basic 5
       49,   ada 95
       49,   ai shells
       53,   c++
       53,   java
       64,   lisp
       71,   ada 83
       71,   fortran 95
       80,   3rd generation default
       91,   ansi cobol 85
       91,   pascal
      107,   2nd generation default
      107,   algol 68
      107,   cobol
      107,   fortran
      128,   c
      320,   1st generation default
      640,   machine language
     3200,   natural language
    

    Anyway, there are other considerations. Awk is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For example, here's the complete listing of someone's Awk spell-checking program.

    BEGIN     {while (getline<"Usr.Dict.Words") dict[$0]=1}
    !dict[$1] {print $1}
    

    Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.

    Awk is the cure for late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. Awk programmers can hack up usable systems in the time it takes other programmers to boot their IDE. And, as a result of that easy exploration, it is possible to find loopholes missed by other analyst that lead to the innovative better solution to the problems (e.g. see Ronald Loui's O(nlogn) clustering tool).

    Certainly, we can drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And Awk's lack of data structures (except num, string, and array) requires some discipline to handle properly.

    But experienced Awk programmers know that the cleverer the program, the smaller the audience gets. If it is possible for to explain something succinctly in a simple language like Awk, then it is also possible that more folks will read that code.

    Finally, at this may be the most important point, it might be misguided to argue about Awk vs LanguageX in terms of the specifics of those languages. Awk programmers can't over-elaborate their solutions- they are forced to code the solution in the simplest manner possible. This push to simplicity, to the essence of the problem, can be an insightful process. Coding in Awk is like preserving fruit- you boil off everything that is superfluous, that needlessly bloats the material what you are working with. It is amazing how little code is required to code the core of an idea (e.g. see Darius Bacon's LISP interpreter, written in Awk).


    categories: Mascot,Jan,2010,HiroS

    A new Mascot for Awk

    Hirofumi Saito contributes a candidate Awk mascot from the http://gauc.no-ip.org/ Japan GNU AWK Users Club.


    categories: Runawk,Project,Tools,Jan,2010,AlexC

    Runawk 0.19 Released

    Download

    http://sourceforge.net/projects/runawk

    About

    runawk is a small wrapper for the AWK interpreter that helps one write standalone AWK scripts. Its main feature is to provide a module/library system for AWK which is somewhat similar to Perl's "use" command. It also allows you to select a preferred AWK interpreter and to setup the environment for your scripts. It also provides other helpful features, for example it includes numerous useful of modules.

    Major Changes IN RUNAWK-0.19.0

    • fix in runawk.c: \n was missed in "running '%s' failed: %s" error message. The problem was seen on ancient (12 years old) HP-UX
    • fix in teets/test.mk: "diff -u" is not portable (SunOS, HP-UX),
    • DIFF_PROG variable is introduced to fix the problem
    • fix in modules/power_getopt.awk: after printing help message we
    • should exit immediately not running END section, s/exit/exitnow/
    • new function heapsort_values in heapsort.awk module
    • new function quicksort_values in quicksort.awk module
    • new function sort_values in sort.awk module

    Author

    Aleksey Cheusov


    categories: Android,Jan,2010,MartinC

    Awk on Android

    Recently, on comp.lang.awk, Michael Sanders asked:

      Is gawk (or a reasonable version of awk) available for Google's Android? How about Nokia's N900?

      It is available for Nokia's N810 (as part of busybox) and, I would hope for the N900.

    Martin Cohen answers:

      Well, it looks like maybe it is. I did another Google search for "awk android" and this was one of the results: Downloading File: /43862/awk4j-1.6.1-android-src.zip - awk4j (AWK ...

      When I followed the link, I got the file awk4j-1.6.1-android-src.zip - when I unpacked it there were a number of files including a directory called "Sample" with a number of awk programs (with odd characters at the end of each line - I opened them in emacs) and a files named "awk4jAndroid.apk" which is, I guess, awk for Android (duh!).


    categories: Jan,2010,MichealS

    Rot-13 in Awk

    The following ROT13 is a slight modifcation of the example found at http://www.miranda.org/~jkominek/rot13/awk/.

    #!/bin/awk -f 
    BEGIN { 
      from = "NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm0987654321" 
      to   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890" 
      for (i = 1; i <= length(from); i++) { 
          letter[substr(from, i, 1)] = substr(to, i, 1) 
      } 
    } 
    { 
      for (i = 1; i <= length($0); i++) { 
          char = substr($0, i, 1) 
          if (match(char, "[a-zA-Z]|[0-9]") != 0) { 
    	  printf("%c", letter[char]) 
          } else { 
    	  printf("%c", char) 
          } 
      } 
      printf("\n") 
    } 
    

    Author

    Michael Sanders.


    categories: Os,Jan,2010,GrantC

    Network Monitoring in Awk

    Download

    Tarball (12K).

    About

    NetDraw is a suite of awk and bash scripts to log and display Internet connection activity.

    Netdwraw supports live monitoring of the network interface and display the most recent 24 hours' activity.

    The tool snapshots network activity each five minutes and updates the scrolling image. For example:

    Written in awk and bash, netdraw uses fly to drive gd to draw the chart.

    Author

    Grant Coady


    categories: Jan,2010,HyungC

    QSE: an Embeddable Awk Interpreter

    Hyung-Hwan Chung offers QSE, an embeddable Awk.

    Download

    See QSE.

    About QSE

    QSE is a code library that implements various Unix utilities in an embeddable form and provides a set of APIs to embed them into an application. The APIs have been designed to be flexible enough to access various aspects of an embedding application and an embedded object from each other.

    By embedding a Unix utility into an application, a developer is relieved of problems caused by interacting with external programs and can have tighter control over it. Currently the library contains the following utilities:

    • AWK Interpreter
    • CUT Text Cutter
    • SED Stream Editor

    QSEAWK is an embeddable AWK interpreter and is a part of the QSE library. The interpreter implements the language described in the book the AWK Proramming Language, with some extensions. Its design focuses on building a flexible and robust embedding API with minimal platform dependency. An embedding application is capable of:

    • adding new global variables and functions.
    • getting and set the value of a global variable.
    • calling a function with or without parameters and getting its return value.
    • customizing I/O handlers for file, pipe, console I/O.
    • creating multiple interpreters independent of each other.
    • running a single script with different I/O streams independently.
    • changing language features by setting options.
    • and more

    But Why QSE?

    An advantage of embedding a scripting language into an application is that you can extend an application by changing scripts instead of recompiling the whole application. As an AWK lover, I was a bit disappointed that I could not find any embedded implementations of the AWK programming language that I could squeeze into my applications.

    QSE is designed to embedded Awk into other applications, rather than being used as a standalone tool (though it is not impossible). Why did I choose AWK as an embedded language? Simple. Both I and my clients liked it and were too lazy to learn a new scripting language.

    Also, an embedded solution is a better solution that calling an external AWK interpreter:

    • There is the extra overhead of forking an external process.
    • There was an absense of any AWK interpreters on the target platform
    • I found version issues of an AWK interpreter.
    • I an unable to extend the interpreter itself. (e.g. adding an application specific builtin function like hash_passwor).

    Hence, my conclusion was to implement an embeddable awk interpreter myself.

    Example

    One of the applications I wrote implements password change policy in an AWK script. The application calls the "is_password_acceptable" function with the password entered by a user, before having accepted the user-entered password. It checks its return value and determines to accept the password.

    Of course, the engine is prearranged with global variables PASSWD_HISTORY_SIZE, and PASSWORD_HISTORY_FILE, and a buitin function hash_password() using flexiable QSEAWK API functions upon application start-up.

    For example, here is the sample AWK function below.

    function is_password_acceptable(passwd)
    {
        # check the password length
        if (length(passwd) < 8) return 0;
        # check if the password is composed of alphabets or digits only
        if (passwd ~ /^([[:alpha:]]+|[[:digit:]]+)$/) 
    	return 0;
        if (PASSWD_HISTORY_SIZE > 0)
        {
    	hashed = hash_passwd(passwd);
    	# check if the password is found in the history file
    	while ((getline entry < PASSWD_HISTORY_FILE) > 0)
    	{
    	    if (hashed == entry)
    	    {
    		# an entry is found in the history.
    		# reject the password
    		close (PASSWD_HISTORY_FILE);
    		return 0;
    	    }
    	}
    	close (PASSWD_HISTORY_FILE);
        }
        return 1;
    }
    

    The C application's password policy function is roughly shown below also. Note that this application utilized the embedded QSEAWK interprerter in an event(password change)-driven way, not entering the BEGIN, pattern-action blocks, END loops.

    int is_password_acceptable (qse_awk_rtx_t* rtx, const char* passwd)
    {
        qse_awk_val_t* ret, * arg[1];
        qse_bool_t ok;
        
        ... abbreviated ...
        
        /* transform a character string to an AWK value */
        arg[0] = qse_awk_rtx_makestrval0 (rtx, passwd);
        
        ... abbreviated ...
        
        /* increment the reference counter of arg[0] */
        qse_awk_rtx_refupval (rtx, arg[0]);
        /* call "is_password_acceptable" */
        ret = qse_awk_rtx_call (rtx, "is_password_acceptable", arg, 1);
        /* decrement the reference counter of arg[0] */
        qse_awk_rtx_refdownval (rtx, arg[0]);
        
        ... abbreviated ...
        
        /* get the boolean value from the return value */
        ok = qse_awk_rtx_valtobool (awk_rtx, ret);
        /* decrement the reference counter of the return value */
        qse_awk_rtx_refdownval (rtx, ret);
        /* accept or reject? */
        return ok? 0: -1;
    }
    
    After all, I managed to get rid of any needs to recompile the whole application and redeploy it whenever a client asks for password policy change.

    categories: GUI,Jan,2010,MichaelS

    AWK GUIs for Windows

    Here's a dirt simple method of sprucing up your AWK output under Windows.

    Requires Windows, and the WSH scripting host, both of which are native to any modern Windows installation.

    To use this method, follow these three steps:

    1. Save a script like the following to an file with an extension of 'HTA' eg - 'example.hta' (HTA meaning 'HyperText Application' is a proprietary format for Internet Explorer).
    2. Edit the line shown below containing 'program.awk datafile' and replace with your AWK program file, and your datafile.
    3. Double click your HTA file to run it.

    Example

    This example will capture the output of your AWK program and render that output dynamically to an HTML stream within a graphical window.

    <html>
       <head>
          <title>My application</title>
    
          <hta:application
          id="MyApp"
          applicationName="My application"
          border="thick"
          borderStyle="normal"
          caption="yes"
          contextMenu="yes"
          icon=""
          innerBorder="no"
          maximizeButton="yes"
          minimizeButton="yes"
          navigable="yes"
          scroll="yes"
          scrollFlat="no"
          selection="yes"
          showInTaskBar="yes"
          singleInstance="yes"
          sysMenu="yes"
          version="1.0"
    	  windowState="normal">
       </head>
    
       <body>
          <script type="text/vbscript">
             Set WshShell = CreateObject("Wscript.shell")
             Set objExec= WshShell.Exec("%comspec% /c gawk -f program.awk
             datafile")
             output = objExec.StdOut.ReadAll
             document.write("<pre>" & output & "</pre>")
             </script>
    
          </body>
       </html>
    

    Author

    Michael Sanders: http://topcat.hypermart.net.


    categories: Jan,2010,MartinC

    Very, Very, Very Long Strings in Gawk

    In this discussion from comp.lang.awk, Martin Cohen builds a really, really, really long string in Gawk (300 million characters). He writes....

    I had to extract 25-bit fields from a 90MB binary file, with frames of 10,000 fields indicated by a 33-bit sync value. The words I was interested in were indicated by being preceded by a special tag word.

    My first step was to convert the binary file to hex text using od. I then wrote some gawk code to read the text file and extract the (32- bit) words preceded by the tag word. There were 9 million of them.

    I concatenated them into a single string of 72 million hex characters (had to do byte-swapping along the way), and then, one character at a time, converted that into a string of 0's and 1's 300 million characters long. I could then easily (using index) search for the sync pattern (independent of any word boundaries) and find the data I wanted.

    The total run time was just under 7 minutes (under Red Hat 5.1).

    Some optimizations I had to do:

    • To build up the string of 9 million hex words, I had to group them 256 words at a time before concatenating them to the big string. When I just did one word at a time, I took forever - I had to stop it.
    • Similarly, When converting the hex to binary, I converted groups of 256 characters at a time before appending them to the big binary string.
    • Thinking about it now, I could probably combine the gathering of the hex words with the conversion to binary - my program was a revision of one where that combining wasn't done.

    Anyway, it's nice that gawk can handle really long strings.

    Author

    Martin Cohen

    categories: Getline,Tips,Jan,2009,EdM

    Use (and Abuse) of Getline

    by Ed Morton (and friends)

    The following summary, composed to address the recurring issue of getline (mis)use, was based primarily on information from the book "Effective Awk Programming", Third Edition By Arnold Robbins; (http://www.oreilly.com/catalog/awkprog3) with review and additional input from many of the comp.lang.awk regulars, including

    • Steve Calfee,
    • Martin Cohen,
    • Manuel Collado,
    • J├╝rgen Kahrs,
    • Kenny McCormack,
    • Janis Papanagnou,
    • Anton Treuenfels,
    • Thomas Weidenfeller,
    • John LaBadie and
    • Edward Rosten.

    Getline

    getline is fine when used correctly (see below for a list of those cases), but it's best avoided by default because:

    1. It allows people to stick to their preconceived ideas of how to program rather than learning the easier way that awk was designed to read input. It's like C programmers continuing to do procedural programming in C++ rather than learning the new paradigm and the supporting language constructs.
    2. It has many insidious caveats that come back to bite you either immediately or in future. The succeeding discussion captures some of those and explains when getline IS appropriate.

    As the book "Effective Awk Programming", Third Edition By Arnold Robbins; http://www.oreilly.com/catalog/awkprog3) which provides much of the source for this discussion says:

      "The getline command is used in several different ways and should not be used by beginners. ... come back and study the getline command after you have reviewed the rest ... and have a good knowledge of how awk works."

    Variants

    The following summarises the eight variants of getline applications, listing which variables are set by each one:

    Variant                 Variables Set 
    -------                 -------------
    getline                 $0, ${1...NF}, NF, FNR, NR, FILENAME 
    getline var             var, FNR, NR, FILENAME 
    getline < file          $0, ${1...NF}, NF 
    getline var < file      var 
    command | getline       $0, ${1...NF}, NF 
    command | getline var   var 
    command |& getline      $0, ${1...NF}, NF 
    command |& getline var  var 
    

    The "command |& ..." variants are GNU awk (gawk) extensions. gawk also populates the ERRNO builtin variable if getline fails.

    Although calling getline is very rarely the right approach (see below), if you need to do it the safest ways to invoke getline are:

    if/while ( (getline var < file) > 0) 
    if/while ( (command | getline var) > 0) 
    if/while ( (command |& getline var) > 0) 
    

    since those do not affect any of the builtin variables and they allow you to correctly test for getline succeeding or failing. If you need the input record split into separate fields, just call "split()" to do that.

    Caveats

    Users of getline have to be aware of the following non-obvious effects of using it:

    1. Normally FILENAME is not set within a BEGIN section, but a non-redirected call to getline will set it.
    2. Calling "getline < FILENAME" is NOT the same as calling "getline". The second form will read the next record from FILENAME while the first form will read the first record again.
    3. Calling getline without a var to be set will update $0 and $NF so they will have a different value for subsequent processing than they had for prior processing in the same condition/action block.
    4. Many of the getline variants above set some but not all of the builtin variables, so you need to be very careful that it's setting the ones you need/expect it to.
    5. According to POSIX, `getline < expression' is ambiguous if expression contains unparenthesized operators other than `$'; for example, `getline < dir "/" file' is ambiguous because the concatenation operator is not parenthesized. You should write it as `getline < (dir "/" file)' if you want your program to be portable to other awk implementations.
    6. In POSIX-compliant awks (e.g. gawk --posix) a failure of getline (e.g. trying to read from a non-readable file) will be fatal to the program, otherwise it won't.
    7. Unredirected getline can defeat the simple and usual rule to handle input file transitions:
      FNR==1 { ... start of file actions ... }
      
      File transitions can occur at getlines, so FNR==1 needs to also be checked after each unredirected (from a specific file name) getline. e.g. if you want to print the first line of each of these files:
      $ cat file1 
      a 
      b 
      $ cat file2 
      c 
      d 
      
      you'd normally do:
      $ awk 'FNR==1{print}' file1 file2 
      a 
      c 
      
      but if a "getline" snuck in, it could have the unexpected consequence of skipping the test for FNR==1 and so not printing the first line of the second file.
      $ awk 'FNR==1{print}/b/{getline}' file1 file2 
      a 
      
    8. Using getline in the BEGIN section to skip lines makes your program difficult to apply to multiple files. e.g. with data like...
      some header line 
      ---------------- 
      data line 1 
      data line 2 
      ... 
      data line 10000 
      
      you may consider using...
      BEGIN { getline header; getline } 
      { whatever_using_header_and_data_on_the_line() } 
      
      instead of...
      FNR == 1 { header = $0 } 
      FNR < 3 { next } 
      { whatever_using_header_and_data_on_the_line() } 
      
      but the getline version would not work on multiple files since the BEGIN section would only be executed once, before the first file is processed, whereas the non-getline version would work as-is. This is one example of the common case where the getline command itself isn't directly causing the problem, but the type of design you can end up with if you select a getline approach is not ideal.

    Applications

    getline is an appropriate solution for the following:

    1. Reading from a pipe, e.g.:
      command = "ls" 
      while ( (command | getline var) > 0) { 
          print var 
      } 
      close(command) 
      
    2. Reading from a coprocess, e.g.:
      command = "LC_ALL=C sort" 
      n = split("abcdefghijklmnopqrstuvwxyz", a, "") 
      for (i = n; i > 0; i--) 
           print a[i] |& command 
      close(command, "to") 
      while ((command |& getline var) > 0) 
          print "got", var 
      close(command) 
      
    3. In the BEGIN section, reading some initial data that's referenced during processing multiple subsequent input files, e.g.:
      BEGIN { 
         while ( (getline var < ARGV[1]) > 0) { 
                data[var]++ 
         } 
         close(ARGV[1]) 
         ARGV[1]="" 
       } 
       $0 in data 
      
    4. Recursive-descent parsing of an input file or files, e.g.:
      awk 'function read(file) { 
                  while ( (getline < file) > 0) { 
                      if ($1 == "include") { 
                           read($2) 
                      } else { 
                           print > ARGV[2] 
                      } 
                  } 
                  close(file) 
            } 
            BEGIN{ 
               read(ARGV[1]) 
               ARGV[1]="" 
               close(ARGV[2]) 
           }1' file1 tmp 
      

    In all other cases, it's clearest, simplest, less error-prone, and easiest to maintain to let awks normal text-processing read the records. In the case of "c", whether to use the BEGIN+getline approach or just collect the data within the awk condition/action part after testing for the first file is largely a style choice.

    "a" above calls the UNIX command "ls" to list the current directory contents, then prints the result one line at a time.

    "b" above writes the letters of the alphabet in reverse order, one per line, down the two-way pipe to the UNIX "sort" command. It then closes the write end of the pipe, so that sort receives an end-of-file indication. This causes sort to sort the data and write the sorted data back to the gawk program. Once all of the data has been read, gawk terminates the coprocess and exits. This is particularly necessary in order to use the UNIX "sort" utility as part of a coprocess since sort must read all of its input data before it can produce any output. The sort program does not receive an end-of-file indication until gawk closes the write end of the pipe. Other programs can be invoked as just:

    command = "program" 
    do { 
          print data |& command 
          command |& getline var 
    } while (data left to process) 
    close(command) 
    

    Not that calling close() with a second argument is also gawk-specific.

    "c" above reads every record of the first file passed as an argument to awk into an array and then for every subsequent file passed as an argument will print every record from that file that matches any of the records that appeared in the first file (and so are stored in the "data" array). This could alternatively have been implemented as:

    # fails if first file is empty 
    NR==FNR{ data[$0]++; next } 
    $0 in data 
    

    or:

    FILENAME==ARGV[1] { data[$0]++; next } 
    $0 in data 
    

    or:

    FILENAME=="specificFileName" { data[$0]++; next } 
    $0 in data 
    

    or (gawk only):

    ARGIND==1 { data[$0]++; next } 
    $0 in data 
    

    "d" above not only expands all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2]. In this case, since it's convenient to use $1 and $2, and no other part of the program references any builtin variables, getline was used without populating an explicit variable. This method is limited in its recursion depth to the total number of open files the OS permits at one time.

    Tips

    The following tips may help if, after reading the above, you discover you have an appropriate application for getline or if you're looking for an alternative solution to using getline:

    1. If you need to distinguish between a normal EOF or some read or opening error, you have to use gawks ERRNO variable or code it as: if/while ( (e = (getline var < file)) > 0) { ... } close(file) if(e < 0) some_error_handling
    2. Don't forget to close() any file you open for reading. The common idiom for getline and other methods of opening files/streams is:
      cmd="some command" 
      do something with cmd 
      close(cmd) 
      
    3. A common misapplication of getline is to just skip a few lines of an input file. The following discusses how to do that without using getline with all that implies as discussed above. This discussion builds on the common awk idiom to "decrement a variable to zero" by putting the decrement of the variable as the second term in an "and" clause with the first part being the variable itself, so the decrement only occurs if the variable is non-zero:
      • Print the Nth record after some pattern:
        awk 'c&&!--c;/pattern/{c=N}' file 
      • Print every record except the Nth record after some pattern:
        awk 'c&&!--c{next}/pattern/{c=N}' file 
      • Print the N records after some pattern:
        awk 'c&&c--;/pattern/{c=N}' file 
      • Print every record except the N records after some pattern:
        awk 'c&&c--{next}/pattern/{c=N}' file

    In this example there are no blank lines and the output is all aligned with the left hand column and you want to print $0 for the second record following the record that contains some pattern, e.g. the number 3:

    $ cat file 
    line 1 
    line 2 
    line 3 
    line 4 
    line 5 
    line 6 
    line 7 
    line 8 
    $ awk '/3/{getline;getline;print}' file 
    line 5 
    

    That works Just fine. Now let's see the concise way to do it without getline:

    $ awk 'c&&!--c;/3/{c=2}' file 
    line 5

    It's not quite so obvious at a glance what that does, but it uses an idiom that most awk programmers could do well to learn and it is briefer and avoids all those getline caveats.

    Now let's say we want to print the 5th line after the pattern instead of the 2nd line. Then we'd have:

    $ awk '/3/{getline;getline;getline;getline;getline;print}' file 
    line 8 
    $ awk 'c&&!--c;/3/{c=5}' file 
    line 8
    

    i.e. we have to add a whole series of additional getline calls to the getline version, as opposed to just changing the counter from 2 to 5 for the non-getline version. In reality, you'd probably completely rewrite the getline version to use a loop:

    $ awk '/3/{for (c=1;c<=5;c++) getline; print}' file 
    line 8

    Still not as concise as the non-getline version, has all the getline caveats and required a redesign of the code just to change a counter.

    Now let's say we also have to print the word "Eureka" if the number 4 appears in the input file. With the getline verion, you now have to do something like:

    $ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" } 
    print}' file 
    Eureka! 
    line 8

    whereas with the non-getline version you just have to do:

    $ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file 
    Eureka! 
    line 8

    i.e. with the getline version, you have to work around the fact that you're now processing records outside of the normal awk work-loop, whereas with the non-getline version you just have to drop your test for "4" into the normal place and let awks normal record processing deal with it like it always does. Actually, if you look closely a

    t the above you'll notice we just unintentionally introduced a bug in the getline version. Consider what would happen in both versions if 3 and 4 appear on the same line. The non-getline version would behave correctly, but to fix the getline version, you'd need to duplicate the condition somewhere, e.g. perhaps something like this:

    $ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline } 
    if ($0 ~ /4/) print "Eureka!"; print}' file 
    Eureka! 
    line 8 
    

    Now consider how the above would behave when there aren't 5 lines left in the input file or when the last line of the file contains both a 3 and a 4. i.e. there are still design questions to be answered and bugs that will appear at the limits of the input space.

    Ignoring those bugs since this is not intended as a discussion on debugging getline programs, let's say you no longer need to print the 5th record after the number 3 but still have to do the Eureka on 4. With the getline version, you'd strip out the test for 3 and the getline stuff to be left with:

    $ awk '{if ($0 ~ /4/) print "Eureka!"}' file 
    Eureka!
    

    which you'd then presumably rewrite as:

    $ awk '/4/{print "Eureka!"}' file 
    Eureka! 
    

    which is what you get just by removing everything involving the test for 3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:

    $ awk '/4/{print "Eureka!"}' file 
    Eureka! 
    

    i.e. again, one small requirement change required a complete redesign of the getline code, but just the absolute minimum necessary tweak to the non-getline version.

    So, what you see above in the getline case was significant redesign required for every tiny requirement change, much larger amounts of handwritten code required, insidious bugs introduced during development and challenging design questions at the limits of your input space, whereas the non-getline version always had less code, was much easier to modify as requirements changed, and was much more obvious, predictable, and correct in how it would behave at the limits of the input space.


    categories: Forloop,Tips,Jan,2009,Jimh

    Never write for(i=1;i<=n;i++).. again?

    by Jim Hart

    I've written this kind of thing

    n = split(something,arr,/re/)
    for(i=1;i<=n;i++) {
       print arr[i]
    }
    

    so often, it's tedious. I like this better:

    n = split(something,arr,/re/)
    while(n--) {
       print arr[i++]
    }
    

    Easier to type. And, in cases where front-to-back or back-to-front doesn't matter, it's even simpler:

    # copy a number indexed array, assuming n contains the number of
    # elements
    
    while(n--) arr2[n] = arr1[n]
    

    And, yes,

    for(i in arr1) arr2[i] = arr1[i]
    

    works, too. But, some loops don't involve arrays. :-)

    Want more?

    This tip has been discussed on comp.lang.awk.


    categories: Contribute,Jan,2009,Admin

    How to Contribute

    This web site is a front end to a repository of Awk code. The site, and the code, is maintained by the international awk community (which includes you) so there are many ways you can contribute:

    Link to this site from your home page

    Using this logo, link to http://awk.info:

    (By the way, our current logo is pretty lame. Want to contribute a better one? Please, be our guest!)

    Improve a Page

    Found a Typo? A Rendering Problem? Want to clarify something?

    Want to add some links?

    See the above instructions.

    How to Write Pages for this Site

    1. Write the page.
    2. Test the page by placing it on a publicly readable site, then see if it renders ok.
    3. Email the url of that page to mail@awk.info. Do NOT send the page.

    When writing a page, please follow these guidelines:

    • Do not use <hr> tags: these are reserved for dividing pages in a multi-page view.
    • Use only one <h1> tag at the top of page. Everything else should <h2> or below.
    • Try to avoid using tricky CSS/HTML styling tricks. Vanilla HMTL is best.
    • The page you write will end up being rendered as the middle pane of this site (around 550 pixels wide). So don't write wide pages.
    • If you include code samples, note that our CSS wraps pre-formatted code if it gets too wide. For example, at the time of this writing, the following pre-formatted texts gets ugly after about 75 characters:
              1         2         3         4         5         6         7
    012345678901234567890123456789012345678901234567890123456789012345678901234567890
    

    Contributing Code

    To contribute code, zip up the directory and mail it to

    Coding Standards

    All function and file names are global to our code so please ensure your new function/file name does not clobber an old one.

    Optionally, you might considering adding:

    Add a Library Function Files

    In the language of this site, a function file is a 100% standalone file containing one or more functions with no dependancies on other files. Note that if your function file depends on other files, then it becomes a package (see below).

    Functions are stored in a file caled myfunc.awk.

    Add a Package

    In the language of this site, a package is a file that depends on other files (and the other files may depend on yet others, recursively).

    Following a recent discussion in comp.lang.awk, we say that these dependancies are commented with

    #use file.awk 
    

    where file.awk is some file (e.g. a file in the current directory).

    Note that : file.awk will be loaded before the file containing the reference to #use file.awk.


    categories: Contribute,Jan,2009,Timm

    Pretty Print AWK Code

    The code that renders the awk.info web site can "pretty print" awk code. For example:

    To enable that pretty print, add some html syntax inside your code and apply the following conventions.

    Preview Engine

    Note that if you want to see your "looking pretty", then you could could see how it looks using our preview tool:

    http://awk.info/?awk:urlWithoutHTTPprefix
    

    For exmaple, the file http://menzies.us/tmp/xx.awk can be previewed using http://awk.info/?awk:menzies.us/tmp/xx.awk

    Contributing Pretty Code

    Once you've got it "looking pretty", please consider contributing that code to awk.info, so our code library can grow. To do so, either email mail@awk.info with the URL of your pretty code or zip up the files and email them across.

    HTML-based Commenting Conventions

    The first paragraph of the file will be ignored. Use this first para for copyright notices or comments about down-in-the weeds trivia. Note: the first para ends with one blank line.

    The next paragraph should start with

    #.H1 <join>Title</join>

    The code could should be topped and tailed as follows:

    #<pre>
    code
    #</pre>
    

    All other comment lines should start with a single "#" at front-of-line. These comment characters will be stripped away by the awk.info renderer.

    Awk.info's renderer adopts the following html shorthand. If a line starts with

    #.WORD other words 
    

    this this is replaced with

    <WORD> other words</WORD>
    

    If no other words follow #.WORD then the line becomes just <WORD>

    Awk.info's renderer supports a few HTML extensions:

    • #.IN path includes a file found in the LAWKER repositoriy at some path inside the trunk.
    • #.CODE path includes the contents of path, wrapped in <pre> tags, and prefixed by the path.
    • #.BODY path is the same as #.CODE but it skips the first paragraph (this is useful when the first paragraph includes tedious details you want to hite from the user).
    • Note that, for #.IN, #.CODE, #.BODY, the path must appear after a single space.

    That's it. Now you can pretty print your code on the web just be adding a little html in the comments.


    categories: Contribute,Jan,2009,Timm

    Show Unit Tests

    Ideally, all code in our code repository comes with unit tests:

    • Either demo scripts to show off functionality
    • Or a regression suite that checks that new changes does not mess up existing code.

    Accordingly code offered to this site can contain unit tests, using the methods described in this page.

    But before going on, we stress that awk.info gratefully accepts awk contributions in any form. That is, including unit tests with code is optional.

    Files

    If your code is in directory yourcode then create a sub-directory yourcode/eg

    Write a test in a file yourcode/eg/yourtest. Divide that test into two parts:

    1. In the first paragraph of that file, write any tedious set up required to get the system ready for the test.
    2. In the second, third, etc paragraph, write the code that shows the test
    3. For example, in the following code, the real test comes after some low-level environmental set up:
      # assumes
      # - the LAWKER trunk has been checked out and
      # - .bash_profile contains: export Lawker="$HOME/svns/lawker/fridge"
      . $Lawker/lib/bash/setup
      
      gawk -f join.awk --source '
      BEGIN { split("tim tom tam",a)
              print join(a,2)
      }'
      

    Write the expected output of that test case in yourcode/eg/yourtest.out

    Regression Tests

    The above file conventions mean that an automatic tool can run over the entire code base and perform a regression test (checking if all the tests generate all the *.out files.

    Displaying the Tests (and Output)

    Another advantage of the above scheme is that you can use the tests to document your code.

    To show the test case, add the following into your .awk file:

     #.BODY       yourcode/eg/yourtest
     #.CODE       yourcode/eg/yourtest.out
    

    Then zip the directory yourcode (including yourcode/eg) and send it to awk.info. Once we install those files on our site then when awk.info displays that file, the test case trivia is hidden and the users only see the essential details. For an example of this, see http://awk.info/?gawk/array/join.awk.


    categories: Learn,Jan,2009,Admin

    Learning Awk

    Short Overviews

    The following list is sorted by newbie-ness (so best to start at the top):

    Longer Tutorials

    The following list is sorted by the number of times this material is tagged at delicious.com (most tagged at top):

    Other Stuff


    categories: Learn,Jan,2009,Ronl

    Teaching Awk

    (For tutorial material on Awk, see Learning Awk page.)

    R. Loui loui@ai.wustl.edu is Associate Professor of Computer Science, at Washington University in St. Louis. He has published in AI Journal, Computational Intelligence, ACM SIGART, AI Magazine, AI and Law, the ACM Computing Surveys Symposium on AI, Cognitive Science, Minds and Machines, Journal of Philosophy.

    Whenever Ronald Loui teaches GAWK, he gives the students the choice of learning PERL instead. Ninety percent will choose GAWK after looking at a few simple examples of each language (samples shown below). Those who choose PERL do so because someone told them to learn PERL.

    After one laboratory, more than half of the GAWK students are confident with their GAWK skills and can begin designing. Almost no student can become confident in PERL that quickly.

    After a week, 90% of those who have attempted GAWK have mastered it, compared to fewer than 50% of PERL students attaining similar facility with the language (it would be unfair to require one to `master' PERL).

    By the end of the semester, over 90% who have attempted GAWK have succeeded, and about two-thirds of those who have attempted PERL have succeeded.

    To be fair, within a year, half of the GAWK programmers have also studied PERL. Most are doing so in order to read PERL and will not switch to writing PERL. No one who learns PERL migrates to GAWK.

    PERL and GAWK appear to have similar programming, development, and debugging cycle times.

    Finally, there seems to be a small advantage for GAWK over PERL, after a year, for the programmers willingness to begin a new program. That is, both GAWK and PERL programmers tend to enjoy writing a lot of programs, but GAWK has the slight edge here.


    categories: Learn,Jan,2009,Timm

    Four Keys to Gawk

    by T. Menzies

    Imagine Gawk as a kind of a cut-down C language with four tricks:

    1. self-initializing variables
    2. pattern-based programming
    3. regular expressions
    4. associative arrays.

    What to all these do? Well....

    Self-initializing variables.

    You don't need to define variables- they appear as your use them.

    There are only three types: stings, numbers, and arrays.

    To ensure a number is a number, add zero to it.

    x=x+0
    

    To ensure a string is a string, add an empty string to it.

    x= x "" "the string you really want to add"
    

    To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables:

     function haslocals(passed1,passed2,         local1,local2,local3) {
            passed1=passes1+1  # changes externally
            local1=7           # only changed locally
     }
    

    Note that its good practice to add white space between passed and local variables.

    Pattern-based programming

    Gawk programs can contain functions AND pattern/action pairs.

    If the pattern is satisfied, the action is called.

     /^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
               p = 1;
             }
     /^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
               p = 0;
             }
     END     { if (p != 0) print "missing .P2 at end" }
    

    Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:

     BEGIN {
            while (getline < "Usr.Dict.Words") #slurp in dictionary 
                    dict[$0] = 1
            FS=",";                            #set field seperator
            srand();                           #reset random seed
            Round=10;                          #always start globals with U.C.
     }
    

    The default action is {print $0}; i.e. print the whole line.

    The default pattern is 1; i.e. true.

    Patterns are checked, top to bottom, in source-code order.

    Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1". Regular expressions are important enough for their own section.

    A Small Example

    Ok, so now we know enough to explain an simple report function. How does hist.awk work in the following?

     
    % cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
                        uniq -c | sort -r -n | Gawk -f hist.awk
    
                  **************************  26 /var/empty
                                          **   2 /var/virusmails
                                          **   2 /var/root
                                           *   1 /var/xgrid/controller
                                           *   1 /var/xgrid/agent
                                           *   1 /var/teamsserver
                                           *   1 /var/spool/uucp
                                           *   1 /var/spool/postfix
                                           *   1 /var/spool/cups
                                           *   1 /var/pcast/server
                                           *   1 /var/pcast/agent
                                           *   1 /var/imap
                                           *   1 /Library/WebServer
    

    hist.awk reads the maximum width from line one (when NR==1), then scales it to some maximum width value. For each line, it then prints the line ($0) with some stars at front.

    NR==1  { Width = Width ? Width : 40 ; sets Width if it is missing
             Scale = $1 > Width ? $1 / Width : 1 
           }
           { Stars=int($1*Scale);  
             print str(Width - Stars," ") str(Stars,"*") $0 
           }
    
    # note that, in the following "tmp" is a local variable
    function str(n,c, tmp) { # returns a string, size "n", of all  "c" 
        while((n--) > 0 ) tmp= c tmp 
        return tmp 
    }
    

    Regular Expressions

    Do you know what these mean?

    • /^[ \t\n]*/
    • /[ \t\n]*$/
    • /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/

    Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:

      function trim(s,     t) {
        t=s;
        sub(/^[ \t\n]*/,"",t);
        sub(/[ \t\n]*$/,"",t);
        return t
     }
    

    or recognize something that isn't a number:

    if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ ) 
        {print "ERROR: " $i " not a number}
    

    Regular expressions are an astonishingly useful tool supported by many languages (e.g. Awk, Perl, Python, Java). The following notes review the basics. For full details, see http://www.gnu.org/manual/Gawk-3.1.1/html_node/Regexp.html#Regexp.

    Syntax: Here's the basic building blocks of regular expressions:

    c
    matches the character c (assuming c is a character with no special meaning in regexps).

    \c
    matches the literal character c; e.g. tabs and newlines are \t and \n respectively.

    .
    matches any character except newline.

    ^
    matches the beginning of a line or a string.

    $
    matches the end of a line or a string.

    [abc...]
    matches any of the characters ac... (character class).

    [^ac...]
    matches any character except abc... and newline (negated character class).

    r*
    matches zero or more r's.

    And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.

    More Syntax:

    But that's only the start of regular expressions. There's lots more. For example:

    r+
    matches one or more r's.

    r?
    matches zero or one r's.

    r1|r2
    matches either r1 or r2 (alternation).

    r1r2
    matches r1, and then r2 (concatenation).

    (r)
    matches r (grouping).

    Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:

    ^[+-]? ...
    Numbers begin with zero or one plus or minus signs.

    ...[0-9]+...
    Simple numbers are just one or more numbers.

    ...[.]?[0-9]*...
    which may be followed by a decimal point and zero or more digits.

    ...|[.][0-9]+...
    Alternatively, a number can have zero leading numbers and just start with a decimal point.

    .... ([eE]...)?$
    Also, there may be an exponent added

    ...[+-]?[0-9]+)?$
    and that exponent is a positive or negative bunch of digits.

    Associative arrays

    Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):

    Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename
    

    The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:

     #!/usr/bin/awk -f
      {for(i=1;i <=NF;i++) freq[$i]++ }
      END{for(word in freq) print word, freq[word]  }
    

    You can find out if an element exists in an array at a certain index with the expression:

    index in array
    

    This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present.

    You can remove an individual element of an array using the delete statement:

    delete array[index]
    

    It is not an error to delete an element which does not exist.

    Gawk has a special kind of for statement for scanning an array:

     for (var in array)
            body
    

    This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.

    There order in which the array is scanned is not defined.

    To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack

     function top(a)        {return a[a[0]]}
     function push(a,x,  i) {i=++a[0]; a[i]=x; return i}
     function pop(a,   x,i) {
       i=a[0]--;  
       if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}
    

    The pop function can be used in the usual way:

     BEGIN {push(a,1); push(a,2); push(a,3);
            while(x=pop(a)) print x
     3
     2
     1
    

    We can catch everything in an array to a string:

     function a2s(a,  i,s) {
            s=""; 
            for (i in a) {s=s " " i "= [" a[i]"]\n"}; 
            return s}
    
      BEGIN {push(L,1); push(L,2); push(L,3);
            print a2s(L);}
      0= [3]
      1= [1]
      2= [2]
      3= [3]
    

    And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form:

    ^=include file

    This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line:

     function rinclude (line,    x,a) {
       split(line,a,/ /);
       if ( a[1] ~ /^\=include/ ) { 
         while ( ( getline x < a[2] ) > 0) rinclude(x);
         close(a[2])}
       else {print line}
     }
    

    Note that the third argument of the split function can be any regular expression.

    By the way, here's a nice trick with arrays. To print the lines in a files in a random order:

     BEGIN {srand()}
           {Array[rand()]=$0}
     END   {for(I in Array) print $0}
    

    Short, heh? This is not a perfect solution. Gawk can only generate 1,000,000 different random numbers so the birthday theorem cautions that there is a small chance that the lines will be lost when different lines are written to the same randomly selected location. After some experiments, I can report that you lose around one item after 1,000 inserts and 10 to 12 items after 10,000 random inserts. Nothing to write home about really. But for larger item sets, the above three liner is not what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%) are lost after 100,000 random inserts. Not good!


    categories: OneLiners,Learn,Jan,2009,Admin

    Awk one-liners

    Awk is famous for how much it can do in one line.

    This site has many samples of that capability. And if you have any more to add, please send them in.


    categories: OneLiners,Learn,Jan,2009,EricP

    Handy One-Liners For Awk (v0.22)

    Eric Pement
    pemente@northpark.edu

    Latest version of this file is usually at:
    http://www.student.northpark.edu/pemente/awk/awk1line.txt

    USAGE

    Unix:     awk '/pattern/ {print "$1"}'    # standard Unix shells
    DOS/Win:  awk '/pattern/ {print "$1"}'    # okay for DJGPP compiled
              awk "/pattern/ {print \"$1\"}"  # required for Mingw32
    

    Most of my experience comes from version of GNU awk (gawk) compiled for Win32. Note in particular that DJGPP compilations permit the awk script to follow Unix quoting syntax '/like/ {"this"}'. However, the user must know that single quotes under DOS/Windows do not protect the redirection arrows (<, >) nor do they protect pipes (|). Both are special symbols for the DOS/CMD command shell and their special meaning is ignored only if they are placed within "double quotes." Likewise, DOS/Win users must remember that the percent sign (%) is used to mark DOS/Win environment variables, so it must be doubled (%%) to yield a single percent sign visible to awk.

    If I am sure that a script will NOT need to be quoted in Unix, DOS, or CMD, then I normally omit the quote marks. If an example is peculiar to GNU awk, the command 'gawk' will be used. Please notify me if you find errors or new commands to add to this list (total length under 65 characters). I usually try to put the shortest script first.

    File Spacing

    Double space a file

     awk '1;{print ""}'
     awk 'BEGIN{ORS="\n\n"};1'
    

    Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text. NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are often treated as non-blank, and thus 'NF' alone will return TRUE.

    awk 'NF{print $0 "\n"}'
    

    Triple space a file

    awk '1;{print "\n"}'

    Numbering and Calculations

    Precede each line by its line number FOR THAT FILE (left alignment). Using a tab (\t) instead of space will preserve margins.

    awk '{print FNR "\t" $0}' files*

    Precede each line by its line number FOR ALL FILES TOGETHER, with tab.

    awk '{print NR "\t" $0}' files*

    Number each line of a file (number on left, right-aligned) Double the percent signs if typing from the DOS command prompt.

    awk '{printf("%5d : %s\n", NR,$0)}'

    Number each line of file, but only print numbers if line is not blank Remember caveats about Unix treatment of \r (mentioned above)

    awk 'NF{$0=++a " :" $0};{print}'
     awk '{print (NF? ++a " :" :"") $0}'
    

    Count lines (emulates "wc -l")

    awk 'END{print NR}'

    Print the sums of the fields of every line

    awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'

    Add all fields in all lines and print the sum

    awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

    Print every line after replacing each field with its absolute value

     awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
     awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
    

    Print the total number of fields ("words") in all lines

     awk '{ total = total + NF }; END {print total}' file

    Print the total number of lines that contain "Beth"

     awk '/Beth/{n++}; END {print n+0}' file

    Print the largest first field and the line that contains it Intended for finding the longest string in field #1

    awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'

    Print the number of fields in each line, followed by the line

    awk '{ print NF ":" $0 } '

    Print the last field of each line

    awk '{ print $NF }'

    Print the last field of the last line

    awk '{ field = $NF }; END{ print field }'

    Print every line with more than 4 fields

    awk 'NF > 4'

    Print every line where the value of the last field is > 4

    awk '$NF > 4'

    Text Conversion and Substitution

    IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format

    awk '{sub(/\r$/,"");print}'   # assumes EACH line ends with Ctrl-M

    IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format

    awk '{sub(/$/,"\r");print}

    IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format

    awk 1

    IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format Cannot be done with DOS versions of awk, other than gawk:

    gawk -v BINMODE="w" '1' infile >outfile

    Use "tr" instead.

     tr -d \r outfile # GNU tr version 1.22 or higher

    Delete leading whitespace (spaces, tabs) from front of each line aligns all text flush left

    awk '{sub(/^[ \t]+/, ""); print}'

    Delete trailing whitespace (spaces, tabs) from end of each line

    awk '{sub(/[ \t]+$/, "");print}'
    

    Delete BOTH leading and trailing whitespace from each line

    awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
    awk '{$1=$1;print}'           # also removes extra space between fields
    

    Insert 5 blank spaces at beginning of each line (make page offset)

    awk '{sub(/^/, "     ");print}'
    

    Align all text flush right on a 79-column width

    awk '{printf "%79s\n", $0}' file*
    

    Center all text on a 79-character width

    awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*
    

    Substitute (find and replace) "foo" with "bar" on each line

    awk '{sub(/foo/,"bar");print}'           # replaces only 1st instance
    gawk '{$0=gensub(/foo/,"bar",4);print}'  # replaces only 4th instance
    awk '{gsub(/foo/,"bar");print}'          # replaces ALL instances in a line
    

    Substitute "foo" with "bar" ONLY for lines which contain "baz"

    awk '/baz/{gsub(/foo/, "bar")};{print}'
    

    Substitute "foo" with "bar" EXCEPT for lines which contain "baz"

    awk '!/baz/{gsub(/foo/, "bar")};{print}'
    

    Change "scarlet" or "ruby" or "puce" to "red"

    awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
    

    Reverse order of lines (emulates "tac")

    awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*
    

    If a line ends with a backslash, append the next line to it (fails if there are multiple lines ending with backslash...)

    awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*
    

    Print and sort the login names of all users

    awk -F ":" '{ print $1 | "sort" }' /etc/passwd
    

    Print the first 2 fields, in opposite order, of every line

    awk '{print $2, $1}' file
    

    Switch the first 2 fields of every line

    awk '{temp = $1; $1 = $2; $2 = temp}' file
    

    Print every line, deleting the second field of that line

    awk '{ $2 = ""; print }'
    

    Print in reverse order the fields of every line

    awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file
    

    Remove duplicate, consecutive lines (emulates "uniq")

    awk 'a !~ $0; {a=$0}'
    

    Remove duplicate, nonconsecutive lines

    awk '! a[$0]++'                     # most concise script
    awk '!($0 in a) {a[$0];print}'      # most efficient script
    

    Concatenate every 5 lines of input, using a comma separator between fields

    awk 'ORS=%NR%5?",":"\n"' file
    

    Selective Printing of Certain Lines

    Print first 10 lines of file (emulates behavior of "head")

    awk 'NR < 11'
    

    Print first line of file (emulates "head -1")

    awk 'NR>1{exit};1'
    

    Print the last 2 lines of a file (emulates "tail -2")

    awk '{y=x "\n" $0; x=$0};END{print y}'
    

    Print the last line of a file (emulates "tail -1")

    awk 'END{print}'
    

    Print only lines which match regular expression (emulates "grep")

    awk '/regex/'
    

    Print only lines which do NOT match regex (emulates "grep -v")

    awk '!/regex/'
    

    Print the line immediately before a regex, but not the line containing the regex

    awk '/regex/{print x};{x=$0}'
     awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'
    

    Print the line immediately after a regex, but not the line containing the regex

    awk '/regex/{getline;print}'
    

    Grep for AAA and BBB and CCC (in any order)

    awk '/AAA/; /BBB/; /CCC/'
    

    Grep for AAA and BBB and CCC (in that order)

    awk '/AAA.*BBB.*CCC/'
    

    Print only lines of 65 characters or longer

    awk 'length > 64'
    

    Print only lines of less than 65 characters

    awk 'length < 64'
    

    Print section of file from regular expression to end of file

    awk '/regex/,0'
    awk '/regex/,EOF'
    

    Print section of file based on line numbers (lines 8-12, inclusive)

    awk 'NR==8,NR==12'
    

    Print line number 52

    awk 'NR==52'
    awk 'NR==52 {print;exit}'          # more efficient on large files
    

    Print section of file between two regular expressions (inclusive)

    awk '/Iowa/,/Montana/'             # case sensitive
    

    Selective Deletion of Certain Lines:

    Delete ALL blank lines from a file (same as "grep '.' ")

    awk NF
    awk '/./'
    

    Credits and Thanks

    Special thanks to Peter S. Tillier for helping me with the first release of this FAQ file.

    For additional syntax instructions, including the way to apply editing commands from a disk file instead of the command line, consult:

    • "sed & awk, 2nd Edition," by Dale Dougherty and Arnold Robbins O'Reilly, 1997
    • "UNIX Text Processing," by Dale Dougherty and Tim O'Reilly Hayden Books, 1987
    • "Effective awk Programming, 3rd Edition." by Arnold Robbins O'Reilly, 2001

    To fully exploit the power of awk, one must understand "regular expressions." For detailed discussion of regular expressions, see

    • "Mastering Regular Expressions, 2d edition" by Jeffrey Friedl (O'Reilly, 2002).

    The manual ("man") pages on Unix systems may be helpful (try "man awk", "man nawk", "man regexp", or the section on regular expressions in "man ed"), but man pages are notoriously difficult. They are not written to teach awk use or regexps to first-time users, but as a reference text for those already acquainted with these tools.

    USE OF '\t' IN awk SCRIPTS: For clarity in documentation, we have used the expression '\t' to indicate a tab character (0x09) in the scripts. All versions of awk, even the UNIX System 7 version should recognize the '\t' abbreviation.


    categories: OneLiners,Learn,Jan,2009,Admin

    Explaining Pemet's One Liners

    Peteris Krumins explaining Eric Pement's Awk one-liners:


    categories: TenLiners,Learn,Jan,2009,Admin

    Awk ten-liners

    Awk is famous for how much it can do in (around) 101 lines. Here are some samples of that capability.

    (And if you have any more to add, please send them in.)


    categories: TenLiners,Learn,Jan,2009,Ronl

    Some Gawk (and PERL) Samples

    by R. Loui

    Here are a few short programs that do the same thing in each language. When reading these examples, the question to ask is `how many language features do I need to understand in order to understand the syntax of these examples'.

    Some of these are longer than they need to be since they don't exploit some (e.g.) command line trick to wrap the code in for each line do X. And that is the point- for teach-ability, the preferred language is the one you need to know LESS about before you can be useful in it.

    hello world

    PERL:

     print "hello world\n"
    

    GAWK:

     BEGIN { print "hello world" }
    

    One plus one

    PERL

     $x= $x+1;
    

    GAWK

     x= x+1
    

    Printing

    PERL

     print $x, $y, $z;
    

    GAWK

     print x,y,z
    

    Printing the first field in a file

    PERL

     while (<>) { 
       split(/ /);
       print "@_[0]\n" 
     }
    
    

    GAWK

     { print $1 }
    

    Printing lines, reversing fields

    PERL

     while (<>) { 
      split(/ /);
      print "@_[1] @_[0]\n" 
     }
    

    GAWK

     { print $2, $1 }
    

    Concatenation of variables

    PERL

     command = "cat $fname1 $fname2 > $fname3"
    

    GAWK

     command = "cat " fname1 " " fname2 " > " fname3
    

    Looping

    PERL:

     for (1..10) { print $_,"\n" }
    

    GAWK:

     BEGIN { 
      for (i=1; i<=10; i++) print i
     }
    

    Pairs of numbers

    PERL:

     for (1..10) { print "$_ ",$_-1 }
     print "\n"
    

    GAWK:

     BEGIN { 
      for (i=1; i<=10; i++) printf i " " i-1
      print ""
     }
    

    List of words into a hash

    PERL

      foreach $x ( split(/ /,"this is not stored linearly") ) 
      { print "$x\n" }
    

    GAWK

     BEGIN { 
      split("this is not stored linearly",temp)
      for (i in temp) print temp[i]
     }
    

    Printing a hash in some key order

    PERL

     $n = split(/ /,"this is not stored linearly");
     for $i (0..$n-1) { print "$i @_[$i]\n" }
     print "\n";
     for $i (@_) { print ++$j," ",$i,"\n" }
    

    AWK

     BEGIN { 
      n = split("this is not stored linearly",temp)
      for (i=1; i<=n; i++) print i, temp[i]
      print ""
      for (i in temp) print i, temp[i]
     }
    

    Printing all lines in a file

    PERL

     open file,"/etc/passwd";
     while (<file>) { print $_ }
    

    GAWK

      BEGIN { 
      while (getline < "/etc/passwd") print
     }
    

    Printing a string

    PERL

     $x = "this " . "that " . "\n";
     print $x
    

    GAWK

     BEGIN {
      x = "this " "that " "\n" ; printf x
     }
    

    Building and printing an array

    PERL

     $assoc{"this"} = 4;
     $assoc{"that"} = 4;
     $assoc{"the other thing"} = 15;
     for $i (keys %assoc) { print "$i $assoc{$i}\n" }
    

    GAWK

     BEGIN {
       assoc["this"] = 4
       assoc["that"] = 4
       assoc["the other thing"] = 15
       for (i in assoc) print i,assoc[i]
     }
    

    Sorting an array

    PERL

     split(/ /,"this will be sorted once in an array");
     foreach $i (sort @_) { print "$i\n" }
    

    GAWK

     BEGIN {
      split("this will be sorted once in an array",temp," ")
      for (i in temp) print temp[i] | "sort"
      while ("sort" | getline) print
     }
    

    Sorting an array (#2)

    GAWK

     BEGIN {
      split("this will be sorted once in an array",temp," ")
      n=asort(temp)
      for (i=1;i<=n;i++) print temp[i] 
     }
    

    Print all lines, vowels changed to stars

    PERL

     while (<STDIN>) {
      s/[aeiou]/*/g;
      print $_
     }
    
    

    GAWK

     {gsub(/[aeiou]/,"*"); print }
    

    Report from file

    PERL

     #!/pkg/gnu/bin/perl
     # this is a comment
     #
     open(stream1,"w | ");
     while ($line = <stream1>) {
       ($user, $tty, $login, $junk) = split(/ +/, $line, 4);
       print "$user $login ",substr($line,49)
     }
    

    GAWK

    #!/pkg/gnu/bin/gawk -f
     # this is a comment
     #
     BEGIN {
       while ("w" | getline) {
         user = $1; tty = $2; login = $3
         print user, login, substr($0,49)
       }
     }
    

    Web Slurping

    PERL

     open(stream1,"lynx -dump 'cs.wustl.edu/~loui' | ");
     while ($line = <stream1>) {
       if ($flag && $line =~ /[0-9]/) { print $line }
       if ($line =~ /References/) { $flag = 1 }
     }
    
    

    GAWK

     BEGIN {
      com = "lynx -dump 'cs.wustl.edu/~loui' &> /dev/stdout"
      while (com | getline line) {
        if (flag && line ~ /[0-9]/) { print line }
        if (line ~ /References/) { flag = 1 }
      }
     }
    

    categories: Newsgroup,Jan,2009,Steffen

    Top posters at comp.lang.awk

    For the 7 day period ending Monday April 27, 2009.

    posts kbytes name address
    13 28.4 roby elleroroberto@katamail.com
    7 11.6 Steffen Schuler schuler.steffen@gmail.com
    4 10.9 pmarin pacogeek@gmail.com
    3 9.7 Ed Morton mortonspam@gmail.com
    3 5.2 Janis Papanagnou janis_papanagnou@hotmail.com
    3 5.1 nag visitnag@gmail.com
    2 6.5 Tim Menzies menzies.tim@gmail.com
    2 6.1 r.p.loui@gmail.com r.p.loui@gmail.com
    2 5.8 Hermann Peifer peifer@gmx.net
    2 5.7 kielhd kielhd@freenet.de
    41 95.0 Total for top 10

    Totals for the newsgroup

    • 19 posters
    • 52 articles
    • 115.1 kbytes

    The top 10 accounted for

    • 52.6% of the posters
    • 78.8% of the articles
    • 82.5% of the bytes

    Averages

    • 2.7 articles / poster
    • 2.2 kbytes / article
    • 6.1 kbytes / poster


    Provided as a public service by Steffen Schuler

    categories: Newsgroup,Jan,2009,Steffen

    Top subjects at comp.lang.awk

    For the 7 day period ending Monday April 27, 2009.

    posts kbytes subject
    10 33.5 OS-variables in awk
    9 17.9 user functions with variable number of parameters
    5 8.9 File infos
    3 8.5 Interpreter Informations
    3 5.0 Log/History Files
    3 4.9 Help with an input file
    3 4.8 gawk can't run an awk program...
    3 4.6 Log/History File
    2 5.6 pgawk.exe.stackdump
    2 4.7 OT: Re: Interpreter Informations

    52 articles on 18 subjects

    • 38 were followups (73.1%)
    • 0 were crossposts (0.0%)

    115.1 kbytes total

    • headers: 54.4kb (47.3%)
    • quoted text: 32.9kb (28.6%)
    • original text: 27.2kb (23.6%)
    • signatures: 0.6kb (0.5%)

    Averages

    • 2.9 articles / subject
    • 2.2 kbytes / article
    • 6.4 kbytes / subject


    Provided as a public service by Steffen Schuler

    categories: Newsgroup,Jan,2009,Steffen

    Top posters at comp.lang.awk

    For the 365 day period ending Sunday April 26, 2009.

    posts kbytes name address
    156 530.8 Ed Morton mortonspam@gmail.com
    156 388.3 Janis Papanagnou janis_papanagnou@hotmail.com
    146 256.1 pk pk@pk.invalid
    109 306.6 Ed Morton morton@lsupcaemnt.com
    84 146.5 Steffen Schuler schuler.steffen@gmail.com
    83 139.4 Kenny McCormack gazelle@shell.xmission.com
    77 174.1 Aharon Robbins arnold@skeeve.com
    64 162.2 Dave B daveb@addr.invalid
    54 194.9 r.p.loui@gmail.com r.p.loui@gmail.com
    50 107.7 Hermann Peifer peifer@gmx.eu
    979 2406.6 Total for top 10

    Totals for the newsgroup

    • 271 posters
    • 2272 articles
    • 5542.5 kbytes

    The top 10 accounted for

    • 3.7% of the posters
    • 43.1% of the articles
    • 43.4% of the bytes

    Averages

    • 8.4 articles / poster
    • 2.4 kbytes / article
    • 20.5 kbytes / poster


    Provided as a public service by Steffen Schuler

    categories: Newsgroup,Jan,2009,Steffen

    Top subjects at comp.lang.awk

    For the 365 day period ending Sunday April 26, 2009.

    posts kbytes subject
    61 219.6 changing a field without recompiling the record
    44 71.3 Top 10 subjects comp.lang.awk
    42 88.1 GAWK: A fix for "missing file is a fatal error"
    34 59.6 Top 10 posters comp.lang.awk
    30 75.3 Indirect function calls patch for gawk available
    29 65.0 gawk for windows: system() does not yield exit status
    26 67.1 split field by delimiter
    24 63.6 Is there an simple way to initialise arrays in bulk?
    23 63.5 Sed1liners in Awk?
    23 62.6 Gawk match() and numbers in scientific notation

    2272 articles on 389 subjects

    • 1865 were followups (82.1%)
    • 8 were crossposts (0.4%)

    5540.0 kbytes total

    • headers: 2356.9kb (42.5%)
    • quoted text: 1591.2kb (28.7%)
    • original text: 1531.2kb (27.6%)
    • signatures: 60.7kb (1.1%)

    Averages

    • 5.8 articles / subject
    • 2.4 kbytes / article
    • 14.2 kbytes / subject


    Provided as a public service by Steffen Schuler

    categories: WWW,Awk100,Jan,2009,PeterK

    Get_YouTube_Vids

    Purpose

    Download videos from youtube.

    Source code

    gawk/www/get_youtube_vids.awk

    Developers

    Peter Krumin: Downloading YouTube Videos With Gawk

    Domain

    World wide web, slurping, file sharing.

    Contact

    Peter Krumin

    Description

    How to download YouTube videos.

    Awk

    Gawk

    Lines

    331 lines

    Current

    3=Released

    Use

    1=Personal use

    DateDeployed

    July 2007

    Dated

    Sat Feb 21 19:46:10 EST 2009

    Url

    Downloading YouTube Videos With Gawk


    categories: Sudoku,Awk100,Jan,2009,Jimh

    sudoku

    This is a Awk 100 program.

    Submitted by

    Jim Hart

    Purpose

    Solve sudoku puzzles using the same strategies as a person would, not by brute force.

    Source

    gawk/awk100/sudoku

    Developers

    Jim Hart

    Country

    US

    Domain

    command line games

    Contact

    Jim Hart

    Email

    jhart50@gmail.com

    Description

    see Purpose

    AWK versions

    gawk

    Platform

    Mac OS X, PowerPC

    Lines

    529

    Development Effort

    1

    Maintenance Effort

    0

    Date Deployed

    /2006


    categories: Negotiate,Awk100,Jan,2009,Ronl

    Anne's Negotiation Game

    An Awk100 program.

    Purpose

    Research on a model of negotiation incorporating search, dialogue, and changing expectations

    Source code

    See gawk/awk100/negotiate.

    Developers

    Ronald Loui (programmer and designer), Anne Jump (adversary)

    Organization

    National Science Foundation grant at Washington University in St. Louis

    Country

    USA

    Domain

    Prototype of a new idea for cognitive modelling (in artificial intelligence/economics/organizational behavior)

    Contact

    Ronald P. Loui

    Email

    r.p.loui@gmail.com

    Description

    Program generates a game board upon which players take turn searching or declaring according to a protocol. It is based on the same game bimatrix made famous by people like von Neumann and Nash, but invents a new approach to negotiation based on process instead of solution.

    Awk

    Was written for gawk in 1997 but should run on almost any awk dialect

    Platform

    Was written on Redhat Linux with multiple hardware platforms in mind

    Uses

    Was intended to be self-contained

    Lines

    658 lines, of which 39 are comments

    DevelopmentEffort

    One day, 6-8 hours total

    MaintenanceEffort

    Two revisions are available, mainly to permit programs to negotiate instead of humans, and to provide a web-based dashboard to monitor the events

    CurrentStatus

    2=Evaluation

    Use

    2=in-House use

    Users

    50 students in artificial intelligence project classes had to use some version of this code over three yeears

    DateDeployed

    October 1997

    Dated

    January 2008

    References

    There is a draft article (unpublished), and several talks, e.g.

    The paper in Harper and Wheeler, Probability and Inference: Essays in Honour of Henry E. Kyburg Jr. (Paperback), Publisher: College Publications (23 April 2007) ISBN-10: 1904987184 ISBN-13: 978-1904987185 also refers to the theory implemented here. Diana Moore's thesis on negotiation and draft article http://citeseer.ist.psu.edu/11983.html contains some precursor ideas.

    Url

    http://www.cs.wustl.edu/~loui/313f97/anne4.expl.html


    categories: Baseballsim,Awk100,Jan,2009,Ronl

    Baseball sim

    This is a Awk 100 program.

    Purpose

    A quick and dirty baseball simulator for investigating the efficiency of batting lineups

    Source

    See gawk/awk100/baseballsim.

    Developers

    Ronald P. Loui

    Organization

    Washington University in St. Louis

    Country

    USA

    Domain

    Research/Decision Support

    Contact

    Ronald P. Loui

    Email

    r.p.loui@gmail.com

    Description

    This was written for the AI course, and for several investigations, including the determination of whether it is a good idea to bat the pitcher in the 8th spot. One hypothesis that emerges from this program that deserves further study is that the most potent offense is one that spreads rather than concentrates the batting threats.

    Awk

    Gawk around 2002

    Platform

    Linux around 2002

    Uses

    None

    Lines

    409

    DevelopmentEffort

    Approximately one day

    MaintenanceEffort

    Further simulators were developed for improved domain modeling and for successive addition of functionality; no other code maintenance was required.

    CurrentStatus

    1=Prototype

    Use

    1=Personal use

    Users

    About 50 students used this program over three years in AI classes, and two undergraduate theses and one Master's thesis on evolutionary computing made use of this simulator.

    DateDeployed

    October 2002

    Dated

    January 2009

    References

    None, but see Tony LaRussa's comments on batting order while managing the St. Louis Cardinals


    categories: Argcol,Awk100,Jan,2009,Ronl

    Argcol

    An Awk100 program.

    Purpose

    A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.

    Source code

    See gawk/awk100/argcol.

    Developers

    Mark Foltz, Ronald Loui, Thieu Dang, Jeremy Frens

    Organization

    Washington University in St. Louis

    Country

    USA

    Domain

    Application/text support for text editor.

    Contact

    Ronald Loui

    Email

    r.p.loui@gmail.com

    Awk

    Gawk circa 1994, Solaris and MS-DOS-based awk such as mawk.

    Platform

    Solaris and MS-DOS

    Uses

    Vi and variants such as stevie.

    Lines

    278

    DevelopmentEffort

    One week.

    MaintenanceEffort

    No maintenance, eventually rewritten as cgi/web program in Room5 project.

    Current

    4=No longer supported

    Use

    3=Free/public domain

    Users

    2

    DateDeployed

    May 1994

    Dated

    Jan 2009

    References

    Progress on Room 5: a testbed for public interactive semi-formal legal argumentation International Conference on Artificial Intelligence and Law archive Proceedings of the 6th international conference on Artificial intelligence and law Melbourne, Australia Pages: 207 - 214 Year of Publication: 1997 ISBN:0-89791-924-6

    blog comments powered by Disqus