About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
Awk is being used all around the world for real programming problems, but the news is not getting out.
We are aiming to create a database of at least one hundred Awk programs which will:
If you, or your colleagues or friends have written a program which has been used for purposes small or large, why not take five minutes to record the facts, so that others can see what you've done?
To contribute, fill in this template and mail it to mail@awk.info with the subject line Awk 100 contribution.
(Recent additions are shown first.)
From: Tim Menzies <tim@menzies.us>
To: mikelangman@blueyonder.co.uk
Subject: auk images
I write to see if you would be gracious enough to grant us usage rights for your auk paintings to use on this site, in exchange for appropriate credit such as:
From: Mike Langman <mikelangman@blueyonder.co.uk>
Date: Mon, Jan 19, 2009 at 2:55 AM
Subject: Re: auk images
I normally charge for the use of images but as there is no money involved please carry on using the images and include a link to my website as suggested.
Many thanks for asking.
- Mike
"Because easy is not wrong." - Anon
From various sources:
Quotes:
From Project Management Advice:
From Awk programming:
From Awk as a Major Systems Programming Language:
According to Ramesh Natarajan:
From the NoSQL pages:
To join our community, consider contributing to this site.
For a list of authors of this site, see our credits pages.
The Awk Wiki.
USENET discussion group: comp.lang.awk.
For discussions on Awk, see the Awk discussion group.
For comments/ complaints/ corrections/ extensions to this site, contact mail@awk.info.
Awk is a stable, cross platform computer language named for its
authors
Alfred Aho,
Peter Weinberger &
Brian Kernighan. They write:
"Awk is a convenient and expressive programming language that can be
applied to a wide variety of computing and data-manipulation tasks".
In Classic Shell Scripting, Arnold Robbins & Nelson Beebe confess their Awk bias: "We like it. A lot. The simplicity and power of Awk often make it just the right tool for the job."
Besides the Bourne shell, Awk is the only other scripting language available in the standard Unix environment. Implementations of AWK exist as installed software for almost all other operating systems.
Awk is a mature language- it was first implemented in the 1970s. As a tool from the golden age, it is sometimes called primitive. It is more accurate to call it elemental, so tightly focused is the language on what it does best: quickly converting this into that.
Consequently, throughout history, Awk has been the language of choice for many famous scientists such as Leonardo daVinci.
|
|
LAWKER is a repository of Awk code divided into:
See How to Contribute.
Use our issue tracking system.
Many communities have a mascot, a banner that they proudly wave high. So where's the Awk mascot?
I made on up, but you gotta say, it is kinda lame:
So you have any ideas for such a mascot, please email mail@awk.info with the subject line "suggestion for mascot".
Not to stiffle anyone's creativity but the mascot might be based on the mantra "less, but better" or "easy is not wrong" or "a little awk goes a long way".
Chris writes "more of a logo rather than a mascot":
by R. Loui
ACM Sigplan Notices, Volume 31, Number 8, August 1996
Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a `scripting language.' It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't even been close (we have not seen work in PROLOG or JAVA).
There are some quick answers that have to do with the pragmatics of undergraduate programming. Then there are more instructive answers that might be valuable to those who debate programming paradigms or to those who study the history of AI languages. And there are some deep philosophical answers that expose the nature of reasoning and symbolic AI. I think the answers, especially the last ones, can be even more surprising than the observed effectiveness of GAWK for AI.
First it must be confessed that PERL programmers can cobble together AI projects well, too. Most of GAWK's attractiveness is reproduced in PERL, and the success of PERL forebodes some of the success of GAWK. Both are powerful string-processing languages that allow the programmer to exploit many of the features of a UNIX environment. Both provide powerful constructions for manipulating a wide variety of data in reasonably efficient ways. Both are interpreted, which can reduce development time. Both have short learning curves. The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonful of syntactic sugar. Some will argue that PERL has superior functionality, but for quick AI applications, the additional functionality is rarely missed. In fact, PERL's terse syntax is not friendly when regular expressions begin to proliferate and strings contain fragments of HTML, WWW addresses, or shell commands. PERL provides new ways of doing things, but not necessarily ways of doing new things.
In the end, despite minor difference, both PERL and GAWK minimize programmer time. Neither really provides the programmer the setting in which to worry about minimizing run-time.
There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI test bed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is turn left; turn right. If the robot is Netscape, then the right language is something that can generate Netscape -remote 'openURL(http://cs.wustl.edu/~loui) with elan.
Of course, there are deeper answers. Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays. GAWK asks the programmer to use the file system for data organization and the operating system for debugging tools and subroutine libraries. There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time. (footnote: I am reminded of my IBM colleague Ben Grosof's advice for Palo Alto: Don't worry about whether it's highway 101 or 280. Don't worry if you have to head south for an entrance to go north. Just get on the highway as quickly as possible.)
There are some similarities between GAWK and LISP that are illuminating. Both provided a powerful uniform data structure (the associative array implemented as a hash table for GAWK and the S-expression, or list of lists, for LISP). Both were well-supported in their environments (GAWK being a child of UNIX, and LISP being the heart of lisp machines). Both have trivial syntax and find their power in the programmer's willingness to use the simple blocks to build a complex approach.
Deeper still, is the nature of AI programming. AI is about functionality and exploratory programming. It is about bottom-up design and the building of ambitions as greater behaviors can be demonstrated. Woe be to the top-down AI programmer who finds that the bottom-level refinements, `this subroutine parses the sentence,' cannot actually be implemented. Woe be to the programmer who perfects the data structures for that heap sort when the whole approach to the high-level problem needs to be rethought, and the code is sent to the junk heap the next day.
AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor.
Now for the surprising philosophical answers. First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution. Chess, neural nets, and genetic programming show the limits of brute computation. The alternative is clever program organization. (footnote: One might add that the former are the AI approaches that work, but that is easily dismissed: those are the AI approaches that work in general, precisely because cleverness is problem-specific.) So AI programmers always want to maximize the content of their program, not optimize the efficiency of an approach. They want minds, not insects. Instead of enumerating large search spaces, they define ways of reducing search, ways of bringing different knowledge to the task. A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.
Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call reasoning'' instead of logic.'' The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.
I view these last two points as news not only to the programming language community, but also to much of the AI community that has not reflected on the past decade's lessons.
In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.
From awk.freeshell.org:
It's a bit embarassing to note that the exact origins of each are a bit hazy. This whole section requires further work, including the addition of links pointing to source repositories and binary distribution points.
Historical list of Awk implementations.
by T. Menzies
"The Enlightened Ones say that....
Awk is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and Awk programs are generally very short. Awk is useful when the overheads of more sophisticated approaches is not worth the bother. Also, the cost of learning Awk is very low.
But aren't there better scripting languages? Faster? Well, maybe yes and maybe no.
And Awk is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:
loc/fp language
------ --------
6, excel 5
13, sql
21, awk <================
21, perl
21, eiffel
21, clos
21, smalltalk
29, delphi
29, visual basic 5
49, ada 95
49, ai shells
53, c++
53, java
64, lisp
71, ada 83
71, fortran 95
80, 3rd generation default
91, ansi cobol 85
91, pascal
107, 2nd generation default
107, algol 68
107, cobol
107, fortran
128, c
320, 1st generation default
640, machine language
3200, natural language
Anyway, there are other considerations. Awk is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For example, here's the complete listing of someone's Awk spell-checking program.
BEGIN {while (getline<"Usr.Dict.Words") dict[$0]=1}
!dict[$1] {print $1}
Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.
Awk is the cure for late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. Awk programmers can hack up usable systems in the time it takes other programmers to boot their IDE. And, as a result of that easy exploration, it is possible to find loopholes missed by other analyst that lead to the innovative better solution to the problems (e.g. see Ronald Loui's O(nlogn) clustering tool).
Certainly, we can drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And Awk's lack of data structures (except num, string, and array) requires some discipline to handle properly.
But experienced Awk programmers know that the cleverer the program, the smaller the audience gets. If it is possible for to explain something succinctly in a simple language like Awk, then it is also possible that more folks will read that code.
Finally, at this may be the most important point, it might be misguided to argue about Awk vs LanguageX in terms of the specifics of those languages. Awk programmers can't over-elaborate their solutions- they are forced to code the solution in the simplest manner possible. This push to simplicity, to the essence of the problem, can be an insightful process. Coding in Awk is like preserving fruit- you boil off everything that is superfluous, that needlessly bloats the material what you are working with. It is amazing how little code is required to code the core of an idea (e.g. see Darius Bacon's LISP interpreter, written in Awk).
Hirofumi Saito contributes a candidate Awk mascot from the http://gauc.no-ip.org/ Japan GNU AWK Users Club.
runawk is a small wrapper for the AWK interpreter that helps one write standalone AWK scripts. Its main feature is to provide a module/library system for AWK which is somewhat similar to Perl's "use" command. It also allows you to select a preferred AWK interpreter and to setup the environment for your scripts. It also provides other helpful features, for example it includes numerous useful of modules.
Aleksey Cheusov
Recently, on comp.lang.awk, Michael Sanders asked:
It is available for Nokia's N810 (as part of busybox) and, I would hope for the N900.
Martin Cohen answers:
When I followed the link, I got the file awk4j-1.6.1-android-src.zip - when I unpacked it there were a number of files including a directory called "Sample" with a number of awk programs (with odd characters at the end of each line - I opened them in emacs) and a files named "awk4jAndroid.apk" which is, I guess, awk for Android (duh!).
The following ROT13 is a slight modifcation of the example found at http://www.miranda.org/~jkominek/rot13/awk/.
#!/bin/awk -f
BEGIN {
from = "NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm0987654321"
to = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890"
for (i = 1; i <= length(from); i++) {
letter[substr(from, i, 1)] = substr(to, i, 1)
}
}
{
for (i = 1; i <= length($0); i++) {
char = substr($0, i, 1)
if (match(char, "[a-zA-Z]|[0-9]") != 0) {
printf("%c", letter[char])
} else {
printf("%c", char)
}
}
printf("\n")
}
Michael Sanders.
Tarball (12K).
Netdwraw supports live monitoring of the network interface and display the most recent 24 hours' activity.
The tool snapshots network activity each five minutes and updates the scrolling image. For example:
Written in awk and bash, netdraw uses fly to drive gd to draw the chart.
Grant Coady
Hyung-Hwan Chung offers QSE, an embeddable Awk.
See QSE.
QSE is a code library that implements various Unix utilities in an embeddable form and provides a set of APIs to embed them into an application. The APIs have been designed to be flexible enough to access various aspects of an embedding application and an embedded object from each other.
By embedding a Unix utility into an application, a developer is relieved of problems caused by interacting with external programs and can have tighter control over it. Currently the library contains the following utilities:
QSEAWK is an embeddable AWK interpreter and is a part of the QSE library. The interpreter implements the language described in the book the AWK Proramming Language, with some extensions. Its design focuses on building a flexible and robust embedding API with minimal platform dependency. An embedding application is capable of:
An advantage of embedding a scripting language into an application is that you can extend an application by changing scripts instead of recompiling the whole application. As an AWK lover, I was a bit disappointed that I could not find any embedded implementations of the AWK programming language that I could squeeze into my applications.
QSE is designed to embedded Awk into other applications, rather than being used as a standalone tool (though it is not impossible). Why did I choose AWK as an embedded language? Simple. Both I and my clients liked it and were too lazy to learn a new scripting language.
Also, an embedded solution is a better solution that calling an external AWK interpreter:
Hence, my conclusion was to implement an embeddable awk interpreter myself.
One of the applications I wrote implements password change policy in an AWK script. The application calls the "is_password_acceptable" function with the password entered by a user, before having accepted the user-entered password. It checks its return value and determines to accept the password.
Of course, the engine is prearranged with global variables PASSWD_HISTORY_SIZE, and PASSWORD_HISTORY_FILE, and a buitin function hash_password() using flexiable QSEAWK API functions upon application start-up.
For example, here is the sample AWK function below.
function is_password_acceptable(passwd)
{
# check the password length
if (length(passwd) < 8) return 0;
# check if the password is composed of alphabets or digits only
if (passwd ~ /^([[:alpha:]]+|[[:digit:]]+)$/)
return 0;
if (PASSWD_HISTORY_SIZE > 0)
{
hashed = hash_passwd(passwd);
# check if the password is found in the history file
while ((getline entry < PASSWD_HISTORY_FILE) > 0)
{
if (hashed == entry)
{
# an entry is found in the history.
# reject the password
close (PASSWD_HISTORY_FILE);
return 0;
}
}
close (PASSWD_HISTORY_FILE);
}
return 1;
}
The C application's password policy function is roughly shown below also. Note that this application utilized the embedded QSEAWK interprerter in an event(password change)-driven way, not entering the BEGIN, pattern-action blocks, END loops.
int is_password_acceptable (qse_awk_rtx_t* rtx, const char* passwd)
{
qse_awk_val_t* ret, * arg[1];
qse_bool_t ok;
... abbreviated ...
/* transform a character string to an AWK value */
arg[0] = qse_awk_rtx_makestrval0 (rtx, passwd);
... abbreviated ...
/* increment the reference counter of arg[0] */
qse_awk_rtx_refupval (rtx, arg[0]);
/* call "is_password_acceptable" */
ret = qse_awk_rtx_call (rtx, "is_password_acceptable", arg, 1);
/* decrement the reference counter of arg[0] */
qse_awk_rtx_refdownval (rtx, arg[0]);
... abbreviated ...
/* get the boolean value from the return value */
ok = qse_awk_rtx_valtobool (awk_rtx, ret);
/* decrement the reference counter of the return value */
qse_awk_rtx_refdownval (rtx, ret);
/* accept or reject? */
return ok? 0: -1;
}
After all, I managed to get rid of any needs to recompile the whole
application and redeploy it whenever a client asks for password policy
change.
Here's a dirt simple method of sprucing up your AWK output under Windows.
Requires Windows, and the WSH scripting host, both of which are native to any modern Windows installation.
To use this method, follow these three steps:
This example will capture the output of your AWK program and render that output dynamically to an HTML stream within a graphical window.
<html>
<head>
<title>My application</title>
<hta:application
id="MyApp"
applicationName="My application"
border="thick"
borderStyle="normal"
caption="yes"
contextMenu="yes"
icon=""
innerBorder="no"
maximizeButton="yes"
minimizeButton="yes"
navigable="yes"
scroll="yes"
scrollFlat="no"
selection="yes"
showInTaskBar="yes"
singleInstance="yes"
sysMenu="yes"
version="1.0"
windowState="normal">
</head>
<body>
<script type="text/vbscript">
Set WshShell = CreateObject("Wscript.shell")
Set objExec= WshShell.Exec("%comspec% /c gawk -f program.awk
datafile")
output = objExec.StdOut.ReadAll
document.write("<pre>" & output & "</pre>")
</script>
</body>
</html>
Michael Sanders: http://topcat.hypermart.net.
In this discussion from comp.lang.awk, Martin Cohen builds a really, really, really long string in Gawk (300 million characters). He writes....
I had to extract 25-bit fields from a 90MB binary file, with frames of 10,000 fields indicated by a 33-bit sync value. The words I was interested in were indicated by being preceded by a special tag word.
My first step was to convert the binary file to hex text using od. I then wrote some gawk code to read the text file and extract the (32- bit) words preceded by the tag word. There were 9 million of them.
I concatenated them into a single string of 72 million hex characters (had to do byte-swapping along the way), and then, one character at a time, converted that into a string of 0's and 1's 300 million characters long. I could then easily (using index) search for the sync pattern (independent of any word boundaries) and find the data I wanted.
The total run time was just under 7 minutes (under Red Hat 5.1).
Some optimizations I had to do:
Anyway, it's nice that gawk can handle really long strings.
by Ed Morton (and friends)
The following summary, composed to address the recurring issue of getline (mis)use, was based primarily on information from the book "Effective Awk Programming", Third Edition By Arnold Robbins; (http://www.oreilly.com/catalog/awkprog3) with review and additional input from many of the comp.lang.awk regulars, including
getline is fine when used correctly (see below for a list of those cases), but it's best avoided by default because:
As the book "Effective Awk Programming", Third Edition By Arnold Robbins; http://www.oreilly.com/catalog/awkprog3) which provides much of the source for this discussion says:
The following summarises the eight variants of getline applications, listing which variables are set by each one:
Variant Variables Set
------- -------------
getline $0, ${1...NF}, NF, FNR, NR, FILENAME
getline var var, FNR, NR, FILENAME
getline < file $0, ${1...NF}, NF
getline var < file var
command | getline $0, ${1...NF}, NF
command | getline var var
command |& getline $0, ${1...NF}, NF
command |& getline var var
The "command |& ..." variants are GNU awk (gawk) extensions. gawk also populates the ERRNO builtin variable if getline fails.
Although calling getline is very rarely the right approach (see below), if you need to do it the safest ways to invoke getline are:
if/while ( (getline var < file) > 0) if/while ( (command | getline var) > 0) if/while ( (command |& getline var) > 0)
since those do not affect any of the builtin variables and they allow you to correctly test for getline succeeding or failing. If you need the input record split into separate fields, just call "split()" to do that.
Users of getline have to be aware of the following non-obvious effects of using it:
FNR==1 { ... start of file actions ... }
File transitions can occur at getlines, so FNR==1 needs to also be
checked after each unredirected (from a specific file name) getline.
e.g. if you want to print the first line of each of these files:
$ cat file1 a b $ cat file2 c dyou'd normally do:
$ awk 'FNR==1{print}' file1 file2
a
c
but if a "getline" snuck in, it could have the unexpected consequence of
skipping the test for FNR==1 and so not printing the first line of the
second file.
$ awk 'FNR==1{print}/b/{getline}' file1 file2
a
some header line ---------------- data line 1 data line 2 ... data line 10000you may consider using...
BEGIN { getline header; getline }
{ whatever_using_header_and_data_on_the_line() }
instead of...
FNR == 1 { header = $0 }
FNR < 3 { next }
{ whatever_using_header_and_data_on_the_line() }
but the getline version would not work on multiple files since the BEGIN
section would only be executed once, before the first file is processed,
whereas the non-getline version would work as-is. This is one example of
the common case where the getline command itself isn't directly causing
the problem, but the type of design you can end up with if you select a
getline approach is not ideal.
getline is an appropriate solution for the following:
command = "ls"
while ( (command | getline var) > 0) {
print var
}
close(command)
command = "LC_ALL=C sort"
n = split("abcdefghijklmnopqrstuvwxyz", a, "")
for (i = n; i > 0; i--)
print a[i] |& command
close(command, "to")
while ((command |& getline var) > 0)
print "got", var
close(command)
BEGIN {
while ( (getline var < ARGV[1]) > 0) {
data[var]++
}
close(ARGV[1])
ARGV[1]=""
}
$0 in data
awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' file1 tmp
In all other cases, it's clearest, simplest, less error-prone, and easiest to maintain to let awks normal text-processing read the records. In the case of "c", whether to use the BEGIN+getline approach or just collect the data within the awk condition/action part after testing for the first file is largely a style choice.
"a" above calls the UNIX command "ls" to list the current directory contents, then prints the result one line at a time.
"b" above writes the letters of the alphabet in reverse order, one per line, down the two-way pipe to the UNIX "sort" command. It then closes the write end of the pipe, so that sort receives an end-of-file indication. This causes sort to sort the data and write the sorted data back to the gawk program. Once all of the data has been read, gawk terminates the coprocess and exits. This is particularly necessary in order to use the UNIX "sort" utility as part of a coprocess since sort must read all of its input data before it can produce any output. The sort program does not receive an end-of-file indication until gawk closes the write end of the pipe. Other programs can be invoked as just:
command = "program"
do {
print data |& command
command |& getline var
} while (data left to process)
close(command)
Not that calling close() with a second argument is also gawk-specific.
"c" above reads every record of the first file passed as an argument to awk into an array and then for every subsequent file passed as an argument will print every record from that file that matches any of the records that appeared in the first file (and so are stored in the "data" array). This could alternatively have been implemented as:
# fails if first file is empty
NR==FNR{ data[$0]++; next }
$0 in data
or:
FILENAME==ARGV[1] { data[$0]++; next }
$0 in data
or:
FILENAME=="specificFileName" { data[$0]++; next }
$0 in data
or (gawk only):
ARGIND==1 { data[$0]++; next }
$0 in data
"d" above not only expands all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2]. In this case, since it's convenient to use $1 and $2, and no other part of the program references any builtin variables, getline was used without populating an explicit variable. This method is limited in its recursion depth to the total number of open files the OS permits at one time.
The following tips may help if, after reading the above, you discover you have an appropriate application for getline or if you're looking for an alternative solution to using getline:
cmd="some command" do something with cmd close(cmd)
awk 'c&&!--c;/pattern/{c=N}' file
awk 'c&&!--c{next}/pattern/{c=N}' file
awk 'c&&c--;/pattern/{c=N}' file
awk 'c&&c--{next}/pattern/{c=N}' file
In this example there are no blank lines and the output is all aligned with the left hand column and you want to print $0 for the second record following the record that contains some pattern, e.g. the number 3:
$ cat file
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
$ awk '/3/{getline;getline;print}' file
line 5
That works Just fine. Now let's see the concise way to do it without getline:
$ awk 'c&&!--c;/3/{c=2}' file
line 5
It's not quite so obvious at a glance what that does, but it uses an idiom that most awk programmers could do well to learn and it is briefer and avoids all those getline caveats.
Now let's say we want to print the 5th line after the pattern instead of the 2nd line. Then we'd have:
$ awk '/3/{getline;getline;getline;getline;getline;print}' file
line 8
$ awk 'c&&!--c;/3/{c=5}' file
line 8
i.e. we have to add a whole series of additional getline calls to the getline version, as opposed to just changing the counter from 2 to 5 for the non-getline version. In reality, you'd probably completely rewrite the getline version to use a loop:
$ awk '/3/{for (c=1;c<=5;c++) getline; print}' file
line 8
Still not as concise as the non-getline version, has all the getline caveats and required a redesign of the code just to change a counter.
Now let's say we also have to print the word "Eureka" if the number 4 appears in the input file. With the getline verion, you now have to do something like:
$ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" }
print}' file
Eureka!
line 8
whereas with the non-getline version you just have to do:
$ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file
Eureka!
line 8
i.e. with the getline version, you have to work around the fact that you're now processing records outside of the normal awk work-loop, whereas with the non-getline version you just have to drop your test for "4" into the normal place and let awks normal record processing deal with it like it always does. Actually, if you look closely a
t the above you'll notice we just unintentionally introduced a bug in the getline version. Consider what would happen in both versions if 3 and 4 appear on the same line. The non-getline version would behave correctly, but to fix the getline version, you'd need to duplicate the condition somewhere, e.g. perhaps something like this:
$ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline }
if ($0 ~ /4/) print "Eureka!"; print}' file
Eureka!
line 8
Now consider how the above would behave when there aren't 5 lines left in the input file or when the last line of the file contains both a 3 and a 4. i.e. there are still design questions to be answered and bugs that will appear at the limits of the input space.
Ignoring those bugs since this is not intended as a discussion on debugging getline programs, let's say you no longer need to print the 5th record after the number 3 but still have to do the Eureka on 4. With the getline version, you'd strip out the test for 3 and the getline stuff to be left with:
$ awk '{if ($0 ~ /4/) print "Eureka!"}' file
Eureka!
which you'd then presumably rewrite as:
$ awk '/4/{print "Eureka!"}' file
Eureka!
which is what you get just by removing everything involving the test for 3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:
$ awk '/4/{print "Eureka!"}' file
Eureka!
i.e. again, one small requirement change required a complete redesign of the getline code, but just the absolute minimum necessary tweak to the non-getline version.
So, what you see above in the getline case was significant redesign required for every tiny requirement change, much larger amounts of handwritten code required, insidious bugs introduced during development and challenging design questions at the limits of your input space, whereas the non-getline version always had less code, was much easier to modify as requirements changed, and was much more obvious, predictable, and correct in how it would behave at the limits of the input space.
by Jim Hart
I've written this kind of thing
n = split(something,arr,/re/)
for(i=1;i<=n;i++) {
print arr[i]
}
so often, it's tedious. I like this better:
n = split(something,arr,/re/)
while(n--) {
print arr[i++]
}
Easier to type. And, in cases where front-to-back or back-to-front doesn't matter, it's even simpler:
# copy a number indexed array, assuming n contains the number of # elements while(n--) arr2[n] = arr1[n]
And, yes,
for(i in arr1) arr2[i] = arr1[i]
works, too. But, some loops don't involve arrays. :-)
This tip has been discussed on comp.lang.awk.
This web site is a front end to a repository of Awk code. The site, and the code, is maintained by the international awk community (which includes you) so there are many ways you can contribute:
Using this logo, link to http://awk.info:
(By the way, our current logo is pretty lame. Want to contribute a better one? Please, be our guest!)
When writing a page, please follow these guidelines:
1 2 3 4 5 6 7
012345678901234567890123456789012345678901234567890123456789012345678901234567890
To contribute code, zip up the directory and mail it to
All function and file names are global to our code so please ensure your new function/file name does not clobber an old one.
Optionally, you might considering adding:
In the language of this site, a function file is a 100% standalone file containing one or more functions with no dependancies on other files. Note that if your function file depends on other files, then it becomes a package (see below).
Functions are stored in a file caled myfunc.awk.
In the language of this site, a package is a file that depends on other files (and the other files may depend on yet others, recursively).
Following a recent discussion in comp.lang.awk, we say that these dependancies are commented with
#use file.awk
where file.awk is some file (e.g. a file in the current directory).
Note that : file.awk will be loaded before the file containing the reference to #use file.awk.
The code that renders the awk.info web site can "pretty print" awk code. For example:
To enable that pretty print, add some html syntax inside your code and apply the following conventions.
Note that if you want to see your "looking pretty", then you could could see how it looks using our preview tool:
http://awk.info/?awk:urlWithoutHTTPprefix
For exmaple, the file http://menzies.us/tmp/xx.awk can be previewed using http://awk.info/?awk:menzies.us/tmp/xx.awk
Once you've got it "looking pretty", please consider contributing that code to awk.info, so our code library can grow. To do so, either email mail@awk.info with the URL of your pretty code or zip up the files and email them across.
The first paragraph of the file will be ignored. Use this first para for copyright notices or comments about down-in-the weeds trivia. Note: the first para ends with one blank line.
The next paragraph should start with
#.H1 <join>Title</join>
The code could should be topped and tailed as follows:
#<pre> code #</pre>
All other comment lines should start with a single "#" at front-of-line. These comment characters will be stripped away by the awk.info renderer.
Awk.info's renderer adopts the following html shorthand. If a line starts with
#.WORD other words
this this is replaced with
<WORD> other words</WORD>
If no other words follow #.WORD then the line becomes just <WORD>
Awk.info's renderer supports a few HTML extensions:
That's it. Now you can pretty print your code on the web just be adding a little html in the comments.
Ideally, all code in our code repository comes with unit tests:
Accordingly code offered to this site can contain unit tests, using the methods described in this page.
But before going on, we stress that awk.info gratefully accepts awk contributions in any form. That is, including unit tests with code is optional.
If your code is in directory yourcode then create a sub-directory yourcode/eg
Write a test in a file yourcode/eg/yourtest. Divide that test into two parts:
# assumes
# - the LAWKER trunk has been checked out and
# - .bash_profile contains: export Lawker="$HOME/svns/lawker/fridge"
. $Lawker/lib/bash/setup
gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
print join(a,2)
}'
Write the expected output of that test case in yourcode/eg/yourtest.out
The above file conventions mean that an automatic tool can run over the entire code base and perform a regression test (checking if all the tests generate all the *.out files.
Another advantage of the above scheme is that you can use the tests to document your code.
To show the test case, add the following into your .awk file:
#.BODY yourcode/eg/yourtest #.CODE yourcode/eg/yourtest.out
Then zip the directory yourcode (including yourcode/eg) and send it to awk.info. Once we install those files on our site then when awk.info displays that file, the test case trivia is hidden and the users only see the essential details. For an example of this, see http://awk.info/?gawk/array/join.awk.
The following list is sorted by newbie-ness (so best to start at the top):
The following list is sorted by the number of times this material is tagged at delicious.com (most tagged at top):
(For tutorial material on Awk, see Learning Awk page.)
R. Loui loui@ai.wustl.edu is Associate Professor of Computer Science, at Washington University in St. Louis. He has published in AI Journal, Computational Intelligence, ACM SIGART, AI Magazine, AI and Law, the ACM Computing Surveys Symposium on AI, Cognitive Science, Minds and Machines, Journal of Philosophy.
Whenever Ronald Loui teaches GAWK, he gives the students the choice of learning PERL instead. Ninety percent will choose GAWK after looking at a few simple examples of each language (samples shown below). Those who choose PERL do so because someone told them to learn PERL.
After one laboratory, more than half of the GAWK students are confident with their GAWK skills and can begin designing. Almost no student can become confident in PERL that quickly.
After a week, 90% of those who have attempted GAWK have mastered it, compared to fewer than 50% of PERL students attaining similar facility with the language (it would be unfair to require one to `master' PERL).
By the end of the semester, over 90% who have attempted GAWK have succeeded, and about two-thirds of those who have attempted PERL have succeeded.
To be fair, within a year, half of the GAWK programmers have also studied PERL. Most are doing so in order to read PERL and will not switch to writing PERL. No one who learns PERL migrates to GAWK.
PERL and GAWK appear to have similar programming, development, and debugging cycle times.
Finally, there seems to be a small advantage for GAWK over PERL, after a year, for the programmers willingness to begin a new program. That is, both GAWK and PERL programmers tend to enjoy writing a lot of programs, but GAWK has the slight edge here.
by T. Menzies
Imagine Gawk as a kind of a cut-down C language with four tricks:
What to all these do? Well....
You don't need to define variables- they appear as your use them.
There are only three types: stings, numbers, and arrays.
To ensure a number is a number, add zero to it.
x=x+0
To ensure a string is a string, add an empty string to it.
x= x "" "the string you really want to add"
To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables:
function haslocals(passed1,passed2, local1,local2,local3) {
passed1=passes1+1 # changes externally
local1=7 # only changed locally
}
Note that its good practice to add white space between passed and local variables.
Gawk programs can contain functions AND pattern/action pairs.
If the pattern is satisfied, the action is called.
/^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
p = 1;
}
/^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
p = 0;
}
END { if (p != 0) print "missing .P2 at end" }
Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:
BEGIN {
while (getline < "Usr.Dict.Words") #slurp in dictionary
dict[$0] = 1
FS=","; #set field seperator
srand(); #reset random seed
Round=10; #always start globals with U.C.
}
The default action is {print $0}; i.e. print the whole line.
The default pattern is 1; i.e. true.
Patterns are checked, top to bottom, in source-code order.
Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1". Regular expressions are important enough for their own section.
Ok, so now we know enough to explain an simple report function. How does hist.awk work in the following?
% cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
uniq -c | sort -r -n | Gawk -f hist.awk
************************** 26 /var/empty
** 2 /var/virusmails
** 2 /var/root
* 1 /var/xgrid/controller
* 1 /var/xgrid/agent
* 1 /var/teamsserver
* 1 /var/spool/uucp
* 1 /var/spool/postfix
* 1 /var/spool/cups
* 1 /var/pcast/server
* 1 /var/pcast/agent
* 1 /var/imap
* 1 /Library/WebServer
hist.awk reads the maximum width from line one (when NR==1), then scales it to some maximum width value. For each line, it then prints the line ($0) with some stars at front.
NR==1 { Width = Width ? Width : 40 ; sets Width if it is missing
Scale = $1 > Width ? $1 / Width : 1
}
{ Stars=int($1*Scale);
print str(Width - Stars," ") str(Stars,"*") $0
}
# note that, in the following "tmp" is a local variable
function str(n,c, tmp) { # returns a string, size "n", of all "c"
while((n--) > 0 ) tmp= c tmp
return tmp
}
Do you know what these mean?
Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:
function trim(s, t) {
t=s;
sub(/^[ \t\n]*/,"",t);
sub(/[ \t\n]*$/,"",t);
return t
}
or recognize something that isn't a number:
if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ )
{print "ERROR: " $i " not a number}
Regular expressions are an astonishingly useful tool supported by many languages (e.g. Awk, Perl, Python, Java). The following notes review the basics. For full details, see http://www.gnu.org/manual/Gawk-3.1.1/html_node/Regexp.html#Regexp.
Syntax: Here's the basic building blocks of regular expressions:
c
matches the character c (assuming c is a character with no special meaning in regexps).
\c
matches the literal character c; e.g. tabs and newlines are \t and \n respectively.
.
matches any character except newline.
^
matches the beginning of a line or a string.
$
matches the end of a line or a string.
[abc...]
matches any of the characters ac... (character class).
[^ac...]
matches any character except abc... and newline (negated character class).
r*
matches zero or more r's.
And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.
But that's only the start of regular expressions. There's lots more. For example:
r+
matches one or more r's.
r?
matches zero or one r's.
r1|r2
matches either r1 or r2 (alternation).
r1r2
matches r1, and then r2 (concatenation).
(r)
matches r (grouping).
Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:
^[+-]? ...
Numbers begin with zero or one plus or minus signs.
...[0-9]+...
Simple numbers are just one or more numbers.
...[.]?[0-9]*...
which may be followed by a decimal point and zero or more digits.
...|[.][0-9]+...
Alternatively, a number can have zero leading numbers and just start with a decimal point.
.... ([eE]...)?$
Also, there may be an exponent added
...[+-]?[0-9]+)?$
and that exponent is a positive or negative bunch of digits.
Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):
Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename
The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:
#!/usr/bin/awk -f
{for(i=1;i <=NF;i++) freq[$i]++ }
END{for(word in freq) print word, freq[word] }
You can find out if an element exists in an array at a certain index with the expression:
index in array
This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present.
You can remove an individual element of an array using the delete statement:
delete array[index]
It is not an error to delete an element which does not exist.
Gawk has a special kind of for statement for scanning an array:
for (var in array)
body
This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.
There order in which the array is scanned is not defined.
To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack
function top(a) {return a[a[0]]}
function push(a,x, i) {i=++a[0]; a[i]=x; return i}
function pop(a, x,i) {
i=a[0]--;
if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}
The pop function can be used in the usual way:
BEGIN {push(a,1); push(a,2); push(a,3);
while(x=pop(a)) print x
3
2
1
We can catch everything in an array to a string:
function a2s(a, i,s) {
s="";
for (i in a) {s=s " " i "= [" a[i]"]\n"};
return s}
BEGIN {push(L,1); push(L,2); push(L,3);
print a2s(L);}
0= [3]
1= [1]
2= [2]
3= [3]
And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form:
^=include file
This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line:
function rinclude (line, x,a) {
split(line,a,/ /);
if ( a[1] ~ /^\=include/ ) {
while ( ( getline x < a[2] ) > 0) rinclude(x);
close(a[2])}
else {print line}
}
Note that the third argument of the split function can be any regular expression.
By the way, here's a nice trick with arrays. To print the lines in a files in a random order:
BEGIN {srand()}
{Array[rand()]=$0}
END {for(I in Array) print $0}
Short, heh? This is not a perfect solution. Gawk can only generate 1,000,000 different random numbers so the birthday theorem cautions that there is a small chance that the lines will be lost when different lines are written to the same randomly selected location. After some experiments, I can report that you lose around one item after 1,000 inserts and 10 to 12 items after 10,000 random inserts. Nothing to write home about really. But for larger item sets, the above three liner is not what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%) are lost after 100,000 random inserts. Not good!
Awk is famous for how much it can do in one line.
This site has many samples of that capability. And if you have any more to add, please send them in.
Eric Pement
pemente@northpark.edu
Latest version of this file is usually at:
http://www.student.northpark.edu/pemente/awk/awk1line.txt
Unix: awk '/pattern/ {print "$1"}' # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}' # okay for DJGPP compiled
awk "/pattern/ {print \"$1\"}" # required for Mingw32
Most of my experience comes from version of GNU awk (gawk) compiled for Win32. Note in particular that DJGPP compilations permit the awk script to follow Unix quoting syntax '/like/ {"this"}'. However, the user must know that single quotes under DOS/Windows do not protect the redirection arrows (<, >) nor do they protect pipes (|). Both are special symbols for the DOS/CMD command shell and their special meaning is ignored only if they are placed within "double quotes." Likewise, DOS/Win users must remember that the percent sign (%) is used to mark DOS/Win environment variables, so it must be doubled (%%) to yield a single percent sign visible to awk.
If I am sure that a script will NOT need to be quoted in Unix, DOS, or CMD, then I normally omit the quote marks. If an example is peculiar to GNU awk, the command 'gawk' will be used. Please notify me if you find errors or new commands to add to this list (total length under 65 characters). I usually try to put the shortest script first.
Double space a file
awk '1;{print ""}'
awk 'BEGIN{ORS="\n\n"};1'
Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text. NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are often treated as non-blank, and thus 'NF' alone will return TRUE.
awk 'NF{print $0 "\n"}'
Triple space a file
awk '1;{print "\n"}'
Precede each line by its line number FOR THAT FILE (left alignment). Using a tab (\t) instead of space will preserve margins.
awk '{print FNR "\t" $0}' files*
Precede each line by its line number FOR ALL FILES TOGETHER, with tab.
awk '{print NR "\t" $0}' files*
Number each line of a file (number on left, right-aligned) Double the percent signs if typing from the DOS command prompt.
awk '{printf("%5d : %s\n", NR,$0)}'
Number each line of file, but only print numbers if line is not blank Remember caveats about Unix treatment of \r (mentioned above)
awk 'NF{$0=++a " :" $0};{print}'
awk '{print (NF? ++a " :" :"") $0}'
Count lines (emulates "wc -l")
awk 'END{print NR}'
Print the sums of the fields of every line
awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'
Add all fields in all lines and print the sum
awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'
Print every line after replacing each field with its absolute value
awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
Print the total number of fields ("words") in all lines
awk '{ total = total + NF }; END {print total}' file
Print the total number of lines that contain "Beth"
awk '/Beth/{n++}; END {print n+0}' file
Print the largest first field and the line that contains it Intended for finding the longest string in field #1
awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'
Print the number of fields in each line, followed by the line
awk '{ print NF ":" $0 } '
Print the last field of each line
awk '{ print $NF }'
Print the last field of the last line
awk '{ field = $NF }; END{ print field }'
Print every line with more than 4 fields
awk 'NF > 4'
Print every line where the value of the last field is > 4
awk '$NF > 4'
IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
awk '{sub(/\r$/,"");print}' # assumes EACH line ends with Ctrl-M
IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
awk '{sub(/$/,"\r");print}
IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
awk 1
IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format Cannot be done with DOS versions of awk, other than gawk:
gawk -v BINMODE="w" '1' infile >outfile
Use "tr" instead.
tr -d \routfile # GNU tr version 1.22 or higher
Delete leading whitespace (spaces, tabs) from front of each line aligns all text flush left
awk '{sub(/^[ \t]+/, ""); print}'
Delete trailing whitespace (spaces, tabs) from end of each line
awk '{sub(/[ \t]+$/, "");print}'
Delete BOTH leading and trailing whitespace from each line
awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
awk '{$1=$1;print}' # also removes extra space between fields
Insert 5 blank spaces at beginning of each line (make page offset)
awk '{sub(/^/, " ");print}'
Align all text flush right on a 79-column width
awk '{printf "%79s\n", $0}' file*
Center all text on a 79-character width
awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*
Substitute (find and replace) "foo" with "bar" on each line
awk '{sub(/foo/,"bar");print}' # replaces only 1st instance
gawk '{$0=gensub(/foo/,"bar",4);print}' # replaces only 4th instance
awk '{gsub(/foo/,"bar");print}' # replaces ALL instances in a line
Substitute "foo" with "bar" ONLY for lines which contain "baz"
awk '/baz/{gsub(/foo/, "bar")};{print}'
Substitute "foo" with "bar" EXCEPT for lines which contain "baz"
awk '!/baz/{gsub(/foo/, "bar")};{print}'
Change "scarlet" or "ruby" or "puce" to "red"
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
Reverse order of lines (emulates "tac")
awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*
If a line ends with a backslash, append the next line to it (fails if there are multiple lines ending with backslash...)
awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*
Print and sort the login names of all users
awk -F ":" '{ print $1 | "sort" }' /etc/passwd
Print the first 2 fields, in opposite order, of every line
awk '{print $2, $1}' file
Switch the first 2 fields of every line
awk '{temp = $1; $1 = $2; $2 = temp}' file
Print every line, deleting the second field of that line
awk '{ $2 = ""; print }'
Print in reverse order the fields of every line
awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file
Remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'
Remove duplicate, nonconsecutive lines
awk '! a[$0]++' # most concise script
awk '!($0 in a) {a[$0];print}' # most efficient script
Concatenate every 5 lines of input, using a comma separator between fields
awk 'ORS=%NR%5?",":"\n"' file
Print first 10 lines of file (emulates behavior of "head")
awk 'NR < 11'
Print first line of file (emulates "head -1")
awk 'NR>1{exit};1'
Print the last 2 lines of a file (emulates "tail -2")
awk '{y=x "\n" $0; x=$0};END{print y}'
Print the last line of a file (emulates "tail -1")
awk 'END{print}'
Print only lines which match regular expression (emulates "grep")
awk '/regex/'
Print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'
Print the line immediately before a regex, but not the line containing the regex
awk '/regex/{print x};{x=$0}'
awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'
Print the line immediately after a regex, but not the line containing the regex
awk '/regex/{getline;print}'
Grep for AAA and BBB and CCC (in any order)
awk '/AAA/; /BBB/; /CCC/'
Grep for AAA and BBB and CCC (in that order)
awk '/AAA.*BBB.*CCC/'
Print only lines of 65 characters or longer
awk 'length > 64'
Print only lines of less than 65 characters
awk 'length < 64'
Print section of file from regular expression to end of file
awk '/regex/,0' awk '/regex/,EOF'
Print section of file based on line numbers (lines 8-12, inclusive)
awk 'NR==8,NR==12'
Print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files
Print section of file between two regular expressions (inclusive)
awk '/Iowa/,/Montana/' # case sensitive
Delete ALL blank lines from a file (same as "grep '.' ")
awk NF awk '/./'
Special thanks to Peter S. Tillier for helping me with the first release of this FAQ file.
For additional syntax instructions, including the way to apply editing commands from a disk file instead of the command line, consult:
To fully exploit the power of awk, one must understand "regular expressions." For detailed discussion of regular expressions, see
The manual ("man") pages on Unix systems may be helpful (try "man awk", "man nawk", "man regexp", or the section on regular expressions in "man ed"), but man pages are notoriously difficult. They are not written to teach awk use or regexps to first-time users, but as a reference text for those already acquainted with these tools.
USE OF '\t' IN awk SCRIPTS: For clarity in documentation, we have used the expression '\t' to indicate a tab character (0x09) in the scripts. All versions of awk, even the UNIX System 7 version should recognize the '\t' abbreviation.
Peteris Krumins explaining Eric Pement's Awk one-liners:
Awk is famous for how much it can do in (around) 101 lines. Here are some samples of that capability.
(And if you have any more to add, please send them in.)
by R. Loui
Here are a few short programs that do the same thing in each language. When reading these examples, the question to ask is `how many language features do I need to understand in order to understand the syntax of these examples'.
Some of these are longer than they need to be since they don't exploit some (e.g.) command line trick to wrap the code in for each line do X. And that is the point- for teach-ability, the preferred language is the one you need to know LESS about before you can be useful in it.
hello world
PERL:
print "hello world\n"
GAWK:
BEGIN { print "hello world" }
One plus one
PERL
$x= $x+1;
GAWK
x= x+1
Printing
PERL
print $x, $y, $z;
GAWK
print x,y,z
Printing the first field in a file
PERL
while (<>) {
split(/ /);
print "@_[0]\n"
}
GAWK
{ print $1 }
Printing lines, reversing fields
PERL
while (<>) {
split(/ /);
print "@_[1] @_[0]\n"
}
GAWK
{ print $2, $1 }
Concatenation of variables
PERL
command = "cat $fname1 $fname2 > $fname3"
GAWK
command = "cat " fname1 " " fname2 " > " fname3
Looping
PERL:
for (1..10) { print $_,"\n" }
GAWK:
BEGIN {
for (i=1; i<=10; i++) print i
}
Pairs of numbers
PERL:
for (1..10) { print "$_ ",$_-1 }
print "\n"
GAWK:
BEGIN {
for (i=1; i<=10; i++) printf i " " i-1
print ""
}
List of words into a hash
PERL
foreach $x ( split(/ /,"this is not stored linearly") )
{ print "$x\n" }
GAWK
BEGIN {
split("this is not stored linearly",temp)
for (i in temp) print temp[i]
}
Printing a hash in some key order
PERL
$n = split(/ /,"this is not stored linearly");
for $i (0..$n-1) { print "$i @_[$i]\n" }
print "\n";
for $i (@_) { print ++$j," ",$i,"\n" }
AWK
BEGIN {
n = split("this is not stored linearly",temp)
for (i=1; i<=n; i++) print i, temp[i]
print ""
for (i in temp) print i, temp[i]
}
Printing all lines in a file
PERL
open file,"/etc/passwd";
while (<file>) { print $_ }
GAWK
BEGIN {
while (getline < "/etc/passwd") print
}
Printing a string
PERL
$x = "this " . "that " . "\n"; print $x
GAWK
BEGIN {
x = "this " "that " "\n" ; printf x
}
Building and printing an array
PERL
$assoc{"this"} = 4;
$assoc{"that"} = 4;
$assoc{"the other thing"} = 15;
for $i (keys %assoc) { print "$i $assoc{$i}\n" }
GAWK
BEGIN {
assoc["this"] = 4
assoc["that"] = 4
assoc["the other thing"] = 15
for (i in assoc) print i,assoc[i]
}
Sorting an array
PERL
split(/ /,"this will be sorted once in an array");
foreach $i (sort @_) { print "$i\n" }
GAWK
BEGIN {
split("this will be sorted once in an array",temp," ")
for (i in temp) print temp[i] | "sort"
while ("sort" | getline) print
}
Sorting an array (#2)
GAWK
BEGIN {
split("this will be sorted once in an array",temp," ")
n=asort(temp)
for (i=1;i<=n;i++) print temp[i]
}
Print all lines, vowels changed to stars
PERL
while (<STDIN>) {
s/[aeiou]/*/g;
print $_
}
GAWK
{gsub(/[aeiou]/,"*"); print }
Report from file
PERL
#!/pkg/gnu/bin/perl
# this is a comment
#
open(stream1,"w | ");
while ($line = <stream1>) {
($user, $tty, $login, $junk) = split(/ +/, $line, 4);
print "$user $login ",substr($line,49)
}
GAWK
#!/pkg/gnu/bin/gawk -f
# this is a comment
#
BEGIN {
while ("w" | getline) {
user = $1; tty = $2; login = $3
print user, login, substr($0,49)
}
}
Web Slurping
PERL
open(stream1,"lynx -dump 'cs.wustl.edu/~loui' | ");
while ($line = <stream1>) {
if ($flag && $line =~ /[0-9]/) { print $line }
if ($line =~ /References/) { $flag = 1 }
}
GAWK
BEGIN {
com = "lynx -dump 'cs.wustl.edu/~loui' &> /dev/stdout"
while (com | getline line) {
if (flag && line ~ /[0-9]/) { print line }
if (line ~ /References/) { flag = 1 }
}
}
For the 7 day period ending Monday April 27, 2009.
| posts | kbytes | name | address |
|---|---|---|---|
| 13 | 28.4 | roby | elleroroberto@katamail.com |
| 7 | 11.6 | Steffen Schuler | schuler.steffen@gmail.com |
| 4 | 10.9 | pmarin | pacogeek@gmail.com |
| 3 | 9.7 | Ed Morton | mortonspam@gmail.com |
| 3 | 5.2 | Janis Papanagnou | janis_papanagnou@hotmail.com |
| 3 | 5.1 | nag | visitnag@gmail.com |
| 2 | 6.5 | Tim Menzies | menzies.tim@gmail.com |
| 2 | 6.1 | r.p.loui@gmail.com | r.p.loui@gmail.com |
| 2 | 5.8 | Hermann Peifer | peifer@gmx.net |
| 2 | 5.7 | kielhd | kielhd@freenet.de |
| 41 | 95.0 | Total for top 10 | |
For the 7 day period ending Monday April 27, 2009.
| posts | kbytes | subject |
|---|---|---|
| 10 | 33.5 | OS-variables in awk |
| 9 | 17.9 | user functions with variable number of parameters |
| 5 | 8.9 | File infos |
| 3 | 8.5 | Interpreter Informations |
| 3 | 5.0 | Log/History Files |
| 3 | 4.9 | Help with an input file |
| 3 | 4.8 | gawk can't run an awk program... |
| 3 | 4.6 | Log/History File |
| 2 | 5.6 | pgawk.exe.stackdump |
| 2 | 4.7 | OT: Re: Interpreter Informations |
For the 365 day period ending Sunday April 26, 2009.
| posts | kbytes | name | address |
|---|---|---|---|
| 156 | 530.8 | Ed Morton | mortonspam@gmail.com |
| 156 | 388.3 | Janis Papanagnou | janis_papanagnou@hotmail.com |
| 146 | 256.1 | pk | pk@pk.invalid |
| 109 | 306.6 | Ed Morton | morton@lsupcaemnt.com |
| 84 | 146.5 | Steffen Schuler | schuler.steffen@gmail.com |
| 83 | 139.4 | Kenny McCormack | gazelle@shell.xmission.com |
| 77 | 174.1 | Aharon Robbins | arnold@skeeve.com |
| 64 | 162.2 | Dave B | daveb@addr.invalid |
| 54 | 194.9 | r.p.loui@gmail.com | r.p.loui@gmail.com |
| 50 | 107.7 | Hermann Peifer | peifer@gmx.eu |
| 979 | 2406.6 | Total for top 10 | |
For the 365 day period ending Sunday April 26, 2009.
| posts | kbytes | subject |
|---|---|---|
| 61 | 219.6 | changing a field without recompiling the record |
| 44 | 71.3 | Top 10 subjects comp.lang.awk |
| 42 | 88.1 | GAWK: A fix for "missing file is a fatal error" |
| 34 | 59.6 | Top 10 posters comp.lang.awk |
| 30 | 75.3 | Indirect function calls patch for gawk available |
| 29 | 65.0 | gawk for windows: system() does not yield exit status |
| 26 | 67.1 | split field by delimiter |
| 24 | 63.6 | Is there an simple way to initialise arrays in bulk? |
| 23 | 63.5 | Sed1liners in Awk? |
| 23 | 62.6 | Gawk match() and numbers in scientific notation |
Download videos from youtube.
Peter Krumin: Downloading YouTube Videos With Gawk
World wide web, slurping, file sharing.
Peter Krumin
How to download YouTube videos.
Gawk
331 lines
3=Released
1=Personal use
July 2007
Sat Feb 21 19:46:10 EST 2009
Downloading YouTube Videos With Gawk
This is a Awk 100 program.
Jim Hart
Solve sudoku puzzles using the same strategies as a person would, not by brute force.
Jim Hart
US
Jim Hart
jhart50@gmail.com
see Purpose
gawk
Mac OS X, PowerPC
529
1
0
/2006
An Awk100 program.
Research on a model of negotiation incorporating search, dialogue, and changing expectations
Ronald Loui (programmer and designer), Anne Jump (adversary)
National Science Foundation grant at Washington University in St. Louis
USA
Prototype of a new idea for cognitive modelling (in artificial intelligence/economics/organizational behavior)
Ronald P. Loui
r.p.loui@gmail.com
Program generates a game board upon which players take turn searching or declaring according to a protocol. It is based on the same game bimatrix made famous by people like von Neumann and Nash, but invents a new approach to negotiation based on process instead of solution.
Was written for gawk in 1997 but should run on almost any awk dialect
Was written on Redhat Linux with multiple hardware platforms in mind
Was intended to be self-contained
658 lines, of which 39 are comments
One day, 6-8 hours total
Two revisions are available, mainly to permit programs to negotiate instead of humans, and to provide a web-based dashboard to monitor the events
2=Evaluation
2=in-House use
50 students in artificial intelligence project classes had to use some version of this code over three yeears
October 1997
January 2008
There is a draft article (unpublished), and several talks, e.g.
The paper in Harper and Wheeler, Probability and Inference: Essays in Honour of Henry E. Kyburg Jr. (Paperback), Publisher: College Publications (23 April 2007) ISBN-10: 1904987184 ISBN-13: 978-1904987185 also refers to the theory implemented here. Diana Moore's thesis on negotiation and draft article http://citeseer.ist.psu.edu/11983.html contains some precursor ideas.
http://www.cs.wustl.edu/~loui/313f97/anne4.expl.html
This is a Awk 100 program.
A quick and dirty baseball simulator for investigating the efficiency of batting lineups
Ronald P. Loui
Washington University in St. Louis
USA
Research/Decision Support
Ronald P. Loui
r.p.loui@gmail.com
This was written for the AI course, and for several investigations, including the determination of whether it is a good idea to bat the pitcher in the 8th spot. One hypothesis that emerges from this program that deserves further study is that the most potent offense is one that spreads rather than concentrates the batting threats.
Gawk around 2002
Linux around 2002
None
409
Approximately one day
Further simulators were developed for improved domain modeling and for successive addition of functionality; no other code maintenance was required.
1=Prototype
1=Personal use
About 50 students used this program over three years in AI classes, and two undergraduate theses and one Master's thesis on evolutionary computing made use of this simulator.
October 2002
January 2009
None, but see Tony LaRussa's comments on batting order while managing the St. Louis Cardinals
An Awk100 program.
A tool inspired by fmt that could be used while working in vi to maintain a multi-column pro-con argument format.
See gawk/awk100/argcol.
Mark Foltz, Ronald Loui, Thieu Dang, Jeremy Frens
Washington University in St. Louis
USA
Application/text support for text editor.
Ronald Loui
r.p.loui@gmail.com
Gawk circa 1994, Solaris and MS-DOS-based awk such as mawk.
Solaris and MS-DOS
Vi and variants such as stevie.
278
One week.
No maintenance, eventually rewritten as cgi/web program in Room5 project.
4=No longer supported
3=Free/public domain
2
May 1994
Jan 2009
Progress on Room 5: a testbed for public interactive semi-formal legal argumentation International Conference on Artificial Intelligence and Law archive Proceedings of the 6th international conference on Artificial intelligence and law Melbourne, Australia Pages: 207 - 214 Year of Publication: 1997 ISBN:0-89791-924-6
blog comments powered by Disqus