About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
2009: frequent poster to comp.lang.awk
Writing in comp.lang.awk Ed Morton ports numerous complex sed expressions to Awk:
A comp.lang.awk author ask the question:
I have a file that has a series of lists
(qqq) aaa 111 bbb 222
and I want to make it look like
aaa 111 (qqq) bbb 222 (qqq)
IMHO the clearest sed solution given was:
sed -e '
/^([^)]*)/{
h; # remember the (qqq) part
d
}
/ [1-9][0-9]*$/{
G; # strap the (qqq) part to the list
s/\n/ /
}
' yourfile
while the awk one was:
awk '/^\(/{ h=$0;next } { print $0,h }' file
As I've said repeatedly, sed is an excellent tool for simple substitutions on a single line. For anything else you should use awk, perl, etc.
Having said that, let's take a look at the awk equivalents for the posted sed examples below that are not simple substitutions on a single line so people can judge for themselves (i.e. quietly - this is not a contest and not a religious war!) which code is clearer, more consistent, and more obvious. When reading this, just imagine yourself having to figure out what the given script does in order to debug or enhance it or write your own similar one later.
Note that in awk as in shell there are many ways to solve a problem so I'm trying to stick to the solutions that I think would be the most useful to a beginner since that's who'd be reading an examples page like this, and without using any GNU awk extensions. Also note I didn't test any of this but it's all pretty basic stuff so it should mostly be right.
For those who know absolutely nothing about awk, I think all you need to know to understand the scripts below is that, like sed, it loops through input files evaluating conditions against the current input record (a line by default) and executing the actions you specify (printing the current input record if none specified) if those conditions are true, and it has the following pre-defined symbols:
NR = Number or Records read so far NF = Number of Fields in current record FS = the Field Separator RS = the Record Separator BEGIN = a pattern that's only true before processing any input END = a pattern that's only true after processing all input.
Oh, and setting RS to the NULL string (-v RS='') tells awk to read paragraphs instead of lines as individual records, and setting FS to the NULL string (-v FS='') tells awk to treat each individual character as a field.
For more info on awk, see http://www.awk.info.
Double space a file:
Sed:
sed G
Awk
awk '{print $0 "\n"}'
Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.
Sed:
sed '/^$/d;G'
Awk:
awk 'NF{print $0 "\n"}'
Triple space a file
Sed:
sed 'G;G'
Awk:
awk '{print $0 "\n\n"}'
Undo double-spacing (assumes even-numbered lines are always blank):
Sed:
sed 'n;d'
Awk:
awk 'NF'
Insert a blank line above every line which matches "regex":
Sed:
sed '/regex/{x;p;x;}'
Awk:
awk '{print (/regex/ ? "\n" : "") $0}'
Insert a blank line below every line which matches "regex":
Sed:
sed '/regex/G'
Awk:
awk '{print $0 (/regex/ ? "\n" : "")}'
Insert a blank line above and below every line which matches "regex":
Sed:
sed '/regex/{x;p;x;G;}'
Awk:
awk '{print (/regex/ ? "\n" $0 "\n" : $0)}'
Number each line of a file (simple left alignment). Using a tab (see note on '\t' at end of file) instead of space will preserve margins:
Sed:
sed = filename | sed 'N;s/\n/\t/'
Awk:
awk '{print NR "\t" $0}'
Number each line of a file (number on left, right-aligned):
Sed:
sed = filename | sed 'N; s/^/ /; s/ *\(.\{6,\}\)\n/\1 /'
Awk:
awk '{printf "%6s %s\n",NR,$0}'
Number each line of file, but only print numbers if line is not blank:
Sed:
ed '/./=' filename | sed '/./N; s/\n/ /'
Awk:
awk 'NF{print NR "\t" $0}'
Count lines (emulates "wc -l")
Sed:
sed -n '$='
Awk:
awk 'END{print NR}'
Align all text flush right on a 79-column width:
Sed:
sed -e :a -e 's/^.\{1,78\}$/ &/;ta' # set at 78 plus 1 space
Awk:
awk '{printf "%79s\n",$0}'
Center all text in the middle of 79-column width. In method 1, spaces at the beginning of the line are significant, and trailing spaces are appended at the end of the line. In method 2, spaces at the beginning of the line are discarded in centering the line, and no trailing spaces appear at the end of lines.
Sed:
sed -e :a -e 's/^.\{1,77\}$/ & /;ta' # method 1
sed -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/' # method 2
Awk:
awk '{printf "%"int((79+length)/2)"s\n",$0}'
Reverse order of lines (emulates "tac") Bug/feature in sed v1.5 causes blank lines to be deleted
Sed:
sed '1!G;h;$!d' # method 1 sed -n '1!G;h;$p' # method 2
Awk:
awk '{a[NR]=$0} END{for (i=NR;i>=1;i--) print a[i]}'
Reverse each character on the line (emulates "rev")
Sed:
sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
Awk:
awk -v FS='' '{for (i=NF;i>=1;i--) printf "%s",$i; print ""}'
Join pairs of lines side-by-side (like "paste")
Sed:
sed '$!N;s/\n/ /'
Awk:
awk '{printf "%s%s",$0,(NR%2 ? " " : "\n")}'
If a line ends with a backslash, append the next line to it
Sed:
sed -e :a -e '/\\$/N; s/\\\n//; ta'
Awk:
awk '{printf "%s",(sub(/\\$/,"") ? $0 : $0 "\n")}'
if a line begins with an equal sign, append it to the previous line and replace the "=" with a single space
Sed:
sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'
Awk:
awk '{printf "%s%s",(sub(/^=/," ") ? "" : "\n"),$0} END{print ""}'
Add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)
Sed:
gsed '0~5G' # GNU sed only sed 'n;n;n;n;G;' # other seds
Awk:
awk '{print $0} !(NR%5){print ""}'
Print first 10 lines of file (emulates behavior of "head")
Sed:
sed 10q
Awk:
awk '{print $0} NR==10{exit}'
Print first line of file (emulates "head -1")
Sed:
sed q
Awk:
awk 'NR==1{print $0; exit}'
Print the last 10 lines of a file (emulates "tail")
Sed:
sed -e :a -e '$q;N;11,$D;ba'
Awk:
awk '{a[NR]=$0} END{for (i=NR-10;i<=NR;i++) print a[i]}'
Print the last 2 lines of a file (emulates "tail -2")
Sed:
sed '$!N;$!D'
Awk:
awk '{a[NR]=$0} END{for (i=NR-2;i<=NR;i++) print a[i]}'
Print the last line of a file (emulates "tail -1")
Sed:
sed '$!d' # method 1 sed -n '$p' # method 2
Awk:
awk 'END{print $0}'
Print the next-to-the-last line of a file
Sed:
sed -e '$!{h;d;}' -e x # for 1-line files, print blank line
sed -e '1{$q;}' -e '$!{h;d;}' -e x # for 1-line files, print the line
sed -e '1{$d;}' -e '$!{h;d;}' -e x # for 1-line files, print nothing
Awk:
awk '{prev=curr; curr=$0} END{print prev}'
Print only lines which match regular expression (emulates "grep")
Sed:
sed -n '/regexp/p' # method 1 sed '/regexp/!d' # method 2
Awk:
awk '/regexp/'
Print only lines which do NOT match regexp (emulates "grep -v")
Sed:
sed -n '/regexp/!p' # method 1, corresponds to above sed '/regexp/d' # method 2, simpler syntax
Awk:
awk '!/regexp/'
Print the line immediately before a regexp, but not the line containing the regexp
Sed:
sed -n '/regexp/{g;1!p;};h'
Awk:
awk '/regexp/{print prev} {prev=$0}'
Print the line immediately after a regexp, but not the line containing the regexp
Sed:
sed -n '/regexp/{n;p;}'
Awk:
awk 'found{print $0} {found=(/regexp/ ? 1 : 0)}'
Print 1 line of context before and after regexp, with line number indicating where the regexp occurred (similar to "grep -A1 -B1")
Sed:
sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
Awk:
awk 'found {print preLine "\n" hitLine "\n" $0; found=0}
/regexp/ {preLine=prev; hitLine=NR " " $0; found=1}
{prev=$0}'
Grep for AAA and BBB and CCC (in any order)
Sed:
sed '/AAA/!d; /BBB/!d; /CCC/!d'
Awk:
awk '/AAA/&&/BBB/&&/CCC/'
Grep for AAA and BBB and CCC (in that order)
Sed:
sed '/AAA.*BBB.*CCC/!d'
Awk:
awk '/AAA.*BBB.*CCC/'
Grep for AAA or BBB or CCC (emulates "egrep")
Sed:
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d # most seds gsed '/AAA\|BBB\|CCC/!d' # GNU sed only
Awk:
awk '/AAA|BBB|CCC/'
Print paragraph if it contains AAA (blank lines separate paragraphs). Sed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below
Sed:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'
Awk:
awk -v RS='' '/AAA/'
Print paragraph if it contains AAA and BBB and CCC (in any order)
Sed:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
Awk:
awk -v RS='' '/AAA/&&/BBB/&&/CCC/'
Print paragraph if it contains AAA or BBB or CCC
Sed:
sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d' # GNU sed only
Awk:
awk -v RS='' '/AAA|BBB|CCC/'
Print only lines of 65 characters or longer
Sed:
sed -n '/^.\{65\}/p'
Awk:
awk -v FS='' 'NF>=65'
Print only lines of less than 65 characters
Sed:
sed -n '/^.\{65\}/!p' # method 1, corresponds to above
sed '/^.\{65\}/d' # method 2, simpler syntax
Awk:
awk -v FS='' 'NF<65'
Print section of file from regular expression to end of file
Sed:
sed -n '/regexp/,$p'
Awk:
awk '/regexp/{found=1} found'
Print section of file based on line numbers (lines 8-12, inclusive)
Sed:
sed -n '8,12p' # method 1 sed '8,12!d' # method 2
Awk:
awk 'NR>=8 && NR<=12'
Print line number 52
Sed:
sed -n '52p' # method 1 sed '52!d' # method 2 sed '52q;d' # method 3, efficient on large files
Awk:
awk 'NR==52{print $0; exit}'
Beginning at line 3, print every 7th line
Sed:
gsed -n '3~7p' # GNU sed only
sed -n '3,${p;n;n;n;n;n;n;}' # other seds
Awk:
awk '!((NR-3)%7)'
print section of file between two regular expressions (inclusive)
Sed:
sed -n '/Iowa/,/Montana/p' # case sensitive
Awk:
awk '/Iowa/,/Montana/'
Print all lines of FileID upto 1st line containing
Sed:
sed '/string/q' FileID
Awk:
awk '{print $0} /string/{exit}'
Print all lines of FileID from 1st line containing until eof
Sed:
sed '/string/,$!d' FileID
Awk:
awk '/string/{found=1} found'
Print all lines of FileID from 1st line containing until 1st line containing [boundries inclusive]
Sed:
sed '/string1/,$!d;/string2/q' FileID
Awk:
awk '/string1/{found=1} found{print $0} /string2/{exit}'
Print all of file EXCEPT section between 2 regular expressions
Sed:
sed '/Iowa/,/Montana/d'
Awk:
awk '/Iowa/,/Montana/{next} {print $0}' file
Delete duplicate, consecutive lines from a file (emulates "uniq"). First line in a set of duplicate lines is kept, rest are deleted.
Sed:
sed '$!N; /^\(.*\)\n\1$/!P; D'
Awk:
awk '$0!=prev{print $0} {prev=$0}'
Delete duplicate, nonconsecutive lines from a file. Beware not to overflow the buffer size of the hold space, or else use GNU sed.
Sed:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Awk:
awk '!a[$0]++'
Delete all lines except duplicate lines (emulates "uniq -d").
Sed:
sed '$!N; s/^\(.*\)\n\1$/\1/; t; D'
Awk:
awk '$0==prev{print $0} {prev=$0}' # works only on consecutive
awk 'a[$0]++' # works on non-consecutive
Delete the first 10 lines of a file
Sed:
sed '1,10d'
Awk:
awk 'NR>10'
Delete the last line of a file
Sed:
sed '$d'
Awk:
awk 'NR>1{print prev} {prev=$0}'
Delete the last 2 lines of a file
Sed:
sed 'N;$!P;$!D;$d'
Awk:
awk 'NR>2{print prev[2]} {prev[2]=prev[1]; prev[1]=$0}' # method 1
awk '{a[NR]=$0} END{for (i=i;i<=NR-2;i++) print a[i]}' # method 2
awk -v num=2 'NR>num{print prev[num]}
{for (i=num;i>1;i--) prev[i]=prev[i-1]; prev[1]=$0}' # method 3
Delete the last 10 lines of a file
Sed:
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2
Awk:
awk -v num=10 '...same as deleting last 2 method 3 above...'
Delete every 8th line
Sed:
gsed '0~8d' # GNU sed only sed 'n;n;n;n;n;n;n;d;' # other seds
Awk:
awk 'NR%8'
Delete lines matching pattern
Sed:
sed '/pattern/d'
Awk:
awk '!/pattern/'
Delete ALL blank lines from a file (same as "grep '.' ")
Sed:
sed '/^$/d' # method 1 sed '/./!d' # method 2
Awk:
awk '!/^$/' # method 1 awk '/./' # method 2
Delete all CONSECUTIVE blank lines from file except the first; also deletes all blank lines from top and end of file (emulates "cat -s")
Sed:
sed '/./,/^$/!d'
Awk:
awk '/./,/^$/'
Delete all leading blank lines at top of file
Sed:
sed '/./,$!d'
Awk:
awk 'NF{found=1} found'
Delete all trailing blank lines at end of file
Sed:
sed -e :a -e '/^\n*$/{$d;N;ba' -e '}' # works on all seds
sed -e :a -e '/^\n*$/N;/\n$/ba' # ditto, except for gsed 3.02.*
Awk:
awk '{a[NR]=$0} NF{nbNr=NR} END{for (i=1;i<=nbNr;i++) print a[i]}'
Delete the last line of each paragraph
Sed:
sed -n '/^$/{p;h;};/./{x;/./p;}'
Awk:
awk -v FS='\n' -v RS='' '{for (i=1;i<=NF;i++) print $i; print ""}'
Get Usenet/e-mail message header
Sed:
sed '/^$/q' # deletes everything after first blank line
Awk:
awk '/^$/{exit}'
Get Usenet/e-mail message body
Sed:
sed '1,/^$/d' # deletes everything up to first blank line
Awk:
awk 'found{print $0} /^$/{found=1}'
Get Subject header, but remove initial "Subject: " portion
Sed:
sed '/^Subject: */!d; s///;q'
Awk:
awk 'sub(/Subject: */,"")'
Parse out the address proper. Pulls out the e-mail address by itself from the 1-line return address header (see preceding script)
Sed:
sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
Awk:
awk '{sub(/ *\(.*\)/,""); sub(/>.*/,""); sub(/.*[:<] */,""); print $0}'
Add a leading angle bracket and space to each line (quote a message)
Sed:
sed 's/^/> /'
Awk:
awk '{print "> " $0}'
Delete leading angle bracket & space from each line (unquote a message)
Sed:
sed 's/^> //'
Awk:
awk '{sub(/> /,""); print $0}'
(Editor's note: On Nov 30'09, Hermann Peifer found and fixed bug in an older version of the test code at the end of this file.)
Writing in comp.lang.awk, Ed Morton reveals the secret WHINY_USERS flag.
"Nag" asked:
Hi,
I am creating a file like...
awk '{
....
...
..
printf"%4s %4s\n",$1,$2 > "file1"
}' input
How can I sort file1 within awk code?
Ed Morton writes:
$ cat file
2
1
4
3
$ gawk '{a[$0]}END{for (i in a) print i}' file
4
1
2
3
$ WHINY_USERS=1 gawk '{a[$0]}END{for (i in a) print i}' file
1
2
3
4
Your editor coded up the following test for the runtime costs of WHINY_USERS. The following code is called twice (once with, and once without setting WHINY_USERS):
runWhin() {
WHINY_USERS=1 gawk -v M=1000000 --source '
BEGIN {
M = M ? M : 50
N = M
print N
while(N-- > 0) {
key = rand()" "rand()" "rand()" "rand()" "rand()
A[key] = M - N
}
for(i in A)
N++
}'
}
runNoWhin() {
gawk -v M=1000000 --source '
BEGIN {
M = M ? M : 50
N = M
print N
while(N-- > 0) {
key = rand()" "rand()" "rand()" "rand()" "rand()
A[key] = M - N
}
for(i in A)
N++
}'
}
time runWhin
time runNoWhin
And the results? Sorted added 15% to runtimes:
% bash whiny.sh 1000000 real 0m18.897s user 0m15.826s sys 0m2.445s 1000000 real 0m16.345s user 0m13.469s sys 0m2.435s
In comp.lang.awk, Ed Morton offers advise on how to print ranges of Awk records.
Suppose you are looking to extract a section of code from a text file based on two regular expressions.
Say the file looks like this: newspaper magazing hiking hiking trails in the city muir hike black mountain hike summer meados hike end hiking phone cell skype
and you want to extract
hiking trails in the city muir hike black mountain hike summer meados hikeThe following regular expression won't work right:
awk '/hiking/,/end hiking/{print}' myfile
since that returns some spurious information.
What do do?
Personally, I rarely if ever use
/start/,/end/
as I'm never immediately sure what it'd output for input such as:
start a start b end c end
and whenever you want to do something just slightly different with the selection you need to change the script a lot.
Not being sure of the semantics is probably a catch 22 since I rarely use it but the benefit of using that syntax vs spelling it out:
/start/{f=1} f; /end/{f=0}
just doesn't really seem worthwhile, and then if you want to do something extra like test for some other condition over the block this:
/start/{f=1} f&&cond; /end/{f=0}
is about as brief as:
/start/,/end/{if (cond) print}
and if you want to exclude the start (or end) of the block you're printing then you just move the "f" test to the obvious place and you don't need to duplicate the condition:
f; /start/{f=1} /end/{f=0}
vs
/start/,/end/{if (!/start/) print}
and note the different semantics now. This:
f; /start/{f=1} /end/{f=0}
will exclude the line at the start of the block you're printing, whereas this:
/start/,/end/{if (!/start/) print}
will exclude that line plus every other occurrence of "start" within the block which is probably not what you'd want. To simply exclude only the first line of the block but stay with the /start/,/end/ approach you'd need to do something like:
/start/,/end/{if (!nr++) print; if (/end/) nr=0}
(which is getting fairly obscure.)
by Ed Morton (and friends)
The following summary, composed to address the recurring issue of getline (mis)use, was based primarily on information from the book "Effective Awk Programming", Third Edition By Arnold Robbins; (http://www.oreilly.com/catalog/awkprog3) with review and additional input from many of the comp.lang.awk regulars, including
getline is fine when used correctly (see below for a list of those cases), but it's best avoided by default because:
As the book "Effective Awk Programming", Third Edition By Arnold Robbins; http://www.oreilly.com/catalog/awkprog3) which provides much of the source for this discussion says:
The following summarises the eight variants of getline applications, listing which variables are set by each one:
Variant Variables Set
------- -------------
getline $0, ${1...NF}, NF, FNR, NR, FILENAME
getline var var, FNR, NR, FILENAME
getline < file $0, ${1...NF}, NF
getline var < file var
command | getline $0, ${1...NF}, NF
command | getline var var
command |& getline $0, ${1...NF}, NF
command |& getline var var
The "command |& ..." variants are GNU awk (gawk) extensions. gawk also populates the ERRNO builtin variable if getline fails.
Although calling getline is very rarely the right approach (see below), if you need to do it the safest ways to invoke getline are:
if/while ( (getline var < file) > 0) if/while ( (command | getline var) > 0) if/while ( (command |& getline var) > 0)
since those do not affect any of the builtin variables and they allow you to correctly test for getline succeeding or failing. If you need the input record split into separate fields, just call "split()" to do that.
Users of getline have to be aware of the following non-obvious effects of using it:
FNR==1 { ... start of file actions ... }
File transitions can occur at getlines, so FNR==1 needs to also be
checked after each unredirected (from a specific file name) getline.
e.g. if you want to print the first line of each of these files:
$ cat file1 a b $ cat file2 c dyou'd normally do:
$ awk 'FNR==1{print}' file1 file2
a
c
but if a "getline" snuck in, it could have the unexpected consequence of
skipping the test for FNR==1 and so not printing the first line of the
second file.
$ awk 'FNR==1{print}/b/{getline}' file1 file2
a
some header line ---------------- data line 1 data line 2 ... data line 10000you may consider using...
BEGIN { getline header; getline }
{ whatever_using_header_and_data_on_the_line() }
instead of...
FNR == 1 { header = $0 }
FNR < 3 { next }
{ whatever_using_header_and_data_on_the_line() }
but the getline version would not work on multiple files since the BEGIN
section would only be executed once, before the first file is processed,
whereas the non-getline version would work as-is. This is one example of
the common case where the getline command itself isn't directly causing
the problem, but the type of design you can end up with if you select a
getline approach is not ideal.
getline is an appropriate solution for the following:
command = "ls"
while ( (command | getline var) > 0) {
print var
}
close(command)
command = "LC_ALL=C sort"
n = split("abcdefghijklmnopqrstuvwxyz", a, "")
for (i = n; i > 0; i--)
print a[i] |& command
close(command, "to")
while ((command |& getline var) > 0)
print "got", var
close(command)
BEGIN {
while ( (getline var < ARGV[1]) > 0) {
data[var]++
}
close(ARGV[1])
ARGV[1]=""
}
$0 in data
awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' file1 tmp
In all other cases, it's clearest, simplest, less error-prone, and easiest to maintain to let awks normal text-processing read the records. In the case of "c", whether to use the BEGIN+getline approach or just collect the data within the awk condition/action part after testing for the first file is largely a style choice.
"a" above calls the UNIX command "ls" to list the current directory contents, then prints the result one line at a time.
"b" above writes the letters of the alphabet in reverse order, one per line, down the two-way pipe to the UNIX "sort" command. It then closes the write end of the pipe, so that sort receives an end-of-file indication. This causes sort to sort the data and write the sorted data back to the gawk program. Once all of the data has been read, gawk terminates the coprocess and exits. This is particularly necessary in order to use the UNIX "sort" utility as part of a coprocess since sort must read all of its input data before it can produce any output. The sort program does not receive an end-of-file indication until gawk closes the write end of the pipe. Other programs can be invoked as just:
command = "program"
do {
print data |& command
command |& getline var
} while (data left to process)
close(command)
Not that calling close() with a second argument is also gawk-specific.
"c" above reads every record of the first file passed as an argument to awk into an array and then for every subsequent file passed as an argument will print every record from that file that matches any of the records that appeared in the first file (and so are stored in the "data" array). This could alternatively have been implemented as:
# fails if first file is empty
NR==FNR{ data[$0]++; next }
$0 in data
or:
FILENAME==ARGV[1] { data[$0]++; next }
$0 in data
or:
FILENAME=="specificFileName" { data[$0]++; next }
$0 in data
or (gawk only):
ARGIND==1 { data[$0]++; next }
$0 in data
"d" above not only expands all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2]. In this case, since it's convenient to use $1 and $2, and no other part of the program references any builtin variables, getline was used without populating an explicit variable. This method is limited in its recursion depth to the total number of open files the OS permits at one time.
The following tips may help if, after reading the above, you discover you have an appropriate application for getline or if you're looking for an alternative solution to using getline:
cmd="some command" do something with cmd close(cmd)
awk 'c&&!--c;/pattern/{c=N}' file
awk 'c&&!--c{next}/pattern/{c=N}' file
awk 'c&&c--;/pattern/{c=N}' file
awk 'c&&c--{next}/pattern/{c=N}' file
In this example there are no blank lines and the output is all aligned with the left hand column and you want to print $0 for the second record following the record that contains some pattern, e.g. the number 3:
$ cat file
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
$ awk '/3/{getline;getline;print}' file
line 5
That works Just fine. Now let's see the concise way to do it without getline:
$ awk 'c&&!--c;/3/{c=2}' file
line 5
It's not quite so obvious at a glance what that does, but it uses an idiom that most awk programmers could do well to learn and it is briefer and avoids all those getline caveats.
Now let's say we want to print the 5th line after the pattern instead of the 2nd line. Then we'd have:
$ awk '/3/{getline;getline;getline;getline;getline;print}' file
line 8
$ awk 'c&&!--c;/3/{c=5}' file
line 8
i.e. we have to add a whole series of additional getline calls to the getline version, as opposed to just changing the counter from 2 to 5 for the non-getline version. In reality, you'd probably completely rewrite the getline version to use a loop:
$ awk '/3/{for (c=1;c<=5;c++) getline; print}' file
line 8
Still not as concise as the non-getline version, has all the getline caveats and required a redesign of the code just to change a counter.
Now let's say we also have to print the word "Eureka" if the number 4 appears in the input file. With the getline verion, you now have to do something like:
$ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" }
print}' file
Eureka!
line 8
whereas with the non-getline version you just have to do:
$ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file
Eureka!
line 8
i.e. with the getline version, you have to work around the fact that you're now processing records outside of the normal awk work-loop, whereas with the non-getline version you just have to drop your test for "4" into the normal place and let awks normal record processing deal with it like it always does. Actually, if you look closely a
t the above you'll notice we just unintentionally introduced a bug in the getline version. Consider what would happen in both versions if 3 and 4 appear on the same line. The non-getline version would behave correctly, but to fix the getline version, you'd need to duplicate the condition somewhere, e.g. perhaps something like this:
$ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline }
if ($0 ~ /4/) print "Eureka!"; print}' file
Eureka!
line 8
Now consider how the above would behave when there aren't 5 lines left in the input file or when the last line of the file contains both a 3 and a 4. i.e. there are still design questions to be answered and bugs that will appear at the limits of the input space.
Ignoring those bugs since this is not intended as a discussion on debugging getline programs, let's say you no longer need to print the 5th record after the number 3 but still have to do the Eureka on 4. With the getline version, you'd strip out the test for 3 and the getline stuff to be left with:
$ awk '{if ($0 ~ /4/) print "Eureka!"}' file
Eureka!
which you'd then presumably rewrite as:
$ awk '/4/{print "Eureka!"}' file
Eureka!
which is what you get just by removing everything involving the test for 3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:
$ awk '/4/{print "Eureka!"}' file
Eureka!
i.e. again, one small requirement change required a complete redesign of the getline code, but just the absolute minimum necessary tweak to the non-getline version.
So, what you see above in the getline case was significant redesign required for every tiny requirement change, much larger amounts of handwritten code required, insidious bugs introduced during development and challenging design questions at the limits of your input space, whereas the non-getline version always had less code, was much easier to modify as requirements changed, and was much more obvious, predictable, and correct in how it would behave at the limits of the input space.
Download from LAWKER.
Below is a script I wrote to demonstrate how to use arrays, functions, numerical vs string comparison, etc.
It also provides a framework for people to implement sorting algorithms for comparison. I've implemented a couple and I'm hoping others will contribute more in the same style.
I put very few comments in deliberately because I think the only parts that are hard to understand given some small amount of reading awk manuals are the actual sorting algorithms, and those should be well documented already given a reference except my made-up "Key Sort" but I think that's very easy to understand.
Selection Sort, O(n^2): http://en.wikipedia.org/wiki/Selection_sort
function selSort(keyArr,outArr, swap,thisIdx,minIdx,cmpIdx,numElts) {
for (thisIdx in keyArr) {
outArr[++numElts] = thisIdx
}
for (thisIdx=1; thisIdx<=numElts; thisIdx++) {
minIdx = thisIdx
for (cmpIdx=thisIdx + 1; cmpIdx <= numElts; cmpIdx++) {
if (keyArr[outArr[minIdx]] > keyArr[outArr[cmpIdx]]) {
minIdx = cmpIdx
}
}
if (thisIdx != minIdx) {
swap = outArr[thisIdx]
outArr[thisIdx] = outArr[minIdx]
outArr[minIdx] = swap
}
}
return numElts+0
}
Key Sort O(n^2): made up by Ed Morton for simplicity.
function keySort(keyArr,outArr, \
occArr,thisIdx,thisKey,cmpIdx,outIdx,numElts) {
for (thisIdx in keyArr) {
thisKey = keyArr[thisIdx]
outIdx=++occArr[thisKey] # start at 1 plus num occurrences
for (cmpIdx in keyArr) {
if (thisKey > keyArr[cmpIdx]) {
outIdx++
}
}
outArr[outIdx] = thisIdx
numElts++
}
return numElts+0
}
This code demonstrates the use of arrays, functions, and string vs numeric comparisons in awk. It also provides a framework for people to implement various sorting algorithms in awk such as those listed at http://en.wikipedia.org/wiki/Sorting_algorithm
Traverses the input array, storing it's indices in the output array in sorted order of the input array elements. e.g.
in: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
outArr[] is empty
out: inArr["foo"]="b"; inArr["bar"]="a"; inArr["xyz"]="b"
outArr[1]="bar"; outArr[2]="foo"; outArr[3]="xyz"
Can sort on specific fields given a field number and field separator.
sortType of "n" means sort by numerical comparison, sort by string comparison otherwise.
function genSort(sortAlg,sortType,inArr,outArr,fldNum,fldSep, \
keyArr,thisIdx,thisArr) {
if (fldNum) {
if (sortType == "n") {
for (thisIdx in inArr) {
split(inArr[thisIdx],thisArr,fldSep)
keyArr[thisIdx] = thisArr[fldNum]+0
}
} else {
for (thisIdx in inArr) {
split(inArr[thisIdx],thisArr,fldSep)
keyArr[thisIdx] = thisArr[fldNum]""
}
}
} else {
if (sortType == "n") {
for (thisIdx in inArr) {
keyArr[thisIdx] = inArr[thisIdx]+0
}
} else {
for (thisIdx in inArr) {
keyArr[thisIdx] = inArr[thisIdx]""
}
}
}
if (sortAlg ~ /^sel/) {
numElts = selSort(keyArr,outArr)
} else {
numElts = keySort(keyArr,outArr)
}
return numElts
}
{ inArr[NR]=$0 }
<H3> Output</H3>
END {
numElts = genSort(sortAlg,sortType,inArr,outArr,fldNum,FS)
for (outIdx=1;outIdx<=numElts;outIdx++) {
print inArr[outArr[outIdx]]
}
}
Ed Morton
A recent discussion in comp.lang.awk demonstrated a very cute, and very succinct, awk trick.
Neil Harris wanted to clean up this output:
host1name.com 10.10.10.1 host2name.com 10.10.10.2 host3name.com 10.10.10.3
He was using an uppercase J in vi to manually move the hostname's IP address up onto the same line as it's hostname. But he wanted to automate the task with awk.
Kenny McCormack offered:
ORS=NR%2?" ":"\n"
(Yes, that is the whole program.)
Ed Morton offered a more elegant version:
ORS=NR%2?FS:RS
Finally, Kenny McCormack commented: