About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
These pages focus on sys admin tools in Awk.
Zazzle.com is offering their great "I love Awk mug", starting at $12.
From John David Duncan's parallel-awk.org site.
Parallel Awk is an effort to link Awk with MPI, enabling the everyday analysis of large plain-text files to be parallelized, allowing rapid prototyping of parallel applications, preserving the syntax and style of Awk, and hiding the details of MPI.
The Awk programming language, first developed at Bell Labs in 1977, is a standard part of Unix operating system distributions. It is a compact language, commonly used in systems administration and in commercial (as opposed to scientific) computing. The half dozen books about awk include the original slim and very readable Awk book by Aho, Kernighan, and Weinberger. Awk is standardized in POSIX, and the most actively maintained current implementation is GNU awk. While awk, like sed, is perhaps most often used for "one-liners," its regular expression handling and rich C-like syntax make it well-suited for many small applications and domain-specific languages.
MPI is a standard Message Passing Interface for parallel computing created by the MPI Forum, implemented in two widely-used free distributions (LAM/MPI and MPICH) and in optimized versions provided by many hardware vendors. MPI libraries are often linked with Fortran or C code in scientific computing tasks, such as matrix calculations, and run on supercomputers or Beowulf clusters. For some of these applications, runtime is actually greater than development time; nonetheless, a language for rapid prototyping is a handy tool to have around.
# pi.awk: approximate pi by integrating f(x) = 4/(1+x^2)
# n = number of intervals to calculate
#
# e.g.: mpiexec -n 4 mpawk -v n=10000 -f pi.awk
BEGIN {
h = 1/n
for(i = RANK+1 ; i <= n ; i += SIZE) {
x = h * (i - 0.5)
sum += 4 / (1 + x^2)
}
pi = reduce(sum(h * sum))
if(!RANK) printf("n=%d, pi is %1.20f\n",n,pi)
}
pi.awk requires about 20% as many lines of code as its equivalents in C or Fortran. The output is printed by the process with RANK = 0 and looks like this:
sh% mpiexec -n 4 mpawk -v n=100000 -f pi.awk n=100000, pi is 3.14159265359811668006
The latest beta release of Parallel Awk is version 0.8. In this release, any Awk expression (including numbers, strings, and arrays) can be sent from one process to another using the functions send and recv. The comm_split() function, an interface to MPI_Comm_split, allows the creation of intra-communicators, while a companion function comm_set() is used to set the default MPI communicator implicitly used for all other MPI operations. Supported collective operations include reduce(), which can be applied to both numeric and string expressions, and barrier(). A function called assign() is used to divide the lines of input among the set of processes, as can a hash() function that is applied to array keys or other strings.
These notes come from John Fry's Counting with Awk lecture in his subject Linguistics 115: Corpus Linguistics, Fall 2007, SJSU.
Much research has reported that human writings following well-defined laws. For example, natural langauge text and software programs conform tightly to simple and regular statistical models. For example, "Zipf's Laws" states that multiplying a word's rank r by its frequency f produces (roughly) a constant value C : i.e. r times f is a constant. The frequency f of a word is obtained by counting the number of times it occurs in a text, and r is obtained by ranking all the words by frequency (1. the ; 2. and, 3. I ; etc.) Example of Zipf's Law for five words in the London-Lund corpus of spoken conversation:
r X f = C 35 very 836 = 29,260 45 see 674 = 30,330 55 which 563 = 30,965 65 get 469 = 30,485 75 out 422 = 31,650Another way of expressing Zipf's Law is to say that frequency is reciprocally proportional to rank. For example, the 2nd-ranked word ("and") appears half as often as the 1st-ranked word ("the"). More generally, nth-ranked word appears 1/n as often as "the"
Here is a short awk program, saved as ~jfry/zipf.awk, that reads in a ranked frequency list and computes r times f.
BEGIN {printf "%20s%7s%7s%10s\n", "WORD","RANK","FREQ","C"}
{printf "%20s%7d%7d%10d\n", $2, NR, $1, NR*$1}
This program can be run with
awk -f ~jfry/zipf.awk
Testing Zipf's Law on Shakespeare :
$ tr A-Z a-z < shakespeare.txt | tr -sc a-z '\n' | sort | uniq -c | sort -rn | awk -f ~jfry/zipf.awk WORD RANK FREQ C WORD RANK FREQ C the 1 27378 27378 s i 17 7721 131257 and 2 26084 52168 for 18 7655 137790 i 3 22538 67614 be 19 6897 131043 to 4 19771 79084 his 20 6859 137180 of 5 17481 87405 he 21 6679 140259 a 6 14725 88350 your 22 6657 146454 you 7 13826 96782 this 23 6608 151984 my 8 12489 99912 but 24 6277 150648 that 9 11318 101862 have 25 5902 147550 in 10 11112 111120 as 26 5749 149474 is 11 9319 102509 thou 27 5549 149823 d 12 8960 107520 him 28 5205 145740 not 13 8512 110656 so 29 5058 146682 with 14 7791 109074 will 30 5008 150240 me 15 7777 116655 what 31 4808 149048 it 16 7725 123600 thy 32 4034 129088
Testing Zipf's Law on newswire
$ cd /corpora/newswire/data $ zcat -r .|grep -v '^<' | tr A-Z a-z|tr -sc a-z '\n' | sort| uniq -c | sort -rn | awk -f /home/jfry/zipf.awk WORD RANK FREQ C WORD RANK FREQ C the 1 142M 142M by 16 14M 224M to 2 60M 120M he 17 13M 235M of 3 60M 180M at 18 13M 244M a 4 53M 214M as 19 12M 230M and 5 51M 257M from 20 10M 216M in 6 51M 307M be 21 9M 201M s 7 28M 202M his 22 9M 205M for 8 22M 178M has 23 9M 208M that 9 21M 195M have 24 9M 217M said 10 19M 199M but 25 8M 212M on 11 19M 214M are 26 8M 218M is 12 16M 200M an 27 8M 225M with 13 15M 197M will 28 7M 207M was 14 14M 203M i 29 7M 213M it 15 14M 211M not 30 7M 217M
J. Mellander reports in comp.lang.awk how to make Mawk's hashing run 20+ times faster.
Recently, for a project, I had the occasion to use mawk - I have a list of ~12,000,000 Unix timestamps to nanosecond precision that I needed to match the first field of every record in a number of huge files. Gawk couldn't handle the number of records, and so I used mawk, as being more memory thrifty. The program was a one-liner like this:
mawk 'FNR==NR {x[$1]++;next} $1 in x}' timestamp_file log_file
which works perfectly, but the run time seemed excessive - many hours per log file - which made me think that the hashing function was causing many collisions, and thus hash chaining.....
When stuck in a slow meeting, I started looking at the mawk source code, specifically the hashing functions, of which there are 2: hash() in hash.c & ahash() in array.c
I was surprised to find that the hashing functions in both cases essentially just sum the bytes of the key to create the hash - this means that 123, 321, 213, etc. would all hash to the same location and cause collisions, and hash chaining.
Modifying the hashing to a more efficient hash caused an enormous gain in efficiency, as in this test:
$ wc -l j
2999999 j
$ time mawk-1.3.3/mawk '{x[$1]++}' j >/dev/null
real 2m24.362s
user 2m20.174s
sys 0m0.663s
$ time mawk-1.3.3a/mawk '{x[$1]++}' j >/dev/null
real 0m6.607s
user 0m6.146s
sys 0m0.241s
mawk-1.3.3a has the below modifications. In hash.c I replaced the 'hash' function with:
/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
unsigned hash(s)
register char *s ;
{
register unsigned h = 2166136261 ;
while (*s) h = (h * 16777619) ^ *s++ ;
return h ;
}
and in array.c replaced 'ahash' with:
/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
static unsigned ahash(sval)
STRING* sval ;
{
register unsigned h = 2166136261 ;
register char *s = sval->str;
while (*s) h = (h * 16777619) ^ *s++ ;
return h ;
}
Brendan O'Conner writes in his blog:
When one of these new fangled 'Big Data' sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you're dealing with hundreds of megabytes of data, even simple operations can take plenty of time.
For one recent ad-hoc task I had - reformatting 1GB of textual feature data into a form Matlab and R can read - I tried writing implementations in several languages, with help from my classmate Elijah.
To be clear, the problem is to take several files of (item name, feature name, value) triples, like:
000794107-10-K-19960401 limited 1 000794107-10-K-19960401 colleges 1 000794107-10-K-19960401 code 2 ... 004334108-10-K-19961230 recognition 1 004334108-10-K-19961230 gross 8 ...And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.
Since it's a standardized language, many implementations exist. One of them, MAWK, is incredibly efficient. It outperforms all other languages, including statically typed compiled ones like Java and C++! It wins on both LOC and performance criteria- a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.
All the code, results, and data can be obtained at github.com/brendano/awkspeed. I'd love to see results for more languages.
Editor's note: one reply to this blog entry, by Eric Young, optimized Brendan's Ruby solution and re-ran all the tests. Eric reported the following runtimes. Note that they confirm Brendan's results: mawk runs faster than everything else.
33.8s mawk 36.3s gcc c 51.0s java 67.0s perl Fletch.pl 71.7s python 87.8s perl 95.8s nawk 101.4s gawk 114.0s gcc 133.0s ruby1.9 eay.rb 136.8s ruby1.8 eay.rb 327.6s ruby1.8 372.9s ruby1.9
I was lurking around on twitter during my lunch hour (yes, even freelancers need a lunch hour), and @bitprophet tweeted thusly:
grep -v "^#" syslog.conf |
awk "{print $2}" | egrep -v "^(\*|\|)" |
sed "/^$/ d" | sed "s/^-//"
Followed by this:
Interested to see if anyone can shorten my previous tweet's command line, outside of using 'cut' instead of the awk bit.)
I happen to love puzzles like this, and my lunch was almost immediately followed by a long, boring conference call.
@bitprophet's pipeline above is translated by my brain into the English:
Find non-commented lines, grab the second space-delimited field, then filter out the ones that start with "*" or "|", then delete any blank lines, and strip any leading "-" from the result.
My brain usually attempts to think of the English version of the solution *first*, and then try to emulate that in the code/command I write. So, the issue here is we want to find file paths (and apparently sockets are ok, too, as "@" is a valid leading character in the initial definition of the problem). If it's a file path, we want to see it in a form that would be suitable for passing it to something like "ls -l", which means leading symbols like "-" and "|" should be omitted.
In a syslog.conf file, the main meat is the area where you specify the warning levels, and the file you want messages at that warning level sent to (this is a simplistic explanation, but good enough to understand the solution I came up with). The file is also littered with comments. Here's the file on my Mac:
*.err;kern.*;auth.notice;authpriv,remoteauth,install.none;mail.crit /dev/console
*.notice;authpriv,remoteauth,ftp,install.none;kern.debug;mail.crit /var/log/system.log
# Send messages normally sent to the console also to the serial port.
# To stop messages from being sent out the serial port, comment out this line.
#*.err;kern.*;auth.notice;authpriv,remoteauth.none;mail.crit /dev/tty.serial
# The authpriv log file should be restricted access; these
# messages shouldn't go to terminals or publically-readable
# files.
auth.info;authpriv.*;remoteauth.crit /var/log/secure.log
lpr.info /var/log/lpr.log
mail.* /var/log/mail.log
ftp.* /var/log/ftp.log
install.* /var/log/install.log
install.* @127.0.0.1:32376
local0.* /var/log/appfirewall.log
local1.* /var/log/ipfw.log
stuff.* -/boo
things.* |/var/log
*.emerg *
So, in English, my brain parses the problem like this:
Skip blank lines, commented lines, and lines where the file name is "*", and give me everything else, but strip off characters "-" and "|" before sending it to the screen.
And here's my awk one-liner for doing that:
awk '$0 !~ /^$|^#/ && $2 !~ /^\*/ {sub(/^-|^\|/,"",$2);print $2}' syslog.conf
Knowing a few key things about awk will help parse the above:
Awk automatically breaks up each line of input into fields. If you don't tell it what to use as a delimiter, it'll just use any number of spaces as the delimiter. If you have a CSV file, you'd likely use "awk -F," to tell awk to use a comma. For /etc/passwd, use "awk -F:". From there, you can reference the first field as $1, the second as $2, etc. $0 represents the whole line. There are more, but that's enough for this example.
Though I think most sysadmins can get a lot done with simple usage like "awk -F: '{print $2}'", sometimes more power is needed, and awk delivers. It uses the basic regex engine, and enables you to check a field (or the whole line: $0, like I do above) against a regex as a precondition for performing some action with the line or a field on that line. So, in the above awk command, I check to see if the line is either empty, or a comment. I then use a logical AND to check if field 2 starts with "*". If the current line is a match for any of these rules it is skipped.
Another nice thing about awk is that it actually is a Turing-complete programming language. After I check the lines of input against the rules mentioned above, I immediately know that I definitely want at least some portion of $2 in the remaining lines. What I *don't* want are preceding characters like "-" or "|". I need to strip them from the file name. I use awk's built in "sub()" function to handle that, and with that out of the way I call "print" to send the result to the screen.
Writing in comp.lang.awk Ed Morton ports numerous complex sed expressions to Awk:
A comp.lang.awk author ask the question:
I have a file that has a series of lists
(qqq) aaa 111 bbb 222
and I want to make it look like
aaa 111 (qqq) bbb 222 (qqq)
IMHO the clearest sed solution given was:
sed -e '
/^([^)]*)/{
h; # remember the (qqq) part
d
}
/ [1-9][0-9]*$/{
G; # strap the (qqq) part to the list
s/\n/ /
}
' yourfile
while the awk one was:
awk '/^\(/{ h=$0;next } { print $0,h }' file
As I've said repeatedly, sed is an excellent tool for simple substitutions on a single line. For anything else you should use awk, perl, etc.
Having said that, let's take a look at the awk equivalents for the posted sed examples below that are not simple substitutions on a single line so people can judge for themselves (i.e. quietly - this is not a contest and not a religious war!) which code is clearer, more consistent, and more obvious. When reading this, just imagine yourself having to figure out what the given script does in order to debug or enhance it or write your own similar one later.
Note that in awk as in shell there are many ways to solve a problem so I'm trying to stick to the solutions that I think would be the most useful to a beginner since that's who'd be reading an examples page like this, and without using any GNU awk extensions. Also note I didn't test any of this but it's all pretty basic stuff so it should mostly be right.
For those who know absolutely nothing about awk, I think all you need to know to understand the scripts below is that, like sed, it loops through input files evaluating conditions against the current input record (a line by default) and executing the actions you specify (printing the current input record if none specified) if those conditions are true, and it has the following pre-defined symbols:
NR = Number or Records read so far NF = Number of Fields in current record FS = the Field Separator RS = the Record Separator BEGIN = a pattern that's only true before processing any input END = a pattern that's only true after processing all input.
Oh, and setting RS to the NULL string (-v RS='') tells awk to read paragraphs instead of lines as individual records, and setting FS to the NULL string (-v FS='') tells awk to treat each individual character as a field.
For more info on awk, see http://www.awk.info.
Double space a file:
Sed:
sed G
Awk
awk '{print $0 "\n"}'
Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.
Sed:
sed '/^$/d;G'
Awk:
awk 'NF{print $0 "\n"}'
Triple space a file
Sed:
sed 'G;G'
Awk:
awk '{print $0 "\n\n"}'
Undo double-spacing (assumes even-numbered lines are always blank):
Sed:
sed 'n;d'
Awk:
awk 'NF'
Insert a blank line above every line which matches "regex":
Sed:
sed '/regex/{x;p;x;}'
Awk:
awk '{print (/regex/ ? "\n" : "") $0}'
Insert a blank line below every line which matches "regex":
Sed:
sed '/regex/G'
Awk:
awk '{print $0 (/regex/ ? "\n" : "")}'
Insert a blank line above and below every line which matches "regex":
Sed:
sed '/regex/{x;p;x;G;}'
Awk:
awk '{print (/regex/ ? "\n" $0 "\n" : $0)}'
Number each line of a file (simple left alignment). Using a tab (see note on '\t' at end of file) instead of space will preserve margins:
Sed:
sed = filename | sed 'N;s/\n/\t/'
Awk:
awk '{print NR "\t" $0}'
Number each line of a file (number on left, right-aligned):
Sed:
sed = filename | sed 'N; s/^/ /; s/ *\(.\{6,\}\)\n/\1 /'
Awk:
awk '{printf "%6s %s\n",NR,$0}'
Number each line of file, but only print numbers if line is not blank:
Sed:
ed '/./=' filename | sed '/./N; s/\n/ /'
Awk:
awk 'NF{print NR "\t" $0}'
Count lines (emulates "wc -l")
Sed:
sed -n '$='
Awk:
awk 'END{print NR}'
Align all text flush right on a 79-column width:
Sed:
sed -e :a -e 's/^.\{1,78\}$/ &/;ta' # set at 78 plus 1 space
Awk:
awk '{printf "%79s\n",$0}'
Center all text in the middle of 79-column width. In method 1, spaces at the beginning of the line are significant, and trailing spaces are appended at the end of the line. In method 2, spaces at the beginning of the line are discarded in centering the line, and no trailing spaces appear at the end of lines.
Sed:
sed -e :a -e 's/^.\{1,77\}$/ & /;ta' # method 1
sed -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/' # method 2
Awk:
awk '{printf "%"int((79+length)/2)"s\n",$0}'
Reverse order of lines (emulates "tac") Bug/feature in sed v1.5 causes blank lines to be deleted
Sed:
sed '1!G;h;$!d' # method 1 sed -n '1!G;h;$p' # method 2
Awk:
awk '{a[NR]=$0} END{for (i=NR;i>=1;i--) print a[i]}'
Reverse each character on the line (emulates "rev")
Sed:
sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
Awk:
awk -v FS='' '{for (i=NF;i>=1;i--) printf "%s",$i; print ""}'
Join pairs of lines side-by-side (like "paste")
Sed:
sed '$!N;s/\n/ /'
Awk:
awk '{printf "%s%s",$0,(NR%2 ? " " : "\n")}'
If a line ends with a backslash, append the next line to it
Sed:
sed -e :a -e '/\\$/N; s/\\\n//; ta'
Awk:
awk '{printf "%s",(sub(/\\$/,"") ? $0 : $0 "\n")}'
if a line begins with an equal sign, append it to the previous line and replace the "=" with a single space
Sed:
sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'
Awk:
awk '{printf "%s%s",(sub(/^=/," ") ? "" : "\n"),$0} END{print ""}'
Add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)
Sed:
gsed '0~5G' # GNU sed only sed 'n;n;n;n;G;' # other seds
Awk:
awk '{print $0} !(NR%5){print ""}'
Print first 10 lines of file (emulates behavior of "head")
Sed:
sed 10q
Awk:
awk '{print $0} NR==10{exit}'
Print first line of file (emulates "head -1")
Sed:
sed q
Awk:
awk 'NR==1{print $0; exit}'
Print the last 10 lines of a file (emulates "tail")
Sed:
sed -e :a -e '$q;N;11,$D;ba'
Awk:
awk '{a[NR]=$0} END{for (i=NR-10;i<=NR;i++) print a[i]}'
Print the last 2 lines of a file (emulates "tail -2")
Sed:
sed '$!N;$!D'
Awk:
awk '{a[NR]=$0} END{for (i=NR-2;i<=NR;i++) print a[i]}'
Print the last line of a file (emulates "tail -1")
Sed:
sed '$!d' # method 1 sed -n '$p' # method 2
Awk:
awk 'END{print $0}'
Print the next-to-the-last line of a file
Sed:
sed -e '$!{h;d;}' -e x # for 1-line files, print blank line
sed -e '1{$q;}' -e '$!{h;d;}' -e x # for 1-line files, print the line
sed -e '1{$d;}' -e '$!{h;d;}' -e x # for 1-line files, print nothing
Awk:
awk '{prev=curr; curr=$0} END{print prev}'
Print only lines which match regular expression (emulates "grep")
Sed:
sed -n '/regexp/p' # method 1 sed '/regexp/!d' # method 2
Awk:
awk '/regexp/'
Print only lines which do NOT match regexp (emulates "grep -v")
Sed:
sed -n '/regexp/!p' # method 1, corresponds to above sed '/regexp/d' # method 2, simpler syntax
Awk:
awk '!/regexp/'
Print the line immediately before a regexp, but not the line containing the regexp
Sed:
sed -n '/regexp/{g;1!p;};h'
Awk:
awk '/regexp/{print prev} {prev=$0}'
Print the line immediately after a regexp, but not the line containing the regexp
Sed:
sed -n '/regexp/{n;p;}'
Awk:
awk 'found{print $0} {found=(/regexp/ ? 1 : 0)}'
Print 1 line of context before and after regexp, with line number indicating where the regexp occurred (similar to "grep -A1 -B1")
Sed:
sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
Awk:
awk 'found {print preLine "\n" hitLine "\n" $0; found=0}
/regexp/ {preLine=prev; hitLine=NR " " $0; found=1}
{prev=$0}'
Grep for AAA and BBB and CCC (in any order)
Sed:
sed '/AAA/!d; /BBB/!d; /CCC/!d'
Awk:
awk '/AAA/&&/BBB/&&/CCC/'
Grep for AAA and BBB and CCC (in that order)
Sed:
sed '/AAA.*BBB.*CCC/!d'
Awk:
awk '/AAA.*BBB.*CCC/'
Grep for AAA or BBB or CCC (emulates "egrep")
Sed:
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d # most seds gsed '/AAA\|BBB\|CCC/!d' # GNU sed only
Awk:
awk '/AAA|BBB|CCC/'
Print paragraph if it contains AAA (blank lines separate paragraphs). Sed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below
Sed:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'
Awk:
awk -v RS='' '/AAA/'
Print paragraph if it contains AAA and BBB and CCC (in any order)
Sed:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
Awk:
awk -v RS='' '/AAA/&&/BBB/&&/CCC/'
Print paragraph if it contains AAA or BBB or CCC
Sed:
sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d' # GNU sed only
Awk:
awk -v RS='' '/AAA|BBB|CCC/'
Print only lines of 65 characters or longer
Sed:
sed -n '/^.\{65\}/p'
Awk:
awk -v FS='' 'NF>=65'
Print only lines of less than 65 characters
Sed:
sed -n '/^.\{65\}/!p' # method 1, corresponds to above
sed '/^.\{65\}/d' # method 2, simpler syntax
Awk:
awk -v FS='' 'NF<65'
Print section of file from regular expression to end of file
Sed:
sed -n '/regexp/,$p'
Awk:
awk '/regexp/{found=1} found'
Print section of file based on line numbers (lines 8-12, inclusive)
Sed:
sed -n '8,12p' # method 1 sed '8,12!d' # method 2
Awk:
awk 'NR>=8 && NR<=12'
Print line number 52
Sed:
sed -n '52p' # method 1 sed '52!d' # method 2 sed '52q;d' # method 3, efficient on large files
Awk:
awk 'NR==52{print $0; exit}'
Beginning at line 3, print every 7th line
Sed:
gsed -n '3~7p' # GNU sed only
sed -n '3,${p;n;n;n;n;n;n;}' # other seds
Awk:
awk '!((NR-3)%7)'
print section of file between two regular expressions (inclusive)
Sed:
sed -n '/Iowa/,/Montana/p' # case sensitive
Awk:
awk '/Iowa/,/Montana/'
Print all lines of FileID upto 1st line containing
Sed:
sed '/string/q' FileID
Awk:
awk '{print $0} /string/{exit}'
Print all lines of FileID from 1st line containing until eof
Sed:
sed '/string/,$!d' FileID
Awk:
awk '/string/{found=1} found'
Print all lines of FileID from 1st line containing until 1st line containing [boundries inclusive]
Sed:
sed '/string1/,$!d;/string2/q' FileID
Awk:
awk '/string1/{found=1} found{print $0} /string2/{exit}'
Print all of file EXCEPT section between 2 regular expressions
Sed:
sed '/Iowa/,/Montana/d'
Awk:
awk '/Iowa/,/Montana/{next} {print $0}' file
Delete duplicate, consecutive lines from a file (emulates "uniq"). First line in a set of duplicate lines is kept, rest are deleted.
Sed:
sed '$!N; /^\(.*\)\n\1$/!P; D'
Awk:
awk '$0!=prev{print $0} {prev=$0}'
Delete duplicate, nonconsecutive lines from a file. Beware not to overflow the buffer size of the hold space, or else use GNU sed.
Sed:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Awk:
awk '!a[$0]++'
Delete all lines except duplicate lines (emulates "uniq -d").
Sed:
sed '$!N; s/^\(.*\)\n\1$/\1/; t; D'
Awk:
awk '$0==prev{print $0} {prev=$0}' # works only on consecutive
awk 'a[$0]++' # works on non-consecutive
Delete the first 10 lines of a file
Sed:
sed '1,10d'
Awk:
awk 'NR>10'
Delete the last line of a file
Sed:
sed '$d'
Awk:
awk 'NR>1{print prev} {prev=$0}'
Delete the last 2 lines of a file
Sed:
sed 'N;$!P;$!D;$d'
Awk:
awk 'NR>2{print prev[2]} {prev[2]=prev[1]; prev[1]=$0}' # method 1
awk '{a[NR]=$0} END{for (i=i;i<=NR-2;i++) print a[i]}' # method 2
awk -v num=2 'NR>num{print prev[num]}
{for (i=num;i>1;i--) prev[i]=prev[i-1]; prev[1]=$0}' # method 3
Delete the last 10 lines of a file
Sed:
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2
Awk:
awk -v num=10 '...same as deleting last 2 method 3 above...'
Delete every 8th line
Sed:
gsed '0~8d' # GNU sed only sed 'n;n;n;n;n;n;n;d;' # other seds
Awk:
awk 'NR%8'
Delete lines matching pattern
Sed:
sed '/pattern/d'
Awk:
awk '!/pattern/'
Delete ALL blank lines from a file (same as "grep '.' ")
Sed:
sed '/^$/d' # method 1 sed '/./!d' # method 2
Awk:
awk '!/^$/' # method 1 awk '/./' # method 2
Delete all CONSECUTIVE blank lines from file except the first; also deletes all blank lines from top and end of file (emulates "cat -s")
Sed:
sed '/./,/^$/!d'
Awk:
awk '/./,/^$/'
Delete all leading blank lines at top of file
Sed:
sed '/./,$!d'
Awk:
awk 'NF{found=1} found'
Delete all trailing blank lines at end of file
Sed:
sed -e :a -e '/^\n*$/{$d;N;ba' -e '}' # works on all seds
sed -e :a -e '/^\n*$/N;/\n$/ba' # ditto, except for gsed 3.02.*
Awk:
awk '{a[NR]=$0} NF{nbNr=NR} END{for (i=1;i<=nbNr;i++) print a[i]}'
Delete the last line of each paragraph
Sed:
sed -n '/^$/{p;h;};/./{x;/./p;}'
Awk:
awk -v FS='\n' -v RS='' '{for (i=1;i<=NF;i++) print $i; print ""}'
Get Usenet/e-mail message header
Sed:
sed '/^$/q' # deletes everything after first blank line
Awk:
awk '/^$/{exit}'
Get Usenet/e-mail message body
Sed:
sed '1,/^$/d' # deletes everything up to first blank line
Awk:
awk 'found{print $0} /^$/{found=1}'
Get Subject header, but remove initial "Subject: " portion
Sed:
sed '/^Subject: */!d; s///;q'
Awk:
awk 'sub(/Subject: */,"")'
Parse out the address proper. Pulls out the e-mail address by itself from the 1-line return address header (see preceding script)
Sed:
sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
Awk:
awk '{sub(/ *\(.*\)/,""); sub(/>.*/,""); sub(/.*[:<] */,""); print $0}'
Add a leading angle bracket and space to each line (quote a message)
Sed:
sed 's/^/> /'
Awk:
awk '{print "> " $0}'
Delete leading angle bracket & space from each line (unquote a message)
Sed:
sed 's/^> //'
Awk:
awk '{sub(/> /,""); print $0}'
Download from LAWKER.
BEGIN {
if (!mysql["path"]) {
mysql["path"] = "/usr/bin/mysql"
}
if (mysql["user"]) mysql["user"] = "-u" mysql["user"]
if (mysql["pass"]) mysql["pass"] = "-p" mysql["pass"]
if (!mysql["tempfile_command"]) {
mysql["tempfile_command"] = "mktemp /tmp/__mysql.awk.XXXXXX"
}
mysql["resource_id"] = 1
__mysql_dequote["r"] = "\r"
__mysql_dequote["n"] = "\n"
__mysql_dequote["t"] = "\t"
__mysql_dequote["\\"] = "\\"
}
function mysql_db (db) { mysql["database"] = db }
function mysql_path (path) { mysql["path"] = path }
function mysql_tempfile_command (command) {
mysql["tempfile_command"] = command
}
function mysql_login (username, password, host, args) {
mysql["user"] = "-u" username
mysql["pass"] = "-p" password
if (host) mysql["host"] = "-h" host
if (args) mysql["args"] = args
}
function mysql_query (query ,input,key,i,call,resource) {
resource = mysql["resource_id"]++
mysql["tempfile_command"] | getline mysql[resource]
close(mysql["tempfile_command"])
call = sprintf("%s %s %s %s %s %s > %s",
mysql["path"], mysql["user"], mysql["pass"], mysql["host"],
mysql["args"], mysql["database"],
mysql[resource])
print query | call
close(call)
if (getline input < mysql[resource]) {
for (i = split(input, key, "\t"); i > 0; i--)
mysql[resource, i] = key[i]
}
return resource
}
function mysql_fetch_assoc (resource,row ,input,i,fields) {
fields = 0
if (getline input < mysql[resource]) {
fields = mysql_split(row, input)
for (i = 1; i <= fields; i++)
row[mysql[resource, i]] = row[i]
}
return fields
}
function mysql_split (row, input, r,i) {
r = split(input, row, "\t")
for (i = 0; i <= r; i++) {
row[i] = mysql_dequote(row[i])
}
return r
}
function mysql_fetch_row (resource,row ,input,r,i) {
if (getline input < mysql[resource]) {
return mysql_split(row, input)
}
return 0
}
function mysql_index (resource, id) {
return mysql[resource, id]
}
function mysql_finish (resource, i) {
close(mysql[resource])
system(sprintf("rm %s", mysql[resource]))
delete mysql[resource]
i = 1
while (mysql[resource,i])
delete mysql[resource, i++]
}
function mysql_cleanup ( i) {
for (i = 1; i < mysql["resource_id"]; i++)
if (mysql[i]) {
close(mysql[i])
system(sprintf("rm %s", mysql[i]))
delete mysql[resource]
i = 1
while (mysql[resource,i])
delete mysql[resource, i++]
}
}
Scan a string for mysql escaped tokens and replace them with the appropriate character. This is a fairly slow operation for large strings but it's necessary.
function mysql_dequote (string, result,i,l,c) {
result = ""
l = length(string)
for (i = 1; i <= l; i++) {
c = substr(string, i, 1)
if (c == "\\") {
# This simply shouldn't happen...
## if ((i + 1) == l) continue;
c = substr(string, ++i, 1)
result = result __mysql_dequote[c]
}
else {
result = result c
}
}
return result
}
function mysql_quote (string, result) {
gsub(/\\/, "\\\\", string)
gsub(/'/, "\\'", string)
return "'" string "'"
}
"THE BEER-WARE LICENSE" (Revision 43) borrowed from FreeBSD's jail.c:
Scott S. McCoy Author
Brian Jones writes at linux.com:
The nice thing about humans is that they're at least somewhat predictable. Given the choice between having data randomly strewn about, and having it in some predictable pattern, humans will generally choose predictable patterns (Microsoft filesystem management issues notwithstanding). These patterns are what make awk, a pattern-matching programming language, a wonderful tool for systems administrators, database administrators, and even command-line junkies who use their box mainly for pleasure. The notion of being able to write a one-line command to do almost anything draws ever closer with awk in your tool belt. For most things administrators use awk for, it's an extremely simple language. As you get into writing more advanced awk scripts, at some point it becomes a bit cumbersome, and you realize that Perl is also your friend. But for now, let's focus on how awk can get you the most bang for your keyboard strokes, shall we?
The first thing you should know is that awk is actually a rather powerful language. Entire books have been written about its use. If you're so inclined, you can write extremely complex 1000-line scripts using awk. However, as a systems administrator (the intended audience for this article), 99% of your use of awk will consist of relatively short scripts, and one-off one-liners typed right on the command line. Here's an example of a common use of awk:
[jonesy@newhotness jonesy]$ cat access_log |
awk '{print $1}' | sort | uniq -c | sort -rn
The above one-liner uses awk to slim down the amount of data coming from the web server's access log. The access log is space-delimited, and I only want to see the first field (hence "print $1"). Once I have that data, I want to sort it, then I have "uniq -c" provide a count of each occurrence for each unique value, and then I produce a reverse sort based on the numeric count provided by "uniq". The result has the number of hits in the left column, and the host in the right column, and the most frequent visitors are at the top of the list. Give it a shot! Even if you're hosted by an ISP, you should be able to access this log.
Awk is perfect for ripping data into smaller chunks, to make it more bite-size for other applications or manipulation. To use it on the command line on files that are not space-delimited, you can use the "-F" flag, and indicate a delimiter. This is useful for tearing apart /etc/passwd and /etc/shadow files. For example:
[jonesy@tux jonesy]$ cat /etc/passwd | awk -F: '{print $5}' | awk -F, '{print NF}'
I actually used something kinda similar to that during a NIS to LDAP migration to see if the gecos field ($5 in /etc/passwd) had consistent enough data to be useful. One of the tests is to see how consistent the number of datapoints held in the gecos field is from record to record. To figure out the number of fields in each record's gecos field, I tell awk to use ":" as the delimiter, and, based on that, print the fifth field. I then pipe that to another awk one-liner, which uses an awk built-in variable, "NF" and a different delimiter (gecos is generally comma-delimited, if it's even used for useful data).
When one-liners just aren't enough for you, you can store a whole bunch of awk one-liners in a file, and call awk with "-f script" to tell it which file to read its commands from. Additionally, since awk needs to act on some data, you should also tag on something to take care of feeding awk the data it so desperately needs. For example, if I have a script called "getuname", which looks like this:
BEGIN { FS=":" }
{print $1}
I can now call that script, feeding it anything that I know ahead of time has the user name as the first field in a given record. So I can say "awk -f getuname < /etc/passwd", or "ypcat passwd | awk -f getuname". There are two rather important things I did in this script that will save you some headaches. First, notice the "BEGIN" statement. This statement exists to give you some space to do some tasks before awk starts reading any data. In this example, I want awk to know before it processes any data, that it should use a colon as its field separator. Sure, I could've called awk differently to get around this, ie "awk -F: -f getuname < /etc/passwd", but this way is shorter, and that's the point! It should also be noted that, if you have the need, you can also have an "END" section to your script, which will perform any actions, once, after the last data record has been processed.
On the second line, I've just called a simple awk "action" statement, just like on the command line, with one important exception: I didn't use single quotes around it. If I had, the shell would've tried to interpret this part of the script and choked. I know, because it happened while I was testing this script. Bad admin!
Awk has some built-in functions, like most scripting languages, which make life a bit easier. It also has some built-in variables that awk keeps track of for you -- and you get their values for free, just for asking, which is nice. The most useful variable I've had the pleasure to use as an admin is the "NF" variable, which will tell you, based on the field separator given (space by default), how many fields are in the current record. Conversely, the most useful function I've used as an awk scripter is the "split" function, which can break a single field into another array of separate fields. First, here's a quick example of NF in action:
cat /etc/passwd | awk -F: '{print NF}'
This is the lazy man's way to get the users' shells from the /etc/passwd file without having to remember how many fields are in the file. But wait! This doesn't print the last field in the record! It prints the number of fields in the record! Simple enough -- add a "$" to the front of "NF", and you'll get what you're looking for. Pipe the output to a couple of "sort" and "uniq" commands like we did earlier with the web log, and you'll get a snapshot of what the most commonly used shells are.
Now let's have a look at the split function. Let's say you use your gecos field to store a bunch of datapoints, and the datapoints within the gecos field are comma-delimited. This is not nearly so contrived as it might sound -- this happens in more than two environments I've done work in. Here's what it might look like:
jonesy:x:12000:13:Brian K. Jones,LUSER,101B,NONE:/home/jonesy:/bin/bash
Now let's say your PHB comes along and says he's tired of referring to me as "jonesy" and wants to know my real name. You can use awk's "split" function to help you here, and the code for doing so is fairly short:
BEGIN { FS=":" }
{
gfields = split ( $5, gecos, ",")
chunkname = split ( gecos[1], fullname, " " )
print fullname[chunkname], fullname[1]
}
Let's translate that into English, shall we? Of course, you now know what the BEGIN statement does here -- nothing new. We'll start by looking at the "gfields" line, where I use "split" to break up the 5th field of the record, (the gecos field), using the comma as a delimiter, and storing all of the resulting fields in an array called "gecos". This can be counterintuitive, as you may be tempted to think that the resulting array is called "gfields". However, the "gfields" variable actually represents the last field in the record. You get a look at how this works in the following two lines. "chunkname" represents the number of fields in the "fullname" array. The "fullname" array is created by splitting the first field of the "gecos" array (in this case, the field holding my full name), using a space as the delimiter. On the next line, I reference "fullname[chunkname]", which will print the last name of the person, even if (as in my case) they have a middle name or initial. Then I print the very first field in the fullname array, so the output generated by this script acting on my passwd record would be "Jones Brian".
Whew! That was a mouthful. Awk has so many cool little hacks and built-in features that there has been more than one book published just on Awk. Undoubtedly, I'll utilize some of these features in future articles that involve putting together syadmin solutions using various scripts as duct tape.
blog comments powered by Disqus