Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: SysAdmin,Oct,2009,Admin

Sys Admin

These pages focus on sys admin tools in Awk.


categories: Oct,2009,Zazzle

Awk Mug

Zazzle.com is offering their great "I love Awk mug", starting at $12.


categories: Oct,2009,JohnD

Parallel Awk

From John David Duncan's parallel-awk.org site.

Parallel Awk is an effort to link Awk with MPI, enabling the everyday analysis of large plain-text files to be parallelized, allowing rapid prototyping of parallel applications, preserving the syntax and style of Awk, and hiding the details of MPI.

Awk and MPI

The Awk programming language, first developed at Bell Labs in 1977, is a standard part of Unix operating system distributions. It is a compact language, commonly used in systems administration and in commercial (as opposed to scientific) computing. The half dozen books about awk include the original slim and very readable Awk book by Aho, Kernighan, and Weinberger. Awk is standardized in POSIX, and the most actively maintained current implementation is GNU awk. While awk, like sed, is perhaps most often used for "one-liners," its regular expression handling and rich C-like syntax make it well-suited for many small applications and domain-specific languages.

MPI is a standard Message Passing Interface for parallel computing created by the MPI Forum, implemented in two widely-used free distributions (LAM/MPI and MPICH) and in optimized versions provided by many hardware vendors. MPI libraries are often linked with Fortran or C code in scientific computing tasks, such as matrix calculations, and run on supercomputers or Beowulf clusters. For some of these applications, runtime is actually greater than development time; nonetheless, a language for rapid prototyping is a handy tool to have around.

Example: Calculating Pi

# pi.awk: approximate pi by integrating f(x) = 4/(1+x^2)
# n = number of intervals to calculate 
#
# e.g.: mpiexec -n 4 mpawk -v n=10000 -f pi.awk 

BEGIN {
    h = 1/n
    for(i = RANK+1 ; i <= n ; i += SIZE) {
        x = h * (i - 0.5)
        sum +=  4 / (1 + x^2)
    }
    pi = reduce(sum(h * sum))
    if(!RANK) printf("n=%d, pi is %1.20f\n",n,pi)
}

pi.awk requires about 20% as many lines of code as its equivalents in C or Fortran. The output is printed by the process with RANK = 0 and looks like this:

sh% mpiexec -n 4 mpawk -v n=100000 -f pi.awk
n=100000, pi is 3.14159265359811668006

Status

The latest beta release of Parallel Awk is version 0.8. In this release, any Awk expression (including numbers, strings, and arrays) can be sent from one process to another using the functions send and recv. The comm_split() function, an interface to MPI_Comm_split, allows the creation of intra-communicators, while a companion function comm_set() is used to set the default MPI communicator implicitly used for all other MPI operations. Supported collective operations include reduce(), which can be applied to both numeric and string expressions, and barrier(). A function called assign() is used to divide the lines of input among the set of processes, as can a hash() function that is applied to array keys or other strings.


categories: ,TextMining,Oct,2009,JohnF

Zipf's Law

These notes come from John Fry's Counting with Awk lecture in his subject Linguistics 115: Corpus Linguistics, Fall 2007, SJSU.

Much research has reported that human writings following well-defined laws. For example, natural langauge text and software programs conform tightly to simple and regular statistical models. For example, "Zipf's Laws" states that multiplying a word's rank r by its frequency f produces (roughly) a constant value C : i.e. r times f is a constant. The frequency f of a word is obtained by counting the number of times it occurs in a text, and r is obtained by ranking all the words by frequency (1. the ; 2. and, 3. I ; etc.) Example of Zipf's Law for five words in the London-Lund corpus of spoken conversation:

r  X     f   = C 
35 very  836 = 29,260 
45 see   674 = 30,330 
55 which 563 = 30,965 
65 get   469 = 30,485 
75 out   422 = 31,650 
Another way of expressing Zipf's Law is to say that frequency is reciprocally proportional to rank. For example, the 2nd-ranked word ("and") appears half as often as the 1st-ranked word ("the"). More generally, nth-ranked word appears 1/n as often as "the"

Here is a short awk program, saved as ~jfry/zipf.awk, that reads in a ranked frequency list and computes r times f.

BEGIN {printf "%20s%7s%7s%10s\n", "WORD","RANK","FREQ","C"} 
      {printf "%20s%7d%7d%10d\n", $2, NR, $1, NR*$1} 

This program can be run with

awk -f ~jfry/zipf.awk 

Testing Zipf's Law on Shakespeare :

$ tr A-Z a-z < shakespeare.txt | tr -sc a-z '\n' | sort | 
uniq -c | sort -rn | awk -f ~jfry/zipf.awk 
WORD RANK  FREQ      C WORD RANK  FREQ      C 
the     1 27378  27378 s i    17  7721 131257 
and     2 26084  52168 for    18  7655 137790 
i       3 22538  67614 be     19  6897 131043 
to      4 19771  79084 his    20  6859 137180 
of      5 17481  87405 he     21  6679 140259 
a       6 14725  88350 your   22  6657 146454 
you     7 13826  96782 this   23  6608 151984 
my      8 12489  99912 but    24  6277 150648 
that    9 11318 101862 have   25  5902 147550 
in     10 11112 111120 as     26  5749 149474 
is     11  9319 102509 thou   27  5549 149823 
d      12  8960 107520 him    28  5205 145740 
not    13  8512 110656 so     29  5058 146682 
with   14  7791 109074 will   30  5008 150240 
me     15  7777 116655 what   31  4808 149048 
it     16  7725 123600 thy    32  4034 129088 

Testing Zipf's Law on newswire

$ cd /corpora/newswire/data 
$ zcat -r .|grep -v '^<' | tr A-Z a-z|tr -sc a-z '\n' | sort| 
uniq -c | sort -rn | awk -f /home/jfry/zipf.awk 
WORD RANK FREQ    C WORD RANK FREQ    C 
the     1 142M 142M by     16 14M  224M 
to      2  60M 120M he     17 13M  235M 
of      3  60M 180M at     18 13M  244M 
a       4  53M 214M as     19 12M  230M 
and     5  51M 257M from   20 10M  216M 
in      6  51M 307M be     21  9M  201M 
s       7  28M 202M his    22  9M  205M 
for     8  22M 178M has    23  9M  208M 
that    9  21M 195M have   24  9M  217M 
said   10  19M 199M but    25  8M  212M 
on     11  19M 214M are    26  8M  218M 
is     12  16M 200M an     27  8M  225M 
with   13  15M 197M will   28  7M  207M 
was    14  14M 203M i      29  7M  213M 
it     15  14M 211M not    30  7M  217M 

categories: ,Mawk,Oct,2009,JMellander

Faster Hashing in Mawk

J. Mellander reports in comp.lang.awk how to make Mawk's hashing run 20+ times faster.

Recently, for a project, I had the occasion to use mawk - I have a list of ~12,000,000 Unix timestamps to nanosecond precision that I needed to match the first field of every record in a number of huge files. Gawk couldn't handle the number of records, and so I used mawk, as being more memory thrifty. The program was a one-liner like this:

mawk 'FNR==NR {x[$1]++;next} $1 in x}' timestamp_file log_file

which works perfectly, but the run time seemed excessive - many hours per log file - which made me think that the hashing function was causing many collisions, and thus hash chaining.....

When stuck in a slow meeting, I started looking at the mawk source code, specifically the hashing functions, of which there are 2: hash() in hash.c & ahash() in array.c

I was surprised to find that the hashing functions in both cases essentially just sum the bytes of the key to create the hash - this means that 123, 321, 213, etc. would all hash to the same location and cause collisions, and hash chaining.

Modifying the hashing to a more efficient hash caused an enormous gain in efficiency, as in this test:

$ wc -l j
2999999 j

$ time mawk-1.3.3/mawk '{x[$1]++}' j >/dev/null

real    2m24.362s
user    2m20.174s
sys     0m0.663s

$ time mawk-1.3.3a/mawk '{x[$1]++}' j >/dev/null

real    0m6.607s
user    0m6.146s
sys     0m0.241s

mawk-1.3.3a has the below modifications. In hash.c I replaced the 'hash' function with:

/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
unsigned hash(s)
register char *s ;
{
	register unsigned h = 2166136261 ;
	while (*s) h = (h * 16777619) ^ *s++ ;
	return h ;
}

and in array.c replaced 'ahash' with:

/*
FNV-1 hash function, per en.wikipedia.org/wiki/Fowler-Noll-
Vo_hash_function
*/
static unsigned ahash(sval)
STRING* sval ;
{
	register unsigned h = 2166136261 ;
	register char *s = sval->str;

	while (*s) h = (h * 16777619) ^ *s++ ;
	return h ; 
}

categories: ,Mawk,Oct,2009,BrendanO

Mawk: faster than C, C++, Java, Perl, Ruby,...

Brendan O'Conner writes in his blog:

    When one of these new fangled 'Big Data' sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you're dealing with hundreds of megabytes of data, even simple operations can take plenty of time.

    For one recent ad-hoc task I had - reformatting 1GB of textual feature data into a form Matlab and R can read - I tried writing implementations in several languages, with help from my classmate Elijah.

    To be clear, the problem is to take several files of (item name, feature name, value) triples, like:

    000794107-10-K-19960401 limited 1
    000794107-10-K-19960401 colleges 1
    000794107-10-K-19960401 code 2
    ...
    004334108-10-K-19961230 recognition 1
    004334108-10-K-19961230 gross 8
    ...
    
    And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.

    Since it's a standardized language, many implementations exist. One of them, MAWK, is incredibly efficient. It outperforms all other languages, including statically typed compiled ones like Java and C++! It wins on both LOC and performance criteria- a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.

    All the code, results, and data can be obtained at github.com/brendano/awkspeed. I'd love to see results for more languages.

Editor's note: one reply to this blog entry, by Eric Young, optimized Brendan's Ruby solution and re-ran all the tests. Eric reported the following runtimes. Note that they confirm Brendan's results: mawk runs faster than everything else.

 33.8s     mawk
 36.3s     gcc c
 51.0s     java
 67.0s     perl Fletch.pl
 71.7s     python
 87.8s     perl
 95.8s     nawk
101.4s     gawk
114.0s     gcc
133.0s     ruby1.9 eay.rb
136.8s     ruby1.8 eay.rb
327.6s     ruby1.8
372.9s     ruby1.9

categories: SysAdmin,Oct,2009,M0J0

Shorten Your Pipes

m0j0 writes in his blog...

I was lurking around on twitter during my lunch hour (yes, even freelancers need a lunch hour), and @bitprophet tweeted thusly:

    Get syslog-owned log names from syslog.conf:
    grep -v "^#" syslog.conf | 
    awk "{print $2}" | egrep -v "^(\*|\|)" | 
    sed "/^$/ d" | sed "s/^-//"
    

Followed by this:

    Interested to see if anyone can shorten my previous tweet's command line, outside of using 'cut' instead of the awk bit.)

I happen to love puzzles like this, and my lunch was almost immediately followed by a long, boring conference call.

@bitprophet's pipeline above is translated by my brain into the English:

Find non-commented lines, grab the second space-delimited field, then filter out the ones that start with "*" or "|", then delete any blank lines, and strip any leading "-" from the result.

My brain usually attempts to think of the English version of the solution *first*, and then try to emulate that in the code/command I write. So, the issue here is we want to find file paths (and apparently sockets are ok, too, as "@" is a valid leading character in the initial definition of the problem). If it's a file path, we want to see it in a form that would be suitable for passing it to something like "ls -l", which means leading symbols like "-" and "|" should be omitted.

In a syslog.conf file, the main meat is the area where you specify the warning levels, and the file you want messages at that warning level sent to (this is a simplistic explanation, but good enough to understand the solution I came up with). The file is also littered with comments. Here's the file on my Mac:

*.err;kern.*;auth.notice;authpriv,remoteauth,install.none;mail.crit        /dev/console
*.notice;authpriv,remoteauth,ftp,install.none;kern.debug;mail.crit    /var/log/system.log

# Send messages normally sent to the console also to the serial port.
# To stop messages from being sent out the serial port, comment out this line.
#*.err;kern.*;auth.notice;authpriv,remoteauth.none;mail.crit        /dev/tty.serial

# The authpriv log file should be restricted access; these
# messages shouldn't go to terminals or publically-readable
# files.
auth.info;authpriv.*;remoteauth.crit            /var/log/secure.log

lpr.info                        /var/log/lpr.log
mail.*                            /var/log/mail.log
ftp.*                            /var/log/ftp.log

install.*                        /var/log/install.log
install.*                        @127.0.0.1:32376
local0.*                        /var/log/appfirewall.log
local1.*                        /var/log/ipfw.log
stuff.*                            -/boo
things.*                        |/var/log
*.emerg                            *

So, in English, my brain parses the problem like this:

    Skip blank lines, commented lines, and lines where the file name is "*", and give me everything else, but strip off characters "-" and "|" before sending it to the screen.

And here's my awk one-liner for doing that:

awk '$0 !~ /^$|^#/ && $2 !~ /^\*/ {sub(/^-|^\|/,"",$2);print $2}' syslog.conf

Knowing a few key things about awk will help parse the above:

Awk automatically breaks up each line of input into fields. If you don't tell it what to use as a delimiter, it'll just use any number of spaces as the delimiter. If you have a CSV file, you'd likely use "awk -F," to tell awk to use a comma. For /etc/passwd, use "awk -F:". From there, you can reference the first field as $1, the second as $2, etc. $0 represents the whole line. There are more, but that's enough for this example.

Though I think most sysadmins can get a lot done with simple usage like "awk -F: '{print $2}'", sometimes more power is needed, and awk delivers. It uses the basic regex engine, and enables you to check a field (or the whole line: $0, like I do above) against a regex as a precondition for performing some action with the line or a field on that line. So, in the above awk command, I check to see if the line is either empty, or a comment. I then use a logical AND to check if field 2 starts with "*". If the current line is a match for any of these rules it is skipped.

Another nice thing about awk is that it actually is a Turing-complete programming language. After I check the lines of input against the rules mentioned above, I immediately know that I definitely want at least some portion of $2 in the remaining lines. What I *don't* want are preceding characters like "-" or "|". I need to strip them from the file name. I use awk's built in "sub()" function to handle that, and with that out of the way I call "print" to send the result to the screen.


categories: Sed,Tips,Oct,2009,EdM

Sed in Awk

Writing in comp.lang.awk Ed Morton ports numerous complex sed expressions to Awk:

A comp.lang.awk author ask the question:

    I have a file that has a series of lists

    (qqq)
    aaa 111
    bbb 222
    

    and I want to make it look like

    aaa 111 (qqq)
    bbb 222 (qqq)
    

IMHO the clearest sed solution given was:

sed -e '
   /^([^)]*)/{
      h; # remember the (qqq) part
      d
   }

   / [1-9][0-9]*$/{
      G; # strap the (qqq) part to the list
      s/\n/ /
   }
' yourfile

while the awk one was:

awk '/^\(/{ h=$0;next } { print $0,h }' file

As I've said repeatedly, sed is an excellent tool for simple substitutions on a single line. For anything else you should use awk, perl, etc.

Having said that, let's take a look at the awk equivalents for the posted sed examples below that are not simple substitutions on a single line so people can judge for themselves (i.e. quietly - this is not a contest and not a religious war!) which code is clearer, more consistent, and more obvious. When reading this, just imagine yourself having to figure out what the given script does in order to debug or enhance it or write your own similar one later.

Note that in awk as in shell there are many ways to solve a problem so I'm trying to stick to the solutions that I think would be the most useful to a beginner since that's who'd be reading an examples page like this, and without using any GNU awk extensions. Also note I didn't test any of this but it's all pretty basic stuff so it should mostly be right.

For those who know absolutely nothing about awk, I think all you need to know to understand the scripts below is that, like sed, it loops through input files evaluating conditions against the current input record (a line by default) and executing the actions you specify (printing the current input record if none specified) if those conditions are true, and it has the following pre-defined symbols:

NR = Number or Records read so far
NF = Number of Fields in current record
FS = the Field Separator
RS = the Record Separator
BEGIN = a pattern that's only true before processing any input
END = a pattern that's only true after processing all input.

Oh, and setting RS to the NULL string (-v RS='') tells awk to read paragraphs instead of lines as individual records, and setting FS to the NULL string (-v FS='') tells awk to treat each individual character as a field.

For more info on awk, see http://www.awk.info.

Introductory Examples

Double space a file:

    Sed:

    sed G
    

    Awk

    awk '{print $0 "\n"}'
    

Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.

    Sed:

    sed '/^$/d;G'
    

    Awk:

    awk 'NF{print $0 "\n"}'
    

Triple space a file

    Sed:

    sed 'G;G'
    

    Awk:

    awk '{print $0 "\n\n"}'
    

Undo double-spacing (assumes even-numbered lines are always blank):

    Sed:

    sed 'n;d'
    

    Awk:

    awk 'NF'
    

Insert a blank line above every line which matches "regex":

    Sed:

    sed '/regex/{x;p;x;}'
    

    Awk:

    awk '{print (/regex/ ? "\n" : "") $0}'
    

Insert a blank line below every line which matches "regex":

    Sed:

    sed '/regex/G'
    

    Awk:

    awk '{print $0 (/regex/ ? "\n" : "")}'
    

Insert a blank line above and below every line which matches "regex":

    Sed:

    sed '/regex/{x;p;x;G;}'
    

    Awk:

    awk '{print (/regex/ ? "\n" $0 "\n" : $0)}'
    

Numbering

Number each line of a file (simple left alignment). Using a tab (see note on '\t' at end of file) instead of space will preserve margins:

    Sed:

    sed = filename | sed 'N;s/\n/\t/'
    

    Awk:

    awk '{print NR "\t" $0}'
    

Number each line of a file (number on left, right-aligned):

    Sed:

    sed = filename | sed 'N; s/^/     /; s/ *\(.\{6,\}\)\n/\1  /'
    

    Awk:

    awk '{printf "%6s  %s\n",NR,$0}'
    

Number each line of file, but only print numbers if line is not blank:

    Sed:

    ed '/./=' filename | sed '/./N; s/\n/ /'
    

    Awk:

    awk 'NF{print NR "\t" $0}'
    

Count lines (emulates "wc -l")

    Sed:

    sed -n '$='
    

    Awk:

    awk 'END{print NR}'
    

Text Conversion and Substitution

Align all text flush right on a 79-column width:

    Sed:

    sed -e :a -e 's/^.\{1,78\}$/ &/;ta'  # set at 78 plus 1 space
    

    Awk:

    awk '{printf "%79s\n",$0}'
    

Center all text in the middle of 79-column width. In method 1, spaces at the beginning of the line are significant, and trailing spaces are appended at the end of the line. In method 2, spaces at the beginning of the line are discarded in centering the line, and no trailing spaces appear at the end of lines.

    Sed:

    sed  -e :a -e 's/^.\{1,77\}$/ & /;ta'                     # method 1
    sed  -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/'  # method 2
    

    Awk:

    awk '{printf "%"int((79+length)/2)"s\n",$0}'
    

Reverse order of lines (emulates "tac") Bug/feature in sed v1.5 causes blank lines to be deleted

    Sed:

    sed '1!G;h;$!d'               # method 1
    sed -n '1!G;h;$p'             # method 2
    

    Awk:

    awk '{a[NR]=$0} END{for (i=NR;i>=1;i--) print a[i]}'
    

Reverse each character on the line (emulates "rev")

    Sed:

    sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
    

    Awk:

    awk -v FS='' '{for (i=NF;i>=1;i--) printf "%s",$i; print ""}'
    

Join pairs of lines side-by-side (like "paste")

    Sed:

    sed '$!N;s/\n/ /'
    

    Awk:

    awk '{printf "%s%s",$0,(NR%2 ? " " : "\n")}'
    

If a line ends with a backslash, append the next line to it

    Sed:

    sed -e :a -e '/\\$/N; s/\\\n//; ta'
    

    Awk:

    awk '{printf "%s",(sub(/\\$/,"") ? $0 : $0 "\n")}'
    

if a line begins with an equal sign, append it to the previous line and replace the "=" with a single space

    Sed:

    sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'
    

    Awk:

    awk '{printf "%s%s",(sub(/^=/," ") ? "" : "\n"),$0} END{print ""}'
    

Add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)

    Sed:

    gsed '0~5G'                  # GNU sed only
    sed 'n;n;n;n;G;'             # other seds
    

    Awk:

    awk '{print $0} !(NR%5){print ""}'
    

Selective Printing of Certain Lines

Print first 10 lines of file (emulates behavior of "head")

    Sed:

    sed 10q
    

    Awk:

    awk '{print $0} NR==10{exit}'
    

Print first line of file (emulates "head -1")

    Sed:

    sed q
    

    Awk:

    awk 'NR==1{print $0; exit}'
    

Print the last 10 lines of a file (emulates "tail")

    Sed:

    sed -e :a -e '$q;N;11,$D;ba'
    

    Awk:

    awk '{a[NR]=$0} END{for (i=NR-10;i<=NR;i++) print a[i]}'
    

Print the last 2 lines of a file (emulates "tail -2")

    Sed:

    sed '$!N;$!D'
    

    Awk:

    awk '{a[NR]=$0} END{for (i=NR-2;i<=NR;i++) print a[i]}'
    

Print the last line of a file (emulates "tail -1")

    Sed:

    sed '$!d'                    # method 1
    sed -n '$p'                  # method 2
    

    Awk:

    awk 'END{print $0}'
    

Print the next-to-the-last line of a file

    Sed:

    sed -e '$!{h;d;}' -e x  # for 1-line files, print blank line
    sed -e '1{$q;}' -e '$!{h;d;}' -e x  # for 1-line files, print the line
    sed -e '1{$d;}' -e '$!{h;d;}' -e x  # for 1-line files, print nothing
    

    Awk:

    awk '{prev=curr; curr=$0} END{print prev}'
    

Print only lines which match regular expression (emulates "grep")

    Sed:

    sed -n '/regexp/p'           # method 1
    sed '/regexp/!d'             # method 2
    

    Awk:

    awk '/regexp/'
    

Print only lines which do NOT match regexp (emulates "grep -v")

    Sed:

    sed -n '/regexp/!p'          # method 1, corresponds to above
    sed '/regexp/d'              # method 2, simpler syntax
    

    Awk:

    awk '!/regexp/'
    

Print the line immediately before a regexp, but not the line containing the regexp

    Sed:

    sed -n '/regexp/{g;1!p;};h'
    

    Awk:

    awk '/regexp/{print prev} {prev=$0}'
    

Print the line immediately after a regexp, but not the line containing the regexp

    Sed:

    sed -n '/regexp/{n;p;}'
    

    Awk:

    awk 'found{print $0} {found=(/regexp/ ? 1 : 0)}'
    

Print 1 line of context before and after regexp, with line number indicating where the regexp occurred (similar to "grep -A1 -B1")

    Sed:

    sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
    

    Awk:

    awk 'found    {print preLine "\n" hitLine "\n" $0;   found=0}
          /regexp/ {preLine=prev;   hitLine=NR " " $0;    found=1}
          {prev=$0}'
    

Grep for AAA and BBB and CCC (in any order)

    Sed:

    sed '/AAA/!d; /BBB/!d; /CCC/!d'
    

    Awk:

    awk '/AAA/&&/BBB/&&/CCC/'
    

Grep for AAA and BBB and CCC (in that order)

    Sed:

    sed '/AAA.*BBB.*CCC/!d'
    

    Awk:

    awk '/AAA.*BBB.*CCC/'
    

Grep for AAA or BBB or CCC (emulates "egrep")

    Sed:

    sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d    # most seds
    gsed '/AAA\|BBB\|CCC/!d'                        # GNU sed only
    

    Awk:

    awk '/AAA|BBB|CCC/'
    

Print paragraph if it contains AAA (blank lines separate paragraphs). Sed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below

    Sed:

    sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'
    

    Awk:

    awk -v RS='' '/AAA/'
    

Print paragraph if it contains AAA and BBB and CCC (in any order)

    Sed:

    sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
    

    Awk:

    awk -v RS='' '/AAA/&&/BBB/&&/CCC/'
    

Print paragraph if it contains AAA or BBB or CCC

    Sed:

    sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
    gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d'         # GNU sed only
    

    Awk:

    awk -v RS='' '/AAA|BBB|CCC/'
    

Print only lines of 65 characters or longer

    Sed:

    sed -n '/^.\{65\}/p'
    

    Awk:

    awk -v FS='' 'NF>=65'
    

Print only lines of less than 65 characters

    Sed:

    sed -n '/^.\{65\}/!p'        # method 1, corresponds to above
    sed '/^.\{65\}/d'            # method 2, simpler syntax
    

    Awk:

    awk -v FS='' 'NF<65'
    

Print section of file from regular expression to end of file

    Sed:

    sed -n '/regexp/,$p'
    

    Awk:

    awk '/regexp/{found=1} found'
    

Print section of file based on line numbers (lines 8-12, inclusive)

    Sed:

    sed -n '8,12p'               # method 1
    sed '8,12!d'                 # method 2
    

    Awk:

    awk 'NR>=8 && NR<=12'
    

Print line number 52

    Sed:

    sed -n '52p'                 # method 1
    sed '52!d'                   # method 2
    sed '52q;d'                  # method 3, efficient on large files
    

    Awk:

    awk 'NR==52{print $0; exit}'
    

Beginning at line 3, print every 7th line

    Sed:

    gsed -n '3~7p'               # GNU sed only
    sed -n '3,${p;n;n;n;n;n;n;}' # other seds
    

    Awk:

    awk '!((NR-3)%7)'
    

print section of file between two regular expressions (inclusive)

    Sed:

    sed -n '/Iowa/,/Montana/p'             # case sensitive
    

    Awk:

    awk '/Iowa/,/Montana/'
    

Print all lines of FileID upto 1st line containing

    Sed:

    sed '/string/q' FileID
    

    Awk:

    awk '{print $0} /string/{exit}'
    

Print all lines of FileID from 1st line containing until eof

    Sed:

    sed '/string/,$!d' FileID
    

    Awk:

    awk '/string/{found=1} found'
    

Print all lines of FileID from 1st line containing until 1st line containing [boundries inclusive]

    Sed:

    sed '/string1/,$!d;/string2/q' FileID
    

    Awk:

    awk '/string1/{found=1} found{print $0} /string2/{exit}'
    

Selective Deletion of Certain Lines

Print all of file EXCEPT section between 2 regular expressions

    Sed:

    sed '/Iowa/,/Montana/d'
    

    Awk:

    awk '/Iowa/,/Montana/{next} {print $0}' file
    

Delete duplicate, consecutive lines from a file (emulates "uniq"). First line in a set of duplicate lines is kept, rest are deleted.

    Sed:

    sed '$!N; /^\(.*\)\n\1$/!P; D'
    

    Awk:

    awk '$0!=prev{print $0} {prev=$0}'
    

Delete duplicate, nonconsecutive lines from a file. Beware not to overflow the buffer size of the hold space, or else use GNU sed.

    Sed:

    sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
    

    Awk:

    awk '!a[$0]++'
    

Delete all lines except duplicate lines (emulates "uniq -d").

    Sed:

    sed '$!N; s/^\(.*\)\n\1$/\1/; t; D'
    

    Awk:

    awk '$0==prev{print $0} {prev=$0}'      # works only on consecutive
    awk 'a[$0]++'                           # works on non-consecutive
    

Delete the first 10 lines of a file

    Sed:

    sed '1,10d'
    

    Awk:

    awk 'NR>10'
    

Delete the last line of a file

    Sed:

    sed '$d'
    

    Awk:

    awk 'NR>1{print prev} {prev=$0}'
    

Delete the last 2 lines of a file

    Sed:

    sed 'N;$!P;$!D;$d'
    

    Awk:

    awk 'NR>2{print prev[2]} {prev[2]=prev[1]; prev[1]=$0}'    # method 1
    awk '{a[NR]=$0} END{for (i=i;i<=NR-2;i++) print a[i]}'     # method 2
    awk -v num=2 'NR>num{print prev[num]}
        {for (i=num;i>1;i--) prev[i]=prev[i-1]; prev[1]=$0}'    # method 3
    

Delete the last 10 lines of a file

    Sed:

    sed -e :a -e '$d;N;2,10ba' -e 'P;D'   # method 1
    sed -n -e :a -e '1,10!{P;N;D;};N;ba'  # method 2
    

    Awk:

    awk -v num=10 '...same as deleting last 2 method 3 above...'
    

Delete every 8th line

    Sed:

    gsed '0~8d'                           # GNU sed only
    sed 'n;n;n;n;n;n;n;d;'                # other seds
    

    Awk:

    awk 'NR%8'
    

Delete lines matching pattern

    Sed:

    sed '/pattern/d'
    

    Awk:

    awk '!/pattern/'
    

Delete ALL blank lines from a file (same as "grep '.' ")

    Sed:

    sed '/^$/d'                           # method 1
    sed '/./!d'                           # method 2
    

    Awk:

    awk '!/^$/'                             # method 1
    awk '/./'                               # method 2
    

Delete all CONSECUTIVE blank lines from file except the first; also deletes all blank lines from top and end of file (emulates "cat -s")

    Sed:

    sed '/./,/^$/!d'
    

    Awk:

    awk '/./,/^$/'
    

Delete all leading blank lines at top of file

    Sed:

    sed '/./,$!d'
    

    Awk:

    awk 'NF{found=1} found'
    

Delete all trailing blank lines at end of file

    Sed:

    sed -e :a -e '/^\n*$/{$d;N;ba' -e '}'  # works on all seds
    sed -e :a -e '/^\n*$/N;/\n$/ba'        # ditto, except for gsed 3.02.*
    

    Awk:

    awk '{a[NR]=$0} NF{nbNr=NR} END{for (i=1;i<=nbNr;i++) print a[i]}'
    

Delete the last line of each paragraph

    Sed:

    sed -n '/^$/{p;h;};/./{x;/./p;}'
    

    Awk:

    awk -v FS='\n' -v RS='' '{for (i=1;i<=NF;i++) print $i; print ""}'
    

Special Applications

Get Usenet/e-mail message header

    Sed:

    sed '/^$/q'        # deletes everything after first blank line
    

    Awk:

    awk '/^$/{exit}'
    

Get Usenet/e-mail message body

    Sed:

    sed '1,/^$/d'              # deletes everything up to first blank line
    

    Awk:

    awk 'found{print $0} /^$/{found=1}'
    

Get Subject header, but remove initial "Subject: " portion

    Sed:

    sed '/^Subject: */!d; s///;q'
    

    Awk:

    awk 'sub(/Subject: */,"")'
    

Parse out the address proper. Pulls out the e-mail address by itself from the 1-line return address header (see preceding script)

    Sed:

    sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
    

    Awk:

    awk '{sub(/ *\(.*\)/,""); sub(/>.*/,""); sub(/.*[:<] */,""); print $0}'
    

Add a leading angle bracket and space to each line (quote a message)

    Sed:

    sed 's/^/> /'
    

    Awk:

    awk '{print "> " $0}'
    

Delete leading angle bracket & space from each line (unquote a message)

    Sed:

    sed 's/^> //'
    

    Awk:

    awk '{sub(/> /,""); print $0}'
    

categories: Databases,Oct,2009,ScottS

A MySql Client

Contents

Download

Download from LAWKER.

Code

Set Up

BEGIN {
    if (!mysql["path"]) {
        mysql["path"] = "/usr/bin/mysql"
    }
    if (mysql["user"]) mysql["user"] = "-u" mysql["user"]
    if (mysql["pass"]) mysql["pass"] = "-p" mysql["pass"]

    if (!mysql["tempfile_command"]) {
        mysql["tempfile_command"] = "mktemp /tmp/__mysql.awk.XXXXXX"
    }
    mysql["resource_id"] = 1
    __mysql_dequote["r"]  = "\r"
    __mysql_dequote["n"]  = "\n"
    __mysql_dequote["t"]  = "\t"
    __mysql_dequote["\\"] = "\\"
}

Main Functions

function mysql_db (db)      { mysql["database"] = db    }
function mysql_path (path)  { mysql["path"]     = path  }

function mysql_tempfile_command (command) {
    mysql["tempfile_command"] = command
}
function mysql_login (username, password, host, args) {
    mysql["user"] = "-u" username
    mysql["pass"] = "-p" password
        if (host) mysql["host"] = "-h" host
        if (args) mysql["args"] = args
}
function mysql_query (query    ,input,key,i,call,resource) {
    resource = mysql["resource_id"]++
    mysql["tempfile_command"] | getline mysql[resource]
    close(mysql["tempfile_command"])
    call = sprintf("%s %s %s %s %s %s > %s",
            mysql["path"], mysql["user"], mysql["pass"], mysql["host"],
                        mysql["args"], mysql["database"],
            mysql[resource])
    print query | call
    close(call)
    if (getline input < mysql[resource]) {
        for (i = split(input, key, "\t"); i > 0; i--)
            mysql[resource, i] = key[i]
    }
    return resource
}
function mysql_fetch_assoc (resource,row  ,input,i,fields) {
    fields = 0
    if (getline input < mysql[resource]) {
        fields = mysql_split(row, input)
        for (i = 1; i <= fields; i++)
            row[mysql[resource, i]] = row[i]
    }
    return fields
}
function mysql_split (row, input,   r,i) {
     r = split(input, row, "\t")
     for (i = 0; i <= r; i++) {
         row[i] = mysql_dequote(row[i])
     }
     return r
}
function mysql_fetch_row (resource,row  ,input,r,i) {
    if (getline input < mysql[resource]) {
        return mysql_split(row, input)
    }
    return 0
}
function mysql_index (resource, id) {
    return mysql[resource, id]
}
function mysql_finish (resource, i) {
    close(mysql[resource])
    system(sprintf("rm %s", mysql[resource]))
    delete mysql[resource]
    i = 1
    while (mysql[resource,i])
        delete mysql[resource, i++]
}
function mysql_cleanup (  i) {
    for (i = 1; i < mysql["resource_id"]; i++)
        if (mysql[i]) {
            close(mysql[i])
            system(sprintf("rm %s", mysql[i]))
            delete mysql[resource]
            i = 1
            while (mysql[resource,i])
                delete mysql[resource, i++]
        }
}

Support Utils

Scan a string for mysql escaped tokens and replace them with the appropriate character. This is a fairly slow operation for large strings but it's necessary.

function mysql_dequote (string, result,i,l,c) {
    result = ""
    l = length(string)
    for (i = 1; i <= l; i++) {
        c = substr(string, i, 1)
        if (c == "\\") {
            # This simply shouldn't happen...
            ## if ((i + 1) == l) continue;
            c = substr(string, ++i, 1)
            result = result __mysql_dequote[c]
        }
        else {
            result = result c
        }
    }
    return result
}
function mysql_quote (string,   result) {
    gsub(/\\/, "\\\\", string)
    gsub(/'/, "\\'", string)
    return "'" string "'"
}

Copyright

"THE BEER-WARE LICENSE" (Revision 43) borrowed from FreeBSD's jail.c: wrote this file. As long as you retain this notice you can do whatever you want with this stuff. If we meet some day, and you think this stuff is worth it, you can buy me a beer in return.

Author

Scott S. McCoy


categories: SysAdmin,Oct,2009,BrianJ

SysAdmins: Awk is Your Friend

Brian Jones writes at linux.com:

The nice thing about humans is that they're at least somewhat predictable. Given the choice between having data randomly strewn about, and having it in some predictable pattern, humans will generally choose predictable patterns (Microsoft filesystem management issues notwithstanding). These patterns are what make awk, a pattern-matching programming language, a wonderful tool for systems administrators, database administrators, and even command-line junkies who use their box mainly for pleasure. The notion of being able to write a one-line command to do almost anything draws ever closer with awk in your tool belt. For most things administrators use awk for, it's an extremely simple language. As you get into writing more advanced awk scripts, at some point it becomes a bit cumbersome, and you realize that Perl is also your friend. But for now, let's focus on how awk can get you the most bang for your keyboard strokes, shall we?

The first thing you should know is that awk is actually a rather powerful language. Entire books have been written about its use. If you're so inclined, you can write extremely complex 1000-line scripts using awk. However, as a systems administrator (the intended audience for this article), 99% of your use of awk will consist of relatively short scripts, and one-off one-liners typed right on the command line. Here's an example of a common use of awk:

[jonesy@newhotness jonesy]$ cat access_log | 
     awk '{print $1}' | sort | uniq -c | sort -rn

The above one-liner uses awk to slim down the amount of data coming from the web server's access log. The access log is space-delimited, and I only want to see the first field (hence "print $1"). Once I have that data, I want to sort it, then I have "uniq -c" provide a count of each occurrence for each unique value, and then I produce a reverse sort based on the numeric count provided by "uniq". The result has the number of hits in the left column, and the host in the right column, and the most frequent visitors are at the top of the list. Give it a shot! Even if you're hosted by an ISP, you should be able to access this log.

Awk is perfect for ripping data into smaller chunks, to make it more bite-size for other applications or manipulation. To use it on the command line on files that are not space-delimited, you can use the "-F" flag, and indicate a delimiter. This is useful for tearing apart /etc/passwd and /etc/shadow files. For example:

[jonesy@tux jonesy]$ cat /etc/passwd | awk -F: '{print $5}' | awk -F, '{print NF}'

I actually used something kinda similar to that during a NIS to LDAP migration to see if the gecos field ($5 in /etc/passwd) had consistent enough data to be useful. One of the tests is to see how consistent the number of datapoints held in the gecos field is from record to record. To figure out the number of fields in each record's gecos field, I tell awk to use ":" as the delimiter, and, based on that, print the fifth field. I then pipe that to another awk one-liner, which uses an awk built-in variable, "NF" and a different delimiter (gecos is generally comma-delimited, if it's even used for useful data).

Awk in Scripts

When one-liners just aren't enough for you, you can store a whole bunch of awk one-liners in a file, and call awk with "-f script" to tell it which file to read its commands from. Additionally, since awk needs to act on some data, you should also tag on something to take care of feeding awk the data it so desperately needs. For example, if I have a script called "getuname", which looks like this:

BEGIN { FS=":" }
      {print $1}

I can now call that script, feeding it anything that I know ahead of time has the user name as the first field in a given record. So I can say "awk -f getuname < /etc/passwd", or "ypcat passwd | awk -f getuname". There are two rather important things I did in this script that will save you some headaches. First, notice the "BEGIN" statement. This statement exists to give you some space to do some tasks before awk starts reading any data. In this example, I want awk to know before it processes any data, that it should use a colon as its field separator. Sure, I could've called awk differently to get around this, ie "awk -F: -f getuname < /etc/passwd", but this way is shorter, and that's the point! It should also be noted that, if you have the need, you can also have an "END" section to your script, which will perform any actions, once, after the last data record has been processed.

On the second line, I've just called a simple awk "action" statement, just like on the command line, with one important exception: I didn't use single quotes around it. If I had, the shell would've tried to interpret this part of the script and choked. I know, because it happened while I was testing this script. Bad admin!

Built-in Goodness

Awk has some built-in functions, like most scripting languages, which make life a bit easier. It also has some built-in variables that awk keeps track of for you -- and you get their values for free, just for asking, which is nice. The most useful variable I've had the pleasure to use as an admin is the "NF" variable, which will tell you, based on the field separator given (space by default), how many fields are in the current record. Conversely, the most useful function I've used as an awk scripter is the "split" function, which can break a single field into another array of separate fields. First, here's a quick example of NF in action:

cat /etc/passwd | awk -F: '{print NF}'

This is the lazy man's way to get the users' shells from the /etc/passwd file without having to remember how many fields are in the file. But wait! This doesn't print the last field in the record! It prints the number of fields in the record! Simple enough -- add a "$" to the front of "NF", and you'll get what you're looking for. Pipe the output to a couple of "sort" and "uniq" commands like we did earlier with the web log, and you'll get a snapshot of what the most commonly used shells are.

Now let's have a look at the split function. Let's say you use your gecos field to store a bunch of datapoints, and the datapoints within the gecos field are comma-delimited. This is not nearly so contrived as it might sound -- this happens in more than two environments I've done work in. Here's what it might look like:

jonesy:x:12000:13:Brian K. Jones,LUSER,101B,NONE:/home/jonesy:/bin/bash

Now let's say your PHB comes along and says he's tired of referring to me as "jonesy" and wants to know my real name. You can use awk's "split" function to help you here, and the code for doing so is fairly short:

BEGIN { FS=":" }
      {
        gfields = split ( $5, gecos, ",")
        chunkname = split ( gecos[1], fullname, " " )
        print fullname[chunkname], fullname[1]
      }

Let's translate that into English, shall we? Of course, you now know what the BEGIN statement does here -- nothing new. We'll start by looking at the "gfields" line, where I use "split" to break up the 5th field of the record, (the gecos field), using the comma as a delimiter, and storing all of the resulting fields in an array called "gecos". This can be counterintuitive, as you may be tempted to think that the resulting array is called "gfields". However, the "gfields" variable actually represents the last field in the record. You get a look at how this works in the following two lines. "chunkname" represents the number of fields in the "fullname" array. The "fullname" array is created by splitting the first field of the "gecos" array (in this case, the field holding my full name), using a space as the delimiter. On the next line, I reference "fullname[chunkname]", which will print the last name of the person, even if (as in my case) they have a middle name or initial. Then I print the very first field in the fullname array, so the output generated by this script acting on my passwd record would be "Jones Brian".

In conclusion

Whew! That was a mouthful. Awk has so many cool little hacks and built-in features that there has been more than one book published just on Awk. Undoubtedly, I'll utilize some of these features in future articles that involve putting together syadmin solutions using various scripts as duct tape.

blog comments powered by Disqus