Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Sed,Tips,Apr,2009,Admin

Sed-clones (in Awk)

These pages focus on Sed-like stream editors, written in Awk.


categories: Sed,Tips,Oct,2009,EdM

Sed in Awk

Writing in comp.lang.awk Ed Morton ports numerous complex sed expressions to Awk:

A comp.lang.awk author ask the question:

    I have a file that has a series of lists

    (qqq)
    aaa 111
    bbb 222
    

    and I want to make it look like

    aaa 111 (qqq)
    bbb 222 (qqq)
    

IMHO the clearest sed solution given was:

sed -e '
   /^([^)]*)/{
      h; # remember the (qqq) part
      d
   }

   / [1-9][0-9]*$/{
      G; # strap the (qqq) part to the list
      s/\n/ /
   }
' yourfile

while the awk one was:

awk '/^\(/{ h=$0;next } { print $0,h }' file

As I've said repeatedly, sed is an excellent tool for simple substitutions on a single line. For anything else you should use awk, perl, etc.

Having said that, let's take a look at the awk equivalents for the posted sed examples below that are not simple substitutions on a single line so people can judge for themselves (i.e. quietly - this is not a contest and not a religious war!) which code is clearer, more consistent, and more obvious. When reading this, just imagine yourself having to figure out what the given script does in order to debug or enhance it or write your own similar one later.

Note that in awk as in shell there are many ways to solve a problem so I'm trying to stick to the solutions that I think would be the most useful to a beginner since that's who'd be reading an examples page like this, and without using any GNU awk extensions. Also note I didn't test any of this but it's all pretty basic stuff so it should mostly be right.

For those who know absolutely nothing about awk, I think all you need to know to understand the scripts below is that, like sed, it loops through input files evaluating conditions against the current input record (a line by default) and executing the actions you specify (printing the current input record if none specified) if those conditions are true, and it has the following pre-defined symbols:

NR = Number or Records read so far
NF = Number of Fields in current record
FS = the Field Separator
RS = the Record Separator
BEGIN = a pattern that's only true before processing any input
END = a pattern that's only true after processing all input.

Oh, and setting RS to the NULL string (-v RS='') tells awk to read paragraphs instead of lines as individual records, and setting FS to the NULL string (-v FS='') tells awk to treat each individual character as a field.

For more info on awk, see http://www.awk.info.

Introductory Examples

Double space a file:

    Sed:

    sed G
    

    Awk

    awk '{print $0 "\n"}'
    

Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.

    Sed:

    sed '/^$/d;G'
    

    Awk:

    awk 'NF{print $0 "\n"}'
    

Triple space a file

    Sed:

    sed 'G;G'
    

    Awk:

    awk '{print $0 "\n\n"}'
    

Undo double-spacing (assumes even-numbered lines are always blank):

    Sed:

    sed 'n;d'
    

    Awk:

    awk 'NF'
    

Insert a blank line above every line which matches "regex":

    Sed:

    sed '/regex/{x;p;x;}'
    

    Awk:

    awk '{print (/regex/ ? "\n" : "") $0}'
    

Insert a blank line below every line which matches "regex":

    Sed:

    sed '/regex/G'
    

    Awk:

    awk '{print $0 (/regex/ ? "\n" : "")}'
    

Insert a blank line above and below every line which matches "regex":

    Sed:

    sed '/regex/{x;p;x;G;}'
    

    Awk:

    awk '{print (/regex/ ? "\n" $0 "\n" : $0)}'
    

Numbering

Number each line of a file (simple left alignment). Using a tab (see note on '\t' at end of file) instead of space will preserve margins:

    Sed:

    sed = filename | sed 'N;s/\n/\t/'
    

    Awk:

    awk '{print NR "\t" $0}'
    

Number each line of a file (number on left, right-aligned):

    Sed:

    sed = filename | sed 'N; s/^/     /; s/ *\(.\{6,\}\)\n/\1  /'
    

    Awk:

    awk '{printf "%6s  %s\n",NR,$0}'
    

Number each line of file, but only print numbers if line is not blank:

    Sed:

    ed '/./=' filename | sed '/./N; s/\n/ /'
    

    Awk:

    awk 'NF{print NR "\t" $0}'
    

Count lines (emulates "wc -l")

    Sed:

    sed -n '$='
    

    Awk:

    awk 'END{print NR}'
    

Text Conversion and Substitution

Align all text flush right on a 79-column width:

    Sed:

    sed -e :a -e 's/^.\{1,78\}$/ &/;ta'  # set at 78 plus 1 space
    

    Awk:

    awk '{printf "%79s\n",$0}'
    

Center all text in the middle of 79-column width. In method 1, spaces at the beginning of the line are significant, and trailing spaces are appended at the end of the line. In method 2, spaces at the beginning of the line are discarded in centering the line, and no trailing spaces appear at the end of lines.

    Sed:

    sed  -e :a -e 's/^.\{1,77\}$/ & /;ta'                     # method 1
    sed  -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/'  # method 2
    

    Awk:

    awk '{printf "%"int((79+length)/2)"s\n",$0}'
    

Reverse order of lines (emulates "tac") Bug/feature in sed v1.5 causes blank lines to be deleted

    Sed:

    sed '1!G;h;$!d'               # method 1
    sed -n '1!G;h;$p'             # method 2
    

    Awk:

    awk '{a[NR]=$0} END{for (i=NR;i>=1;i--) print a[i]}'
    

Reverse each character on the line (emulates "rev")

    Sed:

    sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
    

    Awk:

    awk -v FS='' '{for (i=NF;i>=1;i--) printf "%s",$i; print ""}'
    

Join pairs of lines side-by-side (like "paste")

    Sed:

    sed '$!N;s/\n/ /'
    

    Awk:

    awk '{printf "%s%s",$0,(NR%2 ? " " : "\n")}'
    

If a line ends with a backslash, append the next line to it

    Sed:

    sed -e :a -e '/\\$/N; s/\\\n//; ta'
    

    Awk:

    awk '{printf "%s",(sub(/\\$/,"") ? $0 : $0 "\n")}'
    

if a line begins with an equal sign, append it to the previous line and replace the "=" with a single space

    Sed:

    sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'
    

    Awk:

    awk '{printf "%s%s",(sub(/^=/," ") ? "" : "\n"),$0} END{print ""}'
    

Add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)

    Sed:

    gsed '0~5G'                  # GNU sed only
    sed 'n;n;n;n;G;'             # other seds
    

    Awk:

    awk '{print $0} !(NR%5){print ""}'
    

Selective Printing of Certain Lines

Print first 10 lines of file (emulates behavior of "head")

    Sed:

    sed 10q
    

    Awk:

    awk '{print $0} NR==10{exit}'
    

Print first line of file (emulates "head -1")

    Sed:

    sed q
    

    Awk:

    awk 'NR==1{print $0; exit}'
    

Print the last 10 lines of a file (emulates "tail")

    Sed:

    sed -e :a -e '$q;N;11,$D;ba'
    

    Awk:

    awk '{a[NR]=$0} END{for (i=NR-10;i<=NR;i++) print a[i]}'
    

Print the last 2 lines of a file (emulates "tail -2")

    Sed:

    sed '$!N;$!D'
    

    Awk:

    awk '{a[NR]=$0} END{for (i=NR-2;i<=NR;i++) print a[i]}'
    

Print the last line of a file (emulates "tail -1")

    Sed:

    sed '$!d'                    # method 1
    sed -n '$p'                  # method 2
    

    Awk:

    awk 'END{print $0}'
    

Print the next-to-the-last line of a file

    Sed:

    sed -e '$!{h;d;}' -e x  # for 1-line files, print blank line
    sed -e '1{$q;}' -e '$!{h;d;}' -e x  # for 1-line files, print the line
    sed -e '1{$d;}' -e '$!{h;d;}' -e x  # for 1-line files, print nothing
    

    Awk:

    awk '{prev=curr; curr=$0} END{print prev}'
    

Print only lines which match regular expression (emulates "grep")

    Sed:

    sed -n '/regexp/p'           # method 1
    sed '/regexp/!d'             # method 2
    

    Awk:

    awk '/regexp/'
    

Print only lines which do NOT match regexp (emulates "grep -v")

    Sed:

    sed -n '/regexp/!p'          # method 1, corresponds to above
    sed '/regexp/d'              # method 2, simpler syntax
    

    Awk:

    awk '!/regexp/'
    

Print the line immediately before a regexp, but not the line containing the regexp

    Sed:

    sed -n '/regexp/{g;1!p;};h'
    

    Awk:

    awk '/regexp/{print prev} {prev=$0}'
    

Print the line immediately after a regexp, but not the line containing the regexp

    Sed:

    sed -n '/regexp/{n;p;}'
    

    Awk:

    awk 'found{print $0} {found=(/regexp/ ? 1 : 0)}'
    

Print 1 line of context before and after regexp, with line number indicating where the regexp occurred (similar to "grep -A1 -B1")

    Sed:

    sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
    

    Awk:

    awk 'found    {print preLine "\n" hitLine "\n" $0;   found=0}
          /regexp/ {preLine=prev;   hitLine=NR " " $0;    found=1}
          {prev=$0}'
    

Grep for AAA and BBB and CCC (in any order)

    Sed:

    sed '/AAA/!d; /BBB/!d; /CCC/!d'
    

    Awk:

    awk '/AAA/&&/BBB/&&/CCC/'
    

Grep for AAA and BBB and CCC (in that order)

    Sed:

    sed '/AAA.*BBB.*CCC/!d'
    

    Awk:

    awk '/AAA.*BBB.*CCC/'
    

Grep for AAA or BBB or CCC (emulates "egrep")

    Sed:

    sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d    # most seds
    gsed '/AAA\|BBB\|CCC/!d'                        # GNU sed only
    

    Awk:

    awk '/AAA|BBB|CCC/'
    

Print paragraph if it contains AAA (blank lines separate paragraphs). Sed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below

    Sed:

    sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'
    

    Awk:

    awk -v RS='' '/AAA/'
    

Print paragraph if it contains AAA and BBB and CCC (in any order)

    Sed:

    sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
    

    Awk:

    awk -v RS='' '/AAA/&&/BBB/&&/CCC/'
    

Print paragraph if it contains AAA or BBB or CCC

    Sed:

    sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
    gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d'         # GNU sed only
    

    Awk:

    awk -v RS='' '/AAA|BBB|CCC/'
    

Print only lines of 65 characters or longer

    Sed:

    sed -n '/^.\{65\}/p'
    

    Awk:

    awk -v FS='' 'NF>=65'
    

Print only lines of less than 65 characters

    Sed:

    sed -n '/^.\{65\}/!p'        # method 1, corresponds to above
    sed '/^.\{65\}/d'            # method 2, simpler syntax
    

    Awk:

    awk -v FS='' 'NF<65'
    

Print section of file from regular expression to end of file

    Sed:

    sed -n '/regexp/,$p'
    

    Awk:

    awk '/regexp/{found=1} found'
    

Print section of file based on line numbers (lines 8-12, inclusive)

    Sed:

    sed -n '8,12p'               # method 1
    sed '8,12!d'                 # method 2
    

    Awk:

    awk 'NR>=8 && NR<=12'
    

Print line number 52

    Sed:

    sed -n '52p'                 # method 1
    sed '52!d'                   # method 2
    sed '52q;d'                  # method 3, efficient on large files
    

    Awk:

    awk 'NR==52{print $0; exit}'
    

Beginning at line 3, print every 7th line

    Sed:

    gsed -n '3~7p'               # GNU sed only
    sed -n '3,${p;n;n;n;n;n;n;}' # other seds
    

    Awk:

    awk '!((NR-3)%7)'
    

print section of file between two regular expressions (inclusive)

    Sed:

    sed -n '/Iowa/,/Montana/p'             # case sensitive
    

    Awk:

    awk '/Iowa/,/Montana/'
    

Print all lines of FileID upto 1st line containing

    Sed:

    sed '/string/q' FileID
    

    Awk:

    awk '{print $0} /string/{exit}'
    

Print all lines of FileID from 1st line containing until eof

    Sed:

    sed '/string/,$!d' FileID
    

    Awk:

    awk '/string/{found=1} found'
    

Print all lines of FileID from 1st line containing until 1st line containing [boundries inclusive]

    Sed:

    sed '/string1/,$!d;/string2/q' FileID
    

    Awk:

    awk '/string1/{found=1} found{print $0} /string2/{exit}'
    

Selective Deletion of Certain Lines

Print all of file EXCEPT section between 2 regular expressions

    Sed:

    sed '/Iowa/,/Montana/d'
    

    Awk:

    awk '/Iowa/,/Montana/{next} {print $0}' file
    

Delete duplicate, consecutive lines from a file (emulates "uniq"). First line in a set of duplicate lines is kept, rest are deleted.

    Sed:

    sed '$!N; /^\(.*\)\n\1$/!P; D'
    

    Awk:

    awk '$0!=prev{print $0} {prev=$0}'
    

Delete duplicate, nonconsecutive lines from a file. Beware not to overflow the buffer size of the hold space, or else use GNU sed.

    Sed:

    sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
    

    Awk:

    awk '!a[$0]++'
    

Delete all lines except duplicate lines (emulates "uniq -d").

    Sed:

    sed '$!N; s/^\(.*\)\n\1$/\1/; t; D'
    

    Awk:

    awk '$0==prev{print $0} {prev=$0}'      # works only on consecutive
    awk 'a[$0]++'                           # works on non-consecutive
    

Delete the first 10 lines of a file

    Sed:

    sed '1,10d'
    

    Awk:

    awk 'NR>10'
    

Delete the last line of a file

    Sed:

    sed '$d'
    

    Awk:

    awk 'NR>1{print prev} {prev=$0}'
    

Delete the last 2 lines of a file

    Sed:

    sed 'N;$!P;$!D;$d'
    

    Awk:

    awk 'NR>2{print prev[2]} {prev[2]=prev[1]; prev[1]=$0}'    # method 1
    awk '{a[NR]=$0} END{for (i=i;i<=NR-2;i++) print a[i]}'     # method 2
    awk -v num=2 'NR>num{print prev[num]}
        {for (i=num;i>1;i--) prev[i]=prev[i-1]; prev[1]=$0}'    # method 3
    

Delete the last 10 lines of a file

    Sed:

    sed -e :a -e '$d;N;2,10ba' -e 'P;D'   # method 1
    sed -n -e :a -e '1,10!{P;N;D;};N;ba'  # method 2
    

    Awk:

    awk -v num=10 '...same as deleting last 2 method 3 above...'
    

Delete every 8th line

    Sed:

    gsed '0~8d'                           # GNU sed only
    sed 'n;n;n;n;n;n;n;d;'                # other seds
    

    Awk:

    awk 'NR%8'
    

Delete lines matching pattern

    Sed:

    sed '/pattern/d'
    

    Awk:

    awk '!/pattern/'
    

Delete ALL blank lines from a file (same as "grep '.' ")

    Sed:

    sed '/^$/d'                           # method 1
    sed '/./!d'                           # method 2
    

    Awk:

    awk '!/^$/'                             # method 1
    awk '/./'                               # method 2
    

Delete all CONSECUTIVE blank lines from file except the first; also deletes all blank lines from top and end of file (emulates "cat -s")

    Sed:

    sed '/./,/^$/!d'
    

    Awk:

    awk '/./,/^$/'
    

Delete all leading blank lines at top of file

    Sed:

    sed '/./,$!d'
    

    Awk:

    awk 'NF{found=1} found'
    

Delete all trailing blank lines at end of file

    Sed:

    sed -e :a -e '/^\n*$/{$d;N;ba' -e '}'  # works on all seds
    sed -e :a -e '/^\n*$/N;/\n$/ba'        # ditto, except for gsed 3.02.*
    

    Awk:

    awk '{a[NR]=$0} NF{nbNr=NR} END{for (i=1;i<=nbNr;i++) print a[i]}'
    

Delete the last line of each paragraph

    Sed:

    sed -n '/^$/{p;h;};/./{x;/./p;}'
    

    Awk:

    awk -v FS='\n' -v RS='' '{for (i=1;i<=NF;i++) print $i; print ""}'
    

Special Applications

Get Usenet/e-mail message header

    Sed:

    sed '/^$/q'        # deletes everything after first blank line
    

    Awk:

    awk '/^$/{exit}'
    

Get Usenet/e-mail message body

    Sed:

    sed '1,/^$/d'              # deletes everything up to first blank line
    

    Awk:

    awk 'found{print $0} /^$/{found=1}'
    

Get Subject header, but remove initial "Subject: " portion

    Sed:

    sed '/^Subject: */!d; s///;q'
    

    Awk:

    awk 'sub(/Subject: */,"")'
    

Parse out the address proper. Pulls out the e-mail address by itself from the 1-line return address header (see preceding script)

    Sed:

    sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
    

    Awk:

    awk '{sub(/ *\(.*\)/,""); sub(/>.*/,""); sub(/.*[:<] */,""); print $0}'
    

Add a leading angle bracket and space to each line (quote a message)

    Sed:

    sed 's/^/> /'
    

    Awk:

    awk '{print "> " $0}'
    

Delete leading angle bracket & space from each line (unquote a message)

    Sed:

    sed 's/^> //'
    

    Awk:

    awk '{sub(/> /,""); print $0}'
    

categories: Tips,Sept,2009,EdM

The Secret WHINY_USERS Flag

(Editor's note: On Nov 30'09, Hermann Peifer found and fixed bug in an older version of the test code at the end of this file.)

Writing in comp.lang.awk, Ed Morton reveals the secret WHINY_USERS flag.

"Nag" asked:

    Hi,

    I am creating a file like...

    awk '{
     ....
     ...
     ..
     printf"%4s %4s\n",$1,$2 > "file1"
    
    }' input
    

    How can I sort file1 within awk code?

Ed Morton writes:

    There's also the undocumented WHINY_USERS flag for GNU awk that allows for sorted processing of arrays:
    $ cat file
    2
    1
    4
    3
    $ gawk '{a[$0]}END{for (i in a) print i}' file
    4
    1
    2
    3
    $ WHINY_USERS=1 gawk '{a[$0]}END{for (i in a) print i}' file
    1
    2
    3
    4
    

Execution Cost

Your editor coded up the following test for the runtime costs of WHINY_USERS. The following code is called twice (once with, and once without setting WHINY_USERS):

runWhin() {
WHINY_USERS=1 gawk -v M=1000000 --source '
        BEGIN { 
                M = M ? M : 50
                N = M
                print N
                while(N-- > 0) {
                        key = rand()" "rand()" "rand()" "rand()" "rand() 
                        A[key] = M - N
                }
                for(i in A)
                        N++
        }' 
}
runNoWhin() {
gawk -v M=1000000 --source '
        BEGIN { 
                M = M ? M : 50
                N = M
                print N
                while(N-- > 0) {
                        key = rand()" "rand()" "rand()" "rand()" "rand() 
                        A[key] = M - N
                }
                for(i in A)
                        N++
        }' 
}
time runWhin
time runNoWhin

And the results? Sorted added 15% to runtimes:

% bash whiny.sh
1000000

real    0m18.897s
user    0m15.826s
sys     0m2.445s
1000000

real    0m16.345s
user    0m13.469s
sys     0m2.435s

categories: Tips,Aug,2009,EdM

Print Ranges

In comp.lang.awk, Ed Morton offers advise on how to print ranges of Awk records.

Problem

Suppose you are looking to extract a section of code from a text file based on two regular expressions.

Say the file looks like this: newspaper magazing hiking hiking trails in the city muir hike black mountain hike summer meados hike end hiking phone cell skype

and you want to extract

hiking trails in the city
muir hike
black mountain hike
summer meados hike
The following regular expression won't work right:
awk '/hiking/,/end hiking/{print}' myfile
since that returns some spurious information.

What do do?

Solution

Personally, I rarely if ever use

/start/,/end/

as I'm never immediately sure what it'd output for input such as:

start
a
start
b
end
c
end

and whenever you want to do something just slightly different with the selection you need to change the script a lot.

Not being sure of the semantics is probably a catch 22 since I rarely use it but the benefit of using that syntax vs spelling it out:

/start/{f=1} f; /end/{f=0}

just doesn't really seem worthwhile, and then if you want to do something extra like test for some other condition over the block this:

/start/{f=1} f&&cond; /end/{f=0}

is about as brief as:

/start/,/end/{if (cond) print}

and if you want to exclude the start (or end) of the block you're printing then you just move the "f" test to the obvious place and you don't need to duplicate the condition:

f; /start/{f=1} /end/{f=0}
vs
/start/,/end/{if (!/start/) print}

and note the different semantics now. This:

f; /start/{f=1} /end/{f=0}

will exclude the line at the start of the block you're printing, whereas this:

/start/,/end/{if (!/start/) print}

will exclude that line plus every other occurrence of "start" within the block which is probably not what you'd want. To simply exclude only the first line of the block but stay with the /start/,/end/ approach you'd need to do something like:

/start/,/end/{if (!nr++) print; if (/end/) nr=0}

(which is getting fairly obscure.)


categories: Databases,Tips,Jul,2009,VictorA

Using Awk for Databases

Contents

Download

Download all the following example code and support data files from LAWKER

General Information

Introduction

This page contains a set of sample Awk scripts to manage different kinds of databases. In all cases, we'll use a text editor such as edit.exe to create and edit the data files, and Awk scripts will be used to query and manipulate the data.

OK, so it's not a fancy GUI-based system, but this method is flexible and the scripts execute relatively quickly. Also, your data won't be locked in some company's proprietary binary file format. There is also the benefit of portability: If your PC can run DOS, you can also run these scripts on your PC. Awk is also available on Linux and on other operating systems.

This page assumes that you are already familiar with database terms like 'record', 'field', and 'search keyword'.

Introduction to Awk

Awk is an interpreted programming language that is designed for managing and converting data files and generating reports from the data.

Awk will automatically read an input file and parse it into records and fields, one record at a time. A typicall Awk script will then manipulate the fields using predefined variables like $1 (the first field), $2 (the second field), etc.

To use Awk, you create an Awk script, and then run it with the Awk program (gawk.exe in this case). Many Awk scripts are small, and it lends itself to writing "one-time use" programs.

Using the Scripts

All the files on this page are available in the ZIP archive at this link. Feel free to reuse and customize them.

You will need the GNU Awk program gawk.exe to be installed on your QuickPAD Pro. See the programming page for instructions on installing GNU Awk.

Here is the general format of a gawk command line:

	gawk -f SCRIPT DATAFILE
where SCRIPT is the name of the file that contains the Awk script and DATAFILE is the name of the text file that contains the input data.

That command line will not modify the input file and all the output will be directed to the screen.

If a script creates a new data file (for example, a sort script), the command line will be:

	gawk -f SCRIPT DATAFILE > NEWFILE
where NEWFILE is the name of the new data file that will be created.

If you use a particular script often and get tired of typing in a long command line, you can create a batch file to execute the long command line for you.

are currently limited to 64K files for our data. We can work around this restriction by using the chop utility program that is described in the software page.

Index Card Databases

Card File

In this section we demonstrate some Awk scripts to manage This type of database can be used for any type of simple text lists, like lists of books, music CDs, recipes, quotations, etc.

Our information will be stored into 'cards'. Each card will have a 'title' and a 'body':

	Title of Card
	-------------------------
	Free-formatted field of 
	information about this 
	particular card, but
	without any blank lines.
Let's take this information and store it in a text file. To keep things simple, the cards within the file are separated with a blank line, and the first line of each card will be the title.

For example, let's create a sample card file called 'cards.txt' and use it to store a list of our goals.

	Write a book and become famous
	This is a long range
	goal. I need a good book
	idea first. And writing
	skills.

	Solve the problems of society
	This might take
	a little longer
	than expected.

	Take out the garbage
	It's stinking up
	the garage.

Let's begin with an Awk script to print out the titles of all the cards in the file. Here is the script called 'titles':

	# titles - Print the titles of all the cards in the
	# index card file.

	BEGIN { RS = ""; FS = "\n" }
	        { print $1 }

Here is a sample run:

	[B:\] gawk -f titles cards.txt
	Write a book and become famous
	Solve the problems of society
	Take out the garbage
	[B:\]

Another useful script is one that can be used for searching the data file, ignoring uppercase and lowercase distinctions. The following script called 'search' will display the cards that contain the keyword 'write'.

	# search - Print the index card that contains a string

	BEGIN   { RS = ""; FS = "\n"; IGNORECASE=1 }

	/write/ { print $0, "\n" }

Here is a sample run:

	[B:\] gawk -f search cards.txt
	Write a book and become famous
	This is a long range
	goal. I need a good book
	idea first. And writing
	skills.

	[B:\]

To search for other strings, edit the 'search' script and replace 'write' with another search keyword.

Sorting the cards based on the titles would also be a useful operation. Here is a script called 'sort' which reads the entire data file into and array and then uses the QuickSort algorithm to sort it:

	# sort - Sort index card file by the card titles

	BEGIN { RS = ""; FS = "\n" }

	      { A[NR] = $0 } 

	END   {
		qsort(A, 1, NR)
		for (i = 1; i <= NR; i++) {
			print A[i]
			if (i == NR) break
			print ""
		}
	      }

	# QuickSort
	# Source: "The AWK Programming Language", by Aho, et.al., p.161
	function qsort(A, left, right,   i, last) {
		if (left >= right)
			return
		swap(A, left, left+int((right-left+1)*rand()))
		last = left
		for (i = left+1; i <= right; i++)
			if (A[i] < A[left])
				swap(A, ++last, i)
		swap(A, left, last)
		qsort(A, left, last-1)
		qsort(A, last+1, right)
	}
	function swap(A, i, j,   t) {
		t = A[i]; A[i] = A[j]; A[j] = t
	}

And here is a sample run:

	[B:\] awk -f sort cards.txt > new.txt
	[B:\] rename cards.txt cards.bak
	[B:\] rename new.txt cards.txt
	[B:\] type cards.txt
	Solve the problems of society
	This might take
	a little longer
	than expected.

	Take out the garbage
	It's stinking up
	the garage.

	Write a book and become famous
	This is a long range
	goal. I need a good book
	idea first. And writing
	skills.
	[B:\]
Note that we renamed our old data file to cards.bak, instead of deleting the file. It's always good to keep backups of old databases.

However, the 'sort' script had some trouble with large files because it reads in all the cards into an array in RAM. In my tests, the largest file I was able to sort was only about 100K.

"Flash Cards" for Memorization

Index cards can also be used for memorization. The title of the card can contain a question and the body of the card contains the answer that you want to memorize.

Let's write a program that randomly chooses a card from our 'cards.txt' file, displays its title, asks the user to press the 'Enter' key, and then displays the body of that card.

First, we need a text file which contains the questions and answers that we want to memorize. Let's name the file 'question.txt'. Note that the answer can contain multiple lines:

	What is your name?
	My name is
	Sir Lancelot
	of Camelot.

	What is your quest?
	To seek the
	Holy Grail.

	What is your favorite color?
	Blue.

Here is the Awk script called 'memorize'. It will read the data file into an array, randomly shuffle the array, and then it will loop through the array and display each question and answer.

	# memorize - randomly display an index card title, ask user to
	# press return, then display the corresponding body of the card

	BEGIN { RS=""; FS="\n" }

	      { A[NR] = $0 } 

	END   {
		RS="\n"; FS=" "
		shuffle(A, NR)
		for (i = 1; i <= NR; i++) {
			print "\nQUESTION: ", substr(A[i], 1, index(A[i], "\n")-1)
			printf "\nPress return for the answer: "
			getline < "-"
			print "\nANSWER: "
			print substr(A[i], index(A[i], "\n")+1)
			if (i == NR) break
			printf "\nPress return to continue, or 'q' to quit: "
			getline < "-"
			if ($1 == "q") break
		}
	      }

	# Shuffle the array
	function shuffle(A, n,   t) {
		srand()
		# Moses/Oakford shuffle algorithm
		for (i = n; i > 1; i--) {
			j = int((i-1) * rand()) + 1
			t = A[j]; A[j] = A[i]; A[i] = t
		}
	}

Here is a sample run. The script will randomly choose cards until it either finishes going through all the cards, or until the user enters a 'q' to quit.

	[B:\] gawk -f memorize question.txt

	QUESTION:  What is your quest?

	Press return for the answer:

	ANSWER:
	To seek the
	Holy Grail.

	Press return to continue, or 'q' to quit:

	QUESTION:  What is your favorite color?

	Press return for the answer:

	ANSWER:
	Blue.

	Press return to continue, or 'q' to quit:

	QUESTION:  What is your name?

	Press return for the answer:

	ANSWER:
	My name is
	Sir Lancelot
	of Camelot.
	[B:\] gawk -f memorize question.txt
	
	QUESTION:  What is your favorite color?
	
	Press return for the answer:

	ANSWER:
	Blue.

	Press return to continue, or 'q' to quit: q
	[B:\] 

Custom Databases

Address Book

The databases above used a simple 'index card' analogy. That data model works fine for simple lists with free form data, but there are also cases where we need to manage records with specialized data fields.

Let's create a data file and some scripts for an 'address book' database. Our data file will be a text file where every line is one record. Within a line of the file, the data will be separated into fields.

When choosing a delimiter for our fields, we need to make sure that it won't appear accidentally within a field itself. For example, an address book has fields like name, company name, address, etc., and in this case, each of those fields can contain spaces within them (e.g. "ACME Mail Order Company"). Therefore, we can't use a space to separate the fields of the line.

Instead, let's use commas to separate the fields, and we'll need a rule that commas cannot appear within a field.

Here is a sample data file called 'address.txt':

John Robinson,Koren Inc.,978 4th Ave,Boston,MA 01760,617-696-0987
Phyllis Chapman,GVE Corp.,34 Sea Drive,Amesbury,MA 01881,781-879-0900
Here is the script called 'labels' which will print all the data and format it like mailing labels:
	# labels - Format the addresses for printing labels
	# Source: blocklist.awk from "Sed & Awk", by Dale Dougherty, p.148

	BEGIN { FS = "," }

	{
	        print ""        # blank line
	        print $1        # name
	        print $2        # company
	        print $3        # street
	        print $4, $5    # city, state zip
	}
This is the sample run:
	[B:\] gawk -f labels address.txt
	
	John Robinson
	Koren Inc.
	978 4th Ave
	Boston MA 01760
	
	Phyllis Chapman
	GVE Corp.
	34 Sea Drive
	Amesbury MA 01881	
	[B:\] 

It may also be useful to extract just the phone numbers from our data file. Here is the script called 'phones' which will extract only the names and phone numbers from the data file:

	# phones
	# Source: phonelist.awk, from "Sed & Awk", by Dale Dougherty, p.148

	BEGIN { FS="," }

	{ print $1 ", " $6 }
Here is a sample run:
	[B:\] gawk -f phones address.txt
	John Robinson, 617-696-0987
	Phyllis Chapman, 781-879-0900
	[B:\] 
We'll also need a script to search our data file for a name. Here is a script called 'searchad' with will search for the string 'robinson':
	# searchad - Return the record that matches a string

	BEGIN { FS = ","; IGNORECASE=1 }

	/robinson/ {
	        print ""        # blank line
	        print $1        # name
	        print $2        # company
	        print $3        # street
	        print $4, $5    # city, state zip
	}

Here is a sample run:

	[B:\] gawk -f searchad address.txt

	John Robinson
	Koren Inc.
	978 4th Ave
	Boston MA 01760
	[B:\] 

Grading Program

Awk can also be used for mathematical computation of fields. Let's demonstrate this with a data file called 'grades.txt' that contains grades of students.

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

Here is a longer script that will take all the grades, average them equally, and compute the final average and the final grade for each student. At the end, it will compute some statistics about the entire class. Here is the script called 'grades'.

	# grades -- average student grades and determine
	# letter grade as well as class averages
	# Source: "Sed & Awk", by Dale Dougherty, p.192

	# set output field separator to tab.
	BEGIN { OFS = "\t" }

	# action applied to all input lines
	{
		# add up the grades
		total = 0
		for (i = 3; i <= NF; ++i)
			total += $i
		# calculate average
		avg = total / (NF - 2)
		# assign student's average to element of array
		class_avg[NR] = avg
		# determine letter grade
		if (avg >= 90) grade="A"
		else if (avg >= 80) grade="B"
		else if (avg >= 70) grade="C"
		else if (avg >= 60) grade="D"
		else grade="F"
		# increment counter for letter grade array
		++class_grade[grade]
		# print student name, average, and letter grade
		print $1 " " $2, avg, grade
	}

	# print out class statistics
	END  {
		# calculate class average
		for (x = 1; x <= NR; x++)
			class_avg_total += class_avg[x]
		class_average = class_avg_total / NR
		# determine how many above/below average
		for (x = 1; x <= NR; x++)
			if (class_avg[x] >= class_average)
				++above_average
			else
				++below_average
		# print results
		print ""
		print "Class Average: ", class_average
		print "At or Above Average: ", above_average
		print "Below Average: ", below_average
		# print number of students per letter grade
		for (letter_grade in class_grade)
			print letter_grade ":", class_grade[letter_grade]
	}

Here is a sample run:

	[B:\] gawk -f grades grades.txt
	Allen Mona      79      C
	Baker John      88      B
	Jones Andrea    90.5    A
	Smith Jasper    85      B
	Turner Dunce    64.5    D
	Wells Ellis     93.5    A

	Class Average:  83.4167
	At or Above Average:    4
	Below Average:  2
	A:      2
	B:      2
	C:      1
	D:      1
	[B:\]

Another useful script is the following program that computes a histogram of the grades. It is hardcoded to only read the third column ($3), but you can edit it and change it to read any of the columns in the input file. Here is the script called 'histo':

	# histogram
	# Source: "The AWK Programming Language", by Aho, et.al., p.70

	     { x[int($3/10)]++ } # use the third column of input data

	END  {
	        for (i = 0; i < 10; i++)
	                printf(" %2d - %2d: %3d %s\n",
	                       10*i, 10*i+9, x[i], rep(x[i],"*"))
	        printf("100:      %3d %s\n", x[10], rep(x[10],"*"))
	     }

	function rep(n, s,   t) {   # return string of n s's
	        while (n--> 0)
	                t = t s
	        return t
	}
And here is the sample run:
	[B:\] gawk -f histo grades.txt
	  0 -  9:   0
	 10 - 19:   0
	 20 - 29:   0
	 30 - 39:   0
	 40 - 49:   0
	 50 - 59:   0
	 60 - 69:   1 *
	 70 - 79:   1 *
	 80 - 89:   3 ***
	 90 - 99:   1 *
	100:        0	
	[B:\]

The output shows that there were six grades, and most of them were in the 80-89 range.

Checkbook Program

This program takes a data file which lists your checkbook entries and your deposits, and calculates the totals.

Here is what a sample input file called 'checks.txt' looks like:

	check	1021
	to	Champagne Unlimited
	amount	123.10
	date	1/1/87

	deposit	
	amount	500.00
	date	1/1/87

	check	1022
	date	1/2/87
	amount	45.10
	to	Getwell Drug Store
	tax	medical

	check	1023
	amount	125.00
	to	International Travel
	date	1/3/87

	check	1024
	amount	50.00
	to	Carnegie Hall
	date	1/3/87
	tax	charitable contribution

	check	1025
	to	American Express
	amount	75.75
	date	1/5/87

Here is the script called 'check' which will calculate the totals:

	# check - print total deposits and checks
	# Source: "The AWK Programming Language", by Aho, et.al., p.87

	BEGIN { RS=""; FS="\n" }

	/(^|\n)deposit/ { deposits += field("amount"); next }
	/(^|\n)check/   { checks += field("amount"); next }

	END   { printf("Deposits: $%.2f, Checks: $%.2f\n", 
		       deposits, checks)
	      }

	function field(name,   i, f) {
		for (i = 1; i <= NF; i++) {
			split($i, f, "\t")
			if (f[1] == name)
				return f[2]
		}
		printf("Error: no field %s in record\n%s\n", name, $0)
	}

And this is a sample run:

	[B:\] gawk -f check checks.txt
	Deposits: $500.00, Checks: $418.95
	[B:\]

Importing and Exporting Data

Importing Data for use by Awk

Awk works well with data files that are stored in text files. Awk assumes that the data file is organized into records, within each record the data is divided into fields, and there are unique characters in the file that are used as the field separators and record separators.

By default, Awk assumes that newline characters are the record separators and whitespace characters (spaces and tabs) are the field separators. It is also possible to redefine the field separators to other characters, like a comma or a tab character, which means that Awk can process the commonly used "comma separated" and "tab separated" format for data files.

But note that if a file uses newline characters as record separators, it means that a newline cannot appear within a field. For example, a data file file with one record per line cannot contain a text field (e.g. a "notes" field) that contains free form text with newline characters within it. That would confuse Awk unless we added special code to handle that notes field.

The same restrictions apply to the field separators. If a file is defined to be comma separated, it means that no field is allowed to contain comma characters within it (e.g. a Name field that contains "Alvarado, Victor") because Awk would parse that as two fields, not one.

That is why tab separated files tend to be used more often. That way, the fields are allowed to contain spaces and commas.

Another way to format data for use by Awk is to use the "multiline" format, which is what we used for our index card databases above. Awk will treat each line as a field, and a blank line is the record separator.

Exporting Data to Microsoft Excel

To export data to Excel, all we need to do is to convert the data file into tab-delimited format, and store it in a text file with a *.xls extension. When that file is opened in Microsoft Windows, Excel will open it automatically as if it were a spreadsheet.

As an example, let's export our grades.txt file to Excel. Here is our 'grades.txt' file:

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

The file uses spaces as the field separator, so we'll need a script that will convert the field separators into tabs. Here is a script called 'conv2xls':

	# conv2xls - Convert a data file into tab-separated format

	BEGIN {
	        IFS=" "    # input field separator is a space
	        OFS="\t"   # output field separator is a tab
	      }

	      { print $1, $2, $3, $4, $5, $6, $7, $8 }

And here is the sample run, where we store the tab-delimited output into a text file called grades.xls:

	[B:\] gawk -f conv2xls grades.txt > grades.xls
	[B:\]
Here is the contents of the 'grades.xls' text file:
	Allen   Mona    70      77      85      83      70      89
	Baker   John    85      92      78      94      88      91
	Jones   Andrea  89      90      85      94      90      95
	Smith   Jasper  84      88      80      92      84      82
	Turner  Dunce   64      80      60      60      61      62
	Wells   Ellis   90      98      89      96      96      92

We can then copy the grades.xls text file to a Windows PC, double-click on it, and Excel will open it as if it were a spreadsheet:

You can then do a "Save As" in Excel to save it as the regular Excel binary format.

Exporting Data to a Web Page

To export our data to a web page, we will need a script that will input our data file and generate HTML.

Let's start with our 'grades.txt' data file:

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

Here is a script called 'html' that will do the conversion. Note that the data will appear as rows of a table in HTML.

	# html - Convert a data file into an HTML web page with a table
	
	BEGIN {
		print "<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>"
		print "<BODY BGOLOR=\"#ffffff\">"
		print "<CENTER><H1>Grades Database</H1></CENTER>"
		print "<HR noshade size=4 width=75%>"
		print "<P><CENTER><TABLE BORDER>"
		printf "<TR><TH>Last<TH>First"
		print "<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6"
	      }
	
	      { # Print the data in table rows
		printf "<TR><TD>" $1 "<TD>" $2 
		printf "<TD>" $3 "<TD>" $4 "<TD>" $5 
		print  "<TD>" $6 "<TD>" $7 "<TD>" $8 
	      }
	
	END   {
		print "</TABLE></CENTER><P>"
		print "<HR noshade size=4 width=75%>"
		print "</BODY></HTML>"
	      }

Here is the sample run. The output will be placed in a file called 'grades.htm'.

	[B:\] gawk -f html grades.txt > grades.htm
	[B:\]

This is what the resulting 'grades.htm' file looks like:

	<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>
	<BODY BGOLOR="#ffffff">
	<CENTER><H1>Grades Database</H1></CENTER>
	<HR noshade size=4 width=75%>
	<P><CENTER><TABLE BORDER>
	<TR><TH>Last<TH>First<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6
	<TR><TD>Allen<TD>Mona<TD>70<TD>77<TD>85<TD>83<TD>70<TD>89
	<TR><TD>Baker<TD>John<TD>85<TD>92<TD>78<TD>94<TD>88<TD>91
	<TR><TD>Jones<TD>Andrea<TD>89<TD>90<TD>85<TD>94<TD>90<TD>95
	<TR><TD>Smith<TD>Jasper<TD>84<TD>88<TD>80<TD>92<TD>84<TD>82
	<TR><TD>Turner<TD>Dunce<TD>64<TD>80<TD>60<TD>60<TD>61<TD>62
	<TR><TD>Wells<TD>Ellis<TD>90<TD>98<TD>89<TD>96<TD>96<TD>92
	</TABLE></CENTER><P>
	<HR noshade size=4 width=75%>
	</BODY></HTML>

And here is a link to the grades.htm file so you can see what the web page looks like in your browser.

Exporting Data to a Palm Pilot

First, we will need to install a database program on the Palm. There are several database programs to choose from, but let's use the freeware database program called Pilot-DB (available here from PalmGear).

Next, we will need the freeware DOS tools that come with Pilot-DB to help us create the PDB data file. The DB-tools package is available here at PalmGear. You can download it and install it on your Windows PC. Those are DOS tools, but they were compiled to run in DOS under Windows, so we can't run them on the QuickPAD Pro. (Note: DB-tools is an open source project, so the source code is available.)

The DB-tools package contains a program called 'csv2pdb.exe'. It will do the conversion into a Palm PDB file.

Let's use the 'grades.txt' data file as an example:

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

Before we can run the 'csv2pdb.exe' program we first need to convert our data into "csv" (comma separated values) format. We can do that with the following awk script called 'conv2csv':

	# conv2csv - Convert a data file into comma-separated format

	BEGIN {
	        IFS=" "    # input field separator is a space
	        OFS=","    # output field separator is a comma
	      }

	      { print $1, $2, $3, $4, $5, $6, $7, $8 }

Here is the command line to create the comma-delimited data file, which we will call 'grades.csv':

	[B:\] gawk -f conv2csv grades.txt > grades.csv
	[B:\]

This is what the 'grades.csv' file looks like:

	Allen,Mona,70,77,85,83,70,89
	Baker,John,85,92,78,94,88,91
	Jones,Andrea,89,90,85,94,90,95
	Smith,Jasper,84,88,80,92,84,82
	Turner,Dunce,64,80,60,60,61,62
	Wells,Ellis,90,98,89,96,96,92

Next, we need to create an "info" file which will describe the format of our data. The 'csv2pdb.exe' program will need this information for the conversion to Palm format.

The info file will give our database a title and describe the fields of each record. In grades.csv, the first field is the student's last name, the second field is the student's first name, and the other six fields are the grades. Here is the resulting info file called 'grades.ifo':

	title "GradesDB"
	field "Last" string 38
	field "First" string 38
	field "G1" integer 14
	field "G2" integer 14
	field "G3" integer 14
	field "G4" integer 14
	field "G5" integer 14
	field "G6" integer 14
	option backup on

The numbers at the end of the lines are the field widths in pixels; we can make a guess for the field widths, and then fine-tune them on the Palm Pilot. The last line will set the backup bit on the PDB file so that it will be backed up at every hotsync.

From this point on, the rest of the steps must be done on your Windows PC.

On Your Windows PC

Now we create the PDB file on our PC with this command line:

C:\> csv2pdb -i grades.ifo grades.csv grades.pdb C:\>

It will create a new file called 'grades.pdb' in the current directory. This is the Palm database file.

The last step is to install the PDB file to the Palm Pilot: in the Windows Explorer double-click on the PDB file and then hotsync your Palm Pilot as usual.

Here is a screen shot of the Palm Pilot running Pilot-DB with our grades database. (Make sure you have selected the blank unnamed view from menu at the top-right corner of the screen):

As you can see, storing data as text files gives you a lot of flexibility in manipulating the data and exporting it to other formats.

Author

Victor Alvarado


categories: Tips,Jul,2009,Admin

Random Numbers in Gawk

(Summarized and extended from a recent discussion at comp.lang.awk.)

Background

A standard idiom in Gawk is to reset the random number generator in a BEGIN block.

BEGIN {srand() }

Sadly, when called with no arguments, this "reseeding" uses time-in-seconds. So if the same "random" task runs multiple times in the same second, it will get the same random number seed.

Houston, We Have a Problem

"Ben" writes:

I have a Gawk script that puts random comments into a file. It is run 3 times in a row in quick succession. I found that seeding the random number generator using gawk did not work because all 3 times it was run was done within the same second (and it uses the time).

I was wondering if anyone could give me some suggestions as to what can be done to get around this problem.

Solution #1: Persistent Memory

Kenny McCormack writes:

When last I ran into this problem, what I did was to save the last value returned by rand() to a file, then on the next run, read that in and use that value as the arg to srand(). Worked well.

(Editor's comment: Kenny's solution does work well but incurs the cost of maintaining and reading/writing that "last value" file.)

Solution #2: Use Bash

Tim Menzies writes:

How about setting the seed using the BASH $RANDOM variable:

gawk -v Seed=$RANDOM --source 'BEGIN { srand(Seed ? Seed : 1) }' 

If referenced multiple times in a second, it always generates a different number.

In the above usage, if we have a seed, use it. Else, no seed so start all "random" at the same place. If you prefer to use the default "seed from time-in-seconds" then use:

BEGIN { if (Seed) { srand(Seed) } else { srand() } }

(Editor's comment: Tim's solution incurs the overhead of additional command-line syntax. However, it does allow the process calling Gawk to control the seed. This is important when trying to, say, debug code by recreating the sequence of random numbers that lead to the bug.)

Solution #3: Query the OS

Thomas Weidenfeller writes:

Is that good enough (random enough) for your task?

BEGIN {
        "od -tu4 -N4 -A n /dev/random" | getline
        srand(0+$0)
}

(Editor's comment: Nice. Thomas' solution reminds us that "Gawk" can access a whole host of operating system facilities.)

Solution #4: Use the Process Id

Aharon Robbins writes:

You could so something like add PROCINFO["pid"] to the value of the time, or use that as the seed.

$ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
0.405889
$ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
0.671906

(Editor's comment: Aharon's solution is the fastest of all the ones shown here. For example, on Mac OS/X, his solution takes 6ms to run:

$ time gawk 'BEGIN { srand(systime() + PROCINFO["pid"]) }'

real    0m0.006s
user    0m0.002s
sys     0m0.004s

while Thomas' solution is somewhat slower:

$ time gawk 'BEGIN { "od -tu4 -N4 -A n /dev/random" | getline; srand($0+0) }'

real    0m0.039s
user    0m0.004s
sys     0m0.034s

Note that while Aharon's solution is the fastest, it does not let some master process set the seed for the Gawk process (e.g. as in Tim's approach).)

Conclusion

If you want raw speed, use Aharon's approach.

If you want seed control, see Tim's approach.


categories: Funky,Tips,Mar,2009,ArnoldR

Super-For Loops

In this exchange from comp.lang.awk, Jason Quinn discusses his super-for loop trick. Arnold Robbins then chimes in to say that, with indirect functions, super-for loops could become a generic tool.

Jason Quinn writes:

  • Frequently when programming, situations arise for me where I need a nested number of for-loops. Such case arose for me again just recently while I was inventing a dice game. Anyway, here is the implementation that I ended up using to create a "super-for" loop in AWK (a little trickier than C).
  • This simple example merely lists all possible outcomes of rolling 4, 6, 8, 10, 12, and 20 sided dice at once. A super-for loop requires an array to specify the loop indices... here we have 6 dice and the number of sides determines the indices. The code is easily modified for an arbitrary number of dice (which is the whole point).
  • I identify three parts of a super-for which I called the prologue, body, and epilog. Under most circumstances, I think the main body only would get used.
  • For example:
    #shows an example of a superfor loop
    BEGIN {
    	#define loop maximums
    	loopmax[1]=4
    	loopmax[2]=6
    	loopmax[3]=8
    	loopmax[4]=10
    	loopmax[5]=12
    	loopmax[6]=20
    	#call the loop
    	superfor(6)
    }
    function superfor(loopdepth, zz) { # zz is a local variable
            currloopnum++
    
            #start of prologue
            #end of prologue
    
            for(loopcounter[currloopnum]=1; 
                loopcounter[currloopnum]<=loopmax[currloopnum]; 
                loopcounter[currloopnum]++) {
                    if ( loopdepth==1 ) {
                            #start of superfor body
                            for (zz=1;zz<=currloopnum;zz++) {
                                    printf loopcounter[zz] FS
                                    }
                            print ""
                            #end of superfor body
                            }
                    else if ( loopdepth>1 )
                            superfor(loopdepth-1)
                    }
    
            #start of epilog
            #end of epilog
    
            loopdepth++ ; currloopnum--
            }
    

Arnold Robbins replies:

  • I think this would make a great application for indirect function calls. For example:
    function superfor(loopdepth, prologue, body, epilogue,     zz)
    {
            currloopnum++
    
            @prologue()
    
            for(loopcounter[currloopnum]=1; 
                loopcounter[currloopnum]<=loopmax [currloopnum]; 
                loopcounter[currloopnum]++) {
                    if ( loopdepth==1 ) {
                            @body()
                    }
                    else if ( loopdepth>1 )
                            superfor(loopdepth-1, proloogue, 
                                     body, epilogue)
                    }
    
            @epilogue()
    
            loopdepth++ ; currloopnum--
    }
    

categories: Tips,Aug,2009,JanisP

Using Field Names to Reference Columns

In comp.lang.awk, Janis Papanagnou comments on how Awk can read a CSV files where the headers are named in line one.

Problem

Suppose you have a a csv file with headers for field names. Gawk can use those headers for field names- which makes the code more intuitive and easier to work with. Given that awk is expected to work on tabular data, this seems to be a good alternative to just field numbers.

Solution

Try this shell script:
#!/bin/sh
awk -F, -v cols="${1:?}" '
   BEGIN {
     n=split(cols,col)
     for (i=1; i<=n; i++) s[col[i]]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
'

This script can be called with an arbitrary list of column names as defined in the first line of your data file and separated by the same field separator as your data.

For example, suppose the above code is in bycolname.sh and we have data that looks like this:

hello,world,region_name,foo,bar,xyz,dummy
11111,22222,aspac,77777,8888888,xyz,zzzzz
21111,22222,ASPAC,77777,8888888,xyz,zzzzz
31111,22222,ASPAC,77777,8888888,XYZ,zzzzz
41111,22222,aspac,77777,8888888,XYZ,zzzzz

Now, calling this command...

sh bycolname.sh world,hello
... would produce:
22222,11111
22222,21111
22222,31111
22222,41111

Bugs

Non existing column names will expand to $0 each, which may be surprising if there's an unnoticed typo in your field list.


categories: Getline,Tips,Jan,2009,EdM

Use (and Abuse) of Getline

by Ed Morton (and friends)

The following summary, composed to address the recurring issue of getline (mis)use, was based primarily on information from the book "Effective Awk Programming", Third Edition By Arnold Robbins; (http://www.oreilly.com/catalog/awkprog3) with review and additional input from many of the comp.lang.awk regulars, including

  • Steve Calfee,
  • Martin Cohen,
  • Manuel Collado,
  • J├╝rgen Kahrs,
  • Kenny McCormack,
  • Janis Papanagnou,
  • Anton Treuenfels,
  • Thomas Weidenfeller,
  • John LaBadie and
  • Edward Rosten.

Getline

getline is fine when used correctly (see below for a list of those cases), but it's best avoided by default because:

  1. It allows people to stick to their preconceived ideas of how to program rather than learning the easier way that awk was designed to read input. It's like C programmers continuing to do procedural programming in C++ rather than learning the new paradigm and the supporting language constructs.
  2. It has many insidious caveats that come back to bite you either immediately or in future. The succeeding discussion captures some of those and explains when getline IS appropriate.

As the book "Effective Awk Programming", Third Edition By Arnold Robbins; http://www.oreilly.com/catalog/awkprog3) which provides much of the source for this discussion says:

    "The getline command is used in several different ways and should not be used by beginners. ... come back and study the getline command after you have reviewed the rest ... and have a good knowledge of how awk works."

Variants

The following summarises the eight variants of getline applications, listing which variables are set by each one:

Variant                 Variables Set 
-------                 -------------
getline                 $0, ${1...NF}, NF, FNR, NR, FILENAME 
getline var             var, FNR, NR, FILENAME 
getline < file          $0, ${1...NF}, NF 
getline var < file      var 
command | getline       $0, ${1...NF}, NF 
command | getline var   var 
command |& getline      $0, ${1...NF}, NF 
command |& getline var  var 

The "command |& ..." variants are GNU awk (gawk) extensions. gawk also populates the ERRNO builtin variable if getline fails.

Although calling getline is very rarely the right approach (see below), if you need to do it the safest ways to invoke getline are:

if/while ( (getline var < file) > 0) 
if/while ( (command | getline var) > 0) 
if/while ( (command |& getline var) > 0) 

since those do not affect any of the builtin variables and they allow you to correctly test for getline succeeding or failing. If you need the input record split into separate fields, just call "split()" to do that.

Caveats

Users of getline have to be aware of the following non-obvious effects of using it:

  1. Normally FILENAME is not set within a BEGIN section, but a non-redirected call to getline will set it.
  2. Calling "getline < FILENAME" is NOT the same as calling "getline". The second form will read the next record from FILENAME while the first form will read the first record again.
  3. Calling getline without a var to be set will update $0 and $NF so they will have a different value for subsequent processing than they had for prior processing in the same condition/action block.
  4. Many of the getline variants above set some but not all of the builtin variables, so you need to be very careful that it's setting the ones you need/expect it to.
  5. According to POSIX, `getline < expression' is ambiguous if expression contains unparenthesized operators other than `$'; for example, `getline < dir "/" file' is ambiguous because the concatenation operator is not parenthesized. You should write it as `getline < (dir "/" file)' if you want your program to be portable to other awk implementations.
  6. In POSIX-compliant awks (e.g. gawk --posix) a failure of getline (e.g. trying to read from a non-readable file) will be fatal to the program, otherwise it won't.
  7. Unredirected getline can defeat the simple and usual rule to handle input file transitions:
    FNR==1 { ... start of file actions ... }
    
    File transitions can occur at getlines, so FNR==1 needs to also be checked after each unredirected (from a specific file name) getline. e.g. if you want to print the first line of each of these files:
    $ cat file1 
    a 
    b 
    $ cat file2 
    c 
    d 
    
    you'd normally do:
    $ awk 'FNR==1{print}' file1 file2 
    a 
    c 
    
    but if a "getline" snuck in, it could have the unexpected consequence of skipping the test for FNR==1 and so not printing the first line of the second file.
    $ awk 'FNR==1{print}/b/{getline}' file1 file2 
    a 
    
  8. Using getline in the BEGIN section to skip lines makes your program difficult to apply to multiple files. e.g. with data like...
    some header line 
    ---------------- 
    data line 1 
    data line 2 
    ... 
    data line 10000 
    
    you may consider using...
    BEGIN { getline header; getline } 
    { whatever_using_header_and_data_on_the_line() } 
    
    instead of...
    FNR == 1 { header = $0 } 
    FNR < 3 { next } 
    { whatever_using_header_and_data_on_the_line() } 
    
    but the getline version would not work on multiple files since the BEGIN section would only be executed once, before the first file is processed, whereas the non-getline version would work as-is. This is one example of the common case where the getline command itself isn't directly causing the problem, but the type of design you can end up with if you select a getline approach is not ideal.

Applications

getline is an appropriate solution for the following:

  1. Reading from a pipe, e.g.:
    command = "ls" 
    while ( (command | getline var) > 0) { 
        print var 
    } 
    close(command) 
    
  2. Reading from a coprocess, e.g.:
    command = "LC_ALL=C sort" 
    n = split("abcdefghijklmnopqrstuvwxyz", a, "") 
    for (i = n; i > 0; i--) 
         print a[i] |& command 
    close(command, "to") 
    while ((command |& getline var) > 0) 
        print "got", var 
    close(command) 
    
  3. In the BEGIN section, reading some initial data that's referenced during processing multiple subsequent input files, e.g.:
    BEGIN { 
       while ( (getline var < ARGV[1]) > 0) { 
              data[var]++ 
       } 
       close(ARGV[1]) 
       ARGV[1]="" 
     } 
     $0 in data 
    
  4. Recursive-descent parsing of an input file or files, e.g.:
    awk 'function read(file) { 
                while ( (getline < file) > 0) { 
                    if ($1 == "include") { 
                         read($2) 
                    } else { 
                         print > ARGV[2] 
                    } 
                } 
                close(file) 
          } 
          BEGIN{ 
             read(ARGV[1]) 
             ARGV[1]="" 
             close(ARGV[2]) 
         }1' file1 tmp 
    

In all other cases, it's clearest, simplest, less error-prone, and easiest to maintain to let awks normal text-processing read the records. In the case of "c", whether to use the BEGIN+getline approach or just collect the data within the awk condition/action part after testing for the first file is largely a style choice.

"a" above calls the UNIX command "ls" to list the current directory contents, then prints the result one line at a time.

"b" above writes the letters of the alphabet in reverse order, one per line, down the two-way pipe to the UNIX "sort" command. It then closes the write end of the pipe, so that sort receives an end-of-file indication. This causes sort to sort the data and write the sorted data back to the gawk program. Once all of the data has been read, gawk terminates the coprocess and exits. This is particularly necessary in order to use the UNIX "sort" utility as part of a coprocess since sort must read all of its input data before it can produce any output. The sort program does not receive an end-of-file indication until gawk closes the write end of the pipe. Other programs can be invoked as just:

command = "program" 
do { 
      print data |& command 
      command |& getline var 
} while (data left to process) 
close(command) 

Not that calling close() with a second argument is also gawk-specific.

"c" above reads every record of the first file passed as an argument to awk into an array and then for every subsequent file passed as an argument will print every record from that file that matches any of the records that appeared in the first file (and so are stored in the "data" array). This could alternatively have been implemented as:

# fails if first file is empty 
NR==FNR{ data[$0]++; next } 
$0 in data 

or:

FILENAME==ARGV[1] { data[$0]++; next } 
$0 in data 

or:

FILENAME=="specificFileName" { data[$0]++; next } 
$0 in data 

or (gawk only):

ARGIND==1 { data[$0]++; next } 
$0 in data 

"d" above not only expands all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2]. In this case, since it's convenient to use $1 and $2, and no other part of the program references any builtin variables, getline was used without populating an explicit variable. This method is limited in its recursion depth to the total number of open files the OS permits at one time.

Tips

The following tips may help if, after reading the above, you discover you have an appropriate application for getline or if you're looking for an alternative solution to using getline:

  1. If you need to distinguish between a normal EOF or some read or opening error, you have to use gawks ERRNO variable or code it as: if/while ( (e = (getline var < file)) > 0) { ... } close(file) if(e < 0) some_error_handling
  2. Don't forget to close() any file you open for reading. The common idiom for getline and other methods of opening files/streams is:
    cmd="some command" 
    do something with cmd 
    close(cmd) 
    
  3. A common misapplication of getline is to just skip a few lines of an input file. The following discusses how to do that without using getline with all that implies as discussed above. This discussion builds on the common awk idiom to "decrement a variable to zero" by putting the decrement of the variable as the second term in an "and" clause with the first part being the variable itself, so the decrement only occurs if the variable is non-zero:
    • Print the Nth record after some pattern:
      awk 'c&&!--c;/pattern/{c=N}' file 
    • Print every record except the Nth record after some pattern:
      awk 'c&&!--c{next}/pattern/{c=N}' file 
    • Print the N records after some pattern:
      awk 'c&&c--;/pattern/{c=N}' file 
    • Print every record except the N records after some pattern:
      awk 'c&&c--{next}/pattern/{c=N}' file

In this example there are no blank lines and the output is all aligned with the left hand column and you want to print $0 for the second record following the record that contains some pattern, e.g. the number 3:

$ cat file 
line 1 
line 2 
line 3 
line 4 
line 5 
line 6 
line 7 
line 8 
$ awk '/3/{getline;getline;print}' file 
line 5 

That works Just fine. Now let's see the concise way to do it without getline:

$ awk 'c&&!--c;/3/{c=2}' file 
line 5

It's not quite so obvious at a glance what that does, but it uses an idiom that most awk programmers could do well to learn and it is briefer and avoids all those getline caveats.

Now let's say we want to print the 5th line after the pattern instead of the 2nd line. Then we'd have:

$ awk '/3/{getline;getline;getline;getline;getline;print}' file 
line 8 
$ awk 'c&&!--c;/3/{c=5}' file 
line 8

i.e. we have to add a whole series of additional getline calls to the getline version, as opposed to just changing the counter from 2 to 5 for the non-getline version. In reality, you'd probably completely rewrite the getline version to use a loop:

$ awk '/3/{for (c=1;c<=5;c++) getline; print}' file 
line 8

Still not as concise as the non-getline version, has all the getline caveats and required a redesign of the code just to change a counter.

Now let's say we also have to print the word "Eureka" if the number 4 appears in the input file. With the getline verion, you now have to do something like:

$ awk '/3/{for (c=1;c<=5;c++) { getline; if ($0 ~ /4/) print "Eureka!" } 
print}' file 
Eureka! 
line 8

whereas with the non-getline version you just have to do:

$ awk 'c&&!--c;/3/{c=5}/4/{print "Eureka!"}' file 
Eureka! 
line 8

i.e. with the getline version, you have to work around the fact that you're now processing records outside of the normal awk work-loop, whereas with the non-getline version you just have to drop your test for "4" into the normal place and let awks normal record processing deal with it like it always does. Actually, if you look closely a

t the above you'll notice we just unintentionally introduced a bug in the getline version. Consider what would happen in both versions if 3 and 4 appear on the same line. The non-getline version would behave correctly, but to fix the getline version, you'd need to duplicate the condition somewhere, e.g. perhaps something like this:

$ awk '/3/{for (c=1;c<=5;c++) { if ($0 ~ /4/) print "Eureka!"; getline } 
if ($0 ~ /4/) print "Eureka!"; print}' file 
Eureka! 
line 8 

Now consider how the above would behave when there aren't 5 lines left in the input file or when the last line of the file contains both a 3 and a 4. i.e. there are still design questions to be answered and bugs that will appear at the limits of the input space.

Ignoring those bugs since this is not intended as a discussion on debugging getline programs, let's say you no longer need to print the 5th record after the number 3 but still have to do the Eureka on 4. With the getline version, you'd strip out the test for 3 and the getline stuff to be left with:

$ awk '{if ($0 ~ /4/) print "Eureka!"}' file 
Eureka!

which you'd then presumably rewrite as:

$ awk '/4/{print "Eureka!"}' file 
Eureka! 

which is what you get just by removing everything involving the test for 3 and counter in the non-getline version (i.e. "c&&!--c;/3/{c=5}"}:

$ awk '/4/{print "Eureka!"}' file 
Eureka! 

i.e. again, one small requirement change required a complete redesign of the getline code, but just the absolute minimum necessary tweak to the non-getline version.

So, what you see above in the getline case was significant redesign required for every tiny requirement change, much larger amounts of handwritten code required, insidious bugs introduced during development and challenging design questions at the limits of your input space, whereas the non-getline version always had less code, was much easier to modify as requirements changed, and was much more obvious, predictable, and correct in how it would behave at the limits of the input space.


categories: Forloop,Tips,Jan,2009,Jimh

Never write for(i=1;i<=n;i++).. again?

by Jim Hart

I've written this kind of thing

n = split(something,arr,/re/)
for(i=1;i<=n;i++) {
   print arr[i]
}

so often, it's tedious. I like this better:

n = split(something,arr,/re/)
while(n--) {
   print arr[i++]
}

Easier to type. And, in cases where front-to-back or back-to-front doesn't matter, it's even simpler:

# copy a number indexed array, assuming n contains the number of
# elements

while(n--) arr2[n] = arr1[n]

And, yes,

for(i in arr1) arr2[i] = arr1[i]

works, too. But, some loops don't involve arrays. :-)

Want more?

This tip has been discussed on comp.lang.awk.


categories: Tips,Apr,2009,ArnoldR

Moving Files with Awk

Andrew Eaton wrote at comp.lang.awk:

I just started with awk and sed, I am more of a perl/C/C++ person. I have a quick question reguarding the pipe. In Awk, I am trying to use this construct.

while ((getline < "somedata.txt") > 0)
            {print | "mv"} #or could be "mv -v" for verbose. 

Is it possible that "print" is no longer printing the value of getline, if so how do I correct it?

Arnold Robbins comments:

The problem here is that `mv' doesn't read standard input, it only processes command lines. Assuming that your data is something like:

oldfile newfile

You can do things two ways:

# build the command and execute it
while ((getline < "somedata.txt") > 0) {
          command = "mv " $1 " " $2
          system(command)
}
close("somedata.txt")

or this way:

# send commands to the shell
while ((getline < "somedata.txt") > 0) {
          printf("mv %s %s\n", $1, $2) | "sh"
}
close("somedata.txt")
close("sh")

The latter is more efficient.


categories: Sed,Tips,Apr,2009,ArnoldR

AwkSed: A Simple Stream Editor

by Arnold Robbins

From the Gawk Manual.

The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:

command1 < orig.data | sed 's/old/new/g' | command2 > result

Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.

The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:

# awksed.awk --- do s/foo/bar/g using just print
#    Thanks to Michael Brennan for the idea

function usage()
{
  print "usage: awksed pat repl [files...]" > "/dev/stderr"
  exit 1
}

BEGIN {
    # validate arguments
    if (ARGC < 3)
        usage()

    RS = ARGV[1]
    ORS = ARGV[2]

    # don't use arguments as files
    ARGV[1] = ARGV[2] = ""
}

# look ma, no hands!
{
    if (RT == "")
        printf "%s", $0
    else
        print
}

The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.

The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.

There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.

The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.

The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.


categories: Sed,Tips,Apr,2009,JamesL

s2a: sed to Awk

Contents

Download

Description

Bugs

Author

Code

Download

Download from LAWKER.

Description

The s2a project is a sed to awk conversion utility written in awk. As input it takes sed scripts, and it outputs an equivalent awk script.

This version should be fully functional as far as the following sed commands are concerned: a,d,s,p,q,c,i,n. Commands to be implemented in the future: {},=,h,g,N,P,r,x,y,l,H,G,D,b,t,:

Bugs

$ is not a valid line address. Also, line continuation with '\' is not implemented.

Author

James Lyons, Feb 2008.

For more excellent awk code, visit Lyon's awk.dsplab web site.

Code

BEGIN{RS=";|\n"; FS=""; var=1;}
{
    i=1; case1=""; case2="";
    while($i==" ")i++;
    if($i=="\\"||$i=="/"||$i~/[0-9]/) case1=matchaddr();
    if($i==","){i++; case2=matchaddr()};
 handle sed commands
####################################################################################################
    if($i == "d"){ a1=a2="next;";
    }else if($i == "p"){ a1=a2="print;";
    }else if($i == "a"){ rest="";
        for(c=i+2;c<=NF;c++) rest=rest$c;
        a1=a2="$0=$0\"\\n"rest"\";"; 
    }else if($i == "q"){ a1=a2="print; exit;"; 
    }else if($i == "n"){ a1=a2="print; if(getline <= 0) next;"
    }else if($i == "s"){
        re=substr($0, i); p=substr(re,2,1); match(re,"s"p"((\\"p"|.)*)"p"((\\"p"|.)*)"p"([a-zA-Z])?",tmp);
        tmp[3]=gensub(/\\[0-9]/,"\\\\&","g",tmp[3]); 
        tmp[1]=gensub(/\\\(/,"(","g",tmp[1]); tmp[1]=gensub(/\\\)/,")","g",tmp[1]);
        if(tmp[3]=="") a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",1);";
        else a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",\""tmp[5]"\");";
    }else if($i == "c"){ rest="";
        for(c=i+2;c<=NF;c++) rest=rest$c;
        a1="$0=\""rest"\";"; 
        a2="next;";
    }else if($i == "i"){ rest="";
        for(c=i+2;c<=NF;c++) rest=rest$c;
        a1=a2="$0=\""rest"\\n\"$0;"; 
    }else{
        print "ERROR: invalid syntax. Unkown command in expression "$0" (expr number "NR")"; exit;
    }
####################################################################################################
 output awk commands
    if(case1=="" && case2=="") print "{"a1"}";
    else if(case1~/^[0-9]/ && case2=="") print "NR=="case1"{"a1"}";
    else if(case2 == "") print "/"case1"/{"a1"}";
    else if(case1~/^[0-9]/ && case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
    else if(case1~/^[0-9]/)  print "temp"var"==1&&/"case2"/{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
    else if(case2~/^[0-9]/)  print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
    else print "temp"var"==1&&/"case2"/{temp"var++"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
    var++;
}

function matchaddr(){
    str=substr($0, i); p=1;
    if($i == "\\"){ p=substr(str,2,1); match(str,p"([^"p"]*)"p,arr); i++}
    else if($i == "/"){ p=substr(str,1,1); match(str,p"([^"p"]*)"p,arr); }
    else { match(str,/^([0-9]*)/,arr) };
    i += RLENGTH;
    return arr[1];
}
END{print "{print}";}
blog comments powered by Disqus