Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Top10,Mar,2009,Admin

Top 10

The Awk.info Top 10 pages highlights the "best" (most impressive, most insightful, most fun, most visited) pages on this site.


categories: Games,Top10,TenLiners,Mar,2009,BrianK

Story.awk

Contents

Synopsis

echo Goal | gawk -f story.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] 
echo Goal | gawk -f storyp.awk [ -v Grammar=FILE ] [ -v Seed=NUMBER ] 

Download

Download from LAWKER.

Description

This code inputs a set of productions and outputs a string of words that satisfy the production rules.

This page describes two versions of that system: story.awk and storyp.awk. The former selects productions at random with equal probability. The latter allows the user to bias the selection by adding weights at the end of line, after each production.

Options

-v Grammar=FILE
Sets the FILE containing the productions. Defaults to "grammar".
-v Seed=NUM
Sets the seed for the random number generator. Defaults to "1". A useful idiom for generating random text is to use Seed=$RANDOM

Examples

A Short Example

This grammar..

Sentence -> Nounphrase Verbphrase   
Nounphrase -> the boy              
Nounphrase -> the girl           
Verbphrase -> Verb Modlist Adverb 
Verb -> runs                    
Verb -> walks                  
Modlist ->                    
Modlist -> very Modlist      
Adverb -> quickly           
Adverb -> slowly           
... and this input ...
for i in 1 2 3 4 5 6 7 8 9 10;do
	echo Sentence | 
	gawk -f ../story.awk -v Grammar=english.rules -v Seed=$i | 
	fmt
done
... generates these sentences:
the boy runs very slowly
the girl runs slowly
the boy runs very slowly
the girl walks very very quickly
the boy runs quickly
the girl walks very very slowly
the boy walks very very very very very very quickly
the boy walks very quickly
the girl runs slowly
the girl runs very quickly

A Longer Example

Here is Gahan Wilson's sci-fi plot generator ...

Using the above, we can generate the following stories:


 Earth scientists invent giant bugs who want Our Women,  And Take
 A Few And Leave

 Earth is Attacked By tiny lunar superbeings who  Under Stand and
 Are Not radioactive and can not be killed by the Navy but They Die
 From Catching A Cold

 Earth scientists invent enormous bugs who are Friendly and and
 They Get Married And Live Happily Forever After

 Earth is Struck By A Giant cloud and Magically Saved

 Earth scientists invent giant bugs who  Under Stand and Are Not
 radioactive and can not be killed by the Air Force so They Kill
 Us

 Earth is Attacked By enormous extra Galactic blobs who  Under Stand
 and Are Not radioactive and can be killed by the Air Force

 Earth scientists discover enormous blobs who  Under Stand and Are
 Not radioactive and can be killed by a Crowd Of Peasants

 Earth falls Into Sun and  Some  Resuced

 Earth is Struck By A Giant comet but Is Saved

 Earth is Struck By A Giant comet and Is Destroyed

This is generated from the following code:

for i in 1 2 3 4 5 6 7 8 9 10;do
	echo
	echo Start | 
	gawk -f ../story.awk -v Grammar=scifi.rules -v Seed=$i | 
	fmt
done

running on the following grammar:

Start      -> Earth IsStressed
IsStressed -> Catestrophes 
IsStressed -> Science 
IsStressed -> Attack 
IsStressed -> Collision

Catestrophes -> Catestrophe and PossibleMegaDeath

Catestrophe -> burnsUp 
Catestrophe -> freezes
Catestrophe -> fallsIntoSun

Collision -> isStruckByAGiant Floater AndThen

Floater -> comet
Floater -> asteroid
Floater -> cloud

AndThen -> butIsSaved
AndThen -> andIsDestroyed
AndThen -> andMagicallySaved


PossibleMegaDeath -> everybodyDies
PossibleMegaDeath -> Some GoOn 

SomeSaved ->  somePeople
SomeSaved ->  everybody
SomeSaved ->  almostEverybody
  
GoOn -> dies
GoOn -> Resuced
GoOn -> Saved
 
Rescued -> isRescuedBy Sizes Extraterestrial Beings
Saved   -> butIsSavedBy SomeOne scientists the  Science

SomeOne -> earth
SomeOne -> extraterestrial

Science -> scientists DoSomething Sizes Beings Whichetc

DoSomething -> invent
DoSomething -> discover

Attack -> isAttackedBy Sizes Extraterestrial Beings Whichetc

Sizes -> tiny 
Sizes -> giant 
Sizes -> enormous
 
Extraterestrial -> martian
Extraterestrial -> lunar
Extraterestrial -> extraGalactic

Beings -> bugs
Beings -> reptiles
Beings -> blobs
Beings -> superbeings

Whichetc -> who WantSomething

WantSomething -> WantWomen
WantSomething -> areFriendly  and DenoumentOrHappyEnding
WantSomething -> UnderStand ButEtc

Understand -> areFriendly butMisunderstood
Understand -> misunderstandUs
Understand -> understandUsAllTooWell
Understand -> hungry

DenoumentOrHappyEnding -> Denoument
DenoumentOrHappyEnding -> HappyEnding
 
Dine -> Hungry and eat us Denoument?

WhichEtc -> 
Hungry -> lookUponUsAsASourceOfNourishment

WantWomen -> wantOurWomen, AndTakeAFewAndLeave

ButEtc -> AndAre radioactive and TryToKill

AndAre -> andAre
AndAre -> andAreNot

Killers -> Killer 
Killers -> Killer and Killer

Killer -> aCrowdOfPeasants
Killer -> theArmy
Killer -> theNavy
Killer -> theAirForce
Killer -> theMarines
Killer -> theCoastGuard
Killer -> theAtomBomb

TryToKill -> can be killed by Killers
TryToKill -> can not be killed by Killers SoEtc

SoEtc -> butTheyDieFromCatchingACold
SoEtc -> soTheyKillUs
SoEtc -> soTheyPutUsUnderABenignDictatorShip
SoEtc -> soTheyEatUs
SoEtc -> soScientistsInventAWeapon Which
SeEtc -> but Denoument

Which -> whichTurnsThemIntoDisgustingLumps
Which -> whichKillsThem
Which -> whichFails SoEtc

Denomument? ->  
Denomument? -> Denoument  

Denoument ->  aCuteLittleKidConvincesThemPeopleAreOk Ending
Denoument -> aPriestTalksToThemOfGod Ending
Denoument -> theyFallInLoveWithThisBeautifulGirl EndSadOrHappy

EndSadOrHappy -> Ending
EndSadOrHappy -> HappyEnding

Ending -> andTheyDie
Ending -> andTheyLeave
Ending -> andTheyTurnIntoDisgustingLumps

HappyEnding -> andTheyGetMarriedAndLiveHappilyForeverAfter

Biasing the Story

Here is a grammar suitable for storyp.awk. Note that number at end of line that biases how often a production is selected. For example, "runs" and "slowly" are nine times more likely than other Verbs and Adverbs.

Sentence -> Nounphrase Verbphrase   1
Nounphrase -> the boy               0.75
Nounphrase -> the girl              0.25
Verbphrase -> Verb Modlist Adverb   1
Verb -> runs                        0.9
Verb -> walks                       0.1
Modlist ->                          0.5
Modlist -> very Modlist             0.5
Adverb -> quickly                   0.1
Adverb -> slowly                    0.9
The following code executes the biases story generation:
for((i=1;i<=10;i++)); do echo Sentence ;  done |
gawk -f ../storyp.awk -v Grammar=englishp.rules 

This produces the following output. Note that, usually, we run slowly.

the boy runs very slowly 
the boy runs slowly 
the girl runs very slowly 
the boy runs slowly 
the boy runs slowly 
the girl walks very slowly 
the boy walks slowly 
the girl runs slowly 
the boy runs slowly 
the boy runs slowly 

Code

Story.awk

BEGIN { 
    srand(Seed ? Seed : 1) 
	Grammar = Grammar ? Grammar : "grammar"
	while (getline < Grammar > 0)
	    if ($2 == "->") {
		    i = ++lhs[$1]              # count lhs
		    rhscnt[$1, i] = NF-2       # how many in rhs
		    for (j = 3; j <= NF; j++)  # record them
		        rhslist[$1, i, j-2] = $j
	    } else
		     if ($0 !~ /^[ \t]*$/)
        	    print "illegal production: " $0
}
{   if ($1 in lhs) {  # nonterminal to expand
        gen($1)
        printf("\n")
    } else 
        print "unknown nonterminal: " $0   
}
function gen(sym,    i, j) {
    if (sym in lhs) {       # a nonterminal
        i = int(lhs[sym] * rand()) + 1   # random production
        for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
            gen(rhslist[sym, i, j])
    } else {
        gsub(/[A-Z]/," &",sym)
        printf("%s ", sym) }
}

Storyp.awk

Storyp.awk is almost the same as story.awk but it is assumed that each line ends in a number that will bias how often that production gets selected.

BEGIN {
    srand(Seed ? Seed : 1) 
    Grammar = Grammar ? Grammar : "grammar"
    while ((getline < Grammar) > 0)
        if ($2 == "->") {
            i = ++lhs[$1]              # count lhs
            rhsprob[$1, i] = $NF       # 0 <= probability <= 1
            rhscnt[$1, i] = NF-3       # how many in rhs
            for (j = 3; j < NF; j++)   # record them
               rhslist[$1, i, j-2] = $j
        } else
            print "illegal production: " $0
    for (sym in lhs)
         for (i = 2; i <= lhs[sym]; i++)
            rhsprob[sym, i] += rhsprob[sym, i-1]
}
{   if ($1 in lhs) {  # nonterminal to expand
         gen($1)
         printf("\n")
     } else 
         print "unknown nonterminal: " $0   
}
function gen(sym,    i, j) {
    if (sym in lhs) {       # a nonterminal
        j = rand()          # random production
        for (i = 1; i <= lhs[sym] && j > rhsprob[sym, i]; i++) ;       
        for (j = 1; j <= rhscnt[sym, i]; j++) # expand rhs's
            gen(rhslist[sym, i, j])
    } else
        printf("%s ", sym)
}

Author

The code comes from Alfred Aho, Brian Kernighan, and Peter Weinberger from the book "The AWK Programming Language", Addison-Wesley, 1988.

The scifi grammar was written by Tim Menzies, 2009, and is based on Gahan Wilson's sci-fi plot generator: "The Science Fiction Horror Movie Pocket Computer" ( in "The Year's Best Science Fiction No. 5", edited by Harry Harrison and Brian Aldiss, Sphere, London, 1972).


categories: TenLiners,Mar,2009,DonaldM

The Monty Hall Problem

Donald 'Paddy' McCarthy has a nice Awk solution to the Monty Hall Problem, which he describes as follow:

  • The contestant in in front of three doors that he cannot see behind..
  • The three doors conceal one prize and the rest being booby prizes, arranged randomly.
  • The Host asks the contestant to choose a door.
  • The host then goes behind the doors where only he can see what is concealed, then always opens one door, out of the other s not chosen by the contestant, that must reveal a booby prize to the contestant.
  • The host then asks the contestant if he would like either to stick with his previous choice, or switch and choose the other remaining closed door.

It turns out that if the contestant follows a strategy of always switching when asked, then he will maximise his chances of winning. Donald's simulator shows that:

  • A strategy of never switching wins 1/3rd of the time.
  • A strategy of randomly switching wins 1/2 of the time.
  • A strategy of always switching wins 2/3rds of the time.

Code

BEGIN {
	srand()
	doors = 3
	iterations = 10000
	# Behind a door: 
	EMPTY = "empty"; PRIZE = "prize"
	# Algorithm used
    KEEP = "keep"; SWITCH="switch"; RAND="random"; 
}
function monty_hall( choice, algorithm ) { # Set up doors
  for ( i=0; i<doors; i++ ) {
		door[i] = EMPTY
	}
	door[int(rand()*doors)] = PRIZE # One door with prize

  chosen = door[choice]
  del door[choice]

  #if you didn't choose the prize first time around then
  # that will be the alternative
	alternative = (chosen == PRIZE) ? EMPTY : PRIZE 

	if( algorithm == KEEP) {
		return chosen
	} 
	if( algorithm == SWITCH) {
		return alternative
	} 
	return rand() <0.5 ? chosen : alternative
}
function simulate(algo){
	prizecount = 0
	for(j=0; j< iterations; j++){
		if( monty_hall( int(rand()*doors), algo) == PRIZE) { 
			prizecount ++ 
		}
	}
	printf "  Algorithm %7s: prize count = %i, = %6.2f%%\n", \
		algo, prizecount,prizecount*100/iterations
}
BEGIN {
	print "\nMonty Hall problem simulation:"
	print doors, "doors,", iterations, "iterations.\n"
	simulate(KEEP)
	simulate(SWITCH)
	simulate(RAND)
}

Sample Output

gawk -f montyHall.awk

Monty Hall problem simulation:
3 doors, 10000 iterations.

  Algorithm    keep: prize count = 3411, =  34.11%
  Algorithm  switch: prize count = 6655, =  66.55%
  Algorithm  random: prize count = 4991, =  49.91%

categories: Top10,TenLiners,Mar,2009,ScottP

Predicting Gender

Contents

Synopsis

echo name | gawk -f gender.awk

Download

Download from LAWKER

Description

The following code predicts gender, given a first name.

This code is an excellent example of rule-based programming in Awk.

For a full description of the code, see

Code

                                          { sex = "m" } # Assume male.

/^.*[aeiy]$/                              { sex = "f" }  # Female names endng in a/e/i/y.
/^All?[iy]((ss?)|z)on$/                   { sex = "f" }  # Allison (and variations)
/^.*een$/                                 { sex = "f" }  # Cathleen, Eileen, Maureen,...
/^[^S].*r[rv]e?y?$/                       { sex = "m" }  # Barry, Larry, Perry,...
/^[^G].*v[ei]$/                           { sex = "m" }  # Clive, Dave, Steve,...
/^[^BD].*(b[iy]|y|via)nn?$/               { sex = "f" }  # Carolyn,Gwendolyn,Vivian,...
/^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$/  { sex = "m" }  # Dewey, Stanley, Wesley,...
/^[^GKSW].*(th|lv)(e[rt])?$/              { sex = "f" }  # Heather, Ruth, Velvet,...
/^[CGJWZ][^o][^dnt]*y$/                   { sex = "m" }  # Gregory, Jeremy, Zachary,...
/^.*[Rlr][abo]y$/                         { sex = "m" }  # Leroy, Murray, Roy,...
/^[AEHJL].*il.*$/                         { sex = "f" }  # Abigail, Jill, Lillian,...
/^.*[Jj](o|o?[ae]a?n.*)$/                 { sex = "f" }  # Janet, Jennifer, Joan,...
/^.*[GRguw][ae]y?ne$/                     { sex = "m" }  # Duane, Eugene, Rene,...
/^[FLM].*ur(.*[^eotuy])?$/                { sex = "f" }  # Fleur, Lauren, Muriel,...
/^[CLMQTV].*[^dl][in]c.*[ey]$/            { sex = "m" }  # Lance, Quincy, Vince,...
/^M[aei]r[^tv].*([^cklnos]|([^o]n))$/     { sex = "f" }  # Margaret, Marylou, Miriam,...
/^.*[ay][dl]e$/                           { sex = "m" }  # Clyde, Kyle, Pascale,...
/^[^o]*ke$/                               { sex = "m" }  # Blake, Luke, Mike,...
/^[CKS]h?(ar[^lst]|ry).+$/                { sex = "f" }  # Carol, Karen, Sharon,...
/^[PR]e?a([^dfju]|qu)*[lm]$/              { sex = "f" }  # Pam, Pearl, Rachel,...
/^.*[Aa]nn.*$/                            { sex = "f" }  # Annacarol, Leann, Ruthann,...
/^.*[^cio]ag?h$/                          { sex = "f" }  # Deborah, Leah, Sarah,...
/^[^EK].*[grsz]h?an(ces)?$/               { sex = "f" }  # Frances, Megan, Susan,...
/^[^P]*([Hh]e|[Ee][lt])[^s]*[ey].*[^t]$/  { sex = "f" }  # Ethel, Helen, Gretchen,...
/^[^EL].*o(rg?|sh?)?(e|ua)$/              { sex = "m" }  # George, Joshua, Theodore,..
/^[DP][eo]?[lr].*se$/                     { sex = "f" }  # Delores, Doris, Precious,...
/^[^JPSWZ].*[denor]n.*y$/                 { sex = "m" }  # Anthony, Henry, Rodney,...
/^K[^v]*i.*[mns]$/                        { sex = "f" }  # Karin, Kim, Kristin,...
/^Br[aou][cd].*[ey]$/                     { sex = "m" }  # Bradley, Brady, Bruce,...
/^[ACGK].*[deinx][^aor]s$/                { sex = "f" }  # Agnes, Alexis, Glynis,...
/^[ILW][aeg][^ir]*e$/                     { sex = "m" }  # Ignace, Lee, Wallace,...
/^[^AGW][iu][gl].*[drt]$/                 { sex = "f" }  # Juliet, Mildred, Millicent,...
/^[ABEIUY][euz]?[blr][aeiy]$/             { sex = "m" }  # Ari, Bela, Ira,...
/^[EGILP][^eu]*i[ds]$/                    { sex = "f" }  # Iris, Lois, Phyllis,...
/^[ART][^r]*[dhn]e?y$/                    { sex = "m" }  # Randy, Timothy, Tony,...
/^[BHL].*i.*[rtxz]$/                      { sex = "f" }  # Beatriz, Bridget, Harriet,...
/^.*oi?[mn]e$/                            { sex = "m" }  # Antoine, Jerome, Tyrone,...
/^D.*[mnw].*[iy]$/                        { sex = "m" }  # Danny, Demetri, Dondi,...
/^[^BG](e[rst]|ha)[^il]*e$/               { sex = "m" }  # Pete, Serge, Shane,...
/^[ADFGIM][^r]*([bg]e[lr]|il|wn)$/        { sex = "f" }  # Angel, Gail, Isabel,...

                                          { print sex }  # Output prediction

Author

by Scott Pakin, August 1991

categories: CMS,Tools,Mar,2010,Timm

TinyTim: a Content Management System

TINY TIM is a tiny web-site manager written in AWK. For a live demo of the site, see http://at.ttoy.net/?tinytim. The site supports runtime content generation; e.g. the quote shown top right of the demo site is auto-generated each time you refresh the page.

The site was written to demonstrate that a little AWK goes a long way. At the time of this writing, the current system is under 100 lines of code (excluding a seperate formatter, that is another 170 lines of code). It took longer to write this doco and the various HTML/CSS theme files, than the actual code itself (fyi: 6 hours for the themes/doc and 3 hours for the code).

TINY TIM has the following features:

  • Pages can be accessed by their (lowercase) name, or by their (uppercase) tags.
  • Pages can be displayed using a set of customizable themes.
  • Page contents can be written using a HTML shorthand language called MARKUP.
  • Pages can be searched using a Google search box.
  • Source code is auto-displayed using a syntax highlighter.
  • Page content can be auto-created via programmer-modifable plugins.

Install

In a web accessible directory, type

 svn export http://knit.googlecode.com/svn/branches/0.2/tinytim/ 

In the resulting directory, perform the local juju required to make index.cgi web-runnable (e.g. on my ISP, chmod u+rx index.cgi).

Follow the directions in the next section to customize the site.

Using TINY TIM

index.cgi

TINY TIM is controlled by the following index.cgi file. To select a theme, comment out all but one of the last lines (using the "#" character). For a screen-shots of the current themes, see below.

#!/bin/bash
 
[ -n "$1" ] && export QUERY_STRING="$1"
 
tinytim() {
  cat content/* themes/$1/theme.txt |
  gawk -f lib/tinytim.awk |
  sed 's/^<pre>/<script type="syntaxhighlighter" class="brush: cpp"><![CDATA[/' |
  sed 's/^<\/pre>/<\/script>/'
} 
  
 #tinytim auklet
 #tinytim trendygreen
 tinytim wink

Notes:

  • The sed commands: these render normal <pre> using Alex Gorbatchev's excellent syntax highlighter. To change the highlighting rules for a different language, change brush: cpp to one of the supported aliases.
  • The cat command: this assembles the content for the system. Multiple authors can write multiple files in the sub-directorty content.

Themes

Themes are defined in the sub-directory themes/themename. Each theme is defined by a theme.txt file that holds:

  • The HTML template for the theme.
  • The in-line style sheet for the theme.
  • The page contents with pre-defined string names marked with ``; e.g. ``title``. To change those strings, see the instructions at the end of this page.
  • If a `` entry contains a semi-colon (e.g. ``quotes;``) then it is a plugin. Plugin content is generated at runtime using a method described at the end of this document.

To write a new theme:

  1. Create a new folder themes/new.
  2. Copy (e.g.) wink/theme.txt to new.
  3. Using the copied theme as a template, start tinkering.

The following themes are defined in the directory themes.

Auklet:

Trendygreen (adapted from GetTemplates):

Wink:

Defining String Values

The first entry in the content defines strings that can slip into the theme templates. For example, the following slots define the title of a site; the name of formatter script that renders each page; the url of the home directory of the site; a menu to add top of each page; a footer to add to the bottom of each page; and a web-accessible directory for storing images.

 ``title``       Just another Tiny Tim demo
 ``formatter``   lib/markup.awk
 ``description`` (simple cms)
 ``home``        http://at.ttoy.net
 ``menu``        <a href="?index">Home</a> | 
                 <a href="?contact">Contact</a>  |
                 <a href="?about">About</a>
 ``footer``      <p>Powered by <a href="?tinytim">TINY TIM</a>. 
                                 © 2010 by Tim Menzies 
 ``images``      http://at.ttoy.net/img

Note the following important convention. TINY TIM auto-generates some of its own strings. The names of these strings start with an uppercase letter. To avoid confusion of your strings with those that are auto-generated, it is best to start your strings with a lower-case letter (e.g. like all those in the above example.

Adding a Search Engine

Google offers a nice free site-specific search engine. It takes a few days for the spiders to find the site but after that, it works fine. To set this up, follow the instructions at Google custom search, then

  • Add the appropriate magic strings into the first entry of the content (usually content/0config.txt).
  • Add references to those strings to your template.

For example, look for google-search in the current templates and content/0config.txt.

Writing pages

After the first entry, the rest of the entries in the content/* define the pages of a site. Each entry must begin with the magic string

  • Each entry must begin with the magic string #12345
  • The entry consists of paragraphs (separated by blank lines.
  • Paragraph one contains the (short) page name (on line one) following by the page tags (on line two).
      • Note that the page name must start with a lower case letter.
      • And the tags must start with an upper case letter.
  • Paragraph two contains the heading of the page.
  • The remaining paragraphs are the page contents.

For example, this site contains a missing page report. This page is defined as follows. In the following definition of that page, the name is "404"; the tags are "Admin Feb10" and the title is "Sorry".

 #12345####################################################################################
 
 404
 Admin Feb10
  
 Sorry
  
 I have bad news:
 
 <center>
 [img/404book.jpg]
 </center>

The contents can contain HTML and MARKUP tags.

MARKUP

MARKUP is a shorthand for writing HTML pages based on MARKDOWN:

  • Italics, bold, typerwritter font are marked by matching _, *, and ` characters (respectively).
  • Lists are marked by leading "+" characters.
  • Numbered lists are marked by leading "1." strings.
  • Links are enclosed in [square brackets]. The first word in the bracket is the URL and subsequent words are the text for the URL link.
  • Images are marked up with the same [square brackets], but the first work must end in one of .png, .gif, .jpg. Any subsequent words are passed as tags to the <img> tag.

Also, in MARKUP, major, minor, sub-, and sub-sub- headings are two line paragraphs where the second line contains two or more "=", "-", "+", "_" (respectively). MARKUP collects these headings as a table of contents, which is added to the top of the page.

Note that MARKUP is separate to TINY TIM. To change the formatting of pages, write your own AWK code and change the string ``formatter`` in the first entry of content/0config.txt.

Plugins

If a `` entry contains a semi-colon (e.g. ``quotes;``) then it is a plugin. Plugin content is generated at runtime. To write a plugin, modify the file lib/plugins.awk. Currently, that file looks like this:

 function slotsPlugIns(str,slots,   tmp) {
    split(str,tmp,";")
    if (tmp[1]=="quotes")
        return quotes(str,slots)
    return str
 }
 function quotes(str,slots,    n,tmp) {
    srand(systime() + PROCINFO["pid"])
    n=split(slots["quotes"],tmp,"\n")
    return tmp[int(rand()*n) + 1]
 }

The function slotsPlugIns is a "traffic-cop" who decides what plugin to call (in the above, there is only one current plugin: quotes).

Each plugin function (e.g. quotes) is passed the string from the template (see str) and an array of key/value pairs holding all the defined string values (see slots). These functions must return a string to be inserted into the rendered HTML.

In the example above, quotes just returns a random quote. It assumes that the predefined strings includes a set of quotes, one per line:

 ``quotes`` Small  things with great love. <br>-- Mother Teresa
     It's hard work to it look effortless.<br>-- Katarina Witt
    "God bless us every one!".<br>-- Tiny Tim

The quote generated by this plug in can be view, top right of this page.


categories: News,Mar,2009,Admin

The Awk Book's Code

Brian Kernighan has granted permission for this site to host the code from the original Awk book:

  • The AWK Programming Language
  • by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger,
  • Addison-Wesley, 1988.
  • ISBN 0-201-07981-X.

The code can be viewed here.


categories: Dsl,Mar,2009,Admin

Domain-Specific Langauges

These pages focus on domain-specific languages (a.k.a. "little langauges") written in Awk.

These little languages can range from the simple to the quite intricate. For example, LAWKER contains code for

  • Simple:
    • Graph- a simple ascii graph generator;
    • Markdown- an ultra lightweight HTML markup language;
  • Intricate:
    • Awk++- enables object-oriented programming in Awk;
    • AwkLisp- a fully functioning LISP interpreter, written in Awk.

Interestingly, without comments, the LISP interpreter is only three times longer than the HTML markup language. This comments either on the power of Awk, the regularity of LISP's core semantics, or both.


categories: Dsl,Mar,2009,BrianK,PeterW,AlfredA

Graph.awk

Contents

Synopsis

gawk -f graph.awk graphFile

Description

A processor for a little language, specialized for graph-drawing.

The code inputs data, which includes a specification of a graph The output is data plotted in specified area

For example, here is an input specification:

label here's some stuff
bottom ticks 1 5 10 
left ticks 1 2 10 20
range 1 1 10 22
height 10
width 30
1 2 *
2 4 * 
3 6 *
4 8 *
7 14 +
8 12 +
9 10 +
mb 0.9 11 =

It produces the following output

      |----------------------|
20    -                 = =  =
      |       = =  = =       |
      =  = =         +  +    |
10    -                   +  |
      |    *  *              |
      |  *                   |
2     *---------|------------|
     1         5            10
         here's some stuff    

Code

Initialization

Set frame dimensions: height and width; offset for x and y axes.

BEGIN {                
    ht = 24; wid = 80  
    ox = 6; oy = 2     
    number = "^[-+]?([0-9]+[.]?[0-9]*|[.][0-9]+)" \
                            "([eE][-+]?[0-9]+)?$"
}

Handling patterns

Skip comments

/^[ \t]*#/     { next } 

Simple tags

$1 == "height" { ht = $2;  next }
$1 == "width"  { wid = $2; next }
$1 == "label"  {                       # for bottom
    sub(/^ *label */, "")
    botlab = $0
    next
}
$1 == "bottom" && $2 == "ticks" {     # ticks for x-axis
    for (i = 3; i <= NF; i++) bticks[++nb] = $i
    next
}
$1 == "left" && $2 == "ticks" {       # ticks for y-axis
    for (i = 3; i <= NF; i++) lticks[++nl] = $i
    next
}
$1 == "range" {                       # xmin ymin xmax ymax
    xmin = $2; ymin = $3; xmax = $4; ymax = $5
    next
}

Handling numerics.

$1 ~ number && $2 ~ number {  # pair of numbers
    nd++                      # count number of data points
    x[nd] = $1; y[nd] = $2
    ch[nd] = $3               # optional plotting character
    next
}
$1 ~ number && $2 !~ number { # single number
    nd++                      # count number of data points
    x[nd] = nd; y[nd] = $1; ch[nd] = $2
    next
}

Line functions, defined by a slope "m" and a y-intercept "b".

$1 == "mb" {  # m b [mark]
	expand()
    for(i=xmin;i<=xmax;i++) {
		nd++; x[nd]=i; y[nd]=$2*i + $3; ch[nd]=$4 
    }
    next;
}		

Final case: input error.

{ print "?? line " NR ": ["$0"]" >"/dev/stderr" }

Draw the graph

END { expand();   frame(); ticks(); label(); data(); draw() }

Functions

Expand the "x" and "y" boundaries to include all points.

function expand(note) { if (xmin == "") expand1(note) }
function expand1(note) {
 	xmin = xmax = x[1]    
    ymin = ymax = y[1]
    for (i = 2; i <= nd; i++) {
        if (x[i] < xmin) xmin = x[i]
        if (x[i] > xmax) xmax = x[i]
        if (y[i] < ymin) ymin = y[i]
        if (y[i] > ymax) ymax = y[i] }
}

Draw the frame around the graph.

function frame() {        
    for (i = ox; i < wid; i++) plot(i, oy, "-")     # bottom
    for (i = ox; i < wid; i++) plot(i, ht-1, "-")   # top
    for (i = oy; i < ht; i++) plot(ox, i, "|")      # left
    for (i = oy; i < ht; i++) plot(wid-1, i, "|")   # right
}

Create tick marks for both axes.

function ticks(    i) {   
    for (i = 1; i <= nb; i++) {
        plot(xscale(bticks[i]), oy, "|")
        splot(xscale(bticks[i])-1, 1, bticks[i])
    }
    for (i = 1; i <= nl; i++) {
        plot(ox, yscale(lticks[i]), "-")
        splot(0, yscale(lticks[i]), lticks[i])
    }
}

Center labels under x-axis.

function label() {        
    splot(int((wid + ox - length(botlab))/2), 0, botlab)
}

Create data points.

function data(    i) {    
    for (i = 1; i <= nd; i++)
        plot(xscale(x[i]),yscale(y[i]),ch[i]=="" ? "*" : ch[i])
    for(i in mark) print mark[i]
}

Print graph from array.

function draw(    i, j) { 
    for (i = ht-1; i >= 0; i--) {
        for (j = 0; j < wid; j++)
            printf((j,i) in array ? array[j,i] : " ")
        printf("\n")
    }
}

Scale x-values, y-values.

function xscale(x) {      
    return int((x-xmin)/(xmax-xmin) * (wid-1-ox) + ox + 0.5)
}
function yscale(y) {      
    return int((y-ymin)/(ymax-ymin) * (ht-1-oy) + oy + 0.5)
}

Put one character into array.

function plot(x, y, c) {  
    array[x,y] = c
}

Put string "s" into array.

function splot(x, y, s,    i, n) { 
    n = length(s)
    for (i = 0; i < n; i++)
        array[x+i, y] = substr(s, i+1, 1)
}

Author

This code comes from the original Awk book by Alfred Aho, Peter Weinberger & Brian Kernighan and contains some small modifications by Tim Menzies.


categories: Eliza,Top10,AwkLisp,Interpreters,Dsl,Mar,2009,DariusB

AWKLISP v1.2

Download from

Synopsis

awk [-v profiling=1] -f awklisp [optional-Lisp-source-files]

The -v profiling=1 option turns call-count profiling on.

If you want to use it interactively, be sure to include '-' (for the standard input) among the source files. For example:

gawk -f awklisp startup numbers lists -

Description

Overview

This program arose out of one-upmanship. At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster. Since then I've added features and polish, in the hope of taking over the burgeoning market for stately language implementations.

This version tries to deal with as many of the essential issues in interpreter implementation as is reasonable in awk (though most would call this program utterly unreasonable from start to finish, perhaps...). Awk's impoverished control structures put error recovery and tail-call optimization out of reach, in that I can't see a non-painful way to code them. The scope of variables is dynamic because that was easier to implement efficiently. Subject to all those constraints, the language is as Schemely as I could make it: it has a single namespace with uniform evaluation of expressions in the function and argument positions, and the Scheme names for primitives and special forms.

The rest of this file is a reference manual. My favorite tutorial would be The Little LISPer (see section 5, References); don't let the cute name and the cartoons turn you off, because it's a really excellent book with some mind-stretching material towards the end. All of its code will work with awklisp, except for the last two chapters. (You'd be better off learning with a serious Lisp implementation, of course.)

For more details on the implementation, see the Implementation notes (below).

Examples

fib.lsp

Code:

(define fib
  (lambda (n)
    (if (< n 2)
        1
        (+ (fib (- n 1))
           (fib (- n 2))))))
(fib 20)

Comamnd line:

gawk -f awklisp startup numbers  lists fib.lsp

Output:

10946

Eliza

Here are the standard ELIZA dialogue patterns:

(define rules
  '(((hello)
     (How do you do -- please state your problem))
    ((I want)
     (What would it mean if you got -R-)
     (Why do you want -R-)
     (Suppose you got -R- soon))
    ((if)
     (Do you really think its likely that -R-)
     (Do you wish that -R-)
     (What do you think about -R-)
     (Really-- if -R-))
    ((I was)
     (Were you really?)
     (Perhaps I already knew you were -R-)
     (Why do you tell me you were -R- now?))
    ((I am)
     (In what way are you -R-)
     (Do you want to be -R-))
    ((because)
     (Is that the real reason?)
     (What other reasons might there be?)
     (Does that reason seem to explain anything else?))
    ((I feel)
     (Do you often feel -R-))
    ((I felt)
     (What other feelings do you have?))
    ((yes)
     (You seem quite positive)
     (You are sure)
     (I understand))
    ((no)
     (Why not?)
     (You are being a bit negative)
     (Are you saying no just to be negative?))
    ((someone)
     (Can you be more specific?))
    ((everyone)
     (Surely not everyone)
     (Can you think of anyone in particular?)
     (Who for example?)
     (You are thinking of a special person))
    ((perhaps)
     (You do not seem quite certain))
    ((are)
     (Did you think they might not be -R-)
     (Possibly they are -R-))
    (()
     (Very interesting)
     (I am not sure I understand you fully)
     (What does that suggest to you?)
     (Please continue)
     (Go on)
     (Do you feel strongly about discussing such things?))))

Command line:

gawk -f awklisp startup numbers  lists eliza.lsp -

Interaction:

> (eliza)
Hello-- please state your problem 
> (I feel sick)
Do you often feel sick 
> (I am in love with awk)
In what way are you in love with awk 
> (because it is so easy to use)
Is that the real reason? 
> (I was laughed at by the other kids at space camp)
Were you really? 
> (everyone hates me)
Can you think of anyone in particular? 
> (everyone at space camp)
Surely not everyone 
> (perhaps not tina fey)
You do not seem quite certain 
> (I want her to laugh at me)
What would it mean if you got her to laugh at me 

Expressions and their evaluation

Lisp evaluates expressions, which can be simple (atoms) or compound (lists).

An atom is a string of characters, which can be letters, digits, and most punctuation; the characters may -not- include spaces, quotes, parentheses, brackets, '.', '#', or ';' (the comment character). In this Lisp, case is significant ( X is different from x ).

  • Atoms: atom 42 1/137 + ok? hey:names-with-dashes-are-easy-to-read
  • Not atoms: don't-include-quotes (or spaces or parentheses)

A list is a '(', followed by zero or more objects (each of which is an atom or a list), followed by a ')'.

  • Lists: () (a list of atoms) ((a list) of atoms (and lists))
  • Not lists: ) ((()) (two) (lists)

The special object nil is both an atom and the empty list. That is, nil = (). A non-nil list is called a -pair-, because it is represented by a pair of pointers, one to the first element of the list (its -car-), and one to the rest of the list (its -cdr-). For example, the car of ((a list) of stuff) is (a list), and the cdr is (of stuff). It's also possible to have a pair whose cdr is not a list; the pair with car A and cdr B is printed as (A . B).

That's the syntax of programs and data. Now let's consider their meaning. You can use Lisp like a calculator: type in an expression, and Lisp prints its value. If you type 25, it prints 25. If you type (+ 2 2), it prints 4. In general, Lisp evaluates a particular expression in a particular environment (set of variable bindings) by following this algorithm:

  • If the expression is a number, return that number.
  • If the expression is a non-numeric atom (a -symbol-), return the value of that symbol in the current environment. If the symbol is currently unbound, that's an error.
  • Otherwise the expression is a list. If its car is one of the symbols: quote, lambda, if, begin, while, set!, or define, then the expression is a -special- -form-, handled by special rules. Otherwise it's just a procedure call, handled like this: evaluate each element of the list in the current environment, and then apply the operator (the value of the car) to the operands (the values of the rest of the list's elements). For example, to evaluate (+ 2 3), we first evaluate each of its subexpressions: the value of + is (at least in the initial environment) the primitive procedure that adds, the value of 2 is 2, and the value of 3 is 3. Then we call the addition procedure with 2 and 3 as arguments, yielding 5. For another example, take (- (+ 2 3) 1). Evaluating each subexpression gives the subtraction procedure, 5, and 1. Applying the procedure to the arguments gives 4.
We'll see all the primitive procedures in the next section. A user-defined procedure is represented as a list of the form (lambda <parameters> <body>), such as (lambda (x) (+ x 1)). To apply such a procedure, evaluate its body in the environment obtained by extending the current environment so that the parameters are bound to the corresponding arguments. Thus, to apply the above procedure to the argument 41, evaluate (+ x 1) in the same environment as the current one except that x is bound to 41.

If the procedure's body has more than one expression -- e.g., (lambda () (write 'Hello) (write 'world!)) -- evaluate them each in turn, and return the value of the last one.

We still need the rules for special forms. They are:

  • The value of (quote <x>) is <x>. There's a shorthand for this form: '. E.g., the value of '(+ 2 2) is (+ 2 2), -not- 4.
  • (lambda <parameters> ) returns itself: e.g., the value of (lambda (x) x) is (lambda (x) x).
  • To evaluate (if <test-expr> <then-exp> <else-exp>), first evaluate <test-expr>. If the value is true (non-nil), then return the value of <then-exp>, otherwise return the value of <else-exp>. (<else-exp> is optional; if it's left out, pretend there's a nil there.) Example: (if nil 'yes 'no) returns no.
  • To evaluate (begin <expr-1> <expr-2>...), evaluate each of the subexpressions in order, returning the value of the last one.
  • To evaluate (while <test> <expr-1> <expr-2>...), first evaluate <test>. If it's nil, return nil. Otherwise, evaluate <expr-1>, <expr-2>,... in order, and then repeat.
  • To evaluate (set! <variable> <expr>), evaluate <expr>, and then set the value of <variable> in the current environment to the result. If the variable is currently unbound, that's an error. The value of the whole set! expression is the value of <expr>.
  • (define <variable> <expr>) is like set!, except it's used to introduce new bindings, and the value returned is <variable>.

It's possible to define new special forms using the macro facility provided in the startup file. The macros defined there are:

  • (let ((<var> <expr>)...)
      <body>...)
    Bind each <var> to its corresponding <expr> (evaluated in the current environment), and evaluate <body> in the resulting environment.
  • (cond (<test-expr> <result-expr>...)... (else <result-expr>...))
    where the final else clause is optional. Evaluate each <test-expr> in turn, and for the first non-nil result, evaluate its <result-expr>. If none are non-nil, and there's no else clause, return nil.
  • (and <expr>...)
    Evaluate each <expr> in order, until one returns nil; then return nil. If none are nil, return the value of the last <expr>.
  • (or <expr>...)
    Evaluate each <expr> in order, until one returns non-nil; return that value. If all are nil, return nil.

Built-in procedures

List operations:

  • (null? <x>) returns true (non-nil) when <x> is nil.
  • (atom? <x>) returns true when <x> is an atom.
  • (pair? <x>) returns true when <x> is a pair.
  • (car <pair>) returns the car of <pair>.
  • (cdr <pair>) returns the cdr of <pair>.
  • (cadr <pair>) returns the car of the cdr of <pair>. (i.e., the second element.)
  • (cddr <pair>) returns the cdr of the cdr of <pair>.
  • (cons <x> <y>) returns a new pair whose car is <x> and whose cdr is <y>.
  • (list <x>...) returns a list of its arguments.
  • (set-car! <pair> <x>) changes the car of <pair> to <x>.
  • (set-cdr! <pair> <x>) changes the cdr of <pair> to <x>.
  • (reverse! <list>) reverses <list> in place, returning the result.

Numbers:

  • (number? <x>) returns true when <x> is a number.
  • (+ <n> <n>) returns the sum of its arguments.
  • (- <n> <n>) returns the difference of its arguments.
  • (* <n> <n>) returns the product of its arguments.
  • (quotient <n> <n>) returns the quotient. Rounding is towards zero.
  • (remainder <n> <n>) returns the remainder.
  • (< <n1> <n2>) returns true when <n1> is less than <n2>.

I/O:

  • (write <x>) writes <x> followed by a space.
  • (newline) writes the newline character.
  • (read) reads the next expression from standard input and returns it.

Meta-operations:

  • (eval <x>) evaluates <x> in the current environment, returning the result.
  • (apply <proc> <list>) calls <proc> with arguments <list>, returning the result.

Miscellany:

  • (eq? <x> <y>) returns true when <x> and <y> are the same object. Be careful using eq? with lists, because (eq? (cons <x> <y>) (cons <x> <y>)) is false.
  • (put <x> <y> <z>)
  • (get <x> <y>) returns the last value <z> that was put for <x> and <y>, or nil if there is no such value.
  • (symbol? <x>) returns true when <x> is a symbol.
  • (gensym) returns a new symbol distinct from all symbols that can be read.
  • (random <n>) returns a random integer between 0 and <n>-1 (if <n> is positive).
  • (error <x>...) writes its arguments and aborts with error code 1.

Implementation Notes

Overview

Since the code should be self-explanatory to anyone knowledgeable about Lisp implementation, these notes assume you know Lisp but not interpreters. I haven't got around to writing up a complete discussion of everything, though.

The code for an interpreter can be pretty low on redundancy -- this is natural because the whole reason for implementing a new language is to avoid having to code a particular class of programs in a redundant style in the old language. We implement what that class of programs has in common just once, then use it many times. Thus an interpreter has a different style of code, perhaps denser, than a typical application program.

Data representation

Conceptually, a Lisp datum is a tagged pointer, with the tag giving the datatype and the pointer locating the data. We follow the common practice of encoding the tag into the two lowest-order bits of the pointer. This is especially easy in awk, since arrays with non-consecutive indices are just as efficient as dense ones (so we can use the tagged pointer directly as an index, without having to mask out the tag bits). (But, by the way, mawk accesses negative indices much more slowly than positive ones, as I found out when trying a different encoding.)

This Lisp provides three datatypes: integers, lists, and symbols. (A modern Lisp provides many more.)

For an integer, the tag bits are zero and the pointer bits are simply the numeric value; thus, N is represented by N*4. This choice of the tag value has two advantages. First, we can add and subtract without fiddling with the tags. Second, negative numbers fit right in. (Consider what would happen if N were represented by 1+N*4 instead, and we tried to extract the tag as N%4, where N may be either positive or negative. Because of this problem and the above-mentioned inefficiency of negative indices, all other datatypes are represented by positive numbers.)

The evaluation/saved-bindings stack

The following is from an email discussion; it doesn't develop everything from first principles but is included here in the hope it will be helpful.

Hi. I just took a look at awklisp, and remembered that there's more to your question about why we need a stack -- it's a good question. The real reason is because a stack is accessible to the garbage collector.

We could have had apply() evaluate the arguments itself, and stash the results into variables like arg0 and arg1 -- then the case for ADD would look like

if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)

The obvious problem with that approach is how to handle calls to user-defined procedures, which could have any number of arguments. Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1)) is the procedure, and 42 is the argument.

A (wrong) solution could be to evaluate each argument in turn, and bind the corresponding parameter name (like x in this case) to the resulting value (while saving the old value to be restored after we return from the procedure). This is wrong because we must not change the variable bindings until we actually enter the procedure -- for example, with that algorithm ((lambda (x y) y) 1 x) would return 1, when it should return whatever the value of x is in the enclosing environment. (The eval_rands()-type sequence would be: eval the 1, bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind y to that, then eval the body of the lambda.)

Okay, that's easily fixed -- evaluate all the operands and stash them away somewhere until you're done, and *then* do the bindings. So the question is where to stash them. How about a global array? Like

for (i = 0; arglist != NIL; ++i) {
    global_temp[i] = eval(car[arglist])
    arglist = cdr[arglist]
}

followed by the equivalent of extend_env(). This will not do, because the global array will get clobbered in recursive calls to eval(). Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +, like this: global_temp[0] gets 2, and then global_temp[1] gets the eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set to 3 and global_temp[1] to 4 -- so the original assignment of 2 to global_temp[0] is clobbered before we get a chance to use it. By using a stack[] instead of a global_temp[], we finesse this problem.

You may object that we can solve that by just making the global array local, and that's true; lots of small local arrays may or may not be more efficient than one big global stack, in awk -- we'd have to try it out to see. But the real problem I alluded to at the start of this message is this: the garbage collector has to be able to find all the live references to the car[] and cdr[] arrays. If some of those references are hidden away in local variables of recursive procedures, we're stuck. With the global stack, they're all right there for the gc().

(In C we could use the local-arrays approach by threading a chain of pointers from each one to the next; but awk doesn't have pointers.)

(You may wonder how the code gets away with having a number of local variables holding lisp values, then -- the answer is that in every such case we can be sure the garbage collector can find the values in question from some other source. That's what this comment is about:

  # All the interpretation routines have the precondition that their
  # arguments are protected from garbage collection.

In some cases where the values would not otherwise be guaranteed to be available to the gc, we call protect().)

Oh, there's another reason why apply() doesn't evaluate the arguments itself: it's called by do_apply(), which handles lisp calls like (apply car '((x))) -- where we *don't* want the x to get evaluated by apply().

References

  • Harold Abelson and Gerald J. Sussman, with Julie Sussman. Structure and Interpretation of Computer Programs. MIT Press, 1985.
  • John Allen. Anatomy of Lisp. McGraw-Hill, 1978. <;i> Daniel P. Friedman and Matthias Felleisen. The Little LISPer. Macmillan, 1989.

Roger Rohrbach wrote a Lisp interpreter, in old awk (which has no procedures!), called walk . It can't do as much as this Lisp, but it certainly has greater hack value. Cooler name, too. It's available at http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/impl/awk/0.html

Bugs

Eval doesn't check the syntax of expressions. This is a probably-misguided attempt to bump up the speed a bit, that also simplifies some of the code. The macroexpander in the startup file would be the best place to add syntax- checking.

Author

Darius Bacon dairus@wry.me

Copyright

Copyright (c) 1994, 2001 by Darius Bacon.

Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:

  1. The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from defects in it.
  2. The origin of this software must not be misrepresented, either by explicit claim or by omission.
  3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software.

categories: Top10,Wp,Dsl,Mar,2009,JesusG

Markdown.awk

Contents

Synopsis

awk -f markdown.awk file.txt > file.html

Download

Download from LAWKER.

Description

(Note: this code was orginally called txt2html.awk by its author but that caused a name clash inside LAWKER. Hence, I've taken the liberty of renamining it. --Timm)

The following code implements a subset of John Gruber's Markdown langauge: a widely-used, ultra light-weight markup language for html.

  • Paragraghs- denoted by a leading blank line.
  • Images:
    ![alt text](/path/img.jpg "Title")
  • Emphasis: **To be in italics**
  • Code: `<code>` spans are delimited by backticks.
  • Headings (Setex style)
    Level 1 Header 
    =============== 
    
    Level 2 Header
    --------------
    
    Level 3 Header 
    ______________
    
  • Heaings (Atx style):

    Number of leading "#" codes the heading level:

    # Level 1 Header
    #### Level 4 Header
    
  • Unordered lists
  • - List item 1
    - List item 2
    

    Note: beginnging and end of list are automatically inferred, maybe not always correctly.

  • Ordered lists
  • Denoted by a number at start-of-line.

    1 A numbered list item
    

Code

The following code demonstrates a "exception-style" of Awk programming. Note how all the processing relating to each mark-up tag is localized (exception, carrying round prior text and environments). The modularity of the following code should make it easily hackable.

Globals

BEGIN {
	env = "none";
	text = "";
}

Images

/^!\[.+\] *\(.+\)/ {
	split($0, a, /\] *\(/);
	split(a[1], b, /\[/);
	imgtext = b[2];
	split(a[2], b, /\)/);
	imgaddr = b[1];
	print "<p><img src=\"" imgaddr "\" alt=\"" imgtext "\" title=\"\" /></p>\n";
	text = "";
	next;
}

Links

/\] *\(/ {
	do {
		na = split($0, a, /\] *\(/);
		split(a[1], b, "[");
		linktext = b[2];
		nc = split(a[2], c, ")");
		linkaddr = c[1];
		text = text b[1] "<a href=\"" linkaddr "\">" linktext "</a>" c[2];
		for(i = 3; i <= nc; i++)
			text = text ")" c[i];
		for(i = 3; i <= na; i++)
			text = text "](" a[i];
		$0 = text;;
		text = "";
	}
	while (na > 2);
}

Code

/`/ {
	while (match($0, /`/) != 0) {
		if (env == "code") {
			sub(/`/, "</code>");
			env = pcenv;
		}
		else {
			sub(/`/, "<code>");
			pcenv = env;
			env = "code";
		}
	}
}

Emphasis

/\*\*/ {
	while (match($0, /\*\*/) != 0) {
		if (env == "emph") {
			sub(//, "</emph>");
			env = peenv;
		}
		else {
			sub(/\*\*/, "<emph>");
			peenv = env;
			env = "emph";
		}
	}
}

Setex-style Headers

(Plus h3 with underscores.)

/^=+$/ {
	print "<h1>" text "</h1>\n";
	text = "";
	next;
}

/^-+$/ {
	print "<h2>" text "</h2>\n";
	text = "";
	next;
}

/^_+$/ {
	print "<h3>" text "</h3>\n";
	text = "";
	next;
}

Atx-style headers

/^#/ {
	match($0, /#+/);
	n = RLENGTH;
	if(n > 6)
		n = 6;
	print "<h" n ">" substr($0, RLENGTH + 1) "</h" n ">\n";
	next;
}

Unordered Lists

/^[*-+]/ {
	if (env == "none") {
		env = "ul";
		print "<ul>";
	}
	print "<li>" substr($0, 3) "</li>";
	text = "";
	next;
}

/^[0-9]./ {
	if (env == "none") {
		env = "ol";
		print "<ol>";
	}
	print "<li>" substr($0, 3) "</li>";
	next;
}

Paragraphs

/^[ t]*$/ {
	if (env != "none") {
		if (text)
			print text;
		text = "";
		print "</" env ">\n";
		env = "none";
	}
	if (text)
		print "<p>" text "</p>\n";
	text = "";
	next;
}

Default

// {
	text = text $0;
}

End

END {
        if (env != "none") {
                if (text)
                        print text;
                text = "";
                print "</" env ">\n";
                env = "none";
        }
        if (text)
                print "<p>" text "</p>\n";
        text = "";
}

Bugs

Does not implement the full Markdown syntax.

Author

Jesus Galan (yiyus) 2006

<yiyu DOT jgl AT gmail DOT com>

categories: Awk100,Oo,Dsl,Mar,2009,Jimh

Awk++

Contents

Synopsis

 gawk -f awkpp file-name-of-awk++-program
This command is platform independent and sends the translated program to standard output (stdout). See Running awk++ for variations.

This is an updated revision (#21), released August 1, 2009. In this new version:

  • The code no longer needs a shell script or batch file to launch awkpp
  • Multiple inheritance improved
  • added configuration items at the top of the program
This document may be copied only as part of an awk++ distribution and in unmodified form.

Download

Download awkpp21.zip from LAWKER

Description

Awk++ is a preprocessor, that is it reads in a program written in the awk++ language and outputs a new program. However, it's different than awka. The output from the awk++ preprocessor is awk code, not C or an executable program. So, some version of AWK, such as awk or gawk, has to be used to run the preprocessed program. awka can be used, in a second step, to turn the preprocessed awk++ program into an executable, if desired.

OO in AWK++

The awk++ language provides object oriented programming for AWK that includes:

  • classes
  • class properties (persistent object variables)
  • methods
  • inheritance, including multiple inheritance

Awk++ adds new keywords to standard Awk:

  • class
  • method
  • prop
  • property
  • attr
  • attribute
  • elem
  • element
  • var
  • variable

Syntax

Samples:

 a = class1.new[(optional parameters)] *** similar to Ruby
 b = a.get("aProperty")
 a.delete

 class class1 {
 property aProperty
 method new([optional parameters]) {
 # put initialization stuff here
 }

 method get(propName) {
 if(propName = "aProperty")
 return aProperty ### Note the use of 'return'. It behaves
 ### exactly the same as in an AWK function.
 }
 }

Details

To define a class (similar to C++ but no public/private):

class class_name {.....}

To define a class with inheritance:

class class_name : inherited_class_name [ : inherited_class_name...] {.....}

To add local/private variables (persistent variables; syntax is unique to awk++):

class class_name {
 attribute|attr|property|prop|element|elem|variable|var variable_name
 ..... }

To help programmers who are used to other OO languages, "attribute", "property", "element", and "variable", along with their 4-letter abbreviations, are interchangeable.

Note: these persistent variables cannot be accessed directly. The programmer must define method(s) to return them, if their values are to be made available to code that's outside the class.

To add methods

class class_name {
 attribute variable_name1

 method method_name(parameters) {
 ...any awk code....
 }
 ..other method definitions...
 }

To create an object

 object_variable = class_name.new[(optional parameters)]
(runs the method named "new", if it exists; returns the object ID)

To call an object method

object_variable.method_name(parameters)

The dot isn't used for concatenation in awk/gawk, so it's a natural choice for the separator between the object and method.

To reclaim the memory used by an object, use the delete method, i.e.:

object_variable.delete

but don't define delete() in your classes. awk++ recognizes delete() as a special method and will take care of deleting the object. Deleting objects is only necessary, though, if they hold a lot of data. Overhead for objects themselves is insignificant.

Naming and behavior rules:

  • Class names must obey the same rules as user defined function names.
  • Method names must follow the same rules as AWK user defined function names.
  • Class "local" variables (properties, attributes, etc.) must follow the same
  • naming rules as AWK variables.
  • Objects are number variables, so they must obey number variable rules. However,
  • the values in variables holding objects should never be changed, as they are simply identifiers. Performing math operations on them is meaningless.

Syntax notes

OO syntax goals:

  • easy to parse and match to awk code using an awk program as the "preprocessor"
  • easy to understand
  • easy to remember
  • easy and fast to type
  • distinct from existing AWK syntax

The OO syntax is based partly on C++, partly on Javascript, partly on Ruby and partly on the book "The Object-Oriented Thought Process". It isn't lifted in toto from one langauage because other languages provide features that gawk can't accomplish or have syntax that is hard to parse.

Multiple Inheritance

In awk++, if a method is called that isn't in the object's class and there are inherited classes (superclasses) specified, the inherited classes are called in left to right order until one of them returns a value. That value becomes the result of the method call. This is the way awk++ resolves the diamond problem. As a programmer, you control the sequence in which superclasses are called by the left to right order of the list of inherited classes in the class definition.

There are two important things to note.

  1. The search will proceed up through as many ancestors as it takes to find a matching method.
  2. A "match" is made when a value is returned. If a superclass has a matching
  3. method that returns nothing, the search will continue. Thus, it's possible that more than one method could be executed resulting in unintended consequences. Be careful!

Calls to undefined methods do nothing and return nothing, silently.

Running awk++

The command to preprocess an awk++ program looks like this:

gawk -f awkpp file-name-of-awk++-program
or, if the "she-bang" line (line 1 in awkpp) has the right path to gawk, and awkpp is executable and in a directory in PATH,
awkpp file-name-of-awk++-program
To run the output program immediately,
gawk -f awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
or
awkpp -r file-name-of-awk++-program [awk options] data-files-to-be-processed
When running an awk++ program immediately, standard input (stdin) cannot be used for data. One or more data file paths must be listed on the command line.

Bugs

There is a bug in the standard AWK distributions that affects the preprocessor. Additionally, the preprocessor uses the 3rd array option of the match() function. So, it's best to use GAWK to run the preprocessor.

On the other hand, the AWK code created by translating awk++ is intended to work with all versions of AWK. If you find otherwise, please notify the developer(s).

Copyright

Copyright (c) 2008, 2009 Jim Hart, jhart@mail.avcnet.org All rights reserved. The awk++ code is licensed under the GNU Public license (GPL) any version. awk++ documentation, including this page, may be copied only in unmodified form, subject to fair use guidelines.

Author

Jim Hart, jhart@mail.avcnet.org

categories: Funky,Mar,2009,Timm

Funky: Functional Gawk

These pages are focused on Functional Gawk (a.k.a. "Funky").

Funky is enabled by a new feature added to Gawk 3.2: indirect functions. For example:

function foo() { print "foo" }
function bar() { print "bar" }

BEGIN {
                the_func = "foo"
                @the_func()     # calls foo()
                the_func = "bar"
                @the_func()     # calls bar()
}

At the time of this writing, Gawk 3.2 is pre-release and indirect functions can be accessed using the gawk-devel CVS tree:

cvs -d:pserver:anonymous@cvs.sv.gnu.org:/sources/gawk co gawk-devel

categories: Funky,Mar,2009,Timm

The Functional Challange

Indirect functions enable a new view on library management in Gawk and, perhaps, a way to emulate functional abstraction in languages like Lisp.

So, anyone care to try, say:


categories: Funky,Tips,Mar,2009,ArnoldR

Super-For Loops

In this exchange from comp.lang.awk, Jason Quinn discusses his super-for loop trick. Arnold Robbins then chimes in to say that, with indirect functions, super-for loops could become a generic tool.

Jason Quinn writes:

  • Frequently when programming, situations arise for me where I need a nested number of for-loops. Such case arose for me again just recently while I was inventing a dice game. Anyway, here is the implementation that I ended up using to create a "super-for" loop in AWK (a little trickier than C).
  • This simple example merely lists all possible outcomes of rolling 4, 6, 8, 10, 12, and 20 sided dice at once. A super-for loop requires an array to specify the loop indices... here we have 6 dice and the number of sides determines the indices. The code is easily modified for an arbitrary number of dice (which is the whole point).
  • I identify three parts of a super-for which I called the prologue, body, and epilog. Under most circumstances, I think the main body only would get used.
  • For example:
    #shows an example of a superfor loop
    BEGIN {
    	#define loop maximums
    	loopmax[1]=4
    	loopmax[2]=6
    	loopmax[3]=8
    	loopmax[4]=10
    	loopmax[5]=12
    	loopmax[6]=20
    	#call the loop
    	superfor(6)
    }
    function superfor(loopdepth, zz) { # zz is a local variable
            currloopnum++
    
            #start of prologue
            #end of prologue
    
            for(loopcounter[currloopnum]=1; 
                loopcounter[currloopnum]<=loopmax[currloopnum]; 
                loopcounter[currloopnum]++) {
                    if ( loopdepth==1 ) {
                            #start of superfor body
                            for (zz=1;zz<=currloopnum;zz++) {
                                    printf loopcounter[zz] FS
                                    }
                            print ""
                            #end of superfor body
                            }
                    else if ( loopdepth>1 )
                            superfor(loopdepth-1)
                    }
    
            #start of epilog
            #end of epilog
    
            loopdepth++ ; currloopnum--
            }
    

Arnold Robbins replies:

  • I think this would make a great application for indirect function calls. For example:
    function superfor(loopdepth, prologue, body, epilogue,     zz)
    {
            currloopnum++
    
            @prologue()
    
            for(loopcounter[currloopnum]=1; 
                loopcounter[currloopnum]<=loopmax [currloopnum]; 
                loopcounter[currloopnum]++) {
                    if ( loopdepth==1 ) {
                            @body()
                    }
                    else if ( loopdepth>1 )
                            superfor(loopdepth-1, proloogue, 
                                     body, epilogue)
                    }
    
            @epilogue()
    
            loopdepth++ ; currloopnum--
    }
    

categories: Funky,Mar,2009,Timm

Functional Enumeration in Gawk 3.1.7

Contents

Synopsis

all( fun, array [,max]

collect( fun, array1, array2 [,max])

select( fun, array1, array2 [,max])

reject( fun, array1, array2 [,max])

detect( fun, array [,max])

inject( fun, array, carry [,max])

All these functions return the size of array or array2

Description

An interesting new feature in Gawk 3.1.7 is indirect functions. This allows the function name to be a variable, passed as an argument to an array, and called using the syntax

@fun(arg1,arg2,...)    

This enables a new kind of funcational programming style in Gawk. For example, generic enumeration patterns can be coded once, then called many different ways with different function names passed as arguments.

This document illustrates this style of programming.

Enumerators

For example, here are some standard enumeration functions:

all(fun,array [,max]

Applies the function fun to all items in the array. If called with the max argument, then they are iterated in the order i=1 .. max, otherwise we use for(i in a).

collect(fun,array1,array2 [,max])

Applies fun to each item in array1 and collects the results in array2.

select(fun,array1,array2 [,max])

Find all the items in array1 that satisfies fun and add them to array2.

reject(fun,array1,array2 [,max])

Find all the items in array1 that do not satisfy fun and add them to array2.

detect(fun,array [,max])

Return the first item found in array that satisfies fun. If no such item is found, then return the magic global value Fail.

inject(fun,array,carry [,max])

(This one is a little tricky.) The result of applying fun to each item in array is carried into the processing of the next item. Initially, the carried value is carry. This function returns the final carry.

Sample Functions

To illusrate the above, consider the following functions. Each of these are defined for one array item.

function odd(x)    { return (x % 2) == 1 }
function show(x)   { print "[" x "]" }
function mult(x,y) { return x * y }
function halve(x)  { return x/2 }

Using the Functions

  • All-ing...
  • function do_all(   arr) { 
        split("22 23 24 25 26 27 28",arr)
        all("show",arr)
    }
    

    When we run this ...

    eg/enum1

    gawk317="$HOME/opt/gawk/bin/gawk"
    $gawk317 -f ../enumerate.awk --source 'BEGIN { do_all() }'
    

    we see every item in arr printed using the above show function ...

    eg/enum1.out

    [25]
    [26]
    [27]
    [28]
    [22]
    [23]
    [24]
    
  • Collect-ing...
  • function do_collect(        max,arr1,arr2,i) {
        max=split("22 23 24 25 26 27 28",arr1)
        collect("halve",arr1,arr2,max)
        for(i=1;i<=max;i++) print arr2[i]
    }
    

    When we run this ...

    eg/enum2

    gawk317="$HOME/opt/gawk/bin/gawk"
    $gawk317 -f ../enumerate.awk --source 'BEGIN { do_collect() }'
    

    we see every item in arr divided in two ...

    eg/enum2.out

    11
    11.5
    12
    12.5
    13
    13.5
    14
    
  • Select-ing...
  • function do_select(        all,less,arr1,arr2,i) {
        all  = split("22 23 24 25 26 27 28",arr1)
        less = select("odd",arr1,arr2,all)
        for(i=1;i<=less;i++) print arr2[i]
    }
    

    When we run this ...

    eg/enum3

    gawk317="$HOME/opt/gawk/bin/gawk"
    $gawk317 -f ../enumerate.awk --source 'BEGIN { do_select() }'
    

    we see every item in arr that satisfies odd....

    eg/enum3.out

    23
    25
    27
    
  • Reject-ing...
  • function do_reject(        all,less,arr1,arr2,i) {
        all  = split("22 23 24 25 26 27 28",arr1)
        less = reject("odd",arr1,arr2,all)
        for(i=1;i<=less;i++) print arr2[i]
    }
    

    When we run this ...

    eg/enum4

    gawk317="$HOME/opt/gawk/bin/gawk"
    $gawk317 -f ../enumerate.awk --source 'BEGIN { do_reject() }'
    

    we see every item in arr that do not satisfies odd....

    eg/enum4.out

    22
    24
    26
    28
    
  • Detect-ing
  • function do_detect(        all,arr1) {
        all  = split("22 23 24 25 26 27 28",arr1)
        print detect("odd",arr1,all)   
    }
    

    When we run this ...

    eg/enum5

    gawk317="$HOME/opt/gawk/bin/gawk"
    $gawk317 -f ../enumerate.awk --source 'BEGIN { do_detect() }'
    

    we see the first item in arr that satisfies odd....

    eg/enum5.out

    23
    
  • Inject-ing...
  • function do_inject(        all,less,arr1,arr2,i) {
        split("1 2 3 4 5",arr1)
        print inject("mult",arr1,1)
    }
    

    When we run this ...

    eg/enum6

    gawk317="$HOME/opt/gawk/bin/gawk"
    $gawk317 -f ../enumerate.awk --source 'BEGIN { do_inject() }'
    

    we see every the result of multiplying every item in arr by its predecessor.

    eg/enum6.out

    120
    

Code

Note one design principle in the following: any newly generated arrays have indexes 1..max where max is the number of elements in that array.

all

function all (fun,a,max,   i) {
	if (max) 
		for(i=1;i<=max;i++) @fun(a[i]) 
	else  
		for(i in a) @fun(a[i])
}

collect

function collect (fun,a,b,max,   i) {
	if (max)
	    for(i=1;i<=max;i++) {n++; b[i]= @fun(a[i]) }
	else
	    for(i in a) {n++; b[i]= @fun(a[i])}
	return n
}

select

function select (fun,a,b,max,   i,n) {
	if (max)
		for(i=1;i<=max;i++) {
		    if (@fun(a[i])) {n++; b[n]= a[i] }}
	else
		for(i in a) {
		    if (@fun(a[i])) {n++; b[n]= a[i] }}
	return n
}

reject

function reject (fun,a,b,max,   i,n) {
	if (max)
		for(i=1;i<=max;i++) {
		    if (! @fun(a[i])) {n++; b[n]= a[i] }}
	else
		for(i in a) {
		    if (! @fun(a[i])) {n++; b[n]= a[i] }}
	return n
}

detect

BEGIN {Fail="someUnLIKELYSymbol"}
function detect (fun,a,max,   i) {
	if (max)
		for(i=1;i<=max;i++) {
			if (@fun(a[i])) return a[i] }
	else	
		for(i in a) {
			if (@fun(a[i])) return a[i] }
	return Fail
}

inject

function inject (fun,a,carry,max,   i) {
	if (max)
		for(i=1;i<=max;i++)
			 carry = @fun(a[i],carry) 
	else
		for(i in a)
			 carry = @fun(a[i],carry) 
	return carry
}

Bugs

The above code does not pass around any state information that the fum functions can use. So all their deliberations are either with the current array values (integers or strings) or with global state. It might be worthwhile writing new versions of the above with one more argument, to carry that sate.

Author

Tim Menzies

categories: Macros,Tools,Mar,2009,Timm

Macros

These pages focus on macro pre-processors (a natural application for Awk).


categories: Project,Tools,Mar,2009,Admin

Project Tools

These pages focus on tools for larger Gawk programs; e.g. ways to load multiple files or auto-generate documentation straight from the source code.


categories: Runawk,Project,Tools,Mar,2009,AlexC

runawk - wrapper for AWK interpreter

(Note: see recent update.)

Contents

Download from...

Download from LAWKER or a tar file or from SourceForge.

NAME

runawk - wrapper for AWK interpreter

SYNOPSIS

runawk [options] program_file

runawk -e program

DESCRIPTION

After years of using AWK for programming I've found that despite of its simplicity and limitations AWK is good enough for scripting a wide range of different tasks. AWK is not as poweful as their bigger counterparts like Perl, Ruby, TCL and others but it has their own advantages like compactness, simplicity and availability on almost all UNIX-like systems. I personally also like its data-driven nature and token orientation, very useful technique for simple text processing utilities.

But! Unfortunately awk interpreters lacks some important features and sometimes work not as good as it whould be.

Problems I see (some of them, of course)

  1. AWK lacks support for modules. Even if I create small programs, I often want to use the functions created earlier and already used in other scripts. That is, it whould great to orginise functions into so called libraries (modules).

  2. In order to pass arguments to #!/usr/bin/awk -f script (not to awk interpreter), it is necessary to prepand a list of arguments with -- (two minus signes). In my view, this looks badly.

    Example:

    awk_program:

        #!/usr/bin/awk -f
    
        BEGIN {
           for (i=1; i < ARGC; ++i){
              printf "ARGV [%d]=%s\n", i, ARGV [i]
           }
        }

    Shell session:

        % awk_program --opt1 --opt2
        /usr/bin/awk: unknown option --opt1 ignored
        /usr/bin/awk: unknown option --opt2 ignored
    
        % awk_program -- --opt1 --opt2
        ARGV [1]=--opt1
        ARGV [2]=--opt2
        %

    In my opinion awk_program script should work like this

        % awk_program --opt1 --opt2
        ARGV [1]=--opt1
        ARGV [2]=--opt2
        %

    It is possible using runawk.

  3. When #!/usr/bin/awk -f script handles arguments (options) and wants to read from stdin, it is necessary to add /dev/stdin (or `-') as a last argument explicitly.

    Example:

    awk_program:

        #!/usr/bin/awk -f
    
        BEGIN {
           if (ARGV [1] == "--flag"){
              flag = 1
              ARGV [1] = "" # to not read file named "--flag"
           }
        }
        {
           print "flag=" flag " $0=" $0
        }

    Shell session:

        % echo test | awk_program -- --flag
        % echo test | awk_program -- --flag /dev/stdin
        flag=1 $0=test
        %

    Ideally awk_program should work like this

        % echo test | awk_program --flag
        flag=1 $0=test
        %

runawk was created to solve all these problems

OPTIONS

-h|--help

Display help information.

-V|--version

Display version information.

-d|--debug

Turn on a debugging mode in which runawk prints argument list with which real awk interpreter will be run.

-i|--with-stdin

Always add stdin file name to a list of awk arguments

-I|--without-stdin

Do not add stdin file name to a list of awk arguments

-e|--execute program

Specify program. If -e is not specified program is read from program_file.

DETAILS/INTERNALS

Standalone script

Under UNIX-like OS-es you can use runawk by beginning your script with

   #!/usr/local/bin/runawk

line or something like this instead of

   #!/usr/bin/awk -f

or similar.

AWK modules

In order to activate modules you should add them into awk script like this

  #use "module1.awk"
  #use "module2.awk"

that is the line that specifies module name is treated as a comment line by normal AWK interpreter but is processed by runawk especially.

Note that #use should begin with column 0, no spaces are allowed before it and no spaces are allowed between # and use.

Also note that AWK modules can also "use" another modules and so forth. All them are collected in a depth-first order and each one is added to the list of awk interpreter arguments prepanded with -f option. That is #use directive is *NOT* similar to #include in C programming language, runawk's module code is not inserted into the place of #use. Runawk's modules are closer to Perl's "use" command. In case some module is mentioned more than once, only one -f will be added for it, i.e duplications are removed automatically.

Position of #use directive in a source file does matter, i.e. the earlier module is mentioned, the earlier -f will be generated for it.

Example:

  file prog:
     #!/usr/local/bin/runawk

     #use "A.awk"
     #use "B.awk"
     #use "E.awk"

     PROG code
     ...
  file B.awk:
     #use "A.awk"
     #use "C.awk"
     B code
     ...
  file C.awk:
     #use "A.awk"
     #use "D.awk"

     C code
     ...
A.awk and D.awk don't contain #use directive.

If you run

  runawk prog file1 file2

or

  /path/to/prog file1 file2

the following command

  awk -f A.awk -f D.awk -f C.awk -f B.awk -f E.awk -f prog -- file1 file2

will actually run.

You can check this by running

  runawk -d prog file1 file2

Module search strategy

Modules are first searched in a directory where main program (or module in which #use directive is specified) is placed. If it is not found there, then AWKPATH environment variable is checked. AWKPATH keeps a colon separated list of search directories. Finally, module is searched in system runawk modules directory, by default PREFIX/share/runawk but this can be changed at build time.

An absolute path of the module can also be specified.

AWK interpreter and its arguments

In order to pass arguments to AWK script correctly, runawk treats their arguments beginning with `-' sign (minus) especially. The following command

  runawk prog2 -x -f=file -o=output file1 file2

or

  /path/to/prog2 -x -f=file -o=output file1 file2

will actually run

  awk -f prog2 -- -x -f=file -o=output file1 file2

therefore -s, -f, -o options will be passed to ARGV/ARGC awk's variables together with file1 and file2. If all arguments begin with `-' (minus), runawk will add stdin filename to the end of argument list, (unless -I option is specified) i.e. running

  runawk prog3 --value=value

or

  /path/to/prog3 --value=value

will actually run the following

  awk -f prog3 -- --value=value /dev/stdin

Program as an argument

Like some other interpreters runawk can obtain the script from a command line like this

 /path/to/runawk -e '
 #use "alt_assert.awk"

 {
   assert($1 >= 0 && $1 <= 10, "Bad value: " $1)

   # your code below
   ...
 }'

Selecting a preferred AWK interpreter

For some reason you may prefer one AWK interpreter or another with a help of #interp command like this

  file prog:
     #!/usr/local/bin/runawk

     #use "A.awk"
     #use "B.awk"

     #interp "/usr/pkg/bin/nbawk"

     # your code here
     ...

The reason may be efficiency for a particular task, useful but not standard extensions or enything else.

Note that #interp directive should also begin with column 0, no spaces are allowed before it and between # and interp.

Setting environment

In some cases you may want to run AWK interpreter with a specific environment. For example, your script may be oriented to process ASCII text only. In this case you can run AWK with LC_CTYPE=C environment and use regexp ranges.

runawk provides #env directive for this. Strings inside double quotes is passed to putenv(3) libc function.

Example:

  file prog:
     #!/usr/local/bin/runawk

     #env "LC_ALL=C"

     $1 ~ /^[A-Z]+$/ { # A-Z is valid if LC_CTYPE=C
         print $1
     }

EXIT STATUS

If AWK interpreter exits normally, runawk exits with its exit status. If AWK interpreter was killed by signal, runawk exits with exit status 128+signal.

ENVIRONMENT

AWKPATH

Colon separated list of directories where awk modules are searched.

RUNAWK_AWKPROG

Sets the path to the AWK interpreter, used by default, i.e. this variable overrides the compile-time default. Note that #interp directive overrides this.

AUTHOR/LICENSE

Copyright (c) 2007-2008 Aleksey Cheusov <vle@gmx.net>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

BUGS/FEEDBACK

Please send any comments, questions, bug reports etc. to me by e-mail or (even better) register them at sourceforge project home. Feature requests are also welcomed.


categories: Awk100,Macros,Tools,Mar,2009,JonB

m1 : A Micro Macro Processor

Contents

Synopsis

awk -f m1.awk [file...]

Download

Download from LAWKER.

Description

M1 is a simple macro language that supports the essential operations of defining strings and replacing strings in text by their definitions. It also provides facilities for file inclusion and for conditional expan- sion of text. It is not designed for any particular application, so it is mildly useful across several applications, including document preparation and programming. This paper describes the evolution of the program; the final version is implemented in about 110 lines of Awk.

M1 copies its input file(s) to its output unchanged except as modified by certain "macro expressions." The following lines define macros for subsequent processing:

 @comment Any text
 @@                     same as @comment
 @define name value
 @default name value    set if name undefined
 @include filename
 @if varname            include subsequent text if varname != 0
 @unless varname        include subsequent text if varname == 0
 @fi                    terminate @if or @unless
 @ignore DELIM          ignore input until line that begins with DELIM
 @stderr stuff          send diagnostics to standard error

A definition may extend across many lines by ending each line with a backslash, thus quoting the following newline.

Any occurrence of @name@ in the input is replaced in the output by the corresponding value.

@name at beginning of line is treated the same as @name@.

Applications

Form Letters

We'll start with a toy example that illustrates some simple uses of m1. Here's a form letter that I've often been tempted to use:

@default MYNAME Jon Bentley 
@default TASK respond to your special offer 
@default EXCUSE the dog ate my homework 
Dear @NAME@: 
    Although I would dearly love to @TASK@, 
I am afraid that I am unable to do so because @EXCUSE@. 
I am sure that you have been in this situation 
many times yourself. 
            Sincerely, 
            @MYNAME@ 

If that file is namedsayno.mac, it might be invoked with this text:

@define NAME Mr. Smith 
@define TASK subscribe to your magazine 
@define EXCUSE I suddenly forgot how to read 

Recall that a @default takes effect only if its variable was not previously @defined.

Troff Pre-Processing

I've found m1 to be a handy Troff preprocessor. Many of my text files (including this one) start with m1 definitions like:

@define ArrayFig @StructureSec@.2 
@define HashTabFig @StructureSec@.3 
@define TreeFig @StructureSec@.4 
@define ProblemSize 100 

Even a simple form of arithmetic would be useful in numeric sequences of definitions. The longer m1 variables get around Troff's dreadful two-character limit on string names; these variables are also avail- able to Troff preprocessors like Pic and Eqn. Various forms of the @define, @if, and @include facilities are present in some of the Troff-family languages (Pic and Troff) but not others (Tbl); m1 provides a consistent mechanism.

I include figures in documents with lines like this:

@define FIGNUM @FIGMFMOVIE@ 
@define FIGTITLE The Multiple Fragment heuristic. 
@FIGSTART@ 
<PS> <@THISDIR@/mfmovie.pic</PS>
@FIGEND@ 

The two @defines are a hack to supply the two parameters of number and title to the figure. The figure might be set off by horizontal lines or enclosed in a box, the number and title might be printed at the top or the bottom, and the figures might be graphs, pictures, or animations of algorithms. All figures, though, are presented in the consistent format defined by FIGSTART and FIGEND.

Awk Library Management

I have also used m1 as a preprocessor for Awk programs. The @include statement allows one to build simple libraries of Awk functions (though some- but not all- Awk implementations provide this facility by allowing multiple program files). File inclusion was used in an earlier version of this paper to include individual functions in the text and then wrap them all together into the completem1 program. The conditional statements allow one to customize a program with macros rather than run-time if statements, which can reduce both run time and compile time.

Controlling Experiments

The most interesting application for which I've used this macro language is unfortunately too complicated to describe in detail. The job for which I wrote the original version of m1 was to control a set of experiments. The experiments were described in a language with a lexical structure that forced me to make substitutions inside text strings; that was the original reason that substitutions are bracketed by at-signs. The experiments are currently controlled by text files that contain descriptions in the experiment language, data extraction programs written in Awk, and graphical displays of data written in Grap; all the programs are tailored bym1commands.

Most experiments are driven by short files that set a few keys parameters and then@includea large file with many @defaults. Separate files describe the fields of shared databases:

 @define N ($1) 
 @define NODES ($2) 
 @define CPU ($3) 
 ... 

These files are @included in both the experiment files and in Troff files that display data from the databases. I had tried to conduct a similar set of experiments before I built m1, and got mired in muck. The few hours I spent building the tool were paid back handsomely in the first days I used it.

The Substitution Function

M1 uses as fast substitution function. The idea is to process the string from left to right, searching for the first substitution to be made. We then make the substitution, and rescan the string starting at the fresh text. We implement this idea by keeping two strings: the text processed so far is in L (for Left), and unprocessed text is in R (for Right). Here is the pseudocode for dosubs:

L = Empty 
R = Input String 
while R contains an "@" sign do 
	let R = A @ B; set L = L A and R = B 
	if R contains no "@" then 
		L = L "@" 
		break 
	let R = A @ B; set M = A and R = B 
	if M is in SymTab then 
		R = SymTab[M] R 
	else 
		L = L "@" M 
		R = "@" R 
	return L R 

Possible Extensions

There are many ways in which them1program could be extended. Here are some of the biggest temptations to "creeping creaturism":

  • A long definition with a trail of backslashes might be more graciously expressed by a @longdefinestatement terminated by a@longend.
  • An @undefinestatement would remove a definition from the symbol table.
  • I've been tempted to add parameters to macros, but so far I have gotten around the problem by using an idiom described in the next section.
  • It would be easy to add stack-based arithmetic and strings to the language by adding@pushand @popcommands that read and write variables.
  • As soon as you try to write interesting macros, you need to have mechanisms for quoting strings (to postpone evaluation) and for forcing immediate evaluation.

Code

The following code is short (around 100 lines), which is significantly shorter than other macro processors; see, for instance, Chapter 8 of Kernighan and Plauger [1981]. The program uses several techniques that can be applied in many Awk programs.

  • Symbol tables are easy to implement with Awk¿s associative arrays.
  • The program makes extensive use of Awk's string-handling facilities: regular expressions, string concatenation, gsub, index, andsubstr.
  • Awk's file handling makes the dofile procedure straightforward.
  • The readline function and pushback mechanism associated with buffer are of general utility.

error

function error(s) {
	print "m1 error: " s | "cat 1>&2"; exit 1
}

dofile

function dofile(fname,  savefile, savebuffer, newstring) {
	if (fname in activefiles)
		error("recursively reading file: " fname)
	activefiles[fname] = 1
	savefile = file; file = fname
	savebuffer = buffer; buffer = ""
	while (readline() != EOF) {
		if (index($0, "@") == 0) {
			print $0
		} else if (/^@define[ \t]/) {
			dodef()
		} else if (/^@default[ \t]/) {
			if (!($2 in symtab))
				dodef()
		} else if (/^@include[ \t]/) {
			if (NF != 2) error("bad include line")
			dofile(dosubs($2))
		} else if (/^@if[ \t]/) {
			if (NF != 2) error("bad if line")
			if (!($2 in symtab) || symtab[$2] == 0)
				gobble()
		} else if (/^@unless[ \t]/) {
			if (NF != 2) error("bad unless line")
			if (($2 in symtab) && symtab[$2] != 0)
				gobble()
		} else if (/^@fi([ \t]?|$)/) { # Could do error checking here
		} else if (/^@stderr[ \t]?/) {
			print substr($0, 9) | "cat 1>&2"
		} else if (/^@(comment|@)[ \t]?/) {
		} else if (/^@ignore[ \t]/) { # Dump input until $2
			delim = $2
			l = length(delim)
			while (readline() != EOF)
				if (substr($0, 1, l) == delim)
					break
		} else {
			newstring = dosubs($0)
			if ($0 == newstring || index(newstring, "@") == 0)
				print newstring
			else
				buffer = newstring "\n" buffer
		}
	}
	close(fname)
	delete activefiles[fname]
	file = savefile
	buffer = savebuffer
}

readline

Put next input line into global string "buffer". Return "EOF" or "" (null string).

function readline(  i, status) {
	status = ""
	if (buffer != "") {
		i = index(buffer, "\n")
		$0 = substr(buffer, 1, i-1)
		buffer = substr(buffer, i+1)
	} else {
		# Hume: special case for non v10: if (file == "/dev/stdin")
		if (getline <file <= 0)
			status = EOF
	}
	# Hack: allow @Mname at start of line w/o closing @
	if ($0 ~ /^@[A-Z][a-zA-Z0-9]*[ \t]*$/)
		sub(/[ \t]*$/, "@")
	return status
}

gobble

function gobble(  ifdepth) {
	ifdepth = 1
	while (readline() != EOF) {
		if (/^@(if|unless)[ \t]/)
			ifdepth++
		if (/^@fi[ \t]?/ && --ifdepth <= 0)
			break
	}
}

dosubs

function dosubs(s,  l, r, i, m) {
	if (index(s, "@") == 0)
		return s
	l = ""	# Left of current pos; ready for output
	r = s	# Right of current; unexamined at this time
	while ((i = index(r, "@")) != 0) {
		l = l substr(r, 1, i-1)
		r = substr(r, i+1)	# Currently scanning @
		i = index(r, "@")
		if (i == 0) {
			l = l "@"
			break
		}
		m = substr(r, 1, i-1)
		r = substr(r, i+1)
		if (m in symtab) {
			r = symtab[m] r
		} else {
			l = l "@" m
			r = "@" r
		}
	}
	return l r
}

docodef

function dodef(fname,  str, x) {
	name = $2
	sub(/^[ \t]*[^ \t]+[ \t]+[^ \t]+[ \t]*/, "")  # OLD BUG: last * was +
	str = $0
	while (str ~ /\\$/) {
		if (readline() == EOF)
			error("EOF inside definition")
		# OLD BUG: sub(/\\$/, "\n" $0, str)
		x = $0
		sub(/^[ \t]+/, "", x)
		str = substr(str, 1, length(str)-1) "\n" x
	}
	symtab[name] = str
}

BEGIN

BEGIN {	
    EOF = "EOF"
	if (ARGC == 1)
		dofile("/dev/stdin")
	else if (ARGC >= 2) {
		for (i = 1; i < ARGC; i++)
			dofile(ARGV[i])
	} else
		error("usage: m1 [fname...]")
}

Bugs

M1 is three steps lower than m4. You'll probably miss something you have learned to expect.

History

M1 was documented in the 1997 sedawk book by Dale Dougherty & Arnold Robbins (ISBN 1-56592-225-5) but may have been written earlier.

This page was adapted from 131.191.66.141:8181/UNIX_BS/sedawk/examples/ch13/m1.pdf (download from LAWKER).

Author

Jon L. Bentley.


categories: Macros,Tools,Mar,2009,WillW

m5 - macro processor

Download

Download from LAWKER.

Synopsis

m5 [ -Dname ] [ -Dname=def ] [-c] [ -dp char ] 
   [ -o file ] [-sp char ] [ file ... ]
 
[g|n]awk -f m5.awk X [ -Dname ] [ -Dname=def ]  [-c]  [ -dp char ] 
                     [ -o file ] [ -sp char ] [ file ... ]

Description

M5 is a Bourne shell script for invoking m5.awk, which actu- ally performs the macro processing. m5, unlike many macroprocessors, does not directly interpret its input. Instead it uses a two-pass approach in which the first pass translates the input to an awk program, and the second pass executes the awk program to produce the final output. Details of usage are provided below.

This two pass sytem means that macros can contain awk commands, to be executed on the second pass. This greatly extends the expressability of the m5 macro system.

As noted in the synopsis above, its invocation may require specification of awk, gawk, or nawk, depending on the ver- sion of awk available on your system. This choice is further complicated on some systems, e.g. Sun, which have both awk (original awk) and nawk (new awk). Other systems appear to have new awk, but have named it just awk. New awk should be used, regardless of what it has been named. The macro processor translator will not work using original awk because the former, for example, uses the built-in function match().

Options

The following options are supported:

-Dname
Following the cpp convention, define name as 1 (one). This is the same as if a -Dname=1 appeared as an option or #name=1 appeared as an input line. Names specified using -D are awk variables defined just before main is invoked.
-Dname=def
Define name as "def". This is the same as if #name="def" appeared as an input line. Names specified using -D are awk variables defined just before main is invoked.
X
Yes, that really is a capital "X". The ver- sion of nawk on Sun Solaris 2.5.1 apparently does its own argument processing before pass- ing the arguments on to the awk program. In this case, X (and all succeeding options) are believed by nawk to be file names and are passed on to the macro processor translator (m5.awk) for its own argument processing). Without the X, Sun nawk attempts to process succeeding options (e.g., -Dname) as valid nawk arguments or files, thus causing an error. This may not be a problem for all awks.
-c
Compile only. The output program is still produced, but the final output is not.
-dp char
The directive prefix character (default is #).
-o file
The output program file (default is a.awk).
-sp char
The substitution prefix character (default is $).

Usage

Overview

The program that performs the first pass noted above is called the m5 translator and is named m5.awk. The input to the translator may be either standard input or one or more files listed on the command line. An input line with the directive prefix character (# by default) in column 1 is treated as a directive statement in the MP directive language (awk). All other input lines are processed as text lines. Simple macros are created using awk assignment statements and their values referenced using the substitu- tion prefix character ($ by default). The backslash (\) is the escape character; its presence forces the next character to literally appear in the output. This is most useful when forcing the appearance of the directive prefix character, the substitution prefix character, and the escape character itself.

Macro Substitution

All input lines are scanned for macro references that are indicated by the substitution prefix character. Assuming the default value of that character, macro references may be of the form $var, $(var), $(expr), $[str], $var[expr], or $func(args). These are replaced by an awk variable, awk variable, awk expression, awk array reference to the special array M[], regular awk array reference, or awk function call, respectively. These are, in effect, macros. The MP translator checks for proper nesting of parentheses and dou- ble quotes when translating $(expr) and $func(args) macros, and checks for proper nesting of square brackets and double quotes when translating $[expr] and $var[expr] macros. The substitution prefix character indicates a a macro reference unless it is (i) escaped (e.g., \$abc), (ii) followed by a character other than A-Z, a-z, (, or [ (e.g., $@), or (iii) inside a macro reference (e.g., $($abc); probably an error).

An understanding of the implementation of macro substitution will help in its proper usage. When a text line is encoun- tered, it is scanned for macros, embedded in an awk print statement, and copied to the output program. For example, the input line

The quick $fox jumped over the lazy $dog.

is transformed into

print "The quick " fox " jumped over the lazy " dog "."

Obviously the use of this transformation technique relies completely on the presence of the awk concatenation operator (one or more blanks).

Macros Containing Macros

As already noted, a macro reference inside another macro reference will not result in substitution and will probably cause an awk execution-time error. Furthermore, a substitution prefix character in the substituted string is also generally not significant because the substitution pre- fix character is detected at translation time, and macro values are assigned at execution time. However, macro references of the form $[expr] provide a simple nested referencing capability. For example, if $[abc] is in a text line, or in a directive line and not on the left hand side of an assignment statement, it is replaced by eval(M["abc"])/. When the output program is executed, the m5 runtime routine eval()/ substitutes the value of M["abc"] examining it for further macro references of the form $[str] (where "str" denotes an arbitrary string). If one is found, substitution and scanning proceed recursively. Function type macro references may result in references to other mac- ros, thus providing an additional form of nested referenc- ing.

Directive Lines

Except for the include directive, when a directive line is detected, the directive prefix is removed, the line is scanned for macros, and then the line is copied to the out- put program (as distinct from the final output). Any valid awk construct, including the function statement, is allowed in a directive line. Further information on writing awk programs may be found in Aho, Kernighan, and Weinberger, Dougherty and Robbins, and Robbins.

Include Directive

A single non-awk directive has been provided: the include directive. Assuming that # is the directive prefix, #include(filename) directs the MP translator to immediately read from the indicated file, processing lines from it in the normal manner. This processing mode makes the include directive the only type of directive to take effect at translation time. Nested includes are allowed. Include directives must appear on a line by themselves. More ela- borate types of file processing may be directly programmed using appropriate awk statements in the input file.

Main Program and Functions

The MP translator builds the resulting awk program in one of two ways, depending on the form of the first input line. If that line begins with "function", it is assumed that the user is providing one or more functions, including the func- tion "main" required by m5. If the first line does not begin with "function", then the entire input file is translated into awk statements that are placed inside "main". If some input lines are inside functions, and oth- ers are not, awk will will detect this and complain. The MP by design has little awareness of the syntax of directive lines (awk statements), and as a consequence syntax errors in directive lines are not detected until the output program is executed.

Output

Finally, unless the -c (compile only) option is specified on the command line, the output program is executed to produce the final output (directed by default to standard output). The version of awk specified in ARGV[0] (a built-in awk variable containing the command name) is used to execute the program. If ARGV[0] is null, awk is used.

EXAMPLE

Understanding this example requires recognition that macro substitution is a two-step process: (i) the input text is translated into an output awk program, and (ii) the awk program is executed to produce the final output with the macro substitutions actually accomplished. The examples below illustrate this process. # and $ are assumed to be the directive and substitution prefix characters. This example was successfully executed using awk on a Cray C90 running UNICOS 10.0.0.3, gawk on a Gateway E-3200 runing SuSE Linux Version 6.0, and nawk on a Sun Ultra 2 Model 2200 running Solaris 2.5.1.

Input Text

#function main() {

   Example 1: Simple Substitution
   ------------------------------
#  br = "brown"
   The quick $br fox.

   Example 2: Substitution inside a String
   ---------------------------------------
#  r = "row"
   The quick b$(r)n fox.

   Example 3: Expression Substitution
   ----------------------------------
#  a = 4
#  b = 3
   The quick $(2*a + b) foxes.

   Example 4: Macros References inside a Macro
   -------------------------------------------
#  $[fox] = "\$[q] \$[b] \$[f]"
#  $[q] = "quick"
#  $[b] = "brown"
#  $[f] = "fox"
   The $[fox].

   Example 5: Array Reference Substitution
   ---------------------------------------
#  x[7] = "brown"
#  b = 3
   The quick $x[2*b+1] fox.

   Example 6: Function Reference Substitution
   ------------------------------------------
   The quick $color(1,2) fox.

   Example 7: Substitution of Special Characters
   ---------------------------------------------
\#  The \$ quick \\ brown $# fox. $$
#}
#include(testincl.m5)

Included File testincl.m5

#function color(i,j) {
   The lazy dog.
#  if (i == j)
#     return "blue"
#  else
#     return "brown"
#}

Output Program

function main() {
   print
   print "   Example 1: Simple Substitution"
   print "   ------------------------------"
   br = "brown"
   print "   The quick " br " fox."
   print
   print "   Example 2: Substitution inside a String"
   print "   ---------------------------------------"
   r = "row"
   print "   The quick b" r "n fox."
   print
   print "   Example 3: Expression Substitution"
   print "   ----------------------------------"
   a = 4
   b = 3
   print "   The quick " 2*a + b " foxes."
   print
   print "   Example 4: Macros References inside a Macro"
   print "   -------------------------------------------"
   M["fox"] = "$[q] $[b] $[f]"
   M["q"] = "quick"
   M["b"] = "brown"
   M["f"] = "fox"
   print "   The " eval(M["fox"]) "."
   print
   print "   Example 5: Array Reference Substitution"
   print "   ---------------------------------------"
   x[7] = "brown"
   b = 3
   print "   The quick " x[2*b+1] " fox."
   print
   print "   Example 6: Function Reference Substitution"
   print "   ------------------------------------------"
   print "   The quick " color(1,2) " fox."
   print
   print "   Example 7: Substitution of Special Characters"
   print "   ---------------------------------------------"
   print "\#  The \$ quick \\ brown $# fox. $$"
}
function color(i,j) {
   print "   The lazy dog."
   if (i == j)
      return "blue"
   else
      return "brown"
}

function eval(inp   ,isplb,irb,out,name) {

   splb = SP "["
   out = ""

   while( isplb = index(inp, splb) ) {
      irb = index(inp, "]")
      if ( irb == 0 ) {
         out = out substr(inp,1,isplb+1)
         inp = substr( inp, isplb+2 )
      } else {
         name = substr( inp, isplb+2, irb-isplb-2 )
         sub( /^ +/, "", name )
         sub( / +$/, "", name )
         out = out substr(inp,1,isplb-1) eval(M[name])
         inp = substr( inp, irb+1 )
      }
   }

   out = out inp

   return out
}
BEGIN {
   SP = "$"
   main()
   exit
}

Final Output

   Example 1: Simple Substitution
   ------------------------------
   The quick brown fox.

   Example 2: Substitution inside a String
   ---------------------------------------
   The quick brown fox.

   Example 3: Expression Substitution
   ----------------------------------
   The quick 11 foxes.

   Example 4: Macros References inside a Macro
   -------------------------------------------
   The quick brown fox.

   Example 5: Array Reference Substitution
   ---------------------------------------
   The quick brown fox.

   Example 6: Function Reference Substitution
   ------------------------------------------
   The lazy dog.
   The quick brown fox.

   Example 7: Substitution of Special Characters
   ---------------------------------------------
#  The $ quick \ brown $# fox. $$

File

a.awk is the default output program file.

See Also

awk(1), cpp(1), gawk(1), m4(1), nawk(1). vi(1)

Author

William A. Ward, Jr., School of Computer and Information Sciences, University of South Alabama, Mobile, Alabama, July 23, 1999.


categories: Wp,Project,Tools,Mar,2009,Timm

AWKWORDS

Contents

Synopsis

awkwords --title "Title" file > file.html

awkwords file > file.html

Download

This code requires gawk and bash. To download:

wget  http://lawker.googlecode.com/svn/fridge/lib/bash/awkwords
chmod +x awkwords

To test the code, apply it to itself:

  • ./awkwords --title "Does this work?" awkwords > awkwards.html

Description

AwkWords is a simple-to-use markup language for writing documentation for programs whose comment lines start with "#" and whose comments contain HTML code.

For example, awk.info?tools/awkwords shows the html generated from this bash script.

When used with the --title option, a stand alone web page is generated (to control the style of that page, see the CSS function, dicussed below). When used without --title it generated some html suitable for inclusion into other pages.

Also, AwkWords finds all the <h2>, <h3>, <h4>, <h5>, <h6>, <h7>, <h8>, <h9> headings and copies them to a table of contents at the front of the file. Note that AwkWords assumes that the file contains only one <h1> heading- this is printed before the table of contents.

AwkWords adds some short cuts for HTML markup, as well as including nested contents (see below: "including nested content"). This is useful for including, say, program output along with the actual program.

Extra Markup

Short cuts for HTML

#.XX
This is replaced by <XX>.
#.XX words
This is replaced by <XX>words</XX>. Note that this tag won't work properly if the source text spills over more than one line.
#.TO url words
This is replaced by a link to mail to url.
#.URL url words
This is replaced by a link to mail to url.

Including nested content:

#.IN file
This line is replaced by the contents of file.
#.LISTING file
This line is replaced by the name of the file, followed by a verbatbim displau of file (no formatting).
#.CODE file
This line is replaced by the name of the file, followed verbatbim by file (no formatting).
#.BODY file
This line is replaced by file, less the lines before the first blank line.

Programmer's Guide

Awkwords is divided into three functions: unhtml fixes the printing of pre-formatted blocks; toc adds the table of contents while includes handles the details of the extra mark-up.

Functions

unhtml

unhtml() { cat $1| gawk '
  BEGIN {IGNORECASE=1}
  /^<PRE>/   {In=1; print; next}
  /^<\/PRE>/ {In=0; print; next}
  In         {gsub("<","\\<",$0); print; next }
             {print $0 }'
}

toc

toc() { cat $1 | gawk '
 BEGIN             { IGNORECASE = 1 }
 /^<[h]1>/         { Header=$0; next}
 /^[<]h[23456789]>/  { 
       T++ ;
      Toc[T]  = gensub(/(.*)<h(.*)>[ \t]*(.*)[ \t]*<\/h(.*)>(.*)/,
      "<""h\\2><""font color=black>\\•</font></a> <""a href=#" T ">\\3</a></h\\4>",
                "g",$0)
		Pre="<a name="T"></a>" }
     { Line[++N] = Pre $0; Pre="" }
 END { print Header;
       print "<" "h2>Contents</h2>"
       print "<" "div id=\"htmltoc\">"
       for(I=1;I<=T;I++) print Toc[I]	
       print "<" "/div><!--- htmltoc --->"
       print "<" "div id=\"htmlbody\">"
       for(I=1;I<=N;I++) print Line[I]
       print "</" "div><!--- htmlbody --->"		
     }'
}

includes

The xpand function controls recursive inclusion of content. Note that

  • The last act of this function must be to call xpand1.
  • When including verbatim text, the recursive call to xpands must pass "1" to the second paramter.
includes() { cat $1 | gawk '
function xpand(pre,  tmp) {
   if      ($1 ~ "^#.IN")    xpands($2,pre) 
   else if ($1 ~ "^#.BODY" ) xpandsBody($2,pre)
   else if ($1 ~ "^#.LISTING")  {
  	    print "<" "pre>"
	    xpands($2,1)     # <===== note the recursive call with "1"
	    print "<" "/pre>" } 
   else if ($1 ~ "^#.CODE")  {
  	    print "<" "p>" $2 "\n<" "pre>"
	    xpands($2,1)     # <===== note the recursive call with "1"
	    print "<" "/pre>" } 
   else if ($1 ~ "^#.URL") {
	    tmp = $2; $1=$2="";
	    print "<" "a href=\""tmp"\">" trim($0) "</a>"
	    }
   else if ($1 ~ "^#.TO") {
	    tmp = $2; $1=$2="";
	    print "<" "a href=\"mailto:"tmp"\">" trim($0) "</a>"
	    }
   else 
	xpand1(pre)
}

The xpand1 function controls the printing of a single line. If we are formatting verbatim text, we must remove the start-of-html character "<". Otherwise, we expand any html shortcuts.

function xpand1(pre) {
   if (pre)
        gsub("<","\\<",$0)  # <=== remove start-of-html-character
   else {
        $0= xpandHtml($0)      # <=== expand html short cuts
        sub(/^#/,"",$0) }
        print $0 
}

The function xpandHtml controls the html short cuts

function xpandHtml(    str,tag) {
   if ($0 ~ /^#\.H1/) {         
	   $1=""
	   return "<" "h""1><join>" $0 "</join></" "h1>" }
   if (sub(/^#\./,"",$1)) {
	   tag=$1;  $1=""
	   return "<" tag ">"  (($0 ~ /^[ \t]*$/) ? "" : $0"</"tag">")
   }
   return $0
}

The rest of the code is just some book-keeping and managing the recursive addition of content.

function xpands(f,pre) {
     if (newFile(f)) {
	  while((getline <f) > 0) xpand(pre)
          close(f) }
}
function xpandsBody(f,pre, using) {
     if (newFile(f)) { 
	  while((getline <f) >0) {
	    if ( !using && ($0 ~ /^[\t ]*$/) ) using = 1
	    if ( using ) xpand(pre)}
	  close(f) }
}
function newFile(f) { return ++Seen[f]==1 }
function trim (s)   { sub(/^[ \t]*/,"",s);  sub(/[ \t]*$/,"",s); return s } 

BEGIN { IGNORECASE=1 }
      { xpand()      }'
}

CSS styles

If used to generate a full web page, then the following styles are added. Note that the htmltoc class controls the appearance of the table of contents.

css() { 
      echo "<""STYLE type=\"text/css\">"
      cat<<-'EOF'
         div.htmltoc h2 { font-size: medium; font-weight: normal; 
                          margin: 0 0 0 0; margin-left: 30px;}
	 div.htmltoc h3 { font-size: medium; font-weight: normal; 
                          margin: 0 0 0 0; margin-left: 60px;}
         div.htmltoc h4 { font-size: medium; font-weight: normal; 
                          margin: 0 0 0 0; margin-left: 90px;}
         div.htmltoc h5 { font-size: medium; font-weight: normal; 
                          margin: 0 0 0 0; margin-left: 120px;}
         div.htmltoc h6 { font-size: medium; font-weight: normal; 
                          margin: 0 0 0 0; margin-left: 150px;}
         div.htmltoc h7 { font-size: medium; font-weight: normal; 
                          margin: 0 0 0 0; margin-left: 180px; }
      </STYLE>
EOF
}

Main command line

main() { cat $1 | includes | unhtml | toc; }

if [ $1 == "--title" ]
then 
     echo "<""html><""head><""title>$2</title>`css`</head><""body>"; 
     shift 2
     main $1
     echo "<""/body><""/html>"
else 
     main $1
fi 

Bugs

There's no checking for valid input (e.g. pre-formatting tags that never close).

If the input file contains no html mark up, the results are pretty messy.

Recursive includes fail silently if the referenced file does not exist.

I don't like the way I need a seperate pass to do "unhtml". I tried making it work within the code but it got messy.

Author

Tim Menzies

categories: Top10,Awk100,Mar,2009,NelsonB,Spell,ArnoldR

spell.awk

Contents

Synopsis

awk [-v Dictionaries="sysdict1 sysdict2 ..."] -f spell.awk -- \
    [=suffixfile1 =suffixfile2 ...] [+dict1 +dict2 ...] \
    [-strip] [-verbose] [file(s)]

Download

Download from LAWKER.

Description

Why Study This Code?

This program is an example par excellence of the power of awk. Yes, if written in "C", it would run faster. But goodness me, it would be much longer to code. These few lines implement a powerful spell checker, with user-specifiable exception lists. The built-in dictionary is constructed from a list of standard Unix spelling dictionaries, overridable on the command line.

It also offers some tips on how to structure larger-than-ten-line awk programs. In the code below, note the:

  • The code is hundreds of lines long. Yes folks, its true, Awk is not just a tool for writing one-liners.
  • The code is well-structured. Note, for example, how the BEGIN block is used to initialize the system from files/functions.
  • The code uses two tricks that encourages function reuse:
    • Much of the functionality has been moved out of PATTERN-ACTION and into functions.
    • The number of globals is restricted: note the frequent use of local variables in functions.
  • There is an example, in scan_options, of how parse command line arguments;
  • The use of "print pipes" in in report_expcetions shows how to link Awk code to other commands.

(And to write even larger programs, divided into many files, see runawk.)

Dictionaries

Dictionaries are simple text files, with one word per line. Unlike those for Unix spell(1), the dictionaries need not be sorted, and there is no dependence on the locale in this program that can affect which exceptions are reported, although the locale can affect their reported order in the exception list. A default list of dictionaries can be supplied via the environment variable DICTIONARIES, but that can be overridden on the command line.

For the purposes of this program, words are located by replacing ASCII control characters, digits, and punctuation (except apostrophe) with ASCII space (32). What remains are the words to be matched against the dictionary lists. Thus, files in ASCII and ISO-8859-n encodings are supported, as well as Unicode files in UTF-8 encoding.

All word matching is case insensitive (subject to the workings of tolower()).

In this simple version, which is intended to support multiple languages, no attempt is made to strip word suffixes, unless the +strip option is supplied.

Suffixes

Suffixes are defined as regular expressions, and may be supplied from suffix files (one per name) named on the command line, or from an internal default set of English suffixes. Comments in the suffix file run from sharp (#) to end of line. Each suffix regular expression should end with $, to anchor the expression to the end of the word. Each suffix expression may be followed by a list of one or more strings that can replace it, with the special convention that "" represents an empty string. For example:

	ies$	ie ies y	# flies -> fly, series -> series, ties -> tie
	ily$	y ily		# happily -> happy, wily -> wily
	nnily$	n		# funnily -> fun

Although it is permissible to include the suffix in the replacement list, it is not necessary to do so, since words are looked up before suffix stripping.

Suffixes are tested in order of decreasing length, so that the longest matches are tried first.

Output

The default output is just a sorted list of unique spelling exceptions, one per line. With the +verbose option, output lines instead take the form

	filename:linenumber:exception

Some Unix text editors recognize such lines, and can use them to move quickly to the indicated location.

Code

Top-Level

BEGIN	{ initialize() }
	    { spell_check_line() }
END	    { report_exceptions() }

get_dictionaries

function get_dictionaries(        files, key)
{
    if ((Dictionaries == "") && ("DICTIONARIES" in ENVIRON))
	Dictionaries = ENVIRON["DICTIONARIES"]
    if (Dictionaries == "")	# Use default dictionary list
    {
	DictionaryFiles["/usr/dict/words"]++
	DictionaryFiles["/usr/local/share/dict/words.knuth"]++
    }
    else			# Use system dictionaries from command line
    {
	split(Dictionaries, files)
	for (key in files)
	    DictionaryFiles[files[key]]++
    }
}

Initialize

function initialize()
{
   NonWordChars = "[^" \
	"'" \
	"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
	"abcdefghijklmnopqrstuvwxyz" \
	"\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217" \
	"\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237" \
	"\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257" \
	"\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277" \
	"\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317" \
	"\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337" \
	"\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357" \
	"\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377" \
	"]"
    get_dictionaries()
    scan_options()
    load_dictionaries()
    load_suffixes()
    order_suffixes()
}

load_dictionaries

function load_dictionaries(        file, word)
{
    for (file in DictionaryFiles)
    {
	## print "DEBUG: Loading dictionary " file > "/dev/stderr"
	while ((getline word < file) > 0)
	    Dictionary[tolower(word)]++
	close(file)
    }
}

load_suffixes

function load_suffixes(        file, k, line, n, parts)
{
    if (NSuffixFiles > 0)		# load suffix regexps from files
    {
	for (file in SuffixFiles)
	{
	    ## print "DEBUG: Loading suffix file " file > "/dev/stderr"
	    while ((getline line < file) > 0)
	    {
		sub(" *#.*$", "", line)		# strip comments
		sub("^[ \t]+", "", line)	# strip leading whitespace
		sub("[ \t]+$", "", line)	# strip trailing whitespace
		if (line == "")
		    continue
		n = split(line, parts)
		Suffixes[parts[1]]++
		Replacement[parts[1]] = parts[2]
		for (k = 3; k <= n; k++)
		  Replacement[parts[1]]= Replacement[parts[1]] " " parts[k]
	    }
	    close(file)
	}
    }
    else	      # load default table of English suffix regexps
    {
	split("'$ 's$ ed$ edly$ es$ ing$ ingly$ ly$ s$", parts)
	for (k in parts)
	{
	    Suffixes[parts[k]] = 1
	    Replacement[parts[k]] = ""
	}
    }
}

order_suffixes

function order_suffixes(        i, j, key)
{
    # Order suffixes by decreasing length
    NOrderedSuffix = 0
    for (key in Suffixes)
	OrderedSuffix[++NOrderedSuffix] = key
    for (i = 1; i < NOrderedSuffix; i++)
	for (j = i + 1; j <= NOrderedSuffix; j++)
	    if (length(OrderedSuffix[i]) < length(OrderedSuffix[j]))
		swap(OrderedSuffix, i, j)
}

report_execptions

function report_exceptions(        key, sortpipe)
{
  sortpipe= Verbose ? "sort -f -t: -u -k1,1 -k2n,2 -k3" : "sort -f -u -k1"
  for (key in Exception)
  print Exception[key] | sortpipe
  close(sortpipe)
}

scan_options

function scan_options(        k)
{
    for (k = 1; k < ARGC; k++)
    {
	if (ARGV[k] == "-strip")
	{
	    ARGV[k] = ""
	    Strip = 1
	}
	else if (ARGV[k] == "-verbose")
	{
	    ARGV[k] = ""
	    Verbose = 1
	}
	else if (ARGV[k] ~ /^=/)	# suffix file
	{
	    NSuffixFiles++
	    SuffixFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	}
	else if (ARGV[k] ~ /^[+]/)	# private dictionary
	{
	    DictionaryFiles[substr(ARGV[k], 2)]++
	    ARGV[k] = ""
	}
    }

    # Remove trailing empty arguments (for nawk)
    while ((ARGC > 0) && (ARGV[ARGC-1] == ""))
        ARGC--
}

spell_check_line

function spell_check_line(        k, word)
{
    ## for (k = 1; k <= NF; k++) print "DEBUG: word[" k "] = \"" $k "\""
    gsub(NonWordChars, " ")		# eliminate nonword chars
    for (k = 1; k <= NF; k++)
    {
	word = $k
	sub("^'+", "", word)		# strip leading apostrophes
	sub("'+$", "", word)		# strip trailing apostrophes
	if (word != "")
	    spell_check_word(word)
    }
}

spell_check_word

function spell_check_word(word,        key, lc_word, location, w, wordlist)
{
    lc_word = tolower(word)
    ## print "DEBUG: spell_check_word(" word ") -> tolower -> " lc_word
    if (lc_word in Dictionary)		# acceptable spelling
	return
    else				# possible exception
    {
	if (Strip)
	{
	    strip_suffixes(lc_word, wordlist)
	    ## for (w in wordlist) print "DEBUG: wordlist[" w "]"
	    for (w in wordlist)
		if (w in Dictionary)
		    break
	    if (w in Dictionary)
		return
	}
	## print "DEBUG: spell_check():", word
	location = Verbose ? (FILENAME ":" FNR ":") : ""
	if (lc_word in Exception)
	    Exception[lc_word] = Exception[lc_word] "\n" location word
	else
	    Exception[lc_word] = location word
    }
}

strip_suffixes

function strip_suffixes(word, wordlist,        ending, k, n, regexp)
{
    ## print "DEBUG: strip_suffixes(" word ")"
    split("", wordlist)
    for (k = 1; k <= NOrderedSuffix; k++)
    {
	regexp = OrderedSuffix[k]
	## print "DEBUG: strip_suffixes(): Checking \"" regexp "\""
	if (match(word, regexp))
	{
	    word = substr(word, 1, RSTART - 1)
	    if (Replacement[regexp] == "")
		wordlist[word] = 1
	    else
	    {
		split(Replacement[regexp], ending)
		for (n in ending)
		{
		    if (ending[n] == "\"\"")
			ending[n] = ""
		    wordlist[word ending[n]] = 1
		}
	    }
	    break
	}
    }
     ## for (n in wordlist) print "DEBUG: strip_suffixes() -> \"" n "\""
}

swap

function swap(a, i, j,        temp)
{
    temp = a[i]
    a[i] = a[j]
    a[j] = temp
}

Author

Arnold Robbins and Nelson H.F. Beebe in "Classic Shell Scripting", O'Reilly Books


categories: TextMining,Mar,2009,Admin

Text Mining

Some of the code at awk.info is somewhat historical in nature. For example, Scott Pakin's gender predictor was written in 1991. Given that, it might be mistakenly concluded that Awk is somehow old-fashioned and not suitable for modern tasks.

Text mining, on the other hand, could be the killer app for Awk in the 21st century. The language excels at creating one-off reports that handle the quirks of a particular file format.

There is a growing interest in using Awk for this kind of work. All the examples presented below come from work conducted in 2007, 2008:

Why Text Mining?

If we could properly understand unstructured text, this would be a result of tremendous practical importance. A recent study concluded that:

  • 80 percent of business is conducted on unstructured information;
  • 85 percent of all data stored is held in an unstructured format;
  • Unstructured data doubles every three months;

That is, if we can tame the text mining problem, it would be possible to reason and learn from a much wider range of business data than ever before.

Results (with Awk)

Note that, in the Menzies/Marcus and Schmitt/Christianson tool kits, Awk by itself was not enough. The two data mining toolkits mentioned above were all intricate combinations of Awk and sed and bash and etc end etc. Within that combination, Awk was very useful for handling the specifics not managed by the other tools.


categories: TextMining,Mar,2009,LotharS

Awk and Sed for Language Analysis

References

Lothar M. Schmitt and Kiel T. Christianson:

Description

The authors show how to construct tools for language analysis in research and teaching using the Awk, the Bourne-shell, and sed under UNIX. Applications include the following:
  • searches for words, phrases, grammatical patterns and phonemic patterns in text;
  • statistical evaluation of texts in regard to such searches;
  • transformation of phonetic, phonemic or typographic transcriptions;
  • comparison of texts in various respects;
  • lexical-etymological analysis;
  • concordance;
  • assistance in translating text;
  • assistance in learning languages;
  • assistance in teaching languages;
  • and text processing and formatting. This latter includes the generation of on-line dictionaries for the Internet from files that were generated with what-you-see-is-what-you-get editors representing only the linear structure of the dictionary (i.e., the book).
All of the above can be achieved with particularly simple and short code. In that regard, they illustrate how sed and awk can be combined in the pipe mechanism of UNIX to create very powerful processing devices.

Their notes include a short introduction to programming the Bourne-shell and rather short, but complete descriptions of sed and awk customized in regard to language analysis.


categories: TextMining,Mar,2009,Timm

Text Mining Issue Reports

References

Tim Menzies and Andrian Marcus:

Description

Severis is a set of Awk, bash, sed, etc scripts for finding predictors of high severity issues in text reports. Test engineers write such issue reports whenever they encounter anomalies in the code they are inspecting.

Severis was designed to be an audit tool for test engineers, a second "look over the shoulder" to alert a senior engineer if a junior test engineer was doing something strange.

At least for the text issue reports studied by Severis, very simple tools were enough to determine the terms that predicting for different issue severities.


categories: TextMining,Mar,2009,DonaldM

Text Munging in Awk (and Perl and Python)

Donald 'Paddy' McCarthy reports an interesting comparison of Awk vs Perl vs Python for doing some text pre-processing.

The example shows off Awk's ability to quickly prototype a one-off specialized report for a particular data format.

It also offers some comment on the language wars between Awk and <insert your favorite scripting language here>: there is no evidence in the following code that dear old-fashioned Awk is more complex or arcane or slower that more recent, supposedly better, languages.

  • Tests on 1MB of data of the form
    <string:date> [ <float:data-n> <int:flag-n> ]*24
    

    e.g.

    1991-03-31 10.000  1 10.000  1  ... 20.000      1       35.000  1
    
  • Time to process 1MB of data (over 5000 records of the above form):
    • Awk: 1.069s
    • Perl: 2.450s
    • Python: 1.138s

Awk

The awk example:

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{
  nodata = 0;             # Curret run of consecutive flags < 0 in lines of file
  nodata_max=-1;          # Max consecutive flags < 0 in lines of file
  nodata_maxline="!";     # ... and line number(s) where it occurs
}
FNR==1 {
  # Accumulate input file names
  if(infiles){
    infiles = infiles "," infiles
  } else {
    infiles = FILENAME
  }
}
{
  tot_line=0;             # sum of line data
  num_line=0;             # number of line data items with flag>0

  # extract field info, skipping initial date field
  for(field=2; field < =NF; field+=2){
    datum=$field;
    flag=$(field+1);
    if(flag < 1){
      nodata++
    }else{
      # check run of data-absent fields
      if(nodata_max==nodata && (nodata>0)){
        nodata_maxline=nodata_maxline ", " $1
      }
      if(nodata_max < nodata && (nodata>0)){
        nodata_max=nodata
        nodata_maxline=$1
      }
      # re-initialise run of nodata counter
      nodata=0;
      # gather values for averaging
      tot_line+=datum
      num_line++;
    }
  }

  # totals for the file so far
  tot_file += tot_line
  num_file += num_line

  printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n", \
         $1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)? tot_line/num_line: 0

  # debug prints of original data plus some of the computed values
  #printf "%s  %15.3g  %4i\n", $0, tot_line, num_line
  #printf "%s\n  %15.3f  %4i  %4i  %4i  %s\n", $0, tot_line, num_line,  nodata, nodata_max, nodata_maxline


}

END{
  printf "\n"
  printf "File(s)  = %s\n", infiles
  printf "Total    = %10.3f\n", tot_file
  printf "Readings = %6i\n", num_file
  printf "Average  = %10.3f\n", tot_file / num_file

  printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n", nodata_max, nodata_maxline
}

Perl

The same functionality in perl is very similar to the awk program:

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN {
  $nodata = 0;             # Curret run of consecutive flags < 0 in lines of file
  $nodata_max=-1;          # Max consecutive flags < 0 in lines of file
  $nodata_maxline="!";     # ... and line number(s) where it occurs
}
foreach (@ARGV) {
  # Accumulate input file names
  if($infiles ne ""){
    $infiles = "$infiles, $_";
  } else {
    $infiles = $_;
  }
}

while ( < >){
  $tot_line=0;             # sum of line data
  $num_line=0;             # number of line data items with flag>0

  # extract field info, skipping initial date field
  chomp;
  @fields = split(/\s+/);
  $nf = @fields;
  $date = $fields[0];
  for($field=1; $field < $nf; $field+=2){
    $datum = $fields[$field] +0.0;
    $flag  = $fields[$field+1] +0;
    if(($flag+1 < 2)){
      $nodata++;
    }else{
      # check run of data-absent fields
      if($nodata_max==$nodata and ($nodata>0)){
        $nodata_maxline = "$nodata_maxline, $fields[0]";
      }
      if($nodata_max < $nodata and ($nodata>0)){
        $nodata_max = $nodata;
        $nodata_maxline=$fields[0];
      }
      # re-initialise run of nodata counter
      $nodata = 0;
      # gather values for averaging
      $tot_line += $datum;
      $num_line++;
    }
  }

  # totals for the file so far
  $tot_file += $tot_line;
  $num_file += $num_line;

  printf "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f\n",
         $date, (($nf -1)/2) -$num_line, $num_line, $tot_line, ($num_line>0)? $tot_line/$num_line: 0;

}

printf "\n";
printf "File(s)  = %s\n", $infiles;
printf "Total    = %10.3f\n", $tot_file;
printf "Readings = %6i\n", $num_file;
printf "Average  = %10.3f\n", $tot_file / $num_file;

printf "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s\n",
       $nodata_max, $nodata_maxline;

Python

The python program however splits the fields in the line slightly differently (although it could use the method used in the perl and awk programs too):

# Author Donald 'Paddy' McCarthy Jan 01 2007

import fileinput
import sys

nodata = 0;             # Curret run of consecutive flags < 0 in lines of file
nodata_max=-1;          # Max consecutive flags < 0 in lines of file
nodata_maxline=[];      # ... and line number(s) where it occurs

tot_file = 0            # Sum of file data
num_file = 0            # Number of file data items with flag>0

infiles = sys.argv[1:]

for line in fileinput.input():
  tot_line=0;             # sum of line data
  num_line=0;             # number of line data items with flag>0

  # extract field info
  field = line.split()
  date  = field[0]
  data  = [float(f) for f in field[1::2]]
  flags = [int(f)   for f in field[2::2]]

  for datum, flag in zip(data, flags):
    if flag < 1:
      nodata += 1
    else:
      # check run of data-absent fields
      if nodata_max==nodata and nodata>0:
        nodata_maxline.append(date)
      if nodata_max < nodata and nodata>0:
        nodata_max=nodata
        nodata_maxline=[date]
      # re-initialise run of nodata counter
      nodata=0;
      # gather values for averaging
      tot_line += datum
      num_line += 1

  # totals for the file so far
  tot_file += tot_line
  num_file += num_line

  print "Line: %11s  Reject: %2i  Accept: %2i  Line_tot: %10.3f  Line_avg: %10.3f" % (
        date,
        len(data) -num_line,
        num_line, tot_line,
        tot_line/num_line if (num_line>0) else 0)

print ""
print "File(s)  = %s" % (", ".join(infiles),)
print "Total    = %10.3f" % (tot_file,)
print "Readings = %6i" % (num_file,)
print "Average  = %10.3f" % (tot_file / num_file,)

print "\nMaximum run(s) of %i consecutive false readings ends at line starting with date(s): %s" % (
    nodata_max, ", ".join(nodata_maxline))


categories: GUI,,2010,Mar,MichaelS

Xmonthly: Xwindows Interface for Gawk

xmonthly is a hybrid shell script (part bash/gawk/xtoolkit intrinsics) that displays an overview of reminders based on the current month.

It is a simple example of how to use X-window tools with Gawk.

xmonthly employs gawk's implementation of the strftime() function in order to discern the current month. Non-english users must use the 1st three letters of each month in the xmonthly database as determined by the end user's local.

Downlaod: 16K

Author: Michael S. Sanders

blog comments powered by Disqus