Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Verification,Jul,2009,Admin

Awk and Verification

These pages focus on program verification tools, written in Awk.


categories: Databases,Jul,2009,Admin

Awk and Databases

These pages focus on databases and Awk.


categories: Mascot,Jul,2009,DickL

New Awk Mascot: 'AWK-eye the Dwarf?

by Dick L.

I write to suggest that the Awk mascot's name is Hawk-eye (usually spoken as 'AWK-eye with a silent H).

I suggest 'AWK-eye is a DWARF, based on the following analogy:

  • My good friend hAWK-eye is a dwarf, from the first ages, long ago. The dwarves are renowned for their skill in mining and metalwork.
  • My friend is known as Hawk-eye, because even among dwarves, he can mine precious metals and jewels from the dross with great ease and precision. And, having found these precious things, he is able to quickly fashion them into all manner of things, both practical and beautiful.
  • As one from the first age, he is sometimes called primitive. I prefer to call him elemental, so tightly focused is he on what he does best: mining and making from the mountains of text I throw to him. Like all dwarves, he is small - but sinewy and unbreakable.
  • He carries with him his tools of trade - a strong sieve of subtle REGEXP for separating the jewels and metal from the dross, and his hammer that he uses to fashion the gold, silver and jewels into useful and beautiful objects. In appearance he is old and gnarly, but don't be put off - he knows his stuff and works willingly.

I can't draw, but 'AWK-eye looks about half way between Gimli from Lord of the Rings, and Doc from the Disney Snow White and Seven Dwarves. (He has been known to sing "hi ho, hi ho, it's off to work I go". He likes to work!)

I know many spirits and sprites from the first age - LISP, APL, Assembler, Basic, Fortran and Algol. However, I have lost contact with most of these old friends, but ask 'AWK-eye to do new work most weeks. Why?

  • He is small enough that I can take him everywhere. No esoteric installs and fancy GUIs and other bloat.
  • He is so focused on doing the work that I often need done: simple mining and re-fashioning of text, voluminous text.
  • I can say what I want so simply. 'AWK-eye speaks filtering and fashioning. Usually a few lines to 'AWK-eye accomplish what would take ten or twenty lines with the new age creatures.

Yes, I love python, and javascript and all those creatures of later ages. And for some projects, functions as first class citizens, objects and the works is just what I want.

But for many daily jobs, 'AWK-eye is on the sweet spot of enough expressiveness to do the job, but not so much as to be hard to remember, and is small enough I have him everywhere.


categories: Wp,Dsl,Jul,2009,JesusG

md2html : Update to Markdown.awk

Jesus Galan (yiyus) (yiyu DOT jgl AT gmail DOT com) has updated his markdown system.

His new md2html.awk code adds several new functionality extensions and implements numerous bug fixes.

For more on this new code, see his history of a rewrite.

Download

Download from LAWKER.


categories: Jul,2009,RussC

Awk's RE Match Very Fast

(This page is a summary of Russ Cox's excellent article Regular Expression Matching Can Be Simple and Fast.)

Russ Cox writes that Awk's regular expression library is surprisingly faster than that used in Perl, Ruby, and Python:

    This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics.

    Let's use superscripts to denote string repetition, so that a?3a3 is shorthand for a?a?a?aaa. This lets us define experiments where we conduct timing experiments on using regular expressions to match the a?nan against the string an.

    If we conduct those experiments, Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. ... the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string. This trends grows as we increase "n": the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 1015 years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages.).

For some details of his results, see the following graph. Note that the y-axis is logarithmic (increases by a power of ten for each tick) so these differences are really big differences:

The reason for these differences is very technical- but Cox's article offers an excellent and clear description of those details. In short, the RE matcher used in Perl, Ruby, Python is a recursive algorithm that allows the match state to exist in only one state at a time. A Thompson NFA used in Awk/Grep, on the other hand, allows a match to exist in multiple states. Using Thompson's NFA, the whole match process can be pre-computed and cached at compile time, thus removing the backtrack-on-failure process.

And what is the lesson here? Next time someone tells you Awk is old-fashioned, cough politely and mention that at least in some aspects, certain supposedly-more-modern languages do not offer all the support provided by dear-"old"- Awk.


categories: Databases,Tips,Jul,2009,VictorA

Using Awk for Databases

Contents

Download

Download all the following example code and support data files from LAWKER

General Information

Introduction

This page contains a set of sample Awk scripts to manage different kinds of databases. In all cases, we'll use a text editor such as edit.exe to create and edit the data files, and Awk scripts will be used to query and manipulate the data.

OK, so it's not a fancy GUI-based system, but this method is flexible and the scripts execute relatively quickly. Also, your data won't be locked in some company's proprietary binary file format. There is also the benefit of portability: If your PC can run DOS, you can also run these scripts on your PC. Awk is also available on Linux and on other operating systems.

This page assumes that you are already familiar with database terms like 'record', 'field', and 'search keyword'.

Introduction to Awk

Awk is an interpreted programming language that is designed for managing and converting data files and generating reports from the data.

Awk will automatically read an input file and parse it into records and fields, one record at a time. A typicall Awk script will then manipulate the fields using predefined variables like $1 (the first field), $2 (the second field), etc.

To use Awk, you create an Awk script, and then run it with the Awk program (gawk.exe in this case). Many Awk scripts are small, and it lends itself to writing "one-time use" programs.

Using the Scripts

All the files on this page are available in the ZIP archive at this link. Feel free to reuse and customize them.

You will need the GNU Awk program gawk.exe to be installed on your QuickPAD Pro. See the programming page for instructions on installing GNU Awk.

Here is the general format of a gawk command line:

	gawk -f SCRIPT DATAFILE
where SCRIPT is the name of the file that contains the Awk script and DATAFILE is the name of the text file that contains the input data.

That command line will not modify the input file and all the output will be directed to the screen.

If a script creates a new data file (for example, a sort script), the command line will be:

	gawk -f SCRIPT DATAFILE > NEWFILE
where NEWFILE is the name of the new data file that will be created.

If you use a particular script often and get tired of typing in a long command line, you can create a batch file to execute the long command line for you.

are currently limited to 64K files for our data. We can work around this restriction by using the chop utility program that is described in the software page.

Index Card Databases

Card File

In this section we demonstrate some Awk scripts to manage This type of database can be used for any type of simple text lists, like lists of books, music CDs, recipes, quotations, etc.

Our information will be stored into 'cards'. Each card will have a 'title' and a 'body':

	Title of Card
	-------------------------
	Free-formatted field of 
	information about this 
	particular card, but
	without any blank lines.
Let's take this information and store it in a text file. To keep things simple, the cards within the file are separated with a blank line, and the first line of each card will be the title.

For example, let's create a sample card file called 'cards.txt' and use it to store a list of our goals.

	Write a book and become famous
	This is a long range
	goal. I need a good book
	idea first. And writing
	skills.

	Solve the problems of society
	This might take
	a little longer
	than expected.

	Take out the garbage
	It's stinking up
	the garage.

Let's begin with an Awk script to print out the titles of all the cards in the file. Here is the script called 'titles':

	# titles - Print the titles of all the cards in the
	# index card file.

	BEGIN { RS = ""; FS = "\n" }
	        { print $1 }

Here is a sample run:

	[B:\] gawk -f titles cards.txt
	Write a book and become famous
	Solve the problems of society
	Take out the garbage
	[B:\]

Another useful script is one that can be used for searching the data file, ignoring uppercase and lowercase distinctions. The following script called 'search' will display the cards that contain the keyword 'write'.

	# search - Print the index card that contains a string

	BEGIN   { RS = ""; FS = "\n"; IGNORECASE=1 }

	/write/ { print $0, "\n" }

Here is a sample run:

	[B:\] gawk -f search cards.txt
	Write a book and become famous
	This is a long range
	goal. I need a good book
	idea first. And writing
	skills.

	[B:\]

To search for other strings, edit the 'search' script and replace 'write' with another search keyword.

Sorting the cards based on the titles would also be a useful operation. Here is a script called 'sort' which reads the entire data file into and array and then uses the QuickSort algorithm to sort it:

	# sort - Sort index card file by the card titles

	BEGIN { RS = ""; FS = "\n" }

	      { A[NR] = $0 } 

	END   {
		qsort(A, 1, NR)
		for (i = 1; i <= NR; i++) {
			print A[i]
			if (i == NR) break
			print ""
		}
	      }

	# QuickSort
	# Source: "The AWK Programming Language", by Aho, et.al., p.161
	function qsort(A, left, right,   i, last) {
		if (left >= right)
			return
		swap(A, left, left+int((right-left+1)*rand()))
		last = left
		for (i = left+1; i <= right; i++)
			if (A[i] < A[left])
				swap(A, ++last, i)
		swap(A, left, last)
		qsort(A, left, last-1)
		qsort(A, last+1, right)
	}
	function swap(A, i, j,   t) {
		t = A[i]; A[i] = A[j]; A[j] = t
	}

And here is a sample run:

	[B:\] awk -f sort cards.txt > new.txt
	[B:\] rename cards.txt cards.bak
	[B:\] rename new.txt cards.txt
	[B:\] type cards.txt
	Solve the problems of society
	This might take
	a little longer
	than expected.

	Take out the garbage
	It's stinking up
	the garage.

	Write a book and become famous
	This is a long range
	goal. I need a good book
	idea first. And writing
	skills.
	[B:\]
Note that we renamed our old data file to cards.bak, instead of deleting the file. It's always good to keep backups of old databases.

However, the 'sort' script had some trouble with large files because it reads in all the cards into an array in RAM. In my tests, the largest file I was able to sort was only about 100K.

"Flash Cards" for Memorization

Index cards can also be used for memorization. The title of the card can contain a question and the body of the card contains the answer that you want to memorize.

Let's write a program that randomly chooses a card from our 'cards.txt' file, displays its title, asks the user to press the 'Enter' key, and then displays the body of that card.

First, we need a text file which contains the questions and answers that we want to memorize. Let's name the file 'question.txt'. Note that the answer can contain multiple lines:

	What is your name?
	My name is
	Sir Lancelot
	of Camelot.

	What is your quest?
	To seek the
	Holy Grail.

	What is your favorite color?
	Blue.

Here is the Awk script called 'memorize'. It will read the data file into an array, randomly shuffle the array, and then it will loop through the array and display each question and answer.

	# memorize - randomly display an index card title, ask user to
	# press return, then display the corresponding body of the card

	BEGIN { RS=""; FS="\n" }

	      { A[NR] = $0 } 

	END   {
		RS="\n"; FS=" "
		shuffle(A, NR)
		for (i = 1; i <= NR; i++) {
			print "\nQUESTION: ", substr(A[i], 1, index(A[i], "\n")-1)
			printf "\nPress return for the answer: "
			getline < "-"
			print "\nANSWER: "
			print substr(A[i], index(A[i], "\n")+1)
			if (i == NR) break
			printf "\nPress return to continue, or 'q' to quit: "
			getline < "-"
			if ($1 == "q") break
		}
	      }

	# Shuffle the array
	function shuffle(A, n,   t) {
		srand()
		# Moses/Oakford shuffle algorithm
		for (i = n; i > 1; i--) {
			j = int((i-1) * rand()) + 1
			t = A[j]; A[j] = A[i]; A[i] = t
		}
	}

Here is a sample run. The script will randomly choose cards until it either finishes going through all the cards, or until the user enters a 'q' to quit.

	[B:\] gawk -f memorize question.txt

	QUESTION:  What is your quest?

	Press return for the answer:

	ANSWER:
	To seek the
	Holy Grail.

	Press return to continue, or 'q' to quit:

	QUESTION:  What is your favorite color?

	Press return for the answer:

	ANSWER:
	Blue.

	Press return to continue, or 'q' to quit:

	QUESTION:  What is your name?

	Press return for the answer:

	ANSWER:
	My name is
	Sir Lancelot
	of Camelot.
	[B:\] gawk -f memorize question.txt
	
	QUESTION:  What is your favorite color?
	
	Press return for the answer:

	ANSWER:
	Blue.

	Press return to continue, or 'q' to quit: q
	[B:\] 

Custom Databases

Address Book

The databases above used a simple 'index card' analogy. That data model works fine for simple lists with free form data, but there are also cases where we need to manage records with specialized data fields.

Let's create a data file and some scripts for an 'address book' database. Our data file will be a text file where every line is one record. Within a line of the file, the data will be separated into fields.

When choosing a delimiter for our fields, we need to make sure that it won't appear accidentally within a field itself. For example, an address book has fields like name, company name, address, etc., and in this case, each of those fields can contain spaces within them (e.g. "ACME Mail Order Company"). Therefore, we can't use a space to separate the fields of the line.

Instead, let's use commas to separate the fields, and we'll need a rule that commas cannot appear within a field.

Here is a sample data file called 'address.txt':

John Robinson,Koren Inc.,978 4th Ave,Boston,MA 01760,617-696-0987
Phyllis Chapman,GVE Corp.,34 Sea Drive,Amesbury,MA 01881,781-879-0900
Here is the script called 'labels' which will print all the data and format it like mailing labels:
	# labels - Format the addresses for printing labels
	# Source: blocklist.awk from "Sed & Awk", by Dale Dougherty, p.148

	BEGIN { FS = "," }

	{
	        print ""        # blank line
	        print $1        # name
	        print $2        # company
	        print $3        # street
	        print $4, $5    # city, state zip
	}
This is the sample run:
	[B:\] gawk -f labels address.txt
	
	John Robinson
	Koren Inc.
	978 4th Ave
	Boston MA 01760
	
	Phyllis Chapman
	GVE Corp.
	34 Sea Drive
	Amesbury MA 01881	
	[B:\] 

It may also be useful to extract just the phone numbers from our data file. Here is the script called 'phones' which will extract only the names and phone numbers from the data file:

	# phones
	# Source: phonelist.awk, from "Sed & Awk", by Dale Dougherty, p.148

	BEGIN { FS="," }

	{ print $1 ", " $6 }
Here is a sample run:
	[B:\] gawk -f phones address.txt
	John Robinson, 617-696-0987
	Phyllis Chapman, 781-879-0900
	[B:\] 
We'll also need a script to search our data file for a name. Here is a script called 'searchad' with will search for the string 'robinson':
	# searchad - Return the record that matches a string

	BEGIN { FS = ","; IGNORECASE=1 }

	/robinson/ {
	        print ""        # blank line
	        print $1        # name
	        print $2        # company
	        print $3        # street
	        print $4, $5    # city, state zip
	}

Here is a sample run:

	[B:\] gawk -f searchad address.txt

	John Robinson
	Koren Inc.
	978 4th Ave
	Boston MA 01760
	[B:\] 

Grading Program

Awk can also be used for mathematical computation of fields. Let's demonstrate this with a data file called 'grades.txt' that contains grades of students.

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

Here is a longer script that will take all the grades, average them equally, and compute the final average and the final grade for each student. At the end, it will compute some statistics about the entire class. Here is the script called 'grades'.

	# grades -- average student grades and determine
	# letter grade as well as class averages
	# Source: "Sed & Awk", by Dale Dougherty, p.192

	# set output field separator to tab.
	BEGIN { OFS = "\t" }

	# action applied to all input lines
	{
		# add up the grades
		total = 0
		for (i = 3; i <= NF; ++i)
			total += $i
		# calculate average
		avg = total / (NF - 2)
		# assign student's average to element of array
		class_avg[NR] = avg
		# determine letter grade
		if (avg >= 90) grade="A"
		else if (avg >= 80) grade="B"
		else if (avg >= 70) grade="C"
		else if (avg >= 60) grade="D"
		else grade="F"
		# increment counter for letter grade array
		++class_grade[grade]
		# print student name, average, and letter grade
		print $1 " " $2, avg, grade
	}

	# print out class statistics
	END  {
		# calculate class average
		for (x = 1; x <= NR; x++)
			class_avg_total += class_avg[x]
		class_average = class_avg_total / NR
		# determine how many above/below average
		for (x = 1; x <= NR; x++)
			if (class_avg[x] >= class_average)
				++above_average
			else
				++below_average
		# print results
		print ""
		print "Class Average: ", class_average
		print "At or Above Average: ", above_average
		print "Below Average: ", below_average
		# print number of students per letter grade
		for (letter_grade in class_grade)
			print letter_grade ":", class_grade[letter_grade]
	}

Here is a sample run:

	[B:\] gawk -f grades grades.txt
	Allen Mona      79      C
	Baker John      88      B
	Jones Andrea    90.5    A
	Smith Jasper    85      B
	Turner Dunce    64.5    D
	Wells Ellis     93.5    A

	Class Average:  83.4167
	At or Above Average:    4
	Below Average:  2
	A:      2
	B:      2
	C:      1
	D:      1
	[B:\]

Another useful script is the following program that computes a histogram of the grades. It is hardcoded to only read the third column ($3), but you can edit it and change it to read any of the columns in the input file. Here is the script called 'histo':

	# histogram
	# Source: "The AWK Programming Language", by Aho, et.al., p.70

	     { x[int($3/10)]++ } # use the third column of input data

	END  {
	        for (i = 0; i < 10; i++)
	                printf(" %2d - %2d: %3d %s\n",
	                       10*i, 10*i+9, x[i], rep(x[i],"*"))
	        printf("100:      %3d %s\n", x[10], rep(x[10],"*"))
	     }

	function rep(n, s,   t) {   # return string of n s's
	        while (n--> 0)
	                t = t s
	        return t
	}
And here is the sample run:
	[B:\] gawk -f histo grades.txt
	  0 -  9:   0
	 10 - 19:   0
	 20 - 29:   0
	 30 - 39:   0
	 40 - 49:   0
	 50 - 59:   0
	 60 - 69:   1 *
	 70 - 79:   1 *
	 80 - 89:   3 ***
	 90 - 99:   1 *
	100:        0	
	[B:\]

The output shows that there were six grades, and most of them were in the 80-89 range.

Checkbook Program

This program takes a data file which lists your checkbook entries and your deposits, and calculates the totals.

Here is what a sample input file called 'checks.txt' looks like:

	check	1021
	to	Champagne Unlimited
	amount	123.10
	date	1/1/87

	deposit	
	amount	500.00
	date	1/1/87

	check	1022
	date	1/2/87
	amount	45.10
	to	Getwell Drug Store
	tax	medical

	check	1023
	amount	125.00
	to	International Travel
	date	1/3/87

	check	1024
	amount	50.00
	to	Carnegie Hall
	date	1/3/87
	tax	charitable contribution

	check	1025
	to	American Express
	amount	75.75
	date	1/5/87

Here is the script called 'check' which will calculate the totals:

	# check - print total deposits and checks
	# Source: "The AWK Programming Language", by Aho, et.al., p.87

	BEGIN { RS=""; FS="\n" }

	/(^|\n)deposit/ { deposits += field("amount"); next }
	/(^|\n)check/   { checks += field("amount"); next }

	END   { printf("Deposits: $%.2f, Checks: $%.2f\n", 
		       deposits, checks)
	      }

	function field(name,   i, f) {
		for (i = 1; i <= NF; i++) {
			split($i, f, "\t")
			if (f[1] == name)
				return f[2]
		}
		printf("Error: no field %s in record\n%s\n", name, $0)
	}

And this is a sample run:

	[B:\] gawk -f check checks.txt
	Deposits: $500.00, Checks: $418.95
	[B:\]

Importing and Exporting Data

Importing Data for use by Awk

Awk works well with data files that are stored in text files. Awk assumes that the data file is organized into records, within each record the data is divided into fields, and there are unique characters in the file that are used as the field separators and record separators.

By default, Awk assumes that newline characters are the record separators and whitespace characters (spaces and tabs) are the field separators. It is also possible to redefine the field separators to other characters, like a comma or a tab character, which means that Awk can process the commonly used "comma separated" and "tab separated" format for data files.

But note that if a file uses newline characters as record separators, it means that a newline cannot appear within a field. For example, a data file file with one record per line cannot contain a text field (e.g. a "notes" field) that contains free form text with newline characters within it. That would confuse Awk unless we added special code to handle that notes field.

The same restrictions apply to the field separators. If a file is defined to be comma separated, it means that no field is allowed to contain comma characters within it (e.g. a Name field that contains "Alvarado, Victor") because Awk would parse that as two fields, not one.

That is why tab separated files tend to be used more often. That way, the fields are allowed to contain spaces and commas.

Another way to format data for use by Awk is to use the "multiline" format, which is what we used for our index card databases above. Awk will treat each line as a field, and a blank line is the record separator.

Exporting Data to Microsoft Excel

To export data to Excel, all we need to do is to convert the data file into tab-delimited format, and store it in a text file with a *.xls extension. When that file is opened in Microsoft Windows, Excel will open it automatically as if it were a spreadsheet.

As an example, let's export our grades.txt file to Excel. Here is our 'grades.txt' file:

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

The file uses spaces as the field separator, so we'll need a script that will convert the field separators into tabs. Here is a script called 'conv2xls':

	# conv2xls - Convert a data file into tab-separated format

	BEGIN {
	        IFS=" "    # input field separator is a space
	        OFS="\t"   # output field separator is a tab
	      }

	      { print $1, $2, $3, $4, $5, $6, $7, $8 }

And here is the sample run, where we store the tab-delimited output into a text file called grades.xls:

	[B:\] gawk -f conv2xls grades.txt > grades.xls
	[B:\]
Here is the contents of the 'grades.xls' text file:
	Allen   Mona    70      77      85      83      70      89
	Baker   John    85      92      78      94      88      91
	Jones   Andrea  89      90      85      94      90      95
	Smith   Jasper  84      88      80      92      84      82
	Turner  Dunce   64      80      60      60      61      62
	Wells   Ellis   90      98      89      96      96      92

We can then copy the grades.xls text file to a Windows PC, double-click on it, and Excel will open it as if it were a spreadsheet:

You can then do a "Save As" in Excel to save it as the regular Excel binary format.

Exporting Data to a Web Page

To export our data to a web page, we will need a script that will input our data file and generate HTML.

Let's start with our 'grades.txt' data file:

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

Here is a script called 'html' that will do the conversion. Note that the data will appear as rows of a table in HTML.

	# html - Convert a data file into an HTML web page with a table
	
	BEGIN {
		print "<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>"
		print "<BODY BGOLOR=\"#ffffff\">"
		print "<CENTER><H1>Grades Database</H1></CENTER>"
		print "<HR noshade size=4 width=75%>"
		print "<P><CENTER><TABLE BORDER>"
		printf "<TR><TH>Last<TH>First"
		print "<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6"
	      }
	
	      { # Print the data in table rows
		printf "<TR><TD>" $1 "<TD>" $2 
		printf "<TD>" $3 "<TD>" $4 "<TD>" $5 
		print  "<TD>" $6 "<TD>" $7 "<TD>" $8 
	      }
	
	END   {
		print "</TABLE></CENTER><P>"
		print "<HR noshade size=4 width=75%>"
		print "</BODY></HTML>"
	      }

Here is the sample run. The output will be placed in a file called 'grades.htm'.

	[B:\] gawk -f html grades.txt > grades.htm
	[B:\]

This is what the resulting 'grades.htm' file looks like:

	<HTML><HEAD><TITLE>Grades Database</TITLE></HEAD>
	<BODY BGOLOR="#ffffff">
	<CENTER><H1>Grades Database</H1></CENTER>
	<HR noshade size=4 width=75%>
	<P><CENTER><TABLE BORDER>
	<TR><TH>Last<TH>First<TH>G1<TH>G2<TH>G3<TH>G4<TH>G5<TH>G6
	<TR><TD>Allen<TD>Mona<TD>70<TD>77<TD>85<TD>83<TD>70<TD>89
	<TR><TD>Baker<TD>John<TD>85<TD>92<TD>78<TD>94<TD>88<TD>91
	<TR><TD>Jones<TD>Andrea<TD>89<TD>90<TD>85<TD>94<TD>90<TD>95
	<TR><TD>Smith<TD>Jasper<TD>84<TD>88<TD>80<TD>92<TD>84<TD>82
	<TR><TD>Turner<TD>Dunce<TD>64<TD>80<TD>60<TD>60<TD>61<TD>62
	<TR><TD>Wells<TD>Ellis<TD>90<TD>98<TD>89<TD>96<TD>96<TD>92
	</TABLE></CENTER><P>
	<HR noshade size=4 width=75%>
	</BODY></HTML>

And here is a link to the grades.htm file so you can see what the web page looks like in your browser.

Exporting Data to a Palm Pilot

First, we will need to install a database program on the Palm. There are several database programs to choose from, but let's use the freeware database program called Pilot-DB (available here from PalmGear).

Next, we will need the freeware DOS tools that come with Pilot-DB to help us create the PDB data file. The DB-tools package is available here at PalmGear. You can download it and install it on your Windows PC. Those are DOS tools, but they were compiled to run in DOS under Windows, so we can't run them on the QuickPAD Pro. (Note: DB-tools is an open source project, so the source code is available.)

The DB-tools package contains a program called 'csv2pdb.exe'. It will do the conversion into a Palm PDB file.

Let's use the 'grades.txt' data file as an example:

	Allen Mona 70 77 85 83 70 89
	Baker John 85 92 78 94 88 91
	Jones Andrea 89 90 85 94 90 95
	Smith Jasper 84 88 80 92 84 82
	Turner Dunce 64 80 60 60 61 62
	Wells Ellis 90 98 89 96 96 92

Before we can run the 'csv2pdb.exe' program we first need to convert our data into "csv" (comma separated values) format. We can do that with the following awk script called 'conv2csv':

	# conv2csv - Convert a data file into comma-separated format

	BEGIN {
	        IFS=" "    # input field separator is a space
	        OFS=","    # output field separator is a comma
	      }

	      { print $1, $2, $3, $4, $5, $6, $7, $8 }

Here is the command line to create the comma-delimited data file, which we will call 'grades.csv':

	[B:\] gawk -f conv2csv grades.txt > grades.csv
	[B:\]

This is what the 'grades.csv' file looks like:

	Allen,Mona,70,77,85,83,70,89
	Baker,John,85,92,78,94,88,91
	Jones,Andrea,89,90,85,94,90,95
	Smith,Jasper,84,88,80,92,84,82
	Turner,Dunce,64,80,60,60,61,62
	Wells,Ellis,90,98,89,96,96,92

Next, we need to create an "info" file which will describe the format of our data. The 'csv2pdb.exe' program will need this information for the conversion to Palm format.

The info file will give our database a title and describe the fields of each record. In grades.csv, the first field is the student's last name, the second field is the student's first name, and the other six fields are the grades. Here is the resulting info file called 'grades.ifo':

	title "GradesDB"
	field "Last" string 38
	field "First" string 38
	field "G1" integer 14
	field "G2" integer 14
	field "G3" integer 14
	field "G4" integer 14
	field "G5" integer 14
	field "G6" integer 14
	option backup on

The numbers at the end of the lines are the field widths in pixels; we can make a guess for the field widths, and then fine-tune them on the Palm Pilot. The last line will set the backup bit on the PDB file so that it will be backed up at every hotsync.

From this point on, the rest of the steps must be done on your Windows PC.

On Your Windows PC

Now we create the PDB file on our PC with this command line:

C:\> csv2pdb -i grades.ifo grades.csv grades.pdb C:\>

It will create a new file called 'grades.pdb' in the current directory. This is the Palm database file.

The last step is to install the PDB file to the Palm Pilot: in the Windows Explorer double-click on the PDB file and then hotsync your Palm Pilot as usual.

Here is a screen shot of the Palm Pilot running Pilot-DB with our grades database. (Make sure you have selected the blank unnamed view from menu at the top-right corner of the screen):

As you can see, storing data as text files gives you a lot of flexibility in manipulating the data and exporting it to other formats.

Author

Victor Alvarado


categories: Tips,Jul,2009,Admin

Random Numbers in Gawk

(Summarized and extended from a recent discussion at comp.lang.awk.)

Background

A standard idiom in Gawk is to reset the random number generator in a BEGIN block.

BEGIN {srand() }

Sadly, when called with no arguments, this "reseeding" uses time-in-seconds. So if the same "random" task runs multiple times in the same second, it will get the same random number seed.

Houston, We Have a Problem

"Ben" writes:

I have a Gawk script that puts random comments into a file. It is run 3 times in a row in quick succession. I found that seeding the random number generator using gawk did not work because all 3 times it was run was done within the same second (and it uses the time).

I was wondering if anyone could give me some suggestions as to what can be done to get around this problem.

Solution #1: Persistent Memory

Kenny McCormack writes:

When last I ran into this problem, what I did was to save the last value returned by rand() to a file, then on the next run, read that in and use that value as the arg to srand(). Worked well.

(Editor's comment: Kenny's solution does work well but incurs the cost of maintaining and reading/writing that "last value" file.)

Solution #2: Use Bash

Tim Menzies writes:

How about setting the seed using the BASH $RANDOM variable:

gawk -v Seed=$RANDOM --source 'BEGIN { srand(Seed ? Seed : 1) }' 

If referenced multiple times in a second, it always generates a different number.

In the above usage, if we have a seed, use it. Else, no seed so start all "random" at the same place. If you prefer to use the default "seed from time-in-seconds" then use:

BEGIN { if (Seed) { srand(Seed) } else { srand() } }

(Editor's comment: Tim's solution incurs the overhead of additional command-line syntax. However, it does allow the process calling Gawk to control the seed. This is important when trying to, say, debug code by recreating the sequence of random numbers that lead to the bug.)

Solution #3: Query the OS

Thomas Weidenfeller writes:

Is that good enough (random enough) for your task?

BEGIN {
        "od -tu4 -N4 -A n /dev/random" | getline
        srand(0+$0)
}

(Editor's comment: Nice. Thomas' solution reminds us that "Gawk" can access a whole host of operating system facilities.)

Solution #4: Use the Process Id

Aharon Robbins writes:

You could so something like add PROCINFO["pid"] to the value of the time, or use that as the seed.

$ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
0.405889
$ gawk 'BEGIN { srand(systime() + PROCINFO["pid"]); print rand() }'
0.671906

(Editor's comment: Aharon's solution is the fastest of all the ones shown here. For example, on Mac OS/X, his solution takes 6ms to run:

$ time gawk 'BEGIN { srand(systime() + PROCINFO["pid"]) }'

real    0m0.006s
user    0m0.002s
sys     0m0.004s

while Thomas' solution is somewhat slower:

$ time gawk 'BEGIN { "od -tu4 -N4 -A n /dev/random" | getline; srand($0+0) }'

real    0m0.039s
user    0m0.004s
sys     0m0.034s

Note that while Aharon's solution is the fastest, it does not let some master process set the seed for the Gawk process (e.g. as in Tim's approach).)

Conclusion

If you want raw speed, use Aharon's approach.

If you want seed control, see Tim's approach.


categories: Papers,Jul,2009,JiirL

Visual Awk

Reference: Visual AWK: A, Model for Text Processing by Demonstration by Jiirgen Landauer and Masahito Hirakawa . 11th International IEEE Symposium on Visual Languages, 1995

Download

Download from LAKWER.

Abstract

Programming by Demonstration (PBD) systems often have problems with control structure injerence and user-intended generalization. We propose a new solution for these weaknesses basred on concepts of AWK and present a prototype system for text processing. It utilizes vertical demonstration, extensive visual feedback, and program visualization via spreadsheets to achieve improved usability and expressive power.

Introduction

In text editing users are often confronted with reformatting tasks which involve large portions of texts, sometimes consisting of hundreds of lines. For example, let us assume we want to create mailing labels out of a given address list. The task seems to be easy to automat since all paragraphs are similarly structured, containing a name, an address, and a phone number e:ach. However, both the built-in find and replace function and the macro recorder of the editor prove to be not flexible enough to handle the task, because their facilities for specifying search patterns and for dealing with special cases and exceptions are limited.

On the other hand, most current end-uslers estimate solving such tasks with one of today's programming languages as too difficult for them. Programming by Demonstration (PBD) is a promising remedy here since, by contrast, it promises nearly unlimited prograrnming power though ease of learning and usage. Therefore, a variety of PBD systems were proposed for this application domain in the past. But PBD is not yet very widespread in commercial text editors because of some serious weaknesses.

This paper examines these weaknesses and present a new approach for the solution of the deficiencies of PBD. We introduce Visual AWK, a prototype text processing system developed at the Information Systems Lab of Hiroshima University based on the programming language AWK which incorporates the new design approach. Extensive visual feedback and program visualization via spreadsheets improve both usability and expressive power.

Visual AWK is aimed at users without previous knowledge in programming, but with ex- perience in text editor use. The application domain are semi-structured texts. That is, texts that consist of equally structured entities, for instance lines or paragraphs, but may contain a few syntactically classifiable sets of exceptions with a different structure.


categories: Papers,Verification,Jul,2009,GerardH

MicroTrace

by Gerard Holzmann

Description

Micro-tracer is a little awk-script for verifying state machines; quite possibly the world's smallest working verifier. Some comments on the working of the script, plus a sample input for the X.21 protocol, are given below.

Reproduce and use freely, at your own risk of course. The micro-tracer was first described in this report:

  • Gerard Holzmann, X.21 Analysis Revisited: the Micro-Tracer, AT&T Bell Laboratories, Technical Memorandum 11271-8710230-12, October 23, 1987. (PDF)

Code

This script was written to show how little code is needed to write a working verifier for safety properties. The hard problem in writing a practical verifier is to make the search efficient, to support a useful logic, and a sensible specification language... (see the Spin homepage.)

$1 == "init"	{	proc[$2] = $3	}
$1 == "inp"	{	move[$2,$3]=move[$2,$3] $1 "/" $4 "/" $5 "/" $6 "/;" }
$1 == "out"	{	move[$2,$3]=move[$2,$3] $1 "/" $4 "/" $5 "/" $6 "/;" }
END		{	verbose=0; for (i in proc) signal[i] = "-"
			run(mkstate(state))
			for (i in space) nstates++;
			print nstates " states, " deadlocks " deadlocks"
		}

function run(state,  i,str,moved)	# 1 parameter, 3 local vars
{
	if (space[state]++) return	# been here before

	level++; moved=0
	for (i in proc)
	{	str = move[i,proc[i]]
		while (str)
		{	v = substr(str, 1, index(str, ";"))
			sub(v, "", str)
			split(v, arr, "/")
			if (arr[1] == "inp" && arr[3] == signal[arr[4]])
			{	Level[level] = i " " proc[i] " -> " v
				proc[i] = arr[2]
				run(mkstate(k))
				unwrap(state); moved=1
			} else if (arr[1] == "out")
			{	Level[level] = i " " proc[i] " -> " v
				proc[i] = arr[2]; signal[arr[4]] = arr[3]
				run(mkstate(k))
				unwrap(state); moved=1
	}	}	}
	if (!moved)
	{	deadlocks++
		print "deadlock " deadlocks ":"
		for (i in proc) print "\t" i, proc[i], signal[i]
		if (verbose)
			for (i = 1; i < level; i++) print i, Level[i]
	}
	level--
}
function mkstate(state, m)
{	state = ""
	for (m in proc) state = state " " proc[m] " " signal[m]
	return state
}
function unwrap(state, m)
{	split(state, arr, " "); nxt=0
	for (m in proc) { proc[m] = arr[++nxt]; signal[m] = arr[++nxt] }
}

The first three lines of the script deal with the input. Data are stored in two arrays. The initial state of machine A is stored in array element proc[A]. The transitions that machine A can make from state s are stored in move[A,s]. All data are stored as strings, and most arrays are also indexed with strings. All valid moves for A in state s, for instance, are concatenated into the same array element move[A,s], and later unwound as needed in function run().

The line starting with END is executed when the end of the input file has been reached and the complete protocol specification has been read. It initializes the signals and calls the symbolic execution routine run().

The program contains three function definitions: run(), mkstate(), and unwrap(). The global system state, state, is represented as a concatenation of strings encoding process and signal states. The function mkstate() creates the composite, and the function unwrap() restores the arrays proc and signal to the contents that correspond to the description in state. (The recursive step in run() alters their contents.) Function run() uses three local variables, but only one real parameter state that is passed by the calling routine.

The analyzer runs by inspecting the possible moves for each process in turn, checking for valid inp or out moves, and performing a complete depth-first search. Any state that has no successors is flagged as a deadlock. A backtrace of transitions leading into a deadlock is maintained in array Level and can be printed when a deadlock is found.

The first line in run() is a complete state space handler. The composite state is used to index a large array space. If the array element was indexed before it returns a count larger than zero: the state was analyzed before, and the search can be truncated.

After the analysis completes, the contents of array space is available for other types of probing. In this case, the micro tracer just counts the number of states and prints it as a statistic, together with the number of deadlocks found.

A Sample Application -- X21

The transition rules are based on the classic two-process model for the call establishment phase of CCITT Recommendation X.21. Interface signal pairs T, C and R, I are combined. Each possible combination of values on these line pairs is represented by a distinct lower-case ASCII character below. Note that since the lines are modeled as true signals, the receiving process can indeed miss signals if the sending process changes them rapidly and does not wait for the peer process to respond.

Transition rules for the `dte' process.

inp dte state01 state08 u dte
inp dte state01 state18 m dte
inp dte state02 state03 v dte
inp dte state02 state15 u dte
inp dte state02 state19 m dte
inp dte state04 state19 m dte
inp dte state05 state19 m dte
inp dte state05 state6A r dte
inp dte state07 state19 m dte
inp dte state07 state6B r dte
inp dte state08 state19 m dte
inp dte state09 state10B q dte
inp dte state09 state19 m dte
inp dte state10 state19 m dte
inp dte state10 state6C r dte
inp dte state10B state19 m dte
inp dte state10B state6C r dte
inp dte state11 state12 n dte
inp dte state11 state19 m dte
inp dte state12 state19 m dte
inp dte state14 state19 m dte
inp dte state15 state03 v dte
inp dte state15 state19 m dte
inp dte state16 state17 m dte
inp dte state17 state21 l dte
inp dte state18 state01 l dte
inp dte state18 state19 m dte
inp dte state20 state21 l dte
inp dte state6A state07 q dte
inp dte state6A state19 m dte
inp dte state6B state07 q dte
inp dte state6B state10 q dte
inp dte state6B state19 m dte
inp dte state6C state11 l dte
inp dte state6C state19 m dte
out dte state01 state02 d dce
out dte state01 state14 i dce
out dte state01 state21 b dce
out dte state02 state16 b dce
out dte state03 state04 e dce
out dte state04 state05 c dce
out dte state04 state16 b dce
out dte state05 state16 b dce
out dte state07 state16 b dce
out dte state08 state09 c dce
out dte state08 state15 d dce
out dte state08 state16 b dce
out dte state09 state16 b dce
out dte state10 state16 b dce
out dte state10B state16 b dce
out dte state11 state16 b dce
out dte state12 state16 b dce
out dte state14 state01 a dce
out dte state14 state16 b dce
out dte state15 state16 b dce
out dte state18 state16 b dce
out dte state19 state20 b dce
out dte state21 state01 a dce
out dte state6A state16 b dce
out dte state6B state16 b dce
out dte state6C state16 b dce

Transition rules for the `dce' process.

inp dce state01 state02 d dce
inp dce state01 state14 i dce
inp dce state01 state21 b dce
inp dce state02 state16 b dce
inp dce state03 state04 e dce
inp dce state04 state05 c dce
inp dce state04 state16 b dce
inp dce state05 state16 b dce
inp dce state07 state16 b dce
inp dce state08 state09 c dce
inp dce state08 state15 d dce
inp dce state08 state16 b dce
inp dce state09 state16 b dce
inp dce state10 state16 b dce
inp dce state10B state16 b dce
inp dce state11 state16 b dce
inp dce state12 state16 b dce
inp dce state14 state01 a dce
inp dce state14 state16 b dce
inp dce state15 state16 b dce
inp dce state18 state16 b dce
inp dce state19 state20 b dce
inp dce state21 state01 a dce
inp dce state6A state16 b dce
inp dce state6B state16 b dce
inp dce state6C state16 b dce
out dce state01 state08 u dte
out dce state01 state18 m dte
out dce state02 state03 v dte
out dce state02 state15 u dte
out dce state02 state19 m dte
out dce state04 state19 m dte
out dce state05 state19 m dte
out dce state05 state6A r dte
out dce state07 state19 m dte
out dce state07 state6B r dte
out dce state08 state19 m dte
out dce state09 state10B q dte
out dce state09 state19 m dte
out dce state10 state19 m dte
out dce state10 state6C r dte
out dce state10B state19 m dte
out dce state10B state6C r dte
out dce state11 state12 n dte
out dce state11 state19 m dte
out dce state12 state19 m dte
out dce state14 state19 m dte
out dce state15 state03 v dte
out dce state15 state19 m dte
out dce state16 state17 m dte
out dce state17 state21 l dte
out dce state18 state01 l dte
out dce state18 state19 m dte
out dce state20 state21 l dte
out dce state6A state07 q dte
out dce state6A state19 m dte
out dce state6B state07 q dte
out dce state6B state10 q dte
out dce state6B state19 m dte
out dce state6C state11 l dte
out dce state6C state19 m dte

Initialization

init dte state01
init dce state01

Error Listings (verbose mode)

The error listings give with each step number, the name of the executing machine followed by its state and an arrow. Behind the arrow is the transition rule: inp or out, the new state, the required or provided signal value, and the signal name.

deadlock 1:
	dce state21 b
	dte state16 l
1 dce state01 -> out/state08/u/dte/;
2 dce state08 -> out/state19/m/dte/;
3 dte state01 -> inp/state18/m/dte/;
4 dte state18 -> inp/state19/m/dte/;
5 dte state19 -> out/state20/b/dce/;
6 dce state19 -> inp/state20/b/dce/;
7 dce state20 -> out/state21/l/dte/;
8 dte state20 -> inp/state21/l/dte/;
9 dte state21 -> out/state01/a/dce/;
10 dce state21 -> inp/state01/a/dce/;
11 dce state01 -> out/state08/u/dte/;
12 dce state08 -> out/state19/m/dte/;
13 dte state01 -> inp/state18/m/dte/;
14 dte state18 -> out/state16/b/dce/;
15 dce state19 -> inp/state20/b/dce/;
16 dce state20 -> out/state21/l/dte/;
deadlock 2:
	dce state03 b
	dte state16 v
1 dce state01 -> out/state08/u/dte/;
2 dce state08 -> out/state19/m/dte/;
3 dte state01 -> inp/state18/m/dte/;
4 dte state18 -> inp/state19/m/dte/;
5 dte state19 -> out/state20/b/dce/;
6 dce state19 -> inp/state20/b/dce/;
7 dce state20 -> out/state21/l/dte/;
8 dte state20 -> inp/state21/l/dte/;
9 dte state21 -> out/state01/a/dce/;
10 dce state21 -> inp/state01/a/dce/;
11 dce state01 -> out/state08/u/dte/;
12 dce state08 -> out/state19/m/dte/;
13 dte state01 -> out/state21/b/dce/;
14 dce state19 -> inp/state20/b/dce/;
15 dte state21 -> out/state01/a/dce/;
16 dte state01 -> inp/state18/m/dte/;
17 dce state20 -> out/state21/l/dte/;
18 dce state21 -> inp/state01/a/dce/;
19 dce state01 -> out/state18/m/dte/;
20 dte state18 -> inp/state19/m/dte/;
21 dce state18 -> out/state01/l/dte/;
22 dte state19 -> out/state20/b/dce/;
23 dte state20 -> inp/state21/l/dte/;
24 dce state01 -> out/state08/u/dte/;
25 dce state08 -> inp/state16/b/dce/;
26 dte state21 -> out/state01/a/dce/;
27 dte state01 -> inp/state08/u/dte/;
28 dce state16 -> out/state17/m/dte/;
29 dce state17 -> out/state21/l/dte/;
30 dce state21 -> inp/state01/a/dce/;
31 dce state01 -> out/state08/u/dte/;
32 dte state08 -> out/state15/d/dce/;
33 dce state08 -> inp/state15/d/dce/;
34 dce state15 -> out/state03/v/dte/;
35 dte state15 -> inp/state03/v/dte/;
36 dte state03 -> out/state04/e/dce/;
37 dte state04 -> out/state05/c/dce/;
38 dte state05 -> out/state16/b/dce/;
deadlock 3:
	dce state03 b
	dte state20 v
1 dce state01 -> out/state08/u/dte/;
2 dce state08 -> out/state19/m/dte/;
3 dte state01 -> inp/state18/m/dte/;
4 dte state18 -> inp/state19/m/dte/;
5 dte state19 -> out/state20/b/dce/;
6 dce state19 -> inp/state20/b/dce/;
7 dce state20 -> out/state21/l/dte/;
8 dte state20 -> inp/state21/l/dte/;
9 dte state21 -> out/state01/a/dce/;
10 dce state21 -> inp/state01/a/dce/;
11 dce state01 -> out/state08/u/dte/;
12 dce state08 -> out/state19/m/dte/;
13 dte state01 -> out/state21/b/dce/;
14 dce state19 -> inp/state20/b/dce/;
15 dte state21 -> out/state01/a/dce/;
16 dte state01 -> inp/state18/m/dte/;
17 dce state20 -> out/state21/l/dte/;
18 dce state21 -> inp/state01/a/dce/;
19 dce state01 -> out/state18/m/dte/;
20 dte state18 -> inp/state19/m/dte/;
21 dce state18 -> out/state01/l/dte/;
22 dte state19 -> out/state20/b/dce/;
23 dte state20 -> inp/state21/l/dte/;
24 dce state01 -> out/state08/u/dte/;
25 dce state08 -> inp/state16/b/dce/;
26 dte state21 -> out/state01/a/dce/;
27 dte state01 -> inp/state08/u/dte/;
28 dce state16 -> out/state17/m/dte/;
29 dce state17 -> out/state21/l/dte/;
30 dce state21 -> inp/state01/a/dce/;
31 dce state01 -> out/state18/m/dte/;
32 dte state08 -> out/state15/d/dce/;
33 dte state15 -> inp/state19/m/dte/;
34 dce state18 -> out/state01/l/dte/;
35 dce state01 -> inp/state02/d/dce/;
36 dce state02 -> out/state03/v/dte/;
37 dte state19 -> out/state20/b/dce/;
deadlock 4:
	dce state21 b
	dte state16 -
1 dte state01 -> out/state02/d/dce/;
2 dte state02 -> out/state16/b/dce/;
3 dce state01 -> inp/state21/b/dce/;
307 states, 4 deadlocks

categories: Papers,Verification,Jul,2009,MikhailA

An AWK Debugger and Assertion Checker

From "AUI - the Debugger and Assertion Checker for the Awk Programming Language" by Mikhail Auguston, Subhankar Banerjee, Manish Mamnani, Ghulam Nabi, Juris Reinfelds, Ugis Sarkans, and Ivan Strnad . Proceedings of the 1996 International Conference on Software Engineering: Education and Practice (SE:EP '96)

Download from LAWKER.

Abstract

This paper describes the design of Awk User Interface (AUI). AUI is a graphical programming environment for editing, running, testing and debugging of Awk programs. The AUI environment supports tracing of Awk programs, setting breakpoints, and inspection of variable values.

An assertion language to describe relationship between input and output of Awk program is provided. Assertions can be checked after the program run, and if violated, informative and readable messages can be generated. The assertions and debugging rules for the Awk program are written in a separate text file. Assertions are useful not only for testing and debugging but can be considered as a mean for program formal specification and documentation.

Example

The input file contains a list of all states of U.S.A. There are 50 records separated by newlines, one for each of the states. The number of fields in a record is variable. The first field is the name of the state, and the subsequent fields are names of neighbor states. Fields are separated by tabs. For example, the first two records in the database are

Alabama Mississippi Tennessee Georgia Florida 
Alaska 

The task is to color the U.S.A. map in such a way that any two neighboring states are in different colors. We will do it in a greedy manner (without backtracking), assigning to every state the ?rst possible color. The Awk program for this task is the following:

# Greedy map coloring 
BEGIN { FS= "\t"; OFS= "\t" # fields separated by tabs 
		color[0]= "yellow"  # color names 
		color[1]= "blue" 
		color[2]= "red" 
		color[3]= "green" 
		color[4]= "black" 
} 
{ 		i=0 
		while (a[$1,i] ) i++ # find first acceptable color for 
		                     # state $1 
		print $1"\t" color[i] # assign that color 
		for (j=2; j<=NF; j++) a[$j,i]=1	# make that color 
                                            # unacceptable for 
                                            # states $2..$NF 
} 

We can check the correctness of the coloring using the following assertion:

/* Checks the correctness of map coloring - any two neighbor
   states should be colored in different colors */
	FOREACH r1: RECORD FROM FILE input 
		(EXISTS r2: RECORD FROM FILE output 
			(r1.$1 == r2.$1 AND 
 			FOREACH i IN 2..FIELD_NUM(r1) 
				(EXISTS r3: RECORD FROM FILE output 
					(r3.$1 == r1.$i ANDr3.$2!=r2.$2) 
				) 
			) 
		)		 
SAY "Map colored correctly" 
ONFAIL  SAY r1.$1 "and" r1.$i "are of the same color" 
        SAY "although they are neighboring states" 

categories: Papers,Verification,Jul,2009,BalkhisB

Automated Result Verification with Awk

Source

From B.A. Bakar, T. Janowski, Automated Result Verification with AWK iceccs, pp.0188, Sixth IEEE International Conference on Complex Computer Systems (ICECCS'00), 2000

Download

Download from LAWKER.

Abstract

The goal of result-verification is to prove that one execution run of a program satisfies its specification. Compared with implementation-verification, result-verification has a larger scope for applications in practice, gives more opportunities for automation and, based on the execution record not the implementation, is particularly suitable for complex systems.

This paper proposes a technical framework to apply this technique in practice. We show how to write formal result-based specifications, how to generate a verifier program to check a given specification and to carry out result-verification according to the generated program.

The execution result is written as a text file, the verifier is written in AWK (special-purpose language for text processing) and verification is done automatically by the AWK interpreter; given the verifier and the execution result as inputs.

In this paper...

In this paper we propose a technical framework to carry out automated result-verification in practice. Its main features are:
  • The execution result is a simple text file. Many programs produce such (log) files during their normal operations, for administrative purposes. A general technique to record exactly the information needed for verification, is to introduce a program wrapper.
  • The execution result is given as input to the verifier program, which does the actual verification. Given the execution result in a text file, we consider result-verification as the text-processing task. Accordingly, the verifier is written in AWK, which is a special-purpose language for text processing, implemented for most computing platforms. Verification is done by the AWK interpreter, given the execution result and the verifier program as inputs.

categories: Graphics,Jul,2009,TedD

Processing Binary (BMP) files in Gawk

by Ted Davis

Updates

(For an update to this page, see wdenbmp.awk).

Description

My boss wants to put NOAA weather radar images in a looping presentation that is displayed as 720 video on the 1040 LCD TV in the atrium. He couldn't figure out how to download the various layers needed, so he gave me the task. Of course, I had a sample composite image for him in half an hour. It looked terrible on the TV: the writing came out as just a blur and the county and state lines (single pixel mostly) were essentially invisible. Obviously, I could make my own 'cities' overlay, but no tools I had would convert the 'counties' image to any usable vector format for line resizing.

This afternoon, I wrote a gawk script that widens the lines in a 256 color BMP version of the image - I can convert it back to a transparent background GIF later.

The power and range of gawk never ceases to amaze me - a 42 line (pretty printed) program was all it took.

The script uses FS="" to convert the entire file into 331 078 single byte fields. The first 1078 went into a header string and printf()ed to the outfile. The rest went into a a pair of 550 row by 600 column arrays. Then I looked at each pixel in the A array, and if it was not the background color, made the four surrounding pixels in the B array the same color, provided they were background color (not part of an existing line). Then I read out the array in order and printf()ed it to the outfile. The resulting overlay should be readable after changing the colors to make the dark lines brighter and moving its location in the stack to be on top of the other images.

There is one known flaw that I have no intention of addressing: lines that do not intersect other lines grow longer by one pixel for each pass through the program.

Code Fragments

While the actual code is proprietary, the following code snippets show most of the idioms required to handle binaries.

function Bytes2Number( String,  x, y, z, Number ) {
      x = split( String, Scratch, "" )
      Number = 0
      for( y = 1; y <= x; y++ ) {
              z = index( CharString, Scratch[ y ] ) -1
              Number = Number + z * (256^(x - y))
      }
      return Number
}

The following code initializes the CharString variable needed by Bytes2Number.

BEGIN{
     for( x = 0; x <= 255; x++ ) {
          CharString = CharString sprintf( "%c", x )

The above code generates the list of bytes for the Bytes2Number function.

     FS= ""
     RS = /ABC/
}

Mote that the string "ABC" does not appear in any of the image files processed by this code. Hence, the above lines means that the whole image ends up in one record.

The next block analyzes the header to extract useful information.

    {     Width   = Bytes2Number( $22 $21 $20 $19 )
          Height  = Bytes2Number( $26 $25 $24 $23 )
          Data    = Bytes2Number( $14 $13 $12 $11 )
          Size    = Bytes2Number( $6 $5 $4 $3 )
          Depth   = Bytes2Number( $30 $29 ) / 8
          ImgSize = Bytes2Number( $38 $37 $36 $35 )
           ....
    }

(note: I found that the image size in the header may be wrong, notably in files resized by Paint Shop Pro. Calculating it proved more reliable.)


categories: Databases,Spawk,Jul,2009,PanosP

SPAWK moves to GoogleCode

Panos I. Papadopoulos reports that he has moved the SPAWK project (SQL and AWK) to Mercurial and spawk.googlecode.com.

He has also written extensive tutorial notes at the SPAWK wiki.


categories: Spawk,Databases,Jul,2009,PanosP

SQL Powered AWK

Website

http://sites.google.com/site/spawkinfo.

Author

Panos I. Papadopoulos (panos1962@gmail.com).

Description

SPAWK is an elegant collection of functions for accessing and updating MySQL databases from within GNU awk programs. The SPAWK module consists of a single awk extension library, namely libspawk.so, which may be loaded in awk programs using the standard extension awk function:

BEGIN {
   extension("libspawk.so", "dlload")
   ...

A Short Example

Here's a short example of using SPAWK (for more details, see http://sites.google.com/site/spawkinfo/Home/manual).

When calling spawk_select, SPAWK sends the query already given (maybe some spawk_query calls preceeded the spawk_select) to the current server (remind you that "server" in SPAWK's point of view is a connection to the actual MySQL server mysqld). After calling spawk_select, the server is ready to return the results to the awk process via spawk_data, spawk_first or spawk_last calls. Alternatively, at any time we can clear the results' set and release the server with a spawk_clear function call.

The main data receiver is spawk_data function. This function is usually called with one or two arguments. The first argument is an array to be used as a data transfer vehicle, while the second argument may be used optionally to hold the null valued columns. spawk_data returns the number of columns of each returned data row or zero if there are no more data to return (EOD). spawk_first function's arguments and return values are exactly the same as those of spawk_data arguments and returns values, but the rest of the data will be lost, that is get the next available data row and release the server. Similar is the spawk_last function, but the row returned is the last row of the results' set. By the way, the spawk_last function is less efficient than spawk_first; actually, there is no particular reason to call spawk_last at all! Let's see some examples:

BEGIN {
     extension("libspawk.so", "dlload")
     SPAWKINFO["database"] = "information_schema"
     spawk_select("SELECT TABLE_SCHEMA, TABLE_NAME FROM TABLES")
     while (spawk_data(data))
          print data[0]
     exit(0)
}

Things need to be explained:

  • extension is used to load the SPAWK module.
  • SPAWKINFO array is used to specify the default database (schema) to connect. Index "database" denotes default database.
  • spawk_select is used to execute the desired query.
  • spawk_data is used repeatedly to get the results, row by row. spawk_data returns 2 as there are results to be retuned and 0 on EOD.

categories: Tools,Jul,2009,WmM

Finite State Machine Generator

Contents

Download

Download from LAWKER

Usage

In general, specify the state machine in FILE.fsm and define the action functions in FILE_actions.c. Then run fsm.awk compile and link fsm.c fsm_FILE.c and any driver file. Thats it.

Multiple fsms may be built and run in the same application using the function fsm_allocFsm(). Moreover, calls to fsm() may be nested using the same state machine as long as a different context is used. fsm_allocFsm() returns a context number that must be stored and passed to fsm() on each invoction. In the provided sample, the context is stored in myContext in test_driver.c.

Fsm() may be called either by polling for events or from inside an interrupt service routine. If fsm() is called from an interrupt service routine, it must be protected from nested calls using the same context. Interrupting calls using other contexts is permitted.

Note that the function fsminit() is called only once and should not be called for each fsm. If there are special requirements for a given fsm, an appropriate init function should be provided and called for that particular fsm.

Currently, fsm traceEnable is set to true and cannot be disbled (without changing fsm_allocFsm()). An array is maintained within each fsm context wherein each state and event are recorded for each call to fsm().

DESCRIPTION

Fsm.awk is an awk script designed to read a finite state machine (fsm) specification and produce C files which implement that fsm. The file fsm.c, included in the distribution, provides the actual state transition function, and the user provides the state transition "action" functions and any special initialization.

The fsm distribution consists mainly of fsm.awk and fsm.c, although there are a number of header files for declarations - doesn't get much simpler than that.

Typically, the fsm specification is named in the form fsm_name.fsm, but may be named any legal filename. The action functions may be placed in any number of files by any name the user chooses. Each function should return either true or false so that the appropriate next state may be chosen.

The chief benefit of using fsm.awk is easy to read, consistent state machine specifications and reuse of existing, tested code. Multiple tables and multiple users are happily accommodated. It's not hi-tech, but in provides an easy avenue to generalization and consistency where fsms are required.

This distribution represents a rewrite of an earlier version written many years ago - rewritten with newer versions of awk and gcc in mind. Consequently, it has not been tested using other compiler suites. There are no known bugs, but, it IS a rewrite.

Although a good candidate for C++, C was used because C++ was not being used in any of the systems currently using fsm-gen. Maybe a C++ version will be in a subsequent release.

Building the Sample FSM

The distribution provides the following files:

COPYING and      FSF licenses
COPYING.LESSER
filelist         the "packing list"
fsm.awk          the code generator
fsm.c            the context and transition code
fsm.h            definitions for the API
makefile         simple makefile for the test driver code
utils.h          error and utility definitions
test.fsm         a sample fsm specification named "test"
test_actions.c   action functions for the sample

To build the sample,

  1. Download the .zip
  2. extract the files from the zip - unzip contents.zip
  3. build the example fsm - ./fsm.awk test.fsm This step will produce fsm_test.c and fsm_test.h.
  4. compile and link the executable (test) using make
  5. run the sample - the executable produced by the makefile is "test". See the section THE EXAMPLE FSM below for information on using the example.

    When fsm.awk is run, (run via fsm.awk fsmName.fsm) it produces two files, fsm_fsmName.c and fsm_fsmName.h. Fsm_fsmName.c will contain an array of struct fsm_s tagged as fsm_fsmName, eg.,

    struct fsm_s fsm_fsmName [STATES_COUNT][EVENTS_COUNT].
    

    In the fsm distribution, the files fsm_test.c, fsm_test.h and test_actions.c may be built as an executable sample.

    The file fsm.c should be compiled and linked with the final executable as it contains the C code necessary to read the generated tables and update context. <> P Building the example should compile error free with the exception of a warning about using "gets()" in the sample driver. Hey - it's just a driver for a test.

Example FSM Specification File

In its purist form, a fsm specifies state, event, action, new state. For example, a rudimentary ftp server might be specified as follows:

# current     event     action          next 
# state                                  state
# --------------+----------+---------------+------------
IDLE            CONN_REQ   makeConnection  CONNECTED
CONNECTED       GET_REQ    sendBuffer      SENDING
SENDING         FILE_SENT  closeFile       IDLE

It is useful on occasion to make the next state depend on the success or failure of the action function. Here, "ok" and "fail" mean "true" and "false", respectively. For example, as each buffer is sent it would be useful to specify a different state if sendFile() returns fail (indicating EOF).

# current     event     action     next         next 
# state                             state        state
#                                    ok           fail
# --------------+----------+----------+---------+-----
CONNECTED       GET_REQ    sendBuffer SENDING   IDLE

State, event, action, and new state may be specified according to the same rules as C variables/functions. In the above table, the words CONNECTED, GET_REQ, SENDING, and IDLE are used to generate #defines, and the action sendBuffer is the name of a user supplied function.

The file test.fsm illustrates several idioms:

  • an event may be a single event or a comma separated list of events that all result in the same action and same next state. For example, the specification
    # current     event     action     next         next 
    # state                             state        state
    #                                    ok           fail
    # --------------+----------+----------+---------+-----
    S1              EVENT_1    action_1   S2        S3
    

    means, when receiving event EVENT_1 or EVENT_2 in state S1,

    execute action action_1 and go to state S2 if the return value of action_1() is true; go to state S3 if the return value of action_1 is false.
  • note that all events must be specified for each state. See the example specification file, test.fsm.
  • an action specified as "-" means, "do nothing". fsm.awk will generate a NULL in the state transition tables which will be treated as "do nothing". When so specified, the next state will always be the next-state-ok state.
  • an action specified as fsm_invalid_event will call the function fsm_invalid_event(void) which always returns false. This function may be edited to suit the situation at hand. When fsm_invalid_event is specified, the next state (both) may be left unspecified - fsm.awk will generate next state information as being the current state (ie., no change in the current state).
  • a fail next state specified as "-" means the fail next state is the same as the success next state. That is, in the specification
    # current     event     action     next         next 
    # state                             state        state
    #                                    ok           fail
    # --------------+----------+----------+---------+-----
    S1              EVENT_1    action_1   S2        -
    
    means, when receiving event EVENT_1 in state S1, execute action action_1 and go to state S2 irrespective of the return value of action_1().

The Example FSM

Included in the distribution are test.fsm and test_actions.c which implement a very simple state machine called "test". After the executable "test" is produced (via make), it may be used to show the behavior of the fsm.

The example fsm was built and tested with gcc version 4.0.2 and awk version 3.1.4.

Example Output from the Sample

On running "test", first the line "testing fsm test" is printed, then a line indicating the initial state. It then asks for the next event. All events in the example are the lowercase letters 'a' thru 'd', entered from the keyboard. A special event 'z' will cause the trace to be dumped. Entering 'q' will cause test to exit. Note that to keep the example simple, other than special events 'z' and 'q', there is no checking of input for being outside the known set of events. A sample session might look like this:

$>
$> ./test
testing fsm test

starting in state 1
next event: a
got a (0)  ----> called fsm_s2_ab ----> ,went to state 0
next event: d
got d (3)  ----> invalid eventwent to state 0
next event: b
got b (1)  ----> called fsm_s1_b ----> ,went to state 1
next event: c
got c (2)  ----> went to state 1
next event: z
trace index is 4
event      state
0          0
3          0
1          1
2          1
0          0 <-- next/oldest
0          0
0          0
0          0

next event: q
bye
$>

Copyright

Copyright 2008 Wm Miller

This file is part of fsm-gen, and is distributed under the terms of the GNU Lesser General Public License .

Copies of the GNU General Public License and the GNU Lesser General Public License are included with this distrubution in the files COPYING and COPYING.LESSER, respectively.

Fsm-gen is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Fsm-gen is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with fsm-gen. If not, see http://www.gnu.org/licenses.

Author

Wm Miller. The author may be contacted at wmmsf at users.sourceforge.net.

categories: Databases,Jul,2009,CarloS

NoSQL

By Carlo Strozzi (carlo@strozzi.it).

NoSQL is a fast, portable, relational database management system without arbitrary limits, (other than memory and processor speed) that runs under, and interacts with, the UNIX Operating System. It uses the "Operator-Stream Paradigm" described in Unix Review (March, 1991, page 24, "A 4GL Language") where there are a number of "operators" that each perform a unique function on the data. These operators are written in Awk and C, designed to be lightweight Operators will have to be lightweight ones (have a small memory footprint and allows fast startup of the command).

The main reason why NoSQL decided to turn an original RDB system into NoSQL is precisely that the former is entirely written in Perl. Perl is a good programming language for writing self-contained programs, but its pre-compilation phase and long start-up time are worth paying only if once the program has loaded it can do everything in one go. This contrasts sharply with the Operator-stream Paradigm, where operators are chained together in pipelines of two, three or more programs. The overhead associated with initializing Perl at every stage of the pipeline makes pipelining Perl inefficient. A better way of manipulating structured ASCII files is to use the AWK programming language, which is much smaller than Perl, is more specialized for this task, and is very fast at startup.

For more information on NoSQL, see the NoSQL home page.

blog comments powered by Disqus