Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Who,Feb,2009,Timm

Timm= Tim Menzies

2009: assoc Prof, LCSEE, WVU email: tim@menzies.us web site: http://menzies.us.

Tim has been a Great Auk since Feb'09.


categories: Misc,Jan,2009,Timm

History

Recipe for a Language

  • 1 part egrep
  • 1 part snobol
  • 2 parts ed
  • 3 parts C
  • Blend all parts well using lex and yacc. Document minimally and release.
  • After eight years, add another part egrep and two more parts C. Document very well and release.

    Historical Notes

    From awk.freeshell.org:

    • 1977-1985: awk and nawk (now also known as 'old awk' or 'the old true awk'): the original version of the language, lacking many of the features that make it fun to play with now
    • 1985-1996: The GNU implementation, Gawk, was written in 1986 by Paul Rubin and Jay Fenlason, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from Arnold Robbins, thoroughly reworked Gawk for compatibility with the newer Awk.
    • 1996: BWK awk was released under an open license. Huzzah!
    • Sometime before the present, mawk, xgawk, jawk, awkcc, Kernighan's nameless awk-to-C++ compiler, awka, tawk and busybox awk came to be.

    It's a bit embarassing to note that the exact origins of each are a bit hazy. This whole section requires further work, including the addition of links pointing to source repositories and binary distribution points.

    Awk Implemenetations

    Historical list of Awk implementations.

    Awk's Authors: Interviews


    categories: Misc,WhyAwk,Jan,2009,Timm

    Why Gawk?

    by T. Menzies

    "The Enlightened Ones say that....

    • You should never use C if you can do it with a script;
    • You should never use a script if you can do it with awk;
    • Never use awk if you can do it with sed;
    • Never use sed if you can do it with grep."

    Awk is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and Awk programs are generally very short. Awk is useful when the overheads of more sophisticated approaches is not worth the bother. Also, the cost of learning Awk is very low.

    But aren't there better scripting languages? Faster? Well, maybe yes and maybe no.

    And Awk is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:

    loc/fp   language
    ------   --------
    
        6,   excel 5
       13,   sql
       21,   awk       <================
       21,   perl
       21,   eiffel
       21,   clos
       21,   smalltalk
       29,   delphi
       29,   visual basic 5
       49,   ada 95
       49,   ai shells
       53,   c++
       53,   java
       64,   lisp
       71,   ada 83
       71,   fortran 95
       80,   3rd generation default
       91,   ansi cobol 85
       91,   pascal
      107,   2nd generation default
      107,   algol 68
      107,   cobol
      107,   fortran
      128,   c
      320,   1st generation default
      640,   machine language
     3200,   natural language
    

    Anyway, there are other considerations. Awk is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For example, here's the complete listing of someone's Awk spell-checking program.

    BEGIN     {while (getline<"Usr.Dict.Words") dict[$0]=1}
    !dict[$1] {print $1}
    

    Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.

    Awk is the cure for late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. Awk programmers can hack up usable systems in the time it takes other programmers to boot their IDE. And, as a result of that easy exploration, it is possible to find loopholes missed by other analyst that lead to the innovative better solution to the problems (e.g. see Ronald Loui's O(nlogn) clustering tool).

    Certainly, we can drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And Awk's lack of data structures (except num, string, and array) requires some discipline to handle properly.

    But experienced Awk programmers know that the cleverer the program, the smaller the audience gets. If it is possible for to explain something succinctly in a simple language like Awk, then it is also possible that more folks will read that code.

    Finally, at this may be the most important point, it might be misguided to argue about Awk vs LanguageX in terms of the specifics of those languages. Awk programmers can't over-elaborate their solutions- they are forced to code the solution in the simplest manner possible. This push to simplicity, to the essence of the problem, can be an insightful process. Coding in Awk is like preserving fruit- you boil off everything that is superfluous, that needlessly bloats the material what you are working with. It is amazing how little code is required to code the core of an idea (e.g. see Darius Bacon's LISP interpreter, written in Awk).


    categories: Cookbook,Feb,2010,Timm

    Awk Cook Book Project

    Tim Menzies writes...

    Anyone interested in collaborating on an Awk Cook Book? That we get published with (say) O'Reilly?

    After browsing the excellent entries from the PHP cookbook by David Sklar and Adam Trachtenberg, I think the following topics look relevant.

    But what other topics are useful/ simple/ practical/ etc?

    I propose the following timetable:

    • March 2010:
      • Revise the following list of headings (making comments using the comment system, below);
      • Identify possible authors
    • April 2010:
      • Every author writes 5 new entries (call them "yellow")
    • May 2010:
      • Every author reviews 5 entries for other authors, to label them "green" (ok to go) or "red" (needs more work).
      • Every author writes 5 new entries (call them "yellow")
      • Try to interest new authors in the project.
    • Repeat the above, till done.

    So, according to that timetable, the first step is for everyone to jump in and propose topics. Looking forward to your input!

         1	= Strings =
         2	
         3	 * Accessing Substrings
         4	 * Replacing Substrings
         5	 * Processing a String One Character at a Time
         6	 * Reversing a String by Word or Character
         7	 * Expanding and Compressing Tabs
         8	 * Controlling Case
         9	 * Interpolating Functions and Expressions Within Strings
        10	 * Trimming Blanks from a String
        11	 * Parsing Comma-Separated Data
        12	 * Parsing Fixed-Width Delimited Data
        13	 * Taking Strings Apart
        14	 * Wrapping Text at a Certain Line Length
        15	 * Storing Binary Data in Strings
        16	 * Building a String From Parts
        17	 * Substituting Variables Into Strings
        18	 * Converting Between Characters and Values
        19	 * Managing Whitespace
        20	 * Word-wrapping Lines of Text
        21	 * Validating an Email Address
        22	
        23	= Numbers =
        24	
        25	 * Checking Whether a String Contains a Valid Number
        26	 * Comparing Floating-Point Numbers
        27	 * Rounding Floating-Point Numbers
        28	 * Operating on a Series of Integers
        29	 * Generating Random Numbers Within a Range
        30	 * Generating Biased Random Numbers
        31	 * Taking Logarithms
        32	 * Calculating Exponents
        33	 * Formatting Numbers
        34	 * Printing Correct Plurals
        35	 * Calculating Trigonometric Functions
        36	 * Doing Trigonometry in Degrees, not Radians
        37	 * Handling Very Large or Very Small Numbers
        38	 * Converting Between Bases
        39	 * Calculating Using Numbers in Bases Other Than Decimal
        40	
        41	= Dates and Times =
        42	
        43	 * Finding the Current Date and Time
        44	 * Converting Time and Date Parts to an Epoch Timestamp
        45	 * Converting an Epoch Timestamp to Time and Date Parts
        46	 * Printing a Date or Time in a Specified Format
        47	 * Finding the Difference of Two Dates
        48	 * Finding the Difference of Two Dates with Julian Days
        49	 * Finding the Day in a Week, Month, Year, or the Week Number in a Year
        50	 * Validating a Date
        51	 * Parsing Dates and Times from Strings
        52	 * Adding to or Subtracting from a Date
        53	 * Calculating Time with Time Zones
        54	 * Accounting for Daylight Saving Time
        55	 * Generating a High-Precision Time
        56	 * Generating Time Ranges
        57	 * Using Non-Gregorian Calendars
        58	 * Program: Calendar
        59	
        60	= Arrays =
        61	
        62	 * Specifying an Array Not Beginning at Element 0
        63	 * Storing Multiple Elements per Key in an Array
        64	 * Initializing an Array to a Range of Integers
        65	 * Iterating Through an Array
        66	 * Deleting Elements from an Array
        67	 * Changing Array Size
        68	 * Appending One Array to Another
        69	 * Turning an Array into a String
        70	 * Printing an Array with Commas 
        71	 * Checking if a Key Is in an Array
        72	 * Checking if an Element Is in an Array
        73	 * Finding the Position of an Element in an Array
        74	 * Finding Elements That Pass a Certain Test
        75	 * Finding the Largest or Smallest Valued Element in an Array
        76	 * Reversing an Array
        77	 * Sorting an Array
        78	 * Sorting an Array by a Computable Field
        79	 * Sorting Multiple Arrays
        80	 * Sorting an Array Using a Method Instead of a Function
        81	 * Randomizing an Array
        82	 * Shuffling a Deck of Cards
        83	 * Removing Duplicate Elements from an Array
        84	 * Finding the Union, Intersection, or Difference of Two Arrays
        85	 * Finding All Element Combinations of an Array
        86	 * Finding All Permutations of an Array
        87	 * Program: Printing an Array in a Horizontally Columned HTML Table
        88	
        89	= Hashes =
        90	
        91	 * Introduction
        92	 * Adding an Element to a Hash
        93	 * Testing for the Presence of a Key in a Hash
        94	 * Deleting from a Hash
        95	 * Traversing a Hash
        96	 * Printing a Hash
        97	 * Retrieving from a Hash in Insertion Order 
        98	 * Hashes with Multiple Values Per Key
        99	 * Inverting a Hash
       100	 * Sorting a Hash
       101	 * Merging Hashes
       102	 * Finding Common or Different Keys in Two Hashes
       103	 * Hashing References
       104	 * Presizing a Hash
       105	 * Finding the Most Common Anything
       106	 * Representing Relationships Between Data
       107	 * Program: dutree
       108	
       109	= Variables =
       110	
       111	 * Avoiding == Versus = Confusion
       112	 * Establishing a Default Value
       113	 * Exchanging Values Without Using Temporary Variables
       114	 * Creating a Dynamic Variable Name
       115	 * Using Static Variables
       116	 * Sharing Variables Between Processes
       117	 * Encapsulating Complex Data Types as a String
       118	 * Dumping Variable Contents as Strings
       119	
       120	= Functions =
       121	
       122	 * Accessing Function Parameters
       123	 * Setting Default Values for Function Parameters
       124	 * Passing Values by Reference
       125	 * Using Named Parameters
       126	 * Creating Functions That Take a Variable Number of Arguments
       127	 * Returning Values by Reference
       128	 * Returning More Than One Value
       129	 * Skipping Selected Return Values
       130	 * Returning Failure
       131	 * Calling Variable Functions
       132	 * Accessing a Global Variable Inside a Function
       133	 * Creating Dynamic Functions
       134	
       135	= XML =
       136	
       137	 * Generating XML Manually
       138	 * Generating XML with the DOM
       139	 * Parsing XML with the DOM
       140	 * Parsing XML with SAX
       141	 * Transforming XML with XSLT
       142	 * Sending XML-RPC Requests
       143	 * Receiving XML-RPC Requests
       144	 * Sending SOAP Requests
       145	 * Receiving SOAP Requests
       146	 * Exchanging Data with WDDX
       147	 * Reading RSS Feeds
       148	
       149	= Regular Expressions =
       150	
       151	 * Switching From ereg to preg
       152	 * Matching Words
       153	 * Finding the nth Occurrence of a Match
       154	 * Choosing Greedy or Nongreedy Matches
       155	 * Matching a Valid Email Address
       156	 * Finding All Lines in a File That Match a Pattern
       157	 * Capturing Text Inside HTML Tags
       158	 * Escaping Special Characters in a Regular Expression
       159	 * Reading Records with a Pattern Separator
       160	
       161	
       162	= Files =
       163	 * Creating or Opening a Local File
       164	 * Creating a Temporary File
       165	 * Opening a Remote File
       166	 * Reading from Standard Input
       167	 * Reading a File into a String
       168	 * Counting Lines, Paragraphs, or Records in a File
       169	 * Processing Every Word in a File
       170	 * Reading a Particular Line in a File
       171	 * Processing a File Backward by Line or Paragraph
       172	 * Picking a Random Line from a File
       173	 * Randomizing All Lines in a File
       174	 * Processing Variable Length Text Fields
       175	 * Reading Configuration Files
       176	 * Reading from or Writing to a Specific Location in a File
       177	 * Removing the Last Line of a File
       178	 * Modifying a File in Place Without a Temporary File
       179	 * Flushing Output to a File
       180	 * Writing to Standard Output
       181	 * Writing to Many Filehandles Simultaneously
       182	 * Escaping Shell Metacharacters
       183	 * Passing Input to a Program
       184	 * Reading Standard Output from a Program
       185	 * Reading Standard Error from a Program
       186	 * Locking a File
       187	 * Reading and Writing Compressed Files
       188	 * Program: Unzip
       189	
       190	= Directories =
       191	
       192	 * Getting and Setting File Timestamps
       193	 * Getting File Information
       194	 * Changing File Permissions or Ownership
       195	 * Splitting a Filename into Its Component Parts
       196	 * Deleting a File
       197	 * Copying or Moving a File
       198	 * Processing All Files in a Directory
       199	 * Getting a List of Filenames Matching a Pattern
       200	 * Processing All Files in a Directory
       201	 * Making New Directories
       202	 * Removing a Directory and Its Contents
       203	 * Program: Web Server Directory Listing
       204	 * Program: Site Search
    

    categories: CMS,Tools,Mar,2010,Timm

    TinyTim: a Content Management System

    TINY TIM is a tiny web-site manager written in AWK. For a live demo of the site, see http://at.ttoy.net/?tinytim. The site supports runtime content generation; e.g. the quote shown top right of the demo site is auto-generated each time you refresh the page.

    The site was written to demonstrate that a little AWK goes a long way. At the time of this writing, the current system is under 100 lines of code (excluding a seperate formatter, that is another 170 lines of code). It took longer to write this doco and the various HTML/CSS theme files, than the actual code itself (fyi: 6 hours for the themes/doc and 3 hours for the code).

    TINY TIM has the following features:

    • Pages can be accessed by their (lowercase) name, or by their (uppercase) tags.
    • Pages can be displayed using a set of customizable themes.
    • Page contents can be written using a HTML shorthand language called MARKUP.
    • Pages can be searched using a Google search box.
    • Source code is auto-displayed using a syntax highlighter.
    • Page content can be auto-created via programmer-modifable plugins.

    Install

    In a web accessible directory, type

     svn export http://knit.googlecode.com/svn/branches/0.2/tinytim/ 
    

    In the resulting directory, perform the local juju required to make index.cgi web-runnable (e.g. on my ISP, chmod u+rx index.cgi).

    Follow the directions in the next section to customize the site.

    Using TINY TIM

    index.cgi

    TINY TIM is controlled by the following index.cgi file. To select a theme, comment out all but one of the last lines (using the "#" character). For a screen-shots of the current themes, see below.

    #!/bin/bash
     
    [ -n "$1" ] && export QUERY_STRING="$1"
     
    tinytim() {
      cat content/* themes/$1/theme.txt |
      gawk -f lib/tinytim.awk |
      sed 's/^<pre>/<script type="syntaxhighlighter" class="brush: cpp"><![CDATA[/' |
      sed 's/^<\/pre>/<\/script>/'
    } 
      
     #tinytim auklet
     #tinytim trendygreen
     tinytim wink
    

    Notes:

    • The sed commands: these render normal <pre> using Alex Gorbatchev's excellent syntax highlighter. To change the highlighting rules for a different language, change brush: cpp to one of the supported aliases.
    • The cat command: this assembles the content for the system. Multiple authors can write multiple files in the sub-directorty content.

    Themes

    Themes are defined in the sub-directory themes/themename. Each theme is defined by a theme.txt file that holds:

    • The HTML template for the theme.
    • The in-line style sheet for the theme.
    • The page contents with pre-defined string names marked with ``; e.g. ``title``. To change those strings, see the instructions at the end of this page.
    • If a `` entry contains a semi-colon (e.g. ``quotes;``) then it is a plugin. Plugin content is generated at runtime using a method described at the end of this document.

    To write a new theme:

    1. Create a new folder themes/new.
    2. Copy (e.g.) wink/theme.txt to new.
    3. Using the copied theme as a template, start tinkering.

    The following themes are defined in the directory themes.

    Auklet:

    Trendygreen (adapted from GetTemplates):

    Wink:

    Defining String Values

    The first entry in the content defines strings that can slip into the theme templates. For example, the following slots define the title of a site; the name of formatter script that renders each page; the url of the home directory of the site; a menu to add top of each page; a footer to add to the bottom of each page; and a web-accessible directory for storing images.

     ``title``       Just another Tiny Tim demo
     ``formatter``   lib/markup.awk
     ``description`` (simple cms)
     ``home``        http://at.ttoy.net
     ``menu``        <a href="?index">Home</a> | 
                     <a href="?contact">Contact</a>  |
                     <a href="?about">About</a>
     ``footer``      <p>Powered by <a href="?tinytim">TINY TIM</a>. 
                                     © 2010 by Tim Menzies 
     ``images``      http://at.ttoy.net/img
    

    Note the following important convention. TINY TIM auto-generates some of its own strings. The names of these strings start with an uppercase letter. To avoid confusion of your strings with those that are auto-generated, it is best to start your strings with a lower-case letter (e.g. like all those in the above example.

    Adding a Search Engine

    Google offers a nice free site-specific search engine. It takes a few days for the spiders to find the site but after that, it works fine. To set this up, follow the instructions at Google custom search, then

    • Add the appropriate magic strings into the first entry of the content (usually content/0config.txt).
    • Add references to those strings to your template.

    For example, look for google-search in the current templates and content/0config.txt.

    Writing pages

    After the first entry, the rest of the entries in the content/* define the pages of a site. Each entry must begin with the magic string

    • Each entry must begin with the magic string #12345
    • The entry consists of paragraphs (separated by blank lines.
    • Paragraph one contains the (short) page name (on line one) following by the page tags (on line two).
        • Note that the page name must start with a lower case letter.
        • And the tags must start with an upper case letter.
    • Paragraph two contains the heading of the page.
    • The remaining paragraphs are the page contents.

    For example, this site contains a missing page report. This page is defined as follows. In the following definition of that page, the name is "404"; the tags are "Admin Feb10" and the title is "Sorry".

     #12345####################################################################################
     
     404
     Admin Feb10
      
     Sorry
      
     I have bad news:
     
     <center>
     [img/404book.jpg]
     </center>
    

    The contents can contain HTML and MARKUP tags.

    MARKUP

    MARKUP is a shorthand for writing HTML pages based on MARKDOWN:

    • Italics, bold, typerwritter font are marked by matching _, *, and ` characters (respectively).
    • Lists are marked by leading "+" characters.
    • Numbered lists are marked by leading "1." strings.
    • Links are enclosed in [square brackets]. The first word in the bracket is the URL and subsequent words are the text for the URL link.
    • Images are marked up with the same [square brackets], but the first work must end in one of .png, .gif, .jpg. Any subsequent words are passed as tags to the <img> tag.

    Also, in MARKUP, major, minor, sub-, and sub-sub- headings are two line paragraphs where the second line contains two or more "=", "-", "+", "_" (respectively). MARKUP collects these headings as a table of contents, which is added to the top of the page.

    Note that MARKUP is separate to TINY TIM. To change the formatting of pages, write your own AWK code and change the string ``formatter`` in the first entry of content/0config.txt.

    Plugins

    If a `` entry contains a semi-colon (e.g. ``quotes;``) then it is a plugin. Plugin content is generated at runtime. To write a plugin, modify the file lib/plugins.awk. Currently, that file looks like this:

     function slotsPlugIns(str,slots,   tmp) {
        split(str,tmp,";")
        if (tmp[1]=="quotes")
            return quotes(str,slots)
        return str
     }
     function quotes(str,slots,    n,tmp) {
        srand(systime() + PROCINFO["pid"])
        n=split(slots["quotes"],tmp,"\n")
        return tmp[int(rand()*n) + 1]
     }
    

    The function slotsPlugIns is a "traffic-cop" who decides what plugin to call (in the above, there is only one current plugin: quotes).

    Each plugin function (e.g. quotes) is passed the string from the template (see str) and an array of key/value pairs holding all the defined string values (see slots). These functions must return a string to be inserted into the rendered HTML.

    In the example above, quotes just returns a random quote. It assumes that the predefined strings includes a set of quotes, one per line:

     ``quotes`` Small  things with great love. <br>-- Mother Teresa
         It's hard work to it look effortless.<br>-- Katarina Witt
        "God bless us every one!".<br>-- Tiny Tim
    

    The quote generated by this plug in can be view, top right of this page.


    categories: Funky,Mar,2009,Timm

    Funky: Functional Gawk

    These pages are focused on Functional Gawk (a.k.a. "Funky").

    Funky is enabled by a new feature added to Gawk 3.2: indirect functions. For example:

    function foo() { print "foo" }
    function bar() { print "bar" }
    
    BEGIN {
                    the_func = "foo"
                    @the_func()     # calls foo()
                    the_func = "bar"
                    @the_func()     # calls bar()
    }
    

    At the time of this writing, Gawk 3.2 is pre-release and indirect functions can be accessed using the gawk-devel CVS tree:

    cvs -d:pserver:anonymous@cvs.sv.gnu.org:/sources/gawk co gawk-devel
    

    categories: Funky,Mar,2009,Timm

    The Functional Challange

    Indirect functions enable a new view on library management in Gawk and, perhaps, a way to emulate functional abstraction in languages like Lisp.

    So, anyone care to try, say:


    categories: Funky,Mar,2009,Timm

    Functional Enumeration in Gawk 3.1.7

    Contents

    Synopsis

    all( fun, array [,max]

    collect( fun, array1, array2 [,max])

    select( fun, array1, array2 [,max])

    reject( fun, array1, array2 [,max])

    detect( fun, array [,max])

    inject( fun, array, carry [,max])

    All these functions return the size of array or array2

    Description

    An interesting new feature in Gawk 3.1.7 is indirect functions. This allows the function name to be a variable, passed as an argument to an array, and called using the syntax

    @fun(arg1,arg2,...)    
    

    This enables a new kind of funcational programming style in Gawk. For example, generic enumeration patterns can be coded once, then called many different ways with different function names passed as arguments.

    This document illustrates this style of programming.

    Enumerators

    For example, here are some standard enumeration functions:

    all(fun,array [,max]

    Applies the function fun to all items in the array. If called with the max argument, then they are iterated in the order i=1 .. max, otherwise we use for(i in a).

    collect(fun,array1,array2 [,max])

    Applies fun to each item in array1 and collects the results in array2.

    select(fun,array1,array2 [,max])

    Find all the items in array1 that satisfies fun and add them to array2.

    reject(fun,array1,array2 [,max])

    Find all the items in array1 that do not satisfy fun and add them to array2.

    detect(fun,array [,max])

    Return the first item found in array that satisfies fun. If no such item is found, then return the magic global value Fail.

    inject(fun,array,carry [,max])

    (This one is a little tricky.) The result of applying fun to each item in array is carried into the processing of the next item. Initially, the carried value is carry. This function returns the final carry.

    Sample Functions

    To illusrate the above, consider the following functions. Each of these are defined for one array item.

    function odd(x)    { return (x % 2) == 1 }
    function show(x)   { print "[" x "]" }
    function mult(x,y) { return x * y }
    function halve(x)  { return x/2 }
    

    Using the Functions

    • All-ing...
    • function do_all(   arr) { 
          split("22 23 24 25 26 27 28",arr)
          all("show",arr)
      }
      

      When we run this ...

      eg/enum1

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_all() }'
      

      we see every item in arr printed using the above show function ...

      eg/enum1.out

      [25]
      [26]
      [27]
      [28]
      [22]
      [23]
      [24]
      
    • Collect-ing...
    • function do_collect(        max,arr1,arr2,i) {
          max=split("22 23 24 25 26 27 28",arr1)
          collect("halve",arr1,arr2,max)
          for(i=1;i<=max;i++) print arr2[i]
      }
      

      When we run this ...

      eg/enum2

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_collect() }'
      

      we see every item in arr divided in two ...

      eg/enum2.out

      11
      11.5
      12
      12.5
      13
      13.5
      14
      
    • Select-ing...
    • function do_select(        all,less,arr1,arr2,i) {
          all  = split("22 23 24 25 26 27 28",arr1)
          less = select("odd",arr1,arr2,all)
          for(i=1;i<=less;i++) print arr2[i]
      }
      

      When we run this ...

      eg/enum3

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_select() }'
      

      we see every item in arr that satisfies odd....

      eg/enum3.out

      23
      25
      27
      
    • Reject-ing...
    • function do_reject(        all,less,arr1,arr2,i) {
          all  = split("22 23 24 25 26 27 28",arr1)
          less = reject("odd",arr1,arr2,all)
          for(i=1;i<=less;i++) print arr2[i]
      }
      

      When we run this ...

      eg/enum4

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_reject() }'
      

      we see every item in arr that do not satisfies odd....

      eg/enum4.out

      22
      24
      26
      28
      
    • Detect-ing
    • function do_detect(        all,arr1) {
          all  = split("22 23 24 25 26 27 28",arr1)
          print detect("odd",arr1,all)   
      }
      

      When we run this ...

      eg/enum5

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_detect() }'
      

      we see the first item in arr that satisfies odd....

      eg/enum5.out

      23
      
    • Inject-ing...
    • function do_inject(        all,less,arr1,arr2,i) {
          split("1 2 3 4 5",arr1)
          print inject("mult",arr1,1)
      }
      

      When we run this ...

      eg/enum6

      gawk317="$HOME/opt/gawk/bin/gawk"
      $gawk317 -f ../enumerate.awk --source 'BEGIN { do_inject() }'
      

      we see every the result of multiplying every item in arr by its predecessor.

      eg/enum6.out

      120
      

    Code

    Note one design principle in the following: any newly generated arrays have indexes 1..max where max is the number of elements in that array.

    all

    function all (fun,a,max,   i) {
    	if (max) 
    		for(i=1;i<=max;i++) @fun(a[i]) 
    	else  
    		for(i in a) @fun(a[i])
    }
    

    collect

    function collect (fun,a,b,max,   i) {
    	if (max)
    	    for(i=1;i<=max;i++) {n++; b[i]= @fun(a[i]) }
    	else
    	    for(i in a) {n++; b[i]= @fun(a[i])}
    	return n
    }
    

    select

    function select (fun,a,b,max,   i,n) {
    	if (max)
    		for(i=1;i<=max;i++) {
    		    if (@fun(a[i])) {n++; b[n]= a[i] }}
    	else
    		for(i in a) {
    		    if (@fun(a[i])) {n++; b[n]= a[i] }}
    	return n
    }
    

    reject

    function reject (fun,a,b,max,   i,n) {
    	if (max)
    		for(i=1;i<=max;i++) {
    		    if (! @fun(a[i])) {n++; b[n]= a[i] }}
    	else
    		for(i in a) {
    		    if (! @fun(a[i])) {n++; b[n]= a[i] }}
    	return n
    }
    

    detect

    BEGIN {Fail="someUnLIKELYSymbol"}
    function detect (fun,a,max,   i) {
    	if (max)
    		for(i=1;i<=max;i++) {
    			if (@fun(a[i])) return a[i] }
    	else	
    		for(i in a) {
    			if (@fun(a[i])) return a[i] }
    	return Fail
    }
    

    inject

    function inject (fun,a,carry,max,   i) {
    	if (max)
    		for(i=1;i<=max;i++)
    			 carry = @fun(a[i],carry) 
    	else
    		for(i in a)
    			 carry = @fun(a[i],carry) 
    	return carry
    }
    

    Bugs

    The above code does not pass around any state information that the fum functions can use. So all their deliberations are either with the current array values (integers or strings) or with global state. It might be worthwhile writing new versions of the above with one more argument, to carry that sate.

    Author

    Tim Menzies

    categories: Contribute,Jan,2009,Timm

    Pretty Print AWK Code

    The code that renders the awk.info web site can "pretty print" awk code. For example:

    To enable that pretty print, add some html syntax inside your code and apply the following conventions.

    Preview Engine

    Note that if you want to see your "looking pretty", then you could could see how it looks using our preview tool:

    http://awk.info/?awk:urlWithoutHTTPprefix
    

    For exmaple, the file http://menzies.us/tmp/xx.awk can be previewed using http://awk.info/?awk:menzies.us/tmp/xx.awk

    Contributing Pretty Code

    Once you've got it "looking pretty", please consider contributing that code to awk.info, so our code library can grow. To do so, either email mail@awk.info with the URL of your pretty code or zip up the files and email them across.

    HTML-based Commenting Conventions

    The first paragraph of the file will be ignored. Use this first para for copyright notices or comments about down-in-the weeds trivia. Note: the first para ends with one blank line.

    The next paragraph should start with

    #.H1 <join>Title</join>

    The code could should be topped and tailed as follows:

    #<pre>
    code
    #</pre>
    

    All other comment lines should start with a single "#" at front-of-line. These comment characters will be stripped away by the awk.info renderer.

    Awk.info's renderer adopts the following html shorthand. If a line starts with

    #.WORD other words 
    

    this this is replaced with

    <WORD> other words</WORD>
    

    If no other words follow #.WORD then the line becomes just <WORD>

    Awk.info's renderer supports a few HTML extensions:

    • #.IN path includes a file found in the LAWKER repositoriy at some path inside the trunk.
    • #.CODE path includes the contents of path, wrapped in <pre> tags, and prefixed by the path.
    • #.BODY path is the same as #.CODE but it skips the first paragraph (this is useful when the first paragraph includes tedious details you want to hite from the user).
    • Note that, for #.IN, #.CODE, #.BODY, the path must appear after a single space.

    That's it. Now you can pretty print your code on the web just be adding a little html in the comments.


    categories: Contribute,Jan,2009,Timm

    Show Unit Tests

    Ideally, all code in our code repository comes with unit tests:

    • Either demo scripts to show off functionality
    • Or a regression suite that checks that new changes does not mess up existing code.

    Accordingly code offered to this site can contain unit tests, using the methods described in this page.

    But before going on, we stress that awk.info gratefully accepts awk contributions in any form. That is, including unit tests with code is optional.

    Files

    If your code is in directory yourcode then create a sub-directory yourcode/eg

    Write a test in a file yourcode/eg/yourtest. Divide that test into two parts:

    1. In the first paragraph of that file, write any tedious set up required to get the system ready for the test.
    2. In the second, third, etc paragraph, write the code that shows the test
    3. For example, in the following code, the real test comes after some low-level environmental set up:
      # assumes
      # - the LAWKER trunk has been checked out and
      # - .bash_profile contains: export Lawker="$HOME/svns/lawker/fridge"
      . $Lawker/lib/bash/setup
      
      gawk -f join.awk --source '
      BEGIN { split("tim tom tam",a)
              print join(a,2)
      }'
      

    Write the expected output of that test case in yourcode/eg/yourtest.out

    Regression Tests

    The above file conventions mean that an automatic tool can run over the entire code base and perform a regression test (checking if all the tests generate all the *.out files.

    Displaying the Tests (and Output)

    Another advantage of the above scheme is that you can use the tests to document your code.

    To show the test case, add the following into your .awk file:

     #.BODY       yourcode/eg/yourtest
     #.CODE       yourcode/eg/yourtest.out
    

    Then zip the directory yourcode (including yourcode/eg) and send it to awk.info. Once we install those files on our site then when awk.info displays that file, the test case trivia is hidden and the users only see the essential details. For an example of this, see http://awk.info/?gawk/array/join.awk.


    categories: Learn,Jan,2009,Timm

    Four Keys to Gawk

    by T. Menzies

    Imagine Gawk as a kind of a cut-down C language with four tricks:

    1. self-initializing variables
    2. pattern-based programming
    3. regular expressions
    4. associative arrays.

    What to all these do? Well....

    Self-initializing variables.

    You don't need to define variables- they appear as your use them.

    There are only three types: stings, numbers, and arrays.

    To ensure a number is a number, add zero to it.

    x=x+0
    

    To ensure a string is a string, add an empty string to it.

    x= x "" "the string you really want to add"
    

    To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables:

     function haslocals(passed1,passed2,         local1,local2,local3) {
            passed1=passes1+1  # changes externally
            local1=7           # only changed locally
     }
    

    Note that its good practice to add white space between passed and local variables.

    Pattern-based programming

    Gawk programs can contain functions AND pattern/action pairs.

    If the pattern is satisfied, the action is called.

     /^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
               p = 1;
             }
     /^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
               p = 0;
             }
     END     { if (p != 0) print "missing .P2 at end" }
    

    Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:

     BEGIN {
            while (getline < "Usr.Dict.Words") #slurp in dictionary 
                    dict[$0] = 1
            FS=",";                            #set field seperator
            srand();                           #reset random seed
            Round=10;                          #always start globals with U.C.
     }
    

    The default action is {print $0}; i.e. print the whole line.

    The default pattern is 1; i.e. true.

    Patterns are checked, top to bottom, in source-code order.

    Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1". Regular expressions are important enough for their own section.

    A Small Example

    Ok, so now we know enough to explain an simple report function. How does hist.awk work in the following?

     
    % cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
                        uniq -c | sort -r -n | Gawk -f hist.awk
    
                  **************************  26 /var/empty
                                          **   2 /var/virusmails
                                          **   2 /var/root
                                           *   1 /var/xgrid/controller
                                           *   1 /var/xgrid/agent
                                           *   1 /var/teamsserver
                                           *   1 /var/spool/uucp
                                           *   1 /var/spool/postfix
                                           *   1 /var/spool/cups
                                           *   1 /var/pcast/server
                                           *   1 /var/pcast/agent
                                           *   1 /var/imap
                                           *   1 /Library/WebServer
    

    hist.awk reads the maximum width from line one (when NR==1), then scales it to some maximum width value. For each line, it then prints the line ($0) with some stars at front.

    NR==1  { Width = Width ? Width : 40 ; sets Width if it is missing
             Scale = $1 > Width ? $1 / Width : 1 
           }
           { Stars=int($1*Scale);  
             print str(Width - Stars," ") str(Stars,"*") $0 
           }
    
    # note that, in the following "tmp" is a local variable
    function str(n,c, tmp) { # returns a string, size "n", of all  "c" 
        while((n--) > 0 ) tmp= c tmp 
        return tmp 
    }
    

    Regular Expressions

    Do you know what these mean?

    • /^[ \t\n]*/
    • /[ \t\n]*$/
    • /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/

    Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:

      function trim(s,     t) {
        t=s;
        sub(/^[ \t\n]*/,"",t);
        sub(/[ \t\n]*$/,"",t);
        return t
     }
    

    or recognize something that isn't a number:

    if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ ) 
        {print "ERROR: " $i " not a number}
    

    Regular expressions are an astonishingly useful tool supported by many languages (e.g. Awk, Perl, Python, Java). The following notes review the basics. For full details, see http://www.gnu.org/manual/Gawk-3.1.1/html_node/Regexp.html#Regexp.

    Syntax: Here's the basic building blocks of regular expressions:

    c
    matches the character c (assuming c is a character with no special meaning in regexps).

    \c
    matches the literal character c; e.g. tabs and newlines are \t and \n respectively.

    .
    matches any character except newline.

    ^
    matches the beginning of a line or a string.

    $
    matches the end of a line or a string.

    [abc...]
    matches any of the characters ac... (character class).

    [^ac...]
    matches any character except abc... and newline (negated character class).

    r*
    matches zero or more r's.

    And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.

    More Syntax:

    But that's only the start of regular expressions. There's lots more. For example:

    r+
    matches one or more r's.

    r?
    matches zero or one r's.

    r1|r2
    matches either r1 or r2 (alternation).

    r1r2
    matches r1, and then r2 (concatenation).

    (r)
    matches r (grouping).

    Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:

    ^[+-]? ...
    Numbers begin with zero or one plus or minus signs.

    ...[0-9]+...
    Simple numbers are just one or more numbers.

    ...[.]?[0-9]*...
    which may be followed by a decimal point and zero or more digits.

    ...|[.][0-9]+...
    Alternatively, a number can have zero leading numbers and just start with a decimal point.

    .... ([eE]...)?$
    Also, there may be an exponent added

    ...[+-]?[0-9]+)?$
    and that exponent is a positive or negative bunch of digits.

    Associative arrays

    Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):

    Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename
    

    The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:

     #!/usr/bin/awk -f
      {for(i=1;i <=NF;i++) freq[$i]++ }
      END{for(word in freq) print word, freq[word]  }
    

    You can find out if an element exists in an array at a certain index with the expression:

    index in array
    

    This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present.

    You can remove an individual element of an array using the delete statement:

    delete array[index]
    

    It is not an error to delete an element which does not exist.

    Gawk has a special kind of for statement for scanning an array:

     for (var in array)
            body
    

    This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.

    There order in which the array is scanned is not defined.

    To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack

     function top(a)        {return a[a[0]]}
     function push(a,x,  i) {i=++a[0]; a[i]=x; return i}
     function pop(a,   x,i) {
       i=a[0]--;  
       if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}
    

    The pop function can be used in the usual way:

     BEGIN {push(a,1); push(a,2); push(a,3);
            while(x=pop(a)) print x
     3
     2
     1
    

    We can catch everything in an array to a string:

     function a2s(a,  i,s) {
            s=""; 
            for (i in a) {s=s " " i "= [" a[i]"]\n"}; 
            return s}
    
      BEGIN {push(L,1); push(L,2); push(L,3);
            print a2s(L);}
      0= [3]
      1= [1]
      2= [2]
      3= [3]
    

    And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form:

    ^=include file

    This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line:

     function rinclude (line,    x,a) {
       split(line,a,/ /);
       if ( a[1] ~ /^\=include/ ) { 
         while ( ( getline x < a[2] ) > 0) rinclude(x);
         close(a[2])}
       else {print line}
     }
    

    Note that the third argument of the split function can be any regular expression.

    By the way, here's a nice trick with arrays. To print the lines in a files in a random order:

     BEGIN {srand()}
           {Array[rand()]=$0}
     END   {for(I in Array) print $0}
    

    Short, heh? This is not a perfect solution. Gawk can only generate 1,000,000 different random numbers so the birthday theorem cautions that there is a small chance that the lines will be lost when different lines are written to the same randomly selected location. After some experiments, I can report that you lose around one item after 1,000 inserts and 10 to 12 items after 10,000 random inserts. Nothing to write home about really. But for larger item sets, the above three liner is not what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%) are lost after 100,000 random inserts. Not good!


    categories: Arrays,Apr,2009,Timm

    saya

    Synopsis

    saya(array [,label,sep,before,after,eq])

    Description

    Array printing function. Contents printed, sorted on key.

    Arguments

    array
    An array.
    label
    (OPTIONAL) A prefix before every item.
    sep
    (OPTIONAL) A string to print between each item. Defaults to new line.
    before
    (OPTIONAL) A string to print before the array. Defaults to "".
    after
    (OPTIONAL) A string to print after the array. Defaults to new line.
    eq
    (OPTIONAL) A string to print between each key/value pair. Defaults to " = ".

    Returns

    Size of the array

    Notes

    The most common usage is to just use the first two arguments; e.g.

    saya(a,"name") ==>
    
    name[1] = tim
    name[2] = menzies
    

    For other usages, see the examples, below.

    Source

    function saya(a,s, sep0,b4,after,eq,   c,m,n,key,val,i,j,tmp,sep) {
    	sep0  = sep0  ? sep0  : "\n"
    	b4    = b4    ? b4    : "\n"
    	after = after ? after : "\n"
    	eq    = eq    ? eq    : " = "
    	pre   = s     ? s"["  : ""
    	post  = s     ? "]"   : ""
    	m     = asorti(a,b)
    	printf("%s",b4)
    	for(i=1;i<=m;i++)  {
    		key=b[i]
    		val=a[b[i]]
    		printf("%s", sep pre  )
    		n=split(key,tmp,SUBSEP)
    		c = ""
    		for(j=1;j<=n;j++)	{	
    			printf("%s", c tmp[j]  )
    			c=","
    		}
    		printf("%s", post eq val )
    		sep=sep0;
    	};
    	printf("%s",after)
    	return m
    }
    

    Example

    gawk/array/eg/saya »

    gawk -f saya.awk --source '
    BEGIN { 	
    	A["fname"  ] = "tim"
    	A["lname"  ] = "menzies"
    	A["address"] = "usa"
    	saya(A,"",", ","[","]")
    	print ""
    	saya(A,"message")
    	B[2,3,9]   = 100
    	B[10,1,11] = 200
    	B[1,3,10]  = 300
    	saya(B,"b")
    }'
    
    

    gawk/array/eg/saya.out »

    [address = usa, fname = tim, lname = menzies]
    
    message[address] = usa
    message[fname] = tim
    message[lname] = menzies
    
    b[1,3,10] = 300
    b[10,1,11] = 200
    b[2,3,9] = 100
    

    Author

    Tim Menzies


    categories: Timm,Arrays,Function,Feb,2009,ArnoldR

    join

    Synopsis

    join(a [,start,end,sep])

    Description

    Joins at array into a string

    Arguments

    a
    input array
    start
    Index for where to start in the array a. Default=1.
    end
    Index for where to start/stop in the array a. Default=size of array
    sep
    (OPTIONAL) What to write between each item. Defaults to blank space.

    If sep is set to the magic value SUBSEP then internally, join adds nothing between the items.

    Returns

    A string of a's contents.

    Example

    gawk/array/eg/join »

    gawk -f join.awk --source '
    BEGIN { split("tim tom tam",a)
            print join(a,2)
    }'
    

    gawk/array/eg/join.out »

    tom tam
    

    Source

    function join(a,start,end,sep,    result,i) {
        sep   = sep   ? start :  " "
        start = start ? start : 1
        end   = end   ? end   : sizeof(a)
        if (sep == SUBSEP) # magic value
           sep = ""
        result = a[start]
        for (i = start + 1; i <= end; i++)
            result = result sep a[i]
        return result
    }
    

    Helper

    In earlier gawks, length(a) did not work in functions. Hence....

    function sizeof(a,   i,n) { for(i in a) n++ ; return n }
    

    Change Log

    • Jan 24'08: defaults extended to include start,stop
    • Jan 24'08: Sizeof added to handle old gawk bug

    Author

    Arnold Robbins, then Tim Menzies


    categories: Macros,Tools,Mar,2009,Timm

    Macros

    These pages focus on macro pre-processors (a natural application for Awk).


    categories: Sigs,Tools,Apr,2009,Timm

    Random Signatures

    Contents

    Synopsis

    chmod +x sigs; ./sigs

    Download

    Download from LAWKER.

    Description

    Generates random signtures. Signatures and generation code included in same file so installation is just a matter of calling one file.

    Most of the file is a large "here" document. Paragraph 1 of that document is always added to the signatures, followed one of the folowing paragraphs, selected at radonom.

    To add to the signtures, include them in the here document, with one preceeding blank line.

    Code

    Pick1

    pick1() {
        gawk 'BEGIN { srand(); RS=""    }
              NR==1 { print $0 "\n"     }
              NR>1  { Recs[rand()] = $0 }
              END   { for ( R in Recs ) {print Recs[R]; exit}}
            ' $1
    }
    

    The Signatures

    cat << SoMEI_mpOSSIblE_sYMBOl | pick1
    tim.menzies {
      title:   dr (Ph.D.) and associate professor;
      align:   csee, west virginia university;
      cell:   esb 841A; 
      url:   http://menzies.us;
      fyi:   unless marked "URGENT", i usually won't get 2 your email b4 5pm; 
    }
    
    Doing a job RIGHT the first time gets the job done. Doing the job WRONG
    fourteen times gives you job security.
    
    Rome did not create a great empire by having meetings, they did it by
    killing all those who opposed them.
    
    INDECISION is the key to FLEXIBILITY.
    
    "When a subject becomes totally obsolete we make it a required
    course."  Peter Drucker
    
    I saw two shooting stars last night but they were only satellites .
    Its wrong to wish on space hardware. I wish, I wish, I wish you cared.
    -- Billy Bragg
    
    Then, in 1995, came the most amazing event in the
    history of programming languages: the introduction
    of Java.  -- Programming Languages: Principles and Practice
    
    Suburbia is where the developer bulldozes out the trees, then names
    the streets after them. --Bill Vaughan
    
    Instant gratification takes too long.
    -- Carrie Fisher
    
    Complexity is easy. Simplicity is hard.
    --Unknown
    

    Author

    Tim Menzies


    categories: TenLiners,Tools,June,2009,Timm

    shuffle.awk

    Contents

    Synopsis

    To rearrange the items in the input list:

     nshuffle(Array)
    

    To rearrange the items in a copy of the input list:

     shuffle(Array,Copy)
    

    The above calls assumes that array item zero stores the length of the array. If this is not the case, use:

     shuffles(Array,Copy)
    

    Download

    Download from LAWKER.

    Description

    Suppose we want to shuffle items an array into a random order. This shuffle sort do so in linear time and memory.

    The algorithm comes from the dawn of computer time but I first heard of it from Bart Massey (at Portland State). Thank Bart for the clarity of the explanation and blame me for any silliness in the implementation.

    The Slow Way

    A simple way to shuffle an input array of elements is to:

    • Allocate an output array of the same size.
    • Copy items selected at random from the input to the output array.
    • Compact the input array by sliding the first part of the array down to fill the hole left by the removed item.

    This algorithm is clearly correct. However, the algorithm requires time quadratic in the size of the list, and 2x space.

    The Better Way

    We can easily reduce the time complexity to O(N). The only thing done with the input array is to select random elements from it, the order of the elements in it is irrelevant. Therefore, instead of closing the hole left by a removed element by shifting elements, we'll close it by moving the first remaining element of the input array to fill the gap.

    Note an important invariant of the algorithm:

       the number of elements left in the input array 
     + the number of elements in the output array 
     ------------------------------------------------
     = the number of elements initially passed in.  
    

    This means that once an element is removed from the input array and the hole filled, there is a fresh hole created right at the beginning of the input array. Let us put the newly removed element in that hole. Now we can dispense with the output array altogether, and just return the input array. Now the space complexity is just x+1.

    Code

    This code assumes that the array "a" stores its size at "a[0]".

    function nshuffle(a,  i,j,n,tmp) {
      n=a[0]; # a has items at 1...n
      for(i=1;i<=n;i++) {
        j=i+round(rand()*(n-i));
        tmp=a[j];
        a[j]=a[i];
        a[i]=tmp;
      };
      return n;
    }
    function round(x) { return int(x + 0.5) }
    

    nshuffle is fast, but rearranges the order of items in the original list. shuffle generates a new copy of the list with the items in a random order.

    function shuffle(a,b) {
      for(i in a) b[i]=a[i];
      nshuffle(b);
    }
    

    nshuffle also assumes that the list is stores the list size at position zero. If this is not the case, use shuffles.

    function shuffles(a,b,   c,n) {
      for(i in a) {n++; c[i]=a[i]};
      c[0]=n;
      shuffle(c,b);
    }
    

    Correctness proof

    By number of loop iterations

    Base case:
    When i = 0 the 0 array elements in a below i form a shuffled list of 0 elements. All remaining elements are candidates for append.
    Inductive case:
    Assume that i = k and that the sequence of elements in a below k are a random subsequence of the input values of length k. Now every possible remaining candidate is equally likely to occur at position k in this iteration. Thus at the end of the iteration i = k + 1 and the sequence of elements in a below k + 1 are a random subsequence of the input values of length k + 1.

    Examples

    Random orders

    One way to use the above is to run down a list in a random order. For example:

    BEGIN {
      if (ShuffleDemo) {
      		if (Seed) { srand(Seed) } else { srand() };
      		s2i(ShuffleDemo,L1," ");
      		shuffles(L1,L2);
      		while(Item =pop(L2)) print Item;
      }
    }
    function s2i(str,a,sep,   n,i,tmp) {
      n=split(str,tmp,sep);
      for(i=1;i<=n;i++) a[i]=tmp[i];
      return n;
    }
    function pop(a,   x,i) {
      i=a[0]--;  
      if (!i) {return ""} else {x=a[i]; delete a[i]; return x}
    } 
    

    The above can be run using

     gawk -f shuffle.awk  -v ShuffleDemo="aa bb cc dd"
    

    If you run this twice, you'll see two different orderings. Here's one:

     cc
     aa
     dd
     bb
    

    And here's another:

     dd
     bb
     cc
     aa
    

    Fast sampling

    If you are generating the above lists very quickly, then be aware that srand() initializes its random number generator using CPU time in seconds. So, if you are calling the above command line many times per second, you can get repeated outputs.

    The fix is to supply a seed from the Bash $RANDOM variable:

     gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=$RANDOM
    

    much faster than once a second, the above call will generate (far) fewer repeats.

    Repeats

    If you want to repeat some prior run (say, during debugging), set the Seed variable on the command line using (e.g.)

     gawk -f shuffle.awk -v ShuffleDemo="aa bb cc dd" -v Seed=23
    

    This will always print out the same ordering.

    Author

    Tim Menzies


    categories: Wp,Project,Tools,Mar,2009,Timm

    AWKWORDS

    Contents

    Synopsis

    awkwords --title "Title" file > file.html

    awkwords file > file.html

    Download

    This code requires gawk and bash. To download:

    wget  http://lawker.googlecode.com/svn/fridge/lib/bash/awkwords
    chmod +x awkwords
    

    To test the code, apply it to itself:

    • ./awkwords --title "Does this work?" awkwords > awkwards.html

    Description

    AwkWords is a simple-to-use markup language for writing documentation for programs whose comment lines start with "#" and whose comments contain HTML code.

    For example, awk.info?tools/awkwords shows the html generated from this bash script.

    When used with the --title option, a stand alone web page is generated (to control the style of that page, see the CSS function, dicussed below). When used without --title it generated some html suitable for inclusion into other pages.

    Also, AwkWords finds all the <h2>, <h3>, <h4>, <h5>, <h6>, <h7>, <h8>, <h9> headings and copies them to a table of contents at the front of the file. Note that AwkWords assumes that the file contains only one <h1> heading- this is printed before the table of contents.

    AwkWords adds some short cuts for HTML markup, as well as including nested contents (see below: "including nested content"). This is useful for including, say, program output along with the actual program.

    Extra Markup

    Short cuts for HTML

    #.XX
    This is replaced by <XX>.
    #.XX words
    This is replaced by <XX>words</XX>. Note that this tag won't work properly if the source text spills over more than one line.
    #.TO url words
    This is replaced by a link to mail to url.
    #.URL url words
    This is replaced by a link to mail to url.

    Including nested content:

    #.IN file
    This line is replaced by the contents of file.
    #.LISTING file
    This line is replaced by the name of the file, followed by a verbatbim displau of file (no formatting).
    #.CODE file
    This line is replaced by the name of the file, followed verbatbim by file (no formatting).
    #.BODY file
    This line is replaced by file, less the lines before the first blank line.

    Programmer's Guide

    Awkwords is divided into three functions: unhtml fixes the printing of pre-formatted blocks; toc adds the table of contents while includes handles the details of the extra mark-up.

    Functions

    unhtml

    unhtml() { cat $1| gawk '
      BEGIN {IGNORECASE=1}
      /^<PRE>/   {In=1; print; next}
      /^<\/PRE>/ {In=0; print; next}
      In         {gsub("<","\\<",$0); print; next }
                 {print $0 }'
    }
    

    toc

    toc() { cat $1 | gawk '
     BEGIN             { IGNORECASE = 1 }
     /^<[h]1>/         { Header=$0; next}
     /^[<]h[23456789]>/  { 
           T++ ;
          Toc[T]  = gensub(/(.*)<h(.*)>[ \t]*(.*)[ \t]*<\/h(.*)>(.*)/,
          "<""h\\2><""font color=black>\\•</font></a> <""a href=#" T ">\\3</a></h\\4>",
                    "g",$0)
    		Pre="<a name="T"></a>" }
         { Line[++N] = Pre $0; Pre="" }
     END { print Header;
           print "<" "h2>Contents</h2>"
           print "<" "div id=\"htmltoc\">"
           for(I=1;I<=T;I++) print Toc[I]	
           print "<" "/div><!--- htmltoc --->"
           print "<" "div id=\"htmlbody\">"
           for(I=1;I<=N;I++) print Line[I]
           print "</" "div><!--- htmlbody --->"		
         }'
    }
    

    includes

    The xpand function controls recursive inclusion of content. Note that

    • The last act of this function must be to call xpand1.
    • When including verbatim text, the recursive call to xpands must pass "1" to the second paramter.
    includes() { cat $1 | gawk '
    function xpand(pre,  tmp) {
       if      ($1 ~ "^#.IN")    xpands($2,pre) 
       else if ($1 ~ "^#.BODY" ) xpandsBody($2,pre)
       else if ($1 ~ "^#.LISTING")  {
      	    print "<" "pre>"
    	    xpands($2,1)     # <===== note the recursive call with "1"
    	    print "<" "/pre>" } 
       else if ($1 ~ "^#.CODE")  {
      	    print "<" "p>" $2 "\n<" "pre>"
    	    xpands($2,1)     # <===== note the recursive call with "1"
    	    print "<" "/pre>" } 
       else if ($1 ~ "^#.URL") {
    	    tmp = $2; $1=$2="";
    	    print "<" "a href=\""tmp"\">" trim($0) "</a>"
    	    }
       else if ($1 ~ "^#.TO") {
    	    tmp = $2; $1=$2="";
    	    print "<" "a href=\"mailto:"tmp"\">" trim($0) "</a>"
    	    }
       else 
    	xpand1(pre)
    }
    

    The xpand1 function controls the printing of a single line. If we are formatting verbatim text, we must remove the start-of-html character "<". Otherwise, we expand any html shortcuts.

    function xpand1(pre) {
       if (pre)
            gsub("<","\\<",$0)  # <=== remove start-of-html-character
       else {
            $0= xpandHtml($0)      # <=== expand html short cuts
            sub(/^#/,"",$0) }
            print $0 
    }
    

    The function xpandHtml controls the html short cuts

    function xpandHtml(    str,tag) {
       if ($0 ~ /^#\.H1/) {         
    	   $1=""
    	   return "<" "h""1><join>" $0 "</join></" "h1>" }
       if (sub(/^#\./,"",$1)) {
    	   tag=$1;  $1=""
    	   return "<" tag ">"  (($0 ~ /^[ \t]*$/) ? "" : $0"</"tag">")
       }
       return $0
    }
    

    The rest of the code is just some book-keeping and managing the recursive addition of content.

    function xpands(f,pre) {
         if (newFile(f)) {
    	  while((getline <f) > 0) xpand(pre)
              close(f) }
    }
    function xpandsBody(f,pre, using) {
         if (newFile(f)) { 
    	  while((getline <f) >0) {
    	    if ( !using && ($0 ~ /^[\t ]*$/) ) using = 1
    	    if ( using ) xpand(pre)}
    	  close(f) }
    }
    function newFile(f) { return ++Seen[f]==1 }
    function trim (s)   { sub(/^[ \t]*/,"",s);  sub(/[ \t]*$/,"",s); return s } 
    
    BEGIN { IGNORECASE=1 }
          { xpand()      }'
    }
    

    CSS styles

    If used to generate a full web page, then the following styles are added. Note that the htmltoc class controls the appearance of the table of contents.

    css() { 
          echo "<""STYLE type=\"text/css\">"
          cat<<-'EOF'
             div.htmltoc h2 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 30px;}
    	 div.htmltoc h3 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 60px;}
             div.htmltoc h4 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 90px;}
             div.htmltoc h5 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 120px;}
             div.htmltoc h6 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 150px;}
             div.htmltoc h7 { font-size: medium; font-weight: normal; 
                              margin: 0 0 0 0; margin-left: 180px; }
          </STYLE>
    EOF
    }
    

    Main command line

    main() { cat $1 | includes | unhtml | toc; }
    
    if [ $1 == "--title" ]
    then 
         echo "<""html><""head><""title>$2</title>`css`</head><""body>"; 
         shift 2
         main $1
         echo "<""/body><""/html>"
    else 
         main $1
    fi 
    

    Bugs

    There's no checking for valid input (e.g. pre-formatting tags that never close).

    If the input file contains no html mark up, the results are pretty messy.

    Recursive includes fail silently if the referenced file does not exist.

    I don't like the way I need a seperate pass to do "unhtml". I tried making it work within the code but it got messy.

    Author

    Tim Menzies

    categories: DataMining,May,2009,Timm

    NBC: a not-so-naive Bayes Classifier

    Contents

    Synopsis

     ./nbc -[hlx] train test
    

    Download

    Download from Lawker

    Description

    Is less more? Can a few lines of gawk developed in a day or two stand in for a sophisticated state-of-the-art JAVA package? In the general case there may be software engineering advantages to working with rich languages like JAVA. However, in the specific case of a Naive Bayes classifier for discrete data, it is interesting to test if less is indeed more.

    A Naive Bayes classifier collects frequency counts of old events, grouped into "classes". Then, if a new event arrives without a classification, it checks through the old list of classes looking for the one with the highest frequency counts for this new event.

    The method is called "naive" This assumption allows us to collect frequency counts just on each attribute value (and not pairs, or triples, or quads of values).

    In practice, this "naive" strategy works remarkably well- often performs as well as other schemes that try to model interactions between frequencies.

    Hence, we call this system a not-so-naive Bayes Classifier.

    Summary of Results

    In summary, the performance on this simple gawk-based Naive Bayes classifier is quite remarkable.

    • Not surprisingly, nbc.awk's accuracy are very similar to a standard bayes classifier (from the WEKA system). That is, this implementation is adequate.
    • Quite surprisingly, the awk version does not run much slower than java. In fact, for under 1000 instances, this implementation is comparatively fast or faster than a much more mature JAVA-based tool.

    Details: Accuracy

    The following table compares classification accuracies between nbc.awk and WEKA's weka.classifiers.NaiveBayes.

    All the data sets were discrete so no kernel estimation was used.

    Results come from a 10-way cross-val (but no initial randomization of data set order).

    The table is sorted by increase mean difference. Nbc.awk does better than WEKA Bayes on the datasets shown at the bottom of the table.

                    mean                significant 
                    difference   std.   difference? 
          data      in accuracy  dev.   (alpha=0.01)
     -----------------------------------------------
            soybean | -3.02 |   2.02 |  y
               iris | -2.14 |   4.51 |  n
                zoo | -0.38 |   6.54 |  n
      primary-tumor | -0.30 |   3.28 |  n
          audiology | -0.27 |   2.67 |  n
           mushroom | -0.25 |   0.23 |  n
             splice |  0.00 |   0.00 |  n
           kr-vs-kp |  0.00 |   0.00 |  n
      breast-cancer |  0.00 |   0.00 |  n
     contact-lenses |  0.00 |   0.00 |  n
               vote |  0.27 |   1.47 |  n
              lymph |  0.73 |   5.01 |  n
           breast-w |  1.63 |   1.32 |  y
           credit-a |  7.40 |   5.60 |  y
             letter |  9.44 |   1.40 |  y
    

    On the whole, nbc.awk works as well as WEKA Bayes.

    Details: Runtimes

    The following table compares the runtimes of nbc.awk (awk) vs WEKA BAYES (java) measured in seconds.

    Each lines show total times for ten training+test runs (one for each item in the cross val). E.g. letter actually ran in time 4.92 seconds (on average) and this was called 10 times.

    Note: the time for dividing files for the x-val is not shown.

    The table is sorted on the ratio of awk vs java runtimes. Ratios less than one mean awk ran faster than java. Sampler.awk does better than Weka Bayes on the datasets shown at the bottom of the table (below the middle line).

                     runtimes (secs) | 
                    -----------------|---------------------
               data  awk   java ratio| insts  attrs classes
     --------------------------------|---------------------
             letter  49.2  17.6  2.8  20,000   17      27
           mushroom  10.1   5.9  1.7   8,124   23       3
           kr-vs-kp   8.1   5.1  1.6   3,916   37       3
             splice  11.3   7.8  1.4   3,190   62       4
            soybean   4.2   3.4  1.2     683   36      20
     ------------------------------------------------------
          audiology   2.9   3.4  0.9     226   70      25
      primary-tumor   1.3   2.8  0.5     339   18      23
               vote   1.0   2.4  0.4     435   17       3
     contact-lenses   0.6   2.0  0.3      24    5       4
      breast-cancer   0.7   2.4  0.3     286   10       3
           credit-a   1.1   3.3  0.3     690   16       3
           breast-w   1.0   3.5  0.3     699   10       3
              lymph   0.6   2.4  0.2     148   19       5
               iris   0.5   2.5  0.2     150    5       4
                zoo   0.6   2.4  0.2     101   18       8
     ------------------------------------------------------
              total  93.1  66.8  1.4
    

    All up, the awk-based learner was 40% slower than the JAVA. For larger data sets, JAVA was always faster. However, for smaller datasets (under 1000 instances) the awk version was nearly as fast or faster.

    Details: Memory

    We have run this small Awk script on 100s of megabytes of data, without crashes or core dumps. The code is very memory effecient- unlike the WEKA which loads all the data into RAM.

    Discussion

    It is hardly surprising that a state-of-the-art tool kit built and optimized by JAVA gurus can out-perform awk code on large examples. However, what is surprising is that an 32 line AWK script built and debugged in a weekend often works nearly as well, or better.

    Perhaps "nbc" is not-so-naive after all.

    Options

    -x
    run an example
    -h
    print help text
    -l
    print legal notice

    Installation

    To check the download, unzip the contents.zip then

     chmod +x nbc
     ./nbc nbceg.train nbceg.test  | 
     gawk -F, '{print $0  "\t " ($1 !=$2 ? " <== bad" : "")}' 
    

    This should print:

     malign_lymph,malign_lymph        
     metastases,metastases    
     malign_lymph,malign_lymph        
     metastases,metastases    
     malign_lymph,metastases   <== bad
     malign_lymph,malign_lymph        
     malign_lymph,malign_lymph        
     metastases,metastases    
     metastases,metastases    
     metastases,metastases    
     malign_lymph,malign_lymph        
     metastases,metastases    
     malign_lymph,malign_lymph        
    

    Awk code

    Here is the nbc.awk code called by the Bash script (shown below).

     BEGIN {
      #Internal globals:
         Total=0    # count of all instances
       # Classes    # table of class names/frequencies
       # Freg       # table of counters for values in attributes in classes
       # Seen       # table of counters for values in attributes
       # Attributes # table of number of values per attribute
       }
    
     Pass==1 {train()}
     Pass==2 {print $NF "," classify()}
    
     function train(    i,c) { 
       Total++;
       c=$NF;
       Classes[c]++;
       for(i=1;i<=NF;i++) {
         if ($i=="?") continue;
         Freq[c,i,$i]++
         if (++Seen[i,$i]==1) Attributes[i]++}
     }
    
     function classify(         i,temp,what,like,c) {  
       like = -100000; # smaller than any log
       for(c in Classes) {  
         temp=log(Classes[c]/Total); #uses logs to stop numeric errors
         for(i=1;i<NF;i++) {  
           if ( $i=="?" ) continue;
           temp += log((Freq[c,i,$i]+1)/(Classes[c]+Attributes[i]));
         };
         if ( temp >= like ) {like = temp; what=c}
       };
       return what;
     }
    

    Bash Code

    Copyright

    copyleft() { cat<<EOF 
    nbc: a naive bayes classifier
    Copyright (C) 2004 Tim Menzies
    
    This program is free software; you can redistribute it and/or
    modify it under the terms of the GNU General Public License
    as published by the Free Software Foundation, version 2.
    
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    
    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
    EOF
    }
    

    Usage

     usage() { cat<<-EOF
    	Usage: nbc  [FLAGs] TRAIN TEST
    	Naive bayes classifier.
    
    	TRAIN and TEST are comma-separated data files with the same
    	number of columns. The last column of each is the class
    	symbol. This classifier learns from TRAIN and then tries
    	to classify the examples in TEST.
    
    	Flags: 
    	  -h        print this help text
    	  -l        copyright notice
    	  -x	    run an example
    	EOF
    	exit
     } 
    

    Demo code

     nbcDemo() { 
    	main nbceg.train nbceg.test
     } 
    

    Main

     main() {
    	gawk -F, -f nbc.awk Pass=1 $1 Pass=2 $2
     }
    

    Command-line Processing

     demo=""
     while getopts "hlx" flag
     do case "$flag" in
            l)  copyleft; exit;;
            h)  usage; exit ;;
            x)  demo="nbcDemo";;
        esac
     done
     shift $(($OPTIND - 1))
     if [ -n "$demo" ]
     then $demo
          exit
     else  main $1 $2
     fi
    

    Author

    Tim Menzies


    categories: TextMining,Mar,2009,Timm

    Text Mining Issue Reports

    References

    Tim Menzies and Andrian Marcus:

    Description

    Severis is a set of Awk, bash, sed, etc scripts for finding predictors of high severity issues in text reports. Test engineers write such issue reports whenever they encounter anomalies in the code they are inspecting.

    Severis was designed to be an audit tool for test engineers, a second "look over the shoulder" to alert a senior engineer if a junior test engineer was doing something strange.

    At least for the text issue reports studied by Severis, very simple tools were enough to determine the terms that predicting for different issue severities.

    blog comments powered by Disqus