Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Xgawk,XML,Awk100,Apr,2009,JurgenK

XMLgawk

Editor's note: Programmers often take awk "as is", never thinking to use it as a lab in which they can explore other language extensions. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach is taken in the Awk A* project and, as shown here, in XMLgawk.

IMHO, XMLgawk is one of the most exciting new innovations seen in Gawk for many years. It shows that Awk is more than "just" a text processor: rather it is also a candidate technology for modern XML-based web applications. )

Purpose

Extends standard gawk with built-in XML processing.

Developers

Main developers: Jurgen Kahrs and Andrew Schorr.

Conceptual guidance: Manuel Collado.

MS Windows build expert: Victor Paeza.

Contributor of ideas for new features: Peter Saveliev.

Domain

XML processing, plus libraries for other extensions to Gawk.

Description

XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.

Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.

XMLgawk provides the following functionality including:

  • AWK's way of reading data line by line is supplemented by reading XML files node by node.
  • XMLgawk can load .awk file as as well as shared libraries.
  • Adds support for an @include directive in the source code. This is the same feature provided by the current igawk script.

Current

3=Released

Use

3=Free/public domain.

Date Deployed

November 2003.

Dated

April 28, 2009.

Url


categories: Xgawk,XML,Dec,2009,WimVB

Xgawk on Windows

After some hard work I seem to be able to build XMLgawk for native Windows :-). Jurgen, Victor and Manuel: thanks for all the tips!

If you're interested, have a look at http://www.wimdows.info/project/xgawk and have fun.

-- Wim van Blitterswijk


categories: Xgawk,XML,May,2009,JurgenK

XML Well-Formedness

(This page comes from the XML Gawk tutorial.)

One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).

These mechanisms for checking correctness are applied at different levels. The lowest level being well-formedness. The next higher levels of correctness-check are the level of the DTD and (even higher, but not required yet by standards) the Schema. If you have a DTD (or Schema) specification for your XML file, you can hand it over to a validation tool, which applies the specification, checks for conformance and tells you the result. A simple tool for validation against a DTD is xmllint, which is part of libxml and therefore installed on most GNU/Linux systems. Validation against a Schema can be done with more recent versions of xmllint or with the xsv tool.

There are two reasons why validation is currently not incorporated into the gawk interpreter.

  1. Validation is not trivial and only DTD-validation has reached a proper level of standardization, support and stability.
  2. We want a tool that can process all well-formed XML files, not just a tool for processing clean data. A good tool is one that you can rely on and use for fixing problems. What would you think of a car that rejected to drive outside just because there is some mud on the street and the sun isn't shining ?
Here is a script for testing well-formedness of XML data. The real work of checking well-formedness is done by the XML parser incorporated into gawk. We are only interested in the result and some details for error diagnostic and recovery.
     @load xml
     END {
       if (XMLERROR)
         printf("XMLERROR '%s' at row %d col %d len %d\n",
                 XMLERROR, XMLROW, XMLCOL, XMLLEN)
       else
         print "file is well-formed"
     }

As usual, the script starts with switching gawk into XML mode. We are not interested in the content of the nodes being traversed, therefore we have no action to be triggered for a node. Only at the end (when the XML file is already closed) we look at some variables reporting success or failure. If the variable XMLERROR ever contains anything other than 0 or the empty string, there is an error in parsing and the parser will stop tree traversal at the place where the error is. An explanatory message is contained in XMLERROR (whose contents depends on the specific parser used on this platform). The other variables in the example contain the line number and the column in which the XML file is formed badly.

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

categories: Xgawk,XML,May,2009,JurgenK

Printing an Outline of an XML file

(This page comes from the XML Gawk tutorial.)

When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:

     
     <book id="hello-world" lang="en">
     
     <bookinfo>
     <title>Hello, world</title>
     </bookinfo>

     
     <chapter id="introduction">
     <title>Introduction</title>
     
     <para>This is the introduction. It has two sections</para>
     
     <sect1 id="about-this-book">
     <title>About this book</title>

     
     <para>This is my first DocBook file.</para>
     
     </sect1>
     
     <sect1 id="work-in-progress">
     <title>Warning</title>
     
     <para>This is still under construction.</para>

     
     </sect1>
     
     </chapter>
     </book>

Software developers are used to reading text files with proper indentation like this:

     book lang='en' id='hello-world'
       bookinfo
         title
       chapter id='introduction'
         title
         para
         sect1 id='about-this-book'
           title
           para
         sect1 id='work-in-progress'
           title
           para

Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).

The outline tool produces such an indented output and we will now write a script that imitates this kind of output.

     @load xml
     XMLSTARTELEM {
       printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
       for (i=1; i<=NF; i++)
         printf(" %s='%s'", $i, XMLATTR[$i])
       print ""
     }

For the first time, we don't just check if the XMLSTARTELEM variable contains a tag name, but we also print the name out, properly indented with a printf format statement (two blank characters for each indentation level).

Note the use of the associative array XMLATTR. Whenever we enter a markup block (and XMLSTARTELEM is non-empty), the array XMLATTR contains all the attributes of the tag. You can find out the value of an attribute by accessing the array with the attribute's name as an array index. In a well-formed XML file, all the attribute names of one tag are distinct, so we can be sure that each attribute has its own place in the array. The only thing that's left to do is to iterate over all the entries in the array and print name and value in a formatted way. Earlier versions of this script really iterated over the associative array with the for (i in XMLATTR) loop. Doing so is still an option, but in this case we wanted to make sure that attributes are printed in exactly the same oder that is given in the original XML data. The exact order of attribute names is reproduced in the fields $1 .. $NF. So the for loop can iterate over the attributes names in the fields $1 .. $NF and print the attribute values XMLATTR[$i].

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

categories: Xgawk,XML,May,2009,JurgenK

Pulling data from an XML file

(This page comes from the XML Gawk tutorial.)

In a procedural language, the software developer expects that he himself determines control flow within a program. He writes down what has to be done first, second, third and so on. In the pattern-action model of AWK, the novice software developer often has the oppressive feeling that

  • she is not in control
  • events seem to crackle down on her from nowhere
  • data flow seems chaotic and invariants don't exist
  • assertions seem impossible

This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:

  • In GUI frameworks like the X Window system, the main program is a trivial event loop – the main program does nothing but wait for events and invoke event-handlers.
  • In the Prolog programming language, the main program has the form of a query – and then the Prolog interpreter decides which rules to apply to solve the query.
  • When writing a compiler with the lex and yacc tools, the main program only invokes a function yyparse() and the exact control flow depends on the input source which controls invocation of certain rules.
  • When writing an XML parser with the Expat XML parser, the main program registers some callback handler functions, passes the XML source to the Expat parser and the detailed invocation of callback function depends on the XML source.
  • Finally, AWK's pattern-action encourages writing scripts that have no main program at all.

Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.

     @load xml
     BEGIN {
       while (getline > 0) {
         switch (XMLEVENT) {
           case "STARTELEM": {
             printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
             for (i=1; i<=NF; i++)
               printf(" %s='%s'", $i, XMLATTR[$i])
             print ""
           }
         }
       }
     }

One XML event after the other is pulled out of the data with the getline command. It's like feeling each grain of sand pour through your fingers. Users who prefer this style of reading input will also appreciate another novelty: The variable XMLEVENT. While the push-style script in another page used the event-specific variable XMLSTARTELEM to detect the occurrence of a new XML element, our pull-style script always looks at the value of the same universal variable XMLEVENT to detect a new XML element.

Formally, we have a script that consists of one BEGIN pattern followed by an action which is always invoked. You see, this is a corner case of the pattern-action model which has been reduced so wide that its essence has disappeared. Instead of the patterns you now see the cases of switch statement, embedded into a while loop (for reading the file item-wise). Obviously, we have explicite conditionals now, instead of the implicite ones we used formerly. The actions invoked within the case conditions are the same we have seen in the push approach.

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.
blog comments powered by Disqus