Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: XML,Apr,2009,Admin

XML

These pages focus on XML tools and Awk.


categories: XML,June,2009,SteveC

xmlparse.awk

Contents

Synopsis

Download

Description:

Code

Author

A simple XML parser for awk

Synopsis

awk -f xmlparse.awk [FILESPEC]...

Download

From LAWKER.

Description:

This script is a simple XML parser for (modern variants of) awk. Input in XML format is saved to two arrays, "type" and "item".

The term, "item", as used here, refers to a distinct XML element, such as a tag, an attribute name, an attribute value, or data.

The indexes into the arrays are the sequence number that a particular item was encountered. For example, the third item's type is described by type[3], and its value is stored in item[3].

The "type" array contains the type of the item encountered for each sequence number. Types are expressed as a single word: "error" (invalid item or other error), "begin" (open tag), "attrib" (attribute name), "value" (attribute value), "end" (close tag), and "data" (data between tags).

The "item" array contains the value of the item encountered for each sequence number. For types "begin" and "end", the item value is the name of the tag. For "error", the value is the text of the error message. For "attrib", the value is the attribute name. For "value", the value is the attribute value. For "data", the value is the raw data.

WARNING: XML-quoted values ("entities") in the data and attribute values are *NOT* unquoted; they are stored as-is.

Code

BEGIN {

In XML, literal "<" and ">" are only valid as tag delimiters; to include a "<" or ">" as data, they must be quoted: "<" and ">". So we know that if we encounter a ">", we have reached the end of a tag. This makes a convenient end-of-record marker, as the end-of-tag delimiter marks a special event, whereas a new-line is simply whitespace in XML.

        RS = ">";

        lineno = 1;
        sptr = 0;
}
Count input lines.
{
        data = $0;
        lineno += gsub( /\n/, "", data );
        data = "";
}

Special modes of operation. These handle special XML sections, such as literal character data containing XML meta-characters ("cdata" sections), comments, and processing instructions ("pi") for other document processors.

"Cdata" sections are teminated by the sequence, "]]>".

( mode == "cdata" ) {
        if ( $0 ~ /\]\]$/ ) {
                sub( /\]\]$/, "", $0 );
                mode = "";
        };
        item[idx] = item[idx] RS $0;
        next;
}

Comment sections are terminated by the sequence, "-->".

( mode == "comment" ) {
        if ( $0 ~ /--$/ ) {
                sub( /--$/, "", $0 );
                mode = "";
        };
        item[idx] = item[idx] RS $0;
        next;
}
Processing instruction sections are terminated by the sequence, "?>".
( mode == "pi" ) {
        if ( $0 ~ /\?$/ ) {
                sub( /\?$/, "", $0 );
                mode = "";
        };
        item[idx] = item[idx] RS $0;
        next;
}

( !mode ) {
        mline = 0;

Our record separator is the end-of-tag marker, ">". If we've encountered an end-of-tag marker, we should have a beginning-of-tag marker ("<") somewhere in the input record. If not, either there is a spurious end-of-tag marker, or the record was terminated by the end-of-file.

        p = index( $0, "<" );

Any data preceeding the beginning-of-tag marker is raw data. If no beginning-of-tag marker is present, everything in the input is data.

        if ( !p || ( p > 1 )) {
                idx += 1;
                type[idx] = "data";
                item[idx] = ( p ? substr( $0, 1, ( p - 1 )) : $0 );
                if ( !p ) next;
                $0 = substr( $0, p );
        };

Recognize special XML sections. Sections are not processed as XML, but handled specially. If the section end with the current input record, we continue processing XML in the next record; otherwise, we enter a special mode and perform special processing.

Character data ("cdata") sections contain literal character data containing XML meta-characters that should not be processed. Character data sections begin with the sequence, "<![CDATA[" and end with "]]>". This section may span input records.

        if ( $0 ~ /^<!\[[Cc][Dd][Aa][Tt][Aa]\[/ ) {
                idx += 1;
                type[idx] = "cdata";
                $0 = substr( $0, 10 );
                if ( $0 ~ /\]\]$/ ) sub( /\]\]$/, "", $0 );
                else {
                        mode = "cdata";
                        mline = lineno;
                };
                item[idx] = $0;
                next;
        }

Comments begin with the sequence, "". This section may span input records.

        else if ( $0 ~ /^<!--/ ) {
                idx += 1;
                type[idx] = "comment";
                $0 = substr( $0, 5 );
                if ( $0 ~ /--$/ ) sub( /--$/, "", $0 );
                else {
                        mode = "comment";
                        mline = lineno;
                };
                item[idx] = $0;
                next;
        }

Declarations begin with the sequence, "". This section may *NOT* span input records.

        else if ( $0 ~ /^<!/ ) {
                idx += 1;
                type[idx] = "decl";
                $0 = substr( $0, 3 );
                item[idx] = $0;
                next;
        }

Processing instructions ("pi") begin with the sequence, "". This section may span input records.

        else if ( $0 ~ /^<\?/ ) {
                idx += 1;
                type[idx] = "pi";
                $0 = substr( $0, 3 );
                if ( $0 ~ /\?$/ ) sub( /\?$/, "", $0 );
                else {
                        mode = "pi";
                        mline = lineno;
                };
                item[idx] = $0;
                next;
        };

Beyond this point, we're dealing strictly with a tag.

        idx += 1;

A tag that begins with "") is a close tag: it closes a tag-enclosed block.

        if ( substr( $0, 1, 2 ) == "</" ) {
                type[idx] = "end";
                tag = $0 = substr( $0, 3 );
        }

A tag that begins simply with "<" (e.g. as in "

") is an open tag: it starts a tag-enclosed block. Note that a stand-alone tag (e.g. "") will be handled later, and will appear as an open tag and close tag, with no data between.

        else {
                type[idx] = "begin";
                tag = $0 = substr( $0, 2 );
        };

The tag name is saved in "tag" so that we can retreive it later should we find that the tag is stand-alone and need to save a close tag item.

        sub( /[ \n\t/].*$/, "", tag );
        tag = toupper( tolower( tag ));
        item[idx] = tag;

Validate the tag name. If invalid, indicate so and exit.

        if ( tag !~ /^[A-Za-z][-+_.:0-9A-Za-z]*$/ )
        {
                type[idx] = "error";
                item[idx] = "line " lineno ": " tag ": invalid tag name";
                exit( 1 );
        }

If an open tag is encountered, its name is recorded on the stack. If a close tag is encountered, its name is compared against the name on the top of the stack. If the names differ, an error is generated (XML does not allow overlapping tags).

        if ( type[idx] == "begin" ) {
                sptr += 1;
                lstack[sptr] = lineno;
                tstack[sptr] = tag;
        }
        else if ( type[idx] == "end" ) {
                if ( tag != tstack[sptr] ) {
                        type[idx] = "error";
                        item[idx] = "line " lineno ": " tag \
                                    ": unexpected close tag, expecting " \
                                        tstack[sptr];
                        exit( 1 );
                };
                delete tstack[sptr];
                sptr -= 1;
        };

        sub( /[^ \n\t/]*[ \n\t]*/, "", $0 );

Beyond this point, we're dealing with the tag attributes, if any, and/or the stand-alone end-of-tag marker.

        while ( $0 ) {

If $0 contains only a slash (/), then the tag we're processing is stand-alone (e.g. ""), so we generate a close tag, but no data between the open and close tags.

                if ( $0 == "/" )
                {
                        idx += 1;
                        type[idx] = "end";
                        item[idx] = tag;
                        delete lstack[sptr];
                        delete tstack[sptr];
                        sptr -= 1;
                        break;
                };

The attribute name is determined. Note that the attribute name is also saved to "attrib" so that we can reference it should the attribute not include a value. If the attribute does not include a value, it's name is given as its value.

                idx += 1;
                type[idx] = "attrib";
                attrib = $0;
                sub( /=.*$/, "", attrib );
                attrib = tolower( attrib );

                item[idx] = attrib;

Validate the attribute name. If invalid, indicate so and exit.

                if ( attrib !~ /^[A-Za-z][-+_0-9A-Za-z]*$/ )
                {
                        type[idx] = "error";
                        item[idx] = "line " lineno ": " attrib \
                                        ": invalid attribute name";
                        exit( 1 );
                }

                sub( /^[^=]*/, "", $0 );

Each attribute must have a value. If one isn't explicit in the input, we assign it one equal to the name of the attribute itself. Attribute values in the input may be in one of three forms: enclosed in double quotes ("), enclosed in single quotes/apostrophes ('), or a single word.

                idx += 1;
                type[idx] = "value";

                if ( substr( $0, 1, 1 ) == "=" ) {
                        if ( substr( $0, 2, 1 ) == "\"" ) {
                                item[idx] = substr( $0, 3 );
                                sub( /".*$/, "", item[idx] );
                                sub( /^="[^"]*"/, "", $0 );
                        }
                        else if ( substr( $0, 2, 1 ) == "'" ) {
                                item[idx] = substr( $0, 3 );
                                sub( /'.*$/, "", item[idx] );
                                sub( /^='[^']*'/, "", $0 );
                        }
                        else {
                                item[idx] = $0;
                                sub( /[ \n\t/]*.$/, "", item[idx] );
                                sub( /^=[^ \n\t/]*/, "", $0 );
                        };
                }
                else item[idx] = attrib;

                sub( /^[ \n\t]*/, "", $0 );

        };

        attrib = "";
        tag = "";
        next;
}

END {

If mode is defined, the input stream ended without terminating an XML section. Thus, the input contains invalid XML.

        if ( mode ) {
                idx += 1;
                type[idx] = "error";
                if ( mode == "cdata" ) mode = "character data";
                else if ( mode == "pi" ) mode = "processing instruction";
                item[idx] = "line " mline ": unterminated " mode;
        };

If an open tag occured with no corresponding close tag, we have invalid XML.

        for ( n = sptr; n; n -= 1 ) {
                idx += 1;
                type[idx] = "error";
                item[idx] = "line " lstack[n] ": " \
                                tstack[n] ": unclosed tag";
        };
}

The following simple examples demonstrate the use of the accumulated data from the XML input stream.

END {
If errors occured, generate appropriate messages and exit without further processing.
        if ( type[idx] == "error" ) {
                for ( n = idx; n && ( type[n] == "error" ); n -= 1 );
                for ( n += 1; n <= idx; n += 1 ) print "ERROR:", item[n];
                exit 1;
        };
# Print simplified XML. If output completes successfully and the stack # is not empty, close tags are generated for each tag on the stack.
#       in_tag = 0;
#
#       for ( n = 1; n <= idx; n += 1 ) {
#
#               if ( type[n] == "attrib" ) printf( " %s", item[n] );
#
#               else if ( type[n] == "begin" ) {
#                       if ( in_tag ) printf( ">" );
#                       else in_tag = 1;
#                       printf( "<%s", item[n] );
#               }
#
#               else if ( type[n] == "cdata" ) {
#                       if ( in_tag ) {
#                               printf( ">" );
#                               in_tag = 0;
#                       };
#                       printf( "<![CDATA[%s]]>", item[n] );
#               }
#
#               else if ( type[n] == "comment" ) {
#                       if ( in_tag ) {
#                               printf( ">" );
#                               in_tag = 0;
#                       };
#                       printf( "<!--%s-->", item[n] );
#               }
#
#               else if ( type[n] == "data" ) {
#                       if ( in_tag ) {
#                               printf( ">" );
#                               in_tag = 0;
#                       };
#                       printf( "%s", item[n] );
#               }
#
#               else if ( type[n] == "decl" ) {
#                       if ( in_tag ) {
#                               printf( ">" );
#                               in_tag = 0;
#                       }
#                       printf( "<!%s>", item[n] );
#               }
#
#               else if ( type[n] == "end" ) {
#                       if ( in_tag ) {
#                               printf( "/>" );
#                               in_tag = 0;
#                       }
#                       else printf( "</%s>", item[n] );
#               }
#
#               else if ( type[n] == "error" ) {
#                       if ( in_tag ) {
#                               printf( ">" );
#                               in_tag = 0;
#                       };
#                       print "";
#                       print "<!-- ERROR:", item[n], "-->";
#                       break;
#               }
#
#               else if ( type[n] == "pi" ) {
#                       if ( in_tag ) {
#                               printf( ">" );
#                               in_tag = 0;
#                       };
#                       printf( "<?%s?>", item[n] );
#               }
#
#               else if ( type[n] == "value" ) {
#                       if ( item[n] ~ /"/ ) printf( "='%s'", item[n] );
#                       else printf( "=\"%s\"", item[n] );
#               };
#       };
#
#       if ( in_tag ) printf( "\>" );
#
#       for ( n = sptr; n; n -= 1 ) printf( "</%s>", tstack[n] );

# Print an object tree, identifying tags and attributes. Nesting is # emphasized by indenting.

#       indent = "";
#       for ( n = 1; n <= idx; n += 1 ) {
#               if ( type[n] == "attrib" ) print indent "attrib", item[n];
#               else if ( type[n] == "begin" ) {
#                       print indent "begin", item[n];
#                       indent = indent "  ";
#               }
#               else if ( type[n] == "end" ) {
#                       indent = substr( indent, 3 );
#                       print indent "end", item[n];
#               }
#               else if ( type[n] == "error" ) print "ERROR:", item[n];
#               else print indent type[n];
#       };

Print in a linear format suitable for parsing by shell scripts. Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.

        for ( n = 1; n <= idx; n += 1 ) {
                value = item[n];
                gsub( /\\/, "\\\\", value );
                gsub( /\n/, "\\n", value );
                print type[n], value;
        };

        for ( n = sptr; n; n -= 1 ) print "end", tstack[n];

Print attribute values and data in a linear format suitable for searching (e.g. with grep). Attributes are representd as:

      [TAG/]...TAG/ATTRIB=VALUE
Data is represented as:
      [TAG/]...TAG: DATA

Note that all tag names are displayed in upper-case. All attribute names are displayed in lower-case.

Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.

#       sptr = 0;
#       for ( n = 1; n <= idx; n += 1 ) {
#               if ( type[n] == "attrib" ) {
#                       lead = stack[1];
#                       for ( m = 2; m <= sptr; m += 1 ) \
#                               lead = lead "/" stack[m];
#                       lead = lead "/" item[n] "=";
#               }
#               else if ( type[n] == "begin" ) stack[++sptr] = item[n];
#               else if (( type[n] == "cdata" ) || ( type[n] == "data" )) {
#                       lead = stack[1];
#                       for ( m = 2; m <= sptr; m += 1 ) \
#                               lead = lead "/" stack[m];
#                       lead = lead ": ";
#               }
#               else if ( type[n] == "end" ) sptr -= 1;
#               if (( type[n] == "data" ) || ( type[n] == "value" )) {
#                       value = item[n];
#                       gsub( /\\/, "\\\\", value );
#                       gsub( /\n/, "\\n", value );
#                       print lead value;
#               };
#       };
}

Author

Steve Coile


categories: Xgawk,XML,Awk100,Apr,2009,JurgenK

XMLgawk

Editor's note: Programmers often take awk "as is", never thinking to use it as a lab in which they can explore other language extensions. An alternate approach is to treat the Awk code base as a reusable library of parsers, regular expression engines, etc etc and to make modifications to the lanugage. This second approach is taken in the Awk A* project and, as shown here, in XMLgawk.

IMHO, XMLgawk is one of the most exciting new innovations seen in Gawk for many years. It shows that Awk is more than "just" a text processor: rather it is also a candidate technology for modern XML-based web applications. )

Purpose

Extends standard gawk with built-in XML processing.

Developers

Main developers: Jurgen Kahrs and Andrew Schorr.

Conceptual guidance: Manuel Collado.

MS Windows build expert: Victor Paeza.

Contributor of ideas for new features: Peter Saveliev.

Domain

XML processing, plus libraries for other extensions to Gawk.

Description

XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.

Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.

XMLgawk provides the following functionality including:

  • AWK's way of reading data line by line is supplemented by reading XML files node by node.
  • XMLgawk can load .awk file as as well as shared libraries.
  • Adds support for an @include directive in the source code. This is the same feature provided by the current igawk script.

Current

3=Released

Use

3=Free/public domain.

Date Deployed

November 2003.

Dated

April 28, 2009.

Url


categories: Xgawk,XML,Dec,2009,WimVB

Xgawk on Windows

After some hard work I seem to be able to build XMLgawk for native Windows :-). Jurgen, Victor and Manuel: thanks for all the tips!

If you're interested, have a look at http://www.wimdows.info/project/xgawk and have fun.

-- Wim van Blitterswijk


categories: Xgawk,XML,May,2009,JurgenK

XML Well-Formedness

(This page comes from the XML Gawk tutorial.)

One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).

These mechanisms for checking correctness are applied at different levels. The lowest level being well-formedness. The next higher levels of correctness-check are the level of the DTD and (even higher, but not required yet by standards) the Schema. If you have a DTD (or Schema) specification for your XML file, you can hand it over to a validation tool, which applies the specification, checks for conformance and tells you the result. A simple tool for validation against a DTD is xmllint, which is part of libxml and therefore installed on most GNU/Linux systems. Validation against a Schema can be done with more recent versions of xmllint or with the xsv tool.

There are two reasons why validation is currently not incorporated into the gawk interpreter.

  1. Validation is not trivial and only DTD-validation has reached a proper level of standardization, support and stability.
  2. We want a tool that can process all well-formed XML files, not just a tool for processing clean data. A good tool is one that you can rely on and use for fixing problems. What would you think of a car that rejected to drive outside just because there is some mud on the street and the sun isn't shining ?
Here is a script for testing well-formedness of XML data. The real work of checking well-formedness is done by the XML parser incorporated into gawk. We are only interested in the result and some details for error diagnostic and recovery.
     @load xml
     END {
       if (XMLERROR)
         printf("XMLERROR '%s' at row %d col %d len %d\n",
                 XMLERROR, XMLROW, XMLCOL, XMLLEN)
       else
         print "file is well-formed"
     }

As usual, the script starts with switching gawk into XML mode. We are not interested in the content of the nodes being traversed, therefore we have no action to be triggered for a node. Only at the end (when the XML file is already closed) we look at some variables reporting success or failure. If the variable XMLERROR ever contains anything other than 0 or the empty string, there is an error in parsing and the parser will stop tree traversal at the place where the error is. An explanatory message is contained in XMLERROR (whose contents depends on the specific parser used on this platform). The other variables in the example contain the line number and the column in which the XML file is formed badly.

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

categories: XML,June,2009,JurgenK

Dealing with DTDs

(This page comes from the XML Gawk tutorial.)

The declaration of a document type in the header of an XML file is an optional part of the data, not a mandatory one. If such a declaration is present, the reference to the DTD will not be resolved and its contents will not be parsed. However, the presence of the declaration will be reported by gawk. When the declaration starts, the variable XMLSTARTDOCT contains the name of the root element's tag; and later, when the declaration ends, the variable XMLENDDOCT is set to 1. In between, the array variable XMLATTR will be populated with the values of the public identifier of the DTD (if any) and the value of the system's identifier of the DTD (if any). Other parts of the declaration (elements, attributes and entities) will not be reported.

     @load xml
     XMLDECLARATION {
       version    = XMLATTR["VERSION"        ]
       encoding   = XMLATTR["ENCODING"       ]
       standalone = XMLATTR["STANDALONE"     ]
     }
     XMLSTARTDOCT {
       root       = XMLSTARTDOCT
       pub_id     = XMLATTR["PUBLIC"         ]
       sys_id     = XMLATTR["SYSTEM"         ]
       intsubset  = XMLATTR["INTERNAL_SUBSET"]
     }
     XMLENDDOCT {
       print FILENAME
       print "  version    '" version    "'"
       print "  encoding   '" encoding   "'"
       print "  standalone '" standalone "'"
       print "  root   id '" root   "'"
       print "  public id '" pub_id "'"
       print "  system id '" sys_id "'"
       print "  intsubset '" intsubset "'"
       print ""
       version = encoding = standalone = ""
       root = pub_id = sys_id = intsubset ""
     }

Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script shown above will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.

In the following, you can see an example output of the script shown above. Obviously, the first entry is a DocBook file (English version 4.2) containing a book element which has to be validated against a local copy of the DTD at CERN in Switzerland. The second file is a chapter element of DocBook (English version 4.1.2) to be validated against a DTD on the Internet. Finally, the third entry is a file describing a project of the GanttProject application. There is only a tag name for the root element specified, a DTD does not seem to exist.

     data/dbfile.xml
       version    ''
       encoding   ''
       standalone ''
       root   id  'book'
       public id  '-//OASIS//DTD DocBook XML V4.2//EN'
       system id  '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
       intsubset  ''
     
     data/docbook_chapter.xml
       version    ''
       encoding   ''
       standalone ''
       root   id  'chapter'
       public id  '-//OASIS//DTD DocBook XML V4.1.2//EN'
       system id  'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
       intsubset  ''
     
     data/exampleGantt.gan
       version    '1.0'
       encoding   'UTF-8'
       standalone ''
       root   id  'ganttproject.sourceforge.net'
       public id  ''
       system id  ''
       intsubset  ''

You may wish to make changes to this script if you need it in daily work. For example, the script currently reports nothing for files which have no DTD declaration in them. You can easily change this by appending an action for the END rule which reports in case all the variables root, pub_id and sys_id are empty. As it is, the script parses the entire XML file, although the DTD is always positioned at the top, before the root element. Parsing the root element is unnecessary and you can improve the speed of the script significantly if you tell it to stop parsing when the first element (the root element) comes in.

  XMLSTARTELEM { nextfile } 

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

categories: Xgawk,XML,May,2009,JurgenK

Printing an Outline of an XML file

(This page comes from the XML Gawk tutorial.)

When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:

     
     <book id="hello-world" lang="en">
     
     <bookinfo>
     <title>Hello, world</title>
     </bookinfo>

     
     <chapter id="introduction">
     <title>Introduction</title>
     
     <para>This is the introduction. It has two sections</para>
     
     <sect1 id="about-this-book">
     <title>About this book</title>

     
     <para>This is my first DocBook file.</para>
     
     </sect1>
     
     <sect1 id="work-in-progress">
     <title>Warning</title>
     
     <para>This is still under construction.</para>

     
     </sect1>
     
     </chapter>
     </book>

Software developers are used to reading text files with proper indentation like this:

     book lang='en' id='hello-world'
       bookinfo
         title
       chapter id='introduction'
         title
         para
         sect1 id='about-this-book'
           title
           para
         sect1 id='work-in-progress'
           title
           para

Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).

The outline tool produces such an indented output and we will now write a script that imitates this kind of output.

     @load xml
     XMLSTARTELEM {
       printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
       for (i=1; i<=NF; i++)
         printf(" %s='%s'", $i, XMLATTR[$i])
       print ""
     }

For the first time, we don't just check if the XMLSTARTELEM variable contains a tag name, but we also print the name out, properly indented with a printf format statement (two blank characters for each indentation level).

Note the use of the associative array XMLATTR. Whenever we enter a markup block (and XMLSTARTELEM is non-empty), the array XMLATTR contains all the attributes of the tag. You can find out the value of an attribute by accessing the array with the attribute's name as an array index. In a well-formed XML file, all the attribute names of one tag are distinct, so we can be sure that each attribute has its own place in the array. The only thing that's left to do is to iterate over all the entries in the array and print name and value in a formatted way. Earlier versions of this script really iterated over the associative array with the for (i in XMLATTR) loop. Doing so is still an option, but in this case we wanted to make sure that attributes are printed in exactly the same oder that is given in the original XML data. The exact order of attribute names is reproduced in the fields $1 .. $NF. So the for loop can iterate over the attributes names in the fields $1 .. $NF and print the attribute values XMLATTR[$i].

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

categories: Xgawk,XML,May,2009,JurgenK

Pulling data from an XML file

(This page comes from the XML Gawk tutorial.)

In a procedural language, the software developer expects that he himself determines control flow within a program. He writes down what has to be done first, second, third and so on. In the pattern-action model of AWK, the novice software developer often has the oppressive feeling that

  • she is not in control
  • events seem to crackle down on her from nowhere
  • data flow seems chaotic and invariants don't exist
  • assertions seem impossible

This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:

  • In GUI frameworks like the X Window system, the main program is a trivial event loop – the main program does nothing but wait for events and invoke event-handlers.
  • In the Prolog programming language, the main program has the form of a query – and then the Prolog interpreter decides which rules to apply to solve the query.
  • When writing a compiler with the lex and yacc tools, the main program only invokes a function yyparse() and the exact control flow depends on the input source which controls invocation of certain rules.
  • When writing an XML parser with the Expat XML parser, the main program registers some callback handler functions, passes the XML source to the Expat parser and the detailed invocation of callback function depends on the XML source.
  • Finally, AWK's pattern-action encourages writing scripts that have no main program at all.

Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.

     @load xml
     BEGIN {
       while (getline > 0) {
         switch (XMLEVENT) {
           case "STARTELEM": {
             printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
             for (i=1; i<=NF; i++)
               printf(" %s='%s'", $i, XMLATTR[$i])
             print ""
           }
         }
       }
     }

One XML event after the other is pulled out of the data with the getline command. It's like feeling each grain of sand pour through your fingers. Users who prefer this style of reading input will also appreciate another novelty: The variable XMLEVENT. While the push-style script in another page used the event-specific variable XMLSTARTELEM to detect the occurrence of a new XML element, our pull-style script always looks at the value of the same universal variable XMLEVENT to detect a new XML element.

Formally, we have a script that consists of one BEGIN pattern followed by an action which is always invoked. You see, this is a corner case of the pattern-action model which has been reduced so wide that its essence has disappeared. Instead of the patterns you now see the cases of switch statement, embedded into a while loop (for reading the file item-wise). Obviously, we have explicite conditionals now, instead of the implicite ones we used formerly. The actions invoked within the case conditions are the same we have seen in the push approach.

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.

categories: XML,May,2009,MarkB

xmldump

Contents

Displays components within a set of named XML files. With no options, displays the XML files much like that cat command. When options are supplied, displays only the selected components.

Editor's note: for those who do not want to take the plunge into xgawk, dumpxml shows that shows standard Awk supports XML. For a discussion of this file, see comp.lang.awk.

Synopsis

xmldump -[cdit] file

Download

This code requires awk and ksh. To download:

wget  http://lawker.googlecode.com/svn/fridge/lib/ksh/dumpxml
chmod +x dumpxml

Description

One reason I have a distinct loathing for XML, esp. in configuration files, is it's very difficult to parse (with line-based editors) and it's not very readable either. In my book, this breaks both of the fundamental tests for a useable configuration standard .... whoever first thought XML was a good idea for anything except document mark-up should be shot (steps off soap box before he gets lynched for posting off-topic).

Anyway, personal grievances aside, here's a script I was forced to write, unhappy and at gun-point, to try and make some XML files I was dealing with more readable. This demonstrates how much work it takes in AWK just to parse the structure alone. This doesn't even take into consideration reading attribute values or parsing DTDs.

The next person who thinks it's a good idea to write a configuration file in XML will have to personally answer to my wrath ........ perhaps I should set-up a new website banxml.org or xmlboycott.com with the sole intent to make the world see reason. Anyone with me? :-)

Code

Set up

#!/bin/ksh
CALL=$(basename $0)
USAGE="Syntax: $CALL [-cdit] xmlfile ..."

DisplayXML()

Displays selected components of a named XML file. Arguments:

arg 1
0 no doc content, 1 display doc content
arg 2
0 no tags, 1 display tags
arg 3
0 no comments, 1 display comments
arg 4
0 do not change indentation, 1 recalculate indents
arg 5
filename
DisplayXML()
{
    nawk -v shdoc=$1 -v shtags=$2 -v shcomm=$3 -v indent=$4 '
    {
        pushline=levhigh=0

        ### If indenting strip any leading blanks from input
        CloseFlags()
        if (indent && !comment) sub("^[    ][      ]*","")

        ### Strip carriage returns
        gsub("\\r","")

        ### Scan line one character at a time
        for (c=1;c<=length($0);c++)
        {
            CloseFlags()
            ReadChars()
            DisplayChars()
        }

        if (newline)
        {
            print ""
            newline=0
        }
    }

    function CloseFlags()
    {
        if (comment==2) comment=0       # close comment
        if (tag==2) tag=0               # close tag
        if (quotes==2) quotes=0         # close quote
    }

    function ReadChars()
    {
        ch=substr($0,c,1)

        if (!comment)
        {
            if (ch=="<" && substr($0,c,4)=="<!--")
            {
                comment=1                       # opening comment
                ch=substr($0,c,4)               # stretch chars
                c+=3
            }
            else if (!tag && ch=="<")
            {
                tag=1                            # opening tag

                ### Increase or decrease indent depending
                ### on tag style <tag> or </tag> 
                ### but not <?tag?> or <!tag>
                ch2=substr($0,c,2)
                if (ch2=="</") level--
                else if (ch2!="<?" && ch2!="<!")
                {
                    level++
                    levhigh=1
                }
            }
            else if (tag)
            {
                if (!quotes && ch=="\"") quotes=1 # opening quote
                else if (quotes && ch=="\"") quotes=2   # closing 
                else if (!quotes && ch==">")
                {
                    tag=2                   # closing tag

                    ### Catch <tag/> style where
                    ### indent level should not change
                    if (c>1 && substr($0,c-1,2)=="/>") level--
                }
            }
        }
        else
        {
            if (ch=="-" && substr($0,c,3)=="-->")
            {
                comment=2                 # closing comment
                ch=substr($0,c,3)         # stretch chars
                c+=2
            }
        }
    }

    function DisplayChars()
    {
        ### Work out whether to display this character or not
        dispch=0
        if (comment && shcomm) dispch=1
        if (tag && shtags) dispch=1
        if (!comment && !tag && shdoc) dispch=1
        if (dispch)
        {
            if (indent) IndentLine()
            printf("%s",ch)
            if (!newline) newline=1
        }
    }

    function IndentLine()
    {
        if (pushline || comment) return
        pushline=1

        ### Have begun processing first tag so indent level
        ### may already be one level too high
        if ((thislevel=(levhigh?level-1:level))<0) thislevel=0
        for (lev=0;lev<thislevel;lev++) printf("  ")
    }' "$5"

}

Start Up

comments=0
doc=0
indent=0
tags=0
help=0

while getopts cdit c
do
    case $c in
        c) comments=1;;
        d) doc=1;;
        i) indent=1;;
        t) tags=1;;
        ?) help=1;;
    esac
done
shift $(($OPTIND - 1))

Display help message

if [ $help -eq 1 -o $# -eq 0 ]; then
    cat << EOF

Displays components within a set of named XML files.
With no options, displays the XML files much like that cat command.
When options are supplied, displays only the selected components.

$USAGE

where   -c      displays comments
        -d      displays document contents
        -i      indent properly
        -t      displays tags

EOF
    exit 2
fi

If no options supplied, then display entire XML files

if [ $comments -eq 0 -a $doc -eq 0 -a $tags -eq 0 ]; then
    comments=1
    doc=1
    tags=1
fi

first=1
while [ $# -gt 0 ]
do
    if [ $first -eq 1 ]; then
         first=0
    else echo " "  ### this should be Ctrl+L for a form-feed
    fi

    echo "<!-- --- $1 --- -->"
    DisplayXML $doc $tags $comments $indent "$1"
    shift
done 

Author

Mark R.Bannister <markb at freedomware.co.uk>.


categories: XML,May,2009,JanW

getXML.awk

Contents

Synopsis

gawk -f getXML.awk

Download

Download from LAWKER

Example

BEGIN {
    while ( getXML(ARGV[1],1) ) {
        print XTYPE, XITEM;
        for (attrName in XATTR)
            print "\t" attrName "=" XATTR[attrName];
    }
    if (XERROR) {
        print XERROR;
        exit 1;
    }
}

Details

Main function, read snext xml-data into XTYPE,XITEM,XATTR

getXML( file, skipData ): 
file
path to xml file
skipData
flag: do not read "DAT" (data between tags) sections

External variables:

XTYPE
type of item read, e.g. "TAG"(tag), "END"(end tag), "COM"(comment), "DAT"(data)
XITEM
value of item, e.g. tagname if type is "TAG" or "END"
XATTR
Map of attributes, only set if XTYPE=="TAG"
XPATH
Path to current tag, e.g. /TopLevelTag/SubTag1/SubTag2
XLINE
current line number in input file
XNODE
XTYPE, XITEM, XATTR combined into a single string
XERROR
error text, set on parse error

Returns

1
on successful read: XTYPE, XITEM, XATTR are set accordingly
""
at end of file or parse error, XERROR is set on error

Private Data

_XMLIO
buffer, XLINE, XPATH for open files

Code

function getXML( file, skipData           \
				,end,p,q,tag,att,accu,mline,mode,S0,ex,dtd) {
    XTYPE=XITEM=XERROR=XNODE=""; split("",XATTR);
    S0=_XMLIO[file,"S0"]; XLINE=_XMLIO[file,"line"]; 
	XPATH=_XMLIO[file,"path"]; dtd=_XMLIO[file,"dtd"];
    while (!XTYPE) {
        if (S0=="") { if (1!=(getline S0 <file)) break; XLINE++; S0=S0 RS; }
        if ( mode == "" ) {
            mline=XLINE; accu=""; p=substr(S0,1,1);
            if ( p!="<" && !(dtd && p=="]") )         
				mode="DAT";
            else if ( p=="]" ) 
				{ S0=substr(S0,2);  mode="DTE"; end=">"; dtd=0; }
            else if ( substr(S0,1,4)=="<!--" ) 
				{ S0=substr(S0,5);  mode="COM"; end="-->"; }
            else if ( substr(S0,1,9)=="<!DOCTYPE" ) 
                { S0=substr(S0,10); mode="DTB"; end=">"; }
            else if ( substr(S0,1,9)=="<![CDATA[" ) 
                { S0=substr(S0,10); mode="CDA"; end="]]>"; }
            else if ( substr(S0,1,2)=="<!" ) 
				{ S0=substr(S0,3);  mode="DEC"; end=">"; }
            else if ( substr(S0,1,2)=="<?" ) 
				{ S0=substr(S0,3);  mode="PIN"; end="?>"; }
            else if ( substr(S0,1,2)=="</" ) 
				{ S0=substr(S0,3);  mode="END"; end=">";
                tag=S0;sub(/[ \n\r\t>].*$/,"",tag);
				S0=substr(S0,length(tag)+1);
                ex=XPATH;sub(/\/[^\/]*$/,"",XPATH);
				ex=substr(ex,length(XPATH)+2);
                if (tag!=ex) { 
                   	XERROR="unexpected close tag <" ex ">..</" tag ">"; 
					break; } }
            else{                                     
				S0=substr(S0,2);  mode="TAG";
                tag=S0;sub(/[ \n\r\t\/>].*$/,"",tag);
				S0=substr(S0,length(tag)+1);
                if ( tag !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) { 
                    XERROR="invalid tag name '" tag "'"; break; }
                XPATH = XPATH "/" tag; } }
        else if ( mode == "DAT" ) {                            
            p=index(S0,"<"); 
			if ( dtd && (q=index(S0,"]")) && (!p || q<p) ) p=q;
            if (p) {
                if (!skipData) { XTYPE="DAT"; 
                       XITEM=accu unescapeXML(substr(S0,1,p-1)); }
                S0=substr(S0,p); mode=""; }
            else{ if (!skipData) accu=accu unescapeXML(S0); S0=""; } }
        else if ( mode == "TAG" ) {   
			sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
            if ( substr(S0,1,2)=="/>" ) {
                S0=substr(S0,3); mode=""; XTYPE="TAG"; 
				XITEM=tag; S0="</"tag">"S0; }
            else if ( substr(S0,1,1)==">" ) {
                S0=substr(S0,2); mode=""; XTYPE="TAG"; XITEM=tag; }
            else{
                att=S0; sub(/[= \n\r\t\/>].*$/,"",att); 
				S0=substr(S0,length(att)+1); mode="ATTR";
                if ( att !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) { 
                    XERROR="invalid attribute name '" att "'"; 
					break; } } }
        else if ( mode == "ATTR" ) {  
				sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
            if ( substr(S0,1,1)=="=" ) { S0=substr(S0,2); mode="EQ"; }
            else                       { XATTR[att]=att; mode="TAG"; 
                                         XNODE=XNODE att"="att"\001"; } }
        else if ( mode == "EQ" ) {    
					sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
            end=substr(S0,1,1);
            if ( end=="\"" || end=="'" ) {
					S0=substr(S0,2);accu="";mode="VALUE";}
            else{
                accu=S0; sub(/[ \n\r\t\/>].*$/,"",accu); 
				S0=substr(S0,length(accu)+1);
                XATTR[att]=unescapeXML(accu); mode="TAG"; 
				XNODE=XNODE att"="XATTR[att]"\001"; } }
        else if ( mode == "VALUE" ) { # terminated by end
            if ( p=index(S0,end) ) {
                XATTR[att]=accu unescapeXML(substr(S0,1,p-1)); 
				XNODE=XNODE att"="XATTR[att]"\001";
                S0=substr(S0,p+length(end)); mode="TAG"; }
            else{ accu=accu unescapeXML(S0); S0=""; } }
        else if ( mode == "DTB" ) { # terminated by "[" or ">"
            if ( (q=index(S0,"[")) && (!(p=index(S0,end)) || q<p ) ) {
                XTYPE=mode; XITEM= accu substr(S0,1,q-1); 
				S0=substr(S0,q+1); mode=""; dtd=1; }
            else if ( p=index(S0,end) ) {
                XTYPE=mode; XITEM= accu substr(S0,1,p-1); 
				S0="]"substr(S0,p); mode=""; dtd=1; }
            else{ accu=accu S0; S0=""; } }
        else if ( p=index(S0,end) ) {  # terminated by end
            XTYPE=mode; XITEM= ( mode=="END" ? tag : accu substr(S0,1,p-1) );
            S0=substr(S0,p+length(end)); mode=""; }
        else{ accu=accu S0; S0=""; } }
    _XMLIO[file,"S0"]=S0; _XMLIO[file,"line"]=XLINE; 
	_XMLIO[file,"path"]=XPATH; _XMLIO[file,"dtd"]=dtd;
    if (mode=="DAT") { mode=""; if (accu!="") XTYPE="DAT"; XITEM=accu; }
    if (XTYPE) { XNODE=XTYPE"\001"XITEM"\001"XNODE; return 1; }
    close(file);
    delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"]; 
	delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
    if (XERROR) XERROR=file ":" XLINE ": " XERROR;
    else if (mode) XERROR=file ":" mline ": " "unterminated " mode;
    else if (XPATH) XERROR=file ":" XLINE ": "  "unclosed tag(s) " XPATH;
} 

Unescape data and attribute values, used by getXML.

function unescapeXML( text ) {
    gsub( "'", "'",  text );
    gsub( """, "\"", text );
    gsub( ">",   ">",  text );
    gsub( "<",   "<",  text );
    gsub( "&",  "\\&",  text );
    return text
}

Close xml file

function closeXML( file ) {
    close(file);
    delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"]; 
    delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
    delete _XMLIO[file,"open"]; delete _XMLIO[file,"IND"];
}

Author

Jan Weber

blog comments powered by Disqus