About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
These pages focus on XML tools and Awk.
A simple XML parser for awk
awk -f xmlparse.awk [FILESPEC]...
From LAWKER.
This script is a simple XML parser for (modern variants of) awk. Input in XML format is saved to two arrays, "type" and "item".
The term, "item", as used here, refers to a distinct XML element, such as a tag, an attribute name, an attribute value, or data.
The indexes into the arrays are the sequence number that a particular item was encountered. For example, the third item's type is described by type[3], and its value is stored in item[3].
The "type" array contains the type of the item encountered for each sequence number. Types are expressed as a single word: "error" (invalid item or other error), "begin" (open tag), "attrib" (attribute name), "value" (attribute value), "end" (close tag), and "data" (data between tags).
The "item" array contains the value of the item encountered for each sequence number. For types "begin" and "end", the item value is the name of the tag. For "error", the value is the text of the error message. For "attrib", the value is the attribute name. For "value", the value is the attribute value. For "data", the value is the raw data.
WARNING: XML-quoted values ("entities") in the data and attribute values are *NOT* unquoted; they are stored as-is.
BEGIN {
In XML, literal "<" and ">" are only valid as tag delimiters; to include a "<" or ">" as data, they must be quoted: "<" and ">". So we know that if we encounter a ">", we have reached the end of a tag. This makes a convenient end-of-record marker, as the end-of-tag delimiter marks a special event, whereas a new-line is simply whitespace in XML.
RS = ">";
lineno = 1;
sptr = 0;
}
Count input lines.
{
data = $0;
lineno += gsub( /\n/, "", data );
data = "";
}
Special modes of operation. These handle special XML sections, such as literal character data containing XML meta-characters ("cdata" sections), comments, and processing instructions ("pi") for other document processors.
"Cdata" sections are teminated by the sequence, "]]>".
( mode == "cdata" ) {
if ( $0 ~ /\]\]$/ ) {
sub( /\]\]$/, "", $0 );
mode = "";
};
item[idx] = item[idx] RS $0;
next;
}
Comment sections are terminated by the sequence, "-->".
( mode == "comment" ) {
if ( $0 ~ /--$/ ) {
sub( /--$/, "", $0 );
mode = "";
};
item[idx] = item[idx] RS $0;
next;
}
Processing instruction sections are terminated by the sequence, "?>".
( mode == "pi" ) {
if ( $0 ~ /\?$/ ) {
sub( /\?$/, "", $0 );
mode = "";
};
item[idx] = item[idx] RS $0;
next;
}
( !mode ) {
mline = 0;
Our record separator is the end-of-tag marker, ">". If we've encountered an end-of-tag marker, we should have a beginning-of-tag marker ("<") somewhere in the input record. If not, either there is a spurious end-of-tag marker, or the record was terminated by the end-of-file.
p = index( $0, "<" );
Any data preceeding the beginning-of-tag marker is raw data. If no beginning-of-tag marker is present, everything in the input is data.
if ( !p || ( p > 1 )) {
idx += 1;
type[idx] = "data";
item[idx] = ( p ? substr( $0, 1, ( p - 1 )) : $0 );
if ( !p ) next;
$0 = substr( $0, p );
};
Recognize special XML sections. Sections are not processed as XML, but handled specially. If the section end with the current input record, we continue processing XML in the next record; otherwise, we enter a special mode and perform special processing.
Character data ("cdata") sections contain literal character data containing XML meta-characters that should not be processed. Character data sections begin with the sequence, "<![CDATA[" and end with "]]>". This section may span input records.
if ( $0 ~ /^<!\[[Cc][Dd][Aa][Tt][Aa]\[/ ) {
idx += 1;
type[idx] = "cdata";
$0 = substr( $0, 10 );
if ( $0 ~ /\]\]$/ ) sub( /\]\]$/, "", $0 );
else {
mode = "cdata";
mline = lineno;
};
item[idx] = $0;
next;
}
Comments begin with the sequence, "". This section may span input records.
else if ( $0 ~ /^<!--/ ) {
idx += 1;
type[idx] = "comment";
$0 = substr( $0, 5 );
if ( $0 ~ /--$/ ) sub( /--$/, "", $0 );
else {
mode = "comment";
mline = lineno;
};
item[idx] = $0;
next;
}
Declarations begin with the sequence, "". This section may *NOT* span input records.
else if ( $0 ~ /^<!/ ) {
idx += 1;
type[idx] = "decl";
$0 = substr( $0, 3 );
item[idx] = $0;
next;
}
Processing instructions ("pi") begin with the sequence, "" and end with "?>". This section may span input records.
else if ( $0 ~ /^<\?/ ) {
idx += 1;
type[idx] = "pi";
$0 = substr( $0, 3 );
if ( $0 ~ /\?$/ ) sub( /\?$/, "", $0 );
else {
mode = "pi";
mline = lineno;
};
item[idx] = $0;
next;
};
Beyond this point, we're dealing strictly with a tag.
idx += 1;
A tag that begins with "" (e.g. as in "
") is a close tag: it closes a tag-enclosed block.
if ( substr( $0, 1, 2 ) == "</" ) {
type[idx] = "end";
tag = $0 = substr( $0, 3 );
}
A tag that begins simply with "<" (e.g. as in "
") is an open tag: it starts a tag-enclosed block. Note that a stand-alone tag (e.g. "") will be handled later, and will appear as an open tag and close tag, with no data between.
else {
type[idx] = "begin";
tag = $0 = substr( $0, 2 );
};
The tag name is saved in "tag" so that we can retreive it later should we find that the tag is stand-alone and need to save a close tag item.
sub( /[ \n\t/].*$/, "", tag );
tag = toupper( tolower( tag ));
item[idx] = tag;
Validate the tag name. If invalid, indicate so and exit.
if ( tag !~ /^[A-Za-z][-+_.:0-9A-Za-z]*$/ )
{
type[idx] = "error";
item[idx] = "line " lineno ": " tag ": invalid tag name";
exit( 1 );
}
If an open tag is encountered, its name is recorded on the stack. If a close tag is encountered, its name is compared against the name on the top of the stack. If the names differ, an error is generated (XML does not allow overlapping tags).
if ( type[idx] == "begin" ) {
sptr += 1;
lstack[sptr] = lineno;
tstack[sptr] = tag;
}
else if ( type[idx] == "end" ) {
if ( tag != tstack[sptr] ) {
type[idx] = "error";
item[idx] = "line " lineno ": " tag \
": unexpected close tag, expecting " \
tstack[sptr];
exit( 1 );
};
delete tstack[sptr];
sptr -= 1;
};
sub( /[^ \n\t/]*[ \n\t]*/, "", $0 );
Beyond this point, we're dealing with the tag attributes, if any, and/or the stand-alone end-of-tag marker.
while ( $0 ) {
If $0 contains only a slash (/), then the tag we're processing is stand-alone (e.g. ""), so we generate a close tag, but no data between the open and close tags.
if ( $0 == "/" )
{
idx += 1;
type[idx] = "end";
item[idx] = tag;
delete lstack[sptr];
delete tstack[sptr];
sptr -= 1;
break;
};
The attribute name is determined. Note that the attribute name is also saved to "attrib" so that we can reference it should the attribute not include a value. If the attribute does not include a value, it's name is given as its value.
idx += 1;
type[idx] = "attrib";
attrib = $0;
sub( /=.*$/, "", attrib );
attrib = tolower( attrib );
item[idx] = attrib;
Validate the attribute name. If invalid, indicate so and exit.
if ( attrib !~ /^[A-Za-z][-+_0-9A-Za-z]*$/ )
{
type[idx] = "error";
item[idx] = "line " lineno ": " attrib \
": invalid attribute name";
exit( 1 );
}
sub( /^[^=]*/, "", $0 );
Each attribute must have a value. If one isn't explicit in the input, we assign it one equal to the name of the attribute itself. Attribute values in the input may be in one of three forms: enclosed in double quotes ("), enclosed in single quotes/apostrophes ('), or a single word.
idx += 1;
type[idx] = "value";
if ( substr( $0, 1, 1 ) == "=" ) {
if ( substr( $0, 2, 1 ) == "\"" ) {
item[idx] = substr( $0, 3 );
sub( /".*$/, "", item[idx] );
sub( /^="[^"]*"/, "", $0 );
}
else if ( substr( $0, 2, 1 ) == "'" ) {
item[idx] = substr( $0, 3 );
sub( /'.*$/, "", item[idx] );
sub( /^='[^']*'/, "", $0 );
}
else {
item[idx] = $0;
sub( /[ \n\t/]*.$/, "", item[idx] );
sub( /^=[^ \n\t/]*/, "", $0 );
};
}
else item[idx] = attrib;
sub( /^[ \n\t]*/, "", $0 );
};
attrib = "";
tag = "";
next;
}
END {
If mode is defined, the input stream ended without terminating an XML section. Thus, the input contains invalid XML.
if ( mode ) {
idx += 1;
type[idx] = "error";
if ( mode == "cdata" ) mode = "character data";
else if ( mode == "pi" ) mode = "processing instruction";
item[idx] = "line " mline ": unterminated " mode;
};
If an open tag occured with no corresponding close tag, we have invalid XML.
for ( n = sptr; n; n -= 1 ) {
idx += 1;
type[idx] = "error";
item[idx] = "line " lstack[n] ": " \
tstack[n] ": unclosed tag";
};
}
The following simple examples demonstrate the use of the accumulated data from the XML input stream.
END {
If errors occured, generate appropriate messages and exit without
further processing.
if ( type[idx] == "error" ) {
for ( n = idx; n && ( type[n] == "error" ); n -= 1 );
for ( n += 1; n <= idx; n += 1 ) print "ERROR:", item[n];
exit 1;
};
# Print simplified XML. If output completes successfully and the stack
# is not empty, close tags are generated for each tag on the stack.
# in_tag = 0;
#
# for ( n = 1; n <= idx; n += 1 ) {
#
# if ( type[n] == "attrib" ) printf( " %s", item[n] );
#
# else if ( type[n] == "begin" ) {
# if ( in_tag ) printf( ">" );
# else in_tag = 1;
# printf( "<%s", item[n] );
# }
#
# else if ( type[n] == "cdata" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "<![CDATA[%s]]>", item[n] );
# }
#
# else if ( type[n] == "comment" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "<!--%s-->", item[n] );
# }
#
# else if ( type[n] == "data" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "%s", item[n] );
# }
#
# else if ( type[n] == "decl" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# }
# printf( "<!%s>", item[n] );
# }
#
# else if ( type[n] == "end" ) {
# if ( in_tag ) {
# printf( "/>" );
# in_tag = 0;
# }
# else printf( "</%s>", item[n] );
# }
#
# else if ( type[n] == "error" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# print "";
# print "<!-- ERROR:", item[n], "-->";
# break;
# }
#
# else if ( type[n] == "pi" ) {
# if ( in_tag ) {
# printf( ">" );
# in_tag = 0;
# };
# printf( "<?%s?>", item[n] );
# }
#
# else if ( type[n] == "value" ) {
# if ( item[n] ~ /"/ ) printf( "='%s'", item[n] );
# else printf( "=\"%s\"", item[n] );
# };
# };
#
# if ( in_tag ) printf( "\>" );
#
# for ( n = sptr; n; n -= 1 ) printf( "</%s>", tstack[n] );
# Print an object tree, identifying tags and attributes. Nesting is # emphasized by indenting.
# indent = "";
# for ( n = 1; n <= idx; n += 1 ) {
# if ( type[n] == "attrib" ) print indent "attrib", item[n];
# else if ( type[n] == "begin" ) {
# print indent "begin", item[n];
# indent = indent " ";
# }
# else if ( type[n] == "end" ) {
# indent = substr( indent, 3 );
# print indent "end", item[n];
# }
# else if ( type[n] == "error" ) print "ERROR:", item[n];
# else print indent type[n];
# };
Print in a linear format suitable for parsing by shell scripts. Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.
for ( n = 1; n <= idx; n += 1 ) {
value = item[n];
gsub( /\\/, "\\\\", value );
gsub( /\n/, "\\n", value );
print type[n], value;
};
for ( n = sptr; n; n -= 1 ) print "end", tstack[n];
Print attribute values and data in a linear format suitable for searching (e.g. with grep). Attributes are representd as:
[TAG/]...TAG/ATTRIB=VALUE
Data is represented as:
[TAG/]...TAG: DATA
Note that all tag names are displayed in upper-case. All attribute names are displayed in lower-case.
Multi-line values have the new-lines replaced with the character sequence, "\n" (backslash, n) to ensure the entire name/value pair occurs on a single line. All occurances of backslashes (\) in the original value are themselves backslash quoted.
# sptr = 0;
# for ( n = 1; n <= idx; n += 1 ) {
# if ( type[n] == "attrib" ) {
# lead = stack[1];
# for ( m = 2; m <= sptr; m += 1 ) \
# lead = lead "/" stack[m];
# lead = lead "/" item[n] "=";
# }
# else if ( type[n] == "begin" ) stack[++sptr] = item[n];
# else if (( type[n] == "cdata" ) || ( type[n] == "data" )) {
# lead = stack[1];
# for ( m = 2; m <= sptr; m += 1 ) \
# lead = lead "/" stack[m];
# lead = lead ": ";
# }
# else if ( type[n] == "end" ) sptr -= 1;
# if (( type[n] == "data" ) || ( type[n] == "value" )) {
# value = item[n];
# gsub( /\\/, "\\\\", value );
# gsub( /\n/, "\\n", value );
# print lead value;
# };
# };
}
Steve Coile
Editor's note:
Programmers often take awk "as is", never thinking to use it as a lab in which
they can explore other language extensions.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach is taken in the Awk A*
project and, as shown here, in XMLgawk.
IMHO,
XMLgawk is one of the most exciting new innovations
seen in Gawk for many years.
It shows that Awk is more than "just" a text processor: rather
it is also a candidate technology for modern XML-based web applications.
)
Extends standard gawk with built-in XML processing.
Main developers: Jurgen Kahrs and Andrew Schorr.
Conceptual guidance: Manuel Collado.
MS Windows build expert: Victor Paeza.
Contributor of ideas for new features: Peter Saveliev.
XML processing, plus libraries for other extensions to Gawk.
XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.
Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.
XMLgawk provides the following functionality including:
3=Released
3=Free/public domain.
November 2003.
April 28, 2009.
After some hard work I seem to be able to build XMLgawk for native Windows :-). Jurgen, Victor and Manuel: thanks for all the tips!
If you're interested, have a look at http://www.wimdows.info/project/xgawk and have fun.
-- Wim van Blitterswijk
(This page comes from the XML Gawk tutorial.)
One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).
These mechanisms for checking correctness are applied at different levels. The lowest level being well-formedness. The next higher levels of correctness-check are the level of the DTD and (even higher, but not required yet by standards) the Schema. If you have a DTD (or Schema) specification for your XML file, you can hand it over to a validation tool, which applies the specification, checks for conformance and tells you the result. A simple tool for validation against a DTD is xmllint, which is part of libxml and therefore installed on most GNU/Linux systems. Validation against a Schema can be done with more recent versions of xmllint or with the xsv tool.
There are two reasons why validation is currently not incorporated into the gawk interpreter.
@load xml
END {
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
else
print "file is well-formed"
}
As usual, the script starts with switching gawk into XML mode. We are not interested in the content of the nodes being traversed, therefore we have no action to be triggered for a node. Only at the end (when the XML file is already closed) we look at some variables reporting success or failure. If the variable XMLERROR ever contains anything other than 0 or the empty string, there is an error in parsing and the parser will stop tree traversal at the place where the error is. An explanatory message is contained in XMLERROR (whose contents depends on the specific parser used on this platform). The other variables in the example contain the line number and the column in which the XML file is formed badly.
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
The declaration of a document type in the header of an XML file is an optional part of the data, not a mandatory one. If such a declaration is present, the reference to the DTD will not be resolved and its contents will not be parsed. However, the presence of the declaration will be reported by gawk. When the declaration starts, the variable XMLSTARTDOCT contains the name of the root element's tag; and later, when the declaration ends, the variable XMLENDDOCT is set to 1. In between, the array variable XMLATTR will be populated with the values of the public identifier of the DTD (if any) and the value of the system's identifier of the DTD (if any). Other parts of the declaration (elements, attributes and entities) will not be reported.
@load xml
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE" ]
}
XMLSTARTDOCT {
root = XMLSTARTDOCT
pub_id = XMLATTR["PUBLIC" ]
sys_id = XMLATTR["SYSTEM" ]
intsubset = XMLATTR["INTERNAL_SUBSET"]
}
XMLENDDOCT {
print FILENAME
print " version '" version "'"
print " encoding '" encoding "'"
print " standalone '" standalone "'"
print " root id '" root "'"
print " public id '" pub_id "'"
print " system id '" sys_id "'"
print " intsubset '" intsubset "'"
print ""
version = encoding = standalone = ""
root = pub_id = sys_id = intsubset ""
}
Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script shown above will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.
In the following, you can see an example output of
the script shown above. Obviously, the first
entry is a DocBook file (English version 4.2) containing a
book element which has to be validated against a local
copy of the DTD at CERN in Switzerland. The second file is a
chapter element of DocBook (English version 4.1.2) to
be validated against a DTD on the Internet. Finally, the third
entry is a file describing a project of the GanttProject application.
There is only a tag name for the root element specified, a DTD
does not seem to exist.
data/dbfile.xml
version ''
encoding ''
standalone ''
root id 'book'
public id '-//OASIS//DTD DocBook XML V4.2//EN'
system id '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
intsubset ''
data/docbook_chapter.xml
version ''
encoding ''
standalone ''
root id 'chapter'
public id '-//OASIS//DTD DocBook XML V4.1.2//EN'
system id 'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
intsubset ''
data/exampleGantt.gan
version '1.0'
encoding 'UTF-8'
standalone ''
root id 'ganttproject.sourceforge.net'
public id ''
system id ''
intsubset ''
You may wish to make changes to this script if you need it in daily work. For example, the script currently reports nothing for files which have no DTD declaration in them. You can easily change this by appending an action for the END rule which reports in case all the variables root, pub_id and sys_id are empty. As it is, the script parses the entire XML file, although the DTD is always positioned at the top, before the root element. Parsing the root element is unnecessary and you can improve the speed of the script significantly if you tell it to stop parsing when the first element (the root element) comes in.
XMLSTARTELEM { nextfile }
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:
<book id="hello-world" lang="en">
<bookinfo>
<title>Hello, world</title>
</bookinfo>
<chapter id="introduction">
<title>Introduction</title>
<para>This is the introduction. It has two sections</para>
<sect1 id="about-this-book">
<title>About this book</title>
<para>This is my first DocBook file.</para>
</sect1>
<sect1 id="work-in-progress">
<title>Warning</title>
<para>This is still under construction.</para>
</sect1>
</chapter>
</book>
Software developers are used to reading text files with proper indentation like this:
book lang='en' id='hello-world'
bookinfo
title
chapter id='introduction'
title
para
sect1 id='about-this-book'
title
para
sect1 id='work-in-progress'
title
para
Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).
The outline tool produces such an indented output
and we will now write a script that imitates this kind
of output.
@load xml
XMLSTARTELEM {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
For the first time, we don't
just check if the XMLSTARTELEM variable contains
a tag name, but we also print the name out, properly indented
with a printf format statement (two blank characters
for each indentation level).
Note the use of the
associative
array XMLATTR. Whenever we enter a markup block
(and XMLSTARTELEM is non-empty), the array XMLATTR
contains all the attributes of the tag. You can find out the
value of an attribute by accessing the array with the attribute's
name as an array index. In a well-formed XML file, all the attribute
names of one tag are distinct, so we can be sure that each attribute
has its own place in the array. The only thing that's left to do is
to iterate over all the entries in the array and print name and value
in a formatted way. Earlier versions of this script really iterated
over the associative array with the for (i in XMLATTR)
loop. Doing so is still an option, but in this case we wanted to
make sure that attributes are printed in exactly the same oder
that is given in the original XML data. The exact order of attribute
names is reproduced in the fields $1 .. $NF. So the
for loop can iterate over the attributes names in the
fields $1 .. $NF and print the attribute values
XMLATTR[$i].
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
In a procedural language, the software developer expects that he himself determines control flow within a program. He writes down what has to be done first, second, third and so on. In the pattern-action model of AWK, the novice software developer often has the oppressive feeling that
This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:
lex and yacc
tools, the main program only invokes a function yyparse()
and the exact control flow depends on the input source which
controls invocation of certain rules.
Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.
@load xml
BEGIN {
while (getline > 0) {
switch (XMLEVENT) {
case "STARTELEM": {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
}
}
}
One XML event after the other is pulled out of the data
with the getline command. It's like feeling each grain
of sand pour through your fingers. Users who prefer this style
of reading input will also appreciate another novelty: The variable
XMLEVENT. While the push-style script in
another page used the event-specific variable
XMLSTARTELEM to detect the occurrence of a new XML element,
our pull-style script always looks at the value of the same
universal variable XMLEVENT to detect a new XML element.
Formally, we have a script that consists of one BEGIN
pattern followed by an action which is always invoked. You
see, this is a corner case of the pattern-action model
which has been reduced so wide that its essence has disappeared.
Instead of the patterns you now see the cases of switch
statement, embedded into a while loop (for reading the
file item-wise).
Obviously, we have explicite conditionals now, instead of the
implicite ones we used formerly. The actions invoked within
the case conditions are the same we have seen in the
push approach.
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
Displays components within a set of named XML files. With no options, displays the XML files much like that cat command. When options are supplied, displays only the selected components.
Editor's note: for those who do not want to take the plunge into xgawk, dumpxml shows that shows standard Awk supports XML. For a discussion of this file, see comp.lang.awk.
xmldump -[cdit] file
This code requires awk and ksh. To download:
wget http://lawker.googlecode.com/svn/fridge/lib/ksh/dumpxml chmod +x dumpxml
One reason I have a distinct loathing for XML, esp. in configuration files, is it's very difficult to parse (with line-based editors) and it's not very readable either. In my book, this breaks both of the fundamental tests for a useable configuration standard .... whoever first thought XML was a good idea for anything except document mark-up should be shot (steps off soap box before he gets lynched for posting off-topic).
Anyway, personal grievances aside, here's a script I was forced to write, unhappy and at gun-point, to try and make some XML files I was dealing with more readable. This demonstrates how much work it takes in AWK just to parse the structure alone. This doesn't even take into consideration reading attribute values or parsing DTDs.
The next person who thinks it's a good idea to write a configuration file in XML will have to personally answer to my wrath ........ perhaps I should set-up a new website banxml.org or xmlboycott.com with the sole intent to make the world see reason. Anyone with me? :-)
#!/bin/ksh CALL=$(basename $0) USAGE="Syntax: $CALL [-cdit] xmlfile ..."
Displays selected components of a named XML file. Arguments:
DisplayXML()
{
nawk -v shdoc=$1 -v shtags=$2 -v shcomm=$3 -v indent=$4 '
{
pushline=levhigh=0
### If indenting strip any leading blanks from input
CloseFlags()
if (indent && !comment) sub("^[ ][ ]*","")
### Strip carriage returns
gsub("\\r","")
### Scan line one character at a time
for (c=1;c<=length($0);c++)
{
CloseFlags()
ReadChars()
DisplayChars()
}
if (newline)
{
print ""
newline=0
}
}
function CloseFlags()
{
if (comment==2) comment=0 # close comment
if (tag==2) tag=0 # close tag
if (quotes==2) quotes=0 # close quote
}
function ReadChars()
{
ch=substr($0,c,1)
if (!comment)
{
if (ch=="<" && substr($0,c,4)=="<!--")
{
comment=1 # opening comment
ch=substr($0,c,4) # stretch chars
c+=3
}
else if (!tag && ch=="<")
{
tag=1 # opening tag
### Increase or decrease indent depending
### on tag style <tag> or </tag>
### but not <?tag?> or <!tag>
ch2=substr($0,c,2)
if (ch2=="</") level--
else if (ch2!="<?" && ch2!="<!")
{
level++
levhigh=1
}
}
else if (tag)
{
if (!quotes && ch=="\"") quotes=1 # opening quote
else if (quotes && ch=="\"") quotes=2 # closing
else if (!quotes && ch==">")
{
tag=2 # closing tag
### Catch <tag/> style where
### indent level should not change
if (c>1 && substr($0,c-1,2)=="/>") level--
}
}
}
else
{
if (ch=="-" && substr($0,c,3)=="-->")
{
comment=2 # closing comment
ch=substr($0,c,3) # stretch chars
c+=2
}
}
}
function DisplayChars()
{
### Work out whether to display this character or not
dispch=0
if (comment && shcomm) dispch=1
if (tag && shtags) dispch=1
if (!comment && !tag && shdoc) dispch=1
if (dispch)
{
if (indent) IndentLine()
printf("%s",ch)
if (!newline) newline=1
}
}
function IndentLine()
{
if (pushline || comment) return
pushline=1
### Have begun processing first tag so indent level
### may already be one level too high
if ((thislevel=(levhigh?level-1:level))<0) thislevel=0
for (lev=0;lev<thislevel;lev++) printf(" ")
}' "$5"
}
comments=0
doc=0
indent=0
tags=0
help=0
while getopts cdit c
do
case $c in
c) comments=1;;
d) doc=1;;
i) indent=1;;
t) tags=1;;
?) help=1;;
esac
done
shift $(($OPTIND - 1))
Display help message
if [ $help -eq 1 -o $# -eq 0 ]; then
cat << EOF
Displays components within a set of named XML files.
With no options, displays the XML files much like that cat command.
When options are supplied, displays only the selected components.
$USAGE
where -c displays comments
-d displays document contents
-i indent properly
-t displays tags
EOF
exit 2
fi
If no options supplied, then display entire XML files
if [ $comments -eq 0 -a $doc -eq 0 -a $tags -eq 0 ]; then
comments=1
doc=1
tags=1
fi
first=1
while [ $# -gt 0 ]
do
if [ $first -eq 1 ]; then
first=0
else echo " " ### this should be Ctrl+L for a form-feed
fi
echo "<!-- --- $1 --- -->"
DisplayXML $doc $tags $comments $indent "$1"
shift
done
Mark R.Bannister <markb at freedomware.co.uk>.
gawk -f getXML.awk Download from LAWKER
Main function, read snext xml-data into XTYPE,XITEM,XATTR
Unescape data and attribute values, used by getXML.
Close xml file
Jan Weber Download
Example
BEGIN {
while ( getXML(ARGV[1],1) ) {
print XTYPE, XITEM;
for (attrName in XATTR)
print "\t" attrName "=" XATTR[attrName];
}
if (XERROR) {
print XERROR;
exit 1;
}
}
Details
getXML( file, skipData ):
External variables:
Returns
Private Data
Code
function getXML( file, skipData \
,end,p,q,tag,att,accu,mline,mode,S0,ex,dtd) {
XTYPE=XITEM=XERROR=XNODE=""; split("",XATTR);
S0=_XMLIO[file,"S0"]; XLINE=_XMLIO[file,"line"];
XPATH=_XMLIO[file,"path"]; dtd=_XMLIO[file,"dtd"];
while (!XTYPE) {
if (S0=="") { if (1!=(getline S0 <file)) break; XLINE++; S0=S0 RS; }
if ( mode == "" ) {
mline=XLINE; accu=""; p=substr(S0,1,1);
if ( p!="<" && !(dtd && p=="]") )
mode="DAT";
else if ( p=="]" )
{ S0=substr(S0,2); mode="DTE"; end=">"; dtd=0; }
else if ( substr(S0,1,4)=="<!--" )
{ S0=substr(S0,5); mode="COM"; end="-->"; }
else if ( substr(S0,1,9)=="<!DOCTYPE" )
{ S0=substr(S0,10); mode="DTB"; end=">"; }
else if ( substr(S0,1,9)=="<![CDATA[" )
{ S0=substr(S0,10); mode="CDA"; end="]]>"; }
else if ( substr(S0,1,2)=="<!" )
{ S0=substr(S0,3); mode="DEC"; end=">"; }
else if ( substr(S0,1,2)=="<?" )
{ S0=substr(S0,3); mode="PIN"; end="?>"; }
else if ( substr(S0,1,2)=="</" )
{ S0=substr(S0,3); mode="END"; end=">";
tag=S0;sub(/[ \n\r\t>].*$/,"",tag);
S0=substr(S0,length(tag)+1);
ex=XPATH;sub(/\/[^\/]*$/,"",XPATH);
ex=substr(ex,length(XPATH)+2);
if (tag!=ex) {
XERROR="unexpected close tag <" ex ">..</" tag ">";
break; } }
else{
S0=substr(S0,2); mode="TAG";
tag=S0;sub(/[ \n\r\t\/>].*$/,"",tag);
S0=substr(S0,length(tag)+1);
if ( tag !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) {
XERROR="invalid tag name '" tag "'"; break; }
XPATH = XPATH "/" tag; } }
else if ( mode == "DAT" ) {
p=index(S0,"<");
if ( dtd && (q=index(S0,"]")) && (!p || q<p) ) p=q;
if (p) {
if (!skipData) { XTYPE="DAT";
XITEM=accu unescapeXML(substr(S0,1,p-1)); }
S0=substr(S0,p); mode=""; }
else{ if (!skipData) accu=accu unescapeXML(S0); S0=""; } }
else if ( mode == "TAG" ) {
sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
if ( substr(S0,1,2)=="/>" ) {
S0=substr(S0,3); mode=""; XTYPE="TAG";
XITEM=tag; S0="</"tag">"S0; }
else if ( substr(S0,1,1)==">" ) {
S0=substr(S0,2); mode=""; XTYPE="TAG"; XITEM=tag; }
else{
att=S0; sub(/[= \n\r\t\/>].*$/,"",att);
S0=substr(S0,length(att)+1); mode="ATTR";
if ( att !~ /^[A-Za-z:_][0-9A-Za-z:_.-]*$/ ) {
XERROR="invalid attribute name '" att "'";
break; } } }
else if ( mode == "ATTR" ) {
sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
if ( substr(S0,1,1)=="=" ) { S0=substr(S0,2); mode="EQ"; }
else { XATTR[att]=att; mode="TAG";
XNODE=XNODE att"="att"\001"; } }
else if ( mode == "EQ" ) {
sub(/^[ \n\r\t]*/,"",S0); if (S0=="") continue;
end=substr(S0,1,1);
if ( end=="\"" || end=="'" ) {
S0=substr(S0,2);accu="";mode="VALUE";}
else{
accu=S0; sub(/[ \n\r\t\/>].*$/,"",accu);
S0=substr(S0,length(accu)+1);
XATTR[att]=unescapeXML(accu); mode="TAG";
XNODE=XNODE att"="XATTR[att]"\001"; } }
else if ( mode == "VALUE" ) { # terminated by end
if ( p=index(S0,end) ) {
XATTR[att]=accu unescapeXML(substr(S0,1,p-1));
XNODE=XNODE att"="XATTR[att]"\001";
S0=substr(S0,p+length(end)); mode="TAG"; }
else{ accu=accu unescapeXML(S0); S0=""; } }
else if ( mode == "DTB" ) { # terminated by "[" or ">"
if ( (q=index(S0,"[")) && (!(p=index(S0,end)) || q<p ) ) {
XTYPE=mode; XITEM= accu substr(S0,1,q-1);
S0=substr(S0,q+1); mode=""; dtd=1; }
else if ( p=index(S0,end) ) {
XTYPE=mode; XITEM= accu substr(S0,1,p-1);
S0="]"substr(S0,p); mode=""; dtd=1; }
else{ accu=accu S0; S0=""; } }
else if ( p=index(S0,end) ) { # terminated by end
XTYPE=mode; XITEM= ( mode=="END" ? tag : accu substr(S0,1,p-1) );
S0=substr(S0,p+length(end)); mode=""; }
else{ accu=accu S0; S0=""; } }
_XMLIO[file,"S0"]=S0; _XMLIO[file,"line"]=XLINE;
_XMLIO[file,"path"]=XPATH; _XMLIO[file,"dtd"]=dtd;
if (mode=="DAT") { mode=""; if (accu!="") XTYPE="DAT"; XITEM=accu; }
if (XTYPE) { XNODE=XTYPE"\001"XITEM"\001"XNODE; return 1; }
close(file);
delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"];
delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
if (XERROR) XERROR=file ":" XLINE ": " XERROR;
else if (mode) XERROR=file ":" mline ": " "unterminated " mode;
else if (XPATH) XERROR=file ":" XLINE ": " "unclosed tag(s) " XPATH;
}
function unescapeXML( text ) {
gsub( "'", "'", text );
gsub( """, "\"", text );
gsub( ">", ">", text );
gsub( "<", "<", text );
gsub( "&", "\\&", text );
return text
}
function closeXML( file ) {
close(file);
delete _XMLIO[file,"S0"]; delete _XMLIO[file,"line"];
delete _XMLIO[file,"path"]; delete _XMLIO[file,"dtd"];
delete _XMLIO[file,"open"]; delete _XMLIO[file,"IND"];
}
Author