About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
Editor's note:
Programmers often take awk "as is", never thinking to use it as a lab in which
they can explore other language extensions.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach is taken in the Awk A*
project and, as shown here, in XMLgawk.
IMHO,
XMLgawk is one of the most exciting new innovations
seen in Gawk for many years.
It shows that Awk is more than "just" a text processor: rather
it is also a candidate technology for modern XML-based web applications.
)
Extends standard gawk with built-in XML processing.
Main developers: Jurgen Kahrs and Andrew Schorr.
Conceptual guidance: Manuel Collado.
MS Windows build expert: Victor Paeza.
Contributor of ideas for new features: Peter Saveliev.
XML processing, plus libraries for other extensions to Gawk.
XMLgawk is an experimental extension of the GNU Awk interpreter. It includes a small XML parsing library which is built upon the Expat XML parser. The parsing library is a very thin layer on top of Expat (implementing a pull-interface) and can also be used without GNU Awk to read XML data files.
Both, XMLgawk and its XML puller library only require an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a 'make' program.
XMLgawk provides the following functionality including:
3=Released
3=Free/public domain.
November 2003.
April 28, 2009.
(This page comes from the XML Gawk tutorial.)
One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).
These mechanisms for checking correctness are applied at different levels. The lowest level being well-formedness. The next higher levels of correctness-check are the level of the DTD and (even higher, but not required yet by standards) the Schema. If you have a DTD (or Schema) specification for your XML file, you can hand it over to a validation tool, which applies the specification, checks for conformance and tells you the result. A simple tool for validation against a DTD is xmllint, which is part of libxml and therefore installed on most GNU/Linux systems. Validation against a Schema can be done with more recent versions of xmllint or with the xsv tool.
There are two reasons why validation is currently not incorporated into the gawk interpreter.
@load xml
END {
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
else
print "file is well-formed"
}
As usual, the script starts with switching gawk into XML mode. We are not interested in the content of the nodes being traversed, therefore we have no action to be triggered for a node. Only at the end (when the XML file is already closed) we look at some variables reporting success or failure. If the variable XMLERROR ever contains anything other than 0 or the empty string, there is an error in parsing and the parser will stop tree traversal at the place where the error is. An explanatory message is contained in XMLERROR (whose contents depends on the specific parser used on this platform). The other variables in the example contain the line number and the column in which the XML file is formed badly.
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
The declaration of a document type in the header of an XML file is an optional part of the data, not a mandatory one. If such a declaration is present, the reference to the DTD will not be resolved and its contents will not be parsed. However, the presence of the declaration will be reported by gawk. When the declaration starts, the variable XMLSTARTDOCT contains the name of the root element's tag; and later, when the declaration ends, the variable XMLENDDOCT is set to 1. In between, the array variable XMLATTR will be populated with the values of the public identifier of the DTD (if any) and the value of the system's identifier of the DTD (if any). Other parts of the declaration (elements, attributes and entities) will not be reported.
@load xml
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE" ]
}
XMLSTARTDOCT {
root = XMLSTARTDOCT
pub_id = XMLATTR["PUBLIC" ]
sys_id = XMLATTR["SYSTEM" ]
intsubset = XMLATTR["INTERNAL_SUBSET"]
}
XMLENDDOCT {
print FILENAME
print " version '" version "'"
print " encoding '" encoding "'"
print " standalone '" standalone "'"
print " root id '" root "'"
print " public id '" pub_id "'"
print " system id '" sys_id "'"
print " intsubset '" intsubset "'"
print ""
version = encoding = standalone = ""
root = pub_id = sys_id = intsubset ""
}
Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script shown above will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.
In the following, you can see an example output of
the script shown above. Obviously, the first
entry is a DocBook file (English version 4.2) containing a
book element which has to be validated against a local
copy of the DTD at CERN in Switzerland. The second file is a
chapter element of DocBook (English version 4.1.2) to
be validated against a DTD on the Internet. Finally, the third
entry is a file describing a project of the GanttProject application.
There is only a tag name for the root element specified, a DTD
does not seem to exist.
data/dbfile.xml
version ''
encoding ''
standalone ''
root id 'book'
public id '-//OASIS//DTD DocBook XML V4.2//EN'
system id '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
intsubset ''
data/docbook_chapter.xml
version ''
encoding ''
standalone ''
root id 'chapter'
public id '-//OASIS//DTD DocBook XML V4.1.2//EN'
system id 'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
intsubset ''
data/exampleGantt.gan
version '1.0'
encoding 'UTF-8'
standalone ''
root id 'ganttproject.sourceforge.net'
public id ''
system id ''
intsubset ''
You may wish to make changes to this script if you need it in daily work. For example, the script currently reports nothing for files which have no DTD declaration in them. You can easily change this by appending an action for the END rule which reports in case all the variables root, pub_id and sys_id are empty. As it is, the script parses the entire XML file, although the DTD is always positioned at the top, before the root element. Parsing the root element is unnecessary and you can improve the speed of the script significantly if you tell it to stop parsing when the first element (the root element) comes in.
XMLSTARTELEM { nextfile }
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:
<book id="hello-world" lang="en">
<bookinfo>
<title>Hello, world</title>
</bookinfo>
<chapter id="introduction">
<title>Introduction</title>
<para>This is the introduction. It has two sections</para>
<sect1 id="about-this-book">
<title>About this book</title>
<para>This is my first DocBook file.</para>
</sect1>
<sect1 id="work-in-progress">
<title>Warning</title>
<para>This is still under construction.</para>
</sect1>
</chapter>
</book>
Software developers are used to reading text files with proper indentation like this:
book lang='en' id='hello-world'
bookinfo
title
chapter id='introduction'
title
para
sect1 id='about-this-book'
title
para
sect1 id='work-in-progress'
title
para
Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).
The outline tool produces such an indented output
and we will now write a script that imitates this kind
of output.
@load xml
XMLSTARTELEM {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
For the first time, we don't
just check if the XMLSTARTELEM variable contains
a tag name, but we also print the name out, properly indented
with a printf format statement (two blank characters
for each indentation level).
Note the use of the
associative
array XMLATTR. Whenever we enter a markup block
(and XMLSTARTELEM is non-empty), the array XMLATTR
contains all the attributes of the tag. You can find out the
value of an attribute by accessing the array with the attribute's
name as an array index. In a well-formed XML file, all the attribute
names of one tag are distinct, so we can be sure that each attribute
has its own place in the array. The only thing that's left to do is
to iterate over all the entries in the array and print name and value
in a formatted way. Earlier versions of this script really iterated
over the associative array with the for (i in XMLATTR)
loop. Doing so is still an option, but in this case we wanted to
make sure that attributes are printed in exactly the same oder
that is given in the original XML data. The exact order of attribute
names is reproduced in the fields $1 .. $NF. So the
for loop can iterate over the attributes names in the
fields $1 .. $NF and print the attribute values
XMLATTR[$i].
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).
(This page comes from the XML Gawk tutorial.)
In a procedural language, the software developer expects that he himself determines control flow within a program. He writes down what has to be done first, second, third and so on. In the pattern-action model of AWK, the novice software developer often has the oppressive feeling that
This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:
lex and yacc
tools, the main program only invokes a function yyparse()
and the exact control flow depends on the input source which
controls invocation of certain rules.
Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.
@load xml
BEGIN {
while (getline > 0) {
switch (XMLEVENT) {
case "STARTELEM": {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
}
}
}
One XML event after the other is pulled out of the data
with the getline command. It's like feeling each grain
of sand pour through your fingers. Users who prefer this style
of reading input will also appreciate another novelty: The variable
XMLEVENT. While the push-style script in
another page used the event-specific variable
XMLSTARTELEM to detect the occurrence of a new XML element,
our pull-style script always looks at the value of the same
universal variable XMLEVENT to detect a new XML element.
Formally, we have a script that consists of one BEGIN
pattern followed by an action which is always invoked. You
see, this is a corner case of the pattern-action model
which has been reduced so wide that its essence has disappeared.
Instead of the patterns you now see the cases of switch
statement, embedded into a while loop (for reading the
file item-wise).
Obviously, we have explicite conditionals now, instead of the
implicite ones we used formerly. The actions invoked within
the case conditions are the same we have seen in the
push approach.
Jurgen Kahrs
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).