Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Xgawk,XML,May,2009,JurgenK

Printing an Outline of an XML file

(This page comes from the XML Gawk tutorial.)

When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view that is not-so-pretty. For example:

     
     <book id="hello-world" lang="en">
     
     <bookinfo>
     <title>Hello, world</title>
     </bookinfo>

     
     <chapter id="introduction">
     <title>Introduction</title>
     
     <para>This is the introduction. It has two sections</para>
     
     <sect1 id="about-this-book">
     <title>About this book</title>

     
     <para>This is my first DocBook file.</para>
     
     </sect1>
     
     <sect1 id="work-in-progress">
     <title>Warning</title>
     
     <para>This is still under construction.</para>

     
     </sect1>
     
     </chapter>
     </book>

Software developers are used to reading text files with proper indentation like this:

     book lang='en' id='hello-world'
       bookinfo
         title
       chapter id='introduction'
         title
         para
         sect1 id='about-this-book'
           title
           para
         sect1 id='work-in-progress'
           title
           para

Here, it is a bit harder to recognize hierarchical dependencies among the nodes. But proper indentation allows you to oversee files with more than 100 elements (a purely graphical view of such large files gets unbearable).

The outline tool produces such an indented output and we will now write a script that imitates this kind of output.

     @load xml
     XMLSTARTELEM {
       printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
       for (i=1; i<=NF; i++)
         printf(" %s='%s'", $i, XMLATTR[$i])
       print ""
     }

For the first time, we don't just check if the XMLSTARTELEM variable contains a tag name, but we also print the name out, properly indented with a printf format statement (two blank characters for each indentation level).

Note the use of the associative array XMLATTR. Whenever we enter a markup block (and XMLSTARTELEM is non-empty), the array XMLATTR contains all the attributes of the tag. You can find out the value of an attribute by accessing the array with the attribute's name as an array index. In a well-formed XML file, all the attribute names of one tag are distinct, so we can be sure that each attribute has its own place in the array. The only thing that's left to do is to iterate over all the entries in the array and print name and value in a formatted way. Earlier versions of this script really iterated over the associative array with the for (i in XMLATTR) loop. Doing so is still an option, but in this case we wanted to make sure that attributes are printed in exactly the same oder that is given in the original XML data. The exact order of attribute names is reproduced in the fields $1 .. $NF. So the for loop can iterate over the attributes names in the fields $1 .. $NF and print the attribute values XMLATTR[$i].

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.
blog comments powered by Disqus