Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: XML,June,2009,JurgenK

Dealing with DTDs

(This page comes from the XML Gawk tutorial.)

The declaration of a document type in the header of an XML file is an optional part of the data, not a mandatory one. If such a declaration is present, the reference to the DTD will not be resolved and its contents will not be parsed. However, the presence of the declaration will be reported by gawk. When the declaration starts, the variable XMLSTARTDOCT contains the name of the root element's tag; and later, when the declaration ends, the variable XMLENDDOCT is set to 1. In between, the array variable XMLATTR will be populated with the values of the public identifier of the DTD (if any) and the value of the system's identifier of the DTD (if any). Other parts of the declaration (elements, attributes and entities) will not be reported.

     @load xml
     XMLDECLARATION {
       version    = XMLATTR["VERSION"        ]
       encoding   = XMLATTR["ENCODING"       ]
       standalone = XMLATTR["STANDALONE"     ]
     }
     XMLSTARTDOCT {
       root       = XMLSTARTDOCT
       pub_id     = XMLATTR["PUBLIC"         ]
       sys_id     = XMLATTR["SYSTEM"         ]
       intsubset  = XMLATTR["INTERNAL_SUBSET"]
     }
     XMLENDDOCT {
       print FILENAME
       print "  version    '" version    "'"
       print "  encoding   '" encoding   "'"
       print "  standalone '" standalone "'"
       print "  root   id '" root   "'"
       print "  public id '" pub_id "'"
       print "  system id '" sys_id "'"
       print "  intsubset '" intsubset "'"
       print ""
       version = encoding = standalone = ""
       root = pub_id = sys_id = intsubset ""
     }

Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script shown above will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.

In the following, you can see an example output of the script shown above. Obviously, the first entry is a DocBook file (English version 4.2) containing a book element which has to be validated against a local copy of the DTD at CERN in Switzerland. The second file is a chapter element of DocBook (English version 4.1.2) to be validated against a DTD on the Internet. Finally, the third entry is a file describing a project of the GanttProject application. There is only a tag name for the root element specified, a DTD does not seem to exist.

     data/dbfile.xml
       version    ''
       encoding   ''
       standalone ''
       root   id  'book'
       public id  '-//OASIS//DTD DocBook XML V4.2//EN'
       system id  '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
       intsubset  ''
     
     data/docbook_chapter.xml
       version    ''
       encoding   ''
       standalone ''
       root   id  'chapter'
       public id  '-//OASIS//DTD DocBook XML V4.1.2//EN'
       system id  'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
       intsubset  ''
     
     data/exampleGantt.gan
       version    '1.0'
       encoding   'UTF-8'
       standalone ''
       root   id  'ganttproject.sourceforge.net'
       public id  ''
       system id  ''
       intsubset  ''

You may wish to make changes to this script if you need it in daily work. For example, the script currently reports nothing for files which have no DTD declaration in them. You can easily change this by appending an action for the END rule which reports in case all the variables root, pub_id and sys_id are empty. As it is, the script parses the entire XML file, although the DTD is always positioned at the top, before the root element. Parsing the root element is unnecessary and you can improve the speed of the script significantly if you tell it to stop parsing when the first element (the root element) comes in.

  XMLSTARTELEM { nextfile } 

Author

Jurgen Kahrs

Copyright

Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being ?GNU General Public License?, the Front-Cover texts being (1) (see below), and with the Back-Cover Texts being (2) (see below).

  • A GNU Manual
  • You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.
blog comments powered by Disqus