"Cause a little auk awk
goes a long way."

 »  table of contents
 »  featured topics
 »  page tags

About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

 »  articles
 »  books:


Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Jan,2010,MartinC

Very, Very, Very Long Strings in Gawk

In this discussion from comp.lang.awk, Martin Cohen builds a really, really, really long string in Gawk (300 million characters). He writes....

I had to extract 25-bit fields from a 90MB binary file, with frames of 10,000 fields indicated by a 33-bit sync value. The words I was interested in were indicated by being preceded by a special tag word.

My first step was to convert the binary file to hex text using od. I then wrote some gawk code to read the text file and extract the (32- bit) words preceded by the tag word. There were 9 million of them.

I concatenated them into a single string of 72 million hex characters (had to do byte-swapping along the way), and then, one character at a time, converted that into a string of 0's and 1's 300 million characters long. I could then easily (using index) search for the sync pattern (independent of any word boundaries) and find the data I wanted.

The total run time was just under 7 minutes (under Red Hat 5.1).

Some optimizations I had to do:

  • To build up the string of 9 million hex words, I had to group them 256 words at a time before concatenating them to the big string. When I just did one word at a time, I took forever - I had to stop it.
  • Similarly, When converting the hex to binary, I converted groups of 256 characters at a time before appending them to the big binary string.
  • Thinking about it now, I could probably combine the gathering of the hex words with the conversion to binary - my program was a revision of one where that combining wasn't done.

Anyway, it's nice that gawk can handle really long strings.


Martin Cohen
blog comments powered by Disqus