Awk.Info

"Cause a little auk awk
goes a long way."

About awk.info
 »  table of contents
 »  featured topics
 »  page tags


About Awk
 »  advocacy
 »  learning
 »  history
 »  Wikipedia entry
 »  mascot
Implementations
 »  Awk (rarely used)
 »  Nawk (the-one-true, old)
 »  Gawk (widely used)
 »  Mawk
 »  Xgawk (gawk + xml + ...)
 »  Spawk (SQL + awk)
 »  Jawk (Awk in Java JVM)
 »  QTawk (extensions to gawk)
 »  Runawk (a runtime tool)
 »  platform support
Coding
 »  one-liners
 »  ten-liners
 »  tips
 »  the Awk 100
Community
 »  read our blog
 »  read/write the awk wiki
 »  discussion news group

Libraries
 »  Gawk
 »  Xgawk
 »  the Lawker library
Online doc
 »  reference card
 »  cheat sheet
 »  manual pages
 »  FAQ

Reading
 »  articles
 »  books:

WHAT'S NEW?

Mar 01: Michael Sanders demos an X-windows GUI for AWK.

Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK

Feb 28: Tim Menzies asks this community to write an AWK cookbook.

Feb 28: Arnold Robbins announces a new debugger for GAWK.

Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK

Feb 28: Updated: the AWK FAQ

Feb 28: Tim Menzies offers a tiny content management system, in Awk.

Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk

Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).

Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us

Jan 31: Martin Cohen finds Awk on the Android platform.

Jan 31: Aleksey Cheusov released a new version of runawk.

Jan 31: Hirofumi Saito contributes a candidate Awk mascot.

Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.

Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.

[More ...]

Bookmark and Share

categories: Who,Feb,2009,Admin

Credits

Awk.info is maintained by the international awk community. There are many ways you can contribute and get listed below.


categories: Who,Feb,2009,Jimh

Jimh= Jim Hart

Author of awk++.

Jim has been a Great Auk since Feb'09.


categories: Who,Feb,2009,Ronl

Ronl= Ronald Loui

2009: consultant, CycCorp.


categories: Who,Feb,2009,Timm

Timm= Tim Menzies

2009: assoc Prof, LCSEE, WVU email: tim@menzies.us web site: http://menzies.us.

Tim has been a Great Auk since Feb'09.


categories: Who,Feb,2009,EdM

Ed= Ed Morton

2009: frequent poster to comp.lang.awk


categories: Who,Feb,2009,ArnoldR

Arnoldr= Arnold Robbins

Arnold Robbins, an Atlanta native, is a professional programmer and technical author. e has worked with Unix systems since 1980, when he was introduced to a PDP-11 running a version of Sixth Edition Unix.

He has been a heavy AWK user since 1987, when he became involved with gawk, the GNU project's version of AWK. As a member of the POSIX 1003.2 balloting group, he helped shape the POSIX standard for AWK. He is currently the maintainer of gawk and its documentation.

Since late 1997, he and his family have been living happily in Israel.


categories: Who,Feb,2009,Steffen

Steffen= Steffen Schuler

2009: gnu utils developer and monitor of comp.lang.awk


categories: Who,Feb,2009,Admin

Great Auks: awk.info's ringmasters

Some must lead, some must follow, and some have to fix the typos.

A Great Auk is someone with write permission to our repository. Since the source for this web site is stored in that repoistory, it also means that they are webmasters of this site. So they (try) to:

  1. keep the code and pages in a (somewhat) consistent form,
  2. encourage code documentation and test suites,
  3. watch comp.lang.awk for cool stuff to add to this site,
  4. write little demo programs,
  5. handle queries about this site,
  6. work the issue reports,
  7. etc.

If you want to be a Great Auk, please start contributing to this site using any of the usual methods. Once it is clear that you know what you are doing and that you play nice with others, then you should ask a current Great Auk to nominate you. Then, all the current Great Auks will vote about giving your write access.

The current Great Auks are


categories: Cookbook,Feb,2010,Timm

Awk Cook Book Project

Tim Menzies writes...

Anyone interested in collaborating on an Awk Cook Book? That we get published with (say) O'Reilly?

After browsing the excellent entries from the PHP cookbook by David Sklar and Adam Trachtenberg, I think the following topics look relevant.

But what other topics are useful/ simple/ practical/ etc?

I propose the following timetable:

  • March 2010:
    • Revise the following list of headings (making comments using the comment system, below);
    • Identify possible authors
  • April 2010:
    • Every author writes 5 new entries (call them "yellow")
  • May 2010:
    • Every author reviews 5 entries for other authors, to label them "green" (ok to go) or "red" (needs more work).
    • Every author writes 5 new entries (call them "yellow")
    • Try to interest new authors in the project.
  • Repeat the above, till done.

So, according to that timetable, the first step is for everyone to jump in and propose topics. Looking forward to your input!

     1	= Strings =
     2	
     3	 * Accessing Substrings
     4	 * Replacing Substrings
     5	 * Processing a String One Character at a Time
     6	 * Reversing a String by Word or Character
     7	 * Expanding and Compressing Tabs
     8	 * Controlling Case
     9	 * Interpolating Functions and Expressions Within Strings
    10	 * Trimming Blanks from a String
    11	 * Parsing Comma-Separated Data
    12	 * Parsing Fixed-Width Delimited Data
    13	 * Taking Strings Apart
    14	 * Wrapping Text at a Certain Line Length
    15	 * Storing Binary Data in Strings
    16	 * Building a String From Parts
    17	 * Substituting Variables Into Strings
    18	 * Converting Between Characters and Values
    19	 * Managing Whitespace
    20	 * Word-wrapping Lines of Text
    21	 * Validating an Email Address
    22	
    23	= Numbers =
    24	
    25	 * Checking Whether a String Contains a Valid Number
    26	 * Comparing Floating-Point Numbers
    27	 * Rounding Floating-Point Numbers
    28	 * Operating on a Series of Integers
    29	 * Generating Random Numbers Within a Range
    30	 * Generating Biased Random Numbers
    31	 * Taking Logarithms
    32	 * Calculating Exponents
    33	 * Formatting Numbers
    34	 * Printing Correct Plurals
    35	 * Calculating Trigonometric Functions
    36	 * Doing Trigonometry in Degrees, not Radians
    37	 * Handling Very Large or Very Small Numbers
    38	 * Converting Between Bases
    39	 * Calculating Using Numbers in Bases Other Than Decimal
    40	
    41	= Dates and Times =
    42	
    43	 * Finding the Current Date and Time
    44	 * Converting Time and Date Parts to an Epoch Timestamp
    45	 * Converting an Epoch Timestamp to Time and Date Parts
    46	 * Printing a Date or Time in a Specified Format
    47	 * Finding the Difference of Two Dates
    48	 * Finding the Difference of Two Dates with Julian Days
    49	 * Finding the Day in a Week, Month, Year, or the Week Number in a Year
    50	 * Validating a Date
    51	 * Parsing Dates and Times from Strings
    52	 * Adding to or Subtracting from a Date
    53	 * Calculating Time with Time Zones
    54	 * Accounting for Daylight Saving Time
    55	 * Generating a High-Precision Time
    56	 * Generating Time Ranges
    57	 * Using Non-Gregorian Calendars
    58	 * Program: Calendar
    59	
    60	= Arrays =
    61	
    62	 * Specifying an Array Not Beginning at Element 0
    63	 * Storing Multiple Elements per Key in an Array
    64	 * Initializing an Array to a Range of Integers
    65	 * Iterating Through an Array
    66	 * Deleting Elements from an Array
    67	 * Changing Array Size
    68	 * Appending One Array to Another
    69	 * Turning an Array into a String
    70	 * Printing an Array with Commas 
    71	 * Checking if a Key Is in an Array
    72	 * Checking if an Element Is in an Array
    73	 * Finding the Position of an Element in an Array
    74	 * Finding Elements That Pass a Certain Test
    75	 * Finding the Largest or Smallest Valued Element in an Array
    76	 * Reversing an Array
    77	 * Sorting an Array
    78	 * Sorting an Array by a Computable Field
    79	 * Sorting Multiple Arrays
    80	 * Sorting an Array Using a Method Instead of a Function
    81	 * Randomizing an Array
    82	 * Shuffling a Deck of Cards
    83	 * Removing Duplicate Elements from an Array
    84	 * Finding the Union, Intersection, or Difference of Two Arrays
    85	 * Finding All Element Combinations of an Array
    86	 * Finding All Permutations of an Array
    87	 * Program: Printing an Array in a Horizontally Columned HTML Table
    88	
    89	= Hashes =
    90	
    91	 * Introduction
    92	 * Adding an Element to a Hash
    93	 * Testing for the Presence of a Key in a Hash
    94	 * Deleting from a Hash
    95	 * Traversing a Hash
    96	 * Printing a Hash
    97	 * Retrieving from a Hash in Insertion Order 
    98	 * Hashes with Multiple Values Per Key
    99	 * Inverting a Hash
   100	 * Sorting a Hash
   101	 * Merging Hashes
   102	 * Finding Common or Different Keys in Two Hashes
   103	 * Hashing References
   104	 * Presizing a Hash
   105	 * Finding the Most Common Anything
   106	 * Representing Relationships Between Data
   107	 * Program: dutree
   108	
   109	= Variables =
   110	
   111	 * Avoiding == Versus = Confusion
   112	 * Establishing a Default Value
   113	 * Exchanging Values Without Using Temporary Variables
   114	 * Creating a Dynamic Variable Name
   115	 * Using Static Variables
   116	 * Sharing Variables Between Processes
   117	 * Encapsulating Complex Data Types as a String
   118	 * Dumping Variable Contents as Strings
   119	
   120	= Functions =
   121	
   122	 * Accessing Function Parameters
   123	 * Setting Default Values for Function Parameters
   124	 * Passing Values by Reference
   125	 * Using Named Parameters
   126	 * Creating Functions That Take a Variable Number of Arguments
   127	 * Returning Values by Reference
   128	 * Returning More Than One Value
   129	 * Skipping Selected Return Values
   130	 * Returning Failure
   131	 * Calling Variable Functions
   132	 * Accessing a Global Variable Inside a Function
   133	 * Creating Dynamic Functions
   134	
   135	= XML =
   136	
   137	 * Generating XML Manually
   138	 * Generating XML with the DOM
   139	 * Parsing XML with the DOM
   140	 * Parsing XML with SAX
   141	 * Transforming XML with XSLT
   142	 * Sending XML-RPC Requests
   143	 * Receiving XML-RPC Requests
   144	 * Sending SOAP Requests
   145	 * Receiving SOAP Requests
   146	 * Exchanging Data with WDDX
   147	 * Reading RSS Feeds
   148	
   149	= Regular Expressions =
   150	
   151	 * Switching From ereg to preg
   152	 * Matching Words
   153	 * Finding the nth Occurrence of a Match
   154	 * Choosing Greedy or Nongreedy Matches
   155	 * Matching a Valid Email Address
   156	 * Finding All Lines in a File That Match a Pattern
   157	 * Capturing Text Inside HTML Tags
   158	 * Escaping Special Characters in a Regular Expression
   159	 * Reading Records with a Pattern Separator
   160	
   161	
   162	= Files =
   163	 * Creating or Opening a Local File
   164	 * Creating a Temporary File
   165	 * Opening a Remote File
   166	 * Reading from Standard Input
   167	 * Reading a File into a String
   168	 * Counting Lines, Paragraphs, or Records in a File
   169	 * Processing Every Word in a File
   170	 * Reading a Particular Line in a File
   171	 * Processing a File Backward by Line or Paragraph
   172	 * Picking a Random Line from a File
   173	 * Randomizing All Lines in a File
   174	 * Processing Variable Length Text Fields
   175	 * Reading Configuration Files
   176	 * Reading from or Writing to a Specific Location in a File
   177	 * Removing the Last Line of a File
   178	 * Modifying a File in Place Without a Temporary File
   179	 * Flushing Output to a File
   180	 * Writing to Standard Output
   181	 * Writing to Many Filehandles Simultaneously
   182	 * Escaping Shell Metacharacters
   183	 * Passing Input to a Program
   184	 * Reading Standard Output from a Program
   185	 * Reading Standard Error from a Program
   186	 * Locking a File
   187	 * Reading and Writing Compressed Files
   188	 * Program: Unzip
   189	
   190	= Directories =
   191	
   192	 * Getting and Setting File Timestamps
   193	 * Getting File Information
   194	 * Changing File Permissions or Ownership
   195	 * Splitting a Filename into Its Component Parts
   196	 * Deleting a File
   197	 * Copying or Moving a File
   198	 * Processing All Files in a Directory
   199	 * Getting a List of Filenames Matching a Pattern
   200	 * Processing All Files in a Directory
   201	 * Making New Directories
   202	 * Removing a Directory and Its Contents
   203	 * Program: Web Server Directory Listing
   204	 * Program: Site Search

categories: Feb,2010,ArnoldR

New Awk Debugger

Arnold Robbins writes in Feb 2010..

I am pleased to announce the availability of a test version of gawk. This version uses a byte-code execution engine, and most importantly, it includes a debugger that works at the level of awk statements! The distribution is available at http://www.skeeve.com/gawk/gawk-3.1.7-bc-d.tar.gz.

This version is the same as 3.1.7, but with a new execution engine and a debugging version of gawk named, rather imaginatively, "dgawk". There is a story here. Circa 2003, a gentleman by the name of Jon Haque developed the byte-code execution engine and debugger, in the context of a development gawk version, somewhere between 3.1.3 and 3.1.4.

I never integrated the changes as they were massive and I was busy, and I wasn't able to review them.

The changes languished, and Jon disappeared.

Last fall, Stephen Davies, one of my portability team members, agreed to take on the task of bringing the code into the present. With modest help from me, he succeeded. We then went through additional work to get this version portable to some of the more esoteric systems that gawk supports (64 bit Linux, z/OS and VMS).

I thought it was ready for release at the end of December, until another one of my testers found a severe memory leak in the byte code version. It was a bear to track down, and once again Stephen came through. The debugger uses the readline library, and it is purposely similar to GDB. There is only minimal documentation on the debugger; I'd love to have someone volunteer to write a chapter for the gawk manual that explains it fully.

Example

./dgawk -f ../share/awk/round.awk 
dgawk> help
backtrace      backtrace [N] 
break          break [[filename:]N|function] 
clear          clear [[filename :]N|function] 
continue       continue [COUNT] - continue program being debugged.
delete         delete [breakpoints] [range] 
disable        disable [breakpoints] [range] 
display        display var - print value of variable
down           down [N] - move N frames down the stack.
dump           dump [filename] - dump bytecode.
enable         enable [once|del] [breakpoints] [range] 
finish         finish - execute until selected stack frame returns.
frame          frame [N] - select and print stack frame number N.
help           help - print list of commands.
ignore         ignore N COUNT - set ignore-count of breakpoint
info           info topic 
list           list [-|+|[filename:]lineno|function|range] 
next           next [COUNT] - step program
nexti          nexti [COUNT] - step one instruction
print          print var [var] - print value of a variable
quit           quit - exit debugger.
return         return [value] 
run            run - start executing program.
set            set var = value - assign value to a scalar
step           step [COUNT] - step program
stepi          stepi [COUNT] - step one instruction
tbreak         tbreak [[filename:]N|function] 
trace          trace on|off - print instruction
undisplay      undisplay [N] - remove variable(s)
until          until [[filename:]N|function] 
unwatch        unwatch [N] - remove variable(s) from watch list.
up             up [N] - move N frames up the stack.
watch          watch var - set a watchpoint for a variable.

Here is the debugger printing a function definition:

dgawk> list
1       # round.awk --- do normal rounding
2       #
3       # Arnold Robbins, arnold@skeeve.com, Public Domain
4       # August, 1996
5
6       function round(x,   ival, aval, fraction)
7       {
8          ival = int(x)    # integer part, int() truncates
9
10         # see if fractional part
11         if (ival == x)   # no fraction
12            return ival   # ensure no decimals
13
14         if (x < 0) {
15            aval = -x     # absolute value

categories: Awk100,Feb,2010,ALahm

PatentMatrix: survey gene/protien patents

(From Source Code Biol Med. 2007 Sep 6;2:4. by A. Lahm, E. de Rinaldis)

BACKGROUND: The number of patents associated with genes and proteins and the amount of information contained in each patent often present a real obstacle to the rapid evaluation of the novelty of findings associated to genes from an intellectual property (IP) perspective. This assessment, normally carried out by expert patent professionals, can therefore become cumbersome and time consuming. Here we present PatentMatrix, a novel software tool for the automated analysis of patent sequence text entries.

METHODS AND RESULTS: PatentMatrix is written in the Awk language and requires installation of the Derwent GENESEQtrade mark patent sequence database under the sequence retrieval system SRS.The software works by taking as input two files: i) a list of genes or proteins with the associated GENESEQtrade mark patent sequence accession numbers ii) a list of keywords describing the research context of interest (e.g. 'lung', 'cancer', 'therapeutics', 'diagnostics'). The GENESEQtrade mark database is interrogated through the SRS system and each patent entry of interest is screened for the occurrence of user-defined keywords. Moreover, the software extracts the basic information useful for a preliminary assessment of the IP coverage of each patent from the GENESEQtrade mark database. As output, two tab-delimited files are generated which provide the user with a detailed and an aggregated view of the results.An example is given where the IP position of five genes is evaluated in the context of 'development of antibodies for cancer treatment'.

CONCLUSION: PatentMatrix allows a rapid survey of patents associated with genes or proteins in a particular area of interest as defined by keywords. It can be efficiently used to evaluate the IP-related novelty of scientific findings and to rank genes or proteins according to their IP position.


categories: Awk100,Irc,Feb,2010,PremyslJ

VitaminA IRC Bot

Purpose

A modular IRC bot written in GNU AWK.

Developers

Premysl Janouch

Country

Czech republic

Domain

IRC Bot

Contact

See Developers.

Email

p.janouch@gmail.com

Awk

GNU AWK

Shell

Bourne shell

Platform

POSIX-compatible

Lines

1000

Current

Released (3)

Use

Free/Public Domain (3)

Users

N/A

DateDeployed

2010

Dated

February 2010

Url

http://vitamina.googlecode.com

Note: A regular release is planned in something like a month. I've typed Current: Released so you won't have to update the page.


categories: Timm,Arrays,Function,Feb,2009,ArnoldR

join

Synopsis

join(a [,start,end,sep])

Description

Joins at array into a string

Arguments

a
input array
start
Index for where to start in the array a. Default=1.
end
Index for where to start/stop in the array a. Default=size of array
sep
(OPTIONAL) What to write between each item. Defaults to blank space.

If sep is set to the magic value SUBSEP then internally, join adds nothing between the items.

Returns

A string of a's contents.

Example

gawk/array/eg/join »

gawk -f join.awk --source '
BEGIN { split("tim tom tam",a)
        print join(a,2)
}'

gawk/array/eg/join.out »

tom tam

Source

function join(a,start,end,sep,    result,i) {
    sep   = sep   ? start :  " "
    start = start ? start : 1
    end   = end   ? end   : sizeof(a)
    if (sep == SUBSEP) # magic value
       sep = ""
    result = a[start]
    for (i = start + 1; i <= end; i++)
        result = result sep a[i]
    return result
}

Helper

In earlier gawks, length(a) did not work in functions. Hence....

function sizeof(a,   i,n) { for(i in a) n++ ; return n }

Change Log

  • Jan 24'08: defaults extended to include start,stop
  • Jan 24'08: Sizeof added to handle old gawk bug

Author

Arnold Robbins, then Tim Menzies


categories: Arrays,Function,Feb,2009,Admin

array

Synopsis

arrray(a)

Description

Ensure that an array is empty

Arguments

a
input array

Example

gawk/array/eg/array »

gawk -f array.awk --source '
BEGIN { array(A);
        A[1]=2;
	print length(A);
	array(A);
	print length(A);
}'

gawk/array/eg/array.out »

1
0

Source

function array(a) { split("",a,"") }

categories: Yawk,Awk100,Feb,2009,WolfganZ

Yawk

Purpose

Run a WIKI using Gawk.

Download

Download from LAWKER or Wolfgan Zekol's web site.

Url

For a live demo, see the Yawk home page.

Developers

Wolfgan Zekol.

Domain

Web application.

Contact

Wolfgan Zekol.

Email

dag@awk-scripting.de

Description

Yawk is "yet another wiki klone", one among a lot of others. Yawk was written because the available wikis were missing some formatting capabilities or used strange formatting rules (and you might not like mine) or imposed too much requirements for understanding a wiki (mysql database installation with or without php installed).

Awk

Gawk 3.1.4 or later.

Platform

CGI

Lines

6000 lines.

Current

Status 3=Released.

Use

3=Free/public domain.

DateDeployed

2004

Dated

2009


categories: AwkLisp,Awk100,Feb,2009,DariusB

AwkLisp

Purpose

Code up a LISP/Scheme interpreter in Awk.

For more details..

See awklisp.

Developers

1

Domain

Domain-specific language.

Contact

Darius Bacon dairus@wry.me

Email

dairus@wry.me

Description

At my previous job I had to use MapBasic, an interpreter so astoundingly slow (around 100 times slower than GWBASIC) that one must wonder if it itself is implemented in an interpreted language. I still wonder, but it clearly could be: a bare-bones Lisp in awk, hacked up in a few hours, ran substantially faster.

Awk

Awk/Gawk

Lines

350

Current

1=Prototype

Use

1=Personal use.

DateDeployed

1994

Dated

2009


categories: Name,Awk100,Feb,2009,BillP

Name

Not a single program.

Purpose

Generate TeX code for a bilingual dictionary from a flat file database. This system has been used to generate multiple editions of dictionaries for several dialects of Carrier, the endangered language of a large portion of the central interior of British Columbia.

Developers

Bill Poser

Organization

Country

Canada

Domain

linguistics - dictionary publishing

Contact

Bill Poser

Email

billposer@alum.mit.edu

Description

A dictionary database consists of four flat files containing records in which fields are identified by tags, in a format isomorphic to Standard Dictionary Format. The four files contain: main entries, example sentences with translations, verb roots, verb stems. This provides modest degree of relativization. Awk scripts controlled by a makefile do the bulk of the work of generating TeX code for printing dictionaries containing front matter, a Carrier-English section, an English-Carrier section, a topical index, an alphabetical root list, a list of roots sorted by English gloss, an alphabetical list of verb stems, a list of verb stems sorted by root, an alphabetical list of affixes, a list of affixes sorted by English gloss, a list of scientific names , a list of placenames, and credits for illustrations.

Awk

gawk

Shell

The awk scripts are executed from a make file.

Platform

GNU/Linux on x86.

Uses

The awk scripts are executed from a makefile by GNU make. The other program used extensively is the sort utility msort.

Lines

5500

DevelopmentEffort

The first usable version took no more than a day (plus the time to create the TeX template into which the generated code is inserted).

MaintenanceEffort

Pure maintenance due to changes in environment, bit rot, etc. has been just about nil. The effort devoted to adding features very difficult to estimate as it has taken place at irregular intervals over a period of 15 years.

Current

Status 1=Prototype, 2=Evaluation, 3=Released, 4=No longer supported, 5=Dead 3, I guess. The code is mature but not really released since the author is the only one who normally uses it.

Use

1=Personal use, 2=in-House use, 3=Free/public domain, 4=Licensed, 5=Sold product 1

Users

1

DateDeployed

June 1993.

References

A paper describing these databases and the process for generating dictionaries from them is available: Lexical Databases for Carrier

Url

Some information about the resulting dictionaries: http://www.ydli.org/products/dicts.htm


categories: Top10,Boris,Awk100,Feb,2009,Ronl

Boris

Purpose

Demonstration to DoD of a clustering algorithm suitable for streaming data.

Source code

gawk/awk100/boris

Live demo

http://www.cse.wustl.edu/~loui/boris.cgi.

Developers

Ronald Loui and a programmer named Boris.

Organization

Washington University in St. Louis, CS Dept.

Country

USA

Domain

This is an evolutionary algorithm and visualization of a clustering algorithm that could be turned from O(n^4) to O(nlogn) with a few judicious uses of constants. Later developments added other interactive devices, including progress meters and mouse-and-click behavior.

Contact

Ronald Loui

Email

r.p.loui@gmail.com

Description

The code is an excellent example of the power of Awk as a prototyping tool: after getting the code running, with the least development time, a quirk was observed in the code that allowed a reduction from O(n^4) to O(nlogn).

  • Two of the n's are lost (n^2) by noticing that when there is a swap, the delta in the scoring function falls off by the squared distance from the point of a swap. So if you just set a constant, such as 10 or 20, or 100, based on the expected size of your clusters, then you can stop calculating the scoring function when you get past that constant.
  • The other n comes from either fixing the size of the matrix, and occasionally flushing new candidates in and out, or else by sampling over a subset of the n when you calculate the score.
  • The nlogn remains because there is a sort every now and then.

Awk

Gawk

Platform

Intended for fast servers, 1+ ghz.

Uses

Html.

Lines

158.

Development Effort

One weekend.

Maintenance Effort

None.

Current

2=Evaluation.

Use

2=in-House use.

Users

5

DateDeployed

2004.

Dated

Feb 2009.

References

Streaming Hierarchical Clustering for Concept Mining Looks, M.; Levine, A.; Covington, G.A.; Loui, R.P.; Lockwood, J.W.; Cho, Y.H. Aerospace Conference, 2007 IEEE Volume , Issue , 3-10 March 2007 Page(s):1 - 12 Digital Object Identifier 10.1109/AERO.2007.352792

blog comments powered by Disqus