About awk.info
» table of contents
» featured topics
» page tags
|
|
|
|
|
|
Mar 01: Michael Sanders demos an X-windows GUI for AWK.
Mar 01: Awk100#24: A. Lahm and E. de Rinaldis' patent search, in AWK
Feb 28: Tim Menzies asks this community to write an AWK cookbook.
Feb 28: Arnold Robbins announces a new debugger for GAWK.
Feb 28: Awk100#23: Premysl Janouch offers a IRC bot, In AWK
Feb 28: Updated: the AWK FAQ
Feb 28: Tim Menzies offers a tiny content management system, in Awk.
Jan 31: Comment system added to awk.info. For example, see discussion bottom of ?keys2awk
Jan 31: Martin Cohen shows that Gawk can handle massively long strings (300 million characters).
Jan 31: The AWK FAQ is being updated. For comments/ corrections/ extensions, please mail tim@menzies.us
Jan 31: Martin Cohen finds Awk on the Android platform.
Jan 31: Aleksey Cheusov released a new version of runawk.
Jan 31: Hirofumi Saito contributes a candidate Awk mascot.
Jan 31: Michael Sanders shows how to quickly build an AWK GUI for windows.
Jan 31: Hyung-Hwan Chung offers QSE, an embeddable Awk Interpreter.
These pages are grouped into the topics, listed below (latest one shown first):
;
210 pages.
by Ronald P. Loui
Associate Professor of CSE
Washington University in St. Louis
(Pre-publication draft; copyright reserved by author. A subsequent version of this document appeared as IEEE Computer, vol. 41, no. 7, July 2008).
To the credit of this journal, it had the courage to publish the signal paper on scripting, John Ousterhout's "Scripting: Higher Level Programming for the 21st Century" in 1998. Today, that document rolls forward with an ever-growing list of positive citations. More importantly, every major observation in that paper seems now to be entrenched in software practice today; every benefit claimed for scripting appears to be genuine (flexibility of typelessness, rapid turnaround of interpretation, higher level semantics, development speed, appropriateness for gluing components and internet programming, ease of learning and increase in amount of casual programming).
Interestingly, IEEE COMPUTER also just printed one of the most canonical attacks on scripting, by one Diomidis Spinellis, 2005, "Java Makes Scripting Languages Irrelevant?" Part of what makes this attack interesting is that the author seems unconvinced of his own title; the paper concludes with more text devoted to praising scripting languages than it expends in its declaration of Java's progress toward improved usability. It is unclear what is a better recommendation for scripting: the durability of Ousterhout's text or the indecisiveness of this recent critic's.
The real shock is that the academic programming language community continues to reject the sea change in programming practices brought about by scripting. Enamored of the object-oriented paradigm, especially in the undergraduate curriculum, unwilling to accept the LAMP (Linux-Apache-MySQL-Perl/Python/Php) tool set, and firmly believing that more programming theory leads to better programming practice, the academics seem blind to the facts on the ground. The ACM flagship, COMMUNICATIONS OF THE ACM for example, has never published a paper recognizing the scripting philosophy, and the references throughout the ACM Digital Library to scripting are not encouraging.
Part of the problem is that scripting has risen in the shadow of object-oriented programming and highly publicized corporate battles between Sun, Netscape, and Microsoft with their competing software practices. Scripting has been appearing language by language, including object-oriented scripting languages now. Another part of the problem is that scripting is only now mature enough to stand up against its legitimate detractors. Today, there are answers to many of the persistent questions about scripting: is there a scripting language appropriate for the teaching of CS1 (the first programming course for majors in the undergraduate computing curriculum)? Is there a scripting language for enterprise or real-time applications? Is there a way for scripting practices to scale to larger software engineering projects?
I intend to review the recent history briefly for those who have not yet joined the debate, then present some of the answers that scripting advocates now give to those nagging questions. Finally, I will describe how a real pragmatism of academic interest in programming languages would have better prepared the academic computing community to see the changes that have been afoot.
1996-1998 are perhaps the most interesting years in the phylogeny of scripting. In those years, perl "held the web together", and together with a new POSIX awk and GNU gawk, was shipping with every new Linux. Meanwhile javascript was being deployed furiously (javascript bearing no important relation to java, having been renamed from "livescript" for purely corporate purposes, apparently a sign of Netscape's solidarity with Sun, and even renamed "jscript" under Microsoft). Also, a handoff from tcl/tk to python was taking place as the language of choice for GUI developers who would not yield to Microsoft's VisualBasic. Php appeared in those years, though it would take another round of development before it would start displacing server-side perl, cold fusion, and asp. Every one of these languages is now considered a classic, even prototypical, scripting language.
Already by mid-decade, the shift from scheme to java as the dominant CS1 language was complete, and the superiority of c++ over c was unquestioned in industry. But java applets were not well supported in browsers, so the appeal of "write once, run everywhere" quickly became derided as "write once, debug everywhere." Web page forms, which used the common gateway interface (cgi) were proliferating, and systems programming languages like c became recognized as overkill for server-side programming. Developers quickly discovered the main advantage of perl for cgi forms processing, especially in the dot-com setting: it minimized the programmer's write-time. What about performance? The algorithms were simple, network latency masked small delays, and database performance was built into the database software. It turned out that the bottleneck was the programming. Even at run-time, the network and disk properties were the problems, not the cpu processing. What about maintenance? The developers and management were both happy to rewrite code for redesigned services rather than deal with legacy code. Scripting, it turns out, was so powerful and programmer-friendly that it was easier to create new scripts from scratch than to modify old programs. What about user interface? After all, by 1990, most of the programming effort had become the writing of the GUI, and the object-oriented paradigm had much of its momentum in the inheritance of interface widget behaviors. Surprisingly, the interface that most programmers needed could be had in a browser. The html/javascript/cgi trio became the GUI, and if more was needed, then ambitious client-side javascript was more reliable than the browser's java virtual machine. Moreover, the server-side program was simply a better way to distribute automation in a heterogeneous internet than the downloadable client-side program, regardless of whether the download was in binary or bytecode.
Although there was not agreement on what exact necessary and sufficient properties characterized scripting and distinguished it from "more serious" programming, several things were clear:
This last point was extremely counterintuitive. Strong typing, naming regimen, and verbosity were motivated mainly by a desire to help the programmer avoid errors. But the programmer who had to generate too many keystrokes and consult too many pages, who had to search through many different files to discover semantics, and who had to follow too many rules, who had to sustain motivation and concentration over a long period of time, was a distracted and consequently inefficient programmer. Just as vast libraries did not deliver the promise of greater reusability, and virtual machines did not deliver the promise of platform-independence, the language's promise to discipline the programmer quite simply did not reduce the tendency of humans to err. It exchanged one kind of frequent error for another.
Scripting languages became the favorite tools of the independent-minded programmers: the "hackers" yes, but also the gifted and genius programmers who tended to drive a project's design and development. As Paul Graham noted (in a column reprinted in "Hackers and Painters" or this), one of the lasting and legitimate benefits of java is that it permits managers to level the playing field and extract considerable productivity from the less talented and less motivated programmers (hence, more disposable). There was a corollary to this difference between the mundane and the liberating:
The distinct features of scripting languages that produce these effects are usually enumerated as semantic features, starting with low I/O specification costs, the use of implicit coercion and weak typing, automatic variable initialization with optional declaration, predominant use of associative arrays for storage and regular expressions for pattern matching, reduced syntax, and powerful control structures. But the main reason for the productivity gains may be found in the name "scripting" itself. To script an environment is to be powerfully embedded in that environment. In the same way that the dolphin reigns over the open ocean, lisp is a powerful language for those who would customize their emacs, javascript is feral among browsers, and gawk and perl rule the linux jungle.
There is even a hint of AI in the idea of scripting: the scripting language is the way to get high level control, to automate by capturing the intentions and routines normally provided by the human. If recording and replaying macros is a kind of autopilot, then scripting is a kind of proxy for human decisionmaking. Nowhere is this clearer than in simple server-side php, or in sysadmin shell scripting.
So where do we stand now? While it may have been risky for Ousterhout to proclaim scripting on the rise in 1998, it would be folly to dismiss the success of scripting today. It is even possible that java will yield its position of dominance in the near future. (By the time this essay is printed, LAMP and AJAX might be the new darlings of the tech press; see recent articles in Business Week, this IEEE COMPUTER, and even James Gosling's blog where he concedes he was wanting to write a scripting language when he was handed the java project. Java is very much in full retreat.) Is scripting ready to fill the huge vacuum that would be produced?
I personally believe that CS1 java is the greatest single mistake in the history of computing curricula. I believe this because of the empirical evidence, not because I have an a priori preference (I too voted to shift from scheme to java in our CS1, over a decade ago, so I am complicit in the java debacle). I reported in SIGPLAN 1996 ("Why gawk for AI?") that only the scripting programmers could generate code fast enough to keep up with the demands of the artificial intelligence laboratory class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting. In the intervening decade, little has changed. We actually see more scripting, as students are happy to compress images so that they can script their computer vision projects rather than stumble around in c and c++. In fact, students who learn to script early are empowered throughout their college years, especially in the crucial UNIX and web environments. Those who learn only java are stifled by enterprise-sized correctness and the chimerae of just-in-time compilation, swing, JRE, JINI, etc. Young programmers need to practice and produce, and to learn through mistakes why discipline is needed. They need to learn design patterns by solving problems, not by reading interfaces to someone else's black box code. It is imperative that programmers learn to be creative and inventive, and they need programming tools that support code exploration rather than code production.
What scripting language could be used for CS1? My personal preferences are gawk, javascript, php, and asp, mainly because of their very gentle learning curves. I don't think perl would be a disaster; its imperfection would create many teaching moments. But there is emerging consensus in the scripting community that python is the right choice for freshman programming. Ruby would also be a defensible choice. Python and ruby have the enviable properties that almost no one dislikes them, and almost everyone respects them. Both languages support a wide variety of programming styles and paradigms and satisfy practitioners and theoreticians equally. Both languages are carefully enough designed that "correct" programming practices can be demonstrated and high standards of code quality can be enforced. The fact that Google stands by python is an added motivation for undergraduate majors.
But do scripting solutions scale? What about the performance gap when the polynomial, or worse the exponential, algorithm faces large n, and the algorithm is written in an interpreted or weakly compiled language? What about software engineering in the large, on big projects? There has been a lot of discussion about scalability of scripts recently. In the past, debates have simply ended with the concession that large systems would have to be rewritten in c++, or a similar language, once the scripting had served its prototyping duty.
The enterprise question is the easier of the two. Just as the individual programmer reaps benefits from a division of labor among tools, writing most of the code in scripts, and writing all bottleneck code in a highly optimizable language, the group of programmers benefits from the use of multiple paradigms and multiple languages. In a recent large project, we used vhdl for fpga's with a lot of gawk to configure the vhdl. We used python and php to generate dynamic html with svg and javascript for the interfaces. We used c and c++ for high performance communications wrappers, which communicated xml to higher level scripts that managed databases and processes. We saw sysadmin and report-generation in perl, ruby, and gawk, data scrubbing in perl and gawk, user scripting in bash, tcl, and gawk, and prototyping in perl and gawk. Only one module was written in java (because that programmer loved java): it was late, it was slow, it failed, and it was eventually rewritten in c++. In retrospect, neither the high performance components nor the lightweight code components were appropriate for the java language. Does scripting scale to enterprise software? I would not manage a project that did not include a lot of scripting, to minimize the amount of "hard" programming, to increase flexibility and reduce delivery time at all stages, to take testing to a higher level, and to free development resources for components where performance is actually critical. I nearly weep when I think about the text processing that was written in c under my managerial watch, because the programmer did not know perl. We write five hundred line scripts in gawk that would be ten thousand line modules in java or c++. In view of the fact that there are much better scripting tools for most of what gets programmed in java and c++, perhaps the question is whether java and c++ scale.
How about algorithmic complexity? Don't scripting languages take too long to perform nested loops? The answer here is that a cpu-bound tight loop such as a matrix multiplication is indeed faster in a language like c. But such bottlenecks are easy to identify and indeed easy to rewrite in c. True system bottlenecks are things like paging, chasing pointers on disk, process initialization, garbage collection, fragmentation, cache mismanagement, and poor data organization. Often, we see that better data organization was unimplemented because it would have required more code, code that would have been attempted in an "easier" programming language like a scripting language, but which was too difficult to attempt in a "harder" programming language. We saw this in the AI class with heuristic search and computer vision, where brute force is better in c, but complex heuristics are better than brute force, and scripting is better for complex heuristics. When algorithms are exponential, it usually doesn't matter what language you use because most practical n will incur too great a cost. Again, the solution is to write heuristics, and scripting is the top dog in that house. Cpu's are so much faster than disks these days that a single extra disk read can erase the CPU advantage of using compiled c instead of interpreted gawk. In any case, java is hardly the first choice for those who have algorithmic bottlenecks.
The real reason why academics were blindsided by scripting is their lack of practicality. Academic computing was generally late to adopt Wintel architectures, late to embrace cgi programming, and late to accept Linux in the same decade that brought scripting's rise. Academia understandably holds industry at a distance. Still, there is a purely intellectual reason why programming language courses are only now warming to scripting. The historical concerns of programming language theory have been syntax and semantics. Java's amazing contribution to computer science is that it raised so many old-fashioned questions that tickled the talents of existing programming language experts: e.g., how can it be compiled? But there are new questions that can be asked, too, such as what a particular language is well-suited to achieve inexpensively, quickly, or elegantly, especially with the new mix of platforms. The proliferation of scripting languages represents a new age of innovation in programming practice.
Linguists recognize something above syntax and semantics, and they call it "pragmatics". Pragmatics has to do with more abstract social and cognitive functions of language: situations, speakers and hearers, discourse, plans and actions, and performance. We are entering an era of comparative programming language study when the issues are higher-level, social, and cognitive too.
My old friend, Michael Scott, has a popular textbook called PROGRAMMING LANGUAGE PRAGMATICS. But it is a fairly traditional tome concerned with parameter passing, types, and bindings (it's hard to see why it merits "pragmatics" in its title, even as it goes to second edition with a chapter on scripting added!). A real programming pragmatics would ask questions like:
There have been programming language "shootouts" and "scriptometers" on the internet that have sought to address some of the questions that are relevant to the choice of scripting language, but they have been just first steps. For example, one site reports on the shortest script in each scripting language that can perform a simple task. But absolute brevity for trivial tasks, such as "print hello world" is not as illuminating as typical brevity for real tasks, such as xml parsing.
Pragmatic questions are not the easiest questions for mathematically-inclined computer scientists to address. They refer by their nature to people, their habits, their sociology, and the technological demands of the day. But it is the importance of such questions that makes programmers choose scripting languages. Ousterhout declared scripting on the rise, but perhaps so too are programming language pragmatics.
I have to thank Charlie Comstock for contributing many ideas and references over the past two years that have shaped my views, especially the commitment to the idea of pragmatics.
Prof. Dr. Loui and his students are the usual winners of the department programming contest and have contributed to current gnu releases of gawk and malloc. He has lectured on AI for two decades on five continents, taught AI programming for two decades, and is currently funded on a project delivering hardware and software on U.S. government contracts.
At the Proceedings of the Winter Usenix Conference (Dallas'91) Henry Spencer wrote in Awk As A Major Systems Programming Language that...
There is no fundamental reason why awk programs have to be small "glue" programs: even the "old" awk is a powerful programming language in its own right. Effective use of its data structures and its stream-oriented structure takes some adjustment for C programmers, but the results can be quite striking.
On the other hand, getting there can be a bit painful, and improvements in both the language and its support tools would help.
In 2009, Arnold Robbins comments:
These pages focus on games, written in Awk.
gawk -f eliza.awk
Download from LAWKER.
From Wikipedia: ELIZA was a computer program and an early example of (by modern standards) primitive natural language processing. ELIZA operated by processing users' responses to scripts, the most famous of which was DOCTOR, a simulation of a Rogerian psychotherapist. In this mode, ELIZA mostly rephrased the user's statements as questions and posed those to the 'patient.' ELIZA was written by Joseph Weizenbaum between 1964 to 1966.
This is a Gawk implementation of the famous ELIZA program by Joseph Weizenbaum. To end, type SHUT UP HI, IM ELIZA, TELL ME YOUR PROBLEM I hate dogs CAN YOU ELABORATE ON THAT ? Dogs remind me of Mylie Cyrus CAN YOU ELABORATE ON THAT ? You're kidding, right? Mylie Cyrus is one of the biggest stars on the planet WHAT MAKES YOU THINK I AM KIDDING, RIGHT? MYLIE CYRUS IS ONE OF THE BIGGEST STARS ON THE PLANET ? Cause you did not know about Mylie WE WERE DISCUSSING YOU -- NOT ME No, I was discussing Mylie Oh, I ID NOT KNOW ABOUT MYLIE Well aren't you stupid Oh, I TUPID can you get her number? WE WERE DISCUSSING YOU -- NOT ME sorry PLEASE DON'T APOLOGIZE ! but you sound so angry YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ? i think you are a robot Oh, I RE A ROBOT yes I UNDERSTAND no you don't YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ? hey- is this eliza? WHAT'S ABOUT ME ? I want to take you to dinner YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ? I feel you are really stupid WE WERE DISCUSSING YOU -- NOT ME SHUT UP WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $101
BEGIN {
SetUpEliza()
print "This is a Gawk implementation of the "\
"famous ELIZA program by Joseph Weizenbaum. "\
"To end, type SHUT UP\n";
print ElizaSays("");
}
{ print ElizaSays($0) }
function ElizaSays(YouSay) {
if (YouSay == "") {
cost = 0
answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM"
} else {
q = toupper(YouSay)
gsub("'", "", q)
if(q == qold) {
answer = "PLEASE DONT REPEAT YOURSELF !"
} else {
if (index(q, "SHUT UP") > 0) {
answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\
int(100*rand()+30+cost/100)
1;
} else {
qold = q
w = "-" # no keyword recognized yet
for (i in k) { # search for keywords
if (index(q, i) > 0) {
w = i
break
}
}
if (w == "-") { # no keyword, take old subject
w = wold
subj = subjold
} else { # find subject
subj = substr(q, index(q, w) + length(w)+1)
wold = w
subjold = subj # remember keyword and subject
}
for (i in conj)
gsub(i, conj[i], q) # conjugation
# from all answers to this keyword, select one randomly
answer = r[indices[int(split(k[w], indices) * rand()) + 1]]
# insert subject into answer
gsub("_", subj, answer)
}
}
}
cost += length(answer) # for later payment : 1 cent per character
return answer
}
function SetUpEliza() {
srand()
wold = "-"
subjold = " "
# table for conjugation
conj[" ARE " ] = " AM "
conj["WERE " ] = "WAS "
conj[" YOU " ] = " I "
conj["YOUR " ] = "MY "
conj[" IVE " ] =\
conj[" I HAVE " ] = " YOU HAVE "
conj[" YOUVE " ] =\
conj[" YOU HAVE "] = " I HAVE "
conj[" IM " ] =\
conj[" I AM " ] = " YOU ARE "
conj[" YOURE " ] =\
conj[" YOU ARE " ] = " I AM "
# table of all answers
r[1] = "DONT YOU BELIEVE THAT I CAN _"
r[2] = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?"
r[3] = "YOU WANT ME TO BE ABLE TO _ ?"
r[4] = "PERHAPS YOU DONT WANT TO _ "
r[5] = "DO YOU WANT TO BE ABLE TO _ ?"
r[6] = "WHAT MAKES YOU THINK I AM _ ?"
r[7] = "DOES IT PLEASE YOU TO BELIEVE I AM _ ?"
r[8] = "PERHAPS YOU WOULD LIKE TO BE _ ?"
r[9] = "DO YOU SOMETIMES WISH YOU WERE _ ?"
r[10] = "DONT YOU REALLY _ ?"
r[11] = "WHY DONT YOU _ ?"
r[12] = "DO YOU WISH TO BE ABLE TO _ ?"
r[13] = "DOES THAT TROUBLE YOU ?"
r[14] = "TELL ME MORE ABOUT SUCH FEELINGS"
r[15] = "DO YOU OFTEN FEEL _ ?"
r[16] = "DO YOU ENJOY FEELING _ ?"
r[17] = "DO YOU REALLY BELIEVE I DONT _ ?"
r[18] = "PERHAPS IN GOOD TIME I WILL _ "
r[19] = "DO YOU WANT ME TO _ ?"
r[20] = "DO YOU THINK YOU SHOULD BE ABLE TO _ ?"
r[21] = "WHY CANT YOU _ ?"
r[22] = "WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM _ ?"
r[23] = "WOULD YOU PREFER IF I WERE NOT _ ?"
r[24] = "PERHAPS IN YOUR FANTASIES I AM _ "
r[25] = "HOW DO YOU KNOW YOU CANT _ ?"
r[26] = "HAVE YOU TRIED ?"
r[27] = "PERHAPS YOU CAN NOW _ "
r[28] = "DID YOU COME TO ME BECAUSE YOU ARE _ ?"
r[29] = "HOW LONG HAVE YOU BEEN _ ?"
r[30] = "DO YOU BELIEVE ITS NORMAL TO BE _ ?"
r[31] = "DO YOU ENJOY BEING _ ?"
r[32] = "WE WERE DISCUSSING YOU -- NOT ME"
r[33] = "Oh, I _"
r[34] = "YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?"
r[35] = "WHAT WOULD IT MEAN TO YOU, IF YOU GOT _ ?"
r[36] = "WHY DO YOU WANT _ ?"
r[37] = "SUPPOSE YOU SOON GOT _"
r[38] = "WHAT IF YOU NEVER GOT _ ?"
r[39] = "I SOMETIMES ALSO WANT _"
r[40] = "WHY DO YOU ASK ?"
r[41] = "DOES THAT QUESTION INTEREST YOU ?"
r[42] = "WHAT ANSWER WOULD PLEASE YOU THE MOST ?"
r[43] = "WHAT DO YOU THINK ?"
r[44] = "ARE SUCH QUESTIONS IN YOUR MIND OFTEN ?"
r[45] = "WHAT IS IT THAT YOU REALLY WANT TO KNOW ?"
r[46] = "HAVE YOU ASKED ANYONE ELSE ?"
r[47] = "HAVE YOU ASKED SUCH QUESTIONS BEFORE ?"
r[48] = "WHAT ELSE COMES TO MIND WHEN YOU ASK THAT ?"
r[49] = "NAMES DON'T INTEREST ME"
r[50] = "I DONT CARE ABOUT NAMES -- PLEASE GO ON"
r[51] = "IS THAT THE REAL REASON ?"
r[52] = "DONT ANY OTHER REASONS COME TO MIND ?"
r[53] = "DOES THAT REASON EXPLAIN ANYTHING ELSE ?"
r[54] = "WHAT OTHER REASONS MIGHT THERE BE ?"
r[55] = "PLEASE DON'T APOLOGIZE !"
r[56] = "APOLOGIES ARE NOT NECESSARY"
r[57] = "WHAT FEELINGS DO YOU HAVE WHEN YOU APOLOGIZE ?"
r[58] = "DON'T BE SO DEFENSIVE"
r[59] = "WHAT DOES THAT DREAM SUGGEST TO YOU ?"
r[60] = "DO YOU DREAM OFTEN ?"
r[61] = "WHAT PERSONS APPEAR IN YOUR DREAMS ?"
r[62] = "ARE YOU DISTURBED BY YOUR DREAMS ?"
r[63] = "HOW DO YOU DO ... PLEASE STATE YOUR PROBLEM"
r[64] = "YOU DON'T SEEM QUITE CERTAIN"
r[65] = "WHY THE UNCERTAIN TONE ?"
r[66] = "CAN'T YOU BE MORE POSITIVE ?"
r[67] = "YOU AREN'T SURE ?"
r[68] = "DON'T YOU KNOW ?"
r[69] = "WHY NO _ ?"
r[70] = "DON'T SAY NO, IT'S ALWAYS SO NEGATIVE"
r[71] = "WHY NOT ?"
r[72] = "ARE YOU SURE ?"
r[73] = "WHY NO ?"
r[74] = "WHY ARE YOU CONCERNED ABOUT MY _ ?"
r[75] = "WHAT ABOUT YOUR OWN _ ?"
r[76] = "CAN'T YOU THINK ABOUT A SPECIFIC EXAMPLE ?"
r[77] = "WHEN ?"
r[78] = "WHAT ARE YOU THINKING OF ?"
r[79] = "REALLY, ALWAYS ?"
r[80] = "DO YOU REALLY THINK SO ?"
r[81] = "BUT YOU ARE NOT SURE YOU _ "
r[82] = "DO YOU DOUBT YOU _ ?"
r[83] = "IN WHAT WAY ?"
r[84] = "WHAT RESEMBLANCE DO YOU SEE ?"
r[85] = "WHAT DOES THE SIMILARITY SUGGEST TO YOU ?"
r[86] = "WHAT OTHER CONNECTION DO YOU SEE ?"
r[87] = "COULD THERE REALLY BE SOME CONNECTIONS ?"
r[88] = "HOW ?"
r[89] = "YOU SEEM QUITE POSITIVE"
r[90] = "ARE YOU SURE ?"
r[91] = "I SEE"
r[92] = "I UNDERSTAND"
r[93] = "WHY DO YOU BRING UP THE TOPIC OF FRIENDS ?"
r[94] = "DO YOUR FRIENDS WORRY YOU ?"
r[95] = "DO YOUR FRIENDS PICK ON YOU ?"
r[96] = "ARE YOU SURE YOU HAVE ANY FRIENDS ?"
r[97] = "DO YOU IMPOSE ON YOUR FRIENDS ?"
r[98] = "PERHAPS YOUR LOVE FOR FRIENDS WORRIES YOU"
r[99] = "DO COMPUTERS WORRY YOU ?"
r[100] = "ARE YOU TALKING ABOUT ME IN PARTICULAR ?"
r[101] = "ARE YOU FRIGHTENED BY MACHINES ?"
r[102] = "WHY DO YOU MENTION COMPUTERS ?"
r[103] = "WHAT DO YOU THINK MACHINES HAVE TO DO WITH YOUR PROBLEMS ?"
r[104] = "DON'T YOU THINK COMPUTERS CAN HELP PEOPLE ?"
r[105] = "WHAT IS IT ABOUT MACHINES THAT WORRIES YOU ?"
r[106] = "SAY, DO YOU HAVE ANY PSYCHOLOGICAL PROBLEMS ?"
r[107] = "WHAT DOES THAT SUGGEST TO YOU ?"
r[108] = "I SEE"
r[109] = "IM NOT SURE I UNDERSTAND YOU FULLY"
r[110] = "COME COME ELUCIDATE YOUR THOUGHTS"
r[111] = "CAN YOU ELABORATE ON THAT ?"
r[112] = "THAT IS QUITE INTERESTING"
r[113] = "WHY DO YOU HAVE PROBLEMS WITH MONEY ?"
r[114] = "DO YOU THINK MONEY IS EVERYTHING ?"
r[115] = "ARE YOU SURE THAT MONEY IS THE PROBLEM ?"
r[116] = "I THINK WE WANT TO TALK ABOUT YOU, NOT ABOUT ME"
r[117] = "WHAT'S ABOUT ME ?"
r[118] = "WHY DO YOU ALWAYS BRING UP MY NAME ?"
# table for looking up answers that
# fit to a certain keyword
k["CAN YOU"] = "1 2 3"
k["CAN I"] = "4 5"
k["YOU ARE"] =\
k["YOURE"] = "6 7 8 9"
k["I DONT"] = "10 11 12 13"
k["I FEEL"] = "14 15 16"
k["WHY DONT YOU"] = "17 18 19"
k["WHY CANT I"] = "20 21"
k["ARE YOU"] = "22 23 24"
k["I CANT"] = "25 26 27"
k["I AM"] =\
k["IM "] = "28 29 30 31"
k["YOU "] = "32 33 34"
k["I WANT"] = "35 36 37 38 39"
k["WHAT"] =\
k["HOW"] =\
k["WHO"] =\
k["WHERE"] =\
k["WHEN"] =\
k["WHY"] = "40 41 42 43 44 45 46 47 48"
k["NAME"] = "49 50"
k["CAUSE"] = "51 52 53 54"
k["SORRY"] = "55 56 57 58"
k["DREAM"] = "59 60 61 62"
k["HELLO"] =\
k["HI "] = "63"
k["MAYBE"] = "64 65 66 67 68"
k[" NO "] = "69 70 71 72 73"
k["YOUR"] = "74 75"
k["ALWAYS"] = "76 77 78 79"
k["THINK"] = "80 81 82"
k["LIKE"] = "83 84 85 86 87 88 89"
k["YES"] = "90 91 92"
k["FRIEND"] = "93 94 95 96 97 98"
k["COMPUTER"] = "99 100 101 102 103 104 105"
k["-"] = "106 107 108 109 110 111 112"
k["MONEY"] = "113 114 115"
k["ELIZA"] = "116 117 118"
}
Juergen Kahrs
gawk -f hanoi.awk [-n Disks]
The objective is to move N discks from stack 0 to stack 1, always putting a smaller disc on top of a larger one. or on an empty stack
gawk -f hanoi.awk -n 4 0 4321 1 2 0 432 1 2 1 0 43 1 2 2 1 0 43 1 21 2 0 4 1 21 2 3 0 41 1 2 2 3 0 41 1 2 32 0 4 1 2 321 0 1 4 2 321 0 1 41 2 32 0 2 1 41 2 3 0 21 1 4 2 3 0 21 1 43 2 0 2 1 43 2 1 0 1 432 2 1 0 1 4321 2
Main:
BEGIN {
n = arg("-n",5)
for (j=0; j<n; j++) push(0,n-j)
showstacks()
hanoi(n,0,1,2)
}
function hanoi(n,a,b,c) {
if (n==1) {
move(a,b)
} else {
hanoi(n-1,a,c,b)
move(a,b)
hanoi(n-1,c,b,a)
}
}
function move(i,j) {
push(j,pop(i))
showstacks()
}
Showing the stack:
function showstacks( i,j) {
for (i=0; i<=2; i++) {
printf "%s ", i
for (j=0; j<sp[i]; j++) printf "%s", stack[i,j]
print "" }
print ""
}
Standard stuff:
function arg(tag,default) {
for(i in ARGV)
if (ARGV[i] ~ tag)
return ARGV[i+1]
return default
}
function push(i,v) { stack[i,sp[i]++]=v }
function pop(i) { return stack[i,--sp[i]] }
Alan Linton, 2001
gawk -f readminds.awk
(then type "h" or "t").
Shannon's 1953 memo, A Mind-Reading(?) Machine, describes a machine built out of relays at Bell Labs.
The machine took advantage of the difficulty of generating truly random behavior in wetware by using a small (8-state) markov model to predict its human opponents.
We implement a 1970's version of this 1950's algorithm, using AWK instead of mechanical relays.
Our markov model is based on behavior over the last two rounds, with hpa and hpb recording the history of the player's plays, and hca and hcb recording the history of the computer's guesses. The possible cases are: the player won or lost two rounds ago, changed plays or stayed with the same play, and won or lost the last round, for a total of 23 = 8 histories, with any bias towards changing or staying in the upcoming round kept in the tally array.
If the player has repeated their behavior for a given history at least twice, we guess according to their predicted behavior. After the first observation, we guess with a 75%/25%, split, weighted towards the bias. If the player hasn't shown any bias (or during the first two rounds of the game), we guess at hazard.
BEGIN {
print "+---------------------------------------------------------+"
print "| An AWKward mind-reading machine |"
print "| (this retrogame inspired by the Bell Labs Memo: |"
print "| Shannon, 1953, 'A Mind-Reading (?) Machine') |"
print "+---------------------------------------------------------+"
print "Shall we play a game?"
print "Tell me either 'heads' or 'tails'."
print "If I guess what you picked, I win. Otherwise, you win."
print "The match goes for 100 rounds, or someone gets 20."
printf "your play? "
}
BEGIN { "date +%s" | getline seed; srand(seed) }
BEGIN { t = 0 }
NR > 2 {
case = (hpa!=hca)"/"(hpa!=hpb)"/"(hpb!=hcb)
t = tally[case]
}
t < -1 { guess=!hpb }
t == -1 { guess=(int(rand()+.75)?!hpb:hpb) }
t == 0 { guess=int(rand()+.5) }
t == 1 { guess=(int(rand()+.75)?hpb:!hpb) }
t > 1 { guess= hpb }
/^[hH]/ { play=1 }
/^[tT]/ { play=0 }
/^[^hHtT]/ { printf "heads or tails? "; next }
We also report the results of the round to the player (in case they wish to update their internal models). En passant, we update pw and cw, the number of player (resp. computer) wins.
{
printf "You played " (play?"heads":"tails")
printf "; I guessed " (guess?"heads":"tails")
printf ". "(play==guess?"I":"You")" win. "
print "("(pw+=(play!=guess))"-"(cw+=(play==guess))")"
}
After finishing a round, we update the history with the results, including updating tally according to the player's behavior. Again, we wait for two rounds before touching the tally counters, at which point the history will have been fully initialized.
NR > 2 { tally[case] += (hpb == play ? 1 : -1) }
{
hpa = hpb; hpb = play
hca = hcb; hcb = guess
}
At the end of each round, if we haven't met a victory condition, we prompt for the next round.
cw+pw==100 { printf (cw>pw?"I":"You")" won the match "
print "by "(cw>pw?cw-pw:pw-cw)" games."
exit }
pw-cw==20 { print "You win -- up by 20"; exit }
cw-pw==20 { print "I win -- up by 20"; exit }
{ printf "? " }
END {
print " T H A N K Y O U F O R P L A Y I N G "
}
Copyright (c) 2009 the authors listed at the following URL, and/or the authors of referenced articles or incorporated external code: http://en.literateprograms.org/Mind_reading_machine_(AWK)?action=history&offset=20070207160312
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
gawk -f mastermind.awk
Download from LAWKER.
The aim of the game is to guess 4 numbers from 0,1,2,3,4,5,6,7,8,9. A "hit" is the right number in the right position and a "blow" is the right number in a wrong position.
You lose the game if you fail to guess after 10 rounds.
+++ Hit & Blow +++ <Push Enter>
[ 1] >> 1234
## 1 Hit 2 Blow
[ 2] >> 1256
## 1 Hit 1 Blow
[ 3] >> 1789
## 1 Hit 0 Blow
[ 4] >> 1243
## 1 Hit 2 Blow
[ 5] >> 1340
## 3 Hit 0 Blow
[ 6] >> 1320
Congratulations !! (1320)
BEGIN{
srand();
c=1;
print "\n\n +++ Hit & Blow +++ <Push Enter>\n";
q[z=p=int(9*rand())+1]=1;
for(i=2; i<=4;)
if(q[p=int(10*rand())]<1){
q[p]=i++;
z=z*10+p; }
}
Note that the range 1023 ... 9876 are the smallest and largerst 4 digit integers with no repeates.
{ if((n=int($0+0))>=1023 && n<=9876) {
++c;
v=0;
for(i=4; i>0; n=int(n/10))
v+=(q[p=n%10]==i--)?10:(q[p]>0)?1:0;
if (v==40) exit;
else printf("%16s %2d Hit %2d Blow\n", "##", v/10, v%10);
}
if (c>10) exit;
else printf("[%2d] >> ", c);
}
END{
printf("\n %s (%d)\n", (v==40)?"Congratulations !!":"Over times", z);
}
The author's name is YSA.
cat numbers | gawk -f quicksort.awk
Download from LAWKER.
Some Awk implementations come with built in sort routines (e.g. Gawk's asort and asorti functions). But it can be useful to code these yourself, especially in you are doing data structure tricks.
Quicksort selects a pivot and divides the data into values above and below the pivot. Sorting then recurses on these sub-lists.
BEGIN { RS = ""; FS = "\n" }
{ A[NR] = $0 }
END {
qsort(A, 1, NR)
for (i = 1; i <= NR; i++) {
print A[i]
if (i == NR) break
print ""
}
}
function qsort(A, left, right, i, last) {
if (left >= right)
return
swap(A, left, left+int((right-left+1)*rand()))
last = left
for (i = left+1; i <= right; i++)
if (A[i] < A[left])
swap(A, ++last, i)
swap(A, left, last)
qsort(A, left, last-1)
qsort(A, last+1, right)
}
function swap(A, i, j, t) {
t = A[i]; A[i] = A[j]; A[j] = t
}
Alfred Aho, Peter Weinberger, Brian Kernighan, 1988.
In comp.lang.awk, Aleksey Cheusov writes:
I've made runawk-0.16.0 release. This release has lots of important improvements and additions. Sources are available from
RUNAWK is a small wrapper for AWK interpreter that helps to write the standalone programs in AWK. It provides MODULES for AWK similar to PERL's "use" command and other powerful features. Dozens of ready to use modules are also provided.
(For more information, see details from the last release.)
Lots of demo programs for most runawk modules were created and they are in examples/ subdirectory now.
New MEGA module ;-) power_getopt.awk See the documentation and demo program examples/demo_power_getopt. It makes options handling REALLY easy (see below).
New modules:
Minor fixes and improvements in dirname.awk and basename.awk. Now they are fully compatible with dirname(1) and basename(1)
RUNAWK sets the following environment variables for the child awk subprocess:
RUNAWK sets RUNAWK_ART_STDIN environment variable for the child awk subprocess to 1 if additional/artificial `-' was added to the list to awk's arguments.
Makefile:
The most powerful feature of this release is power_getopt.awk module.
It provides a very powerful and very easy way to handle options.
Everything is in the usage message, you should do anything at all.
Venkatesan Satish offers a new
entry in our Awk mascot competition:
These pages focus on word processing tools in Awk.
These pages focus on language interpreters, written in Awk.
Download from LAWKER
Aaslg and aaslr implement the Amazing Awk Syntax Language, AASL (pro-
nounced ``hassle''). Aaslg (pronounced ``hassling'') takes an AASL
specification from the concatenation of the file(s) (default standard
input) and emits the corresponding AASL table on standard output.
Aaslr parses the contents of the file(s) (default standard input)
according to the AASL table in file table, emitting the table's output
on standard output.
Both take a -x option to turn on verbose and cryptic debugging output.
Both look in a library directory for pieces of the AASL system; the
AASLDIR environment variable, if present, overrides the default notion
of the location of this directory.
Aaslr expects input to consist of input tokens, one per line. For sim-
ple tokens, the line is just the text of the token. For metatokens
like ``identifier'', the line is the metatoken's name, a tab, and the
text of the token. [xxx discuss `#' lines]
Aaslr output, in the absence of syntax errors, consists of the input
tokens plus action tokens, which are lines consisting of `#!' followed
immediately by an identifier. If the syntax of the input does not
match that specified in the AASL table, aaslr emits complaint(s) on
standard error and attempts to repair the input into a legal form; see
``ERROR REPAIR'' below. Unless errors have cascaded to the point where
aaslr gives up (in which case it emits the action token ``#!aargh'' to
inform later passes of this), the output will always conform to the
AASL syntax given in the table.
Normally, a complete program using AASL consists of three passes, the
middle one being an invocation of aaslr. The first pass is a lexical
analyzer, which breaks free-form input down into input tokens in some
suitable way. The third pass is a semantics interpreter, which typi-
cally responds to input tokens by momentarily remembering them and to
action tokens by executing some action, often using the remembered
value of the previous input token. Aaslg is in fact implemented using
AASL, following this structure; it implements the -x option by just
passing it to aaslr.
An AASL specification consists of class definitions, text definitions,
and rules, in arbitrary order (except that class definitions must pre-
cede use of the classes they define). A `#' (not enclosed in a string)
begins a comment; characters from it to the end of the line are
ignored. An identifier follows the same rules as a C identifier,
except that in most contexts it can be at most 16 characters long. A
string is enclosed in double quotes ("") and generally follows C syn-
tax. Most strings denote input tokens, and references to ``input
token'' as part of AASL specification syntax should be read as ``string
denoting input token''.
A class definition is an identifier enclosed in angle brackets (<>)
followed by one or more input tokens followed by a semicolon (;). It
gives a name to a set of input tokens. Classes whose names start with
capital letters are user abbreviations; see below. Classes whose names
start with lowercase letters are special classes, used for internal
purposes. The current special classes are:
For example, the class definitions used for AASL itself are:
When AASL error repair is invoked, the parser sometimes needs to gener-
ate input tokens. In the case of a metatoken, the parser knows the
token's name but needs to generate a text for it as well. A text defi-
nition consists of an input token, an arrow (->), and a string specify-
ing what text should be generated for that token. For example, the
text definitions used for AASL itself are:
The rules of a specification define the syntax that the parser should
accept. The order of rules is not significant, except that the first
rule is considered to be the top level of the specification. The spec-
ification is executed by calling the first rule; when execution of that
rule terminates, execution of the specification terminates. If the
user wishes this to occur only at end of input, he should arrange for
the lexical analyzer to produce an endmarker token (conventionally
``EOF'') at the end of the input, and should write the first rule to
require that token at the end.
Note that an input token may be recognized considerably before it is
accepted, but the parser emits it to the output only on acceptance.
A rule consists of an identifier naming it, a colon (:), a sequence of
items which is the body of the rule, and a semicolon (;). When a rule
is called, it is executed by executing the individual items of the body
in order (as modified by control structures) until either one of them
explicitly terminates execution of the rule or the last item is exe-
cuted.
An item which is an input token requires that that token appear in the
input at that point, and accepts it (causing it to be emitted as out-
put).
An item which is an identifier denotes a call to another rule, which
executes the body of that rule and then returns to the caller. It is
an error to call a nonexistent rule.
An item which is an identifier preceded by `!' causes that identifier
to be emitted as an action token; the identifier has no other signifi-
cance.
An item which is `<<' causes execution of the current rule to terminate
immediately, returning to the calling rule.
An item which is `>>' causes the execution of the innermost enclosing
loop (see below) to terminate immediately, with execution continuing
after the end of that loop. The loop must be within the same rule.
An item which is an identifier preceded by `@%&!' causes an internal
semantic action to be executed within the parser; this is normally
needed only for bizarre situations like C's typedef. [xxx should give
details I suppose]
A choice is a sequence of branches enclosed in parentheses (()) and
separated by vertical bars (|). The first of the branches that can be
executed, is, after which execution continues after the end of the
choice.
A loop is a sequence of branches enclosed in braces ({}) and separated
by vertical bars (|). The first of the branches that can be executed,
is, and this is done repeatedly until the loop is terminated by `>>',
after which execution continues after the end of the loop. (A loop can
also be terminated by `<<' terminating execution of the whole rule.)
A branch is just a sequence of items, like a rule body, except that it
must begin with either an input token or a lookahead. If it begins
with an input token, it can be executed only when that token is the
next token in the input, and execution starts with acceptance of that
token.
A lookahead specifies conditions for execution of a branch based on
recognizing but not accepting input token(s). The simplest form is
just an input token enclosed in brackets ([]), in which case execution
of that branch is possible only when that token is the next token in
the input. The brackets can also contain multiple input tokens sepa-
rated by commas, in which case the parser looks for any of those
tokens. If a user-abbreviation class name appears, either by itself or
as an element of a comma-separated list, it stands for the list of
tokens given in its definition.
If a lookahead's brackets contain only a `*', this is a default branch,
executable regardless of the state of the input.
As a very special case, a lookahead's brackets can contain two input
tokens separated by slash (/), in which case that branch is executable
only when those two tokens, in sequence, are next in the input. Warn-
ing: this is implemented by a delicate perversion of the error-repair
machinery, and if the first of those tokens is not then accepted, the
parser will die in convulsions. A further restriction is that the same
input token may not appear as the first token of a double lookahead and
as a normal lookahead token in the same choice/loop.
Certain simple choice/loop structures appear frequently, and there are
abbreviations for them:
For example, here are the rules of the AASL specification for AASL,
minus the actions (which add considerable clutter and are unintelligi-
ble without the third pass):
When the input token is not one of those desired, either because the
item being executed is an input token and a different token appears on
the input, or because none of the branches of a choice/loop is exe-
cutable, error repair is invoked to try to fix things up. Sometimes it
can actually guess right and fix the error, but more frequently it
merely supplies a legal output so that later passes will not be thrown
into chaos by a minor syntax error.
The general error-repair strategy of an AASL parser is to give the
parser what it wants and then attempt to resynchronize the input with
the parser.
[xxx long discussion of how ``what it wants'' is determined when there
are multiple possibilities]
Resynchronization is performed in three stages. The first stage
attempts to resynchronize within a logical line, and is applied only if
neither the input token nor the desired token is a line terminator (a
member of the ``lineterm'' class). If the input token is trivial (a
member of the ``trivial'' class), it is discarded. Otherwise it is
retained, in hopes that it will be the next token that the parser asks
for.
Either way, an error message is produced, indicating what was desired,
what was seen, and what was handed to the parser. If too many of these
messages have been produced for a single line, the parser gives up,
produces a last despairing message, emits a ``#!aargh'' action token to
alert later pases, and exits. Barring this disaster, parsing then con-
tinues. If the parser at some point is willing to accept the input
token, it is accepted and error repair terminates. If a line termina-
tor is seen in input, or the parser requests one, before the parser is
willing to accept the input token, the second phase begins.
The second stage of resynchronization attempts to line both input and
parser up on a line terminator. If the desired token is a line termi-
nator and the input token is not, input is discarded until a line ter-
minator appears. If the desired token is not a line terminator and the
input token is, the input token is retained and parsing continues until
the parser asks for a line terminator. Either way, the third phase
then begins.
The third stage of resynchronization attempts to reconcile line termi-
nators. If the desired and input tokens are identical, the input token
is accepted and error repair terminates. If they are not identical and
the input token is trivial (yes, line terminators can be trivial, and
ones like `;' probably should be), the input token is discarded. If
the desired token is the endmarker, then the input token is discarded.
Otherwise, the input token continues to be retained in hopes that it
will eventually be accepted. [xxx this needs more thought] In any
case, the second phase begins again.
awk(1), yacc(1)
``error-repair disaster'' means that the first token of a double looka-
head could not be accepted and error repair was invoked on it.
Written at University of Toronto by Henry Spencer, somewhat in the
spirit of S/SL (see ACM TOPLAS April 1982).
Some of the restrictions on double lookahead are annoying.
Most of the C string escapes are recognized but disregarded, with only
a backslashed double-quote interpreted properly during text generation.
Error repair needs further tuning; it has an annoying tendency to infi-
nite-loop in certain odd situations (although the messages/line limit
eventually breaks the loop).
Complex choices/loops with many branches can result in very long lines
in the table.
The implementation of AASL was fairly straight forward, with AASL
itself used to describe its own syntax. An AASL specification is
compiled into a table, which is then processed by a table-walking
interpreter. The interpreter expects input to be as tokens, one
per line, much likethe output of a traditional scanner. A complete
program using AASL (for example, the AASL table generator) is
normally three passes: thescanner,the parser (tables plus interpreter),
and a semantics pass. The first set of tables was generated byhand
for bootstrapping.
Apart from the minor nuisance of repeated iterations of language
design, the biggest problem ofimplementing AASL wasthe question of
semantic actions. Inserting awk semantic routines into the table
interpreter, in the style of yacc,would not be impossible, but it
seemed clumsy and inelegant. Awks lack of anyprovision for compile
time initialization of tables strongly suggested reading them in at
run time, rather than taking up space with a huge BEGIN action whose
only purpose was to initialize the tables. This makes insertions into
the interpreters code awkward.
The problem was solved by a crucial observation: traditional compilers
(etc.) merge a two-stepprocess, first validating a token stream and
inserting semantic action cookiesinto it, then interpreting thestream
and the cookies to interface to semantics. Forexample, yaccs grammar
notation can be viewed asinserting fragments of C code into a parsed
output, and then interpreting that output. This approach yieldsan
extremely natural pass structure for an AASL parser,with the
parsersoutput stream being (in the absenceof syntax errors) a copy
of its input stream with annotations. The following semantic pass
then processesthis, momentarily remembering normal tokens and
interpreting annotations as operations on the remembered values.
(The semantic pass is, in fact, a classic pattern+action awk program,
with a pattern and anaction for each annotation, and a general save
the value in a variableaction for normal tokens.)
The one difficulty that arises with this method is when the language
definition involves feedbackloops between semantics and parsing,
an obvious example being Cs typedef.Dealing with this reallydoes
require some imbedding of semantics into the interpreter,although
with care it need not be much: thein-parser code for recognizing C
typedefs, including the complications introduced by block structure
andnested redeclarations of type names, is about 40 lines of awk.The
in-parser actions are invoked by a special variant of the AASL emit
semantic annotationsyntax.
Aside benefit of top-down parsing is that the context of errors is
known, and it is relatively easy to implement automatic error
recovery. When the interpreter is faced with an input token that
does not appearin the list of possibilities in the parser table,
it givesthe parser one of the possibilities anyway, and then usessimple
heuristics to try to adjust the input to resynchronize. The result
is that the parser,and subsequentpasses, always see a syntactically-correct
program. (This approach is borrowed from S/SL and its predecessors.)
Although the detailed error-recovery algorithm is still experimental,
and the current one is notentirely satisfactory when a complex AASL
specification does certain things, in general it deals with minorsyntax
errors simply and cleanly without anyneed for complicating the
specification with details of errorrecovery.Knowing the context of
errors also makes it much easier to generate intelligible error
messagesautomatically.
The AASL implementation is not large. The
scanner is 78 lines of
awk,the parser is 61 lines of AASL (using a fairly low-density
paragraphing style and a good manycomments), and the
semantics pass
is 290
lines of awk. The
table interpreter is 340 lines, about half
of which (and most of the complexity) can be attributed to the
automatic error recovery.
As an experiment with a more ambitious AASL specification, one for
ANSI C was written. This occupies 374 lines excluding comments and
blank lines, andwith the exception of the messy details of
Cdeclaratorsis mostly a fairly straightforward transcription of the
syntax given in the ANSI standard. Generating tables for this takes
about three minutes of CPU time on a Sun 3/180; the tables are about
10K bytes.
The performance of the resulting ANSI C parser is not impressive:
in very round numbers, averagedoveralarge program, it parses about
one line of C per CPU second. (The scanner,164 lines of awk, accounts
for a negligible fraction of this.) Some attention to optimization
of both the tables and the interpreter might speed this up somewhat,
but remarkable improvements are unlikely. As things stand in the absence
of better awk implementations or a rewrite of the table interpreter
in C, its a cute toy, possibly of some pedagogical value, but not a
useful production tool. On the other hand, there does not appear
to be any fundamental reason for the performance shortfall: itspurely
the result of the slowexecution of awk programs.
The scanner would be much
faster with better regular-expression matching, because it can use regular expressions to determine whether
a string is a plausible token but must use substr
to extract the string first. Nawk functions would be very
handy for modularizing code, especially the complicated and seldom-invoked
error-recovery procedure. A
switch statement modelled on the pattern+action scheme would be useful in several places.
Another troublesome issue is that arrays are second-class citizens in awk (and continue to be so in
nawk): there is no array assignment. This lack leads to endless repetitions of code like:
whenever
block structuring or a stack is desired. Nawk's multi-dimensional arrays supply some syntactic
sugar for this but don't
really fix the problem. Not only is this code clumsy, it is woefully inefficient compared
to something like
even if the implementation is very clever. This significantly reduces the usefulness of arrays as symboltables and the like, a role for which they are otherwise very well suited.
It would also be of some use if there were some way to initialize arrays as constant tables, or alternatively
a guarantee that the BEGIN action would be implemented cleverly and would not occupy space after
it had finished executing.
A
minor nuisance that surfaces constantly is that getting an error message
out to the standard-error descriptor is painfully clumsy: one gets to choose between putting error messages
out to a temporary file and having a shell "wrapper" process them later, or piping them into "cat >&2" (!).
The multi-pass input-driven
structure that awk naturally lends itself to produces very
clean and readable code with different phases neatly separated, but creates substantial difficulties
when
feedback loops appear.
(In the case of AASL,this perhaps says more about language design than about
awk.) Henry Spencer.
Download from
LAWKER.
"aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely
in awk and sed. It was done for fun, to establish whether it was possible.
It is; it works. It's quite slow, the input syntax is eccentric and rather
restricted, and error-checking is virtually nonexistent, but it does work.
Furthermore it's very easy to adapt to a new machine, provided the machine
falls into the generic "8-bit-micro" category. It is supplied "as is",
with no guarantees of any kind. I can't be bothered to do any more work on
it right now, but even in its imperfect state it may be useful to someone.
aaa is the mainline shell file.
aux is a subdirectory with machine-independent stuff. Anon, 6801, and
6809 are subdirectories with machine-dependent stuff, choice specified
by a -m option (default is "anon"). Actually, even the stuff that is
supposedly machine-independent does have some machine-dependent
assumptions; notably, it knows that bytes are 8 bits (not serious) and
that the byte is the basic unit of instructions (more serious). These
would have to change for the 68000 (going to 16-bit "bytes" might be
sufficient) and maybe for the 32016 (harder).
aaa thinks that the machine subdirectories and the aux subdirectory are
in the current directory, which is almost certainly wrong.
abst is an abstract for a paper. "card", in each machine directory,
is a summary card for the slightly-eccentric input language. There is no
real manual at present; sorry.
try.s is a sample piece of 6809 input; it is semantic trash, purely for
test purposes. The assembler produces try.a, try.defs, and try.x as
outputs from "aaa try.s". try.a is an internal file that looks
somewhat like an assembly listing. try.defs is another internal file
that looks somewhat like a symbol table. These files are preserved
because of possible usefulness; tmp[123] are non-preserved temporaries.
try.x is the Intel-hex output. try.x.good is identical to try.x and is a
saved copy for regression testing of new work.
01pgm.s is a self-programming program for a 68701, based on the one
in the Motorola ap note. 01pgm.x.good is another regression-test file.
If your C library (used by awk) has broken "%02x" so it no longer means
"two digits of hex, *zero-filled*" (as some SysV libraries have), you
will have to fall back from aux/hex to aux/hex.argh, which does it the
hard way. Oh yes, you'll note that aaa feeds settings into awk on the
command line; don't assume your awk won't do this until you try it.
Henry Spencer
Programmers often take awk "as is", never thinking to use it as a lab in which
we can explore other language extensions. This is of course, only one way to treat
the Awk code base.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach was take by David Ladd and J. Christopher
Raming in their A* system.
They write:
A* is an experimental language designed to facilitate
the creation of language-processing tools. It is analogous either to
an interpreted yacc with Awk as its statement language, or to a
version of Awk which processes programs rather than records.
A* offers two principal advantages over the combination of lex,
yacc, and C:
Reference:
A*: a language for implementing language processors
Ladd, D.A.; Ramming, J.C.;
Software Engineering, IEEE Transactions on
Volume 21, Issue 11, Nov. 1995 Page(s):894 - 901
These pages focus on Sed-like stream editors, written in Awk.
Andrew Eaton wrote at comp.lang.awk:
I just started with awk and sed, I am more of a perl/C/C++ person. I
have a quick question reguarding the pipe. In Awk, I am trying to use this
construct.
Is it possible that "print" is no longer printing the value of
getline, if so how do I correct it?
Arnold Robbins comments:
The problem here is that `mv' doesn't read standard input, it only
processes command lines. Assuming that your data is something like:
You can do things two ways:
or this way:
The latter is more efficient.
by Arnold Robbins
From the Gawk Manual.
The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. While sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:
Here, s/old/new/g tells sed to look for the regexp old on each input line and globally replace it with the text new, i.e., all the occurrences on a line. This is similar to awk's gsub function.
The following program, awksed.awk, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any additional arguments are treated as data file names to process. If none are provided, the standard input is used:
The program relies on gawk's ability to have RS be a regexp, as well as on the setting of RT to the actual text that terminates the record.
The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.
There is one wrinkle to this scheme, which is what to do if the last record doesn't end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to the null string. In this case, we can print $0 using printf.
The BEGIN rule handles the setup, checking for the right number of arguments and calling usage if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as file names.
The usage function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using print or printf as appropriate, depending upon the value of RT.
Download from
LAWKER.
The s2a project is a sed to awk conversion utility written in awk. As input it takes sed scripts, and it outputs an equivalent awk script.
This version should be fully functional as far as the following sed commands are concerned: a,d,s,p,q,c,i,n.
Commands to be implemented in the future: {},=,h,g,N,P,r,x,y,l,H,G,D,b,t,:
$ is not a valid line address.
Also, line continuation with '\' is not implemented.
James Lyons, Feb 2008.
For more excellent awk code, visit Lyon's awk.dsplab
web site.
saya(array [,label,sep,before,after,eq]) Array printing function. Contents printed, sorted on key. Size of the array The most common usage is to just use the first two arguments; e.g. For other usages, see the examples, below. Tim Menzies cat numbers | gawk -f quicksort2.awk
Download from
LAWKER.
Quicksort divides the input data around a randomly selected pivot, then recurses
on the divided data.
In quicksort2, the pivot is selected from
the first line of input.
Each data division is handled by a different UNIX pipe
and recursive gawk processes are called on the divided data.
Yes, this is not the fastest way to do it but (in theory anyway) it should be able
to handle very big data sets.
The output ignores repeated input values. I thought it was a problem with repeating the name of the pipes (hence the "rand()" labelling)
but that did not fix the issues.
Copyright (c) 2009 by David Long.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Original version: David Long, 2004. Tim Menzies added some modifications in 2009
to call recursive Gawk pipes on both sides of the pivot.
gawk -f cryptosig.awk tim@menzies.us
Download from
LAWKER.
Generates a one-line Awk program that can print your email, from a seemingly jumbled string.
This program can then become your email sig and only the Awk cognoscente can generate a reply.
Example
This
can be tested as follows:
or
both of which should print "tim@menzies.us".
Download from
LAWKER.
Generates random signtures. Signatures and generation code included
in same file so installation is just a matter of calling one file.
Most of the file is a large "here" document. Paragraph 1 of that
document is always added to the signatures, followed one of
the folowing paragraphs, selected at radonom.
To add to the signtures, include them in the here document,
with one preceeding blank line.
Tim Menzies
The amazingly workable (text) formatter
awf -macros [ file ] ...
Download from
LAWKER.
Type "make r" to run a regression test, formatting the manual page
(awf.1) and comparing it to a preformatted copy (awf.1.out). Type
"make install" to install it. Pathnames may need changing.
Awf formats the text from the input file(s) (standard input if none) in
an imitation of nroff's style with the -man or -ms macro packages. The
-macro option is mandatory and must be `-man' or `-ms'.
Awf is slow and has many restrictions, but does a decent job on most
manual pages and simple -ms documents, and isn't subject to AT&T's
brain-damaged licensing that denies many System V users any text
formatter at all. It is also a text formatter that is simple enough
to be tinkered with, for people who want to experiment.
Awf implements the following raw nroff requests:
and the following in-text codes:
plus the full list of nroff/troff special characters in the original V7
troff manual.
Many restrictions are present; the behavior in general is a subset of
nroff's. Of particular note are the following:
White space at the beginning of lines, and imbedded white space within
lines, is dealt with properly. Sentence terminators at ends of lines are
understood to imply extra space afterward in filled lines. Tabs are implemented crudely and not quite correctly, although in most cases they
work as expected. Hyphenation is done only at explicit hyphens, emdashes, and nroff discretionary hyphens.
The -man macro set implements the full V7 manual macros, plus a few semi-
random oddballs. The full list is:
.BY and .NB each take a single string argument (respectively, an indi-
cation of authorship and a note about the status of the manual page) and
arrange to place it in the page footer.
The -ms macro set is a substantial subset of the V7 manuscript macros.
The implemented macros are:
Size changes are recognized but ignored, as are .RP and .ND. .UL just
prints its argument in italics. .DS/.DE does not do a keep, nor do any
of the other macros that normally imply keeps.
Assignments to the header/footer string variables are recognized and
implemented, but there is otherwise no control over header/footer
formatting. The DY string variable is available. The PD, PI, and LL
number registers exist and can be changed.
The only output format supported by awf, in its distributed form, is that
appropriate to a dumb terminal, using overprinting for italics (via
underlining) and bold. The nroff special characters are printed as some
vague approximation (it's sometimes very vague) to their correct
appearance.
Awf's knowledge of the output device is established by a device file,
which is read before the user's input. It is sought in awf's library
directory, first as dev.term (where term is the value of the TERM
environment variable) and, failing that, as dev.dumb. The device file
uses special internal commands to set up resolution, special characters,
fonts, etc., and more normal nroff commands to set up page length etc.
All in /usr/lib/awf (this can be overridden by the AWFLIB environment
variable):
awk(1), nroff(1), man(7), ms(7)
Unlike nroff, awf complains whenever it sees unknown commands and macros.
All diagnostics (these and some internal ones) appear on standard error
at the end of the run.
Written at University of Toronto by Henry Spencer, more or less as a
supplement to the C News project.
Copyright 1990 University of Toronto. All rights reserved.
Written by Henry Spencer.
This software is not subject to any license of the American Telephone
and Telegraph Company or of the Regents of the University of California.
Permission is granted to anyone to use this software for any purpose on
any computer system, and to alter it and redistribute it freely, subject
to the following restrictions:
There are plenty, but what do you expect for a text formatter written
entirely in (old) awk?
The -ms stuff has not been checked out very thoroughly.
These pages focus on postscript tricks, written in Awk.
gawk -f pschoose Download from LAWKER Pagerange : list of pages from command line. Pages : array with broken out list.
At end:
"(n in Pages)" is true if page n should be printed
Arnold Robbins gawk -f psrev.awk Download from LAWKER Reverse the pages in a postscript file. Arnold Robbins gawk -f indent.awk file.sh Download from LAWKER
This is part of Phil's AWK tutorial at
http://www.bolthole.com/AWK.html.
This program adjusts the indentation level based on which keywords are
found in each line it encounters.
This is the 'default' action, that actually prints a line out.
This gets called AS WELL AS any other matching clause, in the
order they appear in this program.
An "if" match is run AFTER we run this clause.
A "done" match is run BEFORE we run this clause.
Philip Brown phil@bolthole.com These pages focus on Awk and operating systems.
These pages focus on XML tools and Awk.
Editor's note:
Programmers often take awk "as is", never thinking to use it as a lab in which
they can explore other language extensions.
An alternate approach is to treat the Awk code base as a reusable library
of parsers, regular expression engines, etc etc and to make modifications
to the lanugage. This second approach is taken in the Awk A*
project and, as shown here, in XMLgawk.
IMHO,
XMLgawk is one of the most exciting new innovations
seen in Gawk for many years.
It shows that Awk is more than "just" a text processor: rather
it is also a candidate technology for modern XML-based web applications.
)
Extends standard gawk with built-in XML processing.
Main developers: Jurgen Kahrs and Andrew Schorr.
Conceptual guidance: Manuel Collado.
MS Windows build expert: Victor Paeza.
Contributor of ideas for new features: Peter Saveliev.
XML processing, plus libraries for other extensions to Gawk.
XMLgawk is an experimental extension of the GNU Awk interpreter.
It includes a small XML parsing library which is built upon the
Expat XML parser. The parsing library is a very thin layer on top
of Expat (implementing a pull-interface) and can also be used without
GNU Awk to read XML data files.
Both, XMLgawk and its XML puller
library only require an ANSI C compatible compiler (GCC works, as
do most vendors' ANSI C compilers) and a 'make' program.
XMLgawk provides the following functionality including:
3=Released
3=Free/public domain.
November 2003.
April 28, 2009.
AI Programming lab class challenge .
Download from LAWKER.
Look at the first line of each file for something that looks like thos:
Ronald Loui (programmer and designer)
Washington University in St. Louis
USA
Text-based game simulation.
Ronald P. Loui
r.p.loui@gmail.com
Ronald Loui writes: Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language
A repeated observation in this class is that only the scripting programmers can generate code fast enough to keep up with the demands of the class. Even though students were allowed to choose any language they wanted, and many had to unlearn the java ways of doing things in order to benefit from scripting, there were few who could develop ideas into code effectively and rapidly without scripting.
In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed,
I find lessons in AI's trends, Airs history, and the foundations of AI.
What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy
programming,
some of them are not the least bit surprised.
Was written for gawk in 1995 but should run on almost any awk dialect; some css positioning commands will not work in all browsers; try IE6.
Was written on Redhat Linux with multiple hardware platforms in mind.
Intended to be run on close server to minimize delays.
605 lines in main cgi with several small aux control programs.
Minimal compared to development effort, but potentially will require css for new browsers.
Number of person-months since, including enhancements
2=Evaluation.
50 students in artificial intelligence project classes had to use some version of this code over seven years
October 2004
April 2009
Awk-Linux Educational Operating Systems
Teaching operating systems.
Yung-Pin Cheng
ypc@csie.ntnu.edu.tw
Software Engineering Lab.
Department of Computer Science and Information Engineering
National Taiwan Normal University
TAIWAN
Educators of Operating Systems
Most well-known instructional operating systems are complex,
particularly if their companion software is taken into account.
It takes considerable time and effort to craft these systems,
and their complexity may introduce maintenance and evolution problems.
In this project, a courseware called Awk-Linux is proposed.
The basic hardware functions provided by Awk-Linux include timer interrupt and page-fault interrupt,
which are simulated through program instrumentation over user programs.
A major advantange of the use of Awk for this tool is platform independence.
Awk-Linux can be crafted relatively more easily and
it does not depend on any hardware simulator or platform.
Stable
Awk versions run on many platforms so this tool can be readily and easily ported to other machines.
The same can not be said for other, more complex operating systems courseware that may be
much harder to port to new environments.
In practice, using Awk-Linux is very simple for the instructor and students:
Gawk under cygwin or Linux
Windows (CYGWIN required) or Linux
C programming language
Status 3 (Released)
3(Free/public domain)
2004
Yung-Pin Cheng, Janet Mei-Chuen Lin, Awk-Linux:
A Lightweight Operating Systems Courseware
IEEE Transactions on Education, vol. 51, issue 4, pp. 461-467, 2008.
www.csie.ntnu.edu.tw/~ypc/awklinux.htm
(Editor's Note: This page is a mirror of the original
web site.
It describes a collection of shell/awk/tcl scripts used
for modeling complex domains. This code illustrates
how language choice is not a matter of "awk" vs "X". Rather,
systems can be a menagerie of different languages, including Awk.)
These simulation scripts are also available from in
LAWKER.
To test the code:
To use these scripts, you must go the following:
Then, put a copy of "ns" in the current directory, for example:
To run the tests:
These scripts are quick amalgams of shell scripts, awk, tcl, and
whatever else was handy at the time, so they are not intended as an
example of good programming style. They are run in a directory with a
"graphs" subdirectory for saved output and *.mf files (gnuplot command files),
and an "awk" subdirectory for awk files. Some of these scripts use
supporting *.awk files that are available in the awk directory, but are not listed separately
below. Some of the scripts (tfrm12.run) also use "bwcnt" C programs
for processing output data; the C code for these is in the scripts directory. Possibly one day we will
clean this all up to reduce the proliferation of scripts and languages
involved.
The implementation of TFRC in the NS simulator
is still occasionally being modified, so the precise results of
simulations can change with different versions of NS.
Some of these
simulations must be run with SBSIZE in scoreboard.h set to
10000 instead of to 1024, to allow larger TCP congestion windows.
The simulation for Figure 2 on
"Illustration of the Average Loss Interval"
can be run with
"contents/single.com",
with supporting files
"contents/single.run",
"contents/single.tcl",
and
"contents/queueSize.tcl".
Generating the postscript file also uses the following files:
The simulations for Figure 5 on
"TCP flow sending rate"
can be run with
"contents/tfrm-full.CA.DropTail.run",
"contents/tfrm-full.CA.RED.run"
with supporting files
"contents/tfrm-full.CA.tcl",
"contents/queueSize.tcl",
"contents/getmean-full.tcl".
These scripts will produce data files called
The simulations for Figure 6 on
can be run
with
"contents/tfrm12.com",
with supporting files
"contents/tfrm12.run",
"contents/tfrm12.tcl",
"contents/awk/plotdrops.awk"
and
"contents/queueSize.tcl".
The supporting programs
"bwcnt2" and "bwcnt2a" for processing the output data are compiled from
"contents/bwcnt2.c"
and "contents/bwcnt2a.c".
FYI:
On Sally's computer, this simulation set took 13 minutes.
The following supporting files were also required for generating
the postscript file
"contents/tfrm12.run1",
"contents/graphs/getmean.tcl",
"contents/graphs/s0.12.mf",
"contents/graphs/s0.loss3.mf".
The simulations for Figure 7 on "Coefficient
of variation of throughput between flows" can be run with
"contents/tfrmvar.run" with supporting files
"contents/tfrmvar.tcl",
"contents/queueSize.tcl", and
"contents/graphs/getvar.tcl".
The scripts
"contents/fixcov.tcl" combines the many output files together, and gnuplot requires "contents/graphs/s3xxx.mf" to generate the postscript.
When we have collected the scripts for Figure 8, we will put them on-line.
The simulations for Figures 9 and 10 can be run with the
script
"contents/long/doit".
The supporting scripts are in the tar file.
The simulation takes perhaps one hour.
The simulations for Figures 11-13 can be run with the
script
"contents/short/doit".
The simulation takes up to three days.
The simulations for
Figure 14 on
40 long-lived flows
can be run with
"contents/queue2.com",
with supporting files
"contents/queue.run",
"contents/queue.tcl",
"contents/queueSize.tcl",
"contents/tracequeue.tcl",
awk/"contents/awk/plotaveq.awk",
and
awk/"contents/awk/plotqueue.awk".
Generating the postscript file also uses the following file:
"contents/graphs/s0.queue.mf".
Figures 15-18 are from experiments.
The simulations for Figure 19 on
"A TFRC flow with an end to congestion"
can be run with
"contents/increase.com",
with supporting files
"contents/increase.run",
"contents/increase.tcl",
"contents/queueSize.tcl",
"contents/awk/increase.awk",
and graphs/"scriptsTR/graphs/s0.packetrate.mf".
The simulations for Figure 20 on
"A TFRC flow with persistent congestion"
can be run with
"contents/reduce.com",
with supporting files
"contents/reduce.run",
"contents/reduce.tcl",
"contents/queueSize.tcl",
"contents/awk/reduce.awk",
and
"contents/awk/reduce1.awk".
Generating the postscript file also uses the following file:
"contents/graphs/s0.rate1.mf".
The simulations for Figure 21 on
"Number of round-trip times to reduce the
sending rate" can be run with
"contents/reduce1.com",
with supporting files
"contents/reduce1.run",
"contents/reduce.tcl",
"contents/queueSize.tcl",
"contents/awk/reduce1.awk",
and
"contents/awk/reduce2.awk".
Generating the postscript file also uses the following file:
graphs/"contents/graphs/s0.half.mf".
From Intrusion Alert Normalization method using AWK
scripts and attack name database.
Dongyoung Kim, HyoChan Bang, Jung-Chan Na,
Advanced Communication Technology, 2005, ICACT 2005. The 7th International Conference on
Publication Date: 21-23 Feb. 2005
Volume: 1, On page(s): 608- 611 Vol. 1
The current several classes of intrusion alert have various
formats and semantics. And it is transferred using a variety of
protocols. The protocols that transfer intrusion alert are IDXP,
SNMP trap, SYSLOG protocol, etc. These varieties of intrusion
alert formats make it difticult to use that together. Intrusion alert
normalization makes various intrusion alert to same structure
data and same semantics. We need this normalition process to
unify alerts from a variety of security equipments. This paper
describes how to normalize alerts from several IDS and security
equipments.
You know the song:
99 bottles of beer on the wall, 99 bottles of beer.
Take one down and pass it around, 98 bottles of beer on the wall.
98 bottles of beer on the wall, 98 bottles of beer.
Take one down and pass it around, 97 bottles of beer on the wall.
97 bottles of beer on the wall, 97 bottles of beer.
Take one down and pass it around, 96 bottles of beer on the wall.
....
But how do you code it? Here's
Wilhelm Weske's version.
It is kind of fun but its a little hard to read:
Osamu Aoki has a more maintainable version. Note
how all the screen I/O is localized via functions that return strings, rather than printing
straight to the screen. This is very useful for maintaince purposes or including
code as libraries into other Awk programs.
Osamu's version is very similar to how it'd be done in C or other languages and it
does not take full advantage of Awk's features.
So
Arnold Robbins wrote a third version that is more data driven.
Most of the work is done in a
pre-processor and the actual runtime just dumps text decided before
the run. This solution might take more time (to do the setup) but it
does allow for the simple switching of the interface (just change the
last 10 lines).
I'll drink to that.
These pages focused on using Awk to implement filters on Unix mail files.
http://www.tc.umn.edu/~hause011/article/Statistical_spam_filter.html
I now use a "Statistical Spam Filter". Wow, the scummy sewer
of internet mail is cleansed, refreshed and usable again.
Just using the delete button was getting too difficult, I got 8 to
10 spam for every good piece of mail. As a spam detector I am
not as good a filter as you might think, just the subject and address
is not always enough, an anti-spam tool I am not,
I would occasionally open a spam to my great annoyance.
My filter was inspired by Paul Graham's article about a
Naive Bayesian spam filter. The article is at
"A Plan for Spam".
He basically says that you get statistics on how often tokens show
up in two bodies of mail, (spam and good,) and then calculate the
a statistical value that a single mail is spam by looking at the tokens in it.
The more mail in the good and spam mail bodies, the better the
filter is "trained". Jeez, he made it sound so easy. And it is.
I slapped an anti-spam tool together as a ksh and awk script
for use as a personal filter on a Unix type system. To implement it
I put it in the
~/.forward file. The code is at the bottom of the article, less than 100
The total code for the filter and training script is less than 200
This filter differs in lots of ways from the Paul Graham article.
I took out some of the biases he describes and simplified it,
maybe it is too simple. What I find most interesting is that
the differences do not seem to matter much, I still filter out 96+% of spams.
I got those results with a spam sample that is at least
500 emails and a good email sample that is at least 700 emails.
With smaller training samples or a different mail mix it
may not get as good results, or it may be better.
Note: I later changed the training body to be more like the proportion of
real spam to good mail, which is much more spam than good mail, about
8-10 spam to every good mail received and the anti-spam tool worked better.
First I run the training script on two bodies of mail, ~/Mail/received
(good mail) and ~/Mail/junk (saved spam mail.) The ~/Mail/received file is
already created on my unix box and holds mail that I have read and
not deleted. The training script finds all the tokens in the
emails and gives them a probability depending on how the token
is found in the "spam mail" and the "good mail". The training script
also creates the whitelist of addresses from the "good mail." As the
mail flows through the system the training script will then "learn"
each time it is run.
I run the actual spam filter script from the .forward file which
allows a user to process mail before it hits your inbox.
(Look up "man forward" at the shell prompt for further information
on the .forward file.) The script first checks the whitelist for a
good address, if it is found it passes the filter. If the address is
not found it is passed to the statistical spam filter, the tokens
are checked and the email is given a spaminess value. Above a
certain value the email is classified as spam and put in the ~/Mail/junk
file, below the value it passes to /var/spool/mail/mylogin where
I read it as god intended email to be read, with a creaky old unix client.
However, I can still read it with any other client I want, POP or IMAP.
I included a little test script below that I used to check my results.
I just split emails into files and run them one at a time
and check the value the filter gives.
Testing on email that has been used to train the filter will
give results that are very good and not valid, so I tested on email
not seen by the training script. The filter does get much better at
filtering as the training sample gets bigger, just like the other
statistical spam filters. For example, at lower sample sizes (trained
with 209 good mails, 301 spammails) the filter was pretty bad. When
the average spam value cutoff was raised to .51 so no good mail was blocked,
44% of the spam email passed through on a set of 320 spam and
683 good email. Even so, that means %56 of the spam was blocked.
Small sample sizes are not perfect, but are usable
and I began using the mail filter with a sample set of about
600 good mail and 300 spam. As the training sample increased the results
improved. As I changed the mail mix to reflect the real spam proportions
it got even better, around 96-98% of the spam blocked. I think the lower
early results were because of the proportions of spam to good email, they
should reflect the real proportion received on the system used by
the filter.
Paul Graham or others may have superior filters and better mathematics
for anti-spam algorithms but I am not
sure that it matters all that much, the amount of spam that
gets through is small enough not to bother me.
I used gawk in the filter and checked it with the gawk profiler to
look for performance problems. The largest performance constraint is
creating the spam-probability associative array in memory, the key-value
pairs of tokens and the spam value I assign to them. Creating this
associative array is more that 95% of the current time to process an
email through the filter and gets worse when the set of tokens gets larger.
Perl and other language users can get around this performance
problem with DBM file interfaces, currently not available to my gawk filter.
I added a "whitelist" of good email addresses, a feature
that helps keep good email from a bad classification and improves
performance by a huge amount (at least a magnitude of 100) by not
having to further filter the message.
The white list is not one of the "challenge-response" things that annoys me
so much that I toss any such email away, it simply learns from the email
used to train the filter, it saves addresses that are from email that
has passed the filter and gets in my "received" file. I figure that
if I receive a good email from someone, chances are 100% that I want
to receive email from that address. Note there is a place in the white
list script to get rid of commonly forged email
addresses, like your own address.
The main concept put forward by Paul Graham holds
true and seems ungodly robust: applying statistics to filter
spam works very well compared to lame rule sets and black lists.
My program just proves the robustness of the solution; apparently
any half-baked formula (like what I used) seems to work as long
as the base probability of the tokens is computed.
Here are some of the many differences between this filter
and the filter in the Paul Graham article in no particular
order of importance:
Note: Do not test mail that has been used to train the filter,
test mail not seen by the training program.
Call from the command line or in a crontab file.
Inspired by the Paul Graham article "A Plan for Spam" www.paulgraham.com
Implement in the .forward file like so:
If mail is spam then put in a spam file
else put in the good mail file.
Download from
LAWKER.
Sorts a Unix style mailbox by "thread", in date+subject order.
This is a script I use quite a lot. It requires gawk although with some work could
be ported to standard awk. The timezone offset from GMT has to be
adjust to one's local offset, although I could probably eliminate
that if I wanted to work on it hard enough.
This took me a while to write and get right, but it's been working
flawlessly for a few years now.
The script uses Message-ID header to detect and remove duplicates. It requires GNU Awk for
time/date functions and for efficiency hack in string concatenation but could
be made to run on a POSIX awk with some work.
Pull apart a date string and convert to timestamp.
How many days in the given month?
Trim out "Re:", white space.
Copyright 2007, 2008, Arnold David Robbins
arnold@skeeve.com
Example Code
% cat 1.awk
#!/usr/bin/env runawk
#use "power_getopt.awk"
#.begin-str help
# power_getopt - program demonstrating a power of power_getopt.awk module
# usage: power_getopt [OPTIONS]
# OPTIONS:
# -h|--help display this screen
# -f|--flag flag
# --long-flag long flag only
# -s short flag only
# =F|--FLAG
./1.awk
% ./1.awk
f --- 0
flag --- 0
long-flag --- 0
s --- 0
F --- default1
FLAG --- default2
./1.awk -h
% ./1.awk -h
power_getopt - program demonstrating a power of power_getopt.awk module
usage: power_getopt [OPTIONS]
OPTIONS:
-h|--help display this screen
-f|--flag flag
--long-flag long flag only
-s short flag only
-F|--FLAG
./1.awk -f
% ./1.awk -f
f --- 1
flag --- 1
long-flag --- 0
s --- 0
F --- default1
FLAG --- default2
./1.awk -F value
% ./1.awk -F value
f --- 0
flag --- 0
long-flag --- 0
s --- 0
F --- value
FLAG --- value
./1.awk --FLAG=value
% ./1.awk --FLAG=value
f --- 0
flag --- 0
long-flag --- 0
s --- 0
F --- value
FLAG --- value
New Mascot
Word Processing in Awk
Writing Interpreters
AASL: Parser Genrator in Awk
Download
Synopsis
aaslg [ -x ] [ file ... ]
aaslr [ -x ] table [ file ... ]
Description
AASL Specifications
<trivial> "," ";" ;
<lineterm> ";" ;
<endmarker> "EOF" ;
"id" -> "___"
"string" -> "\"___\""
abbreviation expansion
( items ?) ( items | [*] )
{ items ?} { items | [*] >> }
( ! [look] items ?) ( [ look] | items )
{ ! [look] items ?} { [ look] >> | items }
rules: {
"id" ":" contents ";"
| "<" "id" ">" {"string" ?} ";"
| "string" "->" "string"
| "EOF" >>
};
contents: {
">>"
| "<<"
| "id"
| "!" "id"
| "@%&!" "id"
| "string"
| "(" branches ")"
| "{" branches "}"
| [*] >>
};
branches: (
"!" "[" look "]" contents "?"
| [*] branch (
["|"] {"|" branch ?}
| "?" !endbranch
| [*]
)
);
branch: (
"string" contents
| "[" look "]" contents
);
look: (
["string"/"/"] "string" "/" "string"
| "*"
| [*] looker {"," looker ?}
);
looker: ( "string" | "id" ) ;
Error Repair
Files
all in $AASLDIR:
interp table interpreter
lex first pass of aaslg
syn AASL table for aaslg
sem third pass of aaslg
See Also
Diagnostics
History
Bugs
Assessment
Lessons From AASL
for (i in array)
arraystack[i ":" sp] = array[i]
arraystack[sp] = array
Author
Amazing Awk Assembler
Download from
Description
Author
Awk A*
While there are a number of systems that will help one construct full-blown metaprograms such as compilers and interpreters, we wanted something with extremely low overhead.
We set out to build a something with the property that it would
help even inexperienced users build simple meta-programs in a
matter of minutes with a few lines of code. A* is the result; it
is more than anything else an engineering exercise, as most of
its ideas are not new. It is the arrangement of these ideas and
the purpose to which they are directed distinguish A* from
other tools.
A* programmers are thus able to accomplish many useful tasks with little code.
Sed-clones (in Awk)
Moving Files with Awk
while ((getline < "somedata.txt") > 0)
{print | "mv"} #or could be "mv -v" for verbose.
oldfile newfile
# build the command and execute it
while ((getline < "somedata.txt") > 0) {
command = "mv " $1 " " $2
system(command)
}
close("somedata.txt")
# send commands to the shell
while ((getline < "somedata.txt") > 0) {
printf("mv %s %s\n", $1, $2) | "sh"
}
close("somedata.txt")
close("sh")
AwkSed: A Simple Stream Editor
command1 < orig.data | sed 's/old/new/g' | command2 > result
# awksed.awk --- do s/foo/bar/g using just print
# Thanks to Michael Brennan for the idea
function usage()
{
print "usage: awksed pat repl [files...]" > "/dev/stderr"
exit 1
}
BEGIN {
# validate arguments
if (ARGC < 3)
usage()
RS = ARGV[1]
ORS = ARGV[2]
# don't use arguments as files
ARGV[1] = ARGV[2] = ""
}
# look ma, no hands!
{
if (RT == "")
printf "%s", $0
else
print
}
s2a: sed to Awk
Contents
Download
Description
Bugs
Author
Code
BEGIN{RS=";|\n"; FS=""; var=1;}
{
i=1; case1=""; case2="";
while($i==" ")i++;
if($i=="\\"||$i=="/"||$i~/[0-9]/) case1=matchaddr();
if($i==","){i++; case2=matchaddr()};
handle sed commands
####################################################################################################
if($i == "d"){ a1=a2="next;";
}else if($i == "p"){ a1=a2="print;";
}else if($i == "a"){ rest="";
for(c=i+2;c<=NF;c++) rest=rest$c;
a1=a2="$0=$0\"\\n"rest"\";";
}else if($i == "q"){ a1=a2="print; exit;";
}else if($i == "n"){ a1=a2="print; if(getline <= 0) next;"
}else if($i == "s"){
re=substr($0, i); p=substr(re,2,1); match(re,"s"p"((\\"p"|.)*)"p"((\\"p"|.)*)"p"([a-zA-Z])?",tmp);
tmp[3]=gensub(/\\[0-9]/,"\\\\&","g",tmp[3]);
tmp[1]=gensub(/\\\(/,"(","g",tmp[1]); tmp[1]=gensub(/\\\)/,")","g",tmp[1]);
if(tmp[3]=="") a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",1);";
else a1=a2="$0=gensub(/"tmp[1]"/,\""tmp[3]"\",\""tmp[5]"\");";
}else if($i == "c"){ rest="";
for(c=i+2;c<=NF;c++) rest=rest$c;
a1="$0=\""rest"\";";
a2="next;";
}else if($i == "i"){ rest="";
for(c=i+2;c<=NF;c++) rest=rest$c;
a1=a2="$0=\""rest"\\n\"$0;";
}else{
print "ERROR: invalid syntax. Unkown command in expression "$0" (expr number "NR")"; exit;
}
####################################################################################################
output awk commands
if(case1=="" && case2=="") print "{"a1"}";
else if(case1~/^[0-9]/ && case2=="") print "NR=="case1"{"a1"}";
else if(case2 == "") print "/"case1"/{"a1"}";
else if(case1~/^[0-9]/ && case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
else if(case1~/^[0-9]/) print "temp"var"==1&&/"case2"/{temp"var"=0;"a2"}temp"var"==1{"a2"}NR=="case1"{temp"var"=1;"a1"}";
else if(case2~/^[0-9]/) print "temp"var"==1&&NR=="case2"{temp"var"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
else print "temp"var"==1&&/"case2"/{temp"var++"=0;"a2"}temp"var"==1{"a2"}/"case1"/{temp"var"=1;"a1"}";
var++;
}
function matchaddr(){
str=substr($0, i); p=1;
if($i == "\\"){ p=substr(str,2,1); match(str,p"([^"p"]*)"p,arr); i++}
else if($i == "/"){ p=substr(str,1,1); match(str,p"([^"p"]*)"p,arr); }
else { match(str,/^([0-9]*)/,arr) };
i += RLENGTH;
return arr[1];
}
END{print "{print}";}
saya
Synopsis
Description
Arguments
Returns
Notes
saya(a,"name") ==>
name[1] = tim
name[2] = menzies
Source
function saya(a,s, sep0,b4,after,eq, c,m,n,key,val,i,j,tmp,sep) {
sep0 = sep0 ? sep0 : "\n"
b4 = b4 ? b4 : "\n"
after = after ? after : "\n"
eq = eq ? eq : " = "
pre = s ? s"[" : ""
post = s ? "]" : ""
m = asorti(a,b)
printf("%s",b4)
for(i=1;i<=m;i++) {
key=b[i]
val=a[b[i]]
printf("%s", sep pre )
n=split(key,tmp,SUBSEP)
c = ""
for(j=1;j<=n;j++) {
printf("%s", c tmp[j] )
c=","
}
printf("%s", post eq val )
sep=sep0;
};
printf("%s",after)
return m
}
Example
gawk -f saya.awk --source '
BEGIN {
A["fname" ] = "tim"
A["lname" ] = "menzies"
A["address"] = "usa"
saya(A,"",", ","[","]")
print ""
saya(A,"message")
B[2,3,9] = 100
B[10,1,11] = 200
B[1,3,10] = 300
saya(B,"b")
}'
[address = usa, fname = tim, lname = menzies]
message[address] = usa
message[fname] = tim
message[lname] = menzies
b[1,3,10] = 300
b[10,1,11] = 200
b[2,3,9] = 100
Author
quicksort2.awk
Contents
Synopsis
Download
Description
Code
BEGIN {
recurse1 = "gawk -f quicksort2.awk #" rand()
recurse2 = "gawk -f quicksort2.awk #" rand()
}
NR == 1 { pivot=$0; next }
NR > 1 { if($0 < pivot) print | recurse1
if($0 > pivot) print | recurse2
}
END { close(recurse1)
if(NR > 0) print pivot
close(recurse2)
}
Bugs
See also
Copyright
Author
Hiding Email Address
Contents
Synopsis
Download
Description
% gawk -f cryptosig.awk tim@menzies.us
BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}
echo 'BEGIN{a="7059631863556476595569007169";while(a){printf("%c",46+substr(a,1,2));a=substr(a,3)}}' | gawk -f -
gawk -f crypotsig.awk tim@menzies.us | gawk -f -
Code
BEGIN {
for (i=0; i<=255; i++) { # build table of char=value pairs
ord_arr[sprintf("%c",i)] = i # character = ordinal value
}
for (i=1; i<=ARGC-1; i++) {
str = ""
for (j=1; j<=length(ARGV[i]); j++) {
str = sprintf("%s%02d",str,ord_arr[substr(ARGV[i],j,1)]-46)
}
printf("BEGIN{a=\"%s\";while(a){printf(\"%%c\",46+substr(a,1,2));a=substr(a,3)}}\n",str)
}
exit(0)
}
Author
BEGIN{a="535170696159626207061118755158656500536563";
while(a){
printf("%c",46+substr(a,1,2));a=substr(a,3)};
print("")
}
Random Signatures
Contents
Synopsis
chmod +x sigs; ./sigs
Download
Description
Code
Pick1
pick1() {
gawk 'BEGIN { srand(); RS="" }
NR==1 { print $0 "\n" }
NR>1 { Recs[rand()] = $0 }
END { for ( R in Recs ) {print Recs[R]; exit}}
' $1
}
The Signatures
cat << SoMEI_mpOSSIblE_sYMBOl | pick1
tim.menzies {
title: dr (Ph.D.) and associate professor;
align: csee, west virginia university;
cell: esb 841A;
url: http://menzies.us;
fyi: unless marked "URGENT", i usually won't get 2 your email b4 5pm;
}
Doing a job RIGHT the first time gets the job done. Doing the job WRONG
fourteen times gives you job security.
Rome did not create a great empire by having meetings, they did it by
killing all those who opposed them.
INDECISION is the key to FLEXIBILITY.
"When a subject becomes totally obsolete we make it a required
course." Peter Drucker
I saw two shooting stars last night but they were only satellites .
Its wrong to wish on space hardware. I wish, I wish, I wish you cared.
-- Billy Bragg
Then, in 1995, came the most amazing event in the
history of programming languages: the introduction
of Java. -- Programming Languages: Principles and Practice
Suburbia is where the developer bulldozes out the trees, then names
the streets after them. --Bill Vaughan
Instant gratification takes too long.
-- Carrie Fisher
Complexity is easy. Simplicity is hard.
--Unknown
Author
awf
Synopsis
Download
Description
.\" .ce .fi .in .ne .pl .sp
.ad .de .ft .it .nf .po .ta
.bp .ds .ie .ll .nr .ps .ti
.br .el .if .na .ns .rs .tm
\$ \% \* \c \f \n \s
MAN Macros
.B .DT .IP .P .RE .SM
.BI .HP .IR .PD .RI .TH
.BR .I .LP .PP .RS .TP
.BY .IB .NB .RB .SH .UC
MS Macros
.AB .CD .ID .ND .QP .RS .UL
.AE .DA .IP .NH .QS .SH .UX
.AI .DE .LD .NL .R .SM
.AU .DS .LG .PP .RE .TL
.B .I .LP .QE .RP .TP
Output
FiLes
common common device-independent initialization
dev.* device-specific initialization
mac.m* macro packages
pass1 macro substituter
pass2.base central formatter
pass2.m* macro-package-specific bits of formatter
pass3 line and page composer
See Also
Diagnostics
Author
Copyright
Bugs
Postscript Tricks
pschoose.awk
Contents
Synopsis
Download
Description
Pulls out a range of pages from postscript and just print those.
Details
Code
Set up the list of paes to print.
function set_pagerange( n, m, i, j, f, g)
{
delete Pages
n = split(Pagerange, f, ",")
for (i = 1; i <= n; i++) {
if (index(f[i], "-") != 0) { # a range
m = split(f[i], g, "-")
if (m != 2 || g[1] >= g[2]) {
printf("bad list of pages: %s\n",
f[i]) > "/dev/stderr"
exit 1
}
for (j = g[1]; j <= g[2]; j++)
Pages[j] = 1
} else
Pages[f[i]] = 1
}
}
BEGIN {
# constants
TRUE = 1
FALSE = 0
if (ARGC != 3) {
print "usage: pschoose range-spec file\n" > "/dev/stderr"
exit 1
}
Pagerange = ARGV[1]
delete ARGV[1]
set_pagerange()
}
NR == 1, /^%%Page:/ {
if (! /^%%Page/) {
Prolog[++nprolog] = $0
next
}
}
/^%%Trailer/ || In_trailer {
In_trailer = TRUE
Epilog[++nepilog] = $0
next
}
/^%%Page: / {
++Npage
line = 0
}
for all non-special lines
{
# only save it if we will want to print it
if (Npage in Pages)
Page[Npage, ++line] = $0
}
END {
# print the prologue
for (i = 1; i in Prolog; i++)
print Prolog[i]
# print the actual body
for (i = 1; i <= Npage; i++) {
if (i in Pages) {
for (j = 1; (i, j) in Page; j++) {
print Page[i, j]
}
}
}
# print the epilog
for (i = 1; i in Epilog; i++)
print Epilog[i]
}
Author
psrev.awk
Contents
Synopsis
Download
Description
Code
BEGIN {
# constants
TRUE = 1
FALSE = 0
# Initialize global booleans
Twoup = FALSE
# process command line flags
for (i = 1; i in ARGV && ARGV[i] ~ /^-/; i++) {
if (ARGV[i] == "-2")
Twoup = TRUE
else
printf("psrev: unrecognized option %s\n",
ARGV[i]) > "/dev/stderr"
delete ARGV[i]
}
}
NR == 1, /^%%Page:/ {
if (! /^%%Page/) {
Prolog[++nprolog] = $0
next
}
}
/^%%Trailer/ || In_trailer {
In_trailer = TRUE
Epilog[++nepilog] = $0
next
}
/^%%Page: / {
++Npage
line = 0
}
for all non-special lines
{
Page[Npage, ++line] = $0
}
END {
# print the prologue
for (i = 1; i in Prolog; i++)
print Prolog[i]
# print the actual body
if (Twoup) {
hasodd = (Npage %2 == 1)
if (hasodd) {
# print last page
for (j = 1; (Npage, j) in Page; j++)
print Page[Npage, j]
# make a fake last page for psnup
printf "%%%%Page: %d %d\n", Npage+1, Npage+1
printf "showpage\n"
print "%%BeginPageSetup"
print "BP"
print "%%EndPageSetup"
print "EP"
}
lastpage = (hasodd ? Npage - 1 : Npage)
for (i = lastpage; i > 0; i -= 2) {
for (k = i - 1; k <= i; k++)
for (j = 1; (k, j) in Page; j++)
print Page[k, j]
}
} else {
# regular 1 up printing
for (i = Npage; i > 0; i--)
for (j = 1; (i, j) in Page; j++)
print Page[i, j]
}
# print the epilog
for (i = 1; i in Epilog; i++)
print Epilog[i]
}
Author
indent.awk
Contents
Synopsis
Download
Description
Code
doindent
function doindent(){
tmpindent=indent;
if(indent<0){
print "ERROR; indent level == " indent
}
while(tmpindent >0){
printf(" ");
tmpindent-=1;
}
}
Out-denting
$1 == "done" { indent -=1; }
$1 == "fi" { indent -=1; }
$0 ~ /}/ { if(indent!=0) indent-=1; }
Worker
{
doindent();
print $0;
}
In-denting
$0 ~ /if.*;[ ]*then/ { indent+=1; }
$0 ~ /for.*;[ ]*do/ { indent+=1; }
$0 ~ /while.*;[ ]*do/ { indent+=1; }
$1 == "then" { indent+=1; }
$1 == "do" { indent+=1; }
$0 ~ /{$/ { indent+=1; }
Author
Awk and Operating Systems
XML
XMLgawk
Purpose
Developers
Domain
Description
Current
Use
Date Deployed
Dated
Url
Soccer
Purpose
Installation
#!/usr/bin/gawk -f
Replace this with the full path to the local version of Gawk.
Developers
Organization
Country
Domain
Contact
Email
Description
Awk
Platform
Uses
Lines
DevelopmentEffort
MaintenanceEffort
Current
Users
DateDeployed
Dated
Awk-Linux
Purpose
Developers
Email
Organization
Country
Domain
Description
Awk
Platform
Uses
Current
Use
DateDeployed
References
Url
Simulations for Equation-Based Congestion Control for Unicast Applications
Description
This page has pointers to the simulation scripts for the
Equation-Based Congestion Control for Unicast Applications
by
Sally Floyd, Mark Handley, Jitendra Padhye, and Joerg Widmer,
May 2000,
SIGCOMM 2000.
Download
contents.zip
cd contents
gcc bwcnt2.c -o bwcnt2
gcc bwcnt2a.c -o bwcnt2a
ln -s ~/vint/ns-2/ns ns
./single.com
./tfrm12.com
./queue2.com
./increase.com
./reduce.com
./reduce1.com
Details
From Scripts to Figures
"contents/graphs/s0.interval.mf",
"contents/graphs/s0.loss.mf", and
"contents/graphs/s0.rate.mf".
graphs/s-full-RED.CA.tcpmean
graphs/s-full-DropTail.CA.tcpmean
There are three values for
each data point (from three runs) in these output files. To merge them, use
"contents/merge2.tcl":
merge2.tcl graphs/graphs/s-full-RED.CA.tcpmean > graphs/s-full-RED.CA.tcp
merge2.tcl graphs/graphs/s-full-DropTail.CA.tcpmean > graphs/s-full-DropTail.CA.tcp
Unfortunately, we no longer have the *.mf gnuplot script for
generating the postscript from "s-full-RED.CA.tcp" and
"s-full-DropTail.CA.tcp".
BTW, on a 450MHz Xeon, each graph takes about 7 hours to generate
Intrusion Alert Normalization with Awk
99 Bottles of Beer
#!/usr/bin/awk -f
BEGIN{
split( \
"no mo"\
"rexxN"\
"o mor"\
"exsxx"\
"Take "\
"one dow"\
"n and pas"\
"s it around"\
", xGo to the "\
"store and buy s"\
"ome more, x bot"\
"tlex of beerx o"\
"n the wall" , s,\
"x"); for( i=99 ;\
i>=0; i--){ s[0]=\
s[2] = i ; print \
s[2 + !(i) ] s[8]\
s[4+ !(i-1)] s[9]\
s[10]", " s[!(i)]\
s[8] s[4+ !(i-1)]\
s[9]".";i?s[0]--:\
s[0] = 99; print \
s[6+!i]s[!(s[0])]\
s[8] s[4 +!(i-2)]\
s[9]s[10] ".\n";}}
BEGIN {
for(i = 99; i >= 0; i--) {
print ubottle(i), "on the wall,", lbottle(i) "."
print action(i), lbottle(inext(i)), "on the wall."
print
}
}
function ubottle(n) {
return \
sprintf("%s bottle%s of beer", n ? n : "No more", n - 1 ? "s" : "")
}
function lbottle(n) {
return \
sprintf("%s bottle%s of beer", n ? n : "no more", n - 1 ? "s" : "")
}
function action(n) {
return \
sprintf("%s", n ? "Take one down and pass it around," : \
"Go to the store and buy some more,")
}
function inext(n) {
return n ? n - 1 : 99
}
BEGIN {
# Setup
take = "Take one down, pass it around"
buy = "Go to the store and buy some more"
Instruction[0] = buy
Next[0] = 99
Count[0, 1] = "No more"
Count[0, 0] = "no more"
for (i = 99; i >= 1; i--) {
Instruction[i] = take
Next[i] = i - 1
Count[i, 0] = Count[i, 1] = (i "")
Bottles[i] = "bottles"
}
Bottles[1] = "bottle"
Bottles[0] = "bottles"
# Execution
for (i = 99; i >= 0; i--) {
printf("%s %s of beer on the wall, %s %s of beer.\n",
Count[i, 1],
Bottles[i],
Count[i, 0],
Bottles[i])
printf("%s, %s %s of beer on the wall.\n\n",
Instruction[i],
Count[Next[i], 0],
Bottles[Next[i]])
}
}
Awk and Mail
Shell Statistical Spam Filter and Whitelist
Contents
• About
• Author
• Origin
• Client Side Unix Shell - AWK with updating email address "Whitelist"
• My interpretation of Paul Graham's Spam Article
• How the Spam Filter Works in Unix
• Testing the Spam Filter
• Filter Performance
• White List Filter Improves Performance and Cuts Errors
• Why Differences With Bayes Filters Do Not Matter
• Code
• Test SpamFilter
• Train SpamFilter.
• Spamfilter using statistical filtering.
About
Author
Origin
Client Side Unix Shell - AWK with updating email address "Whitelist"
My interpretation of Paul Graham's Spam Article
How the Spam Filter Works in Unix
Testing the Spam Filter
Filter Performance
White List Filter Improves Performance and Cuts Errors
Why Differences With Bayes Filters Do Not Matter
Code
Test SpamFilter
#!/bin/ksh
filter_test () {
# Split a file of unix email into many mail files with this:
cat ~/Mail/rece* |csplit -k -f good -n 4 - '/^From /' {900}
# Run a modified filter that displays the spam value for each mail file.
# I just commented out the last part of the filter and added a
# print statement of the Subject line and spam value the filter found.
for I in test/good*
do
cat $I | [filter_program-that_shows_the_value_only]
done | sort -n
}
Train SpamFilter.
#!/bin/ksh
number_of_tokens (){
zcat $1 | cat $2 - | wc -w
}
# Note: Get rid of addresses that are commonly forged at the
# "My-Own-Address" string.
address_white_list (){
zcat $1 |
cat $2 - |
egrep '^From |^Return-Path: ' |
nawk '{print tolower($2)}'|
nawk '{gsub ("<",""); gsub (">","");print;}'|
grep -v 'My-Own-Address'|
sort -u > ~/Mail/address_whitelist
}
# Create a hash with probability of spaminess per token.
# Words only in good hash get .01, words only in spam hash get .99
spaminess () {
nawk 'BEGIN {goodnum=ENVIRON["GOODNUM"]; junknum=ENVIRON["JUNKNUM"];}
FILENAME ~ "spamwordfrequency" {bad_hash[$1]=$2}
FILENAME ~ "goodwordfrequency" {good_hash[$1]=$2}
END {
for (word in good_hash) {
if (word in bad_hash) { print word,
(bad_hash[word]/junknum)/ \
((good_hash[word]/goodnum)+(bad_hash[word]/junknum)) }
else { print word, "0.01"}
}
for (word in bad_hash) {
if (word in good_hash) { done="already"}
else { print word, "0.99"}
}}' ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency
}
# Print list of word frequencies
frequency (){
nawk ' { for (i = 1; i <= NF; i++)
freq[$i]++ }
END {
for (word in freq){
if (freq[word] > 2) {
printf "%s\t%d\n", word, freq[word];
}
}
}'
}
# Note: I store the email in compressed files to keep my storage space small,
# so I have the gzipped mail that I run through the filter training
# script as well as current uncompressed "good" and spam files.
#
prepare_data () {
export JUNKNUM=$(number_of_tokens '/Your/home/Mail/*junk*.gz' '/Your/home/Mail/junk')
export GOODNUM=$(number_of_tokens '/Your/home/Mail/*received*.gz' '/Your/home//Mail/received')
address_white_list '/Your/home/Mail/*received*.gz' '/Your/home/Mail/received'
echo $JUNKNUM $GOODNUM
zcat ~/Mail/*junk*.gz | cat ~/Mail/junk - |
frequency|
sort -nr -k 2,2 > ~/Mail/spamwordfrequency
zcat ~/Mail/*received*.gz | cat ~/Mail/received - |
frequency|
sort -nr -k 2,2 > ~/Mail/goodwordfrequency
spaminess|
sort -nr -k 2,2 > ~/Mail/spamprobability
# Clean up files
rm ~/Mail/spamwordfrequency ~/Mail/goodwordfrequency
}
#########
# Main
prepare_data
exit
Spamfilter using statistical filtering.
"| /Your/path/to/bin/spamfilter"
#!/bin/ksh
spamly () {
/usr/bin/nawk '
{ message[k++]=$0; }
END { if (k==0) {exit;} # empty message or was in the whitelist.
good_mail_file="/usr/spool/mail/your_user";
spam_mail_file="/Your/home/Mail/junk";
spam_probability_file="/Your/home/Mail/spamprobability";
total_tokens=0.01;
while (getline < spam_probability_file)
bad_hash[$1]=$2; close(spam_probability_file);
for (line in message){
token_number=split(message[line],tokens);
for (i = 0; i <= token_number; i++){
if (tokens[i] in bad_hash) {
if (bad_hash[tokens[i]] <= 0.06 || bad_hash[tokens[i]] >= 0.94){
total_tokens+=1;
spamtotal+=bad_hash[tokens[i]];
}
}
}
}
if (spamtotal/total_tokens > 0.50) {
for (j = 0; j <= k; j++){ print message[j] >> spam_mail_file}
print "\n\n" >> spam_mail_file;
}
else {
for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
print "\n\n" >> good_mail_file;
}
}'
}
# Check whitelist for good address.
# if in whitelist then put in good_mail_file
# else Pass message through filter.
whitelister () {
/usr/bin/nawk '
BEGIN { whitelist_file="/Your/home/Mail/address_whitelist";
good_mail_file="/usr/spool/mail/your_user";
found="no";
while (getline < whitelist_file)
whitelist[$1]="address"; close(whitelist_file);
}
{ message[k++]=$0;}
/^From / {sender=tolower($2);
gsub ("\<","",sender);
gsub ("\>","",sender);
if (whitelist[sender]) { found="yes";}
}
/^Return-Path: / {sender=tolower($2);
gsub ("\<","",sender);
gsub ("\>","",sender);
if (whitelist[sender]) { found="yes";}
}
END { if (found=="yes") {
for (j = 0; j <= k; j++){ print message[j] >> good_mail_file}
print "\n\n" >> good_mail_file;
}
else {
for (j = 0; j <= k; j++){ print message[j];}
}
}'
}
#####################################
# Main
# The mail is first checked by the white list, if it is not found in the
# white list it is piped to the spam filter.
whitelister | spamly
exit
Mail Sort
Contents
• Author
• Download
• Description
• Code
• Main
• compute_date
• days_in_month
• canonacalize_subject
• Copyright
Author
Download
Description
Code
Main
BEGIN {
TRUE = 1
FALSE = 0
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", months, " ")
for (i in months)
Month[months[i]] = i # map name to number
MonthDays[1] = 31
MonthDays[2] = 28 # not used
MonthDays[3] = 31
MonthDays[4] = 30
MonthDays[5] = 31
MonthDays[6] = 30
MonthDays[7] = 31
MonthDays[8] = 31
MonthDays[9] = 30
MonthDays[10] = 31
MonthDays[11] = 30
MonthDays[12] = 31
In_header = FALSE
Body = ""
LocalOffset = 2 # We are two hours ahead of GMT
# These keep --lint happier
Debug = 0
MessageNum = 0
Duplicates = FALSE
}
/^From / {
In_header = TRUE
if (MessageNum)
Text[MessageNum] = Body
MessageNum++
Body = ""
# print MessageNum
}
In_header && /^Date: / {
Date[MessageNum] = compute_date($0)
}
In_header && /^Subject: / {
Subject[MessageNum] = canonacalize_subject($0)
}
In_header && /^Message-[Ii][Dd]: / {
if (NF == 1) {
getline junk
$0 = $0 RT junk # Preserve original input text!
}
# Note: Do not use $0 directly; it's needed as the Body text
# later on.
line = tolower($0)
split(line, linefields)
message_id = linefields[2]
Mesg_ID[MessageNum] = message_id # needed for disambiguating message
if (message_id in Message_IDs) {
printf("Message %d is duplicate of %s (%s)\n",
MessageNum, Message_IDs[message_id],
message_id) > "/dev/stderr"
Message_IDs[message_id] = (Message_IDs[message_id] ", " MessageNum)
Duplicates++
} else {
Message_IDs[message_id] = MessageNum ""
}
}
In_header && /^$/ {
In_header = FALSE
# map subject and date to index into text
if (Debug && (Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]) in SubjectDateId) {
printf(\
("Message %d: Subject <%s> Date <%s> Message-ID <%s> already in" \
" SubjectDateId (Message %d, s: <%s>, d <%s> i <%s>)!\n"),
MessageNum, Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum],
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]],
Subject[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Date[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]],
Mesg_ID[SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]]]) \
> "/dev/stderr"
}
SubjectDateId[Subject[MessageNum], Date[MessageNum], Mesg_ID[MessageNum]] = MessageNum
if (Debug) {
printf("\tMessage Num = %d, length(SubjectDateId) = %d\n",
MessageNum, length(SubjectDateId)) > "/dev/stderr"
if (MessageNum != length(SubjectDateId) && ! Printed1) {
Printed1++
printf("---> Message %d <---\n", MessageNum) > "/dev/stderr"
}
}
# build up mapping of subject to earliest date for that subject
if (! (Subject[MessageNum] in FirstDates) ||
FirstDates[Subject[MessageNum]] > Date[MessageNum])
FirstDates[Subject[MessageNum]] = Date[MessageNum]
}
{
Body = Body ($0 "\n")
}
END {
Text[MessageNum] = Body # get last message
if (Debug) {
printf("length(SubjectDateId) = %d, length(Subject) = %d, length(Date) = %d\n",
length(SubjectDateId), length(Subject), length(Date))
printf("length(FirstDates) = %d\n", length(FirstDates))
}
# Create new array to sort by thread. Subscript is
# earliest date, subject, actual date
for (i in SubjectDateId) {
n = split(i, t, SUBSEP)
if (n != 3) {
printf("yowsa! n != 3 (n == %d)\n", n) > "/dev/stderr"
exit 1
}
# now have subject, date, message-id in t
# create index into Text
Thread[FirstDates[t[1]], i] = SubjectDateId[i]
}
n = asorti(Thread, SortedThread) # Shazzam!
if (Debug) {
printf("length(Thread) = %d, length(SortedThread) = %d\n",
length(Thread), length(SortedThread))
}
if (n != MessageNum && ! Duplicates) {
printf("yowsa! n != MessageNum (n == %d, MessageNum == %d)\n",
n, MessageNum) > "/dev/stderr"
# exit 1
}
if (Debug) {
for (i = 1; i <= n; i++)
printf("SortedThread[%d] = %s, Thread[SortedThread[%d]] = %d\n",
i, SortedThread[i], i, Thread[SortedThread[i]]) > "DUMP1"
close("DUMP1")
if (Debug ~ /exit/)
exit 0
}
for (i = 1; i <= MessageNum; i++) {
if (Debug) {
printf("Date[%d] = %s\n",
i, strftime("%c", Date[i]))
printf("Subject[%d] = %s\n", i, Subject[i])
}
printf("%s", Text[Thread[SortedThread[i]]]) > "OUTPUT"
}
close("OUTPUT")
close("/dev/stderr") # shuts up --lint
}
compute_date
function compute_date(date_rec, fields, year, month, day,
hour, min, sec, tzoff, timestamp)
{
split(date_rec, fields, "[:, ]+")
if ($2 ~ /Sun|Mon|Tue|Wed|Thu|Fri|Sat/) {
# Date: Thu, 05 Jan 2006 17:11:26 -0500
year = fields[5]
month = Month[fields[4]]
day = fields[3] + 0
hour = fields[6]
min = fields[7]
sec = fields[8]
tzoff = fields[9] + 0
} else {
# Date: 05 Jan 2006 17:11:26 -0500
year = fields[4]
month = Month[fields[3]]
day = fields[2] + 0
hour = fields[5]
min = fields[6]
sec = fields[7]
tzoff = fields[8] + 0
}
if (tzoff == "GMT" || tzoff == "gmt")
tzoff = 0
tzoff /= 100 # assume offsets are in whole hours
tzoff = -tzoff
# crude compensation for timezone
# mktime() wants a local time:
# hour + tzoff yields GMT
# GMT + LocalOffset yields local time
hour += tzoff + LocalOffset
# if moved into next day, reset other values
if (hour > 23) {
hour %= 24
day++
if (day > days_in_month(month, year)) {
day = 1
month++
if (month > 12) {
month = 1
year++
}
}
}
timestamp = mktime(sprintf("%d %d %d %d %d %d -1",
year, month, day, hour, min, sec))
# timestamps can be 9 or 10 digits.
# canonicalize them into 11 digits with leading zeros
return sprintf("%011d", timestamp)
}
days_in_month
function days_in_month(month, year)
{
if (month != 2)
return MonthDays[month]
if (year % 4 == 0 && year % 400 != 0)
return 29
return 28
}
canonacalize_subject
function canonacalize_subject(subj_line)
{
subj_line = tolower(subj_line)
sub(/^subject: +/, "", subj_line)
sub(/^(re: *)+/, "", subj_line)
sub(/[[:space:]]+$/, "", subj_line)
gsub(/[[:space:]]+/, " ", subj_line)
return subj_line
}
Copyright