(2023-04-29) AWK is underrated, even in the POSIX variant --------------------------------------------------------- I want to take a little break from writing new stuff. But still, there is one thing that bothers me a lot. Whenever I search any information on how to do this or that with AWK, especially on StackOverflow-like forums, I constantly stumble upon "solutions" using Bash, Coreutils, sed, Perl and even Python or Ruby. Anything but AWK the question authors initially ask about. I don't know, maybe forum know-it-alls think it's a kind of "XY problem" (which bears a bag of bullshit on its own, but that's another topic) and whoever asked the question chose the wrong tool for the job and the tool they offer is better and so on, but damn! I'm fluent in Bash, Dash, Python 3.6+, JS (from ES3 to ES6 and whatever was next), C89 and VTL-2, and as such, I have a lot of options to choose from when writing new stuff, but I want to get fluent in AWK as well. So, if I (hypothetically) ask about how to do something in AWK, I want an answer about AWK, not about Bash or Python which I already can write just about everything in, or about Perl which honestly must already die. The know-it-alls can't even consider the situation someone could be left with Busybox and nothing else, and that's why they want to learn how to solve problems with AWK alone (which is the only proper programming language they can have on some systems, and Busybox sed is much more limited compared to GNU sed too), not because they don't know Perl or whatever. This is why I have given up on trying to find answers on forums and turned to the sole point of authority: POSIX.1-2017, 2018 edition ([1]). It has some external links (e.g. for printf/sprintf format specifiers ([2]) or for extended regular expressions format ([3])) but this is where everything becomes crystal clear in terms of features we can use: anything not in there is some non-standard extension. Compared to the real-life AWK versions I'm using right now (Busybox and GAWK), I'm still missing bitwise operations but, to be honest, they are not necessary everywhere and can be emulated with normal integer arithmetics if required, although it would definitely be slower. To make sure you're on the safe side (mostly), GAWK even has a --posix (or -P) flag to turn on the POSIX compatibility mode. I say "mostly" because no matter which options you set, different implementations handle null bytes in strings differently, and POSIX states the behavior is undefined in this case, so no one is to blame. For instance, in Busybox, you can't have null bytes inside any string as they automatically truncate its contents, while in GAWK they are handled normally even if you don't explicitly pass the -b flag (treat all characters as raw bytes regardless of locale). The POSIX specification is also missing GAWK's epic TCP/UDP socket pseudo-filenames (starting with /inet) and bidirectional process communication operator (|&). Yet, despite all this, I consider even the standard AWK criminally underrated. Why? Well, think about how much programming around us really boils down to processing text in one way or another. Rendering templates, parsing logs, scraping web pages, collecting reports, emulating terminals, marshalling objects between client and server, most popular client-server protocols and APIs themselves... Not even to mention how smaller Bopher-NG could become if rewritten in AWK, but first, it couldn't be called Bopher anymore, second, I don't have time for this effort for now. But you get the idea, right? Whatever task involving text where using C is too tedious, is a job for AWK with its record- and field-oriented engine with extended regular expressions available out of the box. And, if you really need it, basic math is already there too, up to square roots, logarithms, sines, cosines and arctangents, as well as your basic built-in PRNG with rand() and srand(). I don't really know what prevented them to add bitwise operations to the standard but it's already pretty functional for such a tiny package (and I already mentioned that even Busybox AWK that has them is just under 3K SLOC long). Of course, this tinyness comes at a cost of some sacrifice in convenience: no way of explicitly declaring variables as local (only implicitly, as unused function parameters), 1-based string indexing (as opposed to C-like languages where 0-based indexing is commonplace), no multi-assignment in the initializing clause of for loops (although Busybox supports them but even GAWK doesn't), a single format for numbers (stored as floating-point, even when explicitly cast to integers with int()), a single format for arrays (strictly associative and all keys are cast to strings), but all these are minor quirks compared to what this language is really capable of. Another thing I'd like to mention is that AWK specification, while having some minor updates to clarify things from time to time, has been staying like this for good 35 years or so, and this means as long as you adhere to POSIX, your programs will run on some ancient systems just as successfully as on the current ones. Yes, you may struggle to replicate the behavior of old C compilers and runtime libraries, you may find incompatibilities across various versions of Perl (not even to mention Bash, Lua and Python), you might have issues with compiling J2ME or other old Java 2/3 code on OpenJDK higher than 8 or running REXX on anything modern non-IBM, you can find your entire JS code not working on KaiOS 2.x because of some ES6 feature not yet present in Gecko 48 back then... but as long as you have an AWK there and an AWK here and you're not using any non-standard extensions and null bytes in your strings, you can be sure your program will be fully portable to any standard-compatible implementation from 35 years ago and probably from 35 years forth. And this is probably where the lack of big-market interest is even somewhat good: no one is going to try to shove in fancy useless "features" like OOP, template-based programming, decorators and other BS that breaks all compatibility and makes the codebase even slower and much bulkier. And, as a good example of "don't try to fix what's not broken", AWK is definitely worth learning and using as an everyday tool. --- Luxferre --- [1]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition /utilities/awk.html [2]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition /basedefs/V1_chap05.html#tag_05 [3]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition /basedefs/V1_chap09.html#tag_09_04