(2023-04-11) Sharballs > Tarballs --------------------------------- Nowadays, when people hear the words "archive file", they usually think of something like .zip or .7z (or, if they are completely braindead, something like .rar) files, that contain some directory structure in compressed form. Most of them associate archiving with compression and don't have a slightest clue they are completely different processes, and that even the Info-ZIP format supports the "store" method that doesn't compress any data. Those who live in a more healthy kind of environment, surely know about tarballs, but still not every one of them might understand or intuitively get why these tarballs often have two separate suffixes in their names, like .tar.gz, .tar.bz2, .tar.xz and so on. And only those who have seen and worked with .cpio.gz files definitely know the truth about this, because otherwise they wouldn't be able to create a single file in this format. And the truth is that compression algorithms don't work with filesystem structures like directories and files themselves. They only work with streams of continuous data. Turning the former into the latter is the sole task of an archiver. There is a whole lot of software that only archives files and directories whithout any compression, with tar, ar and cpio being the most famous and popular examples. Yes, modern GNU tar can automatically call the compressor (gzip, bzip2, xz) if we tell it to, but it still is a fully separate stage. We can gunzip a .tar.gz file and still work with the bare .tar file as if it wasn't created with the gzipping option. This is why archive formats are NOT the same as compression formats, and are an interesting topic on their own. By the way, I'm not really sure why tar took over cpio for general usage. The cpio format itself is more straightforward and doesn't require 512-byte block alignment, and now allows (and even recommends) to use plain ASCII-based file headers, making it a fully plaintext format in case your files are also plaintext. The only _major_ difference is that cpio command itself (that, in GNU version, even supports tar/ustar format creation!), in the archival mode, only accepts the _full_ list of files/directories from the standard input and outputs the resulting stream to the standard output. In the extraction mode, it accepts the stream from the standard input. I like this behavior more than tar's, because it is implemented in the true Unix way and serves the initial purposes of cpio (passing complex directory structures over the network or between linear-access storage devices like tapes) much better. The tar command always could simulate this experience, but its default mode is accepting the flags and the archive file first, and files/directories to add (in the archival mode) afterwards. And, unlike cpio, if you add a single directory to the list, tar will automatically add all the underlying elements recursively. Maybe not having to use find command for this purpose made tar more appealing to noobs, as well as not having to pipe the output to gzip of whatever for further compression, as it's just a matter of a single letter flag you pass to the tar command. That's probably why tarballs, whether compressed or not, became a de-facto standard in the modern Unix-like ecosystem, despite cpio being much more suitable for backup-restore scenarios. But what if I told you that there exists an archive format that's even more noob-friendly in terms of unpacking (kinda like SFX-type archives in Faildows), has minimum dependencies to create the archives and zero dependencies to unpack them, and is portable across all POSIX-compliant systems? Interested? Well, this format is called shar (SHell ARchive) and it doesn't have a single specification... well, because it's just a generated shell script that recreates initial files from the data stored in it. So, there is no separate shar unpacker, the files unpack themselves when passed to sh. And all major differences between various shar flavors are in the following things: how they store the data internally, what are the dependencies required by the archiver and what are the dependencies required by the self-unpacking script. Historically, shar archiver was a shell script too but most current shar versions are written in C, and shell scripts generated by them depend on the echo and mkdir commands and some kind of monsters like sed and uudecode. I personally don't support this approach, as echo can have some caveats in different OS implementations, uuencode and uudecode might not be installed at all and sed is a Turing-complete language by itself. That's why I naturally decided to create my own version of shar. As a shell script, of course, but, for the first time in all these years, not a Bash-specific one. Writing a shar clone may seem a very straightforward task until you start thinking about minimizing external dependencies as much as possible. I decided that the minimum requirement for my shar is that it must work at least on Busybox and on the bare KaiOS 2.5.x/Android 6 ADB shell with Toybox or whatever it has there. By "work" i mean both archiving and extraction. The questions that I had put before myself were: 1) What to use instead of echo? 2) What to use instead of uuencode to pack binary files? 3) How to read binaries in a non-Bash-specific way? 4) How to ensure we don't have duplicate input files and directories? 5) How to ensure we don't have EOF markers in our packed content? And the answer to the first two questions came almost instantly: printf. Alas, POSIX printf doesn't have the wonderful %q specifier that would solve 90% of my problems, and I didn't want to make Bash a dependency even for packing only. As for the EOF markers all current shell-only shar versions use, we could use them with variable reading and some end-of-line manipulation, and this is what I tried first. But Android's shell reminded me that this is the case when dumber is smarter. So instead of using EOF markers, I ditched this approach altogether and wrote a function to serialize any file into a series of shell printf calls by a fixed chunk length (because we don't want to overflow the 130K command buffer, do we?). And this function also addresses the question number three: use as standard version of read builtin as possible with an empty IFS value, and read the file byte by byte. It is slow but reliable. Then, using another printf call, the ASCII code of the byte is retrieved, and then, depending on its value, it is output "as is" or as a \x-sequence, hex-encoded. With what? With printf, of course! Now, since the final printf call in the shar file that actually unrolls the bytes will be called with %b specifier, we must also make sure that all single quotes and backslashes are also passed in there hex-encoded. That's another two conditions added into our loop. Once a proper serializer is created, that already is 80% of success. Now, as shar traditionally accepts the exhaustive list of input files from the command line arguments only, and they can come from various sources, there's no guarantee that the input won't contain duplicates, and, of course, we don't want duplicates in our archive. This is where we must make use of an external command dependency, namely, sort -u. Some might argue that sort|uniq might be more portable but I've actually never seen any sort command version - in GNU/Linux, macOS, Busybox or even Toybox - that wouldn't support the -u flag. Looks like portable enough to me, at least from the archive creation standpoint. Apart from that, I'd like to make the shar script create the entire directory structure _before_ writing any files into it, so a separate loop to do this was implemented. And that actually is it. The entire archiving script, lshar.sh, that I have published in the downloads section of my main hoi.st Gophermap, is exactly 60 SLOC of simple and well commented code that is portable across various shells. And, just like the very first versions of shar, this script also is released into public domain. I guess this will be my primary tool for publishing new code and the code migrated from Git repos (something I already suggested in my previous post, by the way). Obviously, just like with any other shar, Lshar can be combined with gzip or a similar tool to achieve compression. Examples: Archive and compress: find my-dir | xargs sh lshar.sh | gzip -9 > my-sharball.shar.gz Decompress and unroll: gzip -d -c my-sharball.shar.gz | sh Note that, due to the script nature, unrolling is always fast but archiving isn't. Since the serializer processes one byte at a time, it's not the fastest thing in the world (and on KaiOS phones, it's very noticeable), so I'm probably going to walk the path of the original shar creators and write a portable ANSI C89 version of the same tool at some point in the future. For the time being though, it serves its purpose and also is a cool example of working with individual bytes in shell scripts in a way that isn't Bash-specific. Using tarballs to show respect to the Unix way? Switch to sharballs if you truly love it. --- Luxferre ---