(2023-04-11) Sharballs > Tarballs
---------------------------------
Nowadays, when people hear the words "archive file", they usually think of
something like .zip or .7z (or, if they are completely braindead, something 
like .rar) files, that contain some directory structure in compressed form. 
Most of them associate archiving with compression and don't have a slightest 
clue they are completely different processes, and that even the Info-ZIP 
format supports the "store" method that doesn't compress any data. Those who 
live in a more healthy kind of environment, surely know about tarballs, but 
still not every one of them might understand or intuitively get why these 
tarballs often have two separate suffixes in their names, like .tar.gz, 
.tar.bz2, .tar.xz and so on. And only those who have seen and worked with 
.cpio.gz files definitely know the truth about this, because otherwise they 
wouldn't be able to create a single file in this format.

And the truth is that compression algorithms don't work with filesystem
structures like directories and files themselves. They only work with 
streams of continuous data. Turning the former into the latter is the sole 
task of an archiver. There is a whole lot of software that only archives 
files and directories whithout any compression, with tar, ar and cpio being 
the most famous and popular examples. Yes, modern GNU tar can automatically 
call the compressor (gzip, bzip2, xz) if we tell it to, but it still is a 
fully separate stage. We can gunzip a .tar.gz file and still work with the 
bare .tar file as if it wasn't created with the gzipping option. This is why 
archive formats are NOT the same as compression formats, and are an 
interesting topic on their own.

By the way, I'm not really sure why tar took over cpio for general usage. The
cpio format itself is more straightforward and doesn't require 512-byte 
block alignment, and now allows (and even recommends) to use plain 
ASCII-based file headers, making it a fully plaintext format in case your 
files are also plaintext. The only _major_ difference is that cpio command 
itself (that, in GNU version, even supports tar/ustar format creation!), in 
the archival mode, only accepts the _full_ list of files/directories from 
the standard input and outputs the resulting stream to the standard output. 
In the extraction mode, it accepts the stream from the standard input. I 
like this behavior more than tar's, because it is implemented in the true 
Unix way and serves the initial purposes of cpio (passing complex directory 
structures over the network or between linear-access storage devices like 
tapes) much better. The tar command always could simulate this experience, 
but its default mode is accepting the flags and the archive file first, and 
files/directories to add (in the archival mode) afterwards. And, unlike 
cpio, if you add a single directory to the list, tar will automatically add 
all the underlying elements recursively. Maybe not having to use find 
command for this purpose made tar more appealing to noobs, as well as not 
having to pipe the output to gzip of whatever for further compression, as 
it's just a matter of a single letter flag you pass to the tar command. 
That's probably why tarballs, whether compressed or not, became a de-facto 
standard in the modern Unix-like ecosystem, despite cpio being much more 
suitable for backup-restore scenarios.

But what if I told you that there exists an archive format that's even more
noob-friendly in terms of unpacking (kinda like SFX-type archives in 
Faildows), has minimum dependencies to create the archives and zero 
dependencies to unpack them, and is portable across all POSIX-compliant 
systems? Interested? Well, this format is called shar (SHell ARchive) and it 
doesn't have a single specification... well, because it's just a generated 
shell script that recreates initial files from the data stored in it. So, 
there is no separate shar unpacker, the files unpack themselves when passed 
to sh. And all major differences between various shar flavors are in the 
following things: how they store the data internally, what are the 
dependencies required by the archiver and what are the dependencies required 
by the self-unpacking script. Historically, shar archiver was a shell script 
too but most current shar versions are written in C, and shell scripts 
generated by them depend on the echo and mkdir commands and some kind of 
monsters like sed and uudecode. I personally don't support this approach, as 
echo can have some caveats in different OS implementations, uuencode and 
uudecode might not be installed at all and sed is a Turing-complete language 
by itself. That's why I naturally decided to create my own version of shar. 
As a shell script, of course, but, for the first time in all these years, 
not a Bash-specific one.

Writing a shar clone may seem a very straightforward task until you start
thinking about minimizing external dependencies as much as possible. I 
decided that the minimum requirement for my shar is that it must work at 
least on Busybox and on the bare KaiOS 2.5.x/Android 6 ADB shell with Toybox 
or whatever it has there. By "work" i mean both archiving and extraction. 
The questions that I had put before myself were:

1) What to use instead of echo?
2) What to use instead of uuencode to pack binary files?
3) How to read binaries in a non-Bash-specific way?
4) How to ensure we don't have duplicate input files and directories?
5) How to ensure we don't have EOF markers in our packed content?

And the answer to the first two questions came almost instantly: printf.
Alas, POSIX printf doesn't have the wonderful %q specifier that would solve 
90% of my problems, and I didn't want to make Bash a dependency even for 
packing only. As for the EOF markers all current shell-only shar versions 
use, we could use them with variable reading and some end-of-line 
manipulation, and this is what I tried first. But Android's shell reminded 
me that this is the case when dumber is smarter. So instead of using EOF 
markers, I ditched this approach altogether and wrote a function to 
serialize any file into a series of shell printf calls by a fixed chunk 
length (because we don't want to overflow the 130K command buffer, do we?). 
And this function also addresses the question number three: use as standard 
version of read builtin as possible with an empty IFS value, and read the 
file byte by byte. It is slow but reliable. Then, using another printf call, 
the ASCII code of the byte is retrieved, and then, depending on its value, 
it is output "as is" or as a \x-sequence, hex-encoded. With what? With 
printf, of course! Now, since the final printf call in the shar file that 
actually unrolls the bytes will be called with %b specifier, we must also 
make sure that all single quotes and backslashes are also passed in there 
hex-encoded. That's another two conditions added into our loop.

Once a proper serializer is created, that already is 80% of success. Now, as
shar traditionally accepts the exhaustive list of input files from the 
command line arguments only, and they can come from various sources, there's 
no guarantee that the input won't contain duplicates, and, of course, we 
don't want duplicates in our archive. This is where we must make use of an 
external command dependency, namely, sort -u. Some might argue that 
sort|uniq might be more portable but I've actually never seen any sort 
command version - in GNU/Linux, macOS, Busybox or even Toybox - that 
wouldn't support the -u flag. Looks like portable enough to me, at least 
from the archive creation standpoint. Apart from that, I'd like to make the 
shar script create the entire directory structure _before_ writing any files 
into it, so a separate loop to do this was implemented.

And that actually is it. The entire archiving script, lshar.sh, that I have
published in the downloads section of my main hoi.st Gophermap, is exactly 
60 SLOC of simple and well commented code that is portable across various 
shells. And, just like the very first versions of shar, this script also is 
released into public domain. I guess this will be my primary tool for 
publishing new code and the code migrated from Git repos (something I 
already suggested in my previous post, by the way). Obviously, just like 
with any other shar, Lshar can be combined with gzip or a similar tool to 
achieve compression. Examples:

Archive and compress:
find my-dir | xargs sh lshar.sh | gzip -9 > my-sharball.shar.gz

Decompress and unroll:
gzip -d -c my-sharball.shar.gz | sh

Note that, due to the script nature, unrolling is always fast but archiving
isn't. Since the serializer processes one byte at a time, it's not the 
fastest thing in the world (and on KaiOS phones, it's very noticeable), so 
I'm probably going to walk the path of the original shar creators and write 
a portable ANSI C89 version of the same tool at some point in the future. 
For the time being though, it serves its purpose and also is a cool example 
of working with individual bytes in shell scripts in a way that isn't 
Bash-specific.

Using tarballs to show respect to the Unix way? Switch to sharballs if you
truly love it.

--- Luxferre ---