(2023-04-12) On structured text data formats -------------------------------------------- From time to time, I come across various articles about why this or that format rulez or suxx, endless debates like XML vs. JSON vs. YAML vs. TOML and so on. I'm astounded about the fact that these debates are being held in all possible seriousness. As if the readability of the format is solely determined by the syntax itself and not by the way you choose to write your own documents in it. In my practice, I had seen a lot, including nearly unreadable YAMLs and perfectly readable JSONs and XMLs, although, of course, the opposite can be seen a bit more often. For me, any of these discussions just don't make sense. When choosing an export/configuration format for your project, you must only answer three simple questions: 1) Does the data need to be readable/editable by humans? 2) How critical is the performance of saving/loading the data by the machine? 3) What is the deepest level of hierarchy that _really_ needs to be stored? Trust me, everything else is not as important as you think. Moreover, once you answer these questions, the results might surprise you as they most probably wouldn't match your first assumptions. But it's not about the format, it's about how you arrange your data. And even then, you can make an informed choice towards BOTH readability AND ease of parsing. Here, I'm going to overview two easily most overlooked and underrated structured text data storage formats that would make our lives so much easier if everyone was using them for appropriate situations instead of all this zoo. The first format is something that you, since you're reading this on Gopher, must already be familiar with: TSV, tab-separated values. Yes, Gophermaps are just TSV files and nothing else. The name looks like the format was derived from CSV, but in fact it's far more ingenious in its simplicity. Unlike CSV where you have to escape commmas (and don't even get me started on the fucking M$ that actually allowed _semicolon_ instead of comma as a delimiter for some locales, and still calls that abomination CSV) and ideally quote all strings with whitespaces and commas and escape all the quotes inside and so on, TSV allows you to just write things as is. Because no one uses the tab character, CR or LF in their tabular data or configuration values anyway. If they really have to be used there, they are just replaced with \t, \r and \n respectively, with the backslash itself being escaped as \\ in this case. But that's really it. This format is extremely easy to parse and write, and probably offers the highest machine-friendliness to readability ratio. The only precaution you must take if you write TSV files manually in a text editor and not programmatically is to make sure tabs are saved as tabs and your editor doesn't automatically convert them to spaces. Other than that, it just works. Nothing to complain about, really. Except one thing. TSV format, exactly because of being so simple, doesn't offer a straightforward way to store hierarchical data of deeper levels than just "list of key-value objects" or "key - fixed-size list of values". Sure, just like with Gophermaps, you can use the first field to store the value (and optionally the type) and all the subsequent fields to store keys and subkeys, but variable amount of fields in each row would greatly reduce both readability and parsing efficiency, and that's not what we want. So, in case these levels of hierarchy are not enough, we must find something even more ingenious to describe more complex structures while not introducing another JSON, YAML or, gods forbid, XML level of complexity for machine parsing and still keep the format entirely humanly readable AND writeable. Enter Recfiles. This is a format created under the GNU umbrella to describe relational database-like structures using simple plaintext files. By the way, I won't talk about using Recutils here, I don't care much about them. For now, I'd like to focus on the format itself. As far as I understood, it's about as simple as Gemtext. 1. Comments: * Any line starting with the # character is a comment * Comments can only be written on a separate line, # must be the first in it 2. Fields: * They are name-value pairs separated with colon and space (": " or ":\t") * Field names are case-sensitive * Any field name must match this regexp: ^[a-zA-Z%][a-zA-Z0-9_]*$ * Field names starting with % denote metadata, not data * Any field value must be terminated with LF (except \LF or LF+ cases) * If a line ends with \LF instead of LF, the next line continues the value * Newlines in values are encoded as LF+ and a single optional whitespace * Fully blank lines are allowed and not counted as fields * In all other cases, the line after LF must begin with a valid field name 3. Records: * A record is a group of fields written one after another * Can contain multiple fields with identical names and/or values * Records are separated by one or more blank lines (like paragraphs in MD) * Record size is the number of fields it contains And this is where the syntax itself ends. Everything else documented about Recfiles, including the notion of record sets and how we describe them using record descriptors (which are just records containing metadata fields only, something like database table schemas) is completely optional, built upon this syntax and constitutes implementation details specific to a particular set of tools (GNU Recutils). If you're interested in diving deeper into the canonical implementation that GNU Recutils are, I recommend their full manual ([1]) for further reading. It really is fascinating. However, with Recfiles being a fully open format, its implementations are not limited to just one, and some other tools adopt simpler modes of operation. The reference recfile parser in Tcl ([2]), for instance, only recognizes the %rec metadata field in the descriptor to turn its value (record type) into a top-level key in the output dictionary. I really like this format for the same reason I like Gemtext among others: because it is fully line-oriented. That is, after splitting your text by LF, you can unambiguously determine the type of each line based on the character it starts with. In fact, a single record parser can be defined with a very simple informal algorithm: 1. Initialize an empty string buffer BUF. Set the literal reading mode to off. 2. Read the next line L. 3. If the literal reading mode is on, append the contents of L to BUF and go to step 9. 4. If L is empty, go to step 11. 5. If L starts with #, go to step 2. 6. If L starts with +, skip an optional whitespace after it and append an LF and the further L contents to BUF, emit the flag to update the previously emitted field, then go to step 9. 7. Read all characters until the first colon (:) in L as NAME. If NAME matches the ^[a-zA-Z%][a-zA-Z0-9_]*$ regexp, then save it, otherwise discard it. 8. Clear the BUF value. Read all characters after the first colon (:) in L into BUF. If BUF now starts with a whitespace (0x20 or 0x9), remove this whitespace. 9. If BUF ends with a backslash, then remove it from BUF, turn on the literal reading mode and go to step 2. 10. Turn off the literal reading mode and emit NAME as the current field name and BUF as the current field value. Go to step 2. 11. Report the end of the record. End of algorithm. Note that the algorithm doesn't tell us that to do with the duplicate field names. We determine this ourselves, as well as how to contatenate the + lines. Now, even if we aren't aiming for a full SQLite3 or TextQL replacement, what can we use this bare format for? Tabular data? That's the native mode of recfiles, although just using them as a drop-in replacement for CSV or TSV probably won't showcase their full potential. Every record with the same set of unique field names naturally corresponds to a table row. For example, any Gophermap line has a 1-to-1 mapping to a Recfile record. INI/TOML/.properties-style configuration? Easy! Just use the metadata fields like %rec to name your sections and unique field names in each record. Everything else is the same key-value structure. JSON/YAML-style configuration of any depth? Also easy: * all objects and lists of objects are named via the %rec descriptor; * objects with primitive values are just records with unique key fields; * lists of primitive values are just records (or parts of records) with non-unique key fields; * nesting is done by referencing something like '%rec/[name]' instead of the primitive value. Let's take a look at the example YAML from some CloudBees tutorial: --- doe: "a deer, a female deer" ray: "a drop of golden sun" pi: 3.14159 xmas: true french-hens: 3 calling-birds: - huey - dewey - louie - fred xmas-fifth-day: calling-birds: four french-hens: 3 golden-rings: 5 partridges: count: 1 location: "a pear tree" turtle-doves: two Now, here's how it might look as a Recfile (note that's just one option out of many): # top record descriptor - may be omitted %rec: top doe: a deer, a female deer ray: a drop of golden sun pi: 3.14159 xmas: true french-hens: 3 calling-birds: huey calling-birds: dewey calling-birds: louie calling-birds: fred xmas-fifth-day: %rec/xmas-fifth-day # subrecord descriptor %rec: xmas-fifth-day calling-birds: four french-hens: 3 golden-rings: 5 partridges: %rec/partridges turtle-doves: two # another subrecord descriptor %rec: partridges count: 1 location: a pear tree Another example might be this highly nested JSON that contains some arrays of objects: { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5007", "type": "Powdered Sugar" }, { "id": "5006", "type": "Chocolate with Sprinkles" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] } And here is how I'd represent it as a Recfile (omitting the toplevel descriptor and comments for brevity this time): id: 0001 type: donut name: Cake ppu: 0.55 batters: %rec/batter topping: %rec/topping %rec: batter batter: %rec/batterlist %rec: batterlist id: 1001 type: Regular id: 1002 type: Chocolate id: 1003 type: Blueberry id: 1004 type: Devil's Food %rec: topping id: 5001 type: None id: 5002 type: Glazed id: 5005 type: Sugar id: 5007 type: Powdered Sugar id: 5006 type: Chocolate with Sprinkles id: 5003 type: Chocolate id: 5004 type: Maple Although this example and even more complex ones are obviously machine-generated (i.e. the data came as a result of calling some API) and it doesn't resemble anything to be used for configuration purposes, these data still are more human-manageable in this format (which is still easy to write or read programmatically) instead of trying to guess where the closing bracket was missing and which one exactly, curly or square, it was. It also doesn't cause eyestrain from the abundance of quotation marks and the fact that you need to escape them if they are encountered inside your string values. I also could provide an example of how to handle XML-structured data, but I hope you already get the idea. Regarding non-Recutils implementations of Recfiles, it was also interesting for me to find out that Bash itself, being a GNU project, was supposed to include a readrec builtin command that would facilitate reading a whole record from a file as opposed to parsing the lines obtained via the read builtin. In fact, however, this "builtin" never became a real builtin shipped within Bash. For it to work, you still need to install Recutils separately (and on my Arch, I had to do this from AUR) and then plug the readrec.so library like this: enable -f /usr/lib/readrec.so readrec Even without the entire package overhead, this particular library, on x86_64 architecture, weighs about 14K. Not really sure whether all this is really necessary to just parse a simple record format and handle special newline cases within field values, especially that the command itself doesn't do much else. Also, contrary to GNU's own specification, this command doesn't enforce whitespace characters after the colon to delimit field values from their names (in the input, but does insert them in the output). That's why I created a sourcable script for modern Bash versions (4.3 and up) with my own version of the command, readreclx, that mimics readrec's behavior (although doesn't set the REPLY_REC variable) and weighs under 3K bytes. You can consider it a reference implementation of the algorithm mentioned above. And it looks like it deals with edge cases just fine, although more thorough testing might be required. As usual, I have published this script in my downloads section on hoi.st. Why did I do this? Because such formats really deserve more attention, more love and more independent implementations. --- Luxferre --- [1]: https://www.gnu.org/software/recutils/manual/index.html [2]: https://wiki.tcl-lang.org/page/recfile