Some More Home Cooking

2020-10-06

After removing GNU coreutils and taking closer stock of what we have as a "base system", I am mostly happy with the selection but now wish to spend some time making behaviors, documentation and source code itself more uniform. I have also done some work towards correcting a couple of omissions that are present due to the previous surgury. The commit that I pushed this morning has a few notable changes.

  • hostid - This is a non-POSIX utility of GNU origin that is present on systems with GNU coreutils but not on BSD based systems. I have written a replacement from scratch in C, which was a fairly trivial task as mentioned previously. The new utility, much like the rest of our userland, behaves like the GNU counterpart with the exception of not accepting long options and not having a help option (help is available via the man-pages, so '--help' is always redundant IMO).
  • nproc - Another scratch implementation of a utility with a GNU orgin.
  • rev - A scratch rewrite for HitchHiker, should be functionally identical to the BSD and/or Suckless versions. Notably, my version, the BSD version (which we had previously) and the suckless version are all quite a bit faster than the rev utility provided by the util-linux package due to architectural differences which I'll explain below.
  • mkdir - Another scratch rewrite. These are being done patially as an exercise in C programming and partially to lessen our dependence on the portability libraries used to port BSD utilities to Linux. However, this utility is currently a special case in that regard (see below).
  • base64 - This one had been ported from NetBSD but is actually broken on every platform in regards to being able to process it's data from a file rather than stdin. It's now fixed at least in HitchHiker. I have also begun a process of formatting the source code imported from other sources using clang-format to enforce more uniform standards, and to go through the man-pages a few at a time to do some similar housekeeping. It's a small niggle, but both BSD and Suckless use tabs for indentation where I prefer two spaces for compactness. There are also some inconsistencies in brace styles and function declarations. I prefer the function declaration to be on one line followed by the opening brace on the same line, and for things like loops to include the opening brace on the same line as the loop initialization. Basically, the defaults provided by clang-format make for nice, readable and consistent code.

Now, as mentioned above, let's get into some detail with the implementations of rev and mkdir. Let's begin with rev, which is a simple utility which just reverses the characters in each line. Now, a simple implementation of rev, and basically where I started, would just read each line using getline (or fgetln on a BSD system) and reverse the bytes, excluding the newline, which is replaced back at the end of the output. This works fine for ascii characters, which all by definition fir into a single byte of data. However, utf8 is a fact of life, and we quickly run into a situation where we have reversed the byte order of a multibyte character, printing gibberish to the terminal. The Util-Linux utility solves this by reading each line into an array as a sequence of widechars, essentially reading character by character, and then doing the reversal.

The BSD, Suckless, and now HitchHiker utilities all use a similar, and more efficient approach. The line is read as described above, then, starting with the charcter preceding the newline, each byte is tested to see whether it is the beginning of a character or the middle of a multibyte character. If we run into a multibyte character, we find the beginning of it and then read forward again to the end of the character, before skipping backwards again past the multibyte character. While sounding convoluted, this is quite a bit faster when processing large amounts of data, as we're only looking at the first few bits of each byte rather than reading each line in character by character.

Now on to mkdir. Most of this was straightforward to implement, including the -p option, as we just have to construct a path directory by directory by processing the path as a string, using, in our case, the strtok function. However, mkdir is also expected to be able to set the Unix permissions of directories that it creates, and accept the mode arguments in either octal or symbolic format. It is trivial to implement the octal permissions, but implementing symbolic permissions entails coding a parser which must accept a fairly wide range of possible permutations. After looking at how the functionality has been implemented in BSD, GNU, and Suckless utilities, it quickly became apparent that the BSD implementation makes the most sense from the efficient use of code standpoint, as the BSD C-library already contains the getmode and setmode functions. As this has already been ported to Linux via libbsd, and as such is already present in HitchHiker as it has been used to port much code from both NetBSD and FreeBSD, I decided to just go ahead and use it. On a side note, it would be fantastic to see some of these functions present in the GNU C-library. The getmode function is quite useful, and fgetln is quite a bit more, let's say graceful, to use than getline.

It should be noted that static linking libbsd into a Beerware licensed utility is somewhat problematic. I have been considering extending the build system somewhat, to also build shared libraries alongside the static archives. I am somewhat hesitant, as I feel that low level base utilities such as this should only depend on the system C library at runtime. Alternatively, if at some point in the future HItchHiker does make the switch to Musl libc (which is still on the table) then it might be possible to patch musl to incorporate the functions right into the system C library for HitchHiker, allowing for their removal from libbsd and lessening our reliance on it for porting utilities. I rather like this train of thought...

The end goal is, of course, the kind of tightly integrated userland that BSD systems are noted for. When looking at the various base utility implementations, it is striking that commonly there is a library of common functions built and linked into the utilities. This tactic is employed by sbase, ubase, GNU coreutils, Util-Linux, and lobase (the port of OpenBSD userland to Linux, from which I have borrowed heavily). I'm employing it myself by using libbsd to port NetBSD and FreeBSD utilities. In the long run, I want to eliminate this trend and fully integrate everything to where we're only depending on the C library both at build and run times. My own implementations, with the exception of mkdir, currently do this. As an example, on BSD systems a program is aware of the name under which it was invoked via the getprogname and setprogname functions. As GNU libc does not have these functions, we have a global constant char __progname, which is set to the program name by calling basename(argv[0]) in the main function. While this results in some additional boilerplate code, it's actually scarcely more than what BSD already has due to the need to call getprogname and/or setprogname.

Similarly, Suckless abstracts away certain things like printing to stderr (their eprintf function) and getting program arguments, which are easily done just using fprintf(stderr, "msg") and getopt, respectively.

A possible eventuality might involve refactoring much of this code somewhat to remove the dependencies on their respective utility libraries and replacing it with more "generic" programming. While this might increase source code size somewhat, it would not likely impact compiled size or efficiency, as we're just removing the abstractions and putting them back into the programs. This has the benefit of making the code easier to understand without having to look at the library and what it does.


Tags for this post:

C programming Utilities NonGNU