Simpler Dataflow Language

In my year at TAI, I learned something very important - Bootstrap all new designs off existing technology because frankly, your design could use some work. There is no substitute for working code, especially when your goal is to refine that code.

Well I took that advice yesterday and made a little language that allows you to set up a pipeline between *NIX commands where the commands can have multiple input and output streams. I believe this is equivalent to a Hartmann pipeline if one is clever enough with xargs. The utility is called 'lace' and its source code, containing the following example, can be found at the bottom of this post.

So why all this dataflow business? Well, pipelines in the shell always bugged me, since many problems have a split/join attribute to them. To solve this in a script, you have to use temporary files, explicit file descriptors, or named pipes. All of these have downsides. Temporary files are a pain to manage and clean up. File descriptors require cunning in your command layout and are difficult to maintain since numbers are being hard-coded. Named pipes are also a pain to manage and have strange blocking semantics in the shell, causing elusive deadlocks.

Anyway, below is a Lace script. At the top, a computer lab map is defined. This computer lab is totally unrealistic because all the machines are named localhost. One can just as easily substitute real machine names into the map. The script visits each host, checks what user is signed on, and replaces the host name with the user name in the map. This happens for each host and the modified map is output at the end. Note that some of these steps happen simultaneously.

I wrote this script to hand back assignments faster since I TA an introductory programming lab. At this point, I know most people's names, but am tired of gawking over the shoulders of those I don't know.

## This code replaces all the computer names in /lab/
## with the user id on each machine!
## Non-hostnames in the map must start with exclamation points.
## You should have auto-login to these hosts.

$(H lab)
 !door!  localhost   127.0.0.1     !door
      !  localhost   localhost     !
      !  localhost   localhost     !
      !                            !
      !          !projector        !
      !                            !
      !  localhost   localhost     !door
$(H lab)

echo $(H lab) $(O a)

$(H clientcmd)
who | awk '{ print "host " $0 ; }'
$(H clientcmd)

sed $(X a) $(O a) \
    -e 's/^[[:space:]]\+/ /' \
    -e 's/^ //' \
    -e 's/ $//'

tr $(X a) ' ' '\n' $(O a)

grep $(X a) -v '^!' $(O a)

#cat $(XF a)

#$(H comment)
xargs $(X a) -n 1 -ihost ssh -o StrictHostKeyChecking=no host \
    sh -c $(H clientcmd) $(O users)


$(H subst)
{ printf ("s/\\<%s\\>/%-*s/\n", $1, length($1), $2); }
$(H subst)

awk $(X users) $(H subst) $(O a)

echo $(H lab) $(O b)

sed $(X b) -f $(XF a)
#$(H comment)


Here's the rundown.

The "$(H name)" directive defines a Here document (like in shell scripting) when it appears at the start of a line. The Here document ends when the "$(H name)" is encountered again on its own line. When this doesn't start a line, it is treated as a variable. The Here document text expands into a (single) argument.

The "$(X name)" and "$(O name)" directives specify standard in and standard out for a command. If unspecified, Lace's standard in/out are used. Note that these can appear anywhere on a command line, placement doesn't matter. They don't appear anywhere in the final command arguments. Also note that an output must appear before an input. Outputs and inputs have a 1-to-1 correspondence, so no output can be used as two inputs (use tee to get around this).

The "$(XF name)" and "$(OF name)" directives also define input and outputs for a command, but these expand to file names (/dev/fd/XX). Note that these can be matched up with the previous two ways of defining inputs and outputs (i.e.: they are part of that 1-to-1 correspondence). A real difference from standard *NIX pipelining begins to show with these directives. The final Sed command is
  sed $(X b) -f $(XF a)
which takes the lab map on standard input and a dynamically generated Sed script via a file descriptor (file) in /dev/fd/. This is doable in standard shell programming, but is more difficult to conjure. More complex routing is possible of course.

All in all, I actually found this code easier to write than if I had tried writing a big pipeline within standard shell script syntax. The problem was easier to dissect without worrying where data would have to travel. That said, the two scripting languages are not at odds. Lace scripts can be called from shell scripts, much like Awk and Sed.

The two-days-in source code of lace is here: lace-orig.tar.gz

It's in the public domain, enjoy.
-- Alex

2011.10.17 - last edit >> Thu, 10 Nov 2011 00:00:25 -0500