Practical strace: Retrofitting Build Caching

If you look at a build process abstractly then it is basically a function that uses some files as inputs and creates some files as outputs. We can peek into this input/output process with strace by invoking the build script with strace and then asking it to log all file operations. After we recover the inputs and outputs we can retrofit a caching mechanism on top of the build process by hashing the inputs and using that as a key to save the outputs.

To make things more concrete I’m going to use a simple script as a stand-in for a build process

#!/bin/bash -eu
set -o pipefail

# Outputs
cat input/a input/b > output/ab
cat input/a input/b input/c > output/abc

If you’re following along then your folder structure should look like below

.
├── build.sh
├── output
└── input
    ├── a
    ├── b
    └── c

In practice the build script and folder structure won’t be so simple but this is good enough for a demonstration.

Now if I invoke this script with strace I can monitor all file operations and “reverse engineer” what the build script is doing

$ strace -f -s 500 -e trace=file -o build.output ./build.sh

When you look at the output it might not look like mine but should be close enough

589   execve("./build.sh", ["./build.sh"], 0x7fffd6c4ac88 /* 18 vars */) = 0
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libtinfo.so.5", O_RDONLY|O_CLOEXEC) = 3
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
589   access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
589   openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
589   openat(AT_FDCWD, "/dev/tty", O_RDWR|O_NONBLOCK) = 3
589   openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
# ...

I was surprised to see how many file operations are involved in such a basic script. In real build scripts there will be even more going on but what we’re interested in are the stat and openat operations. The reason is hopefully obvious because those are the system calls that let us figure out which files the build script is reading from and writing to.

After posting this nebkor pointed out how there is already a build system using these ideas. If you dig around you’ll see that the full system call set they use to reverse engineer the inputs and outputs consists of a lot more than stat and openat.

If we just look at the stat calls then we see nothing interesting is going on but in a real build script this would probably have some useful information

$ grep 'stat' build.output
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("/mnt/c/code/strace", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/mnt/c/code", {st_mode=S_IFDIR|0755, st_size=512, ...}) = 0
589   stat("./build.sh", {st_mode=S_IFREG|0777, st_size=116, ...}) = 0
589   stat(".", {st_mode=S_IFDIR|0777, st_size=512, ...}) = 0
589   stat("/usr/local/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/usr/local/bin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/usr/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/usr/bin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/sbin/cat", 0x7ffff6c82a20) = -1 ENOENT (No such file or directory)
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0
589   stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=35064, ...}) = 0

Now let’s look at what files the script was reading from (there is again some extraneous cruft involved but those can be easily filtered out if desired)

$ grep 'O_RDONLY' build.output
# ...
590   openat(AT_FDCWD, "input/a", O_RDONLY) = 3
590   openat(AT_FDCWD, "input/b", O_RDONLY) = 3
# ...
591   openat(AT_FDCWD, "input/a", O_RDONLY) = 3
591   openat(AT_FDCWD, "input/b", O_RDONLY) = 3
591   openat(AT_FDCWD, "input/c", O_RDONLY) = 3

This mirrors exactly what we had in our build script. We first took 2 files to create one output and then took 3 files to create another output. In a real script things would be more complicated but the gist of the idea is that looking at what files are read during the build process gives you an idea of what the inputs are. So we just recovered the inputs to our build process

input/a
input/b
input/c

Now we need to recover the output(s) but that’s easy because instead of looking for files that were read from we look for files that were written to

$ grep 'O_WRONLY' build.output
590   openat(AT_FDCWD, "output/ab", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
591   openat(AT_FDCWD, "output/abc", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

This again mirrors what we had in the script so this gives us the output files

output/ab
output/abc

We now have everything to retrofit a cache system on top of the build process.

First thing we need to do is make a key for the output and we want that key to use the content of the inputs in a non-trivial way. One way to accomplish this is to cleverly hash the contents of the inputs (I’m piping a tar file into shasum because in practice this seems to be the fastest way to generate a content hash that depends on various files and folders)

#!/bin/bash -eu
set -o pipefail

find input -type f -print0 | sort -z > inputs
tar -P --mtime='1970-01-01' --null \
  --format=ustar \
  --files-from=inputs \
  -cf - | shasum -

Running that script should give you the following output and we can use the hash as a key for saving the output(s)

$ ./keygen.sh
6428f5771007cf005037d47c9aeac9bfcc8925f9  -

The actual caching script is pretty simple after we generate the key

#!/bin/bash -eu
set -o pipefail

key="$(./keygen.sh | awk '{print $1}')"
rm -f "${key}.txz"
tar cJf "${key}.txz" output

And that’s it. We just retrofitted a caching system for a build process after reverse engineering the inputs and outputs with strace. To use this cache in production you’d just write another script to compute the key and then see if the cache file exists and skip the build process if it does by just decompressing the cache file.