file system as a database

So it turns out that mv is an atomic operation if the source and destination are on the same device. This means that transforming and shuttling data between files can in theory be done in a transactional manner at the cost of some extra space. Here is an example workflow that came up at work:

  1. User uploads file to a server
  2. File is moved to a non-temporary folder and a symlink is created to it in another folder to indicate that it needs to be processed. Notice that the move is atomic so at no point do we have a file that is visible to anyone in a half-copied state. This also means that even if we move another file on top of it the work that has already started using this file will not be corrupted
  3. A cron job or some process reads the symlink and begins processing the file. The results are written incrementally to a staging folder. If something goes wrong at this point the only price we pay is repeated work because we only remove the symlink if we get to the point of moving the file atomically to another folder to indicate successful completion
  4. Once processing is complete the file is moved into a third directory and the symlink is removed to indicate completion. Once again notice that if something goes wrong and we don’t remove the symlink then the only price we pay is doing extra work and at no point do we have any data that is in a half-finished state
  5. Now another process comes along and does further work using the same pattern of saving progress in a temporary place and then atomically moving it into another folder for further processing and indicating completion by removing any symlinks from the previous place

This procedure gets us almost all the way there but there is subtle bug in the pipeline. There is a race condition between an upload and the completion of the first processing stage. This means that while we are processing the file it is possible for an upload to override the file we are currently processing and although this is not a problem because the previous file will be processed to completion we will incorrectly remove a symlink to a file that is newer than than one we were processing. This means we will never process the newer file because the new symlink will be deleted once we are done processing the older file.

Fortunately there is a fix for this problem. Before we begin processing the first stage of the pipeline we acquire a lock and release it when we are done processing the file. This means that the upload process must also acquire the lock to create the symlink and so it blocks until the first stage of the pipeline is done. Once it is unblocked the upload process will re-create the symlink and upon the next iteration of the pipeline we will process the newer file. We only need to be concerned about the first stage because that is the only place where there is a race condition. Of course this is all assuming nothing crashes while performing all these operations. Nothing bad happens if the processing pipeline crashes because in all those cases we will just do extra work. There is a problem though if the upload process crashes after the file is moved into place but before the symlink is created to indicate that we need to process the file.

To handle the case of the crashed upload process we touch a file before moving the file into place and creating the symlink and delete that file once the symlink is created. This means if we crash at any point we will have an indicator that something went wrong because the file that we touched to indicate the beginning of the transaction will not be deleted. We can use those markers to do weekly maintenance on the state of the file system by stopping both the upload process and the processing pipeline and then repairing any stale data left from potential crashes.

So what have we accomplished? With some atomic operations and file locks we have turned the file system into rudimentary transactional database that’s pretty robust in the face of several modes of failure. When things do fail we just repeat some work and waste some disk space. This could be a problem if the pipeline was longer and dealt with bigger files but it doesn’t so the trade-off is perfectly acceptable.