A Pretty Little Bit of Rsync

The unexpectedly lengthy story about improving one line in a script

Posted April 14, 2020 with tags #rsync

Recently I’ve grown to love using rsync, a tool for transferring files between local and remote destinations.

I hit a bit of a snag though recently when improving an rsync command in the script to deploy my website, so what better opportunity to write about it?

For context, I use Hugo to build my website. Output comes as a collection of static files, and rsync is perfect for copying these files to a web-facing location. In my case, this is a directory served by Nginx.

Using rsync

With rsync, it’s easy to copy a local file to a remote computer.

rsync ~/Documents/hello.txt [email protected]:~/

Under the hood, rsync establishes an SSH connection and copies over hello.txt into the remote user’s home directory. If this file already exists, it’s updated instead.

It has mechanisms to make this efficient as well, reducing the amount of data in transit over a (comparatively) slow network connection:

Skipping files that have likely not changed (same size and modification time)
Instead of transferring a whole file, only transmit the changes needed to update it

When it came to my website, I had a few motivations to improve how I was using rsync to deploy:

Timestamps were not reliable in identifying new/updated files to transfer
Files were being updated on the remote even if their contents had not changed

The last point was of concern because when Nginx serves static files for my website, it uses a file’s modification timestamp to set the Last-Modified header for caching purposes. The less often this timestamp was changed, the more often clients would be able to save resources by falling back to a cached copy of my site.

It was in my interests to help rsync avoid making unnecessary updates while deploying.

Improving my script

Let’s take a look at how the command started out in my deployment script.

rsync -a --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud

This works well enough, updating my website with the latest build artifacts and cleaning up any (removed) files that should no longer with --delete. The dangerous option here is that unassuming -a, which we’ll see later.

The trouble here is that most of these artifacts are generated during a build. Even if they don’t change between builds, rsync will still waste time on them because they have a “newer” timestamp.

Since timestamps can’t be used to identify a change, another option is to compare a hash of the file contents. Thankfully, rsync supports a --checksum option, so now files will be updated if they have a different size or checksum!

- rsync -a            --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud
+ rsync -a --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud

I thought this would work well enough, but looking at the remote I could see that timestamps on many files were still being updated. I hadn’t changed any content, so why was this happening?

I added in some options to start debugging. Aside from the usual -v/--verbose for extra information, I used --itemize-changes to see exactly how rsync was performing updates.

- rsync -a                      --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud
+ rsync -a -v --itemize-changes --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud

How does the command play out in my build logs?

sending incremental file list
.d..t...... ./
.f..t...... 404.html
.f..t...... index.html
.f..t...... index.xml
.f..t...... sitemap.xml
.d..t...... blog/
.f..t...... blog/index.html
.f..t...... blog/index.xml
.d..t...... blog/2017-reflection/
.f..t...... blog/2017-reflection/index.html
...

While rsync wasn’t modifying any file contents, it was modifying file timestamps. What could be causing that?

This was the point where I thought to look back over exactly what that -a option was doing, to see if it was responsible for this mess. First point of call, the man page!

-a, --archive       archive mode; equals -rlptgoD (no -H,-A,-X)

So -a implies several options, many related to file metadata: owner, group, permissions and, sure enough, timestamp.

I had my culprit.

With a further option, we can apply a workaround to skip updating timestamps.

- rsync -a            -v --itemize-changes --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud
+ rsync -a --no-times -v --itemize-changes --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud

To go the extra mile though, it’s much better to replace --archive with the only option I actually need: --recursive. It was also a good time to use long names for each option!

- rsync --archive --no-times -v          --itemize-changes --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud
+ rsync --recursive          --verbose   --itemize-changes --checksum --delete $PWD/public/ nicholas@$DEPLOYMENT_IP:/var/www/nicholas.cloud

Checking back on it a few days later, it seems to be faring well enough!

$ ls -lR /var/www/nicholas.cloud | grep 'Apr' |sed 's;.*A;A;' | cut -b -12 | sort --unique
Apr 10 04:09
Apr 10 05:49
Apr 11 03:12
Apr 13 06:32
Apr 13 07:34

Is this better/optimal?

While I prefer this updated rsync command, it’s worth noting that there a lot of moving partss to consider.

The delta-transfer algorithm rsync employs can greatly reduce the amount of data transferred when two files are largely similar/identical
Generating/comparing checksums may actually be slower than relying on timestamps and the false positives they entail
Nginx also employs the ETag header in addition to the Last-Modified header for certain resources

Is my solution really faster? Does it lead to better caching performance for end users? I don’t think you can tell without benchmarking, but that’s something to investigate another day.

If you’d like to know more about how rsync functions, I’d recommend checking out the man page and this higher-level overview!