Recently I’ve grown to love using rsync, a tool for transferring files between local and remote destinations.
I hit a bit of a snag though recently when improving an rsync command in the script to deploy my website, so what better opportunity to write about it?
For context, I use Hugo to build my website. Output comes as a collection of static files, and rsync is perfect for copying these files to a web-facing location. In my case, this is a directory served by Nginx.
With rsync, it’s easy to copy a local file to a remote computer.
rsync ~/Documents/hello.txt [email protected]:~/
Under the hood, rsync establishes an SSH connection and copies over
hello.txt into the remote user’s home directory. If this file already exists, it’s updated instead.
It has mechanisms to make this efficient as well, reducing the amount of data in transit over a (comparatively) slow network connection:
- Skipping files that have likely not changed (same size and modification time)
- Instead of transferring a whole file, only transmit the changes needed to update it
When it came to my website, I had a few motivations to improve how I was using rsync to deploy:
- Timestamps were not reliable in identifying new/updated files to transfer
- Files were being updated on the remote even if their contents had not changed
The last point was of concern because when Nginx serves static files for my website, it uses a file’s modification timestamp to set the
Last-Modified header for caching purposes. The less often this timestamp was changed, the more often clients would be able to save resources by falling back to a cached copy of my site.
It was in my interests to help rsync avoid making unnecessary updates while deploying.
Improving my script
Let’s take a look at how the command started out in my deployment script.
rsync -a --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud
This works well enough, updating my website with the latest build artifacts and cleaning up any (removed) files that should no longer with
--delete. The dangerous option here is that unassuming
-a, which we’ll see later.
The trouble here is that most of these artifacts are generated during a build. Even if they don’t change between builds, rsync will still waste time on them because they have a “newer” timestamp.
Since timestamps can’t be used to identify a change, another option is to compare a hash of the file contents. Thankfully, rsync supports a
--checksum option, so now files will be updated if they have a different size or checksum!
- rsync -a --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud + rsync -a --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud
I thought this would work well enough, but looking at the remote I could see that timestamps on many files were still being updated. I hadn’t changed any content, so why was this happening?
I added in some options to start debugging. Aside from the usual
--verbose for extra information, I used
--itemize-changes to see exactly how rsync was performing updates.
- rsync -a --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud + rsync -a -v --itemize-changes --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud
How does the command play out in my build logs?
sending incremental file list .d..t...... ./ .f..t...... 404.html .f..t...... index.html .f..t...... index.xml .f..t...... sitemap.xml .d..t...... blog/ .f..t...... blog/index.html .f..t...... blog/index.xml .d..t...... blog/2017-reflection/ .f..t...... blog/2017-reflection/index.html ...
While rsync wasn’t modifying any file contents, it was modifying file timestamps. What could be causing that?
This was the point where I thought to look back over exactly what that
-a option was doing, to see if it was responsible for this mess. First point of call, the man page!
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-a implies several options, many related to file metadata: owner, group, permissions and, sure enough, timestamp.
I had my culprit.
With a further option, we can apply a workaround to skip updating timestamps.
- rsync -a -v --itemize-changes --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud + rsync -a --no-times -v --itemize-changes --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud
To go the extra mile though, it’s much better to replace
--archive with the only option I actually need:
--recursive. It was also a good time to use long names for each option!
- rsync --archive --no-times -v --itemize-changes --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud + rsync --recursive --verbose --itemize-changes --checksum --delete $PWD/public/ [email protected]$DEPLOYMENT_IP:/var/www/nicholas.cloud
Checking back on it a few days later, it seems to be faring well enough!
$ ls -lR /var/www/nicholas.cloud | grep 'Apr' |sed 's;.*A;A;' | cut -b -12 | sort --unique Apr 10 04:09 Apr 10 05:49 Apr 11 03:12 Apr 13 06:32 Apr 13 07:34
Is this better/optimal?
While I prefer this updated rsync command, it’s worth noting that there a lot of moving partss to consider.
- The delta-transfer algorithm rsync employs greatly reduces the amount of data transferred when two files are largely similar/identical
- Generating/comparing checksums may actually be slower than relying on timestamps and the false positives they entail
- Nginx also employs the
ETagheader in addition to the
Last-Modifiedheader for certain resources
Is my solution really faster? Does it lead to better caching performance for end users? I don’t think you can tell without benchmarking, but that’s something to investigate another day.