Title : Synchronization files software
Date : 04 May 2021
Tags : unix
In this article I will introduce you to various opensource file synchronization programs and their according workflows. I may not know them all, obviously.
I can't give a full explanation of each of them, but I will tell you enough so you can know if it could be of any interest to you.
There are many software out there, with pros and cons, to match our file synchronization requirements.
rsync is the leader for simple file replication, it can take care that the destination will exactly match the source data. It's available mostly everywhere and using ssh as a transport it's also secure.
rsync is really the reference for a one-way synchronization.
lsyncd is meant to be used in an environment for near to realtime synchronization. It will check for changes in the monitored directories and will replicate the changes on a remote system (using rsync by default).
unison is like rsync but can synchronize in both way, meaning you can keep two directories synchronized without having to think in which order you need to transfer. Obviously, in case of conflict you will have to resolve and pick which file you want to keep. This is a well established software that is very reliable.
rclone is like rsync but will support many backend instead of relying on ssh to connect to a remote source. It's mostly used to transfer files from or to Cloud services by making a glue between core rclone and the service API.
I covered rclone in a previous article if you want more information.
syncthing is a fantastic tool to keep directories synchronized between computers/phones. It's a service you run, you define what directories you want to export, and on other syncthing instances you can add those exports and it will be kept synchronized together without tuning. It uses a public tracker to find peers so you don't have to mess with NAT or redirections, and if you want full privacy you can use direct IPs. Data are encrypted during transfers.
It has the advantages of working in full automatic mode and can exchange in both ways in a same directory, with multiples instance on a same share, it can also keep previous copies of deleted / replaced files and support many other features.
SparkleShare isn't well known but still does the job very efficiently. It offers automatic synchronization of a directory with other peers based on a git directory, basically, if you add a file or make a change, it's committed and pushed to the remote repositories. If someone make a change, you will receive it too.
While it works very well, it's mostly suited for non binary data because of the git backend. You can't really delete old data so the sparkleshare share will grow over time.
Nextcloud has a file synchronization capability, it's mostly used to upload your data to a remote server and be able to access it from remote, but also share a file or a directory in read only or read/write to other people. It's really a huge toolbox that requires a 24/7 server but provide many features for sharing files. A not so well known feature is the ability to share a directory between Nextcloud instances.
Nextcloud has its core in PHP for the www access but also phone or desktop applications.
Nextcloud can encrypt stored data.
Seafile is a centralized server to store data, like netxtcloud. It's more focused on file storage than nextcloud, but will provide solid features and also companions apps for phones and desktop.
I kept the best for the end. Git-annex is a special beast that would have deserved a full article for it but I never found how to approach it.
git-annex is a command line tool to manage a library of data and will delegate actual transfer to the according protocol.
WHAT DOES IT MEAN? Let's try an analogy.
You are in a house, you have many things in your house: movies, music, books, papers. If you want to keep track of where is stored something, you need an inventory, in which you will label where you stored this paper, this DVD, this book etc... This is what git-annex is doing.
git-annex will allow you to entirely manage data and spread it on different location (with redundancy possible) and let you access natively (or at least tell you where to get it). A real life example would be to use an external hard drive to store big files like music or movies but use a remote server to backup important documents. But you may want your documents to also be on the external hard drive, or even two hard drives, you can tell git-annex to manage that.
git-annex can give you the current state of your library without having the files locally, it will replace the whole hierarchy with symlinks to the real files if they are on your computer, meaning you can get the files when you need them or simply work on that index to remove files and then tell git-annex to proceed to deletion if possible (or when it can, like when you get internet access or you connect that external hard drive).
The draw back is that all the tracked files are symbolic links to a potentially non existing file and that you need a specific workflow of unlocking file in order to make changes, and then store it again.
I've been using it for years for data that doesn't change much (administrative documents, music, pictures) but it's certainly not suitable for tracking logs or often modified files.
The name contains "git" but git-annex only use gits to store the whole metadata, the data themselves are not in git.
There are different strategies to synchronize files between computers, they can be one way, both way, allow other people to use them, manage at huge scale, realtime etc...
From my experience, we all manage our files in very different ways so I'm glad we have that many ways to synchronize them.
PS: don't forget to backup, it's not because you replicate your data that you don't need backup, sometimes it's easy to destroy all the data at once with a simple mistake.