Solving the final pass challenge :
==================================

Until version 0.9.3, fpsync only worked with file lists. That mode was perfect
to have balanced partitions with fine-grained specifications: a precise maximum
number of files and a maximum size. In that mode, fpsync is blazing fast, but
there is a catch: it was only able to perform *incremental* synchronizations.

As a consequence, a so-called 'final pass' was needed to delete extra files that
could have been deleted meanwhile (but already migrated) by users on the source
area.

When using fpsync to migrate big data shares, the recommended way was first to
perform those several successive synchronizations on live data and then stop the
service to keep the source share inaccessible and finalize the migration (as
fast as possible) with a single -manual- rsync --delete pass.

That final pass had to be performed through a single rsync command because
rsync's --delete option is incompatible with a parallel usage when using lists
of files, so fpsync could not pass that option to rsync jobs.

The problem is, when you have to migrate several hundreds of TB of data, that
the final pass can be tricky. With such a data quantity, a single rsync can take
weeks just to crawl the filesystem, thus ruining all the benefits of fpsync. It
can also consume very large chunks of memory to store the file list. Last but
not least, rsync starts data transfer (or file deletion) only once crawling has
finished.

As that final pass is done while the service is offline to users, it should be
fast and reliable, which was obviously not the case.

I've wondered for several months how we could boost that final pass. For memory,
and for the curious minds, here are the different steps I took to finally
implement fpsync's option -E.

[SIGSTOP]

My first ideas were to try to speedup the single rsync pass, mostly by avoiding
crawling the FS again. Here they are.

[SIGCONT]

Idea #1 :
---------

Use last fpsync run's partitions to generate a single -temporary- file
fulllist.txt containing all files present in /data/src/ and use that with a
command such as :

$ rsync -av --delete-missing-args --ignore-existing --files-from=fulllist.txt \
    /data/src/ /data/dst/

That command would remove files not present in fulllist.txt and skip updating
other files.

Pros:
- easy to generate fulllist.txt, no need to rescan filesystem

Cons:
- fulllist.txt can be huge
- file list can be outdated, leading to removing a file in /data/dst/ if it is
  missing from the list
- not really interesting as rsync will stat() files anyway
- single rsync

Idea #2 :
---------

Use first pass' file list and last pass' file list to generate a diff and remove
extra files and directories manually with simple (recursive, forced) rm's.

Pros :
- really fast, no need to stat()
- blindly erasing will work

Cons :
- again, file lists can be huge to handle
- at least two passes are needed before the final one
- even the most recent file list can be outdated

[SIGSTOP]

Rsync's --delete option can only work if the tool can be aware of every single
file that must be present witin a specific area (to be able to determine extra
files to remove). When working with files, that means we are stuck with a
single rsync process.

If we want to split and parallelize cleaning jobs, we must work with
directories. But how can we handle recursion ? How can we split a filesystem
hierarchy into directories that do not overlap ? Wouldn't that always lead to
the root directory ?

I then realized that the problem here was recursivity. If we could perform a
cleaning pass without recursivity for each single directory, we would be able
to clean the entire tree in a parallel way: each jobs can independently clean
its own directory. This is definitely the way to go, so I'll now try to work
with directories instead of files.

[SIGCONT]

Idea #3 :
---------

Use last pass' file list to determine *directories* (only). For each directory,
diff the first depth (no recursivity) and spawn a deletion job to remove extra
files and directories from the destination.

Pros :
- no need to stat() all files: 'ls -1', sort, comm and rm are enough
- easy to spawn a job immediately after having determining a directory from the
  list
- small jobs can be run independently and parallelized

Cons :
- we must determine directories from file lists, which implies concatenating
  and sorting them. Again, they can be huge so expect a big memory usage.
- again, even the most recent file list can be outdated

[SIGSTOP]

Ideas #1 to #3 were trying to avoid re-analyzing the FS but I realized that the
counterpart of re-using file lists is that they will *always* be outdated (I
mean, there is nearly no chance they are up-to-date if the FS is live), which is
not an acceptable tradeoff. So I re-oriented my ideas to try to produce a stream
of deletion jobs as well as synchronization jobs to group those two tasks
and parallelize them, if possible.

[SIGCONT]

Idea #4 :
---------

Instead of determining the directory list from the file lists, add a
'post-directory' hook capability to fpart and use that hook to spawn a
'cleaning' job as described in idea #3 once a directory has been visited.

Pros :
- again, no need to stat() all files: 'ls -1', sort, comm and rm are enough
- cleaning jobs are independent from synchronizing jobs and can be run after
  or before (provided the destination directory exists) file-synchronizing jobs 
- so those jobs can be easily sheduled and parallelized
- no more need to handle huge file lists to determine directories

Cons :
- need to add a specific hook to fpart which does not seem to make sense when
  not used by fpsync
- need to write the cleaning script executed by the hook and reinvent the wheel

Idea #5 :
---------

With a feeling that it would be better if rsync could do the single-depth
synchronization job for me (instead of having to script something more), and
crawling into the man page, I've finally discovered the great --exclude="/*/*"
option. It allows what I was looking for: synchronizing a directory on a
single-depth basis and skip further depths. It is also compatible with the
required --delete option. Used with the --relative option, it gives something
like :

$ cd /data/src/ ; \
    rsync -av --delete --relative --exclude="/foo/bar/*/*" foo/bar/ /data/dst/

if you want to synchronize the foo/bar/ directory without recursing.

So idea #5 is based on that discovery as well as the will to avoid having to
add a specific hook to fpart that would only do what a 'find -type d' can.

Pros :
- no need to modify fpart, use find instead

Cons :
- exclude pattern is variable and must include the source directory
- with the double-star exclude pattern, rsync stat()s each files and directories
  under depth+1 (matching the second star), which means we will be stat()ing far
  too many files

Idea #6 :
---------

Digging again in rsync's man page, I've discovered an even better option. I had
always used rsync's -a option, but it includes option -r (recurse) which is,
after all, *exactly* what I do not want. And luckily, the opposite is available
and is called option -d.

The following will thus do better than the command line from idea #5 :

$ cd /data/src/ ; \
    rsync -dlptgoD -v --delete --relative foo/bar/ /data/dst/

as it will no more stat() each files and directories under depth+1.

So why not jus use, for the final pass, a 'find -type d -exec' to spawn a
single-depth 'rsync --delete' process, over each directory ?

I ended up patching fpsync to test that solution and implemented a final pass
option. But... it was far from optimal and was quite slow as too many cleaning
jobs were scheduled (one per directory).

Pros :
- efficient FS crawling
- no more exclude pattern to handle

Cons :
- too many small jobs spawned, which ruins performances (too much overhead)

[SIGSTOP]

Anyway, it worked! That first implementation proved that cleaning a FS hierarchy
directory after directory and without recursion, works. Slowly, but it works.

It could be nice if we could produce less cleaning jobs by grouping directories
to avoid that overhead.

[SIGCONT]

Idea #7 :
---------

Could rsync handle a list of directories in a non-recursive way ?

Yes!!! It can:

$ cd /data/src/ ; \
    rsync -dlptgoD -v --files-from=dirlist --delete --relative \
        /data/src/ /data/dst/

File dirlist contains a directory list such as:

./foo/bar/
./foo/baz/
[...]

Great! But how can we generate those directory lists efficiently? Wait... Fpart
can generate file lists, it could generate directory lists as well!

This idea lead to modifying fpart to make it able to generate single-depth
directory lists when passing option -E.

Fpsync has then been patched too (see option -E) to use that option and work
with directory lists instead of file lists. That way, cleaning jobs work on
groups of directories and can be started in parallel :)

Pros :
- That solution fits into global fpart and fpsync philosophy,
  patching is trivial
- You get all the benefits from a standard fpsync pass: resumability, status,
  parallelism, FS crawling speed.

Cons :
- directory lists are coarse-grained and will probably be less balanced than
  file lists so maybe use file-based incremental synchronizations (standard
  mode) first and keep option -E (directory mode) for the final pass only
- FS crawling could not be avoided... but is that something we really want if
  we want to work with up-to-date lists?

Well, I hope that those thoughts will help you understand why I decided to
implement the final pass that way. If you have any questions or remarks, do not
hesitate to contact me!

--
Ganael LAPLANCHE <ganael.laplanche@martymac.org>, Nov. 2017
