Describing Incremental File Transfer – How it Saves Time and Data; And, Why I Wrote This :) (plus some art)

libertyteeth (60)in #life • 7 years ago

I was visualizing the protocol for CrashPlan, as I noticed it was backing up my recently-created VM’s files, and had an idea which turned out wrong -- but since I had started to write this up (to explore that idea), I’ll share it with you! :)

How This Started

I made a VM to experiment with. This in itself is cool; I’m starting to do things I used to do -- I'm recovering from multiple concussions. Since my CrashPlan subscription is ending, I’m quite busy retrieving the backups of machines I no longer have or have access to, and backing up the ones I have. I checked CrashPlan on this laptop, and noticed it was backing up the VDI file (virtual disk image) of the VM.

I had just shut the VM down, so it likely changed some portion of the VDI file. Since it is currently backing that up, it means it’ll need to back it up again. So I was thinking about how it could copy only the parts that changed. This is also something rsync does, an open source program maintained by the Samba group.

Incremental Vs. Complete

If it sees the file has changed, it could just choose to send the entire file over again. But, the destination computer could be behind a slow link, and/or a long distance away. Sending the entire file might take a very long time, compared to just sending the changed parts.

Sending only the changed parts is called “incremental”, because it just sends a new “increment” for the file, rather than the entire thing. The simplest increment can be seen in log files, which are generally added to but the previous information is never changed; thus, the new log file will always be the same as the previous one, plus a little extra data (generally, another line).

But for files like the VDI file, changes could happen within the file at any location, so we need an algorithm that can handle this as well.

Algorithm

First, it would check the folder; if it doesn’t exist, the file can't exist, so create the folder and copy it.

Next, it would check the file name; if it doesn’t exist, copy it.

Comparing the date really isn’t important, although it should update the date in the destination, after it’s done copying. (With that in mind, it should create a new copy, so in the event of power interruption, the old file is still the one that was there, and if it can recover from the copy it will; otherwise, discard it and start the copy over.)

If the size is the same, and the date is the same, it can consider the file identical – although there should be an option to always binary-verify the contents.

If the size is different (or the option was set to always check), it would compare the contents.

Comparing the Contents

Obviously, comparing every byte would be the same as sending the entire file, so it wouldn’t do that. One way is to break the file into chunks, perform a computation on each chunk, and send the result of that computation. If the other side performs the same computation on the same chunk of the file on their end and gets the same result, then they can be very close to 100% certain that they have the identical contents. This is known by several names: CRC (cyclic redundancy check); hash; checksum; MD5 sum [1]; and others.

Generating the entire hash of the file, on both sides, would take reading the entire file. To be efficient, the algorithm should read the entire file a minimum number of times, preferably once – I don’t think it can read less than once, like some search algorithms can – which is really neat, itself! https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm [2]

So, let’s say it divides the file into 4 KB (4,096 byte) chunks, and creates a 16-byte MD5 sum from it, then sends that across; that’s 1/256th as much data that needs to be sent (16 sent / 4,096 originally). However, if the file is large, it’s still a lot of data – many 16-byte chunks.

Hash the Hashes

So, a solution to that is to “hash the hashes” – generate a hash from all of the hashes, and compare that. Thus, each file that’s identical has only 16 bytes sent across the much slower wire, to compare on the other side.

And then, keep dividing in two until the different data is found, and then send it. By “dividing in two” I mean, if the entire file hash differs, each side calculates the hash of the first half of the hashes, and compares it; if it’s the same, then the difference is in the second half, so continue halving that; and, if it’s different, then the difference is in the first half, so continue halving the first half. Repeat this until the smallest chunk that’s sent over the wire is found (or, the smallest difference, if it’s larger than that) and then send it across, so the receiving end can put it in the right place and end up with the same file.

I think this describes the least amount of data needed to send across the wire, to update a file. Looks like it’s going to need to read some parts of the file twice, if there are differences – it will read those differences, twice (on the source machine; the destination will read them once, and write them once after the source machine send them across).

If the files are the same, then it’ll only need to read each once -- and if the size and datestamp are the same, and the “binary check” option wasn’t specified, then it will need to read each file zero times.

Saves Data?

A similar process is used in databases and version control systems, in order to “save only the differences” -- instead of having to save an entire copy of a file, when only one line had been changed for instance.

My Initial Thought

It turned out not to be the most efficient, so I didn’t even include it in the above, but for posterity I’ll include what sparked this post:

I had wanted to make a checksum for each 1K, then every 1K checksums make a “higher-level” checksum. And if there were enough of those, make another “higher level” from 1K of those, etc etc.

So, basically, the first level of checksums is “1K of original data per checksum”; the next level is “1M”, next is “1G”, then “1T” – rarely would it need to go beyond that, but, it could.

But while writing the above, I realized that this doesn’t make much sense, and, it would require a lot more calculations than the “best” way. Which I ended up describing above. If you can think of a better way, like Ross Perot said in the 1992 debates, “I’ll all ears!” :)

Uh-Oh… There Is a Better Way…

Similar to the search algorithm, this would be better as powers-of-two; or, at each level, divide the search space into two. So, the first hash would be of just two components, i.e. in the above example, a 1K block, so a hash would describe 2K (and the block would probably be better if it was 4K, since that’s the block size most hard drives are moving towards these days – I updated the above algorithm with that after writing this). Then a pair of hashes at that level would be combined with several others – but how many? I suppose there’s an optimal number below which an MD5 sum would be no less than the original data it describes; and, trivially that would be 17 characters, but since these are MD5 sums we’re dealing with, even two of them would combine 16+16=32 characters, and the MD5 sum of that would be 16 characters.

I’m losing myself, I think, so I wonder how my audience is doing… Anyway, here’s the new algorithm:

4K blocks, each is hashed;
Every two hashes are hashed (and saved);
Every two of those are hashed (and saved), etc until there’s only one hash at the end.

That hash is sent across and compared. If it’s different, no more reading of the file, and no more calculating of the hashes, are necessary to find all of the different blocks – just sending the first hash across is sufficient; if it is the same, it’s in the other side; if different, search this side.

Okay, now I think I’m done. :) Note that it’s the third time I think I’m done, so there’s probably more to this. Please let me know if you can make it more efficient!

And, for a thumbnail I just drew “Pictures of Hash-Stick Men”, referring to the song below it, “Pictures of Matchstick Men” (which itself referred to paintings, I just read). I used Flame Painter, and The Gimp to make the copies of hashes, each line having twice the hashes of the line above it, imitating the doubling I mentioned above. Since each line has a “head,” like a match, it reminded me of the song.

Pictures of Hash-Stick Men+c.png

I had first heard it by Camper Van Beethoven, decades ago, and thought that Strawberry Alarm Clock had done it originally; turns out, it was Status Quo:

And also, the Camper Van Beethoven version (the lead singer from Cracker, David Lowery, was also the lead singer in this earlier band; I really, really enjoyed them back in college):

[1] – The following was up above, but my investigation got a little long-winded so I decided to move it down here, so as not to break the flow.

I have no idea why it’s called “MD5”, even though I’ve used it for decades, so I just looked it up: https://linux.die.net/man/1/md5sum doesn’t tell me the answer, but gives me a hint:

Author
Written by Ulrich Drepper, Scott Miller, and David Madore.

The last names are two Ms and a D, “MD”; not sure where the “5” comes in. Look further: the above “man” page (“man” for “manual”, like a help page in Unix/Linux) mentions an RFC (“Request For Comment”, a way of getting consensus on a new file format, protocol, etc. in the open source world), so I searched that and found: https://www.ietf.org/rfc/rfc1321.txt

That mentions, in the table of contents, that the “5” is because it’s the next iteration after “4”:

Differences Between MD4 and MD5

So, I searched for “rfc for md4” and found this page – turns out, it’s also one RFC prior to the above! (1320 vs. 1321) --
https://tools.ietf.org/html/rfc1320

This is a great an example of why it’s “RFC” – a previous one is mentioned, which I opened in another tab and looked at ( https://tools.ietf.org/html/rfc1186 ) – and, at the very top, it states “Obsoleted by: 1320”. So, comments were requested for 1186, they were given, and the proposal was updated and submitted later, with a higher number, 1320.

So, searched again (this isn’t supposed to be “fun with MD5 flags” :) ), and Wikipedia gave the answer: https://en.wikipedia.org/wiki/MD5

The abbreviation "MD" stands for "Message Digest."

So, it’s just coincidence that the authors’ last names align.

Hilariously, also, looking back at the “man” page up above, I see the “Name” section clearly specifies it, at the end of the brief description!

md5sum - compute and check MD5 message digest

Oh well!

[2] – This also got long-winded, so enjoy if you like the technical stuff! :)

In a nutshell, generally you’re searching for a word in a sentence, i.e., the thing you’re searching for is considerably smaller than the thing you’re searching in. The algorithm analyzes the word, and then searches for the last letter of the word, at the position the last letter should be in. If that’s not it, then it knows the word can’t match there, and moves forward.

The amount it moves forward depends on the “shape” of the word – if searching for “aaaaa” and the fifth letter of the string being searched doesn’t match the “a” for the end of the word, then it can’t match any of the middle letters of the search string, since they’re all “a”. It also can’t match the start of the word either, since that's also “a”, so it skips forward one letter, and then starts the match again.

However, if it’s searching for e.g. “xyz” then it's a little more complicated. If the third letter doesn't match “z”, it then has to compare it (which it already retrieved) to “y”, and if that’s a match, then compare the next to “z” and if that’s a match, compare the previous to “x” to find a match, before moving forward. But if it wasn’t “y”, then compare it to “x” and if it is, compare the next two to “y” and “z” before moving forward.

Note that the previous assumes that retrieving the value at a certain position is “expensive” so it wants to do so as little as possible. For instance, it could be retrieving it from disk, into a register in the CPU – so, it would want to do as much with the data in that register as possible, before discarding it, and especially try to avoid needing to read it a second time.

When moving forward, to move “as far as it can” would mean, after checking the value it retrieved and it’s not “z”, “y”, or “x”, then it knows it can’t start the word here, so move forward one, and then compare three after that to “z”, and repeat. So this would read exactly one third of the letters of the string, unless it finds a match and needs to explore further. It’s more complicated than that, I tried to simplify it. And, I moved this to a reference at the end.

But isn't that neat? To be able to search an entire string for another string, and not have to visit every character? :) Pretty sure I learned that in Freshman year, but it's still a really neat trick. Like in Doom and other 3D shooters, where they don't have to calculate walls that aren't seen, so that saves a bunch of computations. It's great to be able to save time, to be more efficient. And thus, I'm sorry to have spent your time with this wall of text, if it did not lead to your increased efficiency, or at least, enjoyment. :)