My *nix world

Optimize the virtual machine backup process

I have few KVM virtual machines that, for some reasons, I use to backup from time to time.

Some of their virtual disks (let's call them VM-hdd) are stored just as a plain raw-uncompressed file (see qcow2) while others are stored inside a physical disk partition (no file system on partition but the VM-hdd content).

To backup your VM-hdd just use the dd tool to dump the partition content to a raw file (i.e. dd if=/dev/sdX of=/tmp/sdX.raw) then create a compressed archive/copy of it that you can store wherever you like. Likewise, to backup your VM-hdd that is stored as a plain/raw file just compress that file then store it wherever you like.

This post is not about how to optimize the virtual machine backup process. This post is about how to optimize the whole process, meaning:

  1. dump the VM-hdd from physical partition to a raw file (if not stored as a file already)
  2. compress the VM-hdd image as much as possible
  3. process all the steps using as little as possible resources (CPU, disk, time) while getting the optimal compressed backup copy
  1. dump the VM-hdd from physical partition to a raw file
dd if=/dev/<partition> of=<dir>/<imagefile>.raw
  1. compress the VM-hdd as much as possible

By just running the bzip2 (or another compression) tool you against the physical raw file you will, of course, shrink the file to some extent. But what if you have a 100GB VM-hdd that has 99% free space, how much can your compression tool shrink the file? Our common sense will demand to get an archive of maximum 1GB (which is in fact the disk space used). But can any compression algorithm do that? I have honestly my doubts about that.

The problem is that when you delete a file from the disk the space occupied by the file is not automatically zeroed. In fact, the file will remain there until, byte by byte, will be overwritten by other file contents. So if you have used your disk for a while there is a big change that your disk, even if is empty, contains at a very intimate low level some random information. And random information cannot be compressed efficiently (like, for example, few billions of consecutive zeros). In fact, if you used to have some JPEGs, MPEGs or MP3 files on your disk before it got empty (only 1% disk usage) then the chance to compress the disk up to maximum 1 GB (because 99% is free space) is very-very tiny. Because, you know, free space does not mean free of data, it only means that it's free for your OS/application to use it. But, as said before, at low-level it contains non-zeroed data.

So the trick is to zero the disk unused space and only then to try to compress it. All you have to do is to write a file that takes all the free space of your disk till your disk get full (100%). The file should contain as many bytes as necessary (to fill the whole disk) but they must have the same value. Like billions of spaces. Or billions of A-s. Or like billions of zeros. Then you remove that file from disk so that you free back the space that you toke it. And because the space that once was used (by your file) doesn't contain some random bytes (like in JPEG example) but only zeros, then a compression tool can shrink to only few bytes (like 8 or 10 or something). It can do that because it sees that you have, for example, 99GB (~ 10^11 bytes) of zeros so all it have to do to compress that space is to write (into the archived version of your file) "...everything that I wrote before + 10^11 zeros + everything that comes after they".

After you have zeroed the free space of your disk (and release it back to your OS) it is a good idea to defragment your disk, even to consolidate your free space if you have software for that, so that the compression gets even better.

If you have a *nix VM guest then in order to zero the free space is easy:

dd if=/dev/zero of=/tmp/zero.file bs=512 count=/512

If you have a Windows guest then in order to zero the free space you may use a free tool written byMark Russinovich (Winternal co-founder) that is called SDelete. The tool is can be downloaded and use freely like many other tools written by the same author.

It worths mentioning that if you want to zero the free space on volume/partition x: then you should run the SDelete command like below:

x:                 # first change the active partition to X
sdelete.exe -z     # execute the tool from that partitionn

If you are still on Windows then, in order to consolidate/defragment your disk free space, you may use MyDefrag or UltraDefrag (both freeware).

Now, finally, let's compress that hyper-optimized VM-hdd disk image that you've gotten.

If you are on Linux then I would recommend bzip2 compression filter. Or even better, the parallel version of it, pbzip2. Do you have a cluster? The even better, you may use the MPI bzip2 version of it.

Usually I backup both, the virtual disk(s) and the VM xml definition.

tar cf - //vm-hdd-image.raw //vm-definition.xml|pbzip2 -9 -m2000 -v -c > //vm-hdd-image.tar.bz2

If yor VM-hdd is stored on a physical partition (so it's not just a simple raw file) then you can even dump and compress the whole thing in only one line of code:

tar cf - | dd if=/dev/|pbzip2 -9 -m2000 -v -c > //vm-hdd-image.tar.bz2
  1. process all the steps using as little as possible resources (CPU, disk, time) while getting the optimal compressed backup copy

Overall, it was easy and straight forward process, just zero the free space, defragment the disk, dump"tar the disk image to your backup media.

To give you a picture of why is worthing all this effort, I will present shortly two experiments I've done:

  • an XP Home + SP3 system with 3GB free space on a 10G VM-hdd
    • defrag.+zeroing free space+bzip2 => ~4GB bz2 archive (~40% of VM-hdd size)
  • an XP Home + SP3 system with 8GB free space on a 10G VM-hdd
    • defrag.+zeroing free space+bzip2 => ~0.9GB bz2 archive (~9% of VM-hdd size)
    • defrag.+zeroing free space+qcow2 => ~1.1GB qcow2 disk (~11% of VM-hdd size)

Without the disk defragmentation/free space consolidation this compression rate would be hard to achieve because the data on the physical disk is so randomly spread, have a such random information on it, that is hard to find a repetitive pattern to shrink it effectively.

BTW: I am using the last archive of XP-Home (0.9G .bz2 file) as a backup for a pre-installed image that I can use it directly in case I of "emergency" and/or when I want/have to start from the scratch with a XP VM.

Now, if you think that this article was interesting don't forget to rate it. It shows me that you care and thus I will continue write about these things.

The following two tabs change content below.
Optimize the virtual machine backup process

Eugen Mihailescu

Founder/programmer/one-man-show at Cubique Software
Always looking to learn more about *nix world, about the fundamental concepts of math, physics, electronics. I am also passionate about programming, database and systems administration. 16+ yrs experience in software development, designing enterprise systems, IT support and troubleshooting.
Optimize the virtual machine backup process

Latest posts by Eugen Mihailescu (see all)

One thought on “Optimize the virtual machine backup process

  1. PaulL

    Nice suggestion - I did something similar for my vm backups when configuring drbdadm as a backing store. I'd noticed the poor compression and figured it was through not zeroing the file system before starting the build. I hadn't gotten quite as far as working out how to zero it afterwards.

Leave a Reply

Your email address will not be published. Required fields are marked *