I recently ran into a situation where I needed to copy a large amount of data from one box to another in a LAN environment. In a situation like this, the following things are usually true, at least for this project they were:
- Just creating a Tar archive on the source box and transferring it over isn’t gonna fly.
- The source contains many files and directories (in the millions); enough that its not even practical to use file based methods to move data over
- The Disk which data resides on is not exactly very “fast” and may be exceptionally “old”
- We need to maximize transfer speed and we don’t care about “syncing”; we just want a raw dump of the data from one place to another.
- We don’t really care about the surrounding protocols involved in the network transfer, we just want maximum thoroughput with little overhead.
For this project, more specifically, I ended up needing to do a block level copy utilizing the fantastic dd
utility. The answer? Pipes and Netcat (or SSH if you really need encryption)!
Not sure what file vs block level storage is? Read this article!
Let’s jump in!
Before doing block level operations, you may want to look into zero’ing out unused block as to make the image much smaller.
On the destination, get one of these going in a screen/tmux:
nc -l -p 2342 | pv | dd of=/dev/sdc
On the source, start the actual transfer:
dd if=/dev/sdb | nc dest.hostname.net 2342
On the destination, since we used
pv
, you can see the progress of the stream including the speed and how much data has been streamed over.If netcat isn’t your style, this can be done with a ssh one-liner on the source box utilizing less resource intensive encryption and hmac settings (assuming of course that both src and dest have the same hmacs and ciphers available):
dd if=/dev/sda | ssh -o 'compression no' -c arcfour128 -m hmac-md5 [email protected] dd of=/dev/sdc
You could also use netcat
with tar
or other utilities which allow you to push data through a pipe(line) and have a way to receive on the other end.
If you are also moving over a full disk partition (like I am), the following will walk through the steps of expanding the partition and if its ext3, upgrading to ext4:
Resize to expand onto a larger dest. disk:
resize2fs /dev/sdc
Convert to ext4 (from ext3):
a)
fsck.ext3 -pf /dev/sdc
b)
tune2fs -O extents,uninit_bg,dir_index /dev/sdc
c)
fsck.ext4 -yfD /dev/sdc
d) Edit
/etc/fstab
to change any old filesystem declarationEnsure the filesystem seems configured correctly:
tune2fs -l /dev/sdc | grep features
If you used
mkfs.ext4
to create the filesystem, and it’s intention is purely for data (no /boot, /usr, /var, or other operating system needs), you may want to disable reserved blocks:tune2fs -m 0 /dev/device
- I would also recommend you take the time to run
e4defrag
which will enable the extent option (specific to ext4) for all your files:e4defrag /dev/sdc
Note: An
e4defrag
could take a very long time! Good thing you can run it online :)
Well that’s all folks! Hope I’ve helped some people out there as it can be frustrating trying to deal with old disks/servers which have large amounts of data. Trying to tar/rsync millions of files can take ages as there are so many more operations (recurse, process location, access file, read files) involved whereas doing a block level copy can speed things up and provide a constant stream of data.
One last thing: This all depends on your dataset, source/dest, and storage needs. In my case, I have over 10 million files and directories on a small disk in an old server.