Streaming large amounts of data!

I recently ran into a situation where I needed to copy a large amount of data from one box to another in a LAN environment. In a situation like this, the following things are usually true, at least for this project they were:

Just creating a Tar archive on the source box and transferring it over isn’t gonna fly.
The source contains many files and directories (in the millions); enough that its not even practical to use file based methods to move data over
The Disk which data resides on is not exactly very “fast” and may be exceptionally “old”
We need to maximize transfer speed and we don’t care about “syncing”; we just want a raw dump of the data from one place to another.
We don’t really care about the surrounding protocols involved in the network transfer, we just want maximum thoroughput with little overhead.

For this project, more specifically, I ended up needing to do a block level copy utilizing the fantastic dd utility. The answer? Pipes and Netcat (or SSH if you really need encryption)!

Not sure what file vs block level storage is? Read this article!

Let’s jump in!

Before doing block level operations, you may want to look into zero’ing out unused block as to make the image much smaller.

On the destination, get one of these going in a screen/tmux:
```
nc -l -p 2342 | pv | dd of=/dev/sdc
```

On the source, start the actual transfer:

dd if=/dev/sdb | nc dest.hostname.net 2342

On the destination, since we used pv, you can see the progress of the stream including the speed and how much data has been streamed over.
If netcat isn’t your style, this can be done with a ssh one-liner on the source box utilizing less resource intensive encryption and hmac settings (assuming of course that both src and dest have the same hmacs and ciphers available):
```
dd if=/dev/sda | ssh -o 'compression no' -c arcfour128 -m hmac-md5 user@ssh.server.com dd of=/dev/sdc
```

You could also use netcat with tar or other utilities which allow you to push data through a pipe(line) and have a way to receive on the other end.

If you are also moving over a full disk partition (like I am), the following will walk through the steps of expanding the partition and if its ext3, upgrading to ext4:

Resize to expand onto a larger dest. disk: resize2fs /dev/sdc
Convert to ext4 (from ext3):

a) fsck.ext3 -pf /dev/sdc

b) tune2fs -O extents,uninit_bg,dir_index /dev/sdc

c) fsck.ext4 -yfD /dev/sdc

d) Edit /etc/fstab to change any old filesystem declaration
Ensure the filesystem seems configured correctly: tune2fs -l /dev/sdc | grep features

If you used mkfs.ext4 to create the filesystem, and it’s intention is purely for data (no /boot, /usr, /var, or other operating system needs), you may want to disable reserved blocks: tune2fs -m 0 /dev/device

I would also recommend you take the time to run e4defrag which will enable the extent option (specific to ext4) for all your files: e4defrag /dev/sdc

Note: An e4defrag could take a very long time! Good thing you can run it online :)

Well that’s all folks! Hope I’ve helped some people out there as it can be frustrating trying to deal with old disks/servers which have large amounts of data. Trying to tar/rsync millions of files can take ages as there are so many more operations (recurse, process location, access file, read files) involved whereas doing a block level copy can speed things up and provide a constant stream of data.

One last thing: This all depends on your dataset, source/dest, and storage needs. In my case, I have over 10 million files and directories on a small disk in an old server.