Søren Ragsdale (sorenragsdale) wrote,
Søren Ragsdale
sorenragsdale

Building a Cheap ZFS Server

I store my data around the house on a variety of firewire and USB hard drives and I've been feeling the need to consolidate. Partly it's getting everything into one device. Partly it's fault-tolerance: otherwise I lose everything on a drive if it dies. Drobo is one contender, especially after they added firewire, but I've been hearing vague unsettling things about reliability. ReadyNAS is another option, but it's almost $900 on Amazon (disks extra). At the high end for personal storage servers there's the five-bay Thecus N5200 Pro ($800) and an eight-bay ReadyNAS Pro (no price yet).

Of course at the really high end there's network appliances like NetApp, but those are totally out of my league. Or are they? Sun has been trying to drink NetApp's milkshake first by developing ZFS, an open-source filesystem like WAFL on steroids. Then they added CIFS to Solaris and started selling Thumper, a storage server with up to 48 SATA drives. And Solaris is free, and PC hardware is cheap. What if I built myself a mini-Thumper with a $50 Celeron instead of dual Opterons and added a few drives instead of 48?

Take a minute and check out the ZFS presentation. It's impressive and compelling: ZFS is an open-source, transactional, consistent, self-monitoring, self-healing, scalable file system with complete data integrity guarantees. It is some seriously powerful secret sauce. After reading that presentation and a few entries on Jeff Bonwyck's blog I was convinced that Sun are onto something. I decided to build my own mini-Thumper serving files from a disk pool managed by ZFS. It looked very promising, and even if things went completely wrong I'd still end up with a cheap PC that I can give to a friend or sell on CraigsList.


Design Goals
  • Cheaper than ReadyNAS or Drobo

  • Faster than ReadyNAS or Drobo

  • More capacity than ReadyNAS or Drobo

  • Compatible with Windows and OSX

  • Low power

  • Quiet/silent operation



Which OS?

The first question was the operating system. FreeBSD-7.0 includes ZFS but it's still considered "experimental". Linux doesn't use ZFS directly due to incompatibility between the CDDL and the GPL but it's available with FUSE. OSX doesn't support it yet although it's been promised in 10.6. I eventually I decided that OpenSolaris seemed like the best idea: it's what the Thumper runs, it's the reference ZFS implementation, and it provides the kernel and tools designed to work well with ZFS.


Hardware

Next question: what's this going to run on? I spent a few weeks checking the hardware compatibility list and reading the forums before picking the following components.

  • Intel DP35DP motherboard: Intel's basic no-frills socket-775 board is as standard as they come. The board provides plugs for all six of the ICH9R southbridge's six SATA drives. The Intel 82566 gigabit ethernet chip provides hardware packet checksum to maintain full gigabit speeds and is supported by the same e1000g driver that powers the Thumper. ($110, Frys)

  • Intel Celeron E1200 1.6Ghz dual-core CPU: Runs cool and quiet. Supports 64-bit execution. You don't need a lot of power (or money) if all you're doing is computing block checksums. I can always replace it with something faster if I need to. ($50, NewEgg)

  • A-Data 2x2GB DDR2 800 (PC2 6400): 4GB RAM will keep things speedy by giving ZFS space for big buffers. I'll be able to address all that memory since Solaris has been 64-bit since 1995. The RAM is cheap but it passes memtest86+ with no errors. ($80, NewEgg)

  • 80GB Toshiba MK8037GSX 2.5" SATA laptop hard drive: With all that memory you won't be hitting the disk very much after everything has been loaded. Laptop drives are smaller, quieter, and run cooler than 3.5" drives. (used, $25, Amazon.com)

  • Antec Three Hundred case: relatively small size, enough internal space to hold six hard drives plus three 5.25" drive bays that can be converted to hold 4 more 2.5" drives if it ever comes to that. (NewEgg, $70)

  • e-GeForce 7200 GS PCI-E Graphics Card: Cheapest PCI-e card on the Frys shelf. Fanless means zero noise, probably low power consumption. If you've got a spare PCI video card laying around you can just use that. ($70 - $30 rebate, Frys)

  • Corsair HX520W power supply. 80plus certified, highly recommended by SilentPC. Modular power cables let you only install what you need. Very quiet. ($100 - $30 rebate, Frys)

  • Not included: optical drive, keyboard, mouse, monitor. Just borrow these from a friend or another computer for a few days. You don't need these once everything is up and running.

  • Total: $445 (assuming that all my rebates actually get processed) That's $42 cheaper than a Drobo FW800 and less than half the price of a ReadyNAS. Of course I kinda splurged with this setup - I could have gotten away with 2GB instead of 4GB and a efficient 300W power supply instead of a fancy modular 520W which would have put this at $380 without sacrificing very much.

    If you've got an old spare PC laying around then it's free, which is even better.



Installation

Set-up was fairly simple. Boot from the LiveCD, run the installer. If you want to experiment before diving in you can install Solaris inside VirtualBox on your own computer. VirtualBox is free and works very well. The only real difficulty came from the realization that Solaris is not linux. There's a bunch of stuff that work differently. "sudo" is "pfexec", system services are provisioned and managed differently, etc. Even if you're reasonably familiar with Linux you'll have at least a few moments of confusion. There's also not a lot of support for n00bs since most of the people who know Solaris really well are the sort of uberpowerful and cranky BOFHs who will respond with appalled condescension at your ignorance if they can be troubled to respond at all. You probably ought to RTFM anyway, and between the official documentation which is extensive and well-written and the unofficial wiki which is not you'll probably be able to find what you need.

Here are the various guides that I ended up finding useful. Just follow the examples:



Benchmarks

So how fast is it? For bare network speed iperf says I'm getting 923Mbits/s between the server and my MacBook Pro over a GS105 switch:

[ 3] local 10.0.1.4 port 51462 connected with 10.0.1.2 port 5001
[ 3] 0.0-10.0 sec 1.08 GBytes 928 Mbits/sec
[ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet)


That number increases to 971Mbits/s when I turn on jumbo frames. That's good enough that I'm not worried about the network being misconfigured.

[ 3] local 10.0.1.4 port 51465 connected with 10.0.1.2 port 5001
[ 3] 0.0-10.0 sec 1.13 GBytes 971 Mbits/sec
[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)


This is all traffic between my MacBook Pro and the server. Just for kicks I installed a Pro1000gt card in my Windows desktop and turned on jumbo frames. I ended up exceeding the gigabit spec. I'm not quite sure how this is possible, but OK.

[ 4] local 10.0.1.2 port 5001 connected with 10.0.1.3 port 1858
[ 4] 0.0-10.0 sec 1.51 MBytes 1.26 Mbits/sec
[ 4] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)


Not all gigabit ethernet cards are the same! The gigabyte board's stock RTL 8111C chip supposedly speaks gigabit only puts out 300Kbits/sec.

For drive speed, the used $25 laptop drive writes at 48MB/s. The four full-size drives don't do that much better. I've heard that RAID write performance can be slow, but because I'm comparing my setup to other RAID solutions I wanted this to be apples-to-apples. I could reconfigure the pool as two mirrors rather than RAID 3+1 which I'm told would be a bit faster.

soren@sbox:~$ dd if=/dev/zero of=${HOME}/tmp.bin bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 22.1752 s, 48.4 MB/s
soren@sbox:~$ dd if=/dev/zero of=/tank/soren/tmp.bin bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 16.6704 s, 64.4 MB/s


Read speed is a bit faster.

soren@sbox:~$ dd if=${HOME}/tmp.bin of=/dev/null bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 13.4813 s, 79.6 MB/s
soren@sbox:~$ time dd if=/tank/soren/tmp.bin of=/dev/null bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 9.1267 s, 118 MB/s


Over FTP I can sustain a respectable 104MB/s which is faster than any of my local drives can write.

ftp> get tmp.bin /dev/null
local: /dev/null remote: tmp.bin
229 Entering Extended Passive Mode (|||64338|)
150 Opening BINARY mode data connection for tmp.bin (1073741824 bytes).
226 Transfer complete.
1073741824 bytes received in 00:09 (103.94 MB/s)


Over CIFS (samba) I'm transferring a 1GB file in 23 seconds, which works out to either 37.4MB/s or 43.9 MB/s. This guy says that you can't expect more than 30MB/s over ethernet and I'm pleased to exceed his expectations. I can do CIFS writes at 34MB/s which also seems fairly respectable.

gmb:~ soren$ time cp /Volumes/soren_new/tmp.bin /dev/null
real 0m23.321s
user 0m0.007s
sys 0m2.631s
gmb:~ soren$ time cp /Volumes/TB/tmp.bin /Volumes/soren_new/tmp2.bin
real 0m29.930s
user 0m0.005s
sys 0m3.517s
gmb:~ soren$ time cat /Volumes/soren/foo.bin > /dev/null
real 0m27.364s
user 0m0.042s
sys 0m3.868s


After a bit of futzing around I was able to install netatalk and start up an AFP server. It gives me 48MB/s write and either 58MB/s or 94MB/s read depending on whether I'm doing a block copy with dd or a file copy. (I checked Activity Monitor and yes, bits are coming off the eth0 device that fast.)

gmb:~ soren$ time dd if=/dev/zero of=/Volumes/soren/foo.bin bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
1073741824 bytes transferred in 21.050765 secs (51007259 bytes/sec)
gmb:~ soren$ time dd if=/Volumes/soren/foo.bin of=/dev/null bs=1024
1048576+0 records in
1048576+0 records out
1073741824 bytes transferred in 17.356553 secs (61863771 bytes/sec)
gmb:~ soren$ touch /Volumes/soren/foo.bin
gmb:~ soren$ time cat /Volumes/soren/foo.bin > /dev/null
real 0m10.865s
user 0m0.015s
sys 0m1.988s


The Intel motherboard doesn't support overclocking so I couldn't see how much better this would do at a higher clock speed but I tried disabling one of the cores to see how big a hit performance would take. Surprisingly, it didn't seem to make much of a difference. You could probably save $10, use less power, and get just as good performance with a 35w Celeron 430.

I didn't have a ReadyNAS to test, but Toms Hardware says it'll do 30MB/s write, 35MB/s read. The Thecus N5200 is a Celeron-powered Linux server and gives similar performance: 49.6mb/s read, 37.7 write.

Update: I've learned that dd is not good for disk benchmarks. Here's the bonnie++ output for local access:

Version     1.93c   ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
sbox             8G    75  99 92921  26 64215  29   165  99 122683  19 395.0  10
Latency               161ms    3501ms    5294ms     139ms     504ms     643ms
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
sbox             16  8659  59 +++++ +++ 11411  84 12291  81 +++++ +++ 12244  82
Latency             32546us     365us   10085us   24350us     126us     483us


1.93c,1.94,sbox,1,1242278321,8G,,75,99,92921,26,64215,29,165,99,122683,19,395.0,10,16,,,,,8659,59,+++++,+++,11411,84,12291,81,+++++,+++,12244,82,161ms,3501ms,5294ms,139ms,504ms,643ms,32546us,365us,10085us,24350us,126us,483us

And here's the numbers over a CIFS share:

Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
gmb.local      300M     1   6  5704   5  4300   5     2  10 14806   9  3784 104
Latency             16687ms    2001ms    2002ms    9738ms    2000ms     458ms
Version 1.93c       ------Sequential Create------ --------Random Create--------
gmb.local           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   266  18  9996  22   550  26   301  22  2358  11   462  26
Latency              2008ms    7874us   75222us     269ms   16440us     230ms


1.93c,1.93c,gmb.local,1,1242276150,300M,,1,6,5704,5,4300,5,2,10,14806,9,3784,104,16,,,,,266,18,9996,22,550,26,301,22,2358,11,462,26,16687ms,2001ms,2002ms,9738ms,2000ms,458ms,2008ms,7874us,75222us,269ms,16440us,230ms




Frustrations, Discoveries, and other Notes

  • It's pretty quiet - very little of the fan noise that some people complain about from Drobo or ReadyNAS. The loudest thing is the stock fan on the Celeron, and there are even quieter options if I wanted to go that route.

  • One nice trick: once you get everything working, launch VNC instead of X. Then you can stick it in a closet and use the screen remotely if SSH isn't enough for what you're doing. For bonus points start up SSH (svcadm enable ssh), register with dyndns, set port forwarding on your router, and set up an SSH tunnel to your home machine (ssh -CN soren@myserver.dyndns.org -L 2000/localhost/5900). Now you can access your desktop over an encrypted pipe from anywhere in the world.

  • People seem to like the ReadyNAS because it speaks both CIFS and AFP. Some NAS boxes can also act as bittorrent clients. A properly-configured Solaris box can run bittorrent while serving CIFS, AFP, NFS, iSCSI, FTP, SCP, and pretty much anything else you can think of.

  • There are a few weird file permission problems that I'm still sorting out. AFP/Netatalk leaves extra lock files laying around, and CIFS won't let me move directories around or creates files with mode 000.

  • At idle my Kill-a-Watt says it's using 80 watts at idle, 95 watts while shuffling files around, and 112 when it's doing a scrub on all the drives. That's quite a bit more than the ReadyNAS's 30/60 watts; I thought I could do better. The 520w power supply is probably overkill - I could have saved $20 by picking a smaler 300w supply but they don't come with modular cables.

  • I originally bought a DQ35JO hoping to save some money and watts with the integrated video but I returned it after Solaris had problems. Integrated video uses main memory as VRAM and Solaris seemed to be writing into that area when it shouldn't have. I could have chopped $30 more off the price if it opensolaris could handle integrated video. 4GB RAM is also probably overkill - you could save another $40 if you just went with 2GB RAM which ought to be sufficient.

  • I haven't mentioned the actual drives for storage. You can put anything you want in a Drobo or ReadyNAS, and you can put anything you want in this box also. I picked four WD Caviar Green 750GB drives which were $120 each. They're not the fastest drives on the market but they're the coolest and quietest. Unlike the Drobo or ReadyNas you can expand this setup with up to ten drives total before running out of room in the case, even more with other cases. (You'll need a secondary SATA controller card.)

  • The motherboard can configure the SATA drives as AHCI devices and Opensolaris is fine with this. I didn't even need to do anything special, unlike Windows where you have to hit F6 and load special AHCI drivers.

  • It was hard to decide what to run, or indeed, what's available and what (if any) the difference is. There are a confusing plethora of x86 solaris distributions to choose from. There's opensolaris aka Indiana, and there's Solaris 11 aka Solaris Express Community Edition aka SXCE aka SXDE aka ONNV aka Nevada. I don't pretend to know what the hell is going on with all these names nor do I represent that I am actually relaying the situation accurately. Both are free, but you can download Indiana straight from a URL whereas you have to log in to download Nevada. The difference seems to come down to whether you want a Solaris that looks like Ubuntu and comes on a CD ISO or a solaris that looks like Solaris and comes on a DVD ISO. (There are also Schillix, Martux, Belenix, and Nextenta which are Opensolaris based distributions not produced by Sun.)

  • With four 3.5" drives, a laptop drive, and an optical drive the inside of my case was a rats' nest of SATA cables. I ended up buying a bunch of 5 and 9 inch cables from satacables.com which cleaned things up nicely.

  • I intended to boot opensolaris from an 8GB USB flash drive but never got this to work. The kernel gave me some complaint about not being able to load boot caches.

  • If you want full gigabit speeds, cable quality matters. I found a few cat5 cables that wouldn't autonegotiate to full gigabit. I found one cable that worked at gigabit speeds until I zip-tied it into a bundle. Crimping my own custom length cables with a spool of cat5e ended up working best. Monoprice also sells very cheap cat6 patch cable.

  • One limitation of ZFS is that you can't expand a RAID-Z vdev once it's set up. You can add more disks to the pool and you can replace disks in the vdev but you can't make a 4+1 vdev into a 5+1 or 3+1. Keep this in mind when you're setting up your pool; you might want to leave each member small enough to be manageable.

  • I haven't figured out a good way to set up a script to the ZFS pool every week and send me a report. That's one of the key features of ZFS - it tells you about your small problems before they start getting big - but you need to make sure it actually tells you.

  • A manual scrub has uncovered a few errors on the secondhand laptop drive that's serving as my root filesystem. These are new errors indicating bit rot - an earlier scrub didn't detect them. Fortunately "no known data errors" means that my files dodged the bullet. I'm going to buy a second cheap laptop drive and add it to the pool as a mirror, which will allow ZFS to detect and repair bit rot when it sees it. This is the sort of error that would have just gone undetected if I hadn't been using ZFS.
    sbox:~$ zpool status -xv
    pool: rpool
    state: ONLINE
    status: One or more devices has experienced an unrecoverable error. An
    attempt was made to correct the error. Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
    scrub: scrub completed after 0h10m with 2 errors on Sun Aug 3 00:16:33 2008
    config:
    
    NAME STATE READ WRITE CKSUM
    rpool ONLINE 0 0 4
    c4t0d0s0 ONLINE 0 0 4
    
    errors: No known data errors

  • For people like my employer who need to have very many computers accessing very large data sets without certain drives getting overwhelmed by the popularity of the files they're storing Sun is working on combining ZFS with Lustre to improve client-side scalability. Unless you have a few thousand computers or more you might not care about this.

  • If this is all too much for you Nexenta sells Nexentastor, a simplified distribution with a nice menu and management console. On the down-side, a nexentastor license will set you back $800, or about twice as much as the hardware it's running on.

  • "Open source software is only free if your time has no value." There's something to be said for off-the-shelf convenience. It took a while to get all this together. Even though a ReadyNAS or Drobo is more expensive with worse performance, if you don't enjoy futzing around with this stuff it might still not be worth it.
Tags: nas, solaris, zfs
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 10 comments