Gelb Research Group at Washington University in St. Louis

home

people
publications
research
software
resources
facilities
course pages
positions
contact us




Building a Linux File Server - 2004

Configuration, installation, and performance


Hardware

We were running out of shared disk space, so began a project to build a new fileserver. The equipment was all purchased in late February and early March of 2004, so you can "date" the pricing. Here's what we got:
  • Intel SHG2 ServerWorks Grand Champion LE Dual Xeon MP Motherboard, $133(!)
  • Intel Xeon 2.4GHz 400MHz FSB Xeon w 512Kb cache ($223).
  • 2x 512 Mb Kingston PC2100 ECC RAM ($105 each).
  • Supermicro SC833 3U rackmount case with 8-drive SATA backplane ($499).
  • ASUS CD-ROM ($20)
  • Intel Pro/1000MT dual-gigabit server ethernet card ($166).
  • WD 10K RPM U160 9.1Gb SCSI drive, for system disk ($20).
  • 3WARE Escalade 8506-9 SATA RAID card ($499)
  • 8x WD 250Gb 7200RPM SATA drives (2500JD model), $205 each.
Total cost: $3410, plus shipping, etc, for 2Tb raw storage. (Compare with approx. $4500, from Aberdeen Comp. or similar.)

Installation

Putting it all together was straightforward. The SC833 case is really very nice to work with, good cable routing, fan arrangements, etc. Documentation on the SATA backplane is sketchy, but the only problem we found was that the fan alarms are jumpered "active" by default, and we'd plugged the case fans into the motherboard instead of the backplane. The SHG2 motherboard fits in this case, but the power cable only barely makes it to the connector. Getting all the lights and switches connected required finding the pin diagram in the SHG2 documentation. Finally, with this backplane and the 3ware card, it is not possible to get the hard-drive light to register activity of the SATA drives.

Red Hat 9 installed without a hitch, correctly detecting everything (sans RAID; we put that in afterwards.)

Raid configuration and testing

This machine will provide medium-term storage for the results of molecular simulations, which tend to be large numbers of files somewhat smaller than 1~Gb. Many of these will be output directly to the machine via NFS, so network and NFS performance will be important to us. Some level of RAID redundancy is necessary, as this server will not be backed up. The 3Ware RAID card provides RAID levels 0, 1, 10, and 5, as well as JBOD. We have therefore benchmarked numerous configurations using both hardware RAID and software RAID (the linux "md" driver, as provided default with Redhat 9.) We used Bonnie++ for this testing, running on the machine itself (no NFS, yet). The full "raw" results, including file create/destroy information, are available here.

Hardware raid, using the 3Ware card

  • RAID level 5, ReiserFS filesytem:
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  5984  99 259184  99 497269 10  5820  99 +++++ +++ +++++ +++
    castle         512M  5954  98 252266  98 467629  99  5812  99 +++++ +++ +++++ +++
    castle           1G  5878  97 105843  47 18131   4  5368  91 127510  13  1266   2
    castle           2G  5787  96 36603  18 12539   3  5425  86 75505   7 531.2   1
    castle           4G  5719  95 24290  12 12872   3  5657  89 63078   8 392.3   1
    castle           8G  5752  96 21127  11 12386   3  5666  89 57845   8 319.5   0
    

  • RAID level 5, EXT3 filesystem, default options:
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  6053  99 354460  96 472571  99  5841  99 +++++ +++ +++++ +++
    castle         512M  6078  99 193925  52 463586  99  5850  99 +++++ +++ +++++ +++
    castle           1G  6081  99 68234  18 14029   3  5656  96 197022  13  3156   3
    castle           2G  6205  99 27359   8 12948   3  5720  91 65156   6 499.1   0
    castle           4G  6150  99 21101   6 13838   3  5773  91 58134   5 339.6   0
    castle           8G  6163  99 20062   6 14239   3  5698  89 52153   6 296.7   0
    

    The filesystem differences are relatively small, with ReiserFS appearing to win over EXT3 under these conditions, especially for relatively small files. Even though Bonnie++ must be told to use only a fraction of the machine's 1G RAM for the first three tests, the OS is clearly caching; only for 2G+ files does the actual RAID performance become visible. Needless to say, we were quite disappointed with these results; 20 Mb/sec block output for large files is slower than we would expect from a single drive.

  • RAID level 10, ReiserFS
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  6003  99 242145 100 456771  99  5847  99 +++++ +++ +++++ +++
    castle         512M  5987  99 237664  99 419935 100  5847  99 +++++ +++ +++++ +++
    castle           1G  5974  99 153404  67 44920  12  5626  95 202321  13  2419   4
    castle           2G  6071  99 96810  46 30223   9  5704  92 87888   9 580.5   1
    castle           4G  5923  99 84649  42 28195   7  5849  92 74219   9 424.4   0
    castle           8G  5876  98 83718  42 27240   7  5866  92 67335   9 343.0   1
    
    Much better! The 3ware card does a good job in striped/mirrored mode; the only problem here is that fully half of our 2TB raw disk storage is gone, which is not acceptable. At this point, we decided to try out Linux's software RAID drivers, using the 3ware in JBOD mode, which displays the 8 drives to the OS as SCSI disks.

Software RAID performance

  • RAID level 5, 64Kb "chunk size", ReiserFS
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  4801  81 219174  92 454535  98  4812  83 +++++ +++ +++++ +++
    castle         512M  5001  83 177323  77 309866  82  4781  83 +++++ +++ +++++ +++
    castle           1G  4852  82 108248  53 41367  13  4201  72 295542  28  1840   2
    castle           2G  4814  82 57421  31 26558   8  3953  63 154145  26 528.7   1
    castle           4G  4744  81 47526  26 26054   9  4105  65 144300  27 417.2   1
    castle           8G  4746  81 42858  24 26356   9  4184  66 122137  26 343.0   1
    

  • RAID level 5, 128Kb "chunk size", ReiserFS
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  4814  82 254884 100 496005 100  4992  86 +++++ +++ +++++ +++
    castle         512M  4956  83 192262  83 327311  74  4933  84 +++++ +++ +++++ +++
    castle           1G  4873  83 109966  53 40895  12  4338  74 271537  34  1623   2
    castle           2G  4875  82 59728  31 28797   9  4122  66 190785  30 526.4   1
    castle           4G  4818  82 49067  26 29307  10  4318  69 193799  38 364.5   1
    castle           8G  4799  82 43793  24 29106  10  4394  70 169292  35 346.6   1
    
    Ahh.... much better. The large-file block input and output performance is nearly doubled over the hardware RAID 5 performance. You can see that the percentage processor utilization has been reduced to 70-80%; that's because the software raid driver itself is taking up the other 20-30%. A dual-processor box ought to perform better in this regard, though we doubt it will matter for NFS serving.

  • RAID level 5, 128 Kb "chunk size", EXT3 (default options)
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  5082  85 85934  26 259821  66  4961  85 +++++ +++ +++++ +++
    castle         512M  5033  84 220605  69 307081  76  4887  84 +++++ +++ +++++ +++
    castle           1G  5026  84 104240  32 47792  14  4426  76 277147  20  2787   2
    castle           2G  5061  84 61395  19 40743  13  4184  67 212945  28 452.8   1
    castle           4G  5048  84 58734  18 40743  13  4205  67 182419  29 367.2   1
    castle           8G  5054  83 58077  19 40541  13  4167  66 181217  31 337.0   0
    
    The EXT3 filesystem provides for much better block output, and slightly worse block-input, than ReiserFS. Then, of course, I read the manual and found out about the tunable EXT3 "stride" parameter:

  • RAID level 5, 128Kb "chunk size", EXT3: "mkfs -t ext3 -b 4096 -m 0 -R stride=16 /dev/md0"
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  5008  83 347390  94 81200  18  4714  82 +++++ +++ +++++ +++
    castle         512M  4985  83 111411  33 94203  21  4902  84 +++++ +++ +++++ +++
    castle           1G  5035  84 84395  26 61275  17  4345  74 217801  19  3237   3
    castle           2G  5096  84 69271  22 37681  12  6006  95 213366  27 560.9   1
    castle           4G  6013  97 59336  19 42032  13  5959  94 195055  30 392.9   1
    castle           8G  6045  97 59485  19 42121  14  5993  94 178706  31 339.3   1
    
    Small improvements, etc., in some areas. Finally, we pulled another 1 Gb RAM out of a different box, and re-ran the benchmark with double the system memory:

  • RAID level 5, 128Kb "chunk size", EXT3: "mkfs -t ext3 -b 4096 -m 0 -R stride=16 /dev/md0"
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  5952  97 282165  88 415818  99  5736  97 +++++ +++ +++++ +++
    castle         512M  5960  97 269846  86 292862  75  5728  97 +++++ +++ +++++ +++
    castle           1G  5950  97 197219  67 63362  17  5809  99 1545895 100 +++++ +++
    castle           2G  5947  97 98936   33 71277  20  6038  98 450913  42  2663   2
    castle           4G  5989  97 81899   28 50738  17  6154  97 265835  38 543.0   1
    castle           8G  6012  97 76457   26 52119  17  6092  96 229029  37 396.6   1
    
    Clearly, spending another few $100 on RAM is going to get us more performance than anything else at this point!

Summary

We're going with software RAID 5 from here. We will get some NFS benchmarks together once the system is configured, and we'll see if it is possible to get the network working fast enough to saturate the RAID (it ought to be; the machine has three gigabit ethernet interfaces, which I can link-aggregate). The big (unanswered) question is, of course, whether the 3ware card is worth $500 just as an 8-port SATA interface? It may well be; I've no idea how good Linux SATA performance really is, whereas the SCSI subsystem is excellent. The 64bit/66MHz PCI bus should saturate at about 500 MB/sec file transfer, so we're clearly not near that limit yet. Unfortunately, we haven't got any other RAID cards or SATA controllers available for further testing at this time.

Other tests

Just for yuks, we tried a few more speculative benchmarks, none of which worked well (using ReiserFS for these):
  • Software Raid 50 (two striped software RAID-5 arrays):
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  4901  83 235452  94 397748  83  4825  83 +++++ +++ +++++ +++
    castle         512M  4786  81 148968  67 316798  76  4681  82 +++++ +++ +++++ +++
    castle           1G  4754  81 96042  48 34655  11  4622  79 183327  22  1253   2
    castle           2G  4703  80 55631  30 23220   7  4747  76 95241  15 541.3   1
    
    Not bad, but not as good as a single RAID 5, and smaller!

  • Raid 50; software-striped across two 4-drive Hardware RAID-5 arrays:
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  5900  98 223742  99 456602 101  5835  99 +++++ +++ +++++ +++
    castle         512M  5615  93 220022  99 81589  18  5711  97 +++++ +++ +++++ +++
    castle           1G  5765  96 116467  54 15541   4  5212  89 226567  17  1724   1
    castle           2G  5670  92 30382  16  9801   2  5841  93 108015  14 519.7   0
    castle           4G  5656  93 19545  10 10787   3  5816  92 82008  12 375.3   0
    castle           8G  5779  96 18956  10 10634   3  6039  95 73641  12 306.0   0
    
    Sucks! Why? Lets look at a single 4-drive Hardware RAID-5 array:

  • Hardware RAID 5, 4 drives:
    
    Version 1.02c       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    castle         256M  5971  99 235095 100 490467  99  5882  99 +++++ +++ +++++ +++
    castle         512M  5871  97 221081  95 62062  13  5831  99 +++++ +++ +++++ +++
    castle           1G  5910  98 70075  32 12646   3  5355  91 93778   7  1048   2
    castle           2G  6015  98 17549   9  8600   2  5547  88 52798   6 415.4   0
    castle           4G  5926  98 14047   7  9099   2  5582  88 41071   5 293.9   0
    castle           8G  5826  98 12863   6  9165   2  5618  88 38265   5 234.0   0
    
    That's why - the 3Ware card's performance on Raid 5 varies with the number of drives attached. This suggests that using two 4-drive hardware RAID cards and striping them via software might be competitive with the all-software solution above, but it would depend very much on the performance of the RAID cards.


Department of Chemistry and Center for Materials Innovation
Washington University in St. Louis
Last modification: Fri Aug 17 16:24:37 2007
gelb@wustl.edu