Gelb Research Group at Washington University in St. Louis

home

people
publications
research
software
resources
facilities
course pages
positions
contact us




"Fjord" - a linux cluster


This page describes our experiences with the procurement, installation, configuration, troubleshooting, and use of a 24 node dual-processor Athlon cluster.

The hardware
The software: OSCAR
Post-install configuration
Problems and fixes
Performance

The hardware

Our hardware configuration consists of:

Final system cost came to just about $55,000. Here's a picture of it during assembly, with 15 nodes installed. The assembly went reasonably smoothly, except that the screws for rackmounting that came with the nodes don't fit into this particular rack, so we had to order extra screws from APC (since the rack only comes with about 60 screws, and we used six per node!)








Here are some pictures of the inside of the Appro 1100 server. Its pretty intelligently laid out, though not as easy to get into as one might like, as the top cover consists of three separate pieces:

No CPU fans! Instead, a big plastic duct (far right) funnels air from the four big fans directly through the heat-sink fins.

Network configuration

Since every computer has two NICs on the motherboard, we got two switches. There are two private subnets:
  1. "Service network" - this connects the 10/100 ports on all the nodes with the DLINK switch. The Gigabit port from the server connects to the Gigabit uplink port on the switch. This network is used for software installation, routine communication, and file-sharing by NFS.
  2. "Fast network" - This connects the Gigabit ports on all the nodes with the SMC switch. The server does not participate, and this network is meant only for communication during parallel job execution.

Power

The rack system is now plugged into three dedicated 20A lines. I'm not sure what the actual power draw under running conditions is, but will attempt to find out.

Problems

Of the sixteen nodes initially delivered (we added eight more later), two would not boot at all. These were returned to Appro, repaired, burned in, and re-installed. They both worked fine upon return, until we started working with the Gigabit ethernet network, and discovered that one of them had a bad controller. It was returned for repair a second time.

Future expansion

There's room on both switches and the rack for an additional eight nodes, if we so desire. There's also room in the bottom of the rack for enough UPS capacity to run the whole system, if it becomes necessary.

The software: OSCAR

We installed OSCAR 1.4 on the cluster. This is an open-source suite of cluster software that handles the installation of operating systems on the nodes, configuration of networking and parallel libraries, setting up the PBS/Maui queueing system, and some other goodies.

Basically, the install went quite smoothly (and as described in the OSCAR manual) except for a few false starts and the following issues:

  1. The BIOS that came on the node motherboards had a bug that wouldn't allow them to boot without a keyboard attached (even if the "boot w/o keybd" option was selected) -- this slowed things down considerably, until we got a new BIOS from Appro which fixed the problem.
  2. The newest, most current kernel from RedHat 7.3 for the nodes is the one labeled "kernel-smp-2.4.18-17.7.x.athlon.rpm". The somewhat unusual numbering scheme used for this was not detected by OSCAR in the image-setup section, meaning that we had to install the RH7.3 default kernel. This would ordinarily be OK, except that only the newest kernel has support for the built-in Gigabit ethernet! We tried this a few times, and finally gave up and decided to put on the newest kernel afterwards.
  3. The documentation on setting up your own disk partition file is a bit weak - it turns out that you can use ext3 and/or Reiser filesystems, but this is not mentioned anywhere.
  4. In OSCAR there is really no way to go "back" and redo a step (say, "node definitions") if you messed up, or decided that another option should have been checked. Instead, you have to go back all the way to the beginning, re-install OSCAR, and start over. This gets frustrating after a while.
  5. In order to get PBS working right, the first name entry for the server node name in the /etc/hosts files entry for the server's internal network connection has to match the output of "hostname" on the server. This must be true on all the machines! This is a problem, because you can go through the entire install without it, and then PBS won't work....
  6. OSCAR cannot deal with multiple NICs per node. Even if we could have got the right kernel installed, there is no facility in OSCAR for configuring a second private network. While the configuration of the second network by hand was not so difficult, really, we were not able (quickly, at any rate) to do further software modification within the OSCAR framework, e.g., using SystemImager, since when OSCAR installs the first private network, it does it through a system of shell scripts are difficult to adapt. If you only have a single network adapter per node, of course this won't be a problem.
Our overall impression of Oscar, however, was quite good - in one day, without previous experience with the software, we got everything up and running, except for the post-install stuff described below. It would have taken much longer had we had to separately install all the components of OSCAR by hand.

Post-install configuration

A number of configuration issues remained at the end of the OSCAR installation. These included:
  1. Additional NFS-shared filesystems from the server node. This was easy to set up using commands from the C3 package. Basically, edit the /etc/fstab on one node until you like it, and then propagate the changes to all of them.
  2. Installation of new kernels. This was also easy to arrange, by using "cexec" to run the appropriate RPM commands on each node.
  3. Configuration of gigabit ethernet adapters. This required producing and installing a unique "ifcfg-eth1" file on each node. Since there are only sixteen, we did this by hand (it only took a few minutes.) Were there more, a shell script that handled it would be easy to write. We also had to copy an updated /etc/hosts to all the nodes.
  4. Configuration of PBS-run jobs to use the gigabit network. Haven't got to this yet!

    Problems and fixes

    As mentioned above, there were some hardware issues. Appro has been helpful in addressing these, providing new BIOS files, etc.

    Performance

    We are principally concerned with network performance here, as that is likely to be the limiting factor in the efficient use of the cluster for parallel computation.

    Ping benchmarks

    Here are some basic "ping" benchmarks. These were gathered from the output of "ping -f -s 64000 nodename".
    1. 10/100TB network: average return time was 11.95ms, which translates to 10.215 MB/sec transfer rate, about 82% of line maximum.
    2. 1000TB network: average return time was 2.532ms, which translates to 48.21 MB/sec transfer rate, about 40% of line maximum. This is measured with MTU=1500 (regular ethernet.) Still, its 4.7x faster than regular fast ethernet.
    3. 1000TB network, MTU=9000: average return time was 2.585ms, essentially the same speed as with "small" frames.
    4. 1000TB network, MTU=3000: average return time was 2.396ms, about 50.95 MB/sec transfer rate.

    NetPIPE benchmarks

    We have used both the TCP and MPI (LAM) parts of the NetPIPE 3.2 performance benchmark suite to gauge gigabit ethernet performance between two nodes.

    In order to get good (reliable and fast) performance, we ended up using the following:

    1. Version 4.3.15 of the e1000 driver (from Intel)
    2. MTU = 9000 (best TCP performance) or MTU = 1500 (best LAM/MPI performance)
    3. RxIntDelay=0 TxIntDelay=0 (parameters passed to e1000.o via /etc/modules.conf)
    4. /proc/sys/net/core/rmem_max = 1 MB, /proc/sys/net/core/wmem_max = 1 MB
    These values are not really optimal, but they are substantially better than those of the packaged kernel e1000.o module and default Delay and memory values. The RxIntDelay and TxIntDelay values, especially, make a big difference in the consistency of network performance with different message sizes. Communication latencies with these parameters, estimated from the transmission of very small messages, tend to run around 70 microseconds.
    TCP performance examples:

    With the TCP version of NetPIPE, and the above parameters, we were able to get up to 920 Mbps for large messages and jumbo frames. It appears that an MTU of 3000, though, gives substantially better performance for messages on the order of 10Kb to 100Kb.

    MPI/LAM performance examples:

    With the LAM/MPI version of NetPIPE, the best performance achieved with large messages is approximately 690 Mpbs. Performance over different message sizes is somewhat rocky; the use of MTU=1500 smooths this out substantially. The LAM/MPI implementation overhead is thus approximately 25% of achieved network bandwith, not too bad. For comparison we show a single trace (green) using default values for the Rx/Tx Delay parameters, which gave both substantially degraded performance, and extremely slow speeds for certain message sizes.

    Evaluation

    For just TCP traffic, this hardware/software is capable of achieving better than 90% of theoretical network line speed. Even within the MPI implementation, 70% of maximum is probably realizable with further parameter tuning (especially within the LAM implementation.) This is a substantial improvement over previous generations of Gigabit ethernet hardware and drivers, and on par with the best one might reasonably expect. For other benchmarks of this type, see the Gigabit Over Copper Evaluation page.

    Comparison with Myrinet: We are achieving approximately 1/3 the transmission rates, at 7 times the latency, of the latest generation of Myrinet hardware. However, the cost of the hardware used here (approximately $50 for an equivalent Intel NIC, and $2800 for the gigabit switch) is only about 13% of the cost of a 16-way Myrinet switch, cards, and cables.

    Further "real-world" performance numbers will be posted once we've measured them...



Department of Chemistry and Center for Materials Innovation
Washington University in St. Louis
Last modification: Fri Aug 17 16:24:36 2007
gelb@wustl.edu