|
|
"Fjord" - a linux cluster
This page describes our experiences with the procurement,
installation, configuration, troubleshooting, and use of
a 24 node dual-processor Athlon cluster.
The hardware
The software: OSCAR
Post-install configuration
Problems and fixes
Performance
Our hardware configuration consists of:
Final system cost came to just about $55,000. Here's a picture
of it during assembly, with 15 nodes installed. The assembly went reasonably
smoothly, except that the screws for rackmounting that came with
the nodes don't fit into this particular rack, so we had to order
extra screws from APC (since the rack only comes with about 60 screws,
and we used six per node!)
Here are some pictures of the inside of the Appro 1100 server. Its
pretty intelligently laid out, though not as easy to get into as
one might like, as the top cover consists of three separate pieces:
No CPU fans! Instead, a big plastic duct (far right)
funnels air from the four big fans
directly through the heat-sink fins.
Network configuration
Since every computer has two NICs on the motherboard, we got two switches.
There are two private subnets:
- "Service network" - this connects the 10/100 ports on all the nodes
with the DLINK switch. The Gigabit port from the server connects to the
Gigabit uplink port on the switch. This network is used for software
installation, routine communication, and file-sharing by NFS.
- "Fast network" - This connects the Gigabit ports on all the nodes with
the SMC switch. The server does not participate, and this network is meant
only for communication during parallel job execution.
Power
The rack system is now plugged into three dedicated 20A lines. I'm not sure
what the actual power draw under running conditions is, but will attempt
to find out.
Problems
Of the sixteen nodes initially delivered (we added eight more later),
two would not boot at all. These were
returned to Appro, repaired, burned in, and re-installed. They both
worked fine upon return, until we started working with the Gigabit
ethernet network, and discovered that one of them had a bad controller.
It was returned for repair a second time.
Future expansion
There's room on both switches and the rack for an additional eight nodes,
if we so desire. There's also room in the bottom of the rack for enough
UPS capacity to run the whole system, if it becomes necessary.
We installed OSCAR 1.4 on the
cluster. This is an open-source suite of cluster software that handles
the installation of operating systems on the nodes, configuration of
networking and parallel libraries, setting up the PBS/Maui queueing system,
and some other goodies.
Basically, the install went quite smoothly (and as described in the OSCAR
manual) except for a few false starts and the following issues:
- The BIOS that came on the node motherboards had a bug that wouldn't
allow them to boot without a keyboard attached (even if the "boot w/o keybd"
option was selected) -- this slowed things down considerably, until we
got a new BIOS from Appro which fixed the problem.
- The newest, most current kernel from RedHat 7.3 for the nodes is the
one labeled "kernel-smp-2.4.18-17.7.x.athlon.rpm". The somewhat unusual
numbering scheme used for this was not detected by OSCAR in the
image-setup section, meaning that we had to install the RH7.3 default kernel.
This would ordinarily be OK, except that only the newest kernel has
support for the built-in Gigabit ethernet! We tried this a few times, and
finally gave up and decided to put on the newest kernel afterwards.
- The documentation on setting up your own disk partition file is
a bit weak - it turns out that you can use ext3 and/or Reiser filesystems,
but this is not mentioned anywhere.
- In OSCAR there is really no way to go "back" and redo a step
(say, "node definitions") if you messed up, or decided that another option
should have been checked. Instead, you have to go back all the way to
the beginning, re-install OSCAR, and start over. This gets frustrating after
a while.
- In order to get PBS working right, the first name entry for the
server node name in the /etc/hosts files entry for the server's internal
network connection has to match the output of "hostname" on the server.
This must be true on all the machines! This is a problem, because
you can go through the entire install without it, and then PBS won't work....
- OSCAR cannot deal with multiple NICs per node. Even if we
could have got the right kernel installed, there is no facility in OSCAR
for configuring a second private network. While the configuration of
the second network by hand was not so difficult, really, we were not able
(quickly, at any rate) to do further software modification within the
OSCAR framework, e.g., using SystemImager, since when OSCAR installs the
first private network, it does it through a system of shell scripts are
difficult to adapt. If you only have a single network adapter per node,
of course this won't be a problem.
Our overall impression of Oscar, however, was quite good - in one day,
without previous experience with the software, we got everything up and
running, except for the post-install stuff described below. It would
have taken much longer had we had to separately install all the
components of
OSCAR by hand.
A number of configuration issues remained at the end of the OSCAR
installation. These included:
- Additional NFS-shared filesystems from the server node. This
was easy to set up using commands from the
C3 package. Basically,
edit the /etc/fstab on one node until you like it, and then propagate
the changes to all of them.
- Installation of new kernels.
This was also easy to arrange, by using "cexec"
to run the appropriate RPM commands on each node.
- Configuration of gigabit ethernet adapters. This required producing
and installing a unique "ifcfg-eth1" file on each node. Since there
are only sixteen, we did this by hand (it only took a few minutes.) Were
there more, a shell script that handled it would be easy to write.
We also had to copy an updated /etc/hosts to all the nodes.
- Configuration of PBS-run jobs to use the gigabit network.
Haven't got to this yet!
As mentioned above, there were some hardware issues. Appro has been
helpful in addressing these, providing new BIOS files, etc.
We are principally concerned with network performance here, as
that is likely to be the limiting factor in the efficient use of
the cluster for parallel computation.
Ping benchmarks
Here are some basic "ping" benchmarks. These were gathered from
the output of "ping -f -s 64000 nodename".
- 10/100TB network: average return time was 11.95ms, which translates
to 10.215 MB/sec transfer rate, about 82% of line maximum.
- 1000TB network: average return time was 2.532ms, which translates
to 48.21 MB/sec transfer rate, about 40% of line maximum. This is
measured with MTU=1500 (regular ethernet.) Still, its 4.7x faster than
regular fast ethernet.
- 1000TB network, MTU=9000: average return time was 2.585ms, essentially
the same speed as with "small" frames.
- 1000TB network, MTU=3000: average return time was 2.396ms, about
50.95 MB/sec transfer rate.
NetPIPE benchmarks
We have used both the TCP and MPI (LAM) parts of the
NetPIPE 3.2 performance
benchmark suite to gauge gigabit ethernet performance between two nodes.
In order to get good (reliable and fast) performance, we ended up using the following:
- Version 4.3.15 of the e1000 driver (from Intel)
- MTU = 9000 (best TCP performance) or MTU = 1500 (best LAM/MPI performance)
- RxIntDelay=0 TxIntDelay=0 (parameters passed to e1000.o via /etc/modules.conf)
- /proc/sys/net/core/rmem_max = 1 MB, /proc/sys/net/core/wmem_max = 1 MB
These values are not really optimal, but they are substantially better than
those of the packaged kernel e1000.o module and default Delay and memory
values. The RxIntDelay and TxIntDelay values, especially, make a big difference
in the consistency of network performance with different message sizes.
Communication latencies with these parameters, estimated from the transmission
of very small messages, tend to run around 70 microseconds.
TCP performance examples:
With the TCP version of NetPIPE, and the above parameters, we were able
to get up to 920 Mbps for large messages and jumbo frames. It appears that
an MTU of 3000, though, gives substantially better performance for
messages on the order of 10Kb to 100Kb.
MPI/LAM performance examples:
With the LAM/MPI version of NetPIPE, the best performance achieved with
large messages is approximately 690 Mpbs. Performance over different message
sizes is somewhat rocky; the use of MTU=1500 smooths this out substantially.
The LAM/MPI implementation overhead is thus approximately 25% of achieved
network bandwith, not too bad. For comparison we show a single trace (green)
using default values for the Rx/Tx Delay parameters, which gave both substantially
degraded performance, and extremely slow speeds for certain message sizes.
Evaluation
For just TCP traffic, this hardware/software is capable of achieving better
than 90% of theoretical network line speed.
Even within the MPI implementation, 70% of maximum is probably realizable
with further parameter tuning (especially within the LAM implementation.)
This is a substantial improvement over previous generations of Gigabit ethernet
hardware and drivers, and on par with the best one might reasonably expect.
For other benchmarks of this type, see the
Gigabit Over Copper
Evaluation page.
Comparison with Myrinet: We are achieving approximately 1/3 the transmission
rates, at 7 times the latency, of the latest generation of Myrinet hardware.
However, the cost of the hardware used here (approximately $50 for an
equivalent Intel NIC, and $2800 for the gigabit switch) is only about 13%
of the cost of a 16-way Myrinet switch, cards, and cables.
Further "real-world" performance numbers will be posted once we've measured
them...
|
|