Scaling Performance of CitcomS
Scaling Performance of CitcomS on Caltech's CITerra cluster
Model Description
The mesh is a regional cap with 129x129x129 nodes. Total velocity unknowns is 6.5 millions. I ran the models for 11 time steps. The result reported is the (wall clock) time elapsed between step 1 to step 11. Each node on our cluster has 2 quad-core Intel Xeon 2.33 GHz CPU and 12G RAM, that is 8 cores per node. The interconnect is Myrinet GM protocol.
Test 1
Table 1:
Partitioning Proc# Time (sec.) Speedup Efficiency
1x1x1 1 4508.80 1.00 1.000
1x1x2 2 2387.18 1.89 0.944
1x1x4 4 1587.46 2.84 0.710
2x2x1 4 1455.36 3.10 0.775
2x2x2 8 975.25 4.62 0.578
2x2x4 16 594.48 7.58 0.474
4x4x1 16 596.55 7.56 0.472
4x4x2 32 392.11 11.50 0.359
4x4x4 64 303.27 14.87 0.232
The first 5 cases all ran on a single node. So it's kind of testing the impact of limited bus bandwidth. Going from 2 cores to 4 cores, and 4 cores to 8 cores show significant performance impact. However, I am not sure how much of the impact is due to parallel communication and how much is due to bandwidth. I will run more tests that use only 2 cores per nodes to find out.
Test 2
Using different processors per node (ppn) for the following cases. Note that there are two CPU sockets per node, four cores per CPU on this machine.
Table 2:
Partitioning Proc# ppn Time (sec.) Speedup Efficiency
2x2x2 8 8 975.25 4.62 0.578
2x2x2 8 4 771.19 5.84 0.731
2x2x2 8 2 688.03 6.55 0.819
2x2x2 8 1 720.66 6.26 0.782
4x4x1 16 8 596.55 7.56 0.472
4x4x1 16 4 477.53 9.67 0.604
4x4x1 16 2 440.79 10.23 0.639
From ppn=1 to ppn=2 the performance increases slightly, most likely because our Myrinet MPI library has efficient inter-core communication. From ppn=2 to ppn=4, slows down a little; and From ppn=4 to ppn=8, slows down significantly, which can be attributed to bandwidth limit.
The conclusion is that ppn=2 is the fastest on a quad-core machine, ppn=4 is 5-10% slower (than ppn=2). Setting ppn=4 is probably most efficient use of computational resources.
