Log in


Forgot your password?
 
CIG > Members > tan2's Home > Scaling Performance of CitcomS
Personal tools

Scaling Performance of CitcomS

Scaling Performance of CitcomS on Caltech's CITerra cluster

Model Description

The mesh is a regional cap with 129x129x129 nodes. Total velocity unknowns is 6.5 millions. I ran the models for 11 time steps. The result reported is the (wall clock) time elapsed between step 1 to step 11. Each node on our cluster has 2 quad-core Intel Xeon 2.33 GHz CPU and 12G RAM, that is 8 cores per node. The interconnect is Myrinet GM protocol.

Test 1

Table 1:

      Partitioning   Proc#   Time (sec.)  Speedup   Efficiency
      1x1x1            1       4508.80     1.00        1.000  
      1x1x2            2       2387.18     1.89        0.944  
      1x1x4            4       1587.46     2.84        0.710  
      2x2x1            4       1455.36     3.10        0.775  
      2x2x2            8        975.25     4.62        0.578  
      2x2x4           16        594.48     7.58        0.474  
      4x4x1           16        596.55     7.56        0.472  
      4x4x2           32        392.11    11.50        0.359  
      4x4x4           64        303.27    14.87        0.232  

The first 5 cases all ran on a single node. So it's kind of testing the impact of limited bus bandwidth. Going from 2 cores to 4 cores, and 4 cores to 8 cores show significant performance impact. However, I am not sure how much of the impact is due to parallel communication and how much is due to bandwidth. I will run more tests that use only 2 cores per nodes to find out.

Test 2

Using different processors per node (ppn) for the following cases. Note that there are two CPU sockets per node, four cores per CPU on this machine.

Table 2:

      Partitioning   Proc#   ppn  Time (sec.)  Speedup   Efficiency
      2x2x2            8      8     975.25      4.62        0.578
      2x2x2            8      4     771.19      5.84        0.731
      2x2x2            8      2     688.03      6.55        0.819
      2x2x2            8      1     720.66      6.26        0.782
      4x4x1           16      8     596.55      7.56        0.472
      4x4x1           16      4     477.53      9.67        0.604
      4x4x1           16      2     440.79     10.23        0.639

From ppn=1 to ppn=2 the performance increases slightly, most likely because our Myrinet MPI library has efficient inter-core communication. From ppn=2 to ppn=4, slows down a little; and From ppn=4 to ppn=8, slows down significantly, which can be attributed to bandwidth limit.

The conclusion is that ppn=2 is the fastest on a quad-core machine, ppn=4 is 5-10% slower (than ppn=2). Setting ppn=4 is probably most efficient use of computational resources.

Document Actions