The Scope of this page is to resume the results of some benchmarks run on a set of 96 jobs (
S6B HL LF) as a function of the pipeline stage and of various attempted optimizations.

---
Compilation flags optimization @Atlas
In the following table, we report the results of a few tests we have conducted to estimate the impact of various combinations of compiler/compiling flags on the
cumulative job elapsed time (
JET) and the throughput. The average (the chosen set of 96 jobs) JET is reported at various stages of the pipeline together with the final average throughput per core. JET distributions are shown in the links called "benchmark". The "report" link shows a "standard" GW results page produced on the 96 jobs. The analysis label is
S6B _BKG_LF_L1H1_2G_RSRA_run2a. The pool of machines: 8 Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz x 12 cores (a9201-a9208 reserved on Atlas cluster)
Code version WAT Version 6.0.1.1
SVN Revision 4105M
LAL Version 6.13.4.1
FRLIB Version 8.21
Based on ROOT 5.34/25
Glossary
baseline gcc/icc = gcc/icc -c -O2 -fPIC -Wno-deprecated -fopenmp -fPIC -fexceptions
cpu2 = 6 concurrent jobs per node instead of 12
<JET> = Job Elapsed Time averaged over the jobs
Average Throughput = Average rate of production, i.e. Processed data over JET
Table1
| *Notes* | <JET [hr]> STRAIN | *<JET [hr]> CSTRAIN* | *<JET [hr]> COHERENCE* | *<JET [hr]> SUPERCLUSTER* | *<JET> [hr] TOTAL* | *Average throughputper core* | *links* |
*1 | baseline gcc | 0.027+/-0.016 | 0.034+/-0.017 | 0.215+/-0.043 | 0.69+/-0.15 | 1.46+/- 0.33 | 172 +/- 48 | benchmark report |
*2 | baseline icc | 0.016+/-0.004 | 0.022+/-0.005 | 0.16+/-0.03 | 0.53+/-0.17 | *1.28 +/- 0.36* | *199 +/- 67* | benchmark report |
*3* | gcc +cpu2 | 0.014+/-0.008 | 0.019+/-0.010 | 0.14+/-0.03 | 0.47+/-0.12 | 1.12 +/- 0.31 | 114 +/- 37 | benchmark report |
4
| gcc + O3 | 0.018+/-0.004 | 0.024+/-0.005 | 0.19+/-0.04 | 0.58+/-0.18 | 1.47 +/- 0.41 | 173+/- 58 | benchmark report |
5
| icc + O3 | 0.022+/-0.007 | 0.029+/-0.011 | 0.19+/-0.04 | 0.64+/-0.18 | 1.47 +/- 0.40 | 172 +/- 52 | benchmark report |
*6* | gcc + mtune && march native | | | | | 1.61 +/- 0.45 | 159 +/- 51 | benchmark report |
*7* | icc + mtune && march native | | | | | 1.28 +/- 0.32 | 198 +/- 67 | benchmark report |
*8* | gcc + O3+ mtune && march native | 0.018+/-0.003 | 0.024+/-0.005 | 0.19+/-0.04 | 0.61+/-0.17 | 1.50 +/- 0.42 | 170 +/- 53 | benchmark report |
*9* | icc + O3+ mtune && march native | 0.018+/-0.003 | 0.023+/-0.004 | 0.17+/-0.04 | 0.57+/-0.17 | 1.29 +/- 0.30 | 193 +/- 59 | benchmark report |
*10* | gcc + O3+ mtune && march native+cpu2 | 0.014+/-0.006 | 0.018+/-0.006 | 0.13+/-0.03 | 0.45+/-0.11 | 1.07 +/- 0.31 | 120 +/- 41 | benchmark report |
*11* | icc + O3+ mtune && march native +cpu2 | 0.013+/-0.005 | 0.016+/-0.006 | 0.12+/-0.02 | 0.41+/-0.11 | 0.95 +/- 0.26 | 133 +/- 41 | benchmark report |
*12* | icc + xHost | | | | | 1.53 +/- 0.42 | 167 +/- 53 | |
*13* | icc + netrho=5.5 | | | | | 1.96 +/- 0.59 | 135 +/- 64 | |
Compilation flags optimization @CIT
In the following table, we report the results of a few tests we have conducted to estimate the impact of various combinations of compiler/compiling flags on the
cumulative job elapsed time (
JET) and the throughput. The average (the chosen set of 96 jobs) JET is reported at various stages of the pipeline together with the final average throughput per core. JET distributions are shown in the links called "benchmark". The "report" link shows a "standard" GW results page produced on the 96 jobs. The analysis label is
S6B _BKG_LF_L1H1_2G_RSRA_run2a. The pool of machines: 4 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.60GHz x 32 cores (node514-516, node518 reserved on CIT cluster)
Code version WAT Version 6.0.1.1
SVN Revision 4112M
FRLIB Version 8.21
Based on ROOT 5.34/25
Glossary
baseline gcc/icc = gcc/icc -c -O2 -fPIC -Wno-deprecated -fopenmp -fPIC -fexceptions
cpuN = 32/N concurrent jobs per node instead of 32
<JET> = Job Elapsed Time averaged over the jobs
Average Throughput = Average rate of production, i.e. Processed data over JET
Table2
| *Notes* | *<JET> [hr] TOTAL* | *Average throughputper core* | *links* |
*1 | baseline gcc(4.4) | 2.35 +/- 0.50 | 106 +/- 31 | |
*2 | baseline gcc(4.9) | *2.48 +/- 0.60* | *101 +/- 28* | |
*3* | baseline icc(2015) | 2.16 +/- 0.53 | 116 +/- 32 | |
4
| gcc(4.9) + cpu2 | 1.72 +/- 0.42 | (146 +/- 42)/2 | |
5
| gcc(4.9) + cpu4 | 1.52 +/- 0.42 | (168 +/- 55)/4 | |
*6* | gcc(4.9) + cpu8 | 1.42 +/- 0.40 | (182.266 +/- 77)/8 | |
7 | icc + cpu2 | 1.45 +/- 0.36 | (174 +/- 52)/2 | |
8 | icc + cpu4 | 1.27 +/- 0.33 | (197 +/- 52)/4 | |
9 | icc + cpu8 | 1.23 +/- 0.35 | (210.437 +/- 85)/8 | |
10 | gcc(4.9)+native | 2.29 +/- 0.59 | 111 +/- 35 | |
11 | icc + native | 2.07 +/- 0.48 | 121 +/- 36 | |
12 | gcc(4.9)+O3 | 2.37 +/- 0.67 | 108 +/- 34 | |
13 | icc + O3 | 2.30 +/- 0.55 | 108 +/- 29 | |
14 | icc + xHost | 2.28 +/- 0.62 | 111 +/- 31 | |
Tests @CIT for optimization report
Since the cWB code has not yet been parallelized, but one can saturate the cores on a CPU by running many parallel single-threaded jobs, we ran sixteen identical copies of a job on each core of an E5-2670 node, and compared the average runtime to a single job running without resource competition on one core of the same node. The job (#1560) was chosen among the 96 from previous tests in order to be as close as possible to the average throughput. We selected the best compilation options (i.e. icc + native, see entries 11 and 7 of table 2 ).
In the following table 3, we report the results of a few tests with different options for the single job test and for the 16 concurrent replicas of the same job. All tests have been done on a
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.60GHz x 16 cores (node518 no Hyperthreading reserved on CIT cluster)
Table3
| Notes | <JET> [hr] TOTAL | Average throughputper core |
1 | icc + native single | 1.14+/- 0 | (239.5 +/- 0)/16 |
2 | icc + native 16 jobs | 1.56 +/- 0.01 | 174.4 +/- 1.2 |
3 | icc + native single + wat4159 | 1.06 +/- 0 | (258.0 +/- 0)/16 |
4 | icc + native 16 jobs + wat4159 | 1.13 +/- 0.02 | 240.3 +/- 3.1 |
Profiling with perf
We tried following the indications on this page (
perf link). We ran the single job (entry 1 table 3) with perf stat and all the suggested options.
18,343,695,280,058 r5300c0 [62.50%]
93,691,855,831 r530110 [62.49%]
1,148,154,220,148 r532010 [62.50%]
1,134,488,997,034 r534010 [62.50%]
403,702,858,397 r538010 [62.50%]
32,853,515,497 r531010 [50.01%]
286 r530111 [50.00%]
19,984,550,066 r530211 [50.00%]
4099.968195141 seconds time elapsed
perf stat -e r5300c0 -e r530110 -e r532010 -e r534010 -e r538010 -e r531010 - 3669.43s user 7.95s system 89% cpu 1:08:20.01 total
(93691855831+1148154220148+1134488997034*4+403702858397+32853515497*2+286*8+19984550066*4)/4099.968195141
(const double)1.54370713498677206e+09
The estimated single job flops were 1.54 GFLOPS.
We repeated the perf test with the new version of the wat libraries (4159), see entry 3 of table 3.
Performance counter stats for 'taskset -c 0 /home/salemi/ROOT/root-v5-34-25_icc/bin/root -b -q -l /home/salemi/SVN/watrepo/wat.test/trunk/tools/cwb/macros/cwb2G_parameters.C ./config/user_parameters.C /home/salemi/SVN/watrepo/wat.test/trunk/tools/cwb/macros/cwb_xnet.C("./config/user_parameters.C",CWB_STAGE_FULL,"./config/user_parameters.C",true,false)':
9,617,958,974,829 r53003c
18,326,601,093,105 r5300c0
90,835,528,242 r530110
1,154,503,618,542 r532010
1,131,789,600,923 r534010
395,865,816,661 r538010
38,630,751,078 r531010
4,661 r530111
20,203,977,011 r530211
3806.594204513 seconds time elapsed
perf stat -e r53003c -e r5300c0 -e r530110 -e r532010 -e r534010 -e r538010 - 3687.17s user 32.95s system 97% cpu 1:03:26.60 total
(90835528242+1154503618542+395865816661+4*1131789600923+2*38630751078+8*4661+4*20203977011)/3806.594204513
(const double)1.66196880327420640e+09
The estimated single job flops after the wat have been upgraded are now 1.66 GFLOPS.
Resume:
Stuart Anderson warned us that the 16 concurrent jobs from tests in entry 2 were spending significant amount of time in d-state (i.e. in sleep while doing I/O) and actually writing at a slow pace (4 byte at a time). It was then easy to pinpoint the part of the code responsible for those d-states. After some investigation, Gabriele Vedovato found two distinct problems and changed the code. Some thorough testing in the last days have confirmed that the changes to the code work fine in a wide range of configurations: these changes have been yet committed to the main cWB svn repository ( svn=4159).
Technical details:
We use some streamers classes within root to do the I/O. In Particular, while passing from a stage of the pipeline to the following, some I/O is done by means of these streamers: there is a garbage collector file (GC) that is filled and some (in some cases very long) lists of objects that need to be deleted. The problem with the GC was due to a combination of causes: after some upgrade by root (root>5.34), the streamers have been changed. A patch was added to allow for more recent versions of root to work, but it was very inefficient. Still, this was a relatively marginal problem. The deletion of large number of objects was responsible for the most lengthy d-states. A new implementation of the algorithm which uses lists has practically solved the problem.
Conclusion:
The new version of cWB has shrunk the time of the d-states practically to zero. These d-states were a minor problem for the single job, but they were quite a serious problem for the 16 concurrent jobs test (the d-states were lasting up to ~24 minutes (at the end of the supercluster stage)). This explains the moderate improvement of the single job test and the flops estimate, and the impressive improvement of the 16 concurrent jobs test. It should be noticed that this result shows that cWB is very close to being optimally parallelized.
--
FrancescoSalemi - 19 Feb 2015