Running the Pulsar Timing Code on Xeon Phi (Aug 2015)
Intel's Xeon Phi:
http://www.intel.de/content/www/de/de/processors/xeon/xeon-phi-detail.htmlStatus plot of cards in xeonphi02:
https://www.atlas.aei.uni-hannover.de/~fehrmann/MicLoad/micload.png
xeonphi02: mic0, mic1, mic2, ..., mic7
Note: The directory /local/user/$USER/ on xeonphi02 is available on mic0 at /home/$USER/
Optional: Modify your ~/.bash_profile to include
if \[\[ `hostname -s` = xeonphi* \]\]; then echo "Recognized xeonphi host! Sourcing intel compiler." source /opt/intel/2015/intel.sh source /opt/intel/2015/composer_xe_2015.0.090/bin/compilervars.sh intel64 source /opt/intel/2015/impi_5.0.1/bin64/mpivars.sh export DAPL_DBG_TYPE=0 export I_MPI_MIC=1 fi
git clone
git@gitmaster.atlas.aei.uni-hannover.de:gamma-ray-project/fgrptiming.git fgrptiming_mic
cd fgrptiming_mic
git checkout xeonphi
ssh xeonphi02
./build.sh --linux-mic
cd testing/J1035
./run_J1035-6720_mics.sh
Mount Atlas via sshfs
Then:
python triplot_iamcmc.py --infile /Users/holger/sshfs/atlasB/FermiLAT/fgrptiming_mic/testing/J1035/PSR_J1035-6720_v19.dat --outfile /Users/holger/sshfs/atlasB/FermiLAT/fgrptiming_mic/testing/J1035/tripl_PSR_J1035-6720_v19.dat --parfile /Users/holger/sshfs/atlasB/FermiLAT/fgrptiming_mic/testing/J1035/PSR_J1035-6720_v19.dat.par --phaseplots
Xeon Phi Computing examples (Jan 2015)
Here are two example codes showing how to perform parallel computing with the Xeon Phi (MIC).
Example of offloading to the Xeon Phi using
OpenMP:
test-openmp.c
Example using Intel MPI to parallelise the computing:
test-mpi.c
First, login to atlas8 and do the following:
source /opt/intel/2015/intel.sh
source /opt/intel/2015/impi_latest/intel64/bin/mpivars.sh
source /opt/intel/2015/composer_xe_2015/bin/compilervars.sh intel64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64
To compile the
OpenMP example: (remove -openmp tag to compare to running normally)
icc -openmp test-openmp.c -o test-openmp
This code simply "offloads" the expensive for-loop to the Xeon Phi using shared memory, and
OpenMP handles the parallelisation. It's necessary to take care of all pointers and arrays which need to be passed to the Xeon Phi using e.g. in(pointer:length(array_length)) in the offload pragma. Any variables which are iterated over in the loop should be declared "private" to prevent threads from interfering with one another.
The MPI example can be run in a few different modes: running only on the head-node, only on the Xeon Phi, or "symmetrically" on both at once. Using the Xeon Phi like this requires you to be able to access the MIC using passwordless SSH (ask Henning!)
First, compile for both the headnode and the Xeon Phi separately with: #HeadNode only:
mpiicc -lmpi test-mpi.c -o test-mpi.host
#Xeon Phi version:
mpiicc -mmic -lmpi test-mpi.c -o test-mpi.mic
#The mic version must be copied to the Xeon Phi, using e.g.:
cp test-mpi.mic /local/user/fermi/ You can then run them using (-n X specifies X threads on that host): #Headnode only:
mpirun -n 6 ./test-mpi.host
#XeonPhi only:
mpirun -host fermi@192.168.1.1 -n 240 /home/fermi/test-mpi.mic
#Both symmetrically:
mpirun -host atlas8.atlas.local -n 6 ./test-mpi.host : -host fermi@192.168.1.1 -n 240 /home/fermi/test-mpi.mic
Using MPI means that the code runs separately on each thread (which keeps its own copy of variables etc. in memory) which only communicate at certain points when you tell it to (in this case only with MPI_Reduce() which gathers every thread's calculation and sums them). It seems like this method is a bit more cumbersome, but gives you a huge amount of control over exactly what is computed on each node.