NeSI Maui/Mahuika


MauiMahuika
ModelCray XC50Cray CS400
Number of CPUs18,650x2.4Ghz Skylake (1node = 80 virtual cores)8,424 x 2.1GHz Broadwell (1node = 72 virtual cores)
Total Memory66.8Tb30 Tb
SchedulerSLURMSLURM
Max num of submission per user
QueueWall-clock limitNodesCPU/NodeMax Mem/Node
nesi_research24 h26440 (80)80 or 160Gb

Max CPU request: 240 nodes = 9,600 phy.= 19,200 virt. cores

Max Node Hours : 1200 node-hours

eg.) requesting 240 nodes means wall clock limited to 5 hours.



Max num of jobs (submit): 1000

QueueWall-clock limitNodesCPU/NodeMax Mem/CPUMax Mem/Node
large3days226721500Mb108Gb
long3weeks69721500Mb108Gb
prepost3h5726800Mb480Gb
bigmem7days4726800Mb480Gb
hugemem7days0.512830Gb4000Gb
gpu3days4813500Mb108Gb
ga_bigmem7days1726800Mb480Gb
ga_hugemem7days112830Gb4000Gb

Max CPU request: 576 CPUs (8 full nodes)

Max num of jobs (submit): 1000

Max Core hours per job: 20,000 hrs.

Dev env.

File system

Gotchas

Useful commandFairshare score: nn_corehour_usage nesi00213  eg. 0.336420 out of 1.0 

TACC Stampede2


Stampede2 (TACC)
ModelDell PowerEdge C6320P/C6420
Number of CPUs

367,024

Xeon Phi 7250 68C 1.4GHz

Total Memory736Tb
SchedulerSLURM
Max num of submission per user

KNL: 1 node 68 cores (1 socket) = 272 hyper threads BUT 64-68MPI tasks advisable * 4200 KNL nodes (96Gb+16Gb)/node

SKX: 1 nodes 48 cores (= 2 sockets* 24 cores/socket) = 96 hyper threads * 1,736 nodes

QueueWall-clock limitMax Nodes/JobMax active jobs (running+waiting)
KNL


development2h16 (1,088 cores)1
normal48h256 (17,408 cores)50
large48h2048 (139,264 cores)5
long120h32 (2,176 cores)2
flat-quadrant48h32 (2,175 cores)5
SKX


skx-dev2h4 (192 cores)1
skx-normal48h128 (6,144 cores)25
six-large48h868 (41,664 cores)3

SKX is slightly more expensive than KNL

Dev env.Default compiler: Intel 18.
File system

$HOME: 10Gb (200,000 files)
$WORK: 1Tb (3mil files) : not for high IO, large files. nobackup, no purge

$SCRATCH: unlimited. nobackup, deleted if not accessed for 10 day.

 /nesi/project/nesi00213 == $HOME/project

/nesi/nobackup/nesi00213 == $HOME/nobackup or $SCRATCH/nobackp



Gotchas
Building

Intel

module add fftw3/3.3.8 intel/18.0.2 impi/18.0.2 cmake/3.10.2


MPI_C_LIB_NAMES = mpifort;mpi;mpigi;dl;rt;pthread
MPI_dl_LIBRARY = /usr/lib64/libdl.so
MPI_pthread_LIBRARY = /usr/lib64/libthread.so
MPI_rt_LIBRARY =  /usr/lib64/librt.so

By default gcc-6.5 creeps in and it attempts to build with gcc-6.5 instead of icc. Enforce it with CC=icc.

I found "make VERBOSE=1" extremely useful to debug building issues


GCC

  1. Load the correct modules:
    module add git/2.24.1 cmake/3.16.1 TACC impi/17.0.3 libfabric/1.7.0
    autotools/1.1 xalt/2.8 gcc/7.1.0 python3/3.6.1 hdf5/1.10.4

  2. Build FFTW (3.3.8)

    ./configure --enable-float --enable-sse --enable-threads
    --host=x86_64-pc-linux --enable-shared --prefix=$SOMEWHERE
    make all install

    1. Now that initial set up has been completed the following commands can be used for the GCC workflow:
      1. activate_env /work/06833/sungbae/stampede2/Environments/stampede_gcc
      2. module restore gcc_modules
      3. export LIBRARY_PATH=$LIBRARY_PATH:~
  3. Build EMOD3D
    mkdir build

    cd build

    FFTW_DIR=$SOMEWHERE cmake ../

    make


Issue

emod3d has a rounding error issue with icc and returns wrong "ny" failing post-emod3d test. Rob Graves fixed this by converting float to double in the function get_n1n2() in misc.c. The fix is included in 3.0.6 (On Nurion, however, this fix was found to be not enough)

Running

 

Project name must be CamelCase: DesignSafe-Graves

Slurm script needs -N for number of nodes

#SBATCH -N 4
#SBATCH --ntasks=160

Instead of "srun" it uses "ibrun"


Workflow

A number of hardcoded bits assuming NeSI machine need to be updated. Check workflow and qcore "stampede" branches.

https://github.com/ucgmsim/slurm_gm_workflow/tree/stampede

https://github.com/ucgmsim/qcore/tree/stampede

Usage check
(python3_stampede) sungbae@stampede21(1):~$ /usr/local/etc/taccinfo
---------------------- Project balances for user sungbae ----------------------
| Name Avail SUs Expires | |
| DesignSafe-Graves 19974 2020-09-30 | |
------------------------ Disk quotas for user sungbae -------------------------
| Disk Usage (GB) Limit %Used File Usage Limit %Used |
| /home1 0.8 10.0 7.82 1853 200000 0.93 |
| /work 10.0 1024.0 0.97 55539 3000000 1.85 |
| /scratch 11.0 0.0 0.00 4032 0 0.00 |
-------------------------------------------------------------------------------


Available 19974 SUs out of 20000.


KISTI Nurion



Nurion (KISTI)
ModelCray CS500
Number of CPUs

570,020

Xeon Phi 7250 68C 1.4Ghz

Total Memory
SchedulerPBS
Max num of submission per user

KNL: 1 node 68 cores (1 socket) * 8305 nodes (96Gb+16Gb)/node

SKL: 1 node 40 cores (2 sockets * 20 cores/socket) * 132 nodes (192Gb/node)

QueueWall-clock limitMax Nodes/JobMax running jobsMax active jobs (running+waiting)
KNL



exclusiveunlimited2600 (176,800 cores)100200

normal

(82Gb)

48h4970 (337,960 cores)550600

long

(82Gb)

120h3002530
flat (102Gb)48h1803540

debug

(82Gb)

48h2 (20 avail)22
SKL



commercial48h118 (4720cores)26
norm_skl48h118(4720cores)1520
Dev env.
File system

Gotchas

Building EMOD3D was somewhat tricky. I ended up having my own version of CMake 3.9 (existing module has no ccmake, and later versions of CMake are buggy), and fftw3 (existing module didn't have fftw3f, and CMake failed to pick up.


Originally build with Intel tool chain, but EMOD3D had rounding error issues, and it generates incompatible random numbers (different from Maui). For best (and consistent) result, using GNU tool chain is highly recommended.


The following modules are used.

craype-network-opa

gcc

craype-mic-knl

mvapich2


mvapich2 is required as mpi4py doesn't seem to work properly with openmpi


Don't bother with fftw3 module. We need to build fftw3 from scratch: only fftw3f (single) version is needed.

FFTW3

export MPICC='mpicc -fPIC -march=knl'

export CC='gcc -fPIC -march=knl'

./configure --enable-float --enable-sse --enable-threads --host=x86_64-pc-linux --enable-shared --prefix=/home01/hpc11a02/gmsim/Environments/nurion/ROOT/local/gnu

make all install


EMOD3D


mkdir build

cd build

export FFTW_DIR=/home01/hpc11a02/gmsim/Environments/nurion/ROOT/local/gnu

cmake ..

cmake --build . --target all -j 8


GMT

Prerequisite

  • curl
  • sqlite-snapshot-202004061816,
  • zlib-1.2.11,
  • libpng-1.6.37,
  • tiff-4.1.0,
  • GraphicsMagick-1.3.35,
  • proj-7.0
  • gdal-3.0.1

Except for GDAL, this works:

$HOME=/home01/x2319a02

$ PKG_CONFIG_PATH=$HOME/gmsim/Environments/nurion/ROOT/local/gnu/lib/pkgconfig
$ ./configure --prefix=$HOME/gmsim/Environments/nurion/ROOT/local/gnu & 
make all install


For GDAL,

module add netcdf

CPPFLAGS=-I$HOME/gmsim/Environments/nurion/ROOT/local/gnu/include PKG_CONFIG_PATH=$HOME/gmsim/Environments/nurion/ROOT/local/gnu/lib/pkgconfig
./configure --prefix=$HOME/gmsim/Environments/nurion/ROOT/local/gnu --with-proj=$HOME/gmsim/Environments/nurion/ROOT/local/gnu & make all install

(Edit: I had to manually add CPPFLAGS into config.status (2022/11/25)

For GMT,

go to build

cmake -DDCW_PATH:PATH=$HOME/gmsim/Environments/nurion/ROOT/share/dcw-gmt-1.1.4 -DGSHHG_PATH:PATH=$HOME/gmsim/Environments/nurion/ROOT/share/gshhg-gmt-2.3.7 ../


make all install


!WARNING!

"qsub" MUST be executed in $SCRATCH directory.


Usage check

isam

$ lfs quota -h /home01

$ lfs quota -h /scratch


1 gujwa = KNL 6,400 node time (100 SRU time) = 435,000 core hours

XXX sec * 4350/3600 = core hours

For details of PBS, see PBS page.



  • No labels