Child pages
  • Getting Started on the Power755 Cluster
Skip to end of metadata
Go to start of metadata


Access to the P755 Cluster

Access to the P755 cluster is possible by using the Secure Shell (SSH) to login nodes Node called p1n14.canterbury.ac.nz (which runs the AIX operating system) or p2n14.canterbury.ac.nz (which runs the Linux operating system) Once logged the user will be able to compiler and run jobs on the cluster .

It is strongly recommended that you change the password provided by BlueFern and create a new one. To change your password:

  1. Log into the system using your old password, if you are not already logged in.
  2. At the prompt, use the command:

  3. Usually, you will be asked to enter your old password first. This is a security measure, designed to help prevent someone else from changing your password.
  4. You will be asked to enter your new password. Note that choosing a good password is not as simple as just thinking of a word, see the useful guide Choosing a password .
  5. You will then be asked to re-enter your new password.
  6. You should then be informed that the change has been successful, and be returned to the prompt.

Compiling Programmes

Introduction

The compiler versions running on the P755 cluster that explicitly support the POWER7 architecture are:

  • XL C/C++ (C/C++ compiler) version 11.1
  • XLF Fortran version 13.1

Code generated by previous versions of the compilers will run on P755 cluster; however to fully exploit POWER7 features it is recommended to recompile your programmes.

Compiling Power7 programmes

Compiling Serial programmes

A reasonable starting point to compile and produce optimised serial Power7 executable programmes is as follows:

On AIX or Linux

For C:
xlc_r -q64 -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.c

For Fortran:
xlf_r -q64 -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f

For Fortran90:
xlf90_r -q64 -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f90

Note 1

Icon

By default the IBM xl compilers uses 32-bit addressing. To change this to use 64-bit addressing the -q64 option should be used. All object files making up the same executable must be compiled and linked with the -q64 flag.
This flag increases the amount of memory that a program can use and is a good choice for most codes. One possible disadvantage is that all pointers will double in size so there may be a performance impact on codes that use a lot of pointers, this mainly concerns codes written in C or C++.

Compiling MPI programmes

To compile and produce an optimised parallel Message Passing Interface (MPI) Power7 programme a reasonable starting point is as follows:

On AIX

For C:
mpcc_r -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.c

For Fortran:
mpxlf_r -qarch=pwr7 -qtune=pwr7 -O3 -qhot _myprog.f

For Fortran 90:
mpxlf90_r -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f90

On Linux

For C:
mpcc -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.c

For Fortran:
mpfort -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f

For Fortran90:
export MP_COMPILER=xlf90_r

mpfort -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f90

Compiling OpenMP programmes

To compile and produce an optimised Open Multi-Processing (OpenMP) Power7 programme a reasonable starting point is as follows:

On AIX or Linux

For C:
xlc_r -q64 -qsmp=omp -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.c

For Fortran:
xlf_r -q64 -qsmp=omp -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f

For Fortran90:
xlf90_r  -qsmp=omp -q64 -qarch=pwr7 -qtune=pwr7 -O3 -qhot myprog.f90

Running Programmes

Introduction

LoadLeveler provides the facility for submitting and monitoring batch jobs on the p755 cluster. There are four classes (queues) defined, two for supporting jobs that run on the AIX nodes and the other for running jobs on Linux nodes.
Their characteristics are as follows

Class Name
(Queue)

Max Number
of Nodes

Max No of MPI tasks per node

No of threads per task
(Recommended)

Maximum Elapsed time for a job

p7aix

2

32

4

24 hours

p7aix_dev

1

8

4

0.5 hours

p7linux

11

32

4

24 hours

p7linux_dev

1

8

4

0.5 hours

Note

Icon

As the names suggest, users are encouraged to use the queues p7aix_dev and p7linux_dev classes when developing their application

More information on running applications on the power 755 is in the Performance Guide for HPC Applications

Examples

Running a serial job through LoadLeveler

Step 1:

Create a LoadLeveler job file similar to below. In this example it will be called serial.ll.

# Example Serial LoadLeveler Job file
# @ shell              = /bin/bash 
# @ job_name           = myrun
# @ job_type           = serial
# @ wall_clock_limit   = 10:00:00
# @ class              = p7linux
# @ group              = UC 
# @ account_no         = bfcs00000
# To receive an email when yourjob has completed:
# @ notification       = complete
# @ notify_user        = myemail@gmail.com    
# Output and error files 
# @ output             = $(job_name).$(schedd_host).$(jobid).out
# @ error              = $(job_name).$(schedd_host).$(jobid).err
# @ task_affinity      = core(1)
# @ rset               = rset_mcm_affinity 
# @ queue

# All commands that follow will be run as part of my serial job

# Display name of host running serial job
hostname

# Display current time
date

# Run a serial C program
./serial_linux.exe

This job is called "myrun" and has a wall clock limit of 10 hours. All the errors or warnings produced by the execution of this serial code will be written to "$(job_name).$(schedd_host).$(jobid).err" e.g myrun.p2n14-c.245695.err. The output of the serial code will be written to my_run.p2n14-c.245695.out

Note

Icon

The key differences to running a serial job compared to a parallel job are as follows:

@ job_type = serial -- specifies to LoadLeveler that the job is a serial job

There is no need to specify the tasks_per_node or node keywords to in order to run the job.

Step 2:

Submit the file just created to LoadLeveler using the following command:

llsubmit serial.ll
Step 3:

Monitor the progress of the job using the following command:

llq
 

For example,

To see why a job has not started issue the following command llq -s jobID:

In this case this job could not start due to wrong affinity requirements in the loadleveler script. If the cluster is particularly busy at the time of your job submission and ther eare not enough free nodes for your job to start, you would find this information using the command llq -s jobID.  

Another useful command is llstatus, which shows the current status of each node of the Power7 cluster:

  • Down (node not available)
  • Idle (node completely free)
  • Run (node with some tasks running but with some cores free and available)
  • Busy (node with the maximum of tasks reached, all cores or maximum number of tasks are used)

Running an MPI job through LoadLeveler

Step1:

An example LoadLeveler job file to run a parallel MPI job is shown below followed by an explanation of each line. In this example the file will be called mympi.ll.

Note

Icon

That all the lines in this command file are required, unless stated below. The meaning of each line in this command file is as follows:

# Example.... - is a comment provided a @ symbol does not follow the # symbol.
Any line starting with # and followed by a @ symbol is interpreted by LoadLeveler.

# @ shell = /bin/bash - Specifies the Unix shell to be used for the job.

# @ job_type = parallel - informs LoadLeveler that you wish to run a parallel job

# @ job_name = my_run - This allows the user to give a name to the job. This is not mandatory, but is useful for identifying output files.

# @ job_type = parallel - This informs LoadLeveler that this is a parallel job which requires scheduling on multiple processors.

# @ node = 2 - specifies that 2 nodes (each with 4 x 8 core chips) are to be allocated to the job. If an MPI job requires no more that 32 MPI tasks then it is recommended to set node = 1 for best performance

# @ tasks_per_node = 32 - specifies that 32 MPI tasks are to be started on a single node. This number can be varied from 1-32.

# @ wall_clock_limit = 00:20:00 - Specifies a wall clock limit of 20 minutes for the job. The wall clock limit has the format hh:mm:ss or mm:ss.

# @ group = NZ - specifies the group that the user belongs to. The name of the group the user belongs to will be provided when they register to use the system. Loadleveler recognizes 4 groups only: NZ, NZ_merit, UC, or UC_merit. If you are unsure of of your group, run the the command "whatgroupami" on the login node.

# @ account_no = bfcs00000 - is the project number. This is the number we issue to you when you register a project. It can be either bfcs (UC loadleveler groups) or nesi (NZ loadleveler groups) followed by five digits. You can find all the active projects you are participating in by running the command "whatprojectami" on the login node.

# @ output = $(job_name).$(schedd_host).$(jobid).out
# @ error= $(job_name).$(schedd_host).$(jobid).err
The above lines specify the files to which stdout and stderr from the job will be redirected. There is no default, so the user must set something here. The use of $(schedd_host).$(jobid) is recommended as this matches the hostid/jobid reported by the LoadLeveler command llq.

# @ notification = never - Suppresses email notification of job completion.

# @ class = p7linux - specifies the job is to run on nodes running the Linux operating system. To run a job on a node that runs the AIX operating system, specify a class name of p7aix.

# @ rset = rset_mcm_affinity - allows the user to make use of scheduling affinity to improve the performance of their program

# @ task_affinity = core(1) - causes LoadLeveler to run each MPI task on a separate core. There are 32 cores in each of the p755 nodes.
# @ network.MPI_LAPI causes LoadLeveler to allocate the High Performance Infiniband interconnect which will be used for communication between MPI tasks on different nodes.

Note

Icon

This statement should be included only for Linux production jobs (p7linux class). On AIX (and on the p7linux_dev and p7linux_serial) classes, this directive needs to be removed, as the AIX nodes and the Linux development node do not have InfiniBand available (and jobs submitted in these classes requesting an InfiniBand adapter would be waiting in the job queue indefinitely)

# @ environment = COPY_ALL - this cause all the variables defined in the script to be exported to remote nodes. When a MPI job is started the script part of the loadleveler script is executed on one node and when poe/mpirun is called the process is started on as many extra nodes as requested. If you have a script defining variables they are not usually exported to the remote nodes. This is especially important if you use "module" as the PATH and other variables that have been modified or created when loading a module will not be passed onto the remote nodes without this statement.

# @ queue - This line tells LoadLeveler that this is the last LoadLeveler command in the job file.

export MP_EAGER_LIMIT=65536 - most applications perform best for a value of 65536, however the user might want to check this with their application.
For development work set export MP_EAGER_LIMIT=0. If the code doesn't work for MP_EAGER_LIMIT=0, there is a problem with the way it uses MPI, which needs fixing.

export MP_SHARED_MEMORY=yes - Use shared memory inside a node. Don't change.

export MEMORY_AFFINITY=MCM - Use the memory closest to the cpu.

poe ./my_executable - This line executes an MPI executable called my_executable in the current directory.
poe is the MPI job launcher: note that the user does not have to specify the number of processes here: it is automatically derived from the number requested with the LoadLeveler keywords tasks_per_node and node.

Step 2:

Submit the file just created to LoadLeveler using the following command:

llsubmit mympi.ll
Step 3:

Monitor the progress of the job using the following command:

llq

Running an OpenMP job through LoadLeveler

Step 1:

Create a LoadLeveler job file similar to below. In this example it will be called openmp.ll.

# @ shell = /bin/bash
#
# @ job_name = myrun
#
# @ job_type = parallel
# @ tasks_per_node = 1
# @ node = 1
#
# @ wall_clock_limit = 00:20:00
# @ group = UC
# @ account_no = bfcs00000
#
# @ output = $(job_name).$(schedd_host).$(jobid).out
# @ error  = $(job_name).$(schedd_host).$(jobid).err
#
# @ notification = never
# @ class                = p7linux
#
#
# @ parallel_threads = 8
# @ task_affinity=  core(8)
# @ cpus_per_core = 1
# @ rset                 = rset_mcm_affinity
# @ queue

export OMP_NUM_THREADS=8
./my_ompprog.exe

Note the key differences to running a OpenMP job compared to a parallel job are as follows:
@ parallel_threads = 8 -- in this example requests LoadLeveler to perform OpenMP thread-level binding for 8 threads
@ task_affinity= core(8) -- in conjunction with cpus_per_core of 1 requests LoadLeveler to bind 1 thread per core.
@ cpus_per_core = 1 -- specifies the number of logical CPUs per processor core that needs to be allocated to each task of a job with the processor-core affinity requirement.

Step 2:

Submit the file just created to LoadLeveler using the following command:

llsubmit openmp.ll
Step 3:

Monitor the progress of the job using the following command:

llq
 

Running a large memory  job through LoadLeveler (>26GB of RAM )

This section contains tips on running large memory jobs on the Power 7 and ensuring optimum performance, whether it is a parallel job or serial job.

Each Power 7 node contains a total of 32 physical cores and a total of 128GB of RAM. However the underlying architecture is a bit more subtle and to avoid memory thrashing  (using swap) and poor performance you need to be aware of the following hardware set-up:

  • Each node contains 4 core chips of 8 physical cores, (hence a total of 4*8=32 cores) ;
  • Each core chip (MCM) has direct access to a maximum of 32GB of RAM, (hence a total of 4*32=128GB or RAM).
  • Each core chip is connected to the other 3 and can therefore share memory as long as at least one core per MCM is part of the allocated resources of your job.
  • Roughly 4-6GB of RAM is taken from the 128GB of RAM by the operating system so in reality the maximum memory a job could have per node would be ~122-124GB of RAM. Note that the 4-6GB of RAM used by the system is not distributed on all 4 core chips (MCMs) but taken from only 1 of tehm. So on each Power 7 node, 3 MCMs have direct access to 32GB of RAM whereas 1 will only have access to <26GB of RAM. 

If you run a parallel job that uses an entire node (32 tasks), then your job will have 8 cores allocated per MCM and therefore access to the entire memory available (~122-124GB), you will have no issues whatsoever.

On the other hand, if your parallel job has less than 32 tasks (e.g <8) and you need at least 25GB or RAM you need to ensure that loadleveler will allocate at least 1 core per MCM on that node so that your job can have access to more memory from the other MCMs. Otherwise by default, loadleveler will accummulate your 8 tasks onto 1 MCM for faster performance which will limit the amount of memory available to your job to 32GB of RAM or <26GB of RAM if you are unlucky.

To ensure that your job will have a core on more than 1 MCM for your large memory job you will need to add the following memory affinity options to your loadleveler scripts:

  • For a parallel job:
  • For a serial job: 

Please do not hesitate to contact a BlueFern staff member at bluefern@canterbury.ac.nz if you have any questions on running large memory jobs on the Power7.

Running multiple serial jobs through LoadLeveler

The following methods are primarily for running multiple serial jobs with different input data that are independent of each other but have similar runtimes (method 1 and 2). For multiple serial jobs with varying runtimes depending on the input data, you will need to use method 3 which uses dynamic load balancing. There is a maximum number of loadleveler jobs per user that can run concurrently (typically 4 at a time). Therefore this method allows for a greater amount of serial jobs to be running at the same time using the same loadleveler script instead of individual script per job.  

There are 3 ways to submit multiple serial jobs:

  • method 1: each task will execute 1 input data, each task needs to have a similar runtime. It involves writing all the input data files in your loadleveler script and downloading a wrapper file from this website. (See below for more information)
  • method 2: each task will execute 1 input data, each task needs to have a similar runtime. It involves using the built-in MPMD (Multiple Program Multiple Data) poe command and writing a file containing all the executable names and input data. (See below for more information)
  • method 3: 1 task is the "leader" and dynamically distribute a queue of jobs (which can have different runtimes) to the other tasks "workers". In this case you can execute more input data than tasks requested with loadleveler. It involves using batcher from the ADLB library (Asynchronous Dynamic Load Balancing), creating a script of commands to parse to batcher. (See below for more information)

Method 1:

Step 1:

You will need to create the following wrapper file "multiple_input_data_wrapper.sh" in the directory where you will be submitting your multiple serial jobs, or download this file here multiple_input_data_wrapper.sh

Note that this multiple_input_data_wrapper.sh file needs to have the right executable permission to work. For example ls -al should show the x option (for executable) for the user:

If it does not give the execution permission on this file by running the terminal:

 

 If you are unsure on how to do this please do not hesitate to contact a BlueFern staff member at bluefern@canterbury.ac.nz. 

Step 2:

Create a LoadLeveler job file similar to below. In this example it will be called multiple_serial_jobs.ll:

 

Step 3:

Submit the file just created to LoadLeveler using the following command:

llsubmit multiple_serial_jobs.ll
Step 4:

Monitor the progress of the job using the following command:

llq
  

Method 2, MPMD:

Step 1:

You will need to create a file , e.g cmdfile.txt that contains a list of all the executables to run with its appropriate input data.

 

Step 2:

 

Create a LoadLeveler job file similar to below. In this example it will be called multiple_serial_mpmd_jobs.ll:

Step 3:

Submit the file just created to LoadLeveler using the following command:

llsubmit multiple_serial_mpmd_jobs.ll
Step 4:

Monitor the progress of the job using the following command:

llq
   

 

Method 3, batcher with Dynamic Load Balancing :

This methods uses "batcher" an executable from the ADLB library (Asynchronous Dynamic Load Balancing). The ADLB software library from Argonne National Laboratory was designed to help rapidly build scalable parallel programs, more information about its features can be found here and here. The program batcher is an MPI/ADLB parallel program that executes shell script commands in parallel (1 command per line) using the leader/worker model. 1 task out of the total number of tasks requested will play the role of the leader and distribute a dynamic queue of commands to the other tasks/workers. 

Step 1:

You will need to create the following  script "create_input_command.sh" in the directory where you will be submitting your multiple serial jobs, or download this file here create_input_command.sh:

Once your job has been submitted by loadleveler, this script will produce a command input file for batcher to execute, and will look like this:

 Note that alternatively you can create this command file directly, and without redirecting the output from each job into the subdirectory "out-$LOADL_STEP_ID" and separate output files. 

Step 2:

Create a LoadLeveler job file using poe and batcher, in this example it will be called batcher_jobs.ll:

Note that batcher uses MPI to run the various jobs and there is batcher "master" which goes on to starts batcher "slaves". Batcher will not run properly if total_tasks is unset or equal to one.

Step 3:

Submit the file just created to LoadLeveler using the following command:

Step 4:

Monitor the progress of the job using the following command:

 

 

Running a parallel job interactively through LoadLeveler

Step 1:

Create a LoadLeveler job file similar to below. In this example it will be called interactive.ll.

Note the key difference here is that launching of the parallel job is not done within the LoadLeveler job file

# Example Interactive LoadLeveler Job file
# @ shell              = /bin/bash
# @ job_type           = parallel
# @ job_name           = parjob
# @ class              = p7linux
# @ group              = UC
# @ account_no         = bfcs00000
# @ wall_clock_limit   = 0:05:00
# @ tasks_per_node = 2
# @ node = 1
# @ rset = rset_mcm_affinity
# @ queue
Step 2:

From the command line type:

poe  ./myprog -rmfile ./interactive.ll

If LoadLeveler can immediately satisfy the request to allocate the resources then return is passed back to poe and the parallel job, called myprog in this example, will run and any output will appear on the terminal.
If LoadLeveler cannot immediately satisfy the request to allocate the resources then it will display the reasons why and return is then passed back to poe which then terminate and the parallel job does not run.


More Information

Here are some PDF documents from IBM on using the power755 cluster

  • No labels