This page is for asking questions about UCSC.
- I Can't Log in!
- How do I use the BlueGene/P?
- How do I use the power755 cluster?
- Why won't my job run?
- How can I check what is running under my account and kill any runaway processes?
- Can I check resources on all available machines and possibly specify one manually?
- My POE jobs fail with permission errors
- Loadleveler is asking for my "group": What group am I?
I Can't Log in!
- In all cases, please send in a support request to support.nesi.org.nz with your BlueFern username. If you've never logged in to our systems before then it's possible there is a problem with the password that we have emailed you. If you have logged in before, there may be issues such as being over disc quota, etc.
- We run a program that blocks access for hosts from which repeated login failures originate. We get tens of thousands of such failures every month on our systems from malicious internet hosts. One side effect of this is that legitimate off-campus hosts may be blocked, if a user on that host has more than a few login failures. The indicator of this effect is that ssh breaks the connection, without you ever getting a chance to enter your password. If you think this is occurring for you or a colleague, please send in a support request to support.nesi.org.nz with the name or IP address of the host you're using and we will re-enable access for this host.
How do I use the BlueGene/P?
How do I use the power755 cluster?
Why won't my job run?
How can I check what is running under my account and kill any runaway processes?
Llkill is a script to search the nodes on our HPC and visualization clusters. It presents a list of processes that you own and gives you the option of killing them all, or selectively. See the wiki page for more information or run "llkill --help".
To check what processes are running under your account, use the ps command. To check what processes you own on your login session, run ps -U $USER and this will display something like:
Suppose you decide to stop the "mysim.exe" process listed above. We see that it has the process ID (PID) of 545060, so we can use the kill command like this: kill 545060 and this sends a TERMINATE signal to process 545060, which is mysim.exe. Or, you can use kill -INT 545060 to send an INTERRUPT signal to the process, which is the same effect as if you typed control-C when running the process interactively. You can use kill -INT -1 to interrupt every process on the machine, which is sometimes a handy "scattergun" approach.
The situation gets more messy if you have to check on all of our compute nodes (use llstatus to get a list of nodes). You can script the commands like this:
To kill the process on a remote machine, use ssh to remotely execute it, eg: ssh p1n07-c kill -INT 545060 to kill PID 545060 on p1n07-c.
Can I check resources on all available machines and possibly specify one manually?
You can use llstatus to check the load on machines (it's the LdAvg column) and add the clause
to specify that this job has to run on, for example: p1n10. However, such a requirement is not advisable because you may well have an extra-long wait before a particular machine becomes available.
My POE jobs fail with permission errors
See the POE page
Loadleveler is asking for my "group": What group am I?
To submit a job via the scheduling system Loadleveler, you will need to specify the group you belong to. Loadleveler recognizes 4 groups only: NZ, NZ_merit, UC, or UC_merit. "Merit" means that funding has been obtained for a supercomputing resource and jobs that are associated with this a "merit" group will be scheduled to run at a higher priority than jobs in other groups. If you are unsure of which group or groups you belong to, run the the following command on one of the login nodes.
To specify your group in a loadelever script just add: