![]() |
![]() |
|
|
How do I submit a job? The cluster is composed of a total of 20 space sharing cluster nodes. This means that jobs are assigned to the processors according to the amount of space they need and the space that is currently available. The opposite is time-sharing nodes, in which jobs are allocated to the nodes and load balanced according to a specified load on each machine. 1. SERIAL JOBS: i. Running Serial jobs have to be started via a PBS script. Assuming that the program called myprogram is in the directory ~/mycode this is what the script will look like.
#PBS -l nodes=1:ppn=1
#PBS -N myjob15
#PBS -j oe
cd ~/mycode
./myprogram
-N myjob15 specifies the name of the job will be myjob15 -l nodes=1:ppn=1 specified that the job will use 1 node and that there is 1 processor per node. ii. Debugging To save master from CPU and I/O intensive applications, debug your code on one of the compute nodes. This can be done by starting an interactive PBS session with the qsub command: qsub -I. For example, to start a interactive PBS session for a small job using 1 node and 1 processor use qsub -q small -I -l nodes=1:ppn=1. PBS will give you a shell on one of the compute nodes, but without using rsh or rlogin, thus saving resources. Note that PBS allocates and reserves the node for you! Check the qsub man page for more information: man qsub
2. PARALLEL JOBS: Parallel jobs are run and debuged through PBS as described in Section 1. However, parallel jobs using MPI should be run using mpiexec. Mpiexec uses the task manager library of PBS to spawn copies of the executable on the nodes in a PBS allocation. Mpiexec helps clean up the slave MPI processes when the job is aborted on the master node. Another benefit is that resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs. To use mpiexec, place something like this in your PBS script
#PBS -l nodes=20:ppn=2
#PBS -N myjob15
#PBS -j oe
cd ~/mycode
mpiexec [-n cpu] [-comm=type] program
where cpu is _optional_ and should be inherited from PBS -l nodes=, but can be useful for debugging purposes. type should match the type of program: 'p4' or 'none'. type states whether the program is an mpi program or not. 3. SAMPLE PARALLEL SCRIPT FOR BAYWULF #!/bin/sh # Request the amount of memory you need (Maximum 1024 Mb) # Set the name of the job # Request the cpu time (Maximum 24 hr) # Choose the number or type of nodes: # This makes sure to output the logs into the directory from which # you submit your job. cd $PBS_O_WORKDIR echo "This job was submitted by user: $PBS_O_LOGNAME" # Execute the parallel code using mpiexec...also supplies date # before and after running code echo This job started at `date` where ./a.out is your executable compiled using mpicc, mpif77, or mpif90.
How much memory can I use and why is my job being placed in the queue even though the machine I'm trying to run on seems free? Each node has 1 Gb of total RAM. You are free to use all of the RAM on each machine. However, there is a bug with Linux memory management that does not free unused memory. Thefore, if you specify the amount of memory your job needs in your PBS script with #PBS -l mem=200mb your job may be placed in the queue even though no other jobs are running on that machine. The solution, albeit an unelegant one, is to say you only need 10mb in your script even though your job needs more. PBS will run your job even if it is bigger than 10 mb.
How do I find out how big my job is? Use the size command on your executable. For example, size a.out results in
text data bss dec hex filename which shows that your executable uses 19952660 bytes of statically allocated RAM. If you are using F90 or some other compiler which allows dynamic allocation, you need to determine how much space you are going to need either by looking at your code or by checking how much your job uses while it is running with the top command.
How can I see what's happening on the processors? You can check the status of your executables on the processors from any machine in the cluster using qstat command. To check the status of all jobs, including the nodes used,
qstat -an
pbsnodes -a
Who do I contact in case of problems? Please send emails to Yi-Ju Chou for questions about the cluster. Response times may vary between 1 minute and 1 week.
Where can I find more information?
|