Skip navigation links
Saturday, November 21st 2009 11:09:01 AM PST
tcc-3-71.local
This page contains a brief summary of access and job submission info for the Triton Resource. Here you can find basic information on:
Note: The Triton Resource is now available to users in full production mode. Configuration and testing of Triton is complete. The Triton Compute Cluster (TCC) and Petascale Data Analysis Facility (PDAF) are using TAPP accounts to charge users for compute time as of Monday, October 5, 2009.
Early Adopter accounts have been converted to trial accounts and provisioned with 1000 complimentary SUs.
TAPP, the Triton Affiliates and Partners Program, is the prescribed way to manage your access.
Triton staff maintain a Discussion List to which all Triton users are encouraged to subscribe. Members can post questions and comments to Triton Discussion List (triton-discuss@sdsc.edu) to obtain help and support for issues and community feedback.
To login to the UCSD Triton Resource, use the following hostname:
triton-login.sdsc.eduFollowing are examples of Secure Shell (ssh) commands that may be used to login to the Triton Resource:
ssh <your_username>@triton-login.sdsc.edussh -l <your_username> triton-login.sdsc.eduMore information about Secure Shell may be found in the First-time Login guide. SDSC security policy may be found at the SDSC Security site.
Triton uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Moab Workload Manager to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.
Queue Name limits
------------------------
batch max wallclock = 72 hours
default wallclock = 18 hours
max user queuable = 50
large max wallclock = 72 hours
default wallclock = 18 hours
max user queuable = 50
max user run = 5
default memory = 126GB
small max wallclock = 72 hours
default wallclock = 18 hours
max user queuable = 50
default nodes = 1
max nodect = 10
express max wallclock = 2 hours
default wallclock = 2 hours
max user queuable = 2
max user run = 1
In general, jobs will be charged for all cores on a node regardless of how many cores the job actually uses. Only the small queue is excepted from this.
Queue batch is available for all batch jobs. The queue limit is a 72 hour wallclock time. This queue specifies the TCC nodes.
Queue small is defined for single-CPU and other small jobs that can run on fewer than the full set of processing cores of one node. Jobs will only be charged for the number of cores actually used, but may be required to share the node with other small jobs. Since all of a node's memory is available to any of its cores, there may be contention between jobs running simultaneously on the shared node.
For a suggestion on how to use multiple processors on a shared node, see the Jobs section of the FAQ page.
Queue express is defined to improve wait times for interactive jobs. This queue is only available between 8 a.m. and 8 p.m. Monday-Friday. The nodes return to the batch and large queues during other times of day. Also, users are only allowed to have one running job at a time in this queue, and a maximum of two total jobs in the queue.
Queue large specifies the PDAF and PDAFM nodes. When this is requested without a memory size, the default of 128 gigabytes will be allocated. Job requests that specify 128GB or 256GB may be run in the larger PDAFM nodes, which have a 4x premium charge. To avoid this, specify nodes=1:ppn=32with the lower memory size.
When 128GB is requested, the job will actually have only about 126GB due to system overhead. Likewise, if 512GB is requested, only about 504GB is available.
Memory requests must be in 128GB increments, i.e. 128GB, 256GB, 384GB, 512GB. For example,
#PBS -l mem=128GB
To run jobs on the PDAFM (512GB) nodes, specify queue large and request either 384GB or 512GB memory. Do not use pmem (memory per processor) for large jobs.
To specify whether a job runs on the PDAF (256GB) or PDAFM (512GB) nodes, specify the memory feature and submit to the large queue:
#PBS -l nodes=1:mem256gb
or
#PBS -l nodes=1:mem512gb
Users may submit as many as 50 batch jobs. On the large queue no more than five jobs can be run simultaneously.
To reduce email load on the mailservers, please specify an email address
in your TORQUE script. For example,
#!/bin/bash
#PBS -l walltime=00:20:00
#PBS -M <your_username@ucsd.edu>
#PBS -m mail_options
or using the command line:
qsub -m mail_options -M <your_username@ucsd.edu>
These mail_options are available:
n no mail
a mail is sent when the job is aborted by the batch system.
b mail is sent when the job begins execution.
e mail is sent when the job terminates.
For a more detailed discussion about the charging algorithm, and to learn more about accounting and the Triton queuing system, please read the Job Charging Examples on the Policies page and the Accounting section of the FAQ page.
Submit a script to TORQUE:
qsub <batch_script>
The following is an example of a TORQUE batch script for running an MPI job. The line numbers refer to the comments that follow and are not part of the script.
1 #!/bin/csh
2 #PBS -q <batch>
3 #PBS -N <my_job>
4 #PBS -l nodes=10:ppn=2
5 #PBS -l walltime=0:50:00
6 #PBS -o <file.out>
7 #PBS -e <file.err>
8 #PBS -V
9 #PBS -M <username@ucsd.edu>
10 #PBS -m abe
11 #PBS -A <your-account-number>
12 cd /lustre/<username>
13 mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>
Comments for the above script:
- use queue named batch
- current job name is my_job
- request 10 nodes and 2 processors per node
- reserve the requested nodes for 50 minutes
- send standard output to file.out
- send standard error to file.err
- export all my environment variables to the job
- list of users to whom email is sent (comma-separated)
- set of conditions under which the execution server will send email about the job: (a)bort, (b)egin, (e)nd
- account to be charged for running the job; optional if user has only one account; if more than one account is available and this line is omitted, job will be charged to default account
- change to working directory username
- run my_job as a parallel job, sending output to the file mpi.out in current working directory
| Command | Description |
|---|---|
| qstat -a | Display the status of batch jobs |
| qdel <pbs_jobid> | Delete (cancel) a queued job |
| qstat -r | Show all running jobs on system |
| qstat -f <pbs_jobid> | Show detailed information of the specified job |
| qstat -q | Show all queues on system |
| qstat -Q | Show queues limits for all queues |
| qstat -B | Show quick information of the server |
| pbsnodes -a | Show node status |
*View the qstat manpage for more options.
Users can monitor batch queues using these commands:
qstatThe command output shows the job Ids and queues, for example:
Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 90.triton-46 PBStest hocks 0 R batch 91.triton-46 PBStest hocks 0 Q batch 92.triton-46 PBStest hocks 0 Q batch
showqThis command shows the jobs running, queued and blocked:
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
94 hocks Running 8 00:09:53 Fri Apr 3 13:40:43
1 active job 8 of 16 processors in use by local jobs (50.00%)
8 of 8 nodes active (100.00%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
95 hocks Idle 8 00:10:00 Fri Apr 3 13:40:04
96 hocks Idle 8 00:10:00 Fri Apr 3 13:40:05
2 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total jobs: 3
showbfThis command gives information on available time slots:
Partition Tasks Nodes Duration StartOffset StartDate --------- ----- ----- ------------ ------------ -------------- ALL 8 8 INFINITY 00:00:00 13:45:30_04/03
Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.
Open a Ticket with Triton Resource Support using the Support Ticket Form.
Join the Discussion Forum Sign up for our Email Discussion List.
FAQ Read the FAQ Page.
