Skip navigation links
Friday, January 4th 2013 11:25:01 PM PST
This page contains a brief summary of access and job submission info for the Triton Resource. Here you can find basic information on:
Note: The Triton Resource is now running in full production. The Triton Compute Cluster (TCC) and Petascale Data Analysis Facility (PDAF) are available to anyone with a TAPP account as of October 5, 2009.
The trial account phase of Triton Resource will be discontinued after January 31, 2013. Please refer to the Research Cyberinfrastructure website for information regarding possible startup opportunities for research computing at UCSD and SDSC.
TAPP, the Triton Affiliates and Partners Program, is the prescribed way to manage your access.
Triton staff maintain a Discussion List to which all Triton users are encouraged to subscribe. Members can post questions and comments to Triton Discussion List (triton-discuss@sdsc.edu) to obtain help and support for issues and community feedback.
To login to the UCSD Triton Resource, use the following hostname:
triton-login.sdsc.eduFollowing are examples of Secure Shell (ssh) commands that may be used to login to the Triton Resource:
ssh <your_username>@triton-login.sdsc.edussh -l <your_username> triton-login.sdsc.eduMore information about Secure Shell may be found in the First-time Login guide. SDSC security policy may be found at the SDSC Security site.
Triton uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Maui Cluster Scheduler to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.
Queue Name limits
------------------------
Standard queues (available without special permission):
batch max walltime = 120 hours
default walltime = 18 hours
max user queuable = 150
large max walltime = 72 hours
default walltime = 18 hours
max user queuable = 50
max user run = 10
default memory = 126GB
small max walltime = 120 hours
default walltime = 18 hours
max user queuable = 300
max user run = 160
default nodes = 1
max nodect = 10
express max walltime = 2 hours
default walltime = 2 hours
max user queuable = 2
max user run = 1
big max walltime = 12 hours
max user queuable = 5
pdaf max walltime = 500 hours
default walltime = 18 hours
max memory = 256gb
max user run = 15
max user queuable = 50
pdafm max walltime = 500 hours
min memory = 257gb
max user run = 15
max user queuable = 50
Special queues (permission by Access Control List only):
long max walltime = 336 hours
default walltime = 18 hours
max user queuable = 15
max user run = 15
XXL max walltime = 336 hours
max user run = 1
For details about special permission to run longer jobs, see the Jobs section of the FAQ page.
In general, jobs will be charged for all cores on a node regardless of how many cores the job actually uses. Only the small queue is excepted from this.
Queue batch is available for all batch jobs. The queue limit is a 72 hour walltime. This queue specifies the TCC nodes.
Queue small is defined for single-CPU and other small jobs that can run on fewer than the full set of processing cores of one node. Jobs will only be charged for the number of cores actually used, but may be required to share the node with other small jobs. Since all of a node's memory is available to any of its cores, there may be contention between jobs running simultaneously on the shared node.
For a suggestion on how to use multiple processors on a shared node, see the Jobs section of the FAQ page.
Queue express is defined to improve wait times for interactive jobs. This queue is only available between 8 a.m. and 8 p.m. Monday-Friday. The nodes return to the batch and large queues during other times of day. Also, users are only allowed to have one running job at a time in this queue, and a maximum of two total jobs in the queue.
Queue large specifies the PDAF and PDAFM nodes. When this is requested without a memory size, the default of 126 gigabytes will be allocated. Job requests that specify 126GB or 252GB may be run in the larger PDAFM nodes, which have a (currently 2x, normally 4x) premium charge. To avoid this, specify nodes=1:ppn=32 with the lower memory size.
Although configured with 256GB and 512GB respectively, the PDAF and PDAFM nodes practically have slightly less than this available. The limits are 252GB on PDAF and 504GB on PDAFM due to system overhead.
Memory requests must be in 126GB increments, i.e. 126GB, 252GB, 378GB, 504GB. For example,
#PBS -l mem=126gb
To run jobs on the PDAFM (512GB) nodes, specify queue large and request either 378GB or 504GB memory. Do not use pmem (memory per processor) for large jobs.
To specify whether a job is charged at the PDAF (currently 1x, normally 2x) or PDAFM (currently 2x, normally 4x) rate, specify the memory attribute and submit to the large queue. For example,
#PBS -q large
#PBS -l nodes=1:ppn=32
and either
#PBS -l mem=252gb
or
#PBS -l mem=504gb
A job that specifically requests PDAF may actually get scheduled on a PDAFM node, but it will be charged at the lower PDAF rate.
Users may submit as many as 150 batch queue jobs. On the large queue, no more than ten jobs can be run simultaneously and no more than 50 may be queued at one time.
To reduce email load on the mailservers, please specify an email address
in your TORQUE script. For example,
#!/bin/bash
#PBS -l walltime=00:20:00
#PBS -M <your_username@ucsd.edu>
#PBS -m mail_options
or using the command line:
qsub -m mail_options -M <your_username@ucsd.edu>
These mail_options are available:
n no mail
a mail is sent when the job is aborted by the batch system.
b mail is sent when the job begins execution.
e mail is sent when the job terminates.
For a more detailed discussion about the charging algorithm, and to learn more about accounting and the Triton queuing system, please read the Job Charging Examples on the Policies page and the Accounting section of the FAQ page.
Submit a script to TORQUE:
qsub <batch_script>
The following is an example of a TORQUE batch script for running an MPI job. The script lines are discussed in the comments that follow.
#!/bin/csh
#PBS -q <queue name>
#PBS -N <job name>
#PBS -l nodes=10:ppn=2
#PBS -l walltime=0:50:00
#PBS -o <output file>
#PBS -e <error file>
#PBS -V
#PBS -M <email address list>
#PBS -m abe
#PBS -A <account name>
cd /phase1/<user name>
mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>
Comments for the above script:
#PBS -q <queue name>#PBS -N <job name>#PBS -l nodes=10:ppn=2#PBS -l walltime=0:50:006 #PBS -o <output file>#PBS -e <error file>#PBS -V#PBS -M <email address list>#PBS -m abe#PBS -A <account name>To ensure the correct account is charged, it is recommended that the -A option always be used.
cd /phase1/<user name>mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>
Run as a parallel job, in verbose output mode, using 20 processes, on the nodes specified by
the list contained in the file referenced by $PBS_NODEFILE, and send the output to file
mpi.out in current working directory| Command | Description |
|---|---|
| qstat -a | Display the status of batch jobs |
| qdel <pbs_jobid> | Delete (cancel) a queued job |
| qstat -r | Show all running jobs on system |
| qstat -f <pbs_jobid> | Show detailed information of the specified job |
| qstat -q | Show all queues on system |
| qstat -Q | Show queues limits for all queues |
| qstat -B | Show quick information of the server |
| pbsnodes -a | Show node status |
*View the qstat manpage for more options.
The following is an example of a TORQUE command for running an interactive job.
qsub -I -l nodes=10:ppn=2 -l walltime=0:50:00
The standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running.
Users can monitor batch queues using these commands:
qstatThe command output shows the job Ids and queues, for example:
Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 90.triton-46 PBStest hocks 0 R batch 91.triton-46 PBStest hocks 0 Q batch 92.triton-46 PBStest hocks 0 Q batch
showqThis command shows the jobs running, queued and blocked:
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
94 hocks Running 8 00:09:53 Fri Apr 3 13:40:43
1 active job 8 of 16 processors in use by local jobs (50.00%)
8 of 8 nodes active (100.00%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
95 hocks Idle 8 00:10:00 Fri Apr 3 13:40:04
96 hocks Idle 8 00:10:00 Fri Apr 3 13:40:05
2 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total jobs: 3
showbfThis command gives information on available time slots:
Partition Tasks Nodes Duration StartOffset StartDate --------- ----- ----- ------------ ------------ -------------- ALL 8 8 INFINITY 00:00:00 13:45:30_04/03
Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.
Open a Ticket with Triton Resource Support using the Support Ticket Form.
Join the Discussion Forum Sign up for our Email Discussion List.
FAQ Read the FAQ Page.
