UCSD Logo UCSD Logo For Printing Skip navigation links

Navigation

UCSD Triton Resource @ SDSC

Quick Status

Triton Resource Node Status

Friday, January 4th 2013 11:25:01 PM PST


Total TCC Nodes Up: 243

Total 256GB (PDAF) Nodes Up: 20

Total 512GB (PDAFM) Nodes Up: 8

Rack 2 Up Count: 80

Rack 3 Up Count: 79

Rack 4 Up Count: 5

Rack 5 Up Count: 79

Skip navigation menus Start of navigation menus

Quick Start Guide to Running Jobs on Triton

Basic Steps of Running Jobs

This page contains a brief summary of access and job submission info for the Triton Resource. Here you can find basic information on:

Running Jobs on Triton

Note: The Triton Resource is now running in full production. The Triton Compute Cluster (TCC) and Petascale Data Analysis Facility (PDAF) are available to anyone with a TAPP account as of October 5, 2009.

The trial account phase of Triton Resource will be discontinued after January 31, 2013. Please refer to the Research Cyberinfrastructure website for information regarding possible startup opportunities for research computing at UCSD and SDSC.

TAPP, the Triton Affiliates and Partners Program, is the prescribed way to manage your access.

Triton staff maintain a Discussion List to which all Triton users are encouraged to subscribe. Members can post questions and comments to Triton Discussion List (triton-discuss@sdsc.edu) to obtain help and support for issues and community feedback.

  1. System Access - Logging In
  2. To login to the UCSD Triton Resource, use the following hostname:

    • triton-login.sdsc.edu

    Following are examples of Secure Shell (ssh) commands that may be used to login to the Triton Resource:

    • ssh <your_username>@triton-login.sdsc.edu
    • ssh -l <your_username> triton-login.sdsc.edu
    More information about Secure Shell may be found in the First-time Login guide. SDSC security policy may be found at the SDSC Security site.
  3. Running Jobs
    1. Running Jobs with TORQUE
    2. Triton uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Maui Cluster Scheduler to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.

    3. Job Queue Basics
    4.    Queue Name   limits
      ------------------------
         Standard queues (available without special permission):
      
         batch        max walltime = 120 hours
                      default walltime = 18 hours
                      max user queuable = 150
      
         large        max walltime = 72 hours
                      default walltime = 18 hours
                      max user queuable = 50
                      max user run = 10
                      default memory = 126GB
      
         small        max walltime = 120 hours
                      default walltime = 18 hours
                      max user queuable = 300
                      max user run = 160
                      default nodes = 1
                      max nodect = 10
      
         express      max walltime = 2 hours
                      default walltime = 2 hours
                      max user queuable = 2
                      max user run = 1
      
         big          max walltime = 12 hours
                      max user queuable = 5
      
         pdaf         max walltime = 500 hours
                      default walltime = 18 hours
                      max memory = 256gb
                      max user run = 15
                      max user queuable = 50
      
         pdafm        max walltime = 500 hours
                      min memory = 257gb
                      max user run = 15
                      max user queuable = 50
      
         Special queues (permission by Access Control List only):
      
         long         max walltime = 336 hours
                      default walltime = 18 hours
                      max user queuable = 15
                      max user run = 15
      
         XXL          max walltime = 336 hours
                      max user run = 1
      
      

      For details about special permission to run longer jobs, see the Jobs section of the FAQ page.

      In general, jobs will be charged for all cores on a node regardless of how many cores the job actually uses. Only the small queue is excepted from this.

      Queue batch is available for all batch jobs. The queue limit is a 72 hour walltime. This queue specifies the TCC nodes.

      Queue small is defined for single-CPU and other small jobs that can run on fewer than the full set of processing cores of one node. Jobs will only be charged for the number of cores actually used, but may be required to share the node with other small jobs. Since all of a node's memory is available to any of its cores, there may be contention between jobs running simultaneously on the shared node.

      For a suggestion on how to use multiple processors on a shared node, see the Jobs section of the FAQ page.

      Queue express is defined to improve wait times for interactive jobs. This queue is only available between 8 a.m. and 8 p.m. Monday-Friday. The nodes return to the batch and large queues during other times of day. Also, users are only allowed to have one running job at a time in this queue, and a maximum of two total jobs in the queue.

      Queue large specifies the PDAF and PDAFM nodes. When this is requested without a memory size, the default of 126 gigabytes will be allocated. Job requests that specify 126GB or 252GB may be run in the larger PDAFM nodes, which have a (currently 2x, normally 4x) premium charge. To avoid this, specify nodes=1:ppn=32 with the lower memory size.

      Although configured with 256GB and 512GB respectively, the PDAF and PDAFM nodes practically have slightly less than this available. The limits are 252GB on PDAF and 504GB on PDAFM due to system overhead.

      Memory requests must be in 126GB increments, i.e. 126GB, 252GB, 378GB, 504GB. For example,

      To run jobs on the PDAFM (512GB) nodes, specify queue large and request either 378GB or 504GB memory. Do not use pmem (memory per processor) for large jobs.

      To specify whether a job is charged at the PDAF (currently 1x, normally 2x) or PDAFM (currently 2x, normally 4x) rate, specify the memory attribute and submit to the large queue. For example,

      A job that specifically requests PDAF may actually get scheduled on a PDAFM node, but it will be charged at the lower PDAF rate.

      Users may submit as many as 150 batch queue jobs. On the large queue, no more than ten jobs can be run simultaneously and no more than 50 may be queued at one time.

      To reduce email load on the mailservers, please specify an email address in your TORQUE script. For example,

      or using the command line:

      qsub -m mail_options -M <your_username@ucsd.edu>

      These mail_options are available:

          n no mail
          a mail is sent when the job is aborted by the batch system.
          b mail is sent when the job begins execution.
          e mail is sent when the job terminates.
      

      For a more detailed discussion about the charging algorithm, and to learn more about accounting and the Triton queuing system, please read the Job Charging Examples on the Policies page and the Accounting section of the FAQ page.

    5. Submitting with a Job Script
    6. Submit a script to TORQUE:

      qsub <batch_script>

      The following is an example of a TORQUE batch script for running an MPI job. The script lines are discussed in the comments that follow.

      Comments for the above script:

      • #PBS -q <queue name>
        Specify queue to which job is being submitted, one of:
        • batch
        • small
        • express
        • large
        • big
      • #PBS -N <job name>
        Specify name of job
      • #PBS -l nodes=10:ppn=2
        Request 10 nodes and 2 processors per node.
      • #PBS -l walltime=0:50:00
        Reserve the requested nodes for 50 minutes
      • 6 #PBS -o <output file>
        Redirect standard output to a file
      • #PBS -e <error file>
        Redirect standard error to a file
      • #PBS -V
        Export all my environment variables to the job
      • #PBS -M <email address list>
        Comma-separated list of users to whom email is sent
      • #PBS -m abe
        Set of conditions under which the execution server will send email about the job: (a)bort, (b)egin, (e)nd
      • #PBS -A <account name>
        Specify account to be charged for running the job; optional if user has only one account. If more than one account is available and this line is omitted, job will be charged to default account.

        To ensure the correct account is charged, it is recommended that the -A option always be used.

      • cd /phase1/<user name>
        Change to user's working directory in the Lustre filesystem
      • mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out> Run as a parallel job, in verbose output mode, using 20 processes, on the nodes specified by the list contained in the file referenced by $PBS_NODEFILE, and send the output to file mpi.out in current working directory
      TORQUE Commands
      CommandDescription
      qstat -a Display the status of batch jobs
      qdel <pbs_jobid> Delete (cancel) a queued job
      qstat -r Show all running jobs on system
      qstat -f <pbs_jobid> Show detailed information of the specified job
      qstat -q Show all queues on system
      qstat -Q Show queues limits for all queues
      qstat -B Show quick information of the server
      pbsnodes -a Show node status

      *View the qstat manpage for more options.

    7. Submitting an Interactive Job
    8. The following is an example of a TORQUE command for running an interactive job.

      qsub -I -l nodes=10:ppn=2 -l walltime=0:50:00

      The standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running.

    9. Monitoring Batch Queues

    Users can monitor batch queues using these commands:

    • qstat

    • The command output shows the job Ids and queues, for example:

      Job id                    Name             User            Time Use S Queue
      ------------------------- ---------------- --------------- -------- - -----
      90.triton-46              PBStest          hocks                  0 R batch
      91.triton-46              PBStest          hocks                  0 Q batch
      92.triton-46              PBStest          hocks                  0 Q batch
      

    • showq

    • This command shows the jobs running, queued and blocked:

      active jobs------------------------
      JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
      94                    hocks    Running     8    00:09:53  Fri Apr  3 13:40:43
      1 active job               8 of 16 processors in use by local jobs (50.00%)
                                  8 of 8 nodes active      (100.00%)
      
      eligible jobs----------------------
      JOBID              USERNAME      STATE PROCS     WCLIMIT              QUEUETIME
      95                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:04
      96                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:05
      2 eligible jobs
      
      blocked jobs-----------------------
      JOBID              USERNAME      STATE PROCS     WCLIMIT             QUEUETIME
      0 blocked jobs
      Total jobs:  3
      

    • showbf

    • This command gives information on available time slots:

      Partition     Tasks  Nodes      Duration   StartOffset       StartDate
      ---------     -----  -----  ------------  ------------  --------------
      ALL               8      8      INFINITY      00:00:00  13:45:30_04/03
      

      Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.

Contact Us

Open a Ticket with Triton Resource Support using the Support Ticket Form.

Join the Discussion Forum Sign up for our Email Discussion List.

Follow Triton on Twitter

FAQ Read the FAQ Page.

Terms of Use | Privacy

Back to page top End of page