UCSD Logo UCSD Logo For Printing Skip navigation links

Navigation

UCSD Triton Resource @ SDSC

Quick Status

Triton Resource Node Status

Saturday, November 21st 2009 11:09:01 AM PST


TCC Rack 3 Nodes Down (1)

tcc-3-71.local

Total TCC Nodes Up: 247

Total 256GB (PDAF) Nodes Up: 20

Total 512GB (PDAFM) Nodes Up: 8

Rack 2 Up Count: 80

Rack 3 Up Count: 77

Rack 4 Up Count: 11

Rack 5 Up Count: 79

Quick Start Guide to Running Jobs on Triton

Basic Steps of Running Jobs

This page contains a brief summary of access and job submission info for the Triton Resource. Here you can find basic information on:

Running Jobs on Triton

Note: The Triton Resource is now available to users in full production mode. Configuration and testing of Triton is complete. The Triton Compute Cluster (TCC) and Petascale Data Analysis Facility (PDAF) are using TAPP accounts to charge users for compute time as of Monday, October 5, 2009.

Early Adopter accounts have been converted to trial accounts and provisioned with 1000 complimentary SUs.

TAPP, the Triton Affiliates and Partners Program, is the prescribed way to manage your access.

Triton staff maintain a Discussion List to which all Triton users are encouraged to subscribe. Members can post questions and comments to Triton Discussion List (triton-discuss@sdsc.edu) to obtain help and support for issues and community feedback.

  1. System Access - Logging In
  2. To login to the UCSD Triton Resource, use the following hostname:

    • triton-login.sdsc.edu

    Following are examples of Secure Shell (ssh) commands that may be used to login to the Triton Resource:

    • ssh <your_username>@triton-login.sdsc.edu
    • ssh -l <your_username> triton-login.sdsc.edu
    More information about Secure Shell may be found in the First-time Login guide. SDSC security policy may be found at the SDSC Security site.
  3. Running Jobs
    1. Running Jobs with TORQUE
    2. Triton uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Moab Workload Manager to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.

    3. Job Queue Basics
    4.    Queue Name   limits
      ------------------------
         batch        max wallclock = 72 hours
                      default wallclock = 18 hours
                      max user queuable = 50
      
         large        max wallclock = 72 hours
                      default wallclock = 18 hours
                      max user queuable = 50
                      max user run = 5
                      default memory = 126GB
      
         small        max wallclock = 72 hours
                      default wallclock = 18 hours
                      max user queuable = 50
                      default nodes = 1
                      max nodect = 10
      
         express      max wallclock = 2 hours
                      default wallclock = 2 hours
                      max user queuable = 2
                      max user run = 1
      
      

      In general, jobs will be charged for all cores on a node regardless of how many cores the job actually uses. Only the small queue is excepted from this.

      Queue batch is available for all batch jobs. The queue limit is a 72 hour wallclock time. This queue specifies the TCC nodes.

      Queue small is defined for single-CPU and other small jobs that can run on fewer than the full set of processing cores of one node. Jobs will only be charged for the number of cores actually used, but may be required to share the node with other small jobs. Since all of a node's memory is available to any of its cores, there may be contention between jobs running simultaneously on the shared node.

      For a suggestion on how to use multiple processors on a shared node, see the Jobs section of the FAQ page.

      Queue express is defined to improve wait times for interactive jobs. This queue is only available between 8 a.m. and 8 p.m. Monday-Friday. The nodes return to the batch and large queues during other times of day. Also, users are only allowed to have one running job at a time in this queue, and a maximum of two total jobs in the queue.

      Queue large specifies the PDAF and PDAFM nodes. When this is requested without a memory size, the default of 128 gigabytes will be allocated. Job requests that specify 128GB or 256GB may be run in the larger PDAFM nodes, which have a 4x premium charge. To avoid this, specify nodes=1:ppn=32with the lower memory size.

      When 128GB is requested, the job will actually have only about 126GB due to system overhead. Likewise, if 512GB is requested, only about 504GB is available.

      Memory requests must be in 128GB increments, i.e. 128GB, 256GB, 384GB, 512GB. For example,

      To run jobs on the PDAFM (512GB) nodes, specify queue large and request either 384GB or 512GB memory. Do not use pmem (memory per processor) for large jobs.

      To specify whether a job runs on the PDAF (256GB) or PDAFM (512GB) nodes, specify the memory feature and submit to the large queue:

      Users may submit as many as 50 batch jobs. On the large queue no more than five jobs can be run simultaneously.

      To reduce email load on the mailservers, please specify an email address in your TORQUE script. For example,

      or using the command line:

      qsub -m mail_options -M <your_username@ucsd.edu>

      These mail_options are available:

          n no mail
          a mail is sent when the job is aborted by the batch system.
          b mail is sent when the job begins execution.
          e mail is sent when the job terminates.
      

      For a more detailed discussion about the charging algorithm, and to learn more about accounting and the Triton queuing system, please read the Job Charging Examples on the Policies page and the Accounting section of the FAQ page.

    5. Submitting a Job
    6. Submit a script to TORQUE:

      qsub <batch_script>

      The following is an example of a TORQUE batch script for running an MPI job. The line numbers refer to the comments that follow and are not part of the script.

      TORQUE Commands
      CommandDescription
      qstat -a Display the status of batch jobs
      qdel <pbs_jobid> Delete (cancel) a queued job
      qstat -r Show all running jobs on system
      qstat -f <pbs_jobid> Show detailed information of the specified job
      qstat -q Show all queues on system
      qstat -Q Show queues limits for all queues
      qstat -B Show quick information of the server
      pbsnodes -a Show node status

      *View the qstat manpage for more options.

    7. Monitoring Batch Queues

    Users can monitor batch queues using these commands:

    • qstat

    • The command output shows the job Ids and queues, for example:

      Job id                    Name             User            Time Use S Queue
      ------------------------- ---------------- --------------- -------- - -----
      90.triton-46              PBStest          hocks                  0 R batch
      91.triton-46              PBStest          hocks                  0 Q batch
      92.triton-46              PBStest          hocks                  0 Q batch
      

    • showq

    • This command shows the jobs running, queued and blocked:

      active jobs------------------------
      JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
      94                    hocks    Running     8    00:09:53  Fri Apr  3 13:40:43
      1 active job               8 of 16 processors in use by local jobs (50.00%)
                                  8 of 8 nodes active      (100.00%)
      
      eligible jobs----------------------
      JOBID              USERNAME      STATE PROCS     WCLIMIT              QUEUETIME
      95                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:04
      96                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:05
      2 eligible jobs
      
      blocked jobs-----------------------
      JOBID              USERNAME      STATE PROCS     WCLIMIT             QUEUETIME
      0 blocked jobs
      Total jobs:  3
      

    • showbf

    • This command gives information on available time slots:

      Partition     Tasks  Nodes      Duration   StartOffset       StartDate
      ---------     -----  -----  ------------  ------------  --------------
      ALL               8      8      INFINITY      00:00:00  13:45:30_04/03
      

      Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.

Terms of Use | Privacy