UCSD Logo UCSD Logo For Printing Skip navigation links

Navigation

UCSD Triton Resource @ SDSC

Quick Status

Triton Resource Node Status

Thursday, September 29th 2011 02:29:56 PM PDT


PDAFM Nodes Down (8)

pdafm-6-4.local

pdafm-6-5.local

pdafm-6-6.local

pdafm-6-7.local

pdafm-7-4.local

pdafm-7-5.local

pdafm-7-6.local

pdafm-7-7.local

PDAF Nodes Down (20)

pdaf-6-0.local

pdaf-6-1.local

pdaf-6-2.local

pdaf-6-3.local

pdaf-7-0.local

pdaf-7-1.local

pdaf-7-2.local

pdaf-7-3.local

pdaf-8-0.local

pdaf-8-1.local

pdaf-8-2.local

pdaf-8-3.local

pdaf-8-4.local

pdaf-8-5.local

pdaf-8-6.local

pdaf-8-7.local

pdaf-9-0.local

pdaf-9-1.local

pdaf-9-2.local

pdaf-9-3.local

TCC Rack 2 Nodes Down (80)

tcc-2-0.local

tcc-2-1.local

tcc-2-2.local

tcc-2-3.local

tcc-2-4.local

tcc-2-5.local

tcc-2-6.local

tcc-2-7.local

tcc-2-8.local

tcc-2-9.local

tcc-2-10.local

tcc-2-11.local

tcc-2-12.local

tcc-2-13.local

tcc-2-14.local

tcc-2-15.local

tcc-2-16.local

tcc-2-17.local

tcc-2-18.local

tcc-2-19.local

tcc-2-20.local

tcc-2-21.local

tcc-2-22.local

tcc-2-23.local

tcc-2-24.local

tcc-2-25.local

tcc-2-26.local

tcc-2-27.local

tcc-2-28.local

tcc-2-29.local

tcc-2-30.local

tcc-2-31.local

tcc-2-32.local

tcc-2-33.local

tcc-2-34.local

tcc-2-35.local

tcc-2-36.local

tcc-2-37.local

tcc-2-38.local

tcc-2-39.local

tcc-2-40.local

tcc-2-41.local

tcc-2-42.local

tcc-2-43.local

tcc-2-44.local

tcc-2-45.local

tcc-2-46.local

tcc-2-47.local

tcc-2-48.local

tcc-2-49.local

tcc-2-50.local

tcc-2-51.local

tcc-2-52.local

tcc-2-53.local

tcc-2-54.local

tcc-2-55.local

tcc-2-56.local

tcc-2-57.local

tcc-2-58.local

tcc-2-59.local

tcc-2-60.local

tcc-2-61.local

tcc-2-62.local

tcc-2-63.local

tcc-2-64.local

tcc-2-65.local

tcc-2-66.local

tcc-2-67.local

tcc-2-68.local

tcc-2-69.local

tcc-2-70.local

tcc-2-71.local

tcc-2-72.local

tcc-2-73.local

tcc-2-74.local

tcc-2-75.local

tcc-2-76.local

tcc-2-77.local

tcc-2-78.local

tcc-2-79.local

TCC Rack 3 Nodes Down (79)

tcc-3-0.local

tcc-3-1.local

tcc-3-2.local

tcc-3-3.local

tcc-3-4.local

tcc-3-5.local

tcc-3-6.local

tcc-3-7.local

tcc-3-8.local

tcc-3-9.local

tcc-3-10.local

tcc-3-11.local

tcc-3-12.local

tcc-3-13.local

tcc-3-14.local

tcc-3-15.local

tcc-3-16.local

tcc-3-17.local

tcc-3-18.local

tcc-3-19.local

tcc-3-20.local

tcc-3-21.local

tcc-3-22.local

tcc-3-23.local

tcc-3-24.local

tcc-3-25.local

tcc-3-26.local

tcc-3-27.local

tcc-3-28.local

tcc-3-29.local

tcc-3-30.local

tcc-3-31.local

tcc-3-32.local

tcc-3-33.local

tcc-3-34.local

tcc-3-35.local

tcc-3-36.local

tcc-3-37.local

tcc-3-38.local

tcc-3-39.local

tcc-3-40.local

tcc-3-41.local

tcc-3-42.local

tcc-3-43.local

tcc-3-44.local

tcc-3-45.local

tcc-3-46.local

tcc-3-47.local

tcc-3-48.local

tcc-3-49.local

tcc-3-50.local

tcc-3-51.local

tcc-3-52.local

tcc-3-53.local

tcc-3-54.local

tcc-3-55.local

tcc-3-56.local

tcc-3-57.local

tcc-3-58.local

tcc-3-59.local

tcc-3-60.local

tcc-3-62.local

tcc-3-63.local

tcc-3-64.local

tcc-3-65.local

tcc-3-66.local

tcc-3-67.local

tcc-3-68.local

tcc-3-69.local

tcc-3-70.local

tcc-3-71.local

tcc-3-72.local

tcc-3-73.local

tcc-3-74.local

tcc-3-75.local

tcc-3-76.local

tcc-3-77.local

tcc-3-78.local

tcc-3-79.local

TCC Rack 4 Nodes Down (15)

tcc-4-0.local

tcc-4-1.local

tcc-4-2.local

tcc-4-3.local

tcc-4-4.local

tcc-4-5.local

tcc-4-6.local

tcc-4-7.local

tcc-4-8.local

tcc-4-9.local

tcc-4-10.local

tcc-4-11.local

tcc-4-12.local

tcc-4-13.local

tcc-4-15.local

TCC Rack 5 Nodes Down (80)

tcc-5-0.local

tcc-5-1.local

tcc-5-2.local

tcc-5-3.local

tcc-5-4.local

tcc-5-5.local

tcc-5-6.local

tcc-5-7.local

tcc-5-8.local

tcc-5-9.local

tcc-5-10.local

tcc-5-11.local

tcc-5-12.local

tcc-5-13.local

tcc-5-14.local

tcc-5-15.local

tcc-5-16.local

tcc-5-17.local

tcc-5-18.local

tcc-5-19.local

tcc-5-20.local

tcc-5-21.local

tcc-5-22.local

tcc-5-23.local

tcc-5-24.local

tcc-5-25.local

tcc-5-26.local

tcc-5-27.local

tcc-5-28.local

tcc-5-29.local

tcc-5-30.local

tcc-5-31.local

tcc-5-32.local

tcc-5-33.local

tcc-5-34.local

tcc-5-35.local

tcc-5-36.local

tcc-5-37.local

tcc-5-38.local

tcc-5-39.local

tcc-5-40.local

tcc-5-41.local

tcc-5-42.local

tcc-5-43.local

tcc-5-44.local

tcc-5-45.local

tcc-5-46.local

tcc-5-47.local

tcc-5-48.local

tcc-5-49.local

tcc-5-50.local

tcc-5-51.local

tcc-5-52.local

tcc-5-53.local

tcc-5-54.local

tcc-5-55.local

tcc-5-56.local

tcc-5-57.local

tcc-5-58.local

tcc-5-59.local

tcc-5-60.local

tcc-5-61.local

tcc-5-62.local

tcc-5-63.local

tcc-5-64.local

tcc-5-65.local

tcc-5-66.local

tcc-5-67.local

tcc-5-68.local

tcc-5-69.local

tcc-5-70.local

tcc-5-71.local

tcc-5-72.local

tcc-5-73.local

tcc-5-74.local

tcc-5-75.local

tcc-5-76.local

tcc-5-77.local

tcc-5-78.local

tcc-5-79.local

Total TCC Nodes Up: 2

Total 256GB (PDAF) Nodes Up: 0

Total 512GB (PDAFM) Nodes Up: 0

Rack 2 Up Count: 0

Rack 3 Up Count: 1

Rack 4 Up Count: 1

Rack 5 Up Count: 0

Skip navigation menus Start of navigation menus

The Triton Resource provides easily accessible, affordable, high-performance and data-intensive compute resources to UCSD researchers, faculty, affiliates, government and commercial partners through innovative, locally supported, scalable hardware and software over multiple 10-gigabit networks extending from campus laboratories to the UC network, California, and the US.

Section Navigation

Features: Jobs

Access

System

Triton Resource Web Home

Message Board

Friday, January 27, 2012

Recent Triton News and Announcements

Triton Storage System Update


Posted 3:00 p.m. Thursday, Sept. 22, 2011: Phase1 Replaces Data Oasis

SDSC has announced a new 850TB Lustre filesystem, the first phase of a large high-performance filesystem that will be available to Triton users early next year. Phase1 is available from Triton now, and currently has 16 storage servers, each with 4 object storage targets (64 OSTs total) and a peak measured bandwidth of 12.5 GB/s.

All current Triton users have an assigned directory in /phase1 which is accessible now. As of today Sept 22, 2011 the old filesystem (/oasis) is read-only. All users are requested to move their data from /oasis to /phase1 by Oct 10, 2011 at which point /oasis will be retired from service. Users are requested to use compute nodes or the alternate Triton login node (triton-38) to move their data between filesystems to avoid overloading a single client machine. Please contact the Triton Discussion List if you need assistance.

The /phase1 filesystem is intended to be high performance scratch storage and is subject to a purge policy. The filesystem is not backed up and should not be used for long-term storage. Users are reminded that any important data must be moved to their own local storage resources.

Triton Maintenance Progress Update


Updated 8:00 a.m. Wednesday, Sept. 21, 2011: SDSC Switch Maintenance Completed Successfully

The upgraded switch maintenance was completed at approximately 8:30 p.m. Tuesday, Sept. 20. Triton queues should again be accessible from external locations.

Updated 10:30 a.m. Tuesday, Sept. 20, 2011: Maintenance to Continue This Evening

After a longer than expected delay yesterday, the two lustre systems are finally remounted around Triton and the queues are pushing through jobs again. We will make every effort to keep this evening's outage time to a minimum.

From 6:00-9:00 tonight the network folks will be upgrading the switches that connect SDSC to the outside world. During that time, if you're located somewhere outside SDSC, you're likely to be unable to connect to Triton. The good news is that this work shouldn't affect connectivity within SDSC, so jobs in the queues at 6:00 p.m. should be able to run without difficulty--you just won't be able to see the results until outside connections are restored. If you have jobs that happen to need outside connections (e.g. they rely on an externally-mounted disk), it would be a good idea to put a hold on them until after the work is complete. See http://status.externsdsc.org/ for status updates.

Posted 9:30 a.m. Monday, Sept. 19, 2011: Due to Reopen Around Noon

The triton batch queues will are unavailable today, Monday, September 19th, until around noon for some switch work. As part of this work, both /phase1 and /oasis will be unmounted for the duration.

We will post as soon as /phase1 and /oasis are remounted.

Triton Maintenance Windows Scheduled for Sept. 19 and 20


Posted 12:30 p.m. Friday, Sept. 16, 2011: Reservations will prevent some jobs from being scheduled until after completion of downtime

There is a system-wide reservation in place for Data Oasis maintenance on Monday, Sept. 19 from 9 a.m. to noon. Jobs which specify end times after the start of this maintenance will not be scheduled until after it ends. There is also a reservation for Tuesday, Sept. 20 from 6 p.m. to 10 p.m. with the same effect on scheduling.

+ Triton Status After Massive San Diego Power Outage

Updated 6:30 p.m. Tuesday, Sept. 13, 2011: Rebooting Both Login Nodes Around 8:00 p.m. PT

We need to reboot/reinstall the login nodes once more to clear some hung processes and fix a problem with the MPI installation. We will start by shutting off new logins to the alternate login node (triton-38). That will probably cycle by 8:00 p.m. After it comes back up, we'll do the same for the primary login. Apologies for the instability; with luck, this will be the last service interruption we'll see in the aftermath of Thursday's outage.

Updated 11:45 p.m. Sunday, Sept. 11, 2011: Primary Login Node Available Again

triton-login is back up; triton-38 was also down. Details will be forthcoming as we learn more about the cause.

Updated 12:00 p.m. Sunday, Sept. 11, 2011: Alternate Login Node in Use Until Further Notice

Our main login machine, triton-login, seems to be down. Please use our alternate login, triton-38.sdsc.edu, until further notice.

Updated 12:00 p.m. Saturday, Sept. 10, 2011: Triton Back Online

Triton has mostly recovered from the faceplant it took during the power outage. As mentioned, we have temporarily lost a file server (but not the data stored on it), and we had to replace the cluster management server, which has some lingering reprecussions. Here's what we know is presently *not* working:

  • Changing management hardware invalidated most of our software licenses. Yesterday, we replaced the ones we could. The PGI compilers aren't working at this point; you may run into an application or two that we haven't tested yet.
  • None of the /projects directories is mounting. The Triton-hosted /project directories are hosted on the failed server. The one piece of data that we seem to have lost completely is the list of partitions from external servers that we were mounting under /projects. If your project had an external mount, please send us the server and mount point, and we'll start rebuilding the list.
  • A handful of the batch nodes (tcc-*) are still down. We have likely lost some local disks and switch ports.

We'll be working on these problems, plus anything else we discover, over the next week. Please use Triton with some caution during this shake-out, and post any problems that you encounter.

Updated 3:10 p.m. Friday, Sept. 9, 2011: Mid-afternoon update

Triton management node (New Hardware) is up and running and is busily building the rest of the nodes. The major issue we see is that one of the Project NFS servers refuses to boot. That particular issue will not be solved until Monday or Tuesday.

The specific projects that are affected are mounted under /projects:

       cgl-group
       frazer-group
       geogrid-aist
       liai-group
       nrnb-group
       ren-lab
       biogem-lab
       camera
       camera-lab
       crbs-group
       gleeson-lab
       lca-group
       mmiller-group
       sarkar-lab
       zhang-lab

If you have data in these directories, we have backups of your data as of approximately 4:00 a.m. on Sept. 8. If you need access to your replicated data, please contact Phil Papadopoulos or Jim Hayes. We can give you some options (read only, read/write, etc.), but we would want to talk to you individually. Our expectation is that the data on the primary Projects server is intact, we simply can't see it until hardware is addressed.

Home area data (e.g. /home/user) is stored on different servers and is unaffected.

All data partitions on /oasis and /phase1 have been checked and look to be in good shape, too.

We hope to have Triton otherwise restored by the end of the day (but it may go into the weekend). So far, the rebuild of is going smoothly.

Posted 1:30 p.m. Friday, Sept. 9, 2011: Damaged hardware will be replaced today

The power outage wreaked havoc with Triton's management node. We're replacing the hardware and rebuilding. We'll let you know when things are back in operation.

User data (held on different systems) are all intact.

System Maintenance Scheduled for June 27, 2011


Updated 11:00 a.m. Wednesday, June 29, 2011: Maintenance completed at 4:00 p.m.

Triton is available again after a longer-than-expected downtime. We've been struggling mostly with getting the updated batch scheduler software to work as desired; it seems like most of the kinks are worked out now.

Feel free to log in and resume working. Please keep a somewhat closer-than-usual eye on your jobs for the first few days, and post to Triton Discuss if you encounter anything that doesn't look right.

We appreciate your patience.

Updated 8:00 a.m. Tuesday, June 28, 2011: Short delay in upgrade procedure

We've run into some difficulties getting Triton to stand up properly after the software upgrade. We're continuing to work on it, and will post a follow-up when the system becomes available again. At this point, we do not expect to have it available until sometime Tuesday. We apologize for the delay.

Posted 2:00 p.m. Thursday, June 23, 2011: OS upgrade and new software will be available

Triton will be offline for a software upgrade next Monday, June 27th. We'll be moving from Rocks v5.3/CentOS v5.4 to Rocks v5.4/CentOS v5.6; additional new and updated applications are listed below. We've blocked out 8:00-5:00 Monday for the upgrade, but the actual downtime should be much shorter. Check this website, the mailing list, or the Triton Twitter feed for a follow-up message when the system is available again.

Please note that, unlike other recent downtimes, jobs in the queue when the system goes down will need to be resubmitted once it comes back up. Please contact the Triton Discuss mailing list if you have any questions.

New applications in /opt:

ApplicationVersion
blat v34
bowtie v0.12.7
bwa v0.5.9
cilk v8053
cp2k v2.2.184
cpmd v3.13.2
fftw v2.1.5 (in addition to v3.2.1)
fpmpi v2.1f
GenomeAnalysisTK v1.0.5336
ipython v0.10.1
matplotlib v1.0.1
pyfits v2.4.0
python v2.7 + v3.2
pytz v2006p
samtools v0.1.13
tecplot v2011

Updated applications in /opt:

ApplicationCurrent VersionNew Version
apbs v1.2.1 v1.3
ddt v2.4.1 v2.6
gamess v1.2009 v10.2010
gold v2.1.10.0 v2.2.0.1
lammps v28Nov09 v18Feb11
moab v5.3.7 v6.0.2
mpich2 v1.1.1p1 v1.3.2p1
nagios v3.0.6 v3.2.2
namd v2.7b v2.8
nose v0.11.1 v1.0.0
numpy v1.3.0rc2 v1.6.0b1
nwchem v5.1 v6.0
R v2.9.2 v2.12.2
scipy v0.71rc3 v0.9.0
+ Triton Maintenance Complete

Apologies for the extreme delay. Triton is back up with /oasis mounted, the login nodes are reopened, and the queues have started running again. Please post a note to the Discussion List if you run into any problems.

+ Triton Maintenance Extended to Second Day

Updated 9:30 a.m. Tuesday, May 24: Data Oasis Communications Pending

Update on the 5/23 physical relocation: the move was completed in good time yesterday; however, Data Oasis continues to exhibit communications failures. Triton will remain offline until the problem is resolved. We will post follow-ups as we find out more.

+ Triton Maintenance to Relocate Data Oasis Hardware

Updated 10:30 a.m. Tuesday, May 10: All-Day Downtime Set for May 23

To make room for some new hardware, the Data Oasis servers and associated switch are going to be moved to SDSC's other machine room on Monday, May 23rd. Between the move, re-racking, and re-cabling, this will be a more involved process than the switch work we did a few weeks ago; we're figuring that we'll likely have close to a full day's downtime. The Triton submission queues will be down during this period, and access to the Triton login nodes may be shut off as well. We'll post updates as more information becomes available.

+ Triton Switch Maintenance Complete for April 25

Updated 12:30 p.m. Monday, April 25: All Nodes Access Data Oasis

Switch maintenance is complete, and access between Triton and Data Oasis has been restored. The hold on jobs has been removed, so submissions should begin moving through the queues again. Please report any problems to the Triton Discussion List (triton-discuss@sdsc.edu).

+ Triton Maintenance Scheduled for April 25

Updated 3:15 p.m. Thursday, April 21: 8 a.m. Start for Switch Maintenance

Triton will be down for maintenance from 8 a.m. to 1 p.m. PT on Monday, April 25, 2011. Work will be performed on one of the switches that provides access to Data Oasis. All jobs scheduled to be running during this window will remain queued until after the connection to /oasis is restored.

Currently, there is no plan to shut off access to the Triton login nodes. However, they will probably be rebooted on short notice once the switch work is complete. Access to user home filesystems should not be affected, however most /projects filesystems will be offline during the outage.

+ Triton Back Up From Maintenance

Posted 10:30 a.m. Saturday, November 27: Lack of Cooperation Dooms Open Policy

The upgrade to file servers supporting the home filesystem was completed at approximately 2:30 p.m. today.

Posted 12:05 p.m. Thursday, February 3: Brief Period of Unavailability Via Login Node

We are performing a maintenance on the home filesystem, which has caused Triton's login node to be inaccessible for a short time. We will notify through this page, the Discusson List, and Twitter when the system access is back online.

Triton in the Classroom


Posted 3:30 p.m. Wednesday, January 19: Students Learning to Supercompute via Triton

Triton has become a teaching resource as well as a research one in its first year on campus. Read the latest press release at the SDSC News Center.

+ Storage Quotas on Triton Home Filesystem Now in Effect

Posted 10:30 a.m. Saturday, November 27: Lack of Cooperation Dooms Open Policy

Because of recurring issues with users simply consuming disk space without regard to space available, we have had to modify our operative, user-friendly space allocation which allowed users to expand to space needed and then contract after usage. That open policy has failed.

Hard quotas now exist on all home areas. The standard allocation is 100 GB per user. Currently, users who are consuming more than 100 GB have had their quotas set to accomodate active space as of 11/19/2010 so that users can effect cleanup of their home areas.

If you are consuming more than 100 gigabytes of space, you must either

  • reduce your on-disk usage to 100 gigabytes or less

  • OR

  • contact the TAPP Manager for a space request

At the time the policy change was instituted, there were about 60 users consuming over 100 GB of home area space. If you have not been consuming more than 100 GB of space on a long-term basis, or have been attentively clearing out your overage, we thank you!

+ Read-only Mirage Going Offline Nov. 1

Posted 8:30 a.m. Friday, October 29: Last Chance to Get Your Data

Last weekend's changes seem to have finally settled down Mirage. We hope you've had a chance to copy any data you'd like to keep to a new home on Data Oasis.

We plan to permanently disconnect Mirage from Triton this coming Monday, November 1st. Remember that the disks from Mirage will be erased and reused, so anything that hasn't been copied will be unrecoverable.

+ Read-only Mirage Soon to Go Offline

Posted 9:30 p.m. Tuesday, October 26: All User Data Must Be Copied to Oasis or Lost

Within the next couple days, all Mirage disks will be offlined from Triton. Any user data stored there will no longer be accessible. If you have yet to migrate all your essential data to Oasis, please inform the discussion list immediately so arrangements can be made to preserve your data.

It is recommended to use in interactive node to migrate your data to Oasis rather than a session on the login node. Here are some suggestions for how to perform that procedure:

  1. Get and interactive shell via the qsub command.

    % qsub -I

  2. Copy files

    There are many ways to do this, perhaps the slowest being cp -R. One simple and preferrable method is:

    % cd /mirage/<username>

    % tar cf - * | ( cd /oasis/<username>; tar xvfBp -)

You might follow the above tar with an rsync, for example:

rsync /mirage/<username>/ /oasis/<username>/

Note that rsync syntax is rather sensitive. The trailing slashes on the command above are important.

Using tar will cause reads to be well buffered, putting Lustre into more of its comfort zone for "big" files. If you have many small files, there is not an efficient way to move data, mostly because Lustre is not very efficient on small files.

Many other efficient methods are available for large files.

+ Data Oasis Now Online

Updated 4:30 p.m. Tuesday, October 12: Transition of Data Is In Progress

As of today, Data Oasis is available for use. Users should being moving their data from Mirage and updating job scripts to write to /oasis immediately. On about Oct. 29, Mirage will be completely removed from Triton and user data there will no longer be available.

We'll start the process of flipping /mirage to read-only on Friday afternoon, Oct. 15, finishing sometime Monday. Unfortunately, this will require yet another round of rebooting, but we'll try to keep disruptions to a minimum.

The scratch directory for jobs can be found at /oasis/scratch/<login>/$PBS_JOBID We still need to add the code to set an environment variable to this path and to clean out the directories after three days.

+ Data Oasis Availability Delayed One Day

Updated 1 p.m. Monday, October 11: Transition To Data Oasis Delayed

Due to complications with Mirage, the transition will start tomorrow, Tuesday, October 12. Sorry for any inconvenience this may cause. Please adhere to the below schedule, except that it be offset by 24 hours.

+ Data Oasis Transition Set to Begin

Posted 2 p.m. Friday, October 8: Transition To Begin Monday, October 11

Over the past couple of weeks the Triton team has been testing the first incarnation of Data Oasis (DO), a new parallel filesystem intended to replace Mirage. We're ready now to go live with the system. Given the fun we've experienced with Mirage over the past 12 hours, it looks like the timing is good.

This version of DO will have approximately 250 terabytes usable capacity, more than doubling the amount of disk space available on Mirage. DO will run the latest version of the Lustre filesystem; word on the street and our own experience during testing both indicate that we should see significant improvements in stability by making the upgrade.

Next Monday (10/11) we'll begin a two-week transition period to move from Mirage to DO. Both filesystems will be mounted across Triton, mirage at /mirage as usual, and DO at /oasis. As with /mirage, you will find a directory on /oasis named after your login id. Please modify any references you have in scripts, etc. to refer to the new filesystem, and start copying any data you want to keep from Mirage to DO. After three days (Thursday 10/14 — long enough to let any jobs running Sunday to complete), we'll remount Mirage read-only so that jobs won't be able to write any new data to it.

Unlike Mirage, we plan to reserve 50 terabytes on DO for job scratch space. Each job will have the associated scratch directory /oasis/scratch/<login>/<job#> that can be used to place data temporarily. (We'll set a job-specific environment variable to reference this directory.) These directories and their contents will be purged automatically three days after the job completes — this really is scratch space.

Also unlike Mirage, we'll be placing per-user quotas on DO usage, with the goal of avoiding the performance degradation we saw on Mirage when usage got into the high 90-percent range. Details will follow, but each user will have at least enough space to hold their current Mirage usage.

After the two-week transition period, Mirage will leave Triton and its disks will be recycled to other uses, so any data that hasn't been copied over really will be gone. As with Mirage, making backups of data on DO will be the responsibility of the users.

Please post any questions you have about this transition to the Triton Discussion List (triton-discuss@sdsc.edu), and we'll do our best to clarify and help in the move. There should be a /oasis set up sometime late Sunday afternoon; feel free to get an early start on transferring your data.

+ Latest Oasis Test Results

Posted 12:00 p.m. Friday, October 8: New Results from Storage Upgrade

These numbers are for up to 4 nodes using 32 cores. Tests were also run on up to 128 nodes and 1024 cores.

     Nodes       Cores        Max Write        Max Read
        1         8           378.12 MiB/sec   578.18 MiB/sec
        2         8           601.66 MiB/sec   849.80 MiB/sec
        4         8           744.38 MiB/sec   981.06 MiB/sec
        4         16          740.49 MiB/sec   1066.38 MiB/sec
        4         32          565.95 MiB/sec   1070.39 MiB/sec

The peak performance for this set was:

TypeMax speedNodesCoresFile size
Read8395.83 MiB/sec641282TB
Write4417.27 MiB/sec1282564TB

Please visit the Data Oasis page for more information.

New Roll Source Download Location


Posted 4:00 p.m. Thursday, Sept. 30: All Triton Rolls Can Be Obtained from the Rocks Git Server

In an effort to streamline accessibility to Triton source used to build our system software, we want to inform interested users about the availability of the Rocks Git Repository. From this location, daily CVS code updates can be downloaded and users can keep abreast of the most recent changes being checked in by Triton developers. In addition to Triton source code, this repository contains the latest Rocks internal and third-party code as well.

+ Preliminary Oasis Test Results

Posted 5:00 p.m. Tuesday, Sept. 28: Performance Expectations Exceeded During Phase 0 Certification

Performance testing on the first phase (Phase 0) of Data Oasis over the last several days achieved approximately 3.5 gigabytes per second (GB/s) on writes and about 7.6GB/s on reads using a 2 terabyte file and 512 clients.

In networking terms, this was 64 gigabits per second, or an incredible 80% of the theoretical channel-bonded link. Another way to look at this data is:

8000/8*1135 (our best 1 OSS number) = 8000/9050 = 88% scaling efficiency

Our goal was 7 billion bytes per second, so we beat that by 15%.

In related news, the Phase 1 Data Oasis RFP has been published. Oasis should be expanding in about two months. Our goal for Phase 1 will be approximately five times the sustained speeds achieved in Phase 0.

We hope to have Phase 0 available to all Triton users within a few days.

+ Campus Network Outage Scheduled for Oct. 5

Posted 3:30 p.m. Wednesday, Sept. 22: Campus Users Will Lose Access to Triton while Router is Upgraded

On Tuesday, October 5th, 2010 from 5 p.m. until approximately 7:30 p.m., SDSC and UCSD networking teams will update routes to utilize the new MX960 router at SDSC, and retire the older T320 router (known as dolphin). Network routes between SDSC and UCSD will be affected, and there may be connectivity issues to some hosts.

During the maintenance, UCSD Triton users will be unable to access Triton resources. Users connecting to UCSD or SDSC from external hosts should continue to have uninterrupted access to Triton and campus networks.

If you have questions or concerns, please email the Triton Discussion List (triton-discuss@sdsc.edu) or SDSC Network Support.

+ Home File System Space Availability

Posted 9:30 a.m. Tuesday, Sept. 21: Users Asked to Remove Files

We've hit a critical point on /home file usage, where there isn't enough free space for the system overhead involved in deleting files. We have recently freed approximately 1.3 terabytes, which provided headroom for cleaning up the filesystem.

We have about 530 users sharing around 44 terabytes of home space. That works out to about 83 gigabytes per user. Because not all users need that much space (we have a number of idle accounts that contain little or no data), we've so far avoided putting quotas on the system. However, if you're using considerably more than that — say, 830 gigabytes or more — then you're consuming more than a fair share of the resource. Please shift your collected data off of Triton so that we have enough room for everyone to operate.

We continue to seek additional ways to relieve the space crunch. We've shifted some users to another server, freeing up about 30 terabytes from the primary system. However, there will always be a hard limit, no matter how much disk we throw at it, as seen in the 98% usage of the 100 terabytes on /mirage. Users offloading their data onto other resources is the only long-term solution.

+ MATLAB License Restrictions

Update (9:30 a.m. Monday, Sept. 13): Non-UCSD personnel being temporarily denied use of software

Due to restrictions in our licensing agreement with The Mathworks, we are currently forced to limit Triton users access to both client and server MATLAB licenses strictly to UCSD users.

We regret any inconvenience this causes. We are working with the company to relax the restriction so that MATLAB may again be available to all users. We will announce policy changes when we reach a new agreement.

+ Please Remove Unneeded Files from Mirage

Update (9:30 a.m. Thursday, Sept. 2): Filesystem Capacity at 98%

/mirage has hit 98% of capacity. At that high a level of usage, performance bogs down and reliability can get a bit shaky. Please make a pass through your data on the filesystem and remove files you no longer need.

+ Updated Triton Rolls Available

Update (1:30 a.m. Saturday, July 10): Triton_RC3 Rolls Now on Download Page

Many of the source rolls from the Rocks 5.3 upgrade completed on May 18 are now posted for download on the Triton Download Page. You can also find information on how to build a cluster like Triton on our Build Your Own page.

New Features and Benefits of Data Oasis


Update (3:30 p.m. Tuesday, June 23): Benefits and effects of new filesystem

Triton will soon support a faster, more reliable, higher capacity parallel filesystem. Some of the features and benefits are listed below. We will provide more specific data at the conclusion of testing.

Expected User Benefits

  • Filesystem capacity will increase from 100 terabytes to 281 terabytes
  • Redundant metadata servers will provide increased reliability; 384 terabytes total capacity
  • Filesystem software upgrade to version 1.8.3 will add new features and bug fixes
  • Enterprise-class server hardware and interconnect fabric will provide better performance
  • System will support quotas

Anticipated User Impacts

  • Filesystem name will change from /mirage to /oasis - this may affect scripts with hard-coded paths
  • Phase 0: both /mirage and /oasis will be online at same time for about 30 days
  • /mirage data will be destroyed after migration grace period and /mirage will be unmounted

The new hardware and software versions are currently being tested on reserved nodes of Triton. Availability to the production nodes is expected within a few weeks.

+ Details of Recent Upgrade

Update (5:30 p.m. Wednesday, June 9): TRITON_RC3 Upgrade Details

Following are some particulars regarding the affected packages and systems.

Updated system software:

Rocks          v5.1   --> v5.3
CentOS         v5.2   --> v5.4
PGI compiler   v8.0   --> v10.5
Lustre client  v1.6.6 --> v1.8.3
Myrinet driver v1.2.8 --> v1.2.12
Moab           v5.3.5 --> v5.3.7 (TORQUE roll update)

New applications (some of these have been in /beta; all will now have a permanent home in /opt):

  BEAST  1.5.2
  APBS  1.2.1
  LAMMPS 28Nov09
  NAMD  2.7b1
  NWChem 5.1.1
  Open Motif  2.3.2
  PDT 3.15
  TAU 2.19
  SciPy  0.7.1rc3
  FFTW v2.1.5 (in addition to v3.2.1)

In addition, performance and administrative gains include:

  • Users have more refined environment control since most applications now use loadable modules rather than an auto-loaded environment.
  • A significant refactoring of the library layout delivers easier maintenance for static and dynamic mpich/mpich2/openmpi libraries with Intel PGI compilers.
  • Several applications now use faster Intel compiled binaries.
  • The /home area file space has increased, and disk space has been repartitioned to provide better support for large applications.
  • The Myrinet driver upgrade provides fixes for SRAM parity detection in the NIC. It also improves Progress on Codes with many outstanding (unmatched) messages.
+ How you can run extra long jobs on Triton

Update (1:00 p.m. Monday, May 3): New Support Feature! Get approval for your job to run longer than 72 hours!

For jobs requiring more than Triton's 72-hour wallclock limit, users may now request an exception to allow those jobs to be scheduled and run. Please make your request through the discussion mailman list (triton-discuss@sdsc.edu) and system administrators will make the provisions necessary to support your request.

+ View More News and Updates
+ Login Node Patching Monday, Sept. 13

Update (9:30 a.m. Monday, Sept. 13): System should remain available

We received word this morning of a Linux security hole that requires patching our systems. We will start by reinstalling our two login nodes. Figure that triton-38 will go down at 2:30 this afternoon. Once it's back up, we'll take triton-login down. We'll also arrange a rolling reinstall of the compute nodes to avoid disrupting running jobs. With luck, the only impact to users will be a need to bounce between login nodes for an hour or so.

+ Intermittent File System Access on Thursday, August 12, 2010

Update (8:30 p.m. Thursday, August 12): Switch Firmware Upgrade Complete

The upgrade has been completed. There do not appear to be any significant issues associated with the change. Please let us know if you experienced problems with a job during this interval (approximately 3 p.m. to 9 p.m. PT on Thursday, August 12, 2010).

Notice (3:30 p.m. Thursday, August 12): Running Jobs May Be Affected by Switch Firmware Upgrade

We need to upgrade the switch firmware on Triton's Myrinet switch so that it can talk at greater than 1 x 10GbE to our other machine room. As part of that process, the 10GbE connections to home area and Lustre will go up and down. In other words, there will be outages lasting 1 - 5 minutes in which access to home servers is unavailable.

We are 99.99% certain that NFS (home area mounts) will restart without significant issue. However, we are less certain about how Lustre will react to a 1 - 5 minute network outage. We'll monitor running jobs, and if Lustre falls over, folks will need to resubmit. (We of course will do the right thing with respect to charging if the outage has significant effect on running jobs).

There really isn't any -good- time to perform such an upgrade, so now is by definition the "best" time (don't ask for real logic on that statement ;-)).

The outage should be quite short (and if you are not actively working on the login node, you probably won't notice). We will let you know when the upgrade has been completed.

+ Triton Nodes Back Online Monday, August 2

Update (6:00 a.m. PT Wednesday, August 4): New Switch Installed for PDAF/M Nodes

The switch failure affecting approximately 20 PDAF/M nodes that occurred on Sunday, August 1, has been resolved with the replacement of the switch controlling access to those nodes. The entire Triton cluster has been fully functional since Monday morning, August 2. If you experienced a job failure or lost time due to this outage, please request a refund through the Triton Discussion List.

+ Partial Outage of Triton Nodes on August 1, 2010

Update (8:30 a.m. Sunday, August 1): Switch Failure Affects PDAF/M Nodes

An apparent switch failure on racks six and seven has temporarily rendered most of the large memory nodes on Triton inaccessible. The failure occurred at approximately 3:40 a.m. today. Administrators are working on the problem, and we hope to have a replacement switch installed soon. So far, this outage has not affected any TCC nodes, and those remain fully available. Likewise, the PDAF nodes in racks eight and nine remain available. More information will be posted as it becomes available. Check the Triton Status Page for the latest updates. This report is updated every two minutes. Even more detailed information is available on the Triton Ganglia page, which gets updated every one minute.

+ Triton software upgrade on May 18

Update (9:00 p.m. Tuesday, May 18): Upgrade is complete as of 7:45 p.m.

User access to Triton is enabled. Please login and submit your jobs. Promptly report any unusual behavior to the discussion list. Thanks for being patient.

Update (6:00 p.m. Tuesday, May 18): Upgrade is progressing slowly, should be completed by about 7 p.m.

We are a little behind schedule but it looks like steady progress that will get the job finished about two hours later that anticipated. We'll post here when Triton is back up.

Update (10:00 a.m. Tuesday, May 18): Upgrade has begun...

The maintenance period has begun. All users and jobs have been removed from the system. The new software stack is being installed at this time. Check back here for availability, monitor the discussion list, or follow the progress on Twitter.

On Tuesday, May 18, system administrators will apply a major upgrade to the Triton Resource software stack. Triton will be unavailable to users beginning at 8 a.m. and should return to service by 5 p.m. (sooner if possible).

All running jobs will be terminated prior to starting the maintenance, and all temporary data will be discarded. Data on /home and /mirage filesystems will be preserved. To avoid lost work and the need to ask for refunds, do not submit jobs that will run during the maintenance period. All existing jobs will be cleared from the queues, so users must resubmit them after completion.

Details of the planned changes will be announced soon. Please contact the staff via the discussion list if you have questions. Thank you for your patience as we improve Triton's capability to serve you.

+ Free MATLAB seminars on campus Thursday, May 13

Update (12:00 p.m. Monday, Apr. 25): Advanced Programming Techniques with MATLAB at UCSD

The MathWorks will present two complimentary programming seminars to the UCSD community from 10:30 a.m. to 2:30 p.m. in the Student Center's Dolores Huerta — Philip Vera Cruz Room. The morning session, which runs from 10:30 until noon, is titled "Data Acquisition, Analysis and Visualization in MATLAB". A brief Q & A and refreshment break will be followed by the second seminar, titled "Speeding Up Applications: Parallel Computing with MATLAB", which runs from 12:30 until 2:30.

Those interested may sign up at The MathWorks seminar registration web site. For more details and contact info, you can also download the MathWorks announcement (Word doc). The Student Center is located in Muir College, next to Mandeville Center. Map details, parking and driving information is available by searching UCSD MapLink for "Student Center".

+ Storage outage successul this weekend

Update (11:30 a.m. Monday, Apr. 25): Triton PFS returned to service

SDSC Facilities upgraded the machine room floor April 24-25 for enhanced stability during earthquakes. The work required Triton's parallel storage hardware to be relocated, causing the /mirage filesystem to be offline over the weekend.

The maintenance was completed successfully, and the filesystem came back online without problems. Thank you for your patience during this downtime to improve Triton's future reliability.

+ Storage outage 9 p.m. Apr 23rd — 11 a.m. Apr 25th

Update (11:30 a.m. Thursday, Apr. 22): Triton /mirage downtime

SDSC Facilities has scheduled maintenance to the SDSC machine room floor this weekend. This work requires Triton's /mirage storage hardware to be relocated, so the filesystem will be offline during the relocation.

Please plan your work accordingly and defer jobs requiring large scratch space until after the maintenance is complete. Jobs with small scratch space needs may use the /home filesystem as a temporary alternative. Do not redirect large scratch space jobs to the /home filesystem, as this has been a denial of service problem recently.

We apologize for the disruption of service and thank you for bearing with us as the Triton facility is upgraded.

+ Triton Research Opportunities (TRO) Program — Call for Proposals

Update (10:30 a.m. Tuesday, Feb. 9): New program offers compute time

SDSC has announced the formation of the Triton Research Opportunities (TRO) program — a program to provide campus researchers a mechanism to tap into the expertise of SDSC staff in high performance computing, data-intensive science, and cyberinfrastructure software development; and to stimulate new research collaborations. Successful applicants will partner with an SDSC staff researcher to exploit the capabilities of the Triton Resource for their research endeavors.

The TRO program consists of a campus-wide, peer-reviewed proposal competition. Awards will provide Triton cycles and seed funds that enable SDSC researchers to collaborate with campus partners to jointly seek extramural funding. TRO proposals will be solicited semi-annually.

The application deadline is March 15, 2010. For more information about the program and a listing of SDSC staff expertise, see the TRO page on the CI-RED Web site or contact Chaitan Baru.

+ New Policy to Limit Login Node Access

Update (12:00 p.m. Wednesday, Feb. 3): Intended to prevent abuse

As of today, users will be limited in the amount of memory and time that a job can use on the Triton login nodes. This is in response to a small number of users whose jobs have caused bottlenecks due to running inappropriately. Jobs that heavily use or monopolize the limited availability of the login node, which all users depend on for primary access to Triton, prevent others from gaining access to all nodes. Such jobs should be run on the compute nodes to avoid denial of service to other users.

The new limits are as follows:

  • Maximum memory for any process: 2 gigabytes
  • Maximum cpu time for any process: 180 minutes
  • Maximum number of processes: 2048

+ Continuing Triton Access for UCSD Researchers

Update (3:00 p.m. Friday, Jan. 8): Second round of TAPP applications

A new TAPP application period is in effect for UCSD through Jan. 31. For details, visit the Academic Affairs Web site.

+ Computing in the New Year

Update (1:00 p.m. Monday, Jan. 4): Triton support again at full strength

The Triton Resource support staff is back at full strength after the holiday break. Compute time is readily available and jobs queues are temporarily short. Take advantage before the system ramps up again during the winter academic quarter.

+ UCSD Year-end Holiday Campus Closure

Update (2:30 p.m. Friday, Dec. 11): Triton support impacted by furlough

The upcoming campus closure begins Saturday, December 19, 2009, and continues through Sunday, January 3, 2010. Of those days, six are furlough days and the rest are weekends and required holidays. Triton staff are paid from state funds and are required to be furloughed.

Triton itself will be online, but there will be no guaranteed system availability during the holiday break. Staff will attempt to respond to issues posted to the Discussion List, but no assurances are made that any problem can be resolved before the campus re-opens on January 4.

SDSC Operations will monitor the system, but no Triton principles (those capable of fixing specific Triton-related issues) will be available. Even on normal work days, Triton is a best-effort support system with guaranteed problem solving only during regular business hours. Staff will try to check on the system at least daily during the break, but response times could be measured in days rather than the hours or minutes typical of routine support.

In the worst case, if something catastrophic happens, the component(s) will be disconnected from the network and remedied during the first workday (or two) of the New Year.

Please understand that Triton staff are on vacation or would be working unpaid during furlough days. Do not expect Triton issues to be resolved quickly during the Campus closure. Staff will make reasonable efforts, as their personal time allows, to fix issues that arise. Thank you for your patience and understanding during this exceptional time. We wish you the best of the holiday season.

+ New Triton Login Nodes

Update (2:00 p.m. Friday, Dec. 18): New hardware in the New Year

The new Sun Front End nodes have been delivered, and will replace the existing Appro nodes in the Triton Front End configuration. Due to time constraints with the year-end closing of UCSD, this maintenance will be deferred until after the campus reopens on Jan. 4.

+ New Triton Hardware

Update (2:00 p.m. Friday, Dec. 11): Front End equipment to be installed

New login nodes and servers used for Triton administration have been delivered and will be installed in the next few days. This should provide better remote management in the event of login node availability or other maintenance needs on the cluster. In addition, a second login node will be added to the system. Each login node will have a unique name, so users can manually direct login connections to either one in the event of an outage.

Exact information on the time of the outage will be posted as soon as available. Impact to users should be brief if noticeable at all. The new login node name will be posted at that time also.

+ Triton Software Rolls Available

Update (11:30 a.m. Friday, Dec. 11): New Downloads Posted

The Triton Resource engineering staff have completed the first phase of Rocks roll packaging for software used in the building of the TCC, PDAF and PDAFM nodes.

You can obtain source and ISO copies of all available rolls from the Download page and the Build Your Own Triton page . Contact the Discussion List for more details.

+ Home Filesystem Issues Nov. 9 - Dec. 7

Update (3:30 p.m. Monday, Dec. 7): ZFS Automation Back On

As far as we can tell, both the primary and replica NFS servers are functioning normally. Automated replication was turned back on this morning. With luck, we will see several months of uninterrupted service. Please report any irregularities to the Discussion List.

ZFS Snapshots and User Data Replication Re-enabled

Update (10:30 a.m. Monday, Nov. 30): ZFS Scrub Complete

The Upgrade completed and the data scrub was completed early this morning.

  • The Zpool scrub has completed with 0 data errors.
    • This means that ZFS has bit-verified every stored file.
    • The scrub process completed in about 16 hours.
  • ZFS replication of user home areas has been restarted.
    • The primary has about 10 terabytes more data than the backup.
    • The backup script consumes about 45-50MB per second. Therefore, the complete resynchronization will take more than 48 hours.
    • After the resync completes, replication will be set to run nightly. The first fully automatic snapshot-and-backup process is projected to be after Wednesday, Dec. 2.
  • Users can check when their home area was last snapshot.
    • Run ls -lt $HOME/.zfs/snapshot and look at the first directory listed. This will be named with the latest snapshot date.
    • The snapshot filename will be something like
      SNAPSHOT2009-11-06-1257567254
    • After snapshots are taken, the (read-only) data is then replicated to the backup server.
  • Triton admins will be monitoring the replication progress.

Update (6:00 p.m. Sunday, Nov. 29): NFS Filesystem Upgrade

Triton's primary NFS server firmware and software was upgraded today starting at 8 a.m. The upgrade was complete at 9:30 a.m. Most of that time involved flashing and rebooting the nodes. Any affected jobs that were active at the start of the upgrade will be credited.

Another maintenance task will be performed in the background today: the storage pool itself will undergo a "zpool scrub" to validate all stored data. User data will be available during the scrub, but performance will be somewhat diminished. The scrub is the best way to verify integrity. When that completes, replication will be enabled for the first time since November 6. The scrub should be completed before the end of today.

Update (3:00 p.m. Friday, Nov. 20): Login Node Replacement

The ordered replacement node, plus an additional new node, are scheduled to ship Monday, Nov. 23. We hope to have them installed by late next week. This upgrade will double our capacity on the login service, provide better front end server hardware, and improve support response times for dealing with outages by increasing remote accessiblity to admins. In the meantime, the temporary node will continue to serve, and we'll keep a close watch to ensure the greatest possible availability for users.

Update (10:00 a.m. Friday, Nov. 13): Login Node Failure

There appears to be a hardware issue with the login node. The node will not boot from the network. A temporary replacement node was installed and activated prior to 10 a.m. today.

Update (2:30 p.m. Thursday, Nov. 12): Latest from Sun on Crash Dump Analysis

Sun confirmed a bug for when a storage pool sees multiple simultaneous errors. It basically suspends the storage pool, and then all subsequent operations hang, instead of timing out. There is no current fix for the bug, other than ensuring physical integrity of the disks.

We suspect the design rationale for suspending is to not corrupt the file system beyond repair. It's likely that when our systems were built, a serial number range of disk drives were slightly out of spec. We identified self-monitored prefailure warnings on some drives.

A paper from Google Labs, Failure Trends in a Large Disk Drive Population, (PDF) discusses the Annualized Failure Rate (AFR) of disk drives in a very large disk farm. The first three months of disk life in the study farm (about where Triton is in terms of actual usage) show approximately three times higher failure rates on high utilization drives over medium and low utilization ones. Triton is likely within this three-month usage range, so our failure rate is not unexpected.

We are still not backing up user home areas, though all data is protected against double disk failure. To get to the point where we believe that snapshot/replication will not cause hangs, we must root out the marginal drives in the storage arrays. This will take some time.

In the interim, user home area storage should be reliable, but there is the possibility that the home area server will hang. We'll keep watching and try to react quickly if it does. Thanks for your patience, and please continue to help us keep on top of reliability issues by posting to the discussion list whenever you have problems.

Update (1 p.m. Thursday, Nov. 12): On our backup server, we were able to duplicate the problem and force a core dump. All available data have been uploaded to Sun, who are doing a post-mortem to isolate the root cause.

Sun confirmed an issue with Solaris U6 (currently running on the primary ZFS) involving snapshots and incremental ZFS sends. While the support folks were happy that we upgraded to U8 on the backup server, the fact that we are still locking up the file system is puzzling.

After firmware updates and U8 installation on the backup server, two drives (of the 48 in our ZFS configuration) remained non-functional. Those will eventually be repaired, but could be related to the root cause. Our ZFS data are still intact, due to mirroring that allows any two drives to fail. For most users, the backup contains home area data up through Nov 6. Data deposited after this are not being backed up, and users are advised to make alternate plans for safekeeping of such until the problem is resolved in production.

We are hopeful of a more definitive answer from Sun as they pore over the crash dump. Until the backup system issue is resolved, production will remain in its current configuration. Due to the extra attention on this problem, production is being very closely monitored and should be extremely reliable despite the flaw, since administrative support is likely to respond very quickly during the investigation.

Update (9 a.m. Tuesday, Nov. 10): We still do not have a root cause for the ZFS failure that is causing temporary, intermittent login node unavailability. No updates will be made to production servers until the actual cause is known. Currently, no backups are being performed on user home areas, so users may want to take extra precautions with data there until the issue is resolved. Testing to determine the root cause is continuing on our backup server, and production ZFS, login, and all compute nodes are fully functional (except for the ZFS backups).

Update (3 p.m. Monday, Nov. 9): The latest Solaris update did not resolve the ZFS problem — a failure occurred during the pool scrub on the backup server, resulting in a frozen ZFS subsystem. The root cause of the filesystem failures is still unknown at this time. The latest SAS controller patches are being installed on the backup server, and a new pool scrub test will be performed. A new downtime will be scheduled, possibly still today.

Notice to Users

Issues with the Triton ZFS server will be addressed by a brief outage at a time yet to be determined. During this outage, we expect that most running jobs should complete, but a few may experience early terminations. Running jobs will be inventoried prior to the upgrade and refunds will be available for affected jobs.

We've upgraded the backup server and are currently running tests to locate the root cause of the failure. We apologize for the ongoing inconveniences this problem has caused — if the primary server becomes inaccessible before the scheduled upgrade, this maintenance will be combined with our response to that to complete the service with a single outage.

+ Early Adopter Phase Ended October 5

Production Phase Announcement: The full production phase of the Triton Resource began on Monday, October 5, 2009. The Early Adopter phase ended at that time.

What this means for users

Triton's migration to the charged-for usage model was completed on Monday, October 5 with the implementation of the usage accounting service. Early Adopter accounts are no longer being created or renewed, and TAPP or project allocations are now required to run jobs. This marks the beginning of the full production phase of Triton.

If an allocation runs out of SUs, TAPP procedures should be followed to extend or renew the account. Triton system administrators will not be authorized to replenish accounts the way they did during the Early Adopter phase.

Refunds for certain failed jobs and system errors will be considered on a case-by-case basis. Please direct requests to the discussion list.

Users can discover what their calculations will cost and view their usage statements by running the mybalance and gstatement -u $USER commands to see the status of their accounts.

Details on the latest changes and policy decisions can be found on the following FAQ pages:

Maintenance History (Partial)

+ RC2 System Software Upgrade 18 Sept 2009

Both TCC and PDAF were upgraded to Release Candidate 2 in preparation for full production usage and accounting. The system will remain in Early Adopter mode for about two weeks. The upgrade maintenance went smoothly and required about six hours of downtime to update the login node and all compute nodes.

Security Patch 14 Aug 2009

A security patch was applied to Triton on August 14 between 2 and 3 PM PDT. This patch was necessary to close a local privilege security vulnerability first reported on August 11. The RHEL patch became available on August 14 and was installed on the Triton login node almost immediately. Details of the patch are available on Bugzilla. The login node update began at approximately 14:20 PDT and was completed by about 15:10 PDT.

Completion of this security patch accomplished the following:

  • Closed the security hole
  • Did not affect running and queued batch jobs
  • Allowed addition of the Bio roll to Triton
  • Batch nodes were updated to the same level at the first opportunity after the login node was updated
+ Cluster Reinstallation 23 Jul 2009 (TRITON_RC1)

A full reinstallation of Triton was performed on July 23, and completed within the expected 2-3 hour window, after which Triton was again running normally.

The cluster's public IP addresses were changed during this maintenance. The IP address of the login node was changed to 132.249.122.43.

User home data areas were restored intact.

Mirage Installation 20 Jul 2009

During a planned outage on July 20, the Mirage Lustre servers were physically moved to a new rack and new power. Two dead LUNs were also recovered so that all 100 storage targets are currently available. Mirage is now mounted on the login node and all compute nodes. All 100 TB are now available on /mirage.

+ General Status of Triton Resource

The Triton Resource is in full production. TRITON_RC2 (Release Candidate 2) is installed, and full job accounting is in effect.

This site will be kept up-to-date as node statuses change, or when the system has a scheduled maintenance. Currently, all of the nodes are in service and available via the scheduler. When nodes undergo unplanned maintenance, this site will be updated and messages will be posted on the discussion list and Triton's Twitter feed.

+ Early Adopter User Accounts

Early Adopter accounts were reset with a complimentary 1000 SU balance on October 5. You can contact discussion mailman list (triton-discuss@sdsc.edu) with usage and general access questions. You can join the list here.

Triton's exceptional data-intensive computing power is now available to the University of California HPC research community.

If you have an account and are ready to access to the Triton Resource, please visit the User Access page for details and to obtain login information. For information on first-time logins to the Triton Resource, please read the New User page. To request an account, please use TAPP.

To read about the current hardware status and get details of the system building process, read the Triton Resource blog.

Triton Full Production began Oct 5, 2009

Triton's compute components moved to production on October 5, 2009. Early Adopters helped to identify software needs and support requirements starting in July. Users and potential users are encouraged to continue sending feedback and suggestions to the Triton support team.

The 28 large-memory nodes of the PDAF provide some of the most extensive data analysis power available commercially or at any research institution in the country. The cluster includes four special nodes dedicated to database server interaction.

The 256-node TCC is a Rocks cluster with 24 gigabytes of memory and eight processing cores on each node.

Triton was upgraded to Rocks 5.3 on May 18, 2010. This upgrade included many software package updates as well.

In late Summer 2011, Triton users gained access to a new Lustre PFS with 850 terabytes of work and scratch storage. The new filesystem, mounted as /phase1, replaces Data Oasis.

In early Fall, 2010, Triton users received a disk capacity increase to 250 terabytes, coincident with the replacement of /mirage by /oasis.

Phase One
Is Here!

The new Parallel File System for Triton, containing 850TB of Lustre-based storage, is now available to users. Phase One has replaced Data Oasis with new, higher efficiency and more reliable mass storage for Triton jobs.

Obtain Triton Access via TAPP

For general and long-term access to Triton Resource, users are asked to request an allocation through the Triton Affiliates and Partners program, or TAPP. This is the primary way for users to gain access to Triton for running jobs and conducting research.

Contact Us

Open a Ticket with Triton Resource Support using the Support Ticket Form.

Join the Discussion Forum Sign up for our Email Discussion List.

Follow Triton on Twitter

FAQ Read the FAQ Page.

Terms of Use | Privacy

Back to page top End of page