Skip navigation links
Tuesday, February 9th 2010 03:33:01 PM PST
tcc-5-38.local
The Triton Resource provides easily accessible, affordable, high-performance and data-intensive compute resources to UCSD researchers, faculty, affiliates, government and commercial partners through innovative, locally supported, scalable hardware and software over multiple 10-gigabit networks extending from campus laboratories to the UC network, California, and the US.
Update (10:30 a.m. Tuesday, Feb. 9): New program offers compute time
SDSC has announced the formation of the Triton Research Opportunities (TRO) program — a program to provide campus researchers a mechanism to tap into the expertise of SDSC staff in high performance computing, data-intensive science, and cyberinfrastructure software development; and to stimulate new research collaborations. Successful applicants will partner with an SDSC staff researcher to exploit the capabilities of the Triton Resource for their research endeavors.
The TRO program consists of a campus-wide, peer-reviewed proposal competition. Awards will provide Triton cycles and seed funds that enable SDSC researchers to collaborate with campus partners to jointly seek extramural funding. TRO proposals will be solicited semi-annually.
The application deadline is March 15, 2010. For more information about the program and a listing of SDSC staff expertise, see the TRO page on the CI-RED Web site or contact Chaitan Baru.
Update (12:00 p.m. Wednesday, Feb. 3): Intended to prevent abuse
As of today, users will be limited in the amount of memory and time that a job can use on the Triton login nodes. This is in response to a small number of users whose jobs have caused bottlenecks due to running inappropriately. Jobs that heavily use or monopolize the limited availability of the login node, which all users depend on for primary access to Triton, prevent others from gaining access to all nodes. Such jobs should be run on the compute nodes to avoid denial of service to other users.
The new limits are as follows:
Update (3:00 p.m. Friday, Jan. 8): Second round of TAPP applications
A new TAPP application period is in effect for UCSD through Jan. 31. For details, visit the Academic Affairs Web site.
Update (1:00 p.m. Monday, Jan. 4): Triton support again at full strength
The Triton Resource support staff is back at full strength after the holiday break. Compute time is readily available and jobs queues are temporarily short. Take advantage before the system ramps up again during the winter academic quarter.
Update (2:30 p.m. Friday, Dec. 11): Triton support impacted by furlough
The upcoming campus closure begins Saturday, December 19, 2009, and continues through Sunday, January 3, 2010. Of those days, six are furlough days and the rest are weekends and required holidays. Triton staff are paid from state funds and are required to be furloughed.
Triton itself will be online, but there will be no guaranteed system availability during the holiday break. Staff will attempt to respond to issues posted to the Discussion List, but no assurances are made that any problem can be resolved before the campus re-opens on January 4.
SDSC Operations will monitor the system, but no Triton principles (those capable of fixing specific Triton-related issues) will be available. Even on normal work days, Triton is a best-effort support system with guaranteed problem solving only during regular business hours. Staff will try to check on the system at least daily during the break, but response times could be measured in days rather than the hours or minutes typical of routine support.
In the worst case, if something catastrophic happens, the component(s) will be disconnected from the network and remedied during the first workday (or two) of the New Year.
Please understand that Triton staff are on vacation or would be working unpaid during furlough days. Do not expect Triton issues to be resolved quickly during the Campus closure. Staff will make reasonable efforts, as their personal time allows, to fix issues that arise. Thank you for your patience and understanding during this exceptional time. We wish you the best of the holiday season.
Update (2:00 p.m. Friday, Dec. 18): New hardware in the New Year
The new Sun Front End nodes have been delivered, and will replace the existing Appro nodes in the Triton Front End configuration. Due to time constraints with the year-end closing of UCSD, this maintenance will be deferred until after the campus reopens on Jan. 4.
Update (2:00 p.m. Friday, Dec. 11): Front End equipment to be installed
New login nodes and servers used for Triton administration have been delivered and will be installed in the next few days. This should provide better remote management in the event of login node availability or other maintenance needs on the cluster. In addition, a second login node will be added to the system. Each login node will have a unique name, so users can manually direct login connections to either one in the event of an outage.
Exact information on the time of the outage will be posted as soon as available. Impact to users should be brief if noticeable at all. The new login node name will be posted at that time also.
Update (11:30 a.m. Friday, Dec. 11): New Downloads Posted
The Triton Resource engineering staff have completed the first phase of Rocks roll packaging for software used in the building of the TCC, PDAF and PDAFM nodes.
You can obtain source and ISO copies of all available rolls from the Download page and the Build Your Own Triton page . Contact the Discussion List for more details.
Update (3:30 p.m. Monday, Dec. 7): ZFS Automation Back On
As far as we can tell, both the primary and replica NFS servers are functioning normally. Automated replication was turned back on this morning. With luck, we will see several months of uninterrupted service. Please report any irregularities to the Discussion List.
Update (10:30 a.m. Monday, Nov. 30): ZFS Scrub Complete
The Upgrade completed and the data scrub was completed early this morning.
Update (6:00 p.m. Sunday, Nov. 29): NFS Filesystem Upgrade
Triton's primary NFS server firmware and software was upgraded today starting at 8 a.m. The upgrade was complete at 9:30 a.m. Most of that time involved flashing and rebooting the nodes. Any affected jobs that were active at the start of the upgrade will be credited.
Another maintenance task will be performed in the background today: the storage pool itself will undergo a "zpool scrub" to validate all stored data. User data will be available during the scrub, but performance will be somewhat diminished. The scrub is the best way to verify integrity. When that completes, replication will be enabled for the first time since November 6. The scrub should be completed before the end of today.
Update (3:00 p.m. Friday, Nov. 20): Login Node Replacement
The ordered replacement node, plus an additional new node, are scheduled to ship Monday, Nov. 23. We hope to have them installed by late next week. This upgrade will double our capacity on the login service, provide better front end server hardware, and improve support response times for dealing with outages by increasing remote accessiblity to admins. In the meantime, the temporary node will continue to serve, and we'll keep a close watch to ensure the greatest possible availability for users.
Update (10:00 a.m. Friday, Nov. 13): Login Node Failure
There appears to be a hardware issue with the login node. The node will not boot from the network. A temporary replacement node was installed and activated prior to 10 a.m. today.
Update (2:30 p.m. Thursday, Nov. 12): Latest from Sun on Crash Dump Analysis
Sun confirmed a bug for when a storage pool sees multiple simultaneous errors. It basically suspends the storage pool, and then all subsequent operations hang, instead of timing out. There is no current fix for the bug, other than ensuring physical integrity of the disks.
We suspect the design rationale for suspending is to not corrupt the file system beyond repair. It's likely that when our systems were built, a serial number range of disk drives were slightly out of spec. We identified self-monitored prefailure warnings on some drives.
A paper from Google Labs, Failure Trends in a Large Disk Drive Population, (PDF) discusses the Annualized Failure Rate (AFR) of disk drives in a very large disk farm. The first three months of disk life in the study farm (about where Triton is in terms of actual usage) show approximately three times higher failure rates on high utilization drives over medium and low utilization ones. Triton is likely within this three-month usage range, so our failure rate is not unexpected.
We are still not backing up user home areas, though all data is protected against double disk failure. To get to the point where we believe that snapshot/replication will not cause hangs, we must root out the marginal drives in the storage arrays. This will take some time.
In the interim, user home area storage should be reliable, but there is the possibility that the home area server will hang. We'll keep watching and try to react quickly if it does. Thanks for your patience, and please continue to help us keep on top of reliability issues by posting to the discussion list whenever you have problems.
Update (1 p.m. Thursday, Nov. 12): On our backup server, we were able to duplicate the problem and force a core dump. All available data have been uploaded to Sun, who are doing a post-mortem to isolate the root cause.
Sun confirmed an issue with Solaris U6 (currently running on the primary ZFS) involving snapshots and incremental ZFS sends. While the support folks were happy that we upgraded to U8 on the backup server, the fact that we are still locking up the file system is puzzling.
After firmware updates and U8 installation on the backup server, two drives (of the 48 in our ZFS configuration) remained non-functional. Those will eventually be repaired, but could be related to the root cause. Our ZFS data are still intact, due to mirroring that allows any two drives to fail. For most users, the backup contains home area data up through Nov 6. Data deposited after this are not being backed up, and users are advised to make alternate plans for safekeeping of such until the problem is resolved in production.
We are hopeful of a more definitive answer from Sun as they pore over the crash dump. Until the backup system issue is resolved, production will remain in its current configuration. Due to the extra attention on this problem, production is being very closely monitored and should be extremely reliable despite the flaw, since administrative support is likely to respond very quickly during the investigation.
Update (9 a.m. Tuesday, Nov. 10): We still do not have a root cause for the ZFS failure that is causing temporary, intermittent login node unavailability. No updates will be made to production servers until the actual cause is known. Currently, no backups are being performed on user home areas, so users may want to take extra precautions with data there until the issue is resolved. Testing to determine the root cause is continuing on our backup server, and production ZFS, login, and all compute nodes are fully functional (except for the ZFS backups).
Update (3 p.m. Monday, Nov. 9): The latest Solaris update did not resolve the ZFS problem — a failure occurred during the pool scrub on the backup server, resulting in a frozen ZFS subsystem. The root cause of the filesystem failures is still unknown at this time. The latest SAS controller patches are being installed on the backup server, and a new pool scrub test will be performed. A new downtime will be scheduled, possibly still today.
Issues with the Triton ZFS server will be addressed by a brief outage at a time yet to be determined. During this outage, we expect that most running jobs should complete, but a few may experience early terminations. Running jobs will be inventoried prior to the upgrade and refunds will be available for affected jobs.
We've upgraded the backup server and are currently running tests to locate the root cause of the failure. We apologize for the ongoing inconveniences this problem has caused — if the primary server becomes inaccessible before the scheduled upgrade, this maintenance will be combined with our response to that to complete the service with a single outage.
Production Phase Announcement: The full production phase of the Triton Resource began on Monday, October 5, 2009. The Early Adopter phase ended at that time.
What this means for users
Triton's migration to the charged-for usage model was completed on Monday, October 5 with the implementation of the usage accounting service. Early Adopter accounts are no longer being created or renewed, and TAPP or project allocations are now required to run jobs. This marks the beginning of the full production phase of Triton.
If an allocation runs out of SUs, TAPP procedures should be followed to extend or renew the account. Triton system administrators will not be authorized to replenish accounts the way they did during the Early Adopter phase.
Refunds for certain failed jobs and system errors will be considered on a case-by-case basis. Please direct requests to the discussion list.
Users can discover what their calculations will cost and view their usage statements by running the mybalance and gstatement -u $USER commands to see the status of their accounts.
Details on the latest changes and policy decisions can be found on the following FAQ pages:
Both TCC and PDAF were upgraded to Release Candidate 2 in preparation for full production usage and accounting. The system will remain in Early Adopter mode for about two weeks. The upgrade maintenance went smoothly and required about six hours of downtime to update the login node and all compute nodes.
A security patch was applied to Triton on August 14 between 2 and 3 PM PDT. This patch was necessary to close a local privilege security vulnerability first reported on August 11. The RHEL patch became available on August 14 and was installed on the Triton login node almost immediately. Details of the patch are available on Bugzilla. The login node update began at approximately 14:20 PDT and was completed by about 15:10 PDT.
Completion of this security patch accomplished the following:
A full reinstallation of Triton was performed on July 23, and completed within the expected 2-3 hour window, after which Triton was again running normally.
The cluster's public IP addresses were changed during this maintenance. The IP address of the login node was changed to 132.249.122.43.
User home data areas were restored intact.
During a planned outage on July 20, the Mirage Lustre servers were physically moved to a new rack and new power. Two dead LUNs were also recovered so that all 100 storage targets are currently available. Mirage is now mounted on the login node and all compute nodes. All 100 TB are now available on /mirage.
The Triton Resource is in full production. TRITON_RC2 (Release Candidate 2) is installed, and full job accounting is in effect.
This site will be kept up-to-date as node statuses change, or when the system has a scheduled maintenance. Currently, all of the nodes are in service and available via the scheduler. When nodes undergo unplanned maintenance, this site will be updated and messages will be posted on the discussion list and Triton's Twitter feed.
Early Adopter accounts were reset with a complimentary 1000 SU balance on October 5. You can contact discussion mailman list (triton-discuss@sdsc.edu) with usage and general access questions. You can join the list here.
Triton's exceptional data-intensive computing power is now available to the University of California HPC research community.
If you have an account and are ready to access to the Triton Resource, please visit the User Access page for details and to obtain login information. For information on first-time logins to the Triton Resource, please read the New User page. To request an account, please use TAPP.
To read about the current hardware status and get details of the system building process, read the Triton Resource blog.
Triton's compute components moved to production on October 5, 2009. Early Adopters helped to identify software needs and support requirements starting in July. Users and potential users are encouraged to continue sending feedback and suggestions to the Triton support team.
The 28 large-memory nodes of the PDAF provide some of the most extensive data analysis power available commercially or at any research institution in the country. The cluster includes four special nodes dedicated to database server interaction.
The 256-node TCC is a Rocks cluster with 24 gigabytes of memory and eight processing cores on each node.
For general and long-term access to Triton Resource, users are asked to request an allocation through the Triton Affiliates and Partners program, or TAPP. Once the Early Adopter phase is completed and Triton is in full production, this will be the primary way for users to gain access to Triton for running jobs and conducting research.
The Triton compute resources are now in full production. The resource received its production certification with the deployment and acceptance of TRITON_RC2 on October 5, 2009.
All 28 PDAF nodes and all 256 TCC nodes are generally available. One or more of the compute nodes are occasionally set aside for staff testing and development of OS and software packages.
Open a Ticket with Triton Resource Support using the Support Ticket Form.
Join the Discussion Forum Sign up for our Email Discussion List.
FAQ Read the FAQ Page.
