Skip navigation links
Friday, November 20th 2009 09:43:01 PM PST
tcc-3-71.local
The Triton Resource provides easily accessible, affordable, high-performance and data-intensive compute resources to UCSD researchers, faculty, affiliates, government and commercial partners through innovative, locally supported, scalable hardware and software over multiple 10-gigabit networks extending from campus laboratories to the UC network, California, and the US.
Update (3:00 p.m. Friday, Nov. 20): Login Node Replacement
The ordered replacement node, plus an additional new node, are scheduled to ship Monday, Nov. 23. We hope to have them installed by late next week. This upgrade will double our capacity on the login service, provide better front end server hardware, and improve support response times for dealing with outages by increasing remote accessiblity to admins. In the meantime, the temporary node will continue to serve, and we'll keep a close watch to ensure the greatest possible availability for users.
Update (10:00 a.m. Friday, Nov. 13): Login Node Failure
There appears to be a hardware issue with the login node. The node will not boot from the network. A temporary replacement node was installed and activated prior to 10 a.m. today.
Update (2:30 p.m. Thursday, Nov. 12): Latest from Sun on Crash Dump Analysis
Sun confirmed a bug for when a storage pool sees multiple simultaneous errors. It basically suspends the storage pool, and then all subsequent operations hang, instead of timing out. There is no current fix for the bug, other than ensuring physical integrity of the disks.
We suspect the design rationale for suspending is to not corrupt the file system beyond repair. It's likely that when our systems were built, a serial number range of disk drives were slightly out of spec. We identified self-monitored prefailure warnings on some drives.
A paper from Google Labs, Failure Trends in a Large Disk Drive Population, (PDF) discusses the Annualized Failure Rate (AFR) of disk drives in a very large disk farm. The first three months of disk life in the study farm (about where Triton is in terms of actual usage) show approximately three times higher failure rates on high utilization drives over medium and low utilization ones. Triton is likely within this three-month usage range, so our failure rate is not unexpected.
We are still not backing up user home areas, though all data is protected against double disk failure. To get to the point where we believe that snapshot/replication will not cause hangs, we must root out the marginal drives in the storage arrays. This will take some time.
In the interim, user home area storage should be reliable, but there is the possibility that the home area server will hang. We'll keep watching and try to react quickly if it does. Thanks for your patience, and please continue to help us keep on top of reliability issues by posting to the discussion list whenever you have problems.
Update (1 p.m. Thursday, Nov. 12): On our backup server, we were able to duplicate the problem and force a core dump. All available data have been uploaded to Sun, who are doing a post-mortem to isolate the root cause.
Sun confirmed an issue with Solaris U6 (currently running on the primary ZFS) involving snapshots and incremental ZFS sends. While the support folks were happy that we upgraded to U8 on the backup server, the fact that we are still locking up the file system is puzzling.
After firmware updates and U8 installation on the backup server, two drives (of the 48 in our ZFS configuration) remained non-functional. Those will eventually be repaired, but could be related to the root cause. Our ZFS data are still intact, due to mirroring that allows any two drives to fail. For most users, the backup contains home area data up through Nov 6. Data deposited after this are not being backed up, and users are advised to make alternate plans for safekeeping of such until the problem is resolved in production.
We are hopeful of a more definitive answer from Sun as they pore over the crash dump. Until the backup system issue is resolved, production will remain in its current configuration. Due to the extra attention on this problem, production is being very closely monitored and should be extremely reliable despite the flaw, since administrative support is likely to respond very quickly during the investigation.
Update (9 a.m. Tuesday, Nov. 10): We still do not have a root cause for the ZFS failure that is causing temporary, intermittent login node unavailability. No updates will be made to production servers until the actual cause is known. Currently, no backups are being performed on user home areas, so users may want to take extra precautions with data there until the issue is resolved. Testing to determine the root cause is continuing on our backup server, and production ZFS, login, and all compute nodes are fully functional (except for the ZFS backups).
Update (3 p.m. Monday, Nov. 9): The latest Solaris update did not resolve the ZFS problem — a failure occurred during the pool scrub on the backup server, resulting in a frozen ZFS subsystem. The root cause of the filesystem failures is still unknown at this time. The latest SAS controller patches are being installed on the backup server, and a new pool scrub test will be performed. A new downtime will be scheduled, possibly still today.
Issues with the Triton ZFS server will be addressed by a brief outage at a time yet to be determined. During this outage, we expect that most running jobs should complete, but a few may experience early terminations. Running jobs will be inventoried prior to the upgrade and refunds will be available for affected jobs.
We've upgraded the backup server and are currently running tests to locate the root cause of the failure. We apologize for the ongoing inconveniences this problem has caused — if the primary server becomes inaccessible before the scheduled upgrade, this maintenance will be combined with our response to that to complete the service with a single outage.
Production Phase Announcement: The full production phase of the Triton Resource began on Monday, October 5, 2009. The Early Adopter phase ended at that time.
What this means for users
Triton's migration to the charged-for usage model was completed on Monday, October 5 with the implementation of the usage accounting service. Early Adopter accounts are no longer being created or renewed, and TAPP or project allocations are now required to run jobs. This marks the beginning of the full production phase of Triton.
If an allocation runs out of SUs, TAPP procedures should be followed to extend or renew the account. Triton system administrators will not be authorized to replenish accounts the way they did during the Early Adopter phase.
Refunds for certain failed jobs and system errors will be considered on a case-by-case basis. Please direct requests to the discussion list.
Users can discover what their calculations will cost and view their usage statements by running the mybalance and gstatement -u $USER commands to see the status of their accounts.
Details on the latest changes and policy decisions can be found on the following FAQ pages:
Both TCC and PDAF were upgraded to Release Candidate 2 in preparation for full production usage and accounting. The system will remain in Early Adopter mode for about two weeks. The upgrade maintenance went smoothly and required about six hours of downtime to update the login node and all compute nodes.
A security patch was applied to Triton on August 14 between 2 and 3 PM PDT. This patch was necessary to close a local privilege security vulnerability first reported on August 11. The RHEL patch became available on August 14 and was installed on the Triton login node almost immediately. Details of the patch are available on Bugzilla. The login node update began at approximately 14:20 PDT and was completed by about 15:10 PDT.
Completion of this security patch accomplished the following:
A full reinstallation of Triton was performed on July 23, and completed within the expected 2-3 hour window, after which Triton was again running normally.
The cluster's public IP addresses were changed during this maintenance. The IP address of the login node was changed to 132.249.122.43.
User home data areas were restored intact.
During a planned outage on July 20, the Mirage Lustre servers were physically moved to a new rack and new power. Two dead LUNs were also recovered so that all 100 storage targets are currently available. Mirage is now mounted on the login node and all compute nodes. All 100 TB are now available on /mirage.
The Triton Resource is in full production. TRITON_RC2 (Release Candidate 2) is installed, and full job accounting is in effect.
This site will be kept up-to-date as node statuses change, or when the system has a scheduled maintenance. Currently, all of the nodes are in service and available via the scheduler. When nodes undergo unplanned maintenance, this site will be updated and messages will be posted on the discussion list and Triton's Twitter feed.
Early Adopter accounts were reset with a complimentary 1000 SU balance on October 5. You can contact discussion mailman list (triton-discuss@sdsc.edu) with usage and general access questions. You can join the list here.
Triton's exceptional data-intensive computing power is now available to the University of California HPC research community.
If you have an account and are ready to access to the Triton Resource, please visit the User Access page for details and to obtain login information. For information on first-time logins to the Triton Resource, please read the New User page. To request an account, please use TAPP.
To read about the current hardware status and get details of the system building process, read the Triton Resource blog.
Triton's compute components moved to production on October 5, 2009. Early Adopters helped to identify software needs and support requirements starting in July. Users and potential users are encouraged to continue sending feedback and suggestions to the Triton support team.
The 28 large-memory nodes of the PDAF provide some of the most extensive data analysis power available commercially or at any research institution in the country. The cluster includes four special nodes dedicated to database server interaction.
The 256-node TCC is a Rocks cluster with 24 gigabytes of memory and eight processing cores on each node.
For general and long-term access to Triton Resource, users are asked to request an allocation through the Triton Affiliates and Partners program, or TAPP. Once the Early Adopter phase is completed and Triton is in full production, this will be the primary way for users to gain access to Triton for running jobs and conducting research.
The Triton compute resources are now in full production. The resource received its production certification with the deployment and acceptance of TRITON_RC2 on October 5, 2009.
All 28 PDAF nodes and all 256 TCC nodes are generally available. One or more of the compute nodes are occasionally set aside for staff testing and development of OS and software packages.
Open a Ticket with Triton Resource Support using the Support Ticket Form.
Join the Discussion Forum Sign up for our Email Discussion List.
FAQ Read the FAQ Page.
