Statistical Learning, Optimization, and Autonomic Resource Management
on Networked Computer Systems

Computer systems research in the past focused on performance and speed. Nowadays, people are concerned more about systems security and reliability as computers become more and more networked and complicated. Large-scale networked computer systems are no longer manageable manually. For example, a recent IBM internal study on the ASC White machine with 512 nodes installed in LLNL showed that the mean time to failure of a node is about 160 days. It implies there are 3 to 4 nodal failures every day. If the same failure model is applied to IBM's latest BlueGene machines, there would be more than a hundred nodal failures per hour!

The SILOAM objective is to develop statistical machine learning technologies to characterize the networked systems uncertainty and stochastic scheduling and autonomic resource management strategies for adaptive, highly reliable, and self-manageable systems.

This SILOAM project was funded by U.S. National Science Foundation under grants DMS-0624849 and CCF-0611750. See Nuggets 2006 for a summary of recent achievements.