Statistical Learning, Optimization, and Autonomic Resource Management
on Networked Computer Systems
Computer systems research in the past focused on performance and
speed. Nowadays, people are concerned more about systems security
and reliability as computers become more and more networked
and complicated. Large-scale networked computer systems are
no longer manageable manually. For example, a recent IBM internal study
on the ASC White machine with 512 nodes installed in LLNL showed that
the mean time to failure of a node is about 160 days. It implies there
are 3 to 4 nodal failures every day. If the same failure model is applied
to IBM's latest BlueGene machines, there would be more than a hundred
nodal failures per hour!
The SILOAM objective is to develop statistical machine learning technologies
to characterize the networked systems uncertainty and stochastic scheduling
and autonomic resource management strategies for adaptive, highly reliable,
and self-manageable systems.
This SILOAM project was funded by U.S. National Science Foundation under
grants DMS-0624849 and CCF-0611750. See Nuggets 2006 for a summary of recent achievements.
- J. Wei and C. Xu, eQoS: Provisioning of client-experienced end-to-end QoS guaranteees in Web server, IEEE Trans. on Computers, 2006 (in press)
- X. Zhou, J. Wei and C. Xu, Resource allocation for session-bsed 2D service differentiation on e-commerce servers, IEEE Trans. on Parallel and Distributed Systems, Vol.17(8):838-850, August 2006.
- J. Wei, X. Zhou, and C. Xu, Robust processing rate allocation for proportional slowdwon differentiation on Internet servers, IEEE Trans. on Computer, Vol. 54(8):964-977, August 2005.
- X. Zhou and C. Xu, Harmonic proportional bandwidth allocation for service differentiation on streaming servers, IEEE Trans. on Parallel and Distributed Systems, Vol. 15(9):835-848, Sept 2004.