skip to main content
research-article

DieCast: Testing Distributed Systems with an Accurate Scale Model

Published: 01 May 2011 Publication History

Abstract

Large-scale network services can consist of tens of thousands of machines running thousands of unique software configurations spread across hundreds of physical networks. Testing such services for complex performance problems and configuration errors remains a difficult problem. Existing testing techniques, such as simulation or running smaller instances of a service, have limitations in predicting overall service behavior at such scales.
Testing large services should ideally be done at the same scale and configuration as the target deployment, which can be technically and economically infeasible. We present DieCast, an approach to scaling network services in which we multiplex all of the nodes in a given service configuration as virtual machines across a much smaller number of physical machines in a test harness. We show how to accurately scale CPU, network, and disk to provide the illusion that each VM matches a machine in the original service in terms of both available computing resources and communication behavior. We present the architecture and evaluation of a system we built to support such experimentation and discuss its limitations. We show that for a variety of services---including a commercial high-performance cluster-based file system---and resource utilization levels, DieCast matches the behavior of the original service while using a fraction of the physical resources.

References

[1]
Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the Symposium on Operating Systems Principles. 74--89.
[2]
Barham, P. T., Dragovic, B., Fraser, K., Hand, S., Harris, T. L., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the Symposium on Operating Systems Principles. 164--177.
[3]
Barham, P. T., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the Symposium on Operating System Design and Implementation. 259--272.
[4]
Barroso, L. A., Dean, J., and H�lzle, U. 2003. Web search for a planet: The google cluster architecture. IEEE Micro 23, 2, 22--28.
[5]
Bellard, F. 2005. Qemu, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference.
[6]
BitMover. 2008. Lmbench - tools for performance analysis. http://www.bitmover.com/lmbench.
[7]
Blanquer, J. M., Batchelli, A., Schauser, K., and Wolski, R. 2005. Quorum: Flexible quality of service for internet services. In Proceedings of the Symposium on Networked System Design and Implementation. 159--174.
[8]
Bugzilla, X. 2008. Freebsd bootloader stops with btx halted in hvm domu. http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=622.
[9]
Capps, D. 2006. Iozone filesystem benchmark. http://www.iozone.org
[10]
Cecchet, E., Marguerite, J., and Zwaenepoel, W. 2002. Performance and scalability of ejb applications. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications. 246--261.
[11]
Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the Symposium on Operating System Design and Implementation.
[12]
Cheng, Y.-C., H�lzle, U., Cardwell, N., Savage, S., and Voelker, G. M. 2004. Monkey see, monkey do: A tool for tcp tracing and replaying. In Proceedings of the USENIX Annual Technical Conference. 87--98.
[13]
Cohen, B. 2008. Bittorrent. http://www.bittorrent.com.
[14]
Doyle, R. P., Chase, J. S., Asad, O. M., Jin, W., and Vahdat, A. M. 2003. Model-based resource provisioning in a web service utility. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Seattle, Washington.
[15]
Forum, T. M. 1993. Mpi: A message passing interface. In Proceedings of the ACM/IEEE Conference on Supercomputing. Portland, Oregon.
[16]
Ganger, G. R., et al. 2008. The disksim simulation environment. http://www.pdl.cmu.edu/DiskSim/index.html.
[17]
Goldberg, R. P. 1974. Survey of virtual machine research. IEEE Computer Magazine 7, 6, 34--45.
[18]
Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Voelker, G. M., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the Symposium on Operating System Design and Implementation.
[19]
Gupta, D., Vishwanath, K. V., and Vahdat, A. 2007. Diecast: Testing distributed systems with an accurate scale model. Tech. rep. CS2007-0910, University of California, San Diego.
[20]
Haeberlen, A., Mislove, A., and Druschel, P. 2005. Glacier: Highly durable, decentralized storage despite massive correlated failures. In Proceedings of the Symposium on Networked System Design and Implementation. 143--158.
[21]
Huang, X. W., Sharma, R., and Keshav, S. 1999. The entrapid protocol development environment. In Proceedings of the IEEE International Conference on Computer Communications. 1107--1115.
[22]
Jain, R. 1991. The Art of Computer Systems Performance Analysis. John Wiley & Sons. Chapter 12.
[23]
Katabi, D., Handley, M., and Rohrs, C. E. 2002. Congestion control for high bandwidth-delay product networks. In Proceedings of the SIGCOMM Conference. 89--102.
[24]
Lawton, K. P. 1996. Bochs: A portable pc emulator for unix/x. Linux J., 7.
[25]
LBNL. 2008. Linux tcp tuning guide. http://www-didc.lbl.gov/TCP-tuning/linux.html.
[26]
Linux Community. 2008. Linux advanced routing and traffic control. http://lartc.org.
[27]
Linux Foundation. 2008. Net:netem. http://www.linuxfoundation.org/en/Net:Netem.
[28]
Mogul, J. 2006. Emergent (mis) behavior vs. complex software systems. In Proceedings of the European Conference on Computer Systems. 293--304.
[29]
Mogul, J. C. 2003. Tcp offload is a dumb idea whose time has come. In Proceedings of the Workshop on Hot Topics in Operating Systems.
[30]
National Cyber Range 2009. National cyber range. http://www.darpa.mil/sto/ia/ncr.html.
[31]
NS-2. 2008. The network simulator -- ns-2. http://www.isi.edu/nsnam/ns.
[32]
Oppenheimer, D., Ganapathi, A., and Patterson, D. A. 2003. Why do internet services fail, and what can be done about it? In Proceedings of the USENIX Symposium on Internet Technologies and Systems.
[33]
Pan, R., Prabhakar, B., Psounis, K., and Wischik, D. 2003. Shrink: A method for scaleable performance prediction and efficient network simulation. In Proceedings of the IEEE International Conference on Computer Communications.
[34]
Panasas. 2006. Panasas activescale storage cluster will provide i/o for world’s fastest computer. http://panasas.com/press_release_111306.html.
[35]
Panasas. 2008. Panasas. http://www.panasas.com.
[36]
Peterson, L., Bavier, A., Fiuczynski, M. E., and Muir, S. 2006. Experiences building planetlab. In Proceedings of the Symposium on Operating System Design and Implementation.
[37]
Ricci, R., Alfeld, C., and Lepreau, J. 2003. A solver for the network testbed mapping problem. SIGCOMM Comput. Comm. Rev. 33, 2, 65--81.
[38]
Riley, G. F. 2003. The georgia tech network simulator. In Proceedings of the ACM SIGCOMM Workshop on Models, Methods and Tools for Reproducible Network Research. 5--12.
[39]
Rizzo, L. 1997. Dummynet: A simple approach to the evaluation of network protocols. SIGCOMM Comput. Comm. Rev. 27, 1, 31--41.
[40]
Rosenblum, M., Bugnion, E., Devine, S., and Herrod, S. A. 1997. Using the simos machine simulator to study complex computer systems. ACM Trans. Mod. Comput. Simul. 7, 1.
[41]
Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319.
[42]
Stahlman, M. 2007. Does Google have a million servers? http://www.gartner.com/DisplayDocument?doc_cd=149024.
[43]
Szymanski, B. K., Saifee, A., Sastry, A., Liu, Y., and Madnani, K. 2002. Genesis: A system for large-scale parallel network simulation. In Proceedings of the Workshop on Parallel and Distributed Simulation. 89--96.
[44]
tcpdump.org. 2008. Tcpdump/libpcap public repository. http://www.tcpdump.org.
[45]
Tridgell, A. 2004. Emulating netbench. http://samba.org/ftp/tridge/dbench.
[46]
Urgaonkar, B., Shenoy, P. J., and Roscoe, T. 2002. Resource overbooking and application profiling in shared hosting platforms. In Proceedings of the Symposium on Operating System Design and Implementation. 239--254.
[47]
Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostic, D., Chase, J. S., and Becker, D. 2002. Scalability and accuracy in a large-scale network emulator. In Proceedings of the Symposium on Operating System Design and Implementation.
[48]
Vishwanath, K. and Vahdat, A. 2008. Evaluating distributed systems: Does background traffic matter? In Proceedings of the USENIX Annual Technical Conference.
[49]
VMwareAppliances. 2008. Vmware appliances. http://www.vmware.com/vmtn/appliances.
[50]
VMwareESX4.0. 2010. Timekeeping in vmware virtual machines. http://www.vmware.com/pdf/vmware_timekeeping.pdf.
[51]
VMwareESXGuide. Esx server 3 configuration guide. http://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_3_server_config.pdf.
[52]
VMwareP2V. Vmware p2v assistant. http://www.vmware.com/products/p2v.
[53]
Waldspurger, C. A. 2002. Memory resource management in vmware esx server. In Proceedings of the Symposium on Operating System Design and Implementation.
[54]
Warfield, A., Ross, R., Fraser, K., Limpach, C., and Hand, S. 2005. Parallax: Managing storage for a million machines. In Proceedings of the Workshop on Hot Topics in Operating Systems.
[55]
White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C., and Joglekar, A. 2002. An integrated experimental environment for distributed systems and networks. In Proceedings of the Symposium on Operating System Design and Implementation.
[56]
Xu, L., Harfoush, K., and Rhee, I. 2004. Binary increase congestion control (bic) for fast long-distance networks. In Proceedings of the IEEE International Conference on Computer Communications.

Cited By

View all
  • (2024)Non-Functional Requirements Discovery and Quality Assurance Using Goal Model for Earthquake Warning System in Operation2024 IEEE 32nd International Requirements Engineering Conference (RE)10.1109/RE59067.2024.00034(275-286)Online publication date: 24-Jun-2024
  • (2024)Speed Testing for Measuring Network Traffic in a Smart Network Switch2024 International Conference on Computing, Networking and Communications (ICNC)10.1109/ICNC59896.2024.10556114(446-450)Online publication date: 19-Feb-2024
  • (2023)VT-IO: A Virtual Time System Enabling High-fidelity Container-based Network Emulation for I/O Intensive ApplicationsACM Transactions on Modeling and Computer Simulation10.1145/3635307Online publication date: 5-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 29, Issue 2
May 2011
132 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1963559
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2011
Accepted: 01 December 2010
Revised: 01 December 2010
Received: 01 May 2010
Published in�TOCS�Volume 29, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Virtualization
  2. Xen
  3. network emulation

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)6
Reflects downloads up to 22 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Non-Functional Requirements Discovery and Quality Assurance Using Goal Model for Earthquake Warning System in Operation2024 IEEE 32nd International Requirements Engineering Conference (RE)10.1109/RE59067.2024.00034(275-286)Online publication date: 24-Jun-2024
  • (2024)Speed Testing for Measuring Network Traffic in a Smart Network Switch2024 International Conference on Computing, Networking and Communications (ICNC)10.1109/ICNC59896.2024.10556114(446-450)Online publication date: 19-Feb-2024
  • (2023)VT-IO: A Virtual Time System Enabling High-fidelity Container-based Network Emulation for I/O Intensive ApplicationsACM Transactions on Modeling and Computer Simulation10.1145/3635307Online publication date: 5-Dec-2023
  • (2023)Performance Bug Analysis and Detection for Distributed Storage and Computing SystemsACM Transactions on Storage10.1145/358028119:3(1-33)Online publication date: 19-Jun-2023
  • (2022)Temporally synchronized emulation of devices with simulation of networksProceedings of the 2022 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3518997.3531020(1-12)Online publication date: 8-Jun-2022
  • (2022)Mechanisms for Precise Virtual Time Advancement in Network EmulationACM Transactions on Modeling and Computer Simulation10.1145/347886732:2(1-26)Online publication date: 4-Mar-2022
  • (2021)IOTier: A Virtual Testbed to evaluate systems for IoT environments2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid51090.2021.00081(676-683)Online publication date: May-2021
  • (2021)Fallout: Distributed systems testing as a serviceBenchCouncil Transactions on Benchmarks, Standards and Evaluations10.1016/j.tbench.2021.1000101:1(100010)Online publication date: Oct-2021
  • (2020)Adapting TCP for reconfigurable datacenter networksProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388290(651-666)Online publication date: 25-Feb-2020
  • (2020)Power Grid Simulation Testbed for Transactive Energy Management SystemsSustainability10.3390/su1211440212:11(4402)Online publication date: 28-May-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media