skip to main content
research-article

WfCommons: : A framework for enabling scientific workflow research and development

Published: 01 March 2022 Publication History

Abstract

Scientific workflows are a cornerstone of modern scientific computing. They are used to describe complex computational applications that require efficient and robust management of large volumes of data, which are typically stored/processed on heterogeneous, distributed resources. The workflow research and development community has employed a number of methods for the quantitative evaluation of existing and novel workflow algorithms and systems. In particular, a common approach is to simulate workflow executions. In previous works, we have presented a collection of tools that have been adopted by the community for conducting workflow research. Despite their popularity, they suffer from several shortcomings that prevent easy adoption, maintenance, and consistency with the evolving structures and computational requirements of production workflows. In this work, we present WfCommons, a framework that provides a collection of tools for analyzing workflow executions, for producing generators of synthetic workflows, and for simulating workflow executions. We demonstrate the realism of the generated synthetic workflows by comparing their simulated executions to real workflow executions. We also contrast these results with results obtained when using the previously available collection of tools. We find that the workflow generators that are automatically constructed by our framework not only generate representative same-scale workflows (i.e., with structures and task characteristics distributions that resemble those observed in real-world workflows), but also do so at scales larger than that of available real-world workflows. Finally, we conduct a case study to demonstrate the usefulness of our framework for estimating the energy consumption of large-scale workflow executions.

Highlights

Archival and automatic analysis of real workflow instances.
Automatic construction of realistic synthetic workflow generators.
Use of synthetic workflow instances in the energy-efficiency context.
Open-source Python package for workflow analysis and generation.

References

[1]
Deelman E., Vahi K., Rynge M., Mayani R., Ferreira da Silva R., Papadimitriou G., Livny M., The evolution of the Pegasus workflow management software, Impact Comput. Sci. Eng. 21 (4) (2019) 22–36,.
[2]
Liew C.S., Atkinson M.P., Galea M., Ang T.F., Martin P., Hemert J.I.V., Scientific workflows: Moving across paradigms, ACM Comput. Surv. 49 (4) (2016) 1–39,.
[3]
Deelman E., Vahi K., Juve G., Rynge M., Callaghan S., Maechling P.J., Mayani R., Chen W., Ferreira da Silva R., Livny M., Wenger K., Pegasus, a workflow management system for science automation, Future Gener. Comput. Syst. 46 (2015) 17–35,.
[4]
Klimentov A., Buncic P., De K., Jha S., Maeno T., Mount R., Nilsson P., Oleynik D., Panitkin S., Petrosyan A., Porter R.J., Read K.F., Vaniachine A., Wells J.C., Wenaus T., Next generation workload management system for big data on heterogeneous distributed computing, J. Phys. Conf. Ser. 608 (1) (2015) 12040,.
[5]
Deelman E., Peterka T., Altintas I., Carothers C.D., van Dam K.K., Moreland K., Parashar M., Ramakrishnan L., Taufer M., Vetter J., The future of scientific workflows, Int. J. High Perform. Comput. Appl. 32 (1) (2017),.
[6]
Ferreira da Silva R., Casanova H., Chard K., Laney D., Ahn D., Jha S., Goble C., Ramakrishnan L., eerson L., Enders B., Thain D., Altintas I., Babuji Y., Badia R., Bonazzi V., Coleman T., Crusoe M., Deelman E., Di Natale F., Di Tommaso P., Fahringer T., Filgueira R., Fursin G., Ganose A., Gruning B., Katz D.S., Kuchar O., Kupresanin A., Ludascher B., Maheshwari K., Mattoso M., Mehta K., Munson T., Ozik J., Peterka T., Pottier L., Randles T., Soiland-Reyes S., Tovar B., Turilli M., Uram T., Vahi K., Wilde M., Wolf M., Wozniak J., Workflows Community Summit: Bringing the Scientific Workflows Community Together, Zenodo, 2021,.
[7]
Canon L.-C., Chang A.K.W., Robert Y., Vivien F., Scheduling independent stochastic tasks under deadline and budget constraints, Int. J. High Perform. Comput. Appl. 34 (2) (2020) 246–264,.
[8]
Han L., Fèvre V.L., Canon L.-C., Robert Y., Vivien F., A generic approach to scheduling and checkpointing workflows, Int. J. High Perform. Comput. Appl. 33 (6) (2019) 1255–1274,.
[9]
Coleman T., Casanova H., Gwartney T., Ferreira da Silva R., Evaluating energy-aware scheduling algorithms for I/O-intensive scientific workflows, in: International Conference on Computational Science, ICCS, 2021,.
[10]
Ferreira da Silva R., Chen W., Juve G., Vahi K., Deelman E., Community resources for enabling and evaluating research in distributed scientific workflows, in: 10th IEEE International Conference on E-Science, eScience’14, 2014, pp. 177–184,.
[11]
Zhu X., Wang J., Guo H., Zhu D., Yang L.T., Liu L., Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds, IEEE Trans. Parallel Distrib. Syst. 27 (12) (2016) 3501–3517,.
[12]
Rodriguez M.A., Buyya R., Scheduling dynamic workloads in multi-tenant scientific workflow as a service platforms, Future Gener. Comput. Syst. 79 (2018) 739–750,.
[13]
Chen W., Ferreira da Silva R., Deelman E., Sakellariou R., Using imbalance metrics to optimize task clustering in scientific workflow executions, Future Gener. Comput. Syst. 46 (2015) 69–84,.
[14]
Tong Z., Chen H., Deng X., Li K., Li K., A scheduling scheme in the cloud computing environment using deep Q-learning, Inform. Sci. 512 (2020) 1170–1191,.
[15]
Genez T.A., Bittencourt L.F., da Fonseca N.L., Madeira E.R., Estimation of the available bandwidth in inter-cloud links for task scheduling in hybrid clouds, IEEE Trans. Cloud Comput. 7 (1) (2015) 62–74,.
[16]
Ferreira da Silva R., Casanova H., Tanaka R., Suter F., Bridging concepts and practice in escience via simulation-driven engineering, in: Workshop on Bridging from Concepts to Data and Computation for eScience (BC2DC’19), 15th International Conference on eScience (eScience), 2019, pp. 609–614,.
[17]
WfCommons Project, 2021, https://wfcommons.org.
[18]
Ferreira da Silva R., Pottier L., Coleman T., Deelman E., Casanova H., WorkflowHub: Community framework for enabling scientific workflow research and development, in: 2020 IEEE/ACM Workflows in Support of Large-Scale Science, WORKS, IEEE, 2020, pp. 49–56,.
[19]
WfCommons Python package, 2021, https://docs.wfcommons.org.
[20]
Feitelson D.G., Tsafrir D., Krakov D., Experience with using the parallel workloads archive, J. Parallel Distrib. Comput. 74 (10) (2014) 2967–2982,.
[21]
Iosup A., Li H., Jan M., Anoep S., Dumitrescu C., Wolters L., Epema D.H., The grid workloads archive, Future Gener. Comput. Syst. 24 (7) (2008) 672–686,.
[22]
Kondo D., Javadi B., Iosup A., Epema D., The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems, in: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, IEEE, 2010, pp. 398–407,.
[23]
Amstutz P., Crusoe M.R., sa Tijanić N., Chapman B., Chilton J., Heuer M., Kartashov A., Leehr D., Ménager H., Nedeljkovich M., Scales M., Soiland-Reyes S., Stojanovic L., Common Workflow Language, v1.0, figshare, 2016.
[24]
Versluis L., Mathá R., Talluri S., Hegeman T., Prodan R., Deelman E., Iosup A., The workflow trace archive: Open-access data from public and private computing infrastructures, IEEE Trans. Parallel Distrib. Syst. 31 (9) (2020) 2170–2184,.
[25]
Amer M.A., Lucas R., Evaluating workflow tools with SDAG, in: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, IEEE, 2012, pp. 54–63,.
[26]
DAGGEN: A synthetic task graph generator, 2021, https://github.com/frs69wq/daggen.
[27]
Amalarethinam D.G., Mary G.J., DAGEN - A tool to generate arbitrary directed acyclic graphs used for multiprocessor scheduling, Int. J. Res. Rev. Comput. Sci. 2 (3) (2011) 782.
[28]
Amalarethinam D.G., Muthulakshmi P., DAGITIZER – A tool to generate directed acyclic graph through randomizer to model scheduling in grid computing, in: Advances in Computer Science, Engineering and Applications, Springer, 2012, pp. 969–978,.
[29]
Garijo D., Alper P., Belhajjame K., Corcho O., Gil Y., Goble C., Common motifs in scientific workflows: An empirical analysis, Future Gener. Comput. Syst. 36 (2014) 338–351,.
[30]
van Der Aalst W.M., Ter Hofstede A.H., Kiepuszewski B., Barros A.P., Workflow patterns, Distrib. Parallel Databases 14 (1) (2003) 5–51,.
[31]
U. Yildiz, A. Guabtni, A.H. Ngu, Towards scientific workflow patterns, in: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, 2009, pp. 1–10.
[32]
Katz D.S., Merzky A., Zhang Z., Jha S., Application skeletons: Construction and use in eScience, Future Gener. Comput. Syst. 59 (2016) 114–124,.
[34]
Albrecht M., Donnelly P., Bui P., Thain D., Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids, in: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologie, 2012, pp. 1–13,.
[35]
Juve G., Chervenak A., Deelman E., Bharathi S., Mehta G., Vahi K., Characterizing and profiling scientific workflows, Future Gener. Comput. Syst. 29 (3) (2013) 682–692,.
[36]
Ferreira da Silva R., Filgueira R., Deelman E., Pairo-Castineira E., Overton I.M., Atkinson M.P., Using simple pid-inspired controllers for online resilient resource management of distributed scientific workflows, Future Gener. Comput. Syst. 95 (2019) 615–628,.
[37]
Zheng C., Thain D., Integrating containers into workflows: a case study using makeflow, work queue, and docker, in: Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing, 2015, pp. 31–38,.
[38]
Ferreira da Silva R., Mayani R., Shi Y., Kemanian A.R., Rynge M., Deelman E., Empowering agroecosystem modeling with HTC scientific workflows: The cycles model use case, in: First International Workshop on Big Data Tools, Methods, and Use Cases for Innovative Scientific Discovery, BTSD, 2019, pp. 4545–4552,.
[39]
Ferreira da Silva R., Filgueira R., Pietri I., Jiang M., Sakellariou R., Deelman E., A characterization of workflow management systems for extreme-scale applications, Future Gener. Comput. Syst. 75 (2017) 228–238,.
[40]
Coleman T.a., Casanova H., da Silva R.F., WfChef: Automated generation of accurate scientific workflow generators, 2021, arXiv preprint arXiv:2105.00129.
[41]
Casanova H., Pandey S., Oeth J., Tanaka R., Suter F., Ferreira da Silva R., WRENCH: a framework for simulating workflow management systems, in: 13th Workshop on Workflows in Support of Large-Scale Science, WORKS’18, 2018, pp. 74–85,.
[42]
Casanova H., Ferreira da Silva R., Tanaka R., Pandey S., Jethwani G., Koch W., Albrecht S., Oeth J., Suter F., Developing accurate and scalable simulators of production workflow management systems with WRENCH, Future Gener. Comput. Syst. 112 (2020) 162–175,.
[43]
WfCommons GitHub repository, 2021, https://github.com/wfcommons/wfcommons.
[44]
Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., et al., SciPy 1.0: fundamental algorithms for scientific computing in python, Nature Methods 17 (3) (2020) 261–272,.
[45]
K. Keahey, J. Anderson, Z. Zhen, P. Riteau, P. Ruth, D. Stanzione, M. Cevik, J. Colleran, H.S. Gunawi, C. Hammock, et al. Lessons learned from the Chameleon testbed, in: 2020 USENIX Annual Technical Conference, 2020, pp. 219–233.
[46]
[47]
[48]
Zakarya M., Energy, performance and cost efficient datacenters: A survey, Renew. Sustain. Energy Rev. 94 (2018) 363–385,.
[49]
Orgerie A.-C., Assuncao M.D.d., Lefevre L., A survey on techniques for improving the energy efficiency of large-scale distributed systems, ACM Comput. Surv. 46 (4) (2014) 1–31,.
[50]
Ferreira da Silva R., Orgerie A.-C., Casanova H., Tanaka R., Deelman E., Suter F., Accurately simulating energy consumption of I/O-intensive scientific workflows, in: Computational Science, ICCS 2019, Springer International Publishing, 2019, pp. 138–152,.
[51]
Ferreira da Silva R., Casanova H., Orgerie A.-C., Tanaka R., Deelman E., Suter F., Characterizing, modeling, and accurately simulating power and energy consumption of I/O-intensive scientific workflows, J. Comput. Sci. 44 (2020),.
[53]
Tanaka R., Ferreira da Silva R., Casanova H., Teaching parallel and distributed computing concepts in simulation with WRENCH, in: Workshop on Education for High-Performance Computing, EduHPC, 2019, pp. 1–9,.

Cited By

View all
  • (2024)Mapping Large Memory-constrained Workflows onto Heterogeneous Platforms✱Proceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673068(305-316)Online publication date: 12-Aug-2024
  • (2024)Optimizing data regeneration and storage with data dependency for cloud scientific workflow systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121984238:PDOnline publication date: 15-Mar-2024
  • (2023)Automated generation of scientific workflow generators with WfChefFuture Generation Computer Systems10.1016/j.future.2023.04.031147:C(16-29)Online publication date: 1-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 128, Issue C
Mar 2022
570 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 March 2022

Author Tags

  1. Scientific workflows
  2. Workflow management systems
  3. Simulation
  4. Distributed computing
  5. Workflow instances

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Mapping Large Memory-constrained Workflows onto Heterogeneous Platforms✱Proceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673068(305-316)Online publication date: 12-Aug-2024
  • (2024)Optimizing data regeneration and storage with data dependency for cloud scientific workflow systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121984238:PDOnline publication date: 15-Mar-2024
  • (2023)Automated generation of scientific workflow generators with WfChefFuture Generation Computer Systems10.1016/j.future.2023.04.031147:C(16-29)Online publication date: 1-Oct-2023
  • (2023)Precise makespan optimization via hybrid genetic algorithm for scientific workflow scheduling problemNatural Computing: an international journal10.1007/s11047-023-09950-522:4(615-630)Online publication date: 1-Dec-2023
  • (2023)Benchmarking DAG Scheduling Algorithms on�Scientific Workflow InstancesSupercomputing10.1007/978-3-031-49435-2_1(3-20)Online publication date: 25-Sep-2023
  • (2023)Scheduling of�Workflows with�Task Resource Requirements in�Cluster EnvironmentsParallel Computing Technologies10.1007/978-3-031-41673-6_14(177-196)Online publication date: 21-Aug-2023
  • (2022)Mutation and dynamic objective-based farmland fertility algorithm for workflow scheduling in the cloudJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.02.005164:C(69-82)Online publication date: 1-Jun-2022

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media