skip to main content
research-article

Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

Published: 01 December 2016 Publication History

Abstract

Clouds are becoming an important platform for scientific workflow applications. However, with many nodes being deployed in clouds, managing reliability of resources becomes a critical issue, especially for the real-time scientific workflow execution where deadlines should be satisfied. Therefore, fault tolerance in clouds is extremely essential. The PB (primary backup) based scheduling is a popular technique for fault tolerance and has effectively been used in the cluster and grid computing. However, applying this technique for real-time workflows in a virtualized cloud is much more complicated and has rarely been studied. In this paper, we address this problem. We first establish a real-time workflow fault-tolerant model that extends the traditional PB model by incorporating the cloud characteristics. Based on this model, we develop approaches for task allocation and message transmission to ensure faults can be tolerated during the workflow execution. Finally, we propose a dynamic fault-tolerant scheduling algorithm, FASTER, for real-time workflows in the virtualized cloud. FASTER has three key features: 1) it employs a backward shifting method to make full use of the idle resources and incorporates task overlapping and VM migration for high resource utilization, 2) it applies the vertical/horizontal scaling-up technique to quickly provision resources for a burst of workflows, and 3) it uses the vertical scaling-down scheme to avoid unnecessary and ineffective resource changes due to fluctuated workflow requests. We evaluate our FASTER algorithm with synthetic workflows and workflows collected from the real scientific and business applications and compare it with six baseline algorithms. The experimental results demonstrate that FASTER can effectively improve the resource utilization and schedulability even in the presence of node failures in virtualized clouds.

Cited By

View all
  • (2023)HGPSOJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22284244:3(4445-4458)Online publication date: 1-Jan-2023
  • (2023)UNION: Fault-tolerant Cooperative Computing in Opportunistic Mobile Edge CloudACM Transactions on Internet Technology10.1145/361799423:4(1-27)Online publication date: 17-Nov-2023
  • (2023)Fault-tolerant scheduling of graph-based loads on fog/cloud environments with multi-level queues and LSTM-based workload predictionComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2023.109964235:COnline publication date: 1-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 27, Issue 12
December 2016
304 pages

Publisher

IEEE Press

Publication History

Published: 01 December 2016

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)HGPSOJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22284244:3(4445-4458)Online publication date: 1-Jan-2023
  • (2023)UNION: Fault-tolerant Cooperative Computing in Opportunistic Mobile Edge CloudACM Transactions on Internet Technology10.1145/361799423:4(1-27)Online publication date: 17-Nov-2023
  • (2023)Fault-tolerant scheduling of graph-based loads on fog/cloud environments with multi-level queues and LSTM-based workload predictionComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2023.109964235:COnline publication date: 1-Nov-2023
  • (2022)TERMSJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.08.005170:C(74-85)Online publication date: 1-Dec-2022
  • (2022)WfCommonsFuture Generation Computer Systems10.1016/j.future.2021.09.043128:C(16-27)Online publication date: 1-Mar-2022
  • (2022)OFP-TM: an online VM failure prediction and tolerance model towards high availability of cloud computing environmentsThe Journal of Supercomputing10.1007/s11227-021-04235-z78:6(8003-8024)Online publication date: 1-Apr-2022
  • (2022)Reliable budget aware workflow scheduling strategy on multi-cloud environmentCluster Computing10.1007/s10586-021-03464-425:2(1189-1205)Online publication date: 1-Apr-2022
  • (2021)A divide and conquer approach to deadline constrained cost-optimization workflow scheduling for the cloudCluster Computing10.1007/s10586-020-03223-x24:3(1711-1733)Online publication date: 1-Sep-2021
  • (2021)Task replication to improve the reliability of running workflows on the cloudCluster Computing10.1007/s10586-020-03109-y24:1(343-359)Online publication date: 1-Mar-2021
  • (2021)A Novel Fault-Tolerant Approach to Web Service Composition upon the Edge Computing EnvironmentWeb Services – ICWS 202110.1007/978-3-030-96140-4_2(15-31)Online publication date: 10-Dec-2021
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media