Article

Scalable Random Forest with�Data-Parallel Computing

Authors:

Fernando Vázquez-Novoa,

Javier Conejero,

Rosa M. BadiaAuthors Info & Claims

Euro-Par 2023: Parallel Processing: 29th International Conference on Parallel and Distributed Computing, Limassol, Cyprus, August 28 – September 1, 2023, Proceedings

Pages 397 - 410

https://doi.org/10.1007/978-3-031-39698-4_27

Published: 28 August 2023 Publication History

Abstract

In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.

References

[1]

Azizah N, Riza LS, and Wihardi Y Implementation of random forest algorithm with parallel computing in r J. Phys: Conf. Ser. 2019 1280 2

[2]

Baldi P, Sadowski P, and Whiteson D Searching for exotic particles in high-energy physics with deep learning Nature Commun. 2014 5 1 4308

[3]

Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11(2) (2010)

[4]

Breiman L, Friedman J, Olshen R, and Stone C Cart: Classification and Regression Trees (1984) 1993 Belmont, CA Wadsworth

[5]

Chen J et al. A parallel random forest algorithm for big data in a spark cloud computing environment IEEE Trans. Parallel Distrib. Syst. 2016 28 4 919-933

[6]

Cid-Fuentes, J.Á., Solà, S., Álvarez, P., Castro-Ginard, A., Badia, R.M.: dislib: Large scale high performance machine learning in python. In: 2019 15th International Conference on eScience (eScience), pp. 96–105. IEEE (2019)

[7]

Ejarque J, Bertran M, Cid-Fuentes JÁ, Conejero J, and Badia RM Malawski M and Rzadca K Managing failures in task-based parallel workflows in distributed computing environments Euro-Par 2020: Parallel Processing 2020 Cham Springer 411-425

[8]

Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)

[9]

Lordan F et al. ServiceSs: an interoperable programming framework for the cloud J. Grid Comput. 2013 12 1 67-91

[10]

Lordan F, Lezzi D, and Badia RM Sousa L, Roma N, and Tomás P Colony: parallel functions as a service on the cloud-edge continuum Euro-Par 2021: Parallel Processing 2021 Cham Springer 269-284

[11]

Meng X et al. Mllib: machine learning in apache spark J. Mach. Learn. Res. 2016 17 1 1235-1241

[12]

Pedregosa F et al. Scikit-learn: machine learning in Python J. Mach. Learn. Res. 2011 12 2825-2830

[13]

Quinlan JR Induction of decision trees Mach. Learn. 1986 1 1 81-106

[14]

Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, no. 130–136. Citeseer (2015)

[15]

Salzberg, S.L.: C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993 (1994)

[16]

Tejedor E et al. Pycompss: parallel computational workflows in python Int. J. High Perform. Comput. Appl. 2017 31 1 66-82

[17]

Van Rossum G and Drake FL Python 3 Reference Manual 2009 Scotts Valley, CA CreateSpace

[18]

Zaharia M et al. Apache spark: a unified engine for big data processing Commun. ACM 2016 59 11 56-65

Recommendations

Root attribute behavior within a random forest
IDEAL'12: Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning

Random Forest is a computationally efficient technique that can operate quickly over large datasets. It has been used in many recent research projects and real-world applications in diverse domains. However, the associated literature provides few ...
Enhancing the Grid with Cloud Computing

Scientific computing has evolved considerably in recent years. Scientific applications have become more complex and require an increasing number of computing resources to perform on a large scale. Grid computing has become widely used and is the chosen ...
Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par 2023: Parallel Processing: 29th International Conference on Parallel and Distributed Computing, Limassol, Cyprus, August 28 – September 1, 2023, Proceedings

Aug 2023

766 pages

ISBN:978-3-031-39697-7

DOI:10.1007/978-3-031-39698-4

Editors:
José Cano
University of Glasgow, Glasgow, UK
,
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
,
George A. Papadopoulos
University of Cyprus, Nicosia, Cyprus
,
Miquel Peric�s
Chalmers University of Technology, Gothenburg, Sweden
,
Rizos Sakellariou
University of Manchester, Manchester, UK

� The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 28 August 2023

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents