research-article

Open access

A Large-Scale Study of ML-Related Python Projects

Authors:

Thorsten Berger,

Michael VierhauserAuthors Info & Claims

SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

Pages 1272 - 1281

https://doi.org/10.1145/3605098.3636056

Published: 21 May 2024 Publication History

Abstract

The rise of machine learning (ML) for solving current and future problems increased the production of ML-enabled software systems. Unfortunately, standardized tool chains for developing, employing, and maintaining such projects are not yet mature, which can mainly be attributed to a lack of understanding of the properties of ML-enabled software. For instance, it is still unclear how to manage and evolve ML-specific assets together with other software-engineering assets. In particular, ML-specific tools and processes, such as those for managing ML experiments, are often perceived as incompatible with practitioners' software engineering tools and processes. To design new tools for developing ML-enabled software, it is crucial to understand the properties and current problems of developing these projects by eliciting empirical data from real projects, including the evolution of the different assets involved. Moreover, while studies in this direction have recently been conducted, identifying certain types of ML-enabled projects (e.g., experiments, libraries and software systems) remains a challenge for researchers. We present a large-scale study of over 31,066 ML projects found on GitHub, with an emphasis on their development stages and evolution. Our contributions include a dataset, together with empirical data providing an overview of the existing project types and analysis of the projects' properties and characteristics, especially regarding the implementation of different ML development stages and their evolution. We believe that our results support researchers, practitioners, and tool builders conduct follow-up studies and especially build novel tools for managing ML projects, ideally unified with traditional software-engineering tools.

References

[1]

2021. Most popular machine learning libraries - 2014/2021. https://statisticsanddata.org/data/most-popular-machine-learning-libraries

[2]

Mohannad Alahdab and Gül Çalıklı. 2019. Empirical analysis of hidden technical debt patterns in machine learning software. In PROFES.

[3]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In ICSE/SEIP.

[4]

A Arpteg, B Brinne, L Crnkovic-Friis, and J Bosch. 2018. Software Engineering Challenges of Deep Learning. In SEAA.

[5]

Amine Barrak, Ellis E Eghan, and Bram Adams. 2021. On the co-evolution of ml pipelines and source code-empirical study of dvc projects. In SANER.

[6]

Andrew L Beam, Arjun K Manrai, and Marzyeh Ghassemi. 2020. Challenges to the reproducibility of machine learning models in health care. Jama 323, 4 (2020), 305--306.

[7]

Aaditya Bhatia, Ellis E Eghan, Manel Grichi, William G Cavanagh, Zhen Ming, Bram Adams, et al. 2022. Towards a Change Taxonomy for Machine Learning Systems. arXiv preprint arXiv:2203.11365 (2022).

[8]

Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan. 2019. Boa meets python: a boa dataset of data science software in python language. In MSR.

[9]

Sumon Biswas, Mohammad Wardat, and Hridesh Rajan. 2021. The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large. arXiv:2112.01590 (2021).

[10]

Dan Bohus, Sean Andrist, and Mihai Jalobeanu. 2017. Rapid development of multimodal interactive systems: a demonstration of platform for situated intelligence. In ICMI.

[11]

Valerio Cosentino, Javier Luis C�novas Izquierdo, and Jordi Cabot. 2016. Findings from GitHub: methods, datasets and limitations. In MSR.

[12]

Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Commun. ACM 39, 11 (1996), 27--34.

Digital Library

[13]

Rudolf Ferenc, Tam�s Viszkok, Tam�s Aladics, Judit J�sz, and P�ter Heged�s. 2020. Deep-water framework: The Swiss army knife of humans working with machine learning models. SoftwareX 12 (2020), 100551.

[14]

Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In MSR.

[15]

Georgios Gousios and Diomidis Spinellis. 2017. Mining software engineering data from GitHub. In ICSE-C.

[16]

Samuel Idowu, Osman Osman, Daniel Strueber, and Thorsten Berger. 2022. On the Effectiveness of Machine Learning Experiment Management Tools. In 44th International Conference on Software Engineering, Software Engineering in Practice track (ICSE/SEIP).

[17]

Samuel Idowu, Daniel Str�ber, and Thorsten Berger. 2022. Asset Management in Machine Learning: State-of-research and State-of-practice. Comput. Surveys 55, 7, Article 144 (dec 2022), 35 pages.

[18]

Samuel Idowu, Daniel Strueber, and Thorsten Berger. 2022. EMMM: A Unified Meta-Model for Tracking Machine Learning Experiments. In Euromicro Conference on Software Engineering and Advanced Applications (SEAA).

[19]

Samuel Idowu, Daniel Str�ber, and Thorsten Berger. 2021. Asset Management in Machine Learning: A Survey. In ICSE/SEIP.

[20]

Richard Isdahl and Odd Erik Gundersen. 2019. Out-of-the-Box Reproducibility: A Survey of Machine Learning Platforms. In eScience.

[21]

Microsoft. 2017. Team Data Science Process Documentation. https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/

[22]

Ml-Tooling. [n. d.]. ML-tooling/best-of-ml-python: A ranked list of awesome machine learning python libraries. updated weekly. https://github.com/ml-tooling/best-of-ml-python

[23]

Mar�al Mora-Cantallops, Salvador S�nchez-Alonso, Elena Garc�a-Barriocanal, and Miguel-Angel Sicilia. 2021. Traceability for trustworthy AI: A review of models and tools. Big Data and Cognitive Computing 5, 2 (2021), 20.

[24]

Aiswarya Raj Munappy, Jan Bosch, and Helena Homstr�m Olsson. 2020. Data pipeline management in practice: Challenges and opportunities. In PROFES.

[25]

Aiswarya Raj Munappy, Jan Bosch, Helena Holmstr�m Olsson, Anders Arpteg, and Bj�rn Brinne. 2022. Data management for production quality deep learning models: Challenges and solutions. Journal of Systems and Software 191 (2022), 111359.

Digital Library

[26]

Aiswarya Raj Munappy, David Issa Mattos, Jan Bosch, Helena Holmstr�m Olsson, and Anas Dakkak. 2020. From ad-hoc data analytics to dataops. In ICSSP.

[27]

Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian K�stner. 2023. A Dataset and Analysis of Open-Source Machine Learning Products. arXiv:2308.04328 [cs.SE]

[28]

Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. A taxonomy of tools for reproducible machine learning experiments. AIxIA (2021).

[29]

Dhivyabharathi Ramasamy, Cristina Sarasua, Alberto Bacchelli, and Abraham Bernstein. 2023. Workflow analysis of data science code in public GitHub repositories. Empirical Software Engineering 28, 1 (2023), 1--47.

Digital Library

[30]

Sebastian Raschka and Vahid Mirjalili. 2019. Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.

[31]

Iqbal H Sarker, Faisal Faruque, Ujjal Hossen, and Atikur Rahman. 2015. A Survey of Software Development Process Models in Software Engineering. International Journal of Software Engineering and Its Applications 9, 11 (2015), 55--70.

[32]

Marius Schlegel and Kai-Uwe Sattler. 2022. Management of Machine Learning Lifecycle Artifacts: A Survey. SIGMOD Rec. (2022), 18--35.

[33]

Andrew J Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, and Rajesh Vasa. 2020. A large-scale comparative analysis of coding standard conformance in open-source data science projects. In ESEM.

[34]

Rachael Tatman, Jake Vanderplas, and Sohier Dane. 2018. A Practical Taxonomy of Reproducibility for Machine Learning Research. In ICML.

[35]

Bart van Oort, Lu�s Cruz, Maur�cio Aniche, and Arie van Deursen. 2021. The Prevalence of Code Smells in Machine Learning projects. In WAIN.

[36]

Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring execution environments of Jupyter notebooks. In ICSE.

[37]

Thomas Wei�gerber and Michael Granitzer. 2019. Mapping platforms into a new open science model for machine learning. it - Information Technology 61, 4 (2019), 197--208.

[38]

R�diger Wirth and Jochen Hipp. 2000. CRISP-DM : Towards a standard process model for data mining. In KDD.

[39]

Yue Yu, Gang Yin, Huaimin Wang, and Tao Wang. 2014. Exploring the patterns of social behavior in GitHub. In CrowdSoft.

[40]

Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do data science workers collaborate? roles, workflows, and tools. Proc. of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1--23.

Digital Library

Cited By

Idowu SOsman OStr�ber DBerger T(2024)Machine learning experiment management tools: a mixed-methods empirical studyEmpirical Software Engineering10.1007/s10664-024-10444-w29:4Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1007/s10664-024-10444-w

Recommendations

Python Reinforcement Learning Projects: Eight hands-on projects exploring reinforcement learning algorithms using TensorFlow
Discovering repetitive code changes in ML systems
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Similar to software evolution in other software systems, ML software systems evolve with many repetitive changes. Despite some research and tooling for repetitive code changes that exist in Java and other languages, there is a lack of such tools for ...
Gradual AutoML using Lale
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Lale is a sklearn-compatible library for automated machine learning (AutoML). It is open-source (https://github.com/ibm/lale) and addresses the need for gradual automation of machine learning as opposed to offering a black-box AutoML tool. Black-box ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

April 2024

1898 pages

ISBN:9798400702433

DOI:10.1145/3605098

Chair:
Jiman Hong,
Program Chairs:
Juw Won Park,
Adam Przybyłek

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Knut och Alice Wallenbergs Stiftelse

Conference

SAC '24

Sponsor:

SIGAPP

SAC '24: 39th ACM/SIGAPP Symposium on Applied Computing

April 8 - 12, 2024

Avila, Spain

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
150
Total Downloads

Downloads (Last 12 months)150
Downloads (Last 6 weeks)34

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Idowu SOsman OStr�ber DBerger T(2024)Machine learning experiment management tools: a mixed-methods empirical studyEmpirical Software Engineering10.1007/s10664-024-10444-w29:4Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1007/s10664-024-10444-w

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents