skip to main content
10.1145/3605098.3636056acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article
Open access

A Large-Scale Study of ML-Related Python Projects

Published: 21 May 2024 Publication History

Abstract

The rise of machine learning (ML) for solving current and future problems increased the production of ML-enabled software systems. Unfortunately, standardized tool chains for developing, employing, and maintaining such projects are not yet mature, which can mainly be attributed to a lack of understanding of the properties of ML-enabled software. For instance, it is still unclear how to manage and evolve ML-specific assets together with other software-engineering assets. In particular, ML-specific tools and processes, such as those for managing ML experiments, are often perceived as incompatible with practitioners' software engineering tools and processes. To design new tools for developing ML-enabled software, it is crucial to understand the properties and current problems of developing these projects by eliciting empirical data from real projects, including the evolution of the different assets involved. Moreover, while studies in this direction have recently been conducted, identifying certain types of ML-enabled projects (e.g., experiments, libraries and software systems) remains a challenge for researchers. We present a large-scale study of over 31,066 ML projects found on GitHub, with an emphasis on their development stages and evolution. Our contributions include a dataset, together with empirical data providing an overview of the existing project types and analysis of the projects' properties and characteristics, especially regarding the implementation of different ML development stages and their evolution. We believe that our results support researchers, practitioners, and tool builders conduct follow-up studies and especially build novel tools for managing ML projects, ideally unified with traditional software-engineering tools.

References

[1]
2021. Most popular machine learning libraries - 2014/2021. https://statisticsanddata.org/data/most-popular-machine-learning-libraries
[2]
Mohannad Alahdab and Gül Çalıklı. 2019. Empirical analysis of hidden technical debt patterns in machine learning software. In PROFES.
[3]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In ICSE/SEIP.
[4]
A Arpteg, B Brinne, L Crnkovic-Friis, and J Bosch. 2018. Software Engineering Challenges of Deep Learning. In SEAA.
[5]
Amine Barrak, Ellis E Eghan, and Bram Adams. 2021. On the co-evolution of ml pipelines and source code-empirical study of dvc projects. In SANER.
[6]
Andrew L Beam, Arjun K Manrai, and Marzyeh Ghassemi. 2020. Challenges to the reproducibility of machine learning models in health care. Jama 323, 4 (2020), 305--306.
[7]
Aaditya Bhatia, Ellis E Eghan, Manel Grichi, William G Cavanagh, Zhen Ming, Bram Adams, et al. 2022. Towards a Change Taxonomy for Machine Learning Systems. arXiv preprint arXiv:2203.11365 (2022).
[8]
Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan. 2019. Boa meets python: a boa dataset of data science software in python language. In MSR.
[9]
Sumon Biswas, Mohammad Wardat, and Hridesh Rajan. 2021. The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large. arXiv:2112.01590 (2021).
[10]
Dan Bohus, Sean Andrist, and Mihai Jalobeanu. 2017. Rapid development of multimodal interactive systems: a demonstration of platform for situated intelligence. In ICMI.
[11]
Valerio Cosentino, Javier Luis C�novas Izquierdo, and Jordi Cabot. 2016. Findings from GitHub: methods, datasets and limitations. In MSR.
[12]
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Commun. ACM 39, 11 (1996), 27--34.
[13]
Rudolf Ferenc, Tam�s Viszkok, Tam�s Aladics, Judit J�sz, and P�ter Heged�s. 2020. Deep-water framework: The Swiss army knife of humans working with machine learning models. SoftwareX 12 (2020), 100551.
[14]
Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In MSR.
[15]
Georgios Gousios and Diomidis Spinellis. 2017. Mining software engineering data from GitHub. In ICSE-C.
[16]
Samuel Idowu, Osman Osman, Daniel Strueber, and Thorsten Berger. 2022. On the Effectiveness of Machine Learning Experiment Management Tools. In 44th International Conference on Software Engineering, Software Engineering in Practice track (ICSE/SEIP).
[17]
Samuel Idowu, Daniel Str�ber, and Thorsten Berger. 2022. Asset Management in Machine Learning: State-of-research and State-of-practice. Comput. Surveys 55, 7, Article 144 (dec 2022), 35 pages.
[18]
Samuel Idowu, Daniel Strueber, and Thorsten Berger. 2022. EMMM: A Unified Meta-Model for Tracking Machine Learning Experiments. In Euromicro Conference on Software Engineering and Advanced Applications (SEAA).
[19]
Samuel Idowu, Daniel Str�ber, and Thorsten Berger. 2021. Asset Management in Machine Learning: A Survey. In ICSE/SEIP.
[20]
Richard Isdahl and Odd Erik Gundersen. 2019. Out-of-the-Box Reproducibility: A Survey of Machine Learning Platforms. In eScience.
[21]
Microsoft. 2017. Team Data Science Process Documentation. https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/
[22]
Ml-Tooling. [n. d.]. ML-tooling/best-of-ml-python: A ranked list of awesome machine learning python libraries. updated weekly. https://github.com/ml-tooling/best-of-ml-python
[23]
Mar�al Mora-Cantallops, Salvador S�nchez-Alonso, Elena Garc�a-Barriocanal, and Miguel-Angel Sicilia. 2021. Traceability for trustworthy AI: A review of models and tools. Big Data and Cognitive Computing 5, 2 (2021), 20.
[24]
Aiswarya Raj Munappy, Jan Bosch, and Helena Homstr�m Olsson. 2020. Data pipeline management in practice: Challenges and opportunities. In PROFES.
[25]
Aiswarya Raj Munappy, Jan Bosch, Helena Holmstr�m Olsson, Anders Arpteg, and Bj�rn Brinne. 2022. Data management for production quality deep learning models: Challenges and solutions. Journal of Systems and Software 191 (2022), 111359.
[26]
Aiswarya Raj Munappy, David Issa Mattos, Jan Bosch, Helena Holmstr�m Olsson, and Anas Dakkak. 2020. From ad-hoc data analytics to dataops. In ICSSP.
[27]
Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian K�stner. 2023. A Dataset and Analysis of Open-Source Machine Learning Products. arXiv:2308.04328 [cs.SE]
[28]
Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. A taxonomy of tools for reproducible machine learning experiments. AIxIA (2021).
[29]
Dhivyabharathi Ramasamy, Cristina Sarasua, Alberto Bacchelli, and Abraham Bernstein. 2023. Workflow analysis of data science code in public GitHub repositories. Empirical Software Engineering 28, 1 (2023), 1--47.
[30]
Sebastian Raschka and Vahid Mirjalili. 2019. Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.
[31]
Iqbal H Sarker, Faisal Faruque, Ujjal Hossen, and Atikur Rahman. 2015. A Survey of Software Development Process Models in Software Engineering. International Journal of Software Engineering and Its Applications 9, 11 (2015), 55--70.
[32]
Marius Schlegel and Kai-Uwe Sattler. 2022. Management of Machine Learning Lifecycle Artifacts: A Survey. SIGMOD Rec. (2022), 18--35.
[33]
Andrew J Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, and Rajesh Vasa. 2020. A large-scale comparative analysis of coding standard conformance in open-source data science projects. In ESEM.
[34]
Rachael Tatman, Jake Vanderplas, and Sohier Dane. 2018. A Practical Taxonomy of Reproducibility for Machine Learning Research. In ICML.
[35]
Bart van Oort, Lu�s Cruz, Maur�cio Aniche, and Arie van Deursen. 2021. The Prevalence of Code Smells in Machine Learning projects. In WAIN.
[36]
Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring execution environments of Jupyter notebooks. In ICSE.
[37]
Thomas Wei�gerber and Michael Granitzer. 2019. Mapping platforms into a new open science model for machine learning. it - Information Technology 61, 4 (2019), 197--208.
[38]
R�diger Wirth and Jochen Hipp. 2000. CRISP-DM : Towards a standard process model for data mining. In KDD.
[39]
Yue Yu, Gang Yin, Huaimin Wang, and Tao Wang. 2014. Exploring the patterns of social behavior in GitHub. In CrowdSoft.
[40]
Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do data science workers collaborate? roles, workflows, and tools. Proc. of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1--23.

Cited By

View all
  • (2024)Machine learning experiment management tools: a mixed-methods empirical studyEmpirical Software Engineering10.1007/s10664-024-10444-w29:4Online publication date: 29-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing
April 2024
1898 pages
ISBN:9798400702433
DOI:10.1145/3605098
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2024

Check for updates

Author Tags

  1. machine learning
  2. ml-enabled systems
  3. evolution
  4. mining study
  5. open-source projects
  6. large-scale study
  7. tensorflow
  8. scikit-learn

Qualifiers

  • Research-article

Funding Sources

Conference

SAC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)150
  • Downloads (Last 6 weeks)34
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Machine learning experiment management tools: a mixed-methods empirical studyEmpirical Software Engineering10.1007/s10664-024-10444-w29:4Online publication date: 29-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media