skip to main content
10.1145/3453483.3454045acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Learning to find naming issues with big code and small supervision

Published: 18 June 2021 Publication History

Abstract

We introduce a new approach for finding and fixing naming issues in source code. The method is based on a careful combination of unsupervised and supervised procedures: (i) unsupervised mining of patterns from Big Code that express common naming idioms. Program fragments violating such idioms indicates likely naming issues, and (ii) supervised learning of a classifier on a small labeled dataset which filters potential false positives from the violations.
We implemented our method in a system called Namer and evaluated it on a large number of Python and Java programs. We demonstrate that Namer is effective in finding naming mistakes in real world repositories with high precision (~70%). Perhaps surprisingly, we also show that existing deep learning methods are not practically effective and achieve low precision in finding naming issues (up to ~16%).

References

[1]
2020. American fuzzy lop. https://lcamtuf.coredump.cx/afl/
[2]
2020. Error (Java SE 14 & JDK 14). https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/Error.html
[3]
2020. GitHub. https://github.com
[4]
2020. ICLR20-Great. https://github.com/VHellendoorn/ICLR20-Great
[5]
2020. tf-gnn-samples. https://github.com/microsoft/tf-gnn-samples
[6]
2020. unittest — Unit testing framework. https://docs.python.org/3/library/unittest.html##unittest.TestCase.assertTrue
[7]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In FSE 2015. https://doi.org/10.1145/2786805.2786849
[8]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2014. Learning Natural Coding Conventions. In FSE 2014. https://doi.org/10.1145/2635868.2635883
[9]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In ICLR 2018. https://openreview.net/forum?id=BJOFETxR-
[10]
Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In ICML 2016. http://proceedings.mlr.press/v48/allamanis16.html
[11]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In ICLR 2019. https://openreview.net/forum?id=H1gKYo09tX
[12]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A General Path-Based Representation for Predicting Program Properties. In PLDI 2018. https://doi.org/10.1145/3192366.3192412
[13]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning Distributed Representations of Code. Proc. ACM Program. Lang., 3, POPL (2019), 40:1–40:29. https://doi.org/10.1145/3290353
[14]
Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2018. Active Learning of Points-to Specifications. In PLDI 2018. https://doi.org/10.1145/3192366.3192383
[15]
Benjamin Bichsel, Veselin Raychev, Petar Tsankov, and Martin Vechev. 2016. Statistical Deobfuscation of Android Applications. In CCS 2016. https://doi.org/10.1145/2976749.2978422
[16]
Pavol Bielik and Martin Vechev. 2020. Adversarial Robustness for Code. In ICML. http://proceedings.mlr.press/v119/bielik20a.html
[17]
Marcel B�hme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage-based Greybox Fuzzing as Markov Chain. In CCS 2016. https://doi.org/10.1145/2976749.2978428
[18]
Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp. 2010. Exploring the Influence of Identifier Names on Code Quality: An Empirical Study. In CSMR 2010. https://doi.org/10.1109/CSMR.2010.27
[19]
Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In OSDI 2008. http://www.usenix.org/events/osdi08/tech/full_papers/cadar/cadar.pdf
[20]
Victor Chibotaru, Benjamin Bichsel, Veselin Raychev, and Martin Vechev. 2019. Scalable Taint Specification Inference with Big Code. In PLDI 2019. https://doi.org/10.1145/3314221.3314648
[21]
Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In POPL 1977. https://doi.org/10.1145/512950.512973
[22]
Jan Eberhardt, Samuel Steffen, Veselin Raychev, and Martin Vechev. 2019. Unsupervised Learning of API Aliasing Specifications. In PLDI 2019. https://doi.org/10.1145/3314221.3314640
[23]
Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&Fuzz: Machine Learning for Input Fuzzing. In ASE 2017. https://doi.org/10.1109/ASE.2017.8115618
[24]
Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining Frequent Patterns without Candidate Generation. In SIGMOD 2000. https://doi.org/10.1145/342009.335372
[25]
Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding patterns in static analysis alerts: improving actionable alert ranking. In MSR 2014. https://doi.org/10.1145/2597073.2597100
[26]
Jingxuan He, Mislav Balunovic, Nodar Ambroladze, Petar Tsankov, and Martin Vechev. 2019. Learning to Fuzz from Symbolic Execution with Application to Smart Contracts. In CCS 2019. https://doi.org/10.1145/3319535.3363230
[27]
Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. 2018. Debin: Predicting Debug Information in Stripped Binaries. In CCS 2018. https://doi.org/10.1145/3243734.3243866
[28]
Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2020. Global Relational Models of Source Code. In ICLR 2020. OpenReview.net. https://openreview.net/forum?id=B1lnbRNtwr
[29]
Kihong Heo, Hakjoo Oh, and Hongseok Yang. 2019. Resource-aware Program Analysis via Online Abstraction Coarsening. In ICSE 2019. https://doi.org/10.1109/ICSE.2019.00027
[30]
Einar W. Høst and Bjarte M. Ø stvold. 2009. Debugging Method Names. In ECOOP 2009. https://doi.org/10.1007/978-3-642-03013-0_14
[31]
Ted Kremenek and Dawson R. Engler. 2003. Z-Ranking: using statistical analysis to counter the impact of static analysis approximations. In SAS 2003. https://doi.org/10.1007/3-540-44898-5_16
[32]
Carson Kai-Sang Leung, Laks V. S. Lakshmanan, and Raymond T. Ng. 2002. Exploiting Succinct Constraints using FP-trees. SIGKDD Explorations, 4, 1 (2002), 40–49. https://doi.org/10.1145/568574.568581
[33]
Yi Li, Shaohua Wang, Tien N. Nguyen, and Son Van Nguyen. 2019. Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks. Proc. ACM Program. Lang., 3, OOPSLA (2019), 162:1–162:30. https://doi.org/10.1145/3360588
[34]
Hui Liu, Qiurong Liu, Cristian-Alexandru Staicu, Michael Pradel, and Yue Luo. 2016. Nomen est omen: Exploring and Exploiting Similarities between Argument and Parameter Names. In ICSE 2016. https://doi.org/10.1145/2884781.2884841
[35]
Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a Map of Code Duplicates on GitHub. Proc. ACM Program. Lang., 1, OOPSLA (2017), 84:1–84:28. https://doi.org/10.1145/3133908
[36]
Hakjoo Oh, Hongseok Yang, and Kwangkeun Yi. 2015. Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation. In OOPSLA 2015. https://doi.org/10.1145/2814270.2814309
[37]
Rumen Paletov, Petar Tsankov, Veselin Raychev, and Martin Vechev. 2018. Inferring crypto API rules from code changes. In PLDI 2018. https://doi.org/10.1145/3192366.3192403
[38]
Michael Pradel and Thomas R. Gross. 2011. Detecting Anomalies in the Order of Equally-typed Method Arguments. In ISSTA 2011. https://doi.org/10.1145/2001420.2001448
[39]
Michael Pradel and Koushik Sen. 2018. DeepBugs: a Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang., 2, OOPSLA (2018), 147:1–147:25. https://doi.org/10.1145/3276517
[40]
Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting Program Properties from "Big Code". In POPL 2015. https://doi.org/10.1145/2676726.2677009
[41]
Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. 2017. Detecting Argument Selection Defects. Proc. ACM Program. Lang., 1, OOPSLA (2017), 104:1–104:22. https://doi.org/10.1145/3133928
[42]
Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2019. NEUZZ: Efficient Fuzzing with Neural Program Smoothing. In S&P 2019. https://doi.org/10.1109/SP.2019.00052
[43]
Gagandeep Singh, Markus Püschel, and Martin Vechev. 2018. Fast Numerical Program Analysis with Reinforcement Learning. In CAV 2018. https://doi.org/10.1007/978-3-319-96145-3_12
[44]
Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis. Found. Trends Program. Lang., 2, 1 (2015), 1–69. https://doi.org/10.1561/2500000014
[45]
Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. 2019. Neural Program Repair by Jointly Learning to Localize and Repair. In ICLR 2019. https://openreview.net/forum?id=ByloJ20qtm
[46]
Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2017. Skyfire: Data-Driven Seed Generation for Fuzzing. In S&P 2017. https://doi.org/10.1109/SP.2017.23
[47]
Yu Wang, Ke Wang, Fengjuan Gao, and Linzhang Wang. 2020. Learning semantic program embeddings with graph interval neural network. Proc. ACM Program. Lang., 4, OOPSLA (2020), 137:1–137:27. https://doi.org/10.1145/3428205

Cited By

View all
  • (2024)DAInfer: Inferring API Aliasing Specifications from Library Documentation via Neurosymbolic OptimizationProceedings of the ACM on Software Engineering10.1145/36608161:FSE(2469-2492)Online publication date: 12-Jul-2024
  • (2024)Detecting and explaining Python name errorsInformation and Software Technology10.1016/j.infsof.2024.107592(107592)Online publication date: Oct-2024
  • (2023)Pre-implementation Method Name Prediction for Object-oriented ProgrammingACM Transactions on Software Engineering and Methodology10.1145/359720332:6(1-35)Online publication date: 29-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation
June 2021
1341 pages
ISBN:9781450383912
DOI:10.1145/3453483
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Anomaly detection
  2. Bug detection
  3. Machine learning
  4. Name-based program analysis
  5. Static analysis

Qualifiers

  • Research-article

Conference

PLDI '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)4
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DAInfer: Inferring API Aliasing Specifications from Library Documentation via Neurosymbolic OptimizationProceedings of the ACM on Software Engineering10.1145/36608161:FSE(2469-2492)Online publication date: 12-Jul-2024
  • (2024)Detecting and explaining Python name errorsInformation and Software Technology10.1016/j.infsof.2024.107592(107592)Online publication date: Oct-2024
  • (2023)Pre-implementation Method Name Prediction for Object-oriented ProgrammingACM Transactions on Software Engineering and Methodology10.1145/359720332:6(1-35)Online publication date: 29-Sep-2023
  • (2023)CombTransformers: Statement-Wise Transformers for Statement-Wise RepresentationsIEEE Transactions on Software Engineering10.1109/TSE.2023.331079349:10(4677-4690)Online publication date: 6-Sep-2023
  • (2022)Path-sensitive code embedding via contrastive learning for software vulnerability detectionProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3533767.3534371(519-531)Online publication date: 18-Jul-2022
  • (2022)NalinProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510144(1469-1481)Online publication date: 21-May-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media