CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Puri, Ruchir; Kung, David S.; Janssen, Geert; Zhang, Wei; Domeniconi, Giacomo; Zolotov, Vladimir; Dolby, Julian; Chen, Jie; Choudhury, Mihir; Decker, Lindsey; Thost, Veronika; Buratti, Luca; Pujar, Saurabh; Ramji, Shyam; Finkler, Ulrich; Malaika, Susan; Reiss, Frederick

Computer Science > Software Engineering

arXiv:2105.12655 (cs)

[Submitted on 25 May 2021 (v1), last revised 29 Aug 2021 (this version, v2)]

Title:CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Authors:Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss

View PDF

Abstract:Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of AI for Code has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset CodeNet, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98.5% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering.

Comments:	22 pages including references
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2105.12655 [cs.SE]
	(or arXiv:2105.12655v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2105.12655

Submission history

From: Ruchir Puri [view email]
[v1] Tue, 25 May 2021 00:13:29 UTC (689 KB)
[v2] Sun, 29 Aug 2021 19:43:43 UTC (270 KB)

Computer Science > Software Engineering

Title:CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators