skip to main content
research-article

Transform-data-by-example (TDE): an extensible search engine for data transformations

Published: 01 June 2018 Publication History

Abstract

Today, business analysts and data scientists increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, and phone number, before they can perform analysis. This process of data transformation is an important part of data preparation, and is known to be difficult and time-consuming for end-users.
Traditionally, developers have dealt with these longstanding transformation problems using custom code libraries. They have built vast varieties of custom logic for name parsing and address standardization, etc., and shared their source code in places like GitHub. Data transformation would be a lot easier for end-users if they can discover and reuse such existing transformation logic.
We developed Transform-Data-by-Example (TDE), which works like a search engine for data transformations. TDE "indexes" vast varieties of transformation logic in source code, DLLs, web services and mapping tables, so that users only need to provide a few input/output examples to demonstrate a desired transformation, and TDE can interactively find relevant functions to synthesize new programs consistent with all examples. Using an index of 50K functions crawled from GitHub and Stackoverflow, TDE can already handle many common transformations not currently supported by existing systems. On a benchmark with over 200 transformation tasks, TDE generates correct transformations for 72% tasks, which is considerably better than other systems evaluated. A beta version of TDE for Microsoft Excel is available via Office store1. Part of the TDE technology also ships in Microsoft Power BI.

References

[1]
Bing maps api. https://www.microsoft.com/maps/choose-your-bing-maps-API.aspx.
[2]
Informatica Rev. https://www.informatica.com/products/data-quality/rev.html.
[3]
Openrefine. openrefine.org.
[4]
Paxata. https://www.paxata.com/.
[5]
Roslyn compiler framework. https://github.com/dotnet/roslyn/wiki/RoslynOverview.
[6]
Talend. https://www.talend.com/.
[7]
Transform Data by Example (from Microsoft Office Store). https://aka.ms/transform-data-by-example-download.
[8]
Trifacta. https://www.trifacta.com/.
[9]
Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: A robust transformation discovery system. In ICDE, 2016.
[10]
E. Al-Masri and Q. H. Mahmoud. Investigating web services on the world wide web. In WWW 2008.
[11]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 1970.
[12]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.
[13]
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., New York, 2003.
[14]
M. Gollery. Bioinformatics: Sequence and genome analysis. Clinical Chemistry, 2005.
[15]
J. Hare, C. Adams, A. Woodward, and H. Swinehart. Forecast snapshot: Self-service data preparation, worldwide, 2016. Gartner, Inc., February 2016.
[16]
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In SIGPLAN 2011.
[17]
B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the deep web. CACM 2007.
[18]
J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.
[19]
Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In Proceedings of the 2018 International Conference on Management of Data, pages 1377--1392. ACM, 2018.
[20]
Z. Jin, M. R. Anderson, M. Cafarella, and H. V. Jagadish. Foofah: Transforming data by example. In SIGMOD, 2017.
[21]
U. Kumar, V. Kumar, and J. N. KAPUR. Normalized measures of entropy. International Journal Of General System, 1986.
[22]
V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In ACM SIGPLAN Notices, 2014.
[23]
H. Lieberman, editor. Your Wish is My Command: Programming by Example. Morgan Kaufmann, 2001.
[24]
J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. PVLDB, 1(2):1241--1252, 2008.
[25]
R. L. Sallam, P. Forry, E. Zaidi, and S. Vashisth. Gartner: Market guide for self-service data preparation. 2016.
[26]
R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. PVLDB, 9(10):816--827, 2016.
[27]
B. C. Smith. Procedural reflection in programming languages. PhD thesis, Massachusetts Institute of Technology, 1982.
[28]
Y. Wang and Y. He. Synthesizing mapping relationships using table corpus. In SIGMOD, 2017.
[29]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD 2012.
[30]
C. Yan and Y. He. Synthesizing type-detection logic for rich semantic data types using open-source code. In SIGMOD, 2018.
[31]
D. Yankov, P. Berkhin, and L. Li. Evaluation of explore-exploit policies in multi-result ranking systems. arXiv preprint arXiv:1504.07662, 2015.
[32]
E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. PVLDB, 10(10):1034--1045, 2017.

Cited By

View all
  • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 10
June 2018
248 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2018
Published in PVLDB Volume 11, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)6
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
  • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
  • (2024)Towards Efficient Data Wrangling with LLMs using Code GenerationProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663334(62-66)Online publication date: 9-Jun-2024
  • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
  • (2024)Variability in data transformation: towards data migration product linesProceedings of the 18th International Working Conference on Variability Modelling of Software-Intensive Systems10.1145/3634713.3634724(83-92)Online publication date: 7-Feb-2024
  • (2023)SemFORMSProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/827(7106-7109)Online publication date: 19-Aug-2023
  • (2023)DataRinse: Semantic Transforms for Data Preparation Based on Code MiningProceedings of the VLDB Endowment10.14778/3611540.361162816:12(4090-4093)Online publication date: 1-Aug-2023
  • (2023)Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using ExamplesProceedings of the VLDB Endowment10.14778/3611479.361153416:11(3391-3403)Online publication date: 24-Aug-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media