skip to main content
article

Auto-Tables: Relationalize Tables without Using Examples

Published: 14 May 2024 Publication History

Abstract

Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheettables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.

References

[1]
Auto-Tables: full version. https://arxiv.org/abs/2307.14565.
[2]
Example Excel forum question: Hard to query without transformations (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ power-query-data-cleaning-unpivot-transpose-etc/ m-p/2400300.
[3]
Example Excel forum question: Hard to query without transformations (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ unpivot-grouped-data/m-p/3686239.
[4]
Example Excel forum question: Hard to query without transformations (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ unpivot-monthly-data/m-p/1867836.
[5]
Example Excel forum question: Table analysis provides unexpected results (Retrieved in 02/2023). https://answers.microsoft.com/en-us/msoffice/forum/ all/excel-ideas-feature/ c9574cf9-dccc-4356--95d3-07d268e39d82.
[6]
Example Excel forum question to relationalize tables: Data restructuring using Excel (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ data-restructuring-using-excel/m-p/287547.
[7]
Example Excel forum question to relationalize tables: Pivot chart 4 columns (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ pivot-chart-4-columns-set-responses-to-4-questions/ m-p/2329880.
[8]
Example Excel forum question to relationalize tables: Pivot table issue. (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ pivot-table-issue/m-p/3015448.
[9]
Example Excel forum question to relationalize tables: Transpose data for analysis (Retrieved in 02/2023). https://techcommunity.microsoft.com/t5/excel/ transposing-data-for-better-analysis/m-p/1297106.
[10]
Foofah code on GitHub. https://github.com/umich-dbgroup/foofah.
[11]
Pandas API in Python. https://pandas.pydata.org/.
[12]
Pandas operator: Explode. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. DataFrame.explode.html.
[13]
Pandas operator: FFill. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. DataFrame.ffill.html.
[14]
Pandas operator: Melt. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. melt.html.
[15]
Pandas operator: Pivot. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. DataFrame.pivot.html.
[16]
Pandas operator: Stack. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. DataFrame.stack.html.
[17]
Pandas operator: Transpose. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. DataFrame.transpose.html.
[18]
Pandas operator: Wide-to-long. (Retrieved in 02/2023). https://pandas.pydata.org/docs/reference/api/pandas. wide_to_long.html.
[19]
PATSQL code on GitHub. https://github.com/NAIST-SE/PATSQL.
[20]
R operator: pivot-longer, which is similar to Wide-to-long. (Retrieved in 02/2023). https: //tidyr.tidyverse.org/reference/pivot_longer.html.
[21]
Reimplementation of FlashRelate code on GitHub. https://github.com/BEE-Synth/Bee/tree/ 291a824622e36fccfa43461e85be3f836e3f4eff/Eval/ Benchmarks/Spreadsheet/flashrelate-01.
[22]
Scythe code on GitHub. https://github.com/Mestway/Scythe.
[23]
Trifacta: Standardize Using Patterns. (Retrieved in 07/2023). https://docs.trifacta.com/display/DP/ Standardize+Using+Patterns.
[24]
D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. ACM SIGPLAN Notices, 50(6):218--228, 2015.
[25]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020.
[26]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248--255. Ieee, 2009.
[27]
X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu. Turl: Table understanding through representation learning. ACM SIGMOD Record, 51(1):33--40, 2022.
[28]
Y. Gao, S. Huang, and A. Parameswaran. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In Proceedings of the 2018 International Conference on Management of Data, pages 943--958, 2018.
[29]
S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97--105, 2012.
[30]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.
[31]
Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (tde) an extensible search engine for data transformations. Proceedings of the VLDB Endowment, 11(10):1165--1177, 2018.
[32]
Z. Jin, M. R. Anderson, M. Cafarella, and H. Jagadish. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 683--698, 2017.
[33]
Z. Jin, Y. He, and S. Chauduri. Auto-transform: learning-to-transform by patterns. Proceedings of the VLDB Endowment, 13(12):2368--2381, 2020.
[34]
M. Koehler, E. Abel, A. Bogatu, C. Civili, L. Mazilu, N. Konstantinou, A. A. Fernandes, J. Keane, L. Libkin, and N. W. Paton. Incorporating data context to cost-effectively automate end-to-end data wrangling. IEEE Transactions on Big Data, 7(1):169--186, 2019.
[35]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. CACM, 60(6):84--90, 2017.
[36]
Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021.
[37]
Y. Lin, Y. He, and S. Chaudhuri. Auto-bi: Automatically build bi-models leveraging local join prediction and global schema graph. Proceedings of the VLDB Endowment, 2023.
[38]
K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[39]
A. D. Nobari and D. Rafiei. Efficiently transforming tables for joinability. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 1649--1661. IEEE, 2022.
[40]
N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
[41]
H. Sch�utze, C. D. Manning, and P. Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge, 2008.
[42]
C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1--48, 2019.
[43]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[44]
K. Takenouchi, T. Ishio, J. Okada, and Y. Sakata. Patsql: efficient synthesis of sql queries from example tables with quick inference of projected columns. arXiv preprint arXiv:2010.05807, 2020.
[45]
Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query by output. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 535--548, 2009.
[46]
C. Wang, A. Cheung, and R. Bodik. Synthesizing highly expressive sql queries from input-output examples. In SIGPLAN, pages 452--466, 2017.
[47]
C. Yan and Y. He. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1539--1554, 2020.
[48]
J. Yang, Y. He, and S. Chaudhuri. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment, 2021.
[49]
P. Yin, G. Neubig, W.-t. Yih, and S. Riedel. Tabert: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
[50]
S. Zhang and Y. Sun. Automatically synthesizing sql queries from input-output examples. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 224--234. IEEE, 2013.
[51]
E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment, 10(10):1034--1045, 2017. SIGMOD

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 53, Issue 1
March 2024
90 pages
DOI:10.1145/3665252
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2024
Published in SIGMOD Volume 53, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 71
    Total Downloads
  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)15
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media