skip to main content
10.1145/1088149.1088168acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Think globally, search locally

Published: 20 June 2005 Publication History

Abstract

A key step in program optimization is the determination of optimal values for code optimization parameters such as cache tile sizes and loop unrolling factors. One approach, which is implemented in most compilers, is to use analytical models to determine these values. The other approach, used in library generators like ATLAS, is to perform a global empirical search over the space of parameter values.Neither approach is completely suitable for use in general-purpose compilers that must generate high quality code for large programs running on complex architectures. Model-driven optimization may incur a performance penalty of 10-20% even for a relatively simple code like matrix multiplication. On the other hand, global search is not tractable for optimizing large programs for complex architectures because the optimization space is too large.In this paper, we advocate a methodology for generating high-performance code without increasing search time dramatically. Our methodology has three components: (i) modeling, (ii) local search, and (iii) model refinement. We demonstrate this methodology by using it to eliminate the performance gap between code produced by a model-driven version of ATLAS described by us in prior work, and code produced by the original ATLAS system using global search.

References

[1]
Automatically Tuned Linear Algebra Software (ATLAS). http://math-atlas.sourceforge.net/.]]
[2]
R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002.]]
[3]
Gianfranco Bilardi, Paolo D'Alberto, and Alex Nicolau. Fractal matrix multiplication: A case study on portability of cache performance. In Algorithm Engineering: 5th International Workshop, WAE, 2001.]]
[4]
Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and data layout. In SIGPLAN Conference on Programming Language Design and Implementation, pages 279--290, 1995.]]
[5]
Paolo D'Alberto and Alex Nicolau. Juliusc: A practical approach for the analysis of divide-and-conquer algorithms. In LCPC, 2004.]]
[6]
Jack Dongarra. Personal communication.]]
[7]
Evelyn Duesterwald, Rajiv Gupta, and Mary Lou Soffa. Register pipelining: An integrated approach to register allocation for scalar and subscripted variables. In Proceedings of the 4th International Conference on Compiler Construction, pages 192--206. Springer-Verlag, 1992.]]
[8]
Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 2005. special issue on "Program Generation, Optimization, and Adaptation".]]
[9]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285. IEEE Computer Society, 1999.]]
[10]
Daniel M. Lavery, Pohua P. Chang, Scott A. Mahlke, William Y. Chen, and Wen mei W. Hwu. The importance of prepass code scheduling for superscalar and superpipelined processors. IEEE Trans. Comput., 44(3):353--370, 1995.]]
[11]
William Press, Saul Teukolsky, William Vetterling, and Brian Flannery. Numerical Recipes in C. Cambridge University Press, 2002.]]
[12]
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gaĉić, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2), 2005. special issue on "Program Generation, Optimization, and Adaptation".]]
[13]
Rafael H. Saavedra and Alan Jay Smith. Measuring cache and TLB performance and their effect of benchmark run. Technical Report CSD-93-767, February 1993.]]
[14]
Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops. Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990.]]
[15]
R. Clint Whaley. http://sourceforge.net/mailarchive/forum.php? thread_id=1569256&forum_id%=426.]]
[16]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.]]
[17]
Kamen Yotov, Xiaoming Li, Gang Ren, Michael Cibulskis, Gerald DeJong, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill, and Peng Wu. A comparison of empirical and model-driven optimization. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 63--76. ACM Press, 2003.]]
[18]
Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005. special issue on "Program Generation, Optimization, and Adaptation".]]
[19]
Kamen Yotov, Keshav Pingali, and Paul Stodghill. Automatic measurement of memory hierarchy parameters. In Proc. of the 2005 International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'05).]]

Cited By

View all
  • (2023)Locally Linear EmbeddingElements of Dimensionality Reduction and Manifold Learning10.1007/978-3-031-10602-6_8(207-247)Online publication date: 3-Feb-2023
  • (2022)Real-time prediction by data-driven models applied to induction heating processInternational Journal of Material Forming10.1007/s12289-022-01691-715:4Online publication date: 27-May-2022
  • (2020)A Neural Network-Based Optimal Tile Size Selection Model for Embedded Vision Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00077(607-612)Online publication date: Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ICS05
Sponsor:
ICS05: International Conference on Supercomputing 2005
June 20 - 22, 2005
Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Locally Linear EmbeddingElements of Dimensionality Reduction and Manifold Learning10.1007/978-3-031-10602-6_8(207-247)Online publication date: 3-Feb-2023
  • (2022)Real-time prediction by data-driven models applied to induction heating processInternational Journal of Material Forming10.1007/s12289-022-01691-715:4Online publication date: 27-May-2022
  • (2020)A Neural Network-Based Optimal Tile Size Selection Model for Embedded Vision Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00077(607-612)Online publication date: Dec-2020
  • (2018)Revisiting Loop Tiling for DatacentersProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205306(328-340)Online publication date: 12-Jun-2018
  • (2018)An efficient tile size selection model based on machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.06.005121(27-41)Online publication date: Nov-2018
  • (2016)Enhancing X10 performance by auto-tuning the managed java back-end2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer)10.1109/ICTER.2016.7829894(21-28)Online publication date: Sep-2016
  • (2016)Data‐aware tuning of scientific applications with model‐based autotuningConcurrency and Computation: Practice and Experience10.1002/cpe.388529:4Online publication date: 17-Jun-2016
  • (2013)A script-based autotuning compiler system to generate high-performance CUDA codeACM Transactions on Architecture and Code Optimization10.1145/2400682.24006909:4(1-25)Online publication date: 20-Jan-2013
  • (2013)Predictive Modeling in a Polyhedral Optimization SpaceInternational Journal of Parallel Programming10.1007/s10766-013-0241-141:5(704-750)Online publication date: 21-Feb-2013
  • (2012)Analytical bounds for optimal tile size selectionProceedings of the 21st international conference on Compiler Construction10.1007/978-3-642-28652-0_6(101-121)Online publication date: 24-Mar-2012
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media