skip to main content
10.1145/3097983.3097992acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Peeking at A/B Tests: Why it matters, and what to do about it

Published: 13 August 2017 Publication History

Abstract

This paper reports on novel statistical methodology, which has been deployed by the commercial A/B testing platform Optimizely to communicate experimental results to their customers. Our methodology addresses the issue that traditional p-values and confidence intervals give unreliable inference. This is because users of A/B testing software are known to continuously monitor these measures as the experiment is running. We provide always valid p-values and confidence intervals that are provably robust to this effect. Not only does this make it safe for a user to continuously monitor, but it empowers her to detect true effects more efficiently. This paper provides simulations and numerical studies on Optimizely's data, demonstrating an improvement in detection performance over traditional methods.

Supplementary Material

MP4 File (walsh_peeking_tests.mp4)

References

[1]
Akshay Balsubramani and Aaditya Ramdas. 2015. Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint arXiv:1506.03486 (2015).
[2]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289--300.
[3]
Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of statistics (2001), 1165--1188.
[4]
Olive Jean Dunn. 1961. Multiple comparisons among means. J. Amer. Statist. Assoc. 56, 293 (1961), 52--64.
[5]
Ronald Aylmer Fisher and others. 1949. The design of experiments. The design of experiments. Ed. 5 (1949).
[6]
Bhaskar Kumar Ghosh and Pranab Kumar Sen. 1991. Handbook of sequential analysis. CRC Press.
[7]
John M Hoenig and Dennis M Heisey. 2001. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician 55, 1 (2001), 19--24.
[8]
William James and Charles Stein. 1961. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Vol. 1. 361--379.
[9]
Ramesh Johari, Leo Pekelis, and David J. Walsh. 2015. Always Valid Inference: Bringing Sequential Analysis to A/B Testing. (2015). arXiv:arXiv:1512.04922
[10]
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1168--1176.
[11]
Tze Leung Lai. 1997. On optimal stopping problems in sequential hypothesis testing. Statistica Sinica 7, 1 (1997), 33--51.
[12]
T. L. Lai and D. Siegmund. 1979. A Nonlinear Renewal Theory with Applications to Sequential Analysis II. The Annals of Statistics 7, 1 (01 1979), 60--76.
[13]
Erich Leo Lehmann, Joseph P Romano, and George Casella. 1986. Testing statistical hypotheses. Vol. 150. Wiley New York et al.
[14]
Evan Miller. 2010. How not to run an A/B test. (2010). http://www.evanmiller. org/how-not-to-run-an-ab-test.html Blog post.
[15]
Evan Miller. 2015. Simple Sequential A/B Testing. (2015). http://www.evanmiller. org/sequential-ab-testing.html Blog post.
[16]
Herbert Robbins. 1970. Statistical methods related to the law of the iterated logarithm. The Annals of Mathematical Statistics (1970), 1397--1409.
[17]
H Robbins and D Siegmund. 1974. The expected sample size of some tests of power one. The Annals of Statistics (1974), 415--436.
[18]
Steven L Scott. 2015. Multi-armed bandit experiments in the online serviceeconomy. Applied Stochastic Models in Business and Industry 31, 1 (2015), 37--45.
[19]
David Siegmund. 1985. Sequential analysis: tests and confidence intervals. Springer.
[20]
Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 17--26.
[21]
John W Tukey. 1991. The philosophy of multiple comparisons. Statistical science (1991), 100--116.
[22]
Abraham Wald. 1945. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 2 (1945), 117--186.

Cited By

View all
  • (2024)Best of Three Worlds: Adaptive Experimentation for Digital Marketing in PracticeProceedings of the ACM Web Conference 202410.1145/3589334.3645504(3586-3597)Online publication date: 13-May-2024
  • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
  • (2023)Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale DataJournal of Data Science10.6339/23-JDS1099(412-427)Online publication date: 21-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Check for updates

Author Tags

  1. a/b testing
  2. confidence intervals
  3. p-values
  4. sequential hypothesis testing

Qualifiers

  • Research-article

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)353
  • Downloads (Last 6 weeks)37
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Best of Three Worlds: Adaptive Experimentation for Digital Marketing in PracticeProceedings of the ACM Web Conference 202410.1145/3589334.3645504(3586-3597)Online publication date: 13-May-2024
  • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
  • (2023)Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale DataJournal of Data Science10.6339/23-JDS1099(412-427)Online publication date: 21-Apr-2023
  • (2023)Optimal treatment allocation for efficient policy evaluation in sequential decision makingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668245(48890-48905)Online publication date: 10-Dec-2023
  • (2023)Should I stop or should I goProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666817(15799-15832)Online publication date: 10-Dec-2023
  • (2023)Nonparametric extensions of randomized response for private confidence setsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619936(36748-36789)Online publication date: 23-Jul-2023
  • (2023)Research on the Optimization of A/B Testing System Based on Dynamic Strategy DistributionProcesses10.3390/pr1103091211:3(912)Online publication date: 17-Mar-2023
  • (2023)Online Experiments with Diminishing Marginal EffectsSSRN Electronic Journal10.2139/ssrn.4640583Online publication date: 2023
  • (2023)Sociotechnical Audits: Broadening the Algorithm Auditing Lens to Investigate Targeted AdvertisingProceedings of the ACM on Human-Computer Interaction10.1145/36102097:CSCW2(1-37)Online publication date: 4-Oct-2023
  • (2023)Correcting for Interference in Experiments: A Case Study at DouyinProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608808(455-466)Online publication date: 14-Sep-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media